A Study on Cluster-based Indexing of Textual Data
Project/Area Number |
15500081
|
Research Category |
Grant-in-Aid for Scientific Research (C)
|
Allocation Type | Single-year Grants |
Section | 一般 |
Research Field |
Media informatics/Database
|
Research Institution | National Institute of Informatics |
Principal Investigator |
AIZAWA Akiko National Institute of Informatics, Research Center for Information Resources, Professor, 情報学資源研究センター, 教授 (90222447)
|
Project Period (FY) |
2003 – 2004
|
Project Status |
Completed (Fiscal Year 2004)
|
Budget Amount *help |
¥3,500,000 (Direct Cost: ¥3,500,000)
Fiscal Year 2004: ¥1,700,000 (Direct Cost: ¥1,700,000)
Fiscal Year 2003: ¥1,800,000 (Direct Cost: ¥1,800,000)
|
Keywords | Text Mining / Statistical Language Model / Document Clustering / Information Retrieval / Amount of Information / Extraction of Noun Phrases |
Research Abstract |
In this study, we proposed a framework and implementation of an information retrieval system that utilizes clusters of similar documents. The proposed method first generates document clusters together with their representative terms and phrases based on the term distribution or term sequence match. Next, considering each document cluster as a single virtual document, an extended index is created. Upon a query submission, the system uses both the original and the extended indices and returns the integrated result. In the research, we also demonstrated that indices generated based on different viewpoints can be used to enhance the flexibility of the retrieval system. During the research period, we focused on the following research topics: 1. Co-clustering method that is based on the co-occurrence statistics and mutual information 2. Suffix-array based clustering method that utilizes the repetition of textual elements measured by the proposed coincidence score 3. A framework of cluster-based indexing and its implementation 4. Entity identification using the fast repetition-based clustering method Future research issues include statistical and analytical text processing methods to automatically extract index phrases from the target retrieved document set, and also methods for the identification of textual elements that refer to the same real-world entities.
|
Report
(3 results)
Research Products
(27 results)