Research Abstract |
In this study, we proposed a new framework of information retrieval, which we call "cluster-based indexing" , and evaluated the effectiveness using actual document collections. The proposed scheme employs simultaneous clustering between documents and terms using the previously proposed "probability weighted amount of information" as a navigation criteria. The feature is that it aims at exploiting and utilizing the extracted associations between terms and documents by treating them as 'indices' in conventional retrieval systems. Also, the proposed scheme can be considered as an adaptation of a "co-evolutionary framework" in genetic algorithms in the domain of text retrieval since it first randomly initiates clusters of neighboring terms and documents, and then, applies local optimization to the generated clusters in order to deal the large scale of real-world document collections. In our study, we also investigated the effectiveness of the proposed method using such test collections with 10,000 - 100,000 documents as ; abstracts of academic conference papers extracted from NTCIR1, newspaper articles from Mainichi and Nikkei CD-ROM databases, English stories from Reuters or Financial Times. In the evaluation using a text categorization task, it was confirmed that the categorization performance of the generated clusters was slightly worse but almost comparable to the one of Support Vector Machine, which is known to be one of the best classifier for text categorization. Furthermore, it was shown the method could successfully extract associations between documents on the class border, which is difficult with conventional machine-learning based categorization methods.
|