Research Abstract |
The main objective of this research has been on the leverage of vector space model for "concept search", i.e. to search "conceptually similar" documents given a query document either in Japanese or in English. During the past three years, we have developed two new methods that can be applied to concept search; the first one is based on our new data structure for representing a hierarchy of clusters for massive documents, and the second one is based on tuned categorization followed by category-wise dimensional reduction using latent semantic indexing (LSI). The most important advantage of vector space model over other document models is the independence of language once a document is transformed into a vector. For the first method developed in 2004, we repeatedly applied "co-clustering" algorithm to each sampled document collection to get word-document correlated clusters. We created a hierarchy of clusters by changing the sample document size by powers of two (specifically 16, 32, 64, 1
… More
28, 256, 512, 1024, and 2048) and applied "co-clustering" to each sampled collection. This method works well for the patent data because they have a tendency to have strong interrelationship between words and documents. For example, "pachinko" or "video game" only appears in sub-class "A63F" in IPC (International Patent Classification). We participated in NTCIR-5 patent task organized by NII and made a poster presentation at NTCIR-5 international workshop in December 2005 in Tokyo. The research paper on this method was submitted and accepted by AIRS 2005 international conference held in Jeju, Korea. Although this method explored a new method for automatically grouping the patent data into different level of granularity by utilizing hierarchal sampling and "co-clustering", it suffered from identifying minor clusters. For the second method developed in 2006, we first classified the entire patent collection into about 200 categories based on IPC, and then we applied LSI to each category repeatedly. This new method overcomes the difficulty in the first method in that it never relies on the "sampling" that inherits missing data samples, failing to identify minor clusters. This paper on this new method is currently submitted to NTCIR-6, international workshop to be held in Tokyo, May 2007. Search result visualization has been one of the objectives of this research. We have developed several visualization methods. They are divided into two categories; one category is to visualize clusters after summarizing search result, and other category is to use maps to help users understand geographical location if the search result document contains geographical information including prefecture, city, town, and village names. The concept search research also has led to "Semantic Web" applications. In particular, we have investigated new algorithms for "ontology" alignment, where "ontology" denotes shared conceptualization. One algorithm was submitted to international conference, and accepted by ASWC (Asia Semantic Web Conference) held in Beijing, 2006. Our method was competitive with the world-best-known algorithms so far in this research field. Less
|