2004 Fiscal Year Final Research Report Summary
Corpus-based Word Sense Disambiguation and its application to Information Retrieval
Project/Area Number |
15500087
|
Research Category |
Grant-in-Aid for Scientific Research (C)
|
Allocation Type | Single-year Grants |
Section | 一般 |
Research Field |
Intelligent informatics
|
Research Institution | University of Yamanashi |
Principal Investigator |
FUKUMOTO Fumiyo University of Yamanashi, Department of Research Interdisciplinary Graduate School of Medicine and Engineering, Associate Professor, 大学院・医学工学総合研究部, 助教授 (60262648)
|
Project Period (FY) |
2003 – 2004
|
Keywords | Word Sense Disambiguation / Category Hierarchies / Detecting and Correcting Category Errors |
Research Abstract |
In this work, we proposed a method to disambiguate word senses and applied the results to query expansion in Information Retrieval. We mainly focus and proposed in the following methods. (1)Learning Subject Drift for Topic Tracking For topic tracking where data is collected over an extended period of time, the discussion of a topic, i.e. the subject in a story changes over time. This work focuses on subject drift and presents a method for topic tracking on broadcast news stories to handle subject drift. The basic idea is to automatically extract the optimal positive training data of the target topic so as to include only the data which are sufficiently related to the current subject. The method was tested on the TDT1 and TDT2, and the results show the effectiveness of the method. (2)Correcting Category Errors in Text Classification We proposed a method for correcting category annotation errors in multi-labeled data which deteriorate overall performance of text classification. We used the hi
… More
erarchical structure for this purpose : we used it as a simple heuristics, i.e. the resulting category should be the same level, parent or child of the original category assigned to a document Experimental results with the Reuters 96 corpora show that our method achieves high precision in detecting and correcting annotation errors. Further, results on text classification improves accuracy. (3)A comparison of Manual and Automatic Constructions of Category Hierarchy for Classifying Large Corpora We addressed the problem dealing with a large collection of data, and investigate the use of automatically constructing category hierarchy from a given set of categories to improve classification of large corpora. We used two well-known techniques, partitioning clustering, k-means and a loss function to create category hierarchy. K-means is to cluster the given categories in a hierarchy. To select the proper number of k, we use a loss function which measures the degree of our disappointment in any differences between the true distribution over inputs and the learner's prediction. Once the optimal number of k is selected, for each duster, the procedure is repeated. Our evaluation using the 1996 Reuters corpus which consists of 806,791 documents shows that automatically constructing hierarchy improves classification accuracy. (4)Word Sense Disambiguation in Information Retrieval We proposed a method for feature selection which is used for disambiguating word senses. In our method, sets of features which correspond to each different sense of an ambiguous word are selected by applying a statistical technique. Further, we applied the results to query expansion in Information Retrieval. Less
|
Research Products
(10 results)