Automatic Acquisition of Linguistic Knowledge Using Bilingual Comparable Corpora and its Application to Topic Tracking
Project/Area Number |
17500091
|
Research Category |
Grant-in-Aid for Scientific Research (C)
|
Allocation Type | Single-year Grants |
Section | 一般 |
Research Field |
Intelligent informatics
|
Research Institution | University. of Yamanashi |
Principal Investigator |
FUKUMOTO Fumiyo University of Yamanashi, Department of Research Interdisciplinary Graduate School of Medicine and Engineering, Associate Professor, 大学院医学工学総合研究部, 助教授 (60262648)
|
Project Period (FY) |
2005 – 2006
|
Project Status |
Completed (Fiscal Year 2006)
|
Budget Amount *help |
¥3,600,000 (Direct Cost: ¥3,600,000)
Fiscal Year 2006: ¥800,000 (Direct Cost: ¥800,000)
Fiscal Year 2005: ¥2,800,000 (Direct Cost: ¥2,800,000)
|
Keywords | Comparable Corpora / Polysemous Word / Bilingual Terms / Topic Tracking / Semi-supervised Clustering / Linguistic Knowledge Acquisition / 語義 / 続報記事 / 半教師付きクラスタリング / 多言語コーパス / コンパラコーパス / EMアルゴリズム / 多義解消 |
Research Abstract |
With the exponential growth of information on the Internet, it is becoming increasingly difficult to find and organize relevant material. Topic Detection and Tracking (TDT) is a research area to address this problem and consists of five different tasks : story link detection, clustering topic detection, new event detection, story segmentation and topic tracking. The last task, topic tracking, is the focus of this paper. Topic tracking starts from a few sample stories and finds all subsequent stories that discuss the target topic. Here, a topic in the TDT context is something that happens at a specific place and time associated with some specific action In this work, we address the problem of skewed data in topic tracking : the small number of stories labeled positive as compared to negative stories, and proposed a method for estimating effective training stories for the topic tracking task. For a small number of labeled positive stories, we use bilingual comparable corpora, i.e., English and Japanese corpora, together with the EDR bilingual dictionary, and extract story pairs consisting of positive and associated stories. To overcome the problem of a large number of labeled negative stories, we classified them into some clusters. This is done using a semi-supervised clustering algorithm, combining k-means with EM. The method was tested on the TDT English corpus, and the results showed that the system works well when the topic under tracking is talking about an event originating in the source language country, even for a small number of initial positive training stories
|
Report
(3 results)
Research Products
(18 results)