Budget Amount *help |
¥3,600,000 (Direct Cost: ¥3,600,000)
Fiscal Year 2006: ¥800,000 (Direct Cost: ¥800,000)
Fiscal Year 2005: ¥2,800,000 (Direct Cost: ¥2,800,000)
|
Research Abstract |
With the exponential growth of information on the Internet, it is becoming increasingly difficult to find and organize relevant material. Topic Detection and Tracking (TDT) is a research area to address this problem and consists of five different tasks : story link detection, clustering topic detection, new event detection, story segmentation and topic tracking. The last task, topic tracking, is the focus of this paper. Topic tracking starts from a few sample stories and finds all subsequent stories that discuss the target topic. Here, a topic in the TDT context is something that happens at a specific place and time associated with some specific action In this work, we address the problem of skewed data in topic tracking : the small number of stories labeled positive as compared to negative stories, and proposed a method for estimating effective training stories for the topic tracking task. For a small number of labeled positive stories, we use bilingual comparable corpora, i.e., English and Japanese corpora, together with the EDR bilingual dictionary, and extract story pairs consisting of positive and associated stories. To overcome the problem of a large number of labeled negative stories, we classified them into some clusters. This is done using a semi-supervised clustering algorithm, combining k-means with EM. The method was tested on the TDT English corpus, and the results showed that the system works well when the topic under tracking is talking about an event originating in the source language country, even for a small number of initial positive training stories
|