2001 Fiscal Year Final Research Report Summary
A Study about automatic domain term extraction from corpus
Project/Area Number |
12680368
|
Research Category |
Grant-in-Aid for Scientific Research (C)
|
Allocation Type | Single-year Grants |
Section | 一般 |
Research Field |
Intelligent informatics
|
Research Institution | The University of Tokyo |
Principal Investigator |
NAKAGAWA Hiroshi Information Technology Center, The University of Tokyo, Professor, 情報基盤センター, 教授 (20134893)
|
Co-Investigator(Kenkyū-buntansha) |
TANAKA Kumiko (ISHII Kumiko) Interfaculty Information Initiative, The University of Tokyo, Lecturer, 大学院・情報学環, 講師 (10323528)
|
Project Period (FY) |
2000 – 2001
|
Keywords | Term Extraction / Information Extraction / domain term / Corpus / Translation / NICIR / Natural Language Processing / Information Retrieval |
Research Abstract |
We mainly grappled with automatic term extraction methods which extracts domain specific terms from domain corpora that were distributed by NTCIR1 TMREC task group. Among various works in automatic term extraction, the majority of them are concerned with statistics like frequency in corpora, and few focused on the characteristics of space which consists of extracted terms. In this work, we mainly focus on the latter. We propose the method which uses the statistical relation between compound nouns, that are up to 85% of all terms and the remaining 15% of simple nouns. For instance, if we have many compound nouns such as "human information system", "social information system" and so on, the importance of "information" is defined as how many kinds of nouns adjoin or are adjoined with "information." Then, the importance of compound noun is defined as the geometric means of its component nouns. Our system consists of (l) morphological analysis, (2)extracting candidate terms, (3) assign each candidate term its importance value and (4) evaluation with NTCIR1 TMREC test collection. The proposed method shows the high score among methods participating NTCIR1. We also localize our method to English in order for translation extraction to be investigated the next year.
|
Research Products
(8 results)