Project/Area Number |
18500093
|
Research Category |
Grant-in-Aid for Scientific Research (C)
|
Allocation Type | Single-year Grants |
Section | 一般 |
Research Field |
Media informatics/Database
|
Research Institution | National Institute of Informatics |
Principal Investigator |
AIZAWA Akiko National Institute of Informatics, Digital Content and Media Sciences Research Division, Professor (90222447)
|
Project Period (FY) |
2006 – 2007
|
Project Status |
Completed (Fiscal Year 2007)
|
Budget Amount *help |
¥4,010,000 (Direct Cost: ¥3,500,000、Indirect Cost: ¥510,000)
Fiscal Year 2007: ¥2,210,000 (Direct Cost: ¥1,700,000、Indirect Cost: ¥510,000)
Fiscal Year 2006: ¥1,800,000 (Direct Cost: ¥1,800,000)
|
Keywords | natural language processing / compound extraction / dictionary construction / information retrieval / lexicon / dedicated portal sites / indexing tools / CRF / 専用ポータル / 語彙抽出 / 専門ポータル / EM法 |
Research Abstract |
In recent years, constructing dedicated web portals has become a common practice for academic people. These portals are valuable information sources to maintain the diversity of the web contents and to disseminate academic or specialized knowledge to the public. Dedicated portals with specialized content require a good term extraction tool in order to identify multi-word expressions that are not found in general dictionaries. However, existing segmentation tools are not satisfactory for this purpose. Based on the above, this study focuses on a keyword extraction method that enhances the search capability of dedicated portal servers. During the two years research period, we addressed to the followings : 1. A framework of automatic multi-word expression (or compounds) extraction where the following two modules are applied sequentially but independently: (A) a segmentation module that identifies longest multi-word regions from a given text input, and (B) a parsing module that analyzes the cost of word connections within a same multi-word region. 2. A new method for (B) where the tree structure of multi-words was determined using a statistical cost function. The parameters for the function are obtained by applying CRF (conditional random field) to the technical terms extracted from handbooks' of academic societies. The future issues include (i) the implementation of a lightweight tool for automatic keyword extraction using the proposed method, and (ii) the utilization of the extracted terms for search navigation or text categorization.
|