Research on Automatic Incremental Construction of Language Resources
Project/Area Number |
09308009
|
Research Category |
Grant-in-Aid for Scientific Research (A)
|
Allocation Type | Single-year Grants |
Section | 一般 |
Research Field |
Intelligent informatics
|
Research Institution | Tokyo Institute of Technology |
Principal Investigator |
TANAKA Hozumi Graduate School of Information Science and Engineering, Tokyo Institute of Technology, Professor, 大学院・情報理工学研究科, 教授 (80163567)
|
Co-Investigator(Kenkyū-buntansha) |
SHIRAI Kiyoaki Graduate School of Information Science and Engineering, Tokyo Institute of Technology, Research Assistant, 大学院・情報理工学研究科, 助手 (30302970)
INUI Kentaro Faculty of Computer Science and Systems Engineering, Kyushu Institute of Technology, Assistant Professor, 情報工学部, 助教授 (60272689)
TOKUNAGA Takenobu Graduate School of Information Science and Engineering, Tokyo Institute of Technology, Associate Professor, 大学院・情報理工学研究科, 助教授 (20197875)
|
Project Period (FY) |
1997 – 1999
|
Project Status |
Completed (Fiscal Year 1999)
|
Budget Amount *help |
¥22,700,000 (Direct Cost: ¥22,700,000)
Fiscal Year 1999: ¥2,600,000 (Direct Cost: ¥2,600,000)
Fiscal Year 1998: ¥7,200,000 (Direct Cost: ¥7,200,000)
Fiscal Year 1997: ¥12,900,000 (Direct Cost: ¥12,900,000)
|
Keywords | language resource / corpus / morphological analysis / syntactic analysis / probabilistic language model / automatic acquisition of knowledge / 言語知識ベース / 自然言語処理 / 注釈付きコーパス / 言語知識獲得 / MSLR構文解析法 / 確率一般化LRモデル / 確率GLR構文解析法 / 形態素接続表 |
Research Abstract |
This research project is targeted at the automatic incremental construction of a corpus annotated with morphological information, such as word segmentation and part-of-speech (POS hereafter) tags, and syntactic information such as syntactic trees. An overview of the proposed method is as follows : We first analyse large volumes of text to obtain morphological and syntactic information to annotate the text with. Next we newly obtain knowledge for natural language analysis, i.e. we acquire a connection matrix, and train the probabilistic generarized LR language model (PGLR model hereafter). The connection matrix describes adjacency constraints between POS pairs. It can be aquired from POS tagged corpus automatically by way of regarding each POS pair as legally adjacent if it appears in sequence in the training corpus, and illegal otherwise. The PGLR language model provides the probabilistic language model, and is easily trainable given a tree-annotated corpus. Given these knowledge resources, we re-analyze the sentences, and newly obtain morphological and syntactic information. By repeating this procedure, we construct a corpus annotated with morphological and syntactic information automatically. Our experiment shows that the proposed method is effective for the enlargement of existing annotated corpora.
|
Report
(4 results)
Research Products
(10 results)