Study of Class-based Language Model and its Application to Japanese Morphological Analysis
Project/Area Number |
10680383
|
Research Category |
Grant-in-Aid for Scientific Research (C)
|
Allocation Type | Single-year Grants |
Section | 一般 |
Research Field |
Intelligent informatics
|
Research Institution | The University of Tokushima |
Principal Investigator |
KITA Kenji The University of Tokushima, Faculty of Engineering, Associate Professor, 工学部, 助教授 (10243734)
|
Project Period (FY) |
1998 – 1999
|
Project Status |
Completed (Fiscal Year 1999)
|
Budget Amount *help |
¥2,400,000 (Direct Cost: ¥2,400,000)
Fiscal Year 1999: ¥800,000 (Direct Cost: ¥800,000)
Fiscal Year 1998: ¥1,600,000 (Direct Cost: ¥1,600,000)
|
Keywords | natural language processing / Japanese language processing / morphological analysis / word segmentation / probabilistic language model / PPM* model / character class / clustering / PPMモデル |
Research Abstract |
Morphological analysis is the most fundamental process of Japanese language processing. In Japanese morphological analysis, word segmentation is an important problem because word boundaries are not marked in its writing system. In this research project, we first studied a word segmentation model using a character-based n-gram model, which is our baseline method. Next, we applied the PPM* compression algorithm to the problem of word segmentation. PPM (Prediction by Partial Matching) is a lossless compression algorithm based on a finite-context probabilistic modeling technique and PPM* is a variant of PPM, in which there is no a priori bound on context length. We then studied a method for word segmentation based on a character class model. The character class model is more robust than a character-based model because the number of parameters of the character class model is fewer than that of a character-based model. The measurement for Japanese character clustering is the entropy on a corpus different from the corpus for model estimation and the search method is based on the greedy algorithm. For this reason, this clustering method gives us an optimum character classification without giving the number of classes. As the result of experiments on the ADD (ATR Dialogue Database) corpus, the proposed Japanese word segmenter using the character class model marked a higher accuracy than a character-based model. In particular, the proposed method using a variable-length n;-gram class model achieved 96.38% recall and 96.23% precision for open text.
|
Report
(3 results)
Research Products
(30 results)