A study on optimazation of units for statistical language models
Project/Area Number |
14580403
|
Research Category |
Grant-in-Aid for Scientific Research (C)
|
Allocation Type | Single-year Grants |
Section | 一般 |
Research Field |
Intelligent informatics
|
Research Institution | University of Tsukuba |
Principal Investigator |
YAMAMOTO Mikio University of Tsukuba, Graduate School of Systems and Information Engineering, Department of Computer Science, Associate Professor, 大学院・システム情報工学研究科, 助教授 (40210562)
|
Project Period (FY) |
2002 – 2004
|
Project Status |
Completed (Fiscal Year 2004)
|
Budget Amount *help |
¥4,000,000 (Direct Cost: ¥4,000,000)
Fiscal Year 2004: ¥1,100,000 (Direct Cost: ¥1,100,000)
Fiscal Year 2003: ¥1,100,000 (Direct Cost: ¥1,100,000)
Fiscal Year 2002: ¥1,800,000 (Direct Cost: ¥1,800,000)
|
Keywords | Natural language processing / Machine translation / Spell checker / Mutual information / Bayesian statistics / Text modeling / Hierachical models / フレーズ対訳辞書 / 相互情報量最小化 / ディリクレ分布 / 文書モデル / 音声認識 / 統計的言語モデル / 統計的機械翻訳 / 文脈モデル / モデル化単位 |
Research Abstract |
In this project, we investigated and reconsidered two kinds of ‘units' as a basic property of statistical language models. The first unit we reconsidered is ‘tokens' or ‘entries of a dictionary' which are minimal units of sentences. Ordinary statistical language models use words or characters as tokens. But for some applications such as machine translations, we know uses of longer tokens such as phrases improve the system performance. We focused on automatic phrase extractions to build up dictionaries for machine translations with a statistical criterion. We proposed new criteria, minimal mutual information, and showed the method is better than previous phrase extraction methods. Another kind of unit we reconsidered is ‘targets' which are assessed by the models. Ordinary statistical language models evaluate ‘sentences' as targets of applications. But many language applications have to output text which is made up with multiple sentences. We proposed a model to evaluate whole text using Dirichlet mixtures as a distribution for parameters of a multinomial distribution, whose compound distribution is Polya mixtures. We showed lower perplexity of our model than that of the other text models such as the latent Dirichlet allocation(LDA). Experiments of speech recognizer for read documents showed the models effectively correct many misrecognition words using information of whole text.
|
Report
(4 results)
Research Products
(22 results)