Improvement of topic-based language models using Dirichlet mixtures and their applications
Project/Area Number |
17500105
|
Research Category |
Grant-in-Aid for Scientific Research (C)
|
Allocation Type | Single-year Grants |
Section | 一般 |
Research Field |
Perception information processing/Intelligent robotics
|
Research Institution | University of Tsukuba |
Principal Investigator |
YAMAMOTO Mikio University of Tsukuba, Graduate School of Systems and Information Engineering, Associate Processor, 大学院システム情報工学研究科, 助教授 (40210562)
|
Project Period (FY) |
2005 – 2006
|
Project Status |
Completed (Fiscal Year 2006)
|
Budget Amount *help |
¥3,700,000 (Direct Cost: ¥3,700,000)
Fiscal Year 2006: ¥1,400,000 (Direct Cost: ¥1,400,000)
Fiscal Year 2005: ¥2,300,000 (Direct Cost: ¥2,300,000)
|
Keywords | Dirichlet Mixtures / statistical language models / topic-based models / Bayesian statistics / speech recognition / statistical machine translation / 言語横断モデル / ベイズモデル |
Research Abstract |
For improving statistical language models, we enhanced predictive power of ngram models, which are typical language models, using topic or context information. We proposed new estimation methods for Dirichlet mixtures and evaluated the model on applications ; speech recognition and statistical machine translation. 1. We developed a robust estimation method for Dirichlet mixtures language models using hierarchical Bayesian models. In order to approximate integration appeared in Bayesian inference, we used the reversing-EM and variational approximation. In the experiments using various text data, we showed the estimation method achieves the lowest perplexity level. 2. Our model was integrated in speech recognition systems, and evaluated by recognition rate. Two integration methods were developed ; (1) modification of probability of trigram models using the unigram rescaling, (2) optimization on document level using document likelihood computed by our model. Comparing Latent Dirichlet Allocation (LDA) with our model, we showed the speech recognition rate of the system with our model is higher than that of LDA. 3. We proposed cross-language Dirichlet mixture models which were integrated in phrase-based statistical machine translation systems. Using this model, the system can select contextually or topically correct Japanese words from candidates as translation of English input document. Experiments using newspaper articles translation showed that topic models were effective for lower perplexity.
|
Report
(3 results)
Research Products
(11 results)