2004 Fiscal Year Final Research Report Summary

A study on optimazation of units for statistical language models

Research Project

Project/Area Number	14580403
Research Category	Grant-in-Aid for Scientific Research (C)
Allocation Type	Single-year Grants
Section	一般
Research Field	Intelligent informatics
Research Institution	University of Tsukuba
Principal Investigator	YAMAMOTO Mikio University of Tsukuba, Graduate School of Systems and Information Engineering, Department of Computer Science, Associate Professor, 大学院・システム情報工学研究科, 助教授 (40210562)
Project Period (FY)	2002 – 2004
Keywords	Natural language processing / Machine translation / Spell checker / Mutual information / Bayesian statistics / Text modeling / Hierachical models
Research Abstract	In this project, we investigated and reconsidered two kinds of ‘units' as a basic property of statistical language models. The first unit we reconsidered is ‘tokens' or ‘entries of a dictionary' which are minimal units of sentences. Ordinary statistical language models use words or characters as tokens. But for some applications such as machine translations, we know uses of longer tokens such as phrases improve the system performance. We focused on automatic phrase extractions to build up dictionaries for machine translations with a statistical criterion. We proposed new criteria, minimal mutual information, and showed the method is better than previous phrase extraction methods. Another kind of unit we reconsidered is ‘targets' which are assessed by the models. Ordinary statistical language models evaluate ‘sentences' as targets of applications. But many language applications have to output text which is made up with multiple sentences. We proposed a model to evaluate whole text using Dirichlet mixtures as a distribution for parameters of a multinomial distribution, whose compound distribution is Polya mixtures. We showed lower perplexity of our model than that of the other text models such as the latent Dirichlet allocation(LDA). Experiments of speech recognizer for read documents showed the models effectively correct many misrecognition words using information of whole text.

Research Products
(13 results)

All 2005 2004 2003 Other

All Journal Article (13 results)

[Journal Article] 文書確立を用いた文書読み上げ音声認識2005
- Author(s)
  中里理恵
- Journal Title
  
  日本音響学会2005春季研究発表会講演論文集 I
  
  Pages: 47-48
- Description
  「研究成果報告書概要(和文)」より
[Journal Article] Read document recognition using document probability.2005
- Author(s)
  Rie NAKAZATO
- Journal Title
  
  The 2005 Spring Meeting of the Acoustical Society of Japan
  
  Pages: 47-48
- Description
  「研究成果報告書概要(欧文)」より
[Journal Article] 確率的LSAを用いた日本語同音異義語誤りの検出・訂正2004
- Author(s)
  三品拓也
- Journal Title
  
  情報処理学会論文誌 Vol.45,No.9
  
  Pages: 2168-2176
- Description
  「研究成果報告書概要(和文)」より
[Journal Article] 確率的LSAに基づくngramモデルの変分ベイズ学習を利用した文脈適応化2004
- Author(s)
  三品拓也
- Journal Title
  
  電子情報通信学会誌D-II Vol.87,No.7
  
  Pages: 1409-1417
- Description
  「研究成果報告書概要(和文)」より
[Journal Article] 混合ディリクレ分布パラメータの階層ベイズモデルを用いたスムージング法2004
- Author(s)
  貞光九月
- Journal Title
  
  情報処理学会研究報告 200-SLP-53
  
  Pages: 1-6
- Description
  「研究成果報告書概要(和文)」より
[Journal Article] Context adaptation using variational Bayesian learning for ngram models based on probabilistic LSA2004
- Author(s)
  Takuya MISHINA
- Journal Title
  
  The IEICE Transactions on Information and systems Vol.J87-D-II, No.7
  
  Pages: 1409-1417
- Description
  「研究成果報告書概要(欧文)」より
[Journal Article] Detection and correction of Japanese homophone errors using probabilisitic LSA.2004
- Author(s)
  Takuya MISHINA
- Journal Title
  
  IPSJ Journal Vol.45, No.9
  
  Pages: 2168-2175
- Description
  「研究成果報告書概要(欧文)」より
[Journal Article] A smoothing method for parameters of Dirichlet mixtures using hierarchical Bayesian models.2004
- Author(s)
  Kugatsu SADAMITSU
- Journal Title
  
  IPSJ SIG Technical Report 2004-SLP-53
  
  Pages: 1-6
- Description
  「研究成果報告書概要(欧文)」より
[Journal Article] A model for n-terms document friquency using Polya mixtures.2004
- Author(s)
  Kugatsu SADAMITSU
- Journal Title
  
  Proceedings of the Tenth Annual Meeting of the Association for Natural Language Processing
  
  Pages: 697-700
- Description
  「研究成果報告書概要(欧文)」より
[Journal Article] 混合ディレクレ分布を用いた文脈のモデル化と言語モデルへの応用2003
- Author(s)
  山本幹雄
- Journal Title
  
  情報処理学会研究報告 2003-SLP-48
  
  Pages: 29-34
- Description
  「研究成果報告書概要(和文)」より
[Journal Article] Context modeling using Dirichlet mixtures and its applications to language models.2003
- Author(s)
  Mikio YAMAMOTO
- Journal Title
  
  IPSJ SIG Technical Report 2003-SLP-48
  
  Pages: 29-34
- Description
  「研究成果報告書概要(欧文)」より
[Journal Article] 混合ディリクレ分布を用いたトピックに基づく言語モデル
- Author(s)
  貞光九月
- Journal Title
  
  電子情報通信学会論文誌D-II (印刷中)
- Description
  「研究成果報告書概要(和文)」より
[Journal Article] Topic-based language models using Dirichlet mixtures.
- Author(s)
  Kugatsu SADAMITSU
- Journal Title
  
  The IEICE Transactions on Information and systems (to appear.)
- Description
  「研究成果報告書概要(欧文)」より

2004 Fiscal Year Final Research Report Summary

A study on optimazation of units for statistical language models

Principal Investigator

YAMAMOTO Mikio University of Tsukuba, Graduate School of Systems and Information Engineering, Department of Computer Science, Associate Professor, 大学院・システム情報工学研究科, 助教授 (40210562)

Research Products

[Journal Article] 文書確立を用いた文書読み上げ音声認識2005

Author(s)

Journal Title

Description

[Journal Article] Read document recognition using document probability.2005

Author(s)

Journal Title

Description

[Journal Article] 確率的LSAを用いた日本語同音異義語誤りの検出・訂正2004

Author(s)

Journal Title

Description

[Journal Article] 確率的LSAに基づくngramモデルの変分ベイズ学習を利用した文脈適応化2004

Author(s)

Journal Title

Description

[Journal Article] 混合ディリクレ分布パラメータの階層ベイズモデルを用いたスムージング法2004

Author(s)

Journal Title

Description

[Journal Article] Context adaptation using variational Bayesian learning for ngram models based on probabilistic LSA2004

Author(s)

Journal Title

Description

[Journal Article] Detection and correction of Japanese homophone errors using probabilisitic LSA.2004

Author(s)

Journal Title

Description

[Journal Article] A smoothing method for parameters of Dirichlet mixtures using hierarchical Bayesian models.2004

Author(s)

Journal Title

Description

[Journal Article] A model for n-terms document friquency using Polya mixtures.2004

Author(s)

Journal Title

Description

[Journal Article] 混合ディレクレ分布を用いた文脈のモデル化と言語モデルへの応用2003

Author(s)

Journal Title

Description

[Journal Article] Context modeling using Dirichlet mixtures and its applications to language models.2003

Author(s)

Journal Title

Description

[Journal Article] 混合ディリクレ分布を用いたトピックに基づく言語モデル

Author(s)

Journal Title

Description

[Journal Article] Topic-based language models using Dirichlet mixtures.

Author(s)

Journal Title

Description