• Search Research Projects
  • Search Researchers
  • How to Use
  1. Back to project page

2004 Fiscal Year Final Research Report Summary

A study on optimazation of units for statistical language models

Research Project

Project/Area Number 14580403
Research Category

Grant-in-Aid for Scientific Research (C)

Allocation TypeSingle-year Grants
Section一般
Research Field Intelligent informatics
Research InstitutionUniversity of Tsukuba

Principal Investigator

YAMAMOTO Mikio  University of Tsukuba, Graduate School of Systems and Information Engineering, Department of Computer Science, Associate Professor, 大学院・システム情報工学研究科, 助教授 (40210562)

Project Period (FY) 2002 – 2004
KeywordsNatural language processing / Machine translation / Spell checker / Mutual information / Bayesian statistics / Text modeling / Hierachical models
Research Abstract

In this project, we investigated and reconsidered two kinds of ‘units' as a basic property of statistical language models. The first unit we reconsidered is ‘tokens' or ‘entries of a dictionary' which are minimal units of sentences. Ordinary statistical language models use words or characters as tokens. But for some applications such as machine translations, we know uses of longer tokens such as phrases improve the system performance. We focused on automatic phrase extractions to build up dictionaries for machine translations with a statistical criterion. We proposed new criteria, minimal mutual information, and showed the method is better than previous phrase extraction methods.
Another kind of unit we reconsidered is ‘targets' which are assessed by the models. Ordinary statistical language models evaluate ‘sentences' as targets of applications. But many language applications have to output text which is made up with multiple sentences. We proposed a model to evaluate whole text using Dirichlet mixtures as a distribution for parameters of a multinomial distribution, whose compound distribution is Polya mixtures. We showed lower perplexity of our model than that of the other text models such as the latent Dirichlet allocation(LDA). Experiments of speech recognizer for read documents showed the models effectively correct many misrecognition words using information of whole text.

  • Research Products

    (13 results)

All 2005 2004 2003 Other

All Journal Article (13 results)

  • [Journal Article] 文書確立を用いた文書読み上げ音声認識2005

    • Author(s)
      中里理恵
    • Journal Title

      日本音響学会2005春季研究発表会講演論文集 I

      Pages: 47-48

    • Description
      「研究成果報告書概要(和文)」より
  • [Journal Article] Read document recognition using document probability.2005

    • Author(s)
      Rie NAKAZATO
    • Journal Title

      The 2005 Spring Meeting of the Acoustical Society of Japan

      Pages: 47-48

    • Description
      「研究成果報告書概要(欧文)」より
  • [Journal Article] 確率的LSAを用いた日本語同音異義語誤りの検出・訂正2004

    • Author(s)
      三品拓也
    • Journal Title

      情報処理学会論文誌 Vol.45,No.9

      Pages: 2168-2176

    • Description
      「研究成果報告書概要(和文)」より
  • [Journal Article] 確率的LSAに基づくngramモデルの変分ベイズ学習を利用した文脈適応化2004

    • Author(s)
      三品拓也
    • Journal Title

      電子情報通信学会誌D-II Vol.87,No.7

      Pages: 1409-1417

    • Description
      「研究成果報告書概要(和文)」より
  • [Journal Article] 混合ディリクレ分布パラメータの階層ベイズモデルを用いたスムージング法2004

    • Author(s)
      貞光九月
    • Journal Title

      情報処理学会研究報告 200-SLP-53

      Pages: 1-6

    • Description
      「研究成果報告書概要(和文)」より
  • [Journal Article] Context adaptation using variational Bayesian learning for ngram models based on probabilistic LSA2004

    • Author(s)
      Takuya MISHINA
    • Journal Title

      The IEICE Transactions on Information and systems Vol.J87-D-II, No.7

      Pages: 1409-1417

    • Description
      「研究成果報告書概要(欧文)」より
  • [Journal Article] Detection and correction of Japanese homophone errors using probabilisitic LSA.2004

    • Author(s)
      Takuya MISHINA
    • Journal Title

      IPSJ Journal Vol.45, No.9

      Pages: 2168-2175

    • Description
      「研究成果報告書概要(欧文)」より
  • [Journal Article] A smoothing method for parameters of Dirichlet mixtures using hierarchical Bayesian models.2004

    • Author(s)
      Kugatsu SADAMITSU
    • Journal Title

      IPSJ SIG Technical Report 2004-SLP-53

      Pages: 1-6

    • Description
      「研究成果報告書概要(欧文)」より
  • [Journal Article] A model for n-terms document friquency using Polya mixtures.2004

    • Author(s)
      Kugatsu SADAMITSU
    • Journal Title

      Proceedings of the Tenth Annual Meeting of the Association for Natural Language Processing

      Pages: 697-700

    • Description
      「研究成果報告書概要(欧文)」より
  • [Journal Article] 混合ディレクレ分布を用いた文脈のモデル化と言語モデルへの応用2003

    • Author(s)
      山本幹雄
    • Journal Title

      情報処理学会研究報告 2003-SLP-48

      Pages: 29-34

    • Description
      「研究成果報告書概要(和文)」より
  • [Journal Article] Context modeling using Dirichlet mixtures and its applications to language models.2003

    • Author(s)
      Mikio YAMAMOTO
    • Journal Title

      IPSJ SIG Technical Report 2003-SLP-48

      Pages: 29-34

    • Description
      「研究成果報告書概要(欧文)」より
  • [Journal Article] 混合ディリクレ分布を用いたトピックに基づく言語モデル

    • Author(s)
      貞光九月
    • Journal Title

      電子情報通信学会論文誌D-II (印刷中)

    • Description
      「研究成果報告書概要(和文)」より
  • [Journal Article] Topic-based language models using Dirichlet mixtures.

    • Author(s)
      Kugatsu SADAMITSU
    • Journal Title

      The IEICE Transactions on Information and systems (to appear.)

    • Description
      「研究成果報告書概要(欧文)」より

URL: 

Published: 2006-07-11  

Information User Guide FAQ News Terms of Use Attribution of KAKENHI

Powered by NII kakenhi