• Search Research Projects
  • Search Researchers
  • How to Use
  1. Back to previous page

Study of Class-based Language Model and its Application to Japanese Morphological Analysis

Research Project

Project/Area Number 10680383
Research Category

Grant-in-Aid for Scientific Research (C)

Allocation TypeSingle-year Grants
Section一般
Research Field Intelligent informatics
Research InstitutionThe University of Tokushima

Principal Investigator

KITA Kenji  The University of Tokushima, Faculty of Engineering, Associate Professor, 工学部, 助教授 (10243734)

Project Period (FY) 1998 – 1999
Project Status Completed (Fiscal Year 1999)
Budget Amount *help
¥2,400,000 (Direct Cost: ¥2,400,000)
Fiscal Year 1999: ¥800,000 (Direct Cost: ¥800,000)
Fiscal Year 1998: ¥1,600,000 (Direct Cost: ¥1,600,000)
Keywordsnatural language processing / Japanese language processing / morphological analysis / word segmentation / probabilistic language model / PPM* model / character class / clustering / PPMモデル
Research Abstract

Morphological analysis is the most fundamental process of Japanese language processing. In Japanese morphological analysis, word segmentation is an important problem because word boundaries are not marked in its writing system.
In this research project, we first studied a word segmentation model using a character-based n-gram model, which is our baseline method. Next, we applied the PPM* compression algorithm to the problem of word segmentation. PPM (Prediction by Partial Matching) is a lossless compression algorithm based on a finite-context probabilistic modeling technique and PPM* is a variant of PPM, in which there is no a priori bound on context length.
We then studied a method for word segmentation based on a character class model. The character class model is more robust than a character-based model because the number of parameters of the character class model is fewer than that of a character-based model. The measurement for Japanese character clustering is the entropy on a corpus different from the corpus for model estimation and the search method is based on the greedy algorithm. For this reason, this clustering method gives us an optimum character classification without giving the number of classes. As the result of experiments on the ADD (ATR Dialogue Database) corpus, the proposed Japanese word segmenter using the character class model marked a higher accuracy than a character-based model. In particular, the proposed method using a variable-length n;-gram class model achieved 96.38% recall and 96.23% precision for open text.

Report

(3 results)
  • 1999 Annual Research Report   Final Research Report Summary
  • 1998 Annual Research Report
  • Research Products

    (30 results)

All Other

All Publications (30 results)

  • [Publications] 小田裕樹,森信介,北研二: "文字クラスモデルによる日本語単語分割"自然言語処理. 6・7. 93-108 (1999)

    • Description
      「研究成果報告書概要(和文)」より
    • Related Report
      1999 Final Research Report Summary
  • [Publications] K.Kita: "Automatic Clustering of Languages Based on Probabilistic Models"Journal of Quantitative Linguistics. 6・2. 167-171 (1999)

    • Description
      「研究成果報告書概要(和文)」より
    • Related Report
      1999 Final Research Report Summary
  • [Publications] H.Oda,K,Kita: "A Character-Based Japanese Word Segmenter Usirtg PPM^*-Based Langauge Model"Proceedings of ICCPOL'99. 527-532 (1999)

    • Description
      「研究成果報告書概要(和文)」より
    • Related Report
      1999 Final Research Report Summary
  • [Publications] X-Y,Tai,Y.Kato,K. Kita: "Automatically Compiling Multilingual Translations from the World Wide Web"Proceedings of ISMT & CLIP. 516-521 (1998)

    • Description
      「研究成果報告書概要(和文)」より
    • Related Report
      1999 Final Research Report Summary
  • [Publications] Y.Tanaka,K.Kita: "JCKE Multilingual Corpus of Major Asian Languages"Proceedings of TKE'99. 660-670 (1999)

    • Description
      「研究成果報告書概要(和文)」より
    • Related Report
      1999 Final Research Report Summary
  • [Publications] 小田裕樹,北研二: "PPM^*言語モデルを用いた日本語単語分割"情報処理学会論文誌. (印刷中). (2000)

    • Description
      「研究成果報告書概要(和文)」より
    • Related Report
      1999 Final Research Report Summary
  • [Publications] 北 研二: "確率的言語モデル"東京大学出版会. 256 (1999)

    • Description
      「研究成果報告書概要(和文)」より
    • Related Report
      1999 Final Research Report Summary
  • [Publications] K. Kita, M. Sasaki, X-Y. Tai: "Rule-Based Hierarchical Document Categorization for the World Wide Web"Asia Pacific Web Conference (AP-Web98). 269-273 (1998)

    • Description
      「研究成果報告書概要(欧文)」より
    • Related Report
      1999 Final Research Report Summary
  • [Publications] M. Sasaki, K. Kita: "Rule-Based Text Categorization Using Hierarchical Categories"1998 IEEE International Conference on Systems, Man and Cybernetics. 2827-2830 (1998)

    • Description
      「研究成果報告書概要(欧文)」より
    • Related Report
      1999 Final Research Report Summary
  • [Publications] M. Sasaki, K. Kita: "Automatic Text Categorization based on Hierarchical Rules"5th International Conference on Soft Computing. 935-938 (1998)

    • Description
      「研究成果報告書概要(欧文)」より
    • Related Report
      1999 Final Research Report Summary
  • [Publications] Y. Kato, K. Kita: "Modern Japanese Processing Problems - Problems of "Kita-Kana" Appeared in Loan Words -"18th International Conference on Computer Proceeding of Oriental Languages. 305-308 (1999)

    • Description
      「研究成果報告書概要(欧文)」より
    • Related Report
      1999 Final Research Report Summary
  • [Publications] H. Oda, K. Kita: "A Character-Based Japanese Word Segmenter Using PPM* -Based Language Model"18th International Conference on Computer Proceeding of Oriental Languages. 527-532 (1999)

    • Description
      「研究成果報告書概要(欧文)」より
    • Related Report
      1999 Final Research Report Summary
  • [Publications] X-Y, Tai, Y. Kato, K. Kita: "Automatically Compiling Multilingual Translations from the World Wide Web"International Symposium on Machine Translation and Computer Language Information Processing (ISMT & CLIP). 516-521 (1999)

    • Description
      「研究成果報告書概要(欧文)」より
    • Related Report
      1999 Final Research Report Summary
  • [Publications] Y. Tanaka, K. Kita: "JCKE Multilingual Corpus of Major Asian Languages"Fifth International Congress on Terminology and Knowledge Engineering (TKE'99). 660-670 (1999)

    • Description
      「研究成果報告書概要(欧文)」より
    • Related Report
      1999 Final Research Report Summary
  • [Publications] K. Kita: "Automatic Clustering' of Languages Based on Probabilistic Models"Journal of Quantitative Linguistics. Vol. 6, No. 2. 167-171 (1999)

    • Description
      「研究成果報告書概要(欧文)」より
    • Related Report
      1999 Final Research Report Summary
  • [Publications] H. Oda, S. Mori, K. Kita: "A Japanese Word Segmenter by a Character Class Models"Journal of Natural Language Processing. Vol. 6, No. 7 (in Japanese). 93-108 (1999)

    • Description
      「研究成果報告書概要(欧文)」より
    • Related Report
      1999 Final Research Report Summary
  • [Publications] H. Oda, K. Kita: "A Japanese Word Segmenter Using a PPM*-Based Language Model"Journal of IPSJ. (in Japanese) (in press).

    • Description
      「研究成果報告書概要(欧文)」より
    • Related Report
      1999 Final Research Report Summary
  • [Publications] 小田裕樹,森 信介,北研二: "文字クラスモデルによる日本語単語分割"自然言語処理. 6・7. 93-108 (1999)

    • Related Report
      1999 Annual Research Report
  • [Publications] K.Kita: "Automatic Clustering of Languages Based on Ptobabilistic Models"Journal of Quantitative Linguistics. 6・2. 167-171 (1999)

    • Related Report
      1999 Annual Research Report
  • [Publications] H.Oda,K.Kita: "A Character-Based Japanese Word Segmenter Using PPM^*-Based Langauge Model"Proceedings of ICCPOL'99. 527-532 (1999)

    • Related Report
      1999 Annual Research Report
  • [Publications] X-Y,Tai,Y.Kato,K.Kita: "Automatically Compiling Multilingual Translations from the World Wide Web"Proceedings of ISMT&CLIP. 516-521 (1998)

    • Related Report
      1999 Annual Research Report
  • [Publications] Y.Tanaka,K.Kita: "JCKE Multilingual Corpus of Major Asian Languages"Proceedings of TKE'99. 660-670 (1999)

    • Related Report
      1999 Annual Research Report
  • [Publications] 小田裕樹,北 研二: "PPM^*言語モデルを用いた日本語単語分割"情報処理学会論文誌(印刷中). (2000)

    • Related Report
      1999 Annual Research Report
  • [Publications] 北 研二: "確率的言語モデル"東京大学出版会. 256 (1999)

    • Related Report
      1999 Annual Research Report
  • [Publications] Kenji Kita et al.: "Rule-based hierarclnical document categorization for the World Wide Web" Proceedings of APWEB'98. (1998)

    • Related Report
      1998 Annual Research Report
  • [Publications] Minaru Sasaki and Kenji Kita: "Automatic text categorization based on hierarchical rules" PRoceedings of IIZUKA'98. (1998)

    • Related Report
      1998 Annual Research Report
  • [Publications] Minaru Sasaki and Kenji Kita: "Rule-based text categorization using hicrarchical categories" Proceedings of IEEE SMC'98. (1998)

    • Related Report
      1998 Annual Research Report
  • [Publications] 小田裕樹、北研二: "PPMモデルによる日本語単語分割" 自然言語処理研究会. 128. 2827-2830 (1998)

    • Related Report
      1998 Annual Research Report
  • [Publications] 北研二、山口直宏: "World Wide Webからの対訳データの自動収集" 自然言語処理研究会. 128. 127-134 (1998)

    • Related Report
      1998 Annual Research Report
  • [Publications] 小田裕樹、北研二: "文字クラスモデルに基づく日本語田んぼ分割" 自然言語処理研究会. (発売予定). (1999)

    • Related Report
      1998 Annual Research Report

URL: 

Published: 1998-04-01   Modified: 2016-04-21  

Information User Guide FAQ News Terms of Use Attribution of KAKENHI

Powered by NII kakenhi