Study of Class-based Language Model and its Application to Japanese Morphological Analysis

Research Project

Project/Area Number	10680383
Research Category	Grant-in-Aid for Scientific Research (C)
Allocation Type	Single-year Grants
Section	一般
Research Field	Intelligent informatics
Research Institution	The University of Tokushima
Principal Investigator	KITA Kenji The University of Tokushima, Faculty of Engineering, Associate Professor, 工学部, 助教授 (10243734)
Project Period (FY)	1998 – 1999
Project Status	Completed (Fiscal Year 1999)
Budget Amount *help	¥2,400,000 (Direct Cost: ¥2,400,000) Fiscal Year 1999: ¥800,000 (Direct Cost: ¥800,000) Fiscal Year 1998: ¥1,600,000 (Direct Cost: ¥1,600,000)
Keywords	natural language processing / Japanese language processing / morphological analysis / word segmentation / probabilistic language model / PPM* model / character class / clustering / PPMモデル
Research Abstract	Morphological analysis is the most fundamental process of Japanese language processing. In Japanese morphological analysis, word segmentation is an important problem because word boundaries are not marked in its writing system. In this research project, we first studied a word segmentation model using a character-based n-gram model, which is our baseline method. Next, we applied the PPM* compression algorithm to the problem of word segmentation. PPM (Prediction by Partial Matching) is a lossless compression algorithm based on a finite-context probabilistic modeling technique and PPM* is a variant of PPM, in which there is no a priori bound on context length. We then studied a method for word segmentation based on a character class model. The character class model is more robust than a character-based model because the number of parameters of the character class model is fewer than that of a character-based model. The measurement for Japanese character clustering is the entropy on a corpus different from the corpus for model estimation and the search method is based on the greedy algorithm. For this reason, this clustering method gives us an optimum character classification without giving the number of classes. As the result of experiments on the ADD (ATR Dialogue Database) corpus, the proposed Japanese word segmenter using the character class model marked a higher accuracy than a character-based model. In particular, the proposed method using a variable-length n;-gram class model achieved 96.38% recall and 96.23% precision for open text.

Report

(3 results)

1999 Annual Research Report Final Research Report Summary
1998 Annual Research Report

Research Products
(30 results)

All Other

All Publications (30 results)

[Publications] 小田裕樹,森信介,北研二: "文字クラスモデルによる日本語単語分割"自然言語処理. 6・7. 93-108 (1999)
- Description
  「研究成果報告書概要(和文)」より
- Related Report
  1999 Final Research Report Summary
[Publications] K.Kita: "Automatic Clustering of Languages Based on Probabilistic Models"Journal of Quantitative Linguistics. 6・2. 167-171 (1999)
- Description
  「研究成果報告書概要(和文)」より
- Related Report
  1999 Final Research Report Summary
[Publications] H.Oda,K,Kita: "A Character-Based Japanese Word Segmenter Usirtg PPM^*-Based Langauge Model"Proceedings of ICCPOL'99. 527-532 (1999)
- Description
  「研究成果報告書概要(和文)」より
- Related Report
  1999 Final Research Report Summary
[Publications] X-Y,Tai,Y.Kato,K. Kita: "Automatically Compiling Multilingual Translations from the World Wide Web"Proceedings of ISMT & CLIP. 516-521 (1998)
- Description
  「研究成果報告書概要(和文)」より
- Related Report
  1999 Final Research Report Summary
[Publications] Y.Tanaka,K.Kita: "JCKE Multilingual Corpus of Major Asian Languages"Proceedings of TKE'99. 660-670 (1999)
- Description
  「研究成果報告書概要(和文)」より
- Related Report
  1999 Final Research Report Summary
[Publications] 小田裕樹,北研二: "PPM^*言語モデルを用いた日本語単語分割"情報処理学会論文誌. (印刷中). (2000)
- Description
  「研究成果報告書概要(和文)」より
- Related Report
  1999 Final Research Report Summary
[Publications] 北研二: "確率的言語モデル"東京大学出版会. 256 (1999)
- Description
  「研究成果報告書概要(和文)」より
- Related Report
  1999 Final Research Report Summary
[Publications] K. Kita, M. Sasaki, X-Y. Tai: "Rule-Based Hierarchical Document Categorization for the World Wide Web"Asia Pacific Web Conference (AP-Web98). 269-273 (1998)
- Description
  「研究成果報告書概要(欧文)」より
- Related Report
  1999 Final Research Report Summary
[Publications] M. Sasaki, K. Kita: "Rule-Based Text Categorization Using Hierarchical Categories"1998 IEEE International Conference on Systems, Man and Cybernetics. 2827-2830 (1998)
- Description
  「研究成果報告書概要(欧文)」より
- Related Report
  1999 Final Research Report Summary
[Publications] M. Sasaki, K. Kita: "Automatic Text Categorization based on Hierarchical Rules"5th International Conference on Soft Computing. 935-938 (1998)
- Description
  「研究成果報告書概要(欧文)」より
- Related Report
  1999 Final Research Report Summary
[Publications] Y. Kato, K. Kita: "Modern Japanese Processing Problems - Problems of "Kita-Kana" Appeared in Loan Words -"18th International Conference on Computer Proceeding of Oriental Languages. 305-308 (1999)
- Description
  「研究成果報告書概要(欧文)」より
- Related Report
  1999 Final Research Report Summary
[Publications] H. Oda, K. Kita: "A Character-Based Japanese Word Segmenter Using PPM* -Based Language Model"18th International Conference on Computer Proceeding of Oriental Languages. 527-532 (1999)
- Description
  「研究成果報告書概要(欧文)」より
- Related Report
  1999 Final Research Report Summary
[Publications] X-Y, Tai, Y. Kato, K. Kita: "Automatically Compiling Multilingual Translations from the World Wide Web"International Symposium on Machine Translation and Computer Language Information Processing (ISMT & CLIP). 516-521 (1999)
- Description
  「研究成果報告書概要(欧文)」より
- Related Report
  1999 Final Research Report Summary
[Publications] Y. Tanaka, K. Kita: "JCKE Multilingual Corpus of Major Asian Languages"Fifth International Congress on Terminology and Knowledge Engineering (TKE'99). 660-670 (1999)
- Description
  「研究成果報告書概要(欧文)」より
- Related Report
  1999 Final Research Report Summary
[Publications] K. Kita: "Automatic Clustering' of Languages Based on Probabilistic Models"Journal of Quantitative Linguistics. Vol. 6, No. 2. 167-171 (1999)
- Description
  「研究成果報告書概要(欧文)」より
- Related Report
  1999 Final Research Report Summary
[Publications] H. Oda, S. Mori, K. Kita: "A Japanese Word Segmenter by a Character Class Models"Journal of Natural Language Processing. Vol. 6, No. 7 (in Japanese). 93-108 (1999)
- Description
  「研究成果報告書概要(欧文)」より
- Related Report
  1999 Final Research Report Summary
[Publications] H. Oda, K. Kita: "A Japanese Word Segmenter Using a PPM*-Based Language Model"Journal of IPSJ. (in Japanese) (in press).
- Description
  「研究成果報告書概要(欧文)」より
- Related Report
  1999 Final Research Report Summary
[Publications] 小田裕樹,森信介,北研二: "文字クラスモデルによる日本語単語分割"自然言語処理. 6・7. 93-108 (1999)
- Related Report
  1999 Annual Research Report
[Publications] K.Kita: "Automatic Clustering of Languages Based on Ptobabilistic Models"Journal of Quantitative Linguistics. 6・2. 167-171 (1999)
- Related Report
  1999 Annual Research Report
[Publications] H.Oda,K.Kita: "A Character-Based Japanese Word Segmenter Using PPM^*-Based Langauge Model"Proceedings of ICCPOL'99. 527-532 (1999)
- Related Report
  1999 Annual Research Report
[Publications] X-Y,Tai,Y.Kato,K.Kita: "Automatically Compiling Multilingual Translations from the World Wide Web"Proceedings of ISMT&CLIP. 516-521 (1998)
- Related Report
  1999 Annual Research Report
[Publications] Y.Tanaka,K.Kita: "JCKE Multilingual Corpus of Major Asian Languages"Proceedings of TKE'99. 660-670 (1999)
- Related Report
  1999 Annual Research Report
[Publications] 小田裕樹,北研二: "PPM^*言語モデルを用いた日本語単語分割"情報処理学会論文誌(印刷中). (2000)
- Related Report
  1999 Annual Research Report
[Publications] 北研二: "確率的言語モデル"東京大学出版会. 256 (1999)
- Related Report
  1999 Annual Research Report
[Publications] Kenji Kita et al.: "Rule-based hierarclnical document categorization for the World Wide Web" Proceedings of APWEB'98. (1998)
- Related Report
  1998 Annual Research Report
[Publications] Minaru Sasaki and Kenji Kita: "Automatic text categorization based on hierarchical rules" PRoceedings of IIZUKA'98. (1998)
- Related Report
  1998 Annual Research Report
[Publications] Minaru Sasaki and Kenji Kita: "Rule-based text categorization using hicrarchical categories" Proceedings of IEEE SMC'98. (1998)
- Related Report
  1998 Annual Research Report
[Publications] 小田裕樹、北研二: "PPMモデルによる日本語単語分割" 自然言語処理研究会. 128. 2827-2830 (1998)
- Related Report
  1998 Annual Research Report
[Publications] 北研二、山口直宏: "World Wide Webからの対訳データの自動収集" 自然言語処理研究会. 128. 127-134 (1998)
- Related Report
  1998 Annual Research Report
[Publications] 小田裕樹、北研二: "文字クラスモデルに基づく日本語田んぼ分割" 自然言語処理研究会. (発売予定). (1999)
- Related Report
  1998 Annual Research Report

Study of Class-based Language Model and its Application to Japanese Morphological Analysis

Principal Investigator

KITA Kenji The University of Tokushima, Faculty of Engineering, Associate Professor, 工学部, 助教授 (10243734)

¥2,400,000 (Direct Cost: ¥2,400,000)

Report

Research Products

[Publications] 小田裕樹,森信介,北研二: "文字クラスモデルによる日本語単語分割"自然言語処理. 6・7. 93-108 (1999)

Description

Related Report

[Publications] K.Kita: "Automatic Clustering of Languages Based on Probabilistic Models"Journal of Quantitative Linguistics. 6・2. 167-171 (1999)

Description

Related Report

[Publications] H.Oda,K,Kita: "A Character-Based Japanese Word Segmenter Usirtg PPM^*-Based Langauge Model"Proceedings of ICCPOL'99. 527-532 (1999)

Description

Related Report

[Publications] X-Y,Tai,Y.Kato,K. Kita: "Automatically Compiling Multilingual Translations from the World Wide Web"Proceedings of ISMT & CLIP. 516-521 (1998)

Description

Related Report

[Publications] Y.Tanaka,K.Kita: "JCKE Multilingual Corpus of Major Asian Languages"Proceedings of TKE'99. 660-670 (1999)

Description

Related Report

[Publications] 小田裕樹,北研二: "PPM^*言語モデルを用いた日本語単語分割"情報処理学会論文誌. (印刷中). (2000)

Description

Related Report

[Publications] 北 研二: "確率的言語モデル"東京大学出版会. 256 (1999)

Description

Related Report

[Publications] K. Kita, M. Sasaki, X-Y. Tai: "Rule-Based Hierarchical Document Categorization for the World Wide Web"Asia Pacific Web Conference (AP-Web98). 269-273 (1998)

Description

Related Report

[Publications] M. Sasaki, K. Kita: "Rule-Based Text Categorization Using Hierarchical Categories"1998 IEEE International Conference on Systems, Man and Cybernetics. 2827-2830 (1998)

Description

Related Report

[Publications] M. Sasaki, K. Kita: "Automatic Text Categorization based on Hierarchical Rules"5th International Conference on Soft Computing. 935-938 (1998)

Description

Related Report

[Publications] Y. Kato, K. Kita: "Modern Japanese Processing Problems - Problems of "Kita-Kana" Appeared in Loan Words -"18th International Conference on Computer Proceeding of Oriental Languages. 305-308 (1999)

Description

Related Report

[Publications] H. Oda, K. Kita: "A Character-Based Japanese Word Segmenter Using PPM* -Based Language Model"18th International Conference on Computer Proceeding of Oriental Languages. 527-532 (1999)

Description

Related Report

[Publications] X-Y, Tai, Y. Kato, K. Kita: "Automatically Compiling Multilingual Translations from the World Wide Web"International Symposium on Machine Translation and Computer Language Information Processing (ISMT & CLIP). 516-521 (1999)

Description

Related Report

[Publications] Y. Tanaka, K. Kita: "JCKE Multilingual Corpus of Major Asian Languages"Fifth International Congress on Terminology and Knowledge Engineering (TKE'99). 660-670 (1999)

Description

Related Report

[Publications] K. Kita: "Automatic Clustering' of Languages Based on Probabilistic Models"Journal of Quantitative Linguistics. Vol. 6, No. 2. 167-171 (1999)

Description

Related Report

[Publications] H. Oda, S. Mori, K. Kita: "A Japanese Word Segmenter by a Character Class Models"Journal of Natural Language Processing. Vol. 6, No. 7 (in Japanese). 93-108 (1999)

Description

Related Report

[Publications] H. Oda, K. Kita: "A Japanese Word Segmenter Using a PPM*-Based Language Model"Journal of IPSJ. (in Japanese) (in press).

Description

Related Report

[Publications] 小田裕樹,森 信介,北研二: "文字クラスモデルによる日本語単語分割"自然言語処理. 6・7. 93-108 (1999)

Related Report

[Publications] K.Kita: "Automatic Clustering of Languages Based on Ptobabilistic Models"Journal of Quantitative Linguistics. 6・2. 167-171 (1999)

Related Report

[Publications] H.Oda,K.Kita: "A Character-Based Japanese Word Segmenter Using PPM^*-Based Langauge Model"Proceedings of ICCPOL'99. 527-532 (1999)

Related Report

[Publications] X-Y,Tai,Y.Kato,K.Kita: "Automatically Compiling Multilingual Translations from the World Wide Web"Proceedings of ISMT&CLIP. 516-521 (1998)

Related Report

[Publications] Y.Tanaka,K.Kita: "JCKE Multilingual Corpus of Major Asian Languages"Proceedings of TKE'99. 660-670 (1999)

Related Report

[Publications] 小田裕樹,北 研二: "PPM^*言語モデルを用いた日本語単語分割"情報処理学会論文誌(印刷中). (2000)

Related Report

[Publications] 北 研二: "確率的言語モデル"東京大学出版会. 256 (1999)

Related Report

[Publications] Kenji Kita et al.: "Rule-based hierarclnical document categorization for the World Wide Web" Proceedings of APWEB'98. (1998)

Related Report

[Publications] Minaru Sasaki and Kenji Kita: "Automatic text categorization based on hierarchical rules" PRoceedings of IIZUKA'98. (1998)

Related Report

[Publications] Minaru Sasaki and Kenji Kita: "Rule-based text categorization using hicrarchical categories" Proceedings of IEEE SMC'98. (1998)

Related Report

[Publications] 小田裕樹、北研二: "PPMモデルによる日本語単語分割" 自然言語処理研究会. 128. 2827-2830 (1998)

Related Report

[Publications] 北研二、山口直宏: "World Wide Webからの対訳データの自動収集" 自然言語処理研究会. 128. 127-134 (1998)

[Publications] 北研二: "確率的言語モデル"東京大学出版会. 256 (1999)

[Publications] 小田裕樹,森信介,北研二: "文字クラスモデルによる日本語単語分割"自然言語処理. 6・7. 93-108 (1999)

[Publications] 小田裕樹,北研二: "PPM^*言語モデルを用いた日本語単語分割"情報処理学会論文誌(印刷中). (2000)

[Publications] 北研二: "確率的言語モデル"東京大学出版会. 256 (1999)