Studies on Corpus Creation and Use for Linguistic Research

Research Project

Project/Area Number	15300046
Research Category	Grant-in-Aid for Scientific Research (B)
Allocation Type	Single-year Grants
Section	一般
Research Field	Intelligent informatics
Research Institution	Nara Institute of Science and Technology
Principal Investigator	MATSUMOTO Yuji Nara Institute of Science and Technology, Graduate School of Information Science, professor, 情報科学研究科, 教授 (10211575)
Co-Investigator(Kenkyū-buntansha)	ASAHARA Masayuki Nara Institute of Science and Technology, Graduate School of Information Science, Assistant professor, 情報科学研究科, 助手 (80379528) HASHIMOTO Kiyota Osaka Prefectural University, School of Humanities & Social Sciences, associate professor, 人間社会学部, 助教授 (50278818) TONO Yukio Meikai University, Faculty of Languages, professor, 外国語学部, 教授 (10211393) OHTANI Akira Osaka Gakuin University, Faculty of Informatics, Lecturer, 情報学部, 講師 (50283817) 乾健太郎奈良先端科学技術大学院大学, 情報科学研究科, 助教授 (60272689)
Project Period (FY)	2003 – 2005
Project Status	Completed (Fiscal Year 2005)
Budget Amount *help	¥14,500,000 (Direct Cost: ¥14,500,000) Fiscal Year 2005: ¥5,300,000 (Direct Cost: ¥5,300,000) Fiscal Year 2004: ¥4,600,000 (Direct Cost: ¥4,600,000) Fiscal Year 2003: ¥4,600,000 (Direct Cost: ¥4,600,000)
Keywords	corpus / natural language processing / part-of-speech taggin / dependency analysis / database / retrieval / multi-lingual processing / KWIC / 言語コーパス / 言語処理 / 単語検索 / 文字列検索 / タグ付きコーパス
Research Abstract	As for the research for language processing, we augmented the language analysis tools we have been developing, such as Japanese morphological analyzer and Japanese dependency analyzer, for Chinese analysis. As for development of dictionaries, we implemented unknown word analysis system for Chinese, and extracted candidates of new word entries by running the system on a large scale Chinese corpus. Through this experiment, we could successfully construct a large scale Chinese dictionary with about a hundred thousand word entries. For Japanese, we described the constituent word information of Japanese compound words and registered these information in the dictionary. For English, we developed a method for distinguishing literal and idiomatic uses of English multi-word expressions, and showed a high accuracy in distinguishing them. As for the corpus tool development, we made a detailed design of the database schemes for annotated corpus and dictionary entries, and re-implemented the corpus management tool based on these schemes. We also implemented the error correction functions for part-of-speech and dependency analysis errors and designed and implemented the interface for the functions. The visualization function for showing phrasal chunks and their dependency relation, on which one of the error correction functions is realized. The developed corpus management tools are made open to public and we hold two seminars to make it open and to explain the usage to those interested in using the system, aiming at collecting the feedback from the users. We also opened a Web page for introducing and downloading the tools.

Report

(4 results)

2005 Annual Research Report Final Research Report Summary
2004 Annual Research Report
2003 Annual Research Report

Research Products
(28 results)

All 2005 2004 Other

All Journal Article (23 results) Publications (5 results)

[Journal Article] 相対的な係りやすさを考慮した日本語係り受け解析モデル2005
- Author(s)
  工藤拓, 松本裕治
- Journal Title
  
  情報処理学会論文誌 46・4
  
  Pages: 1082-1092
- NAID
  110002911748
- Description
  「研究成果報告書概要(和文)」より
- Related Report
  2005 Final Research Report Summary
[Journal Article] Chinese Word Segmentation by Classification of Characters2005
- Author(s)
  Chooi-Ling Goh, Masayuki Asahara, Yuji Matsumoto
- Journal Title
  
  International Journal of Computational Linguistics and Chinese Language Processing 10・3
  
  Pages: 381-396
- Description
  「研究成果報告書概要(和文)」より
- Related Report
  2005 Annual Research Report 2005 Final Research Report Summary
[Journal Article] 単語レベルと文字レベルの情報を用いた中国語・日本語単語分割2005
- Author(s)
  中川哲治, 松本裕治
- Journal Title
  
  情報処理学会論文誌 46・11
  
  Pages: 2714-2727
- NAID
  110002911747
- Description
  「研究成果報告書概要(和文)」より
- Related Report
  2005 Final Research Report Summary
[Journal Article] ChaKi : An Annotated Corpora Management and Search System2005
- Author(s)
  Yuji Matsumoto, Masayuki Asahara, et al..
- Journal Title
  
  Proceedings from the Corpus Linguistics COnference Series 1・1
- Description
  「研究成果報告書概要(和文)」より
- Related Report
  2005 Final Research Report Summary
[Journal Article] Automatic Extraction of Fixed Multiword Expressions2005
- Author(s)
  Compbell Hore, Masayuki Asahara, Yuji Matsumoto
- Journal Title
  
  Natural Language Processing. Second International Joint Conference, Lecture Notes in Artifical Intelligence 3651
  
  Pages: 565-575
- NAID
  110002949453
- Description
  「研究成果報告書概要(和文)」より
- Related Report
  2005 Final Research Report Summary
[Journal Article] Chinese Deterministic Dependency Analyzer : Examining Effects of Global Features and Root Node Finder2005
- Author(s)
  Yuchang Cheng, Masayuki Asahara, Yuji Matsumoto
- Journal Title
  
  Fourth SIGHAN Workshop on Chinese Language Processing. Proceedings of the Workshop 4
  
  Pages: 17-24
- Description
  「研究成果報告書概要(和文)」より
- Related Report
  2005 Final Research Report Summary
[Journal Article] Japanese Dependency Analysis Model with Relative Strength of Dependency (in Japanese)2005
- Author(s)
  Taku Kudo, Yuji Matsumoto
- Journal Title
  
  Transaction of Information Processing Society of Japan Vol.46, No.4
  
  Pages: 1082-1092
- Description
  「研究成果報告書概要(欧文)」より
- Related Report
  2005 Final Research Report Summary
[Journal Article] Chinese Word Segmentation by Classification of Characters2005
- Author(s)
  Chooi-Ling Goh, Masayuki Asahara, Yuji Matsumoto
- Journal Title
  
  International Journal of Computational Linguistics and Chinese Language Processing Vol.10, No.3
  
  Pages: 381-396
- Description
  「研究成果報告書概要(欧文)」より
- Related Report
  2005 Final Research Report Summary
[Journal Article] Chinese and Japanese Word Segmentation with Word Level and Character Level Information (in Japanese)2005
- Author(s)
  Tetsuji Nakagawa, Yuji Matsumoto
- Journal Title
  
  Transaction of Information Processing Society of Japan Vol.46, No.11
  
  Pages: 2714-2727
- Description
  「研究成果報告書概要(欧文)」より
- Related Report
  2005 Final Research Report Summary
[Journal Article] ChaKi : An Annotated Corpora Management and Search System2005
- Author(s)
  Yuji Matsumoto, Masayuki Asahara, Yukio Tono, Akira Ohtani, Toshio Morita
- Journal Title
  
  Proceedings from the Corpus Linguistics Conference Series Vol.1, No.1
- Description
  「研究成果報告書概要(欧文)」より
- Related Report
  2005 Final Research Report Summary
[Journal Article] Automatic Extraction of Fixed Multiword Expressions2005
- Author(s)
  Campbell Hore, Masayuki Asahara, Yuji Matsumoto
- Journal Title
  
  Natural Language Processing, Second International Joint Conference, Lecture Notes in Artificial Intelligence Vol.3651
  
  Pages: 565-575
- NAID
  110002949453
- Description
  「研究成果報告書概要(欧文)」より
- Related Report
  2005 Final Research Report Summary
[Journal Article] Chinese Deterministic Dependency Analyzer : Examining Effects of Global Features and Root Node Finder2005
- Author(s)
  Yuchang Cheng, Masayuki Asahara, Yuji Matsumoto
- Journal Title
  
  Fourth SIGHAN Workshop on Chinese Language Processing, Proceedings of the Workshop Vol.4
  
  Pages: 17-24
- Description
  「研究成果報告書概要(欧文)」より
- Related Report
  2005 Final Research Report Summary
[Journal Article] 相対的な係りやすさを考慮した日本語係り受け解析モデル2005
- Author(s)
  工藤拓, 松本裕治
- Journal Title
  
  情報処理学会論文誌 46・4
  
  Pages: 1082-1092
- NAID
  110002911748
- Related Report
  2005 Annual Research Report
[Journal Article] 単語レベルと文字レベルの情報を用いた中国語・日本語単語分割2005
- Author(s)
  中川哲治, 松本裕治
- Journal Title
  
  情報処理学会論文誌 46・11
  
  Pages: 2714-2727
- NAID
  110002911747
- Related Report
  2005 Annual Research Report
[Journal Article] ChaKi: An Annotated Corpora Management and Search System2005
- Author(s)
  Yuji Matsumoto, Masayuki Asahara, Kou Kawabe, Yurika Takahashi, Yukio Tono, Akira Ohtani, Toshio Morita
- Journal Title
  
  Proceedings from the Corpus Linguistics Conference Series 1・1
- Related Report
  2005 Annual Research Report
[Journal Article] Automatic Extraction of Fixed Multiword Expressions2005
- Author(s)
  Campbell Hore, Masayuki Asahara, Yuji Matsumoto
- Journal Title
  
  Natural Language Processing, Second International Joint Conference, Lecture Notes in Artificial Intelligence 3651
  
  Pages: 565-575
- NAID
  110002949453
- Related Report
  2005 Annual Research Report
[Journal Article] Chinese Deterministic Dependency Analyzer: Examining Effects of Global, Features and Root Node Finder2005
- Author(s)
  Yuchang Cheng, Masayuki Asahara, Yuji Matsumoto
- Journal Title
  
  Fourth SIGHAN Workshop on Chinese Language Processing, Proceedings of the Workshop 4
  
  Pages: 17-24
- Related Report
  2005 Annual Research Report
[Journal Article] タグ付きコーパスの管理/検索ツール「茶器」の現状2005
- Author(s)
  松本裕治, 浅原正幸, 河部恒, 高橋由梨加, 投野由紀夫, 大谷朗, 森田敏生
- Journal Title
  
  言語処理学会年次大会 11
- Related Report
  2004 Annual Research Report
[Journal Article] 日本語固有表現抽出におけるわかち書き問題の解決2004
- Author(s)
  浅原正幸, 松本裕治
- Journal Title
  
  情報処理学会論文誌 45・5
  
  Pages: 1442-1450
- NAID
  110002712193
- Related Report
  2004 Annual Research Report
[Journal Article] Support Vector Machineを用いた決定性上昇型依存構造解析2004
- Author(s)
  山田寛康, 松本裕治
- Journal Title
  
  情報処理学会論文誌 45・10
  
  Pages: 2416-2427
- NAID
  110002712084
- Related Report
  2004 Annual Research Report
[Journal Article] Japanese Unknown Word Identification by Character-based Chunking2004
- Author(s)
  Masayuki Asahara, Yuji Matsumoto
- Journal Title
  
  Proceedings of 20th International Conference on Computational Linguistics 20
  
  Pages: 459-465
- Related Report
  2004 Annual Research Report
[Journal Article] Pruning False Unknown Words to Improve Chinese Word Segmentation2004
- Author(s)
  Chooi-Lirg Goh, Masayuki Asahara, Yuji Matsumoto
- Journal Title
  
  Proceedings of the 18th Pacific Asia Conference on Language, Information and Computation 18
  
  Pages: 139-149
- NAID
  120006851427
- Related Report
  2004 Annual Research Report
[Journal Article] 茶筌と南瓜による日本語解析-構文情報を用いた文の役割分類2004
- Author(s)
  松本裕治, 高岡一馬, 浅原正幸, 工藤拓
- Journal Title
  
  人工知能学会誌 19・3
  
  Pages: 334-339
- Related Report
  2004 Annual Research Report
[Publications] 中川哲治, 工藤拓, 松本裕治: "Support Vector Machineを用いた形態素解析と修正学習法の提案"情報処理学会論文誌. 44・5. 1354-1367 (2003)
- Related Report
  2003 Annual Research Report
[Publications] Masayuki Asahara, Yuji Matsumoto: "Filler and disfluency identification based on morphological analysis and chunking"Proceedings of ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition. 163-166 (2003)
- Related Report
  2003 Annual Research Report
[Publications] Masayuki Asahara, Yuji Matsumoto: "Japanese named entity extraction with redundant morphological analysis"Proc.Human Language Technology and North American Chapter of Association for Computational Linguistics. 4. 8-15 (2003)
- Related Report
  2003 Annual Research Report
[Publications] 工藤拓, 松本裕治: "部分木に基づくマルコフ確率場と言語解析への適用"情報処理学会研究報告,自然言語処理/情報学基礎. 157. 33-40 (2003)
- Related Report
  2003 Annual Research Report
[Publications] 松本裕治, 他8名: "タグ付きコーパスの格納/検索ツール「茶器」"言語処理学会第10回年次大会論文集. 10. 405-408 (2004)
- Related Report
  2003 Annual Research Report

Studies on Corpus Creation and Use for Linguistic Research

Principal Investigator

MATSUMOTO Yuji Nara Institute of Science and Technology, Graduate School of Information Science, professor, 情報科学研究科, 教授 (10211575)

¥14,500,000 (Direct Cost: ¥14,500,000)

Report

Research Products

[Journal Article] 相対的な係りやすさを考慮した日本語係り受け解析モデル2005

Author(s)

Journal Title

NAID

Description

Related Report

[Journal Article] Chinese Word Segmentation by Classification of Characters2005

Author(s)

Journal Title

Description

Related Report

[Journal Article] 単語レベルと文字レベルの情報を用いた中国語・日本語単語分割2005

Author(s)

Journal Title

NAID

Description

Related Report

[Journal Article] ChaKi : An Annotated Corpora Management and Search System2005

Author(s)

Journal Title

Description

Related Report

[Journal Article] Automatic Extraction of Fixed Multiword Expressions2005

Author(s)

Journal Title

NAID

Description

Related Report

[Journal Article] Chinese Deterministic Dependency Analyzer : Examining Effects of Global Features and Root Node Finder2005

Author(s)

Journal Title

Description

Related Report

[Journal Article] Japanese Dependency Analysis Model with Relative Strength of Dependency (in Japanese)2005

Author(s)

Journal Title

Description

Related Report

[Journal Article] Chinese Word Segmentation by Classification of Characters2005

Author(s)

Journal Title

Description

Related Report

[Journal Article] Chinese and Japanese Word Segmentation with Word Level and Character Level Information (in Japanese)2005

Author(s)

Journal Title

Description

Related Report

[Journal Article] ChaKi : An Annotated Corpora Management and Search System2005

Author(s)

Journal Title

Description

Related Report

[Journal Article] Automatic Extraction of Fixed Multiword Expressions2005

Author(s)

Journal Title

NAID

Description

Related Report

[Journal Article] Chinese Deterministic Dependency Analyzer : Examining Effects of Global Features and Root Node Finder2005

Author(s)

Journal Title

Description

Related Report

[Journal Article] 相対的な係りやすさを考慮した日本語係り受け解析モデル2005

Author(s)

Journal Title

NAID

Related Report

[Journal Article] 単語レベルと文字レベルの情報を用いた中国語・日本語単語分割2005

Author(s)

Journal Title

NAID

Related Report

[Publications] 中川哲治, 工藤拓, 松本裕治: "Support Vector Machineを用いた形態素解析と修正学習法の提案"情報処理学会論文誌. 44・5. 1354-1367 (2003)