Construction of large scale annotated corpus and its management system

Research Project

Project/Area Number	12480082
Research Category	Grant-in-Aid for Scientific Research (B)
Allocation Type	Single-year Grants
Section	一般
Research Field	Intelligent informatics
Research Institution	Tokyo Institute of Technology
Principal Investigator	TOKUNAGA Takenobu Department of Computer Science, Associate Professor, 大学院・情報理工学研究科, 助教授 (20197875)
Co-Investigator(Kenkyū-buntansha)	TANAKA Hozumi Tokyo Institute of Technology, Department of Computer Science, Professor, 大学院・情報理工学研究科, 教授 (80163567) 白井清昭東京工業大学, 大学院・情報理工学研究科, 助手 (30302970)
Project Period (FY)	2000 – 2002
Project Status	Completed (Fiscal Year 2002)
Budget Amount *help	¥14,000,000 (Direct Cost: ¥14,000,000) Fiscal Year 2002: ¥3,600,000 (Direct Cost: ¥3,600,000) Fiscal Year 2001: ¥3,600,000 (Direct Cost: ¥3,600,000) Fiscal Year 2000: ¥6,800,000 (Direct Cost: ¥6,800,000)
Keywords	Natural language processing / Large scale corpora / Japanese grammar / Syntactic analysis / Syntactically annotated corpura / Statistical NLP / 大規模日本語文法 / コーパス作成支援
Research Abstract	Since the middle of 1980's, natural language processing based on a large scale linguistic data has become a main stream in this research area. For this kind of research, linguistic resources play important role, and there has been many attempts to create various kinds of resources. This research project aims to construct an environment to create syntactically annotated Japanese corpora in large scale. To achieve this goal, we conducted the research in the following topics. In 2000, we built an annotation tool which supports a user to annotate syntactic structure on sentences in interactive way. This tool works with an existing parser and the user cab efficiently select a correct syntactic structure from a number of parser's output. In addition, the tool has an ability to navigate the user by suggesting the order of choices. Following this order, the user can efficiently annotate sentences. In 2001, we extracted grammar rules from the EDR corpus, which is one of the existing largest Japan … More ese coypus. The drawback of the EDR corpus is that the grammar based on which the corpus is annotated is missing. Thus we first extract the grammar from the EDR corpus automatically and improve it so that the ambiguities of the grammar became as small as possible. In addition, we proposed a new framework to build semantic knowledge which plays important role not only in semantic analysis but also in syntactic analysis. It is difficult to build semantic knowledge from scratch, therefore we took an approach to combine existing semantic knowledge. In 2002, we continued to work on the two topics started in 2001. In addition to this, we constructed a management system of annotated corpora. This system allows users to retrieve various kinds of syntactic structures efficiently. The structures in a sentence are stored in a relational database system, providing users versatile retrieve capability. In order to verify the results of above research, we built a Japanese corpus consisting of about 20,000 sentences. This sentence set is an excerpt from the EDR corpus. This corpus is based on the grammar extracted from the EDR corpus and improved in this project. To annotate the corpus, the annotation tool developed in this project was used, and the resultant corpus was managed by the system mentioned above. Less

Report

(4 results)

2002 Annual Research Report Final Research Report Summary
2001 Annual Research Report
2000 Annual Research Report

Research Products
(20 results)

All Other

All Publications (20 results)

[Publications] 田中穂積, 徳永健伸: "コンピュータが拓く新しい言語世界"月刊言語. 31・3. 16-22 (2002)
- Description
  「研究成果報告書概要(和文)」より
- Related Report
  2002 Final Research Report Summary
[Publications] 野呂智哉, 白井清昭, 徳永健伸, 田中穂積: "大規模日本語文法の開発-事例研究"情報処理学会自然言語処理研究会. 2002・66. 149-156 (2002)
- Description
  「研究成果報告書概要(和文)」より
- Related Report
  2002 Final Research Report Summary
[Publications] 野呂智哉, 岡崎篤, 徳永健伸, 田中穂積: "大規模日本語文法構築に関する一考察"言語処理学会第8回年次大会予稿集. 387-390 (2002)
- Description
  「研究成果報告書概要(和文)」より
- Related Report
  2002 Final Research Report Summary
[Publications] 美野秀弥, 橋本泰一, 徳永健伸, 田中穂積: "決定リストを利用した形容動詞の修飾先の決定"言語処理学会第8回年次大会予稿集. 411-414 (2002)
- Description
  「研究成果報告書概要(和文)」より
- Related Report
  2002 Final Research Report Summary
[Publications] Tokunaga Takenobu, Syotu Yasuhiro, Tanaka Hozumi, Shirai Kiyoaki: "Integration of heterogeneous language resources : A monolingual dictionary and a thesaurus"Proceedings of NLPRS 2001. 135-142 (2001)
- Description
  「研究成果報告書概要(和文)」より
- Related Report
  2002 Final Research Report Summary
[Publications] 白井清昭, 植木正裕, 橋本泰一, 徳永健伸, 田中穂積: "自然言語解析のためのMSLRパーザツールキット"自然言語処理. 7・5. 93-112 (2000)
- Description
  「研究成果報告書概要(和文)」より
- Related Report
  2002 Final Research Report Summary
[Publications] Tanaka, H. and Tokunaga, T.: "New research program of language by computer"Gengo. 31, No. 3. 16-22 (2002)
- Description
  「研究成果報告書概要(欧文)」より
- Related Report
  2002 Final Research Report Summary
[Publications] Noro, T., Sirai, K., Tokunaga, T. and Tanaka H.: "Development of large Japanese grammar-A case study-"IPSJ-SJGNL. 2002.6. 149-156 (2002)
- Description
  「研究成果報告書概要(欧文)」より
- Related Report
  2002 Final Research Report Summary
[Publications] Noro, T., Okazaki, A., Tokunaga, T. and Tanaka, H.: "A study on large Japanese grammar development"Annual meeting of Association of Natural Language Processing. 387-390 (2002)
- Description
  「研究成果報告書概要(欧文)」より
- Related Report
  2002 Final Research Report Summary
[Publications] Mino, H., Hasimoto, T. Tokunaga, T and Tanaka, H.: "Disambiguation of adverbial phrase attachment by using decision tree"Annual meeting of Association of Natural Language Processing. 411-414 (2002)
- Description
  「研究成果報告書概要(欧文)」より
- Related Report
  2002 Final Research Report Summary
[Publications] Tokunaga, T. and Syotu, Y., Sirai, K. and Tanaka, H.: "Integrations of heterogeneous language resources : A monolingual dictionary and a thesaurus"Proc, of NLPRS 2001. 135-142 (2001)
- Description
  「研究成果報告書概要(欧文)」より
- Related Report
  2002 Final Research Report Summary
[Publications] Sirai, K., Ueki, M., Hasimoto, T., Tokunaga, T. and Tanaka H.: "The MSLR parser : A toolkit of natural language processing"Natural Language Processing. 7, No. 5. 93-112 (2000)
- Description
  「研究成果報告書概要(欧文)」より
- Related Report
  2002 Final Research Report Summary
[Publications] 田中穂積, 徳永健伸: "コンピュータが拓く新しい言語世界"月刊言語. 31・3. 16-22 (2002)
- Related Report
  2002 Annual Research Report
[Publications] 野呂智哉, 白井清昭, 徳永健伸, 田中穂積: "大規模日本語文法の開発-事例研究"情報処理学会自然言語処理研究会. 2002・66. 149-156 (2002)
- Related Report
  2002 Annual Research Report
[Publications] 野呂智哉, 岡崎篤, 徳永健伸, 田中穂積: "大規模日本語文法構築に関する一考察"言語処理学会第8回年次大会予稿集. 387-390 (2002)
- Related Report
  2002 Annual Research Report
[Publications] 美野秀弥, 橋本泰一, 徳永健伸, 田中穂積: "決定リストを利用した形容動詞の修飾先の決定"言語処理学会第8回年次大会予稿集. 411-414 (2002)
- Related Report
  2002 Annual Research Report
[Publications] 徳永健伸, 阿辺川武: "統計情報による連体修飾節の解析"日本語学. 20・12. 20-27 (2001)
- Related Report
  2001 Annual Research Report
[Publications] Tokunaga T., Syotu Y., TAnaka H., Shirai K.: "Integration of heterogeneous language resources"Proc. of 6th ULPRS. 135-142 (2001)
- Related Report
  2001 Annual Research Report
[Publications] 木村健司, 徳永健伸, 田中穂積: "漢字インデックスを利用したパラフレーズの抽出"自然言語処理研究会予稿集. 2001・112. 39-45 (2001)
- Related Report
  2001 Annual Research Report
[Publications] 八木豊, 橋本泰一, 美野秀弥, 徳永健伸, 田中穂積: "決定リストにおける規則の適用順序に関する考察"自然言語処理研究会予稿集. 2001・112. 21-26 (2001)
- Related Report
  2001 Annual Research Report

Construction of large scale annotated corpus and its management system

Principal Investigator

TOKUNAGA Takenobu Department of Computer Science, Associate Professor, 大学院・情報理工学研究科, 助教授 (20197875)

¥14,000,000 (Direct Cost: ¥14,000,000)

Report

Research Products

[Publications] 田中穂積, 徳永健伸: "コンピュータが拓く新しい言語世界"月刊言語. 31・3. 16-22 (2002)

Description

Related Report

[Publications] 野呂智哉, 白井清昭, 徳永健伸, 田中穂積: "大規模日本語文法の開発-事例研究"情報処理学会自然言語処理研究会. 2002・66. 149-156 (2002)

Description

Related Report

[Publications] 野呂智哉, 岡崎篤, 徳永健伸, 田中穂積: "大規模日本語文法構築に関する一考察"言語処理学会第8回年次大会予稿集. 387-390 (2002)

Description

Related Report

[Publications] 美野秀弥, 橋本泰一, 徳永健伸, 田中穂積: "決定リストを利用した形容動詞の修飾先の決定"言語処理学会第8回年次大会予稿集. 411-414 (2002)

Description

Related Report

[Publications] Tokunaga Takenobu, Syotu Yasuhiro, Tanaka Hozumi, Shirai Kiyoaki: "Integration of heterogeneous language resources : A monolingual dictionary and a thesaurus"Proceedings of NLPRS 2001. 135-142 (2001)

Description

Related Report

[Publications] 白井清昭, 植木正裕, 橋本泰一, 徳永健伸, 田中穂積: "自然言語解析のためのMSLRパーザツールキット"自然言語処理. 7・5. 93-112 (2000)

Description

Related Report

[Publications] Tanaka, H. and Tokunaga, T.: "New research program of language by computer"Gengo. 31, No. 3. 16-22 (2002)

Description

Related Report

[Publications] Noro, T., Sirai, K., Tokunaga, T. and Tanaka H.: "Development of large Japanese grammar-A case study-"IPSJ-SJGNL. 2002.6. 149-156 (2002)

Description

Related Report

[Publications] Noro, T., Okazaki, A., Tokunaga, T. and Tanaka, H.: "A study on large Japanese grammar development"Annual meeting of Association of Natural Language Processing. 387-390 (2002)

Description

Related Report

[Publications] Mino, H., Hasimoto, T. Tokunaga, T and Tanaka, H.: "Disambiguation of adverbial phrase attachment by using decision tree"Annual meeting of Association of Natural Language Processing. 411-414 (2002)

Description

Related Report

[Publications] Tokunaga, T. and Syotu, Y., Sirai, K. and Tanaka, H.: "Integrations of heterogeneous language resources : A monolingual dictionary and a thesaurus"Proc, of NLPRS 2001. 135-142 (2001)

Description

Related Report

[Publications] Sirai, K., Ueki, M., Hasimoto, T., Tokunaga, T. and Tanaka H.: "The MSLR parser : A toolkit of natural language processing"Natural Language Processing. 7, No. 5. 93-112 (2000)

Description

Related Report

[Publications] 田中穂積, 徳永健伸: "コンピュータが拓く新しい言語世界"月刊言語. 31・3. 16-22 (2002)

Related Report

[Publications] 野呂智哉, 白井清昭, 徳永健伸, 田中穂積: "大規模日本語文法の開発-事例研究"情報処理学会自然言語処理研究会. 2002・66. 149-156 (2002)

Related Report

[Publications] 野呂智哉, 岡崎篤, 徳永健伸, 田中穂積: "大規模日本語文法構築に関する一考察"言語処理学会第8回年次大会予稿集. 387-390 (2002)

Related Report

[Publications] 美野秀弥, 橋本泰一, 徳永健伸, 田中穂積: "決定リストを利用した形容動詞の修飾先の決定"言語処理学会第8回年次大会予稿集. 411-414 (2002)

Related Report

[Publications] 徳永健伸, 阿辺川武: "統計情報による連体修飾節の解析"日本語学. 20・12. 20-27 (2001)

Related Report

[Publications] Tokunaga T., Syotu Y., TAnaka H., Shirai K.: "Integration of heterogeneous language resources"Proc. of 6th ULPRS. 135-142 (2001)

Related Report

[Publications] 木村健司, 徳永健伸, 田中穂積: "漢字インデックスを利用したパラフレーズの抽出"自然言語処理研究会予稿集. 2001・112. 39-45 (2001)

Related Report

[Publications] 八木豊, 橋本泰一, 美野秀弥, 徳永健伸, 田中穂積: "決定リストにおける規則の適用順序に関する考察"自然言語処理研究会予稿集. 2001・112. 21-26 (2001)

Related Report