2002 Fiscal Year Final Research Report Summary

Construction of large scale annotated corpus and its management system

Research Project

Project/Area Number	12480082
Research Category	Grant-in-Aid for Scientific Research (B)
Allocation Type	Single-year Grants
Section	一般
Research Field	Intelligent informatics
Research Institution	Tokyo Institute of Technology
Principal Investigator	TOKUNAGA Takenobu Department of Computer Science, Associate Professor, 大学院・情報理工学研究科, 助教授 (20197875)
Co-Investigator(Kenkyū-buntansha)	TANAKA Hozumi Tokyo Institute of Technology, Department of Computer Science, Professor, 大学院・情報理工学研究科, 教授 (80163567)
Project Period (FY)	2000 – 2002
Keywords	Natural language processing / Large scale corpora / Japanese grammar / Syntactic analysis / Syntactically annotated corpura / Statistical NLP
Research Abstract	Since the middle of 1980's, natural language processing based on a large scale linguistic data has become a main stream in this research area. For this kind of research, linguistic resources play important role, and there has been many attempts to create various kinds of resources. This research project aims to construct an environment to create syntactically annotated Japanese corpora in large scale. To achieve this goal, we conducted the research in the following topics. In 2000, we built an annotation tool which supports a user to annotate syntactic structure on sentences in interactive way. This tool works with an existing parser and the user cab efficiently select a correct syntactic structure from a number of parser's output. In addition, the tool has an ability to navigate the user by suggesting the order of choices. Following this order, the user can efficiently annotate sentences. In 2001, we extracted grammar rules from the EDR corpus, which is one of the existing largest Japan … More ese coypus. The drawback of the EDR corpus is that the grammar based on which the corpus is annotated is missing. Thus we first extract the grammar from the EDR corpus automatically and improve it so that the ambiguities of the grammar became as small as possible. In addition, we proposed a new framework to build semantic knowledge which plays important role not only in semantic analysis but also in syntactic analysis. It is difficult to build semantic knowledge from scratch, therefore we took an approach to combine existing semantic knowledge. In 2002, we continued to work on the two topics started in 2001. In addition to this, we constructed a management system of annotated corpora. This system allows users to retrieve various kinds of syntactic structures efficiently. The structures in a sentence are stored in a relational database system, providing users versatile retrieve capability. In order to verify the results of above research, we built a Japanese corpus consisting of about 20,000 sentences. This sentence set is an excerpt from the EDR corpus. This corpus is based on the grammar extracted from the EDR corpus and improved in this project. To annotate the corpus, the annotation tool developed in this project was used, and the resultant corpus was managed by the system mentioned above. Less

Research Products
(12 results)

All Other

All Publications (12 results)

[Publications] 田中穂積, 徳永健伸: "コンピュータが拓く新しい言語世界"月刊言語. 31・3. 16-22 (2002)
- Description
  「研究成果報告書概要(和文)」より
[Publications] 野呂智哉, 白井清昭, 徳永健伸, 田中穂積: "大規模日本語文法の開発-事例研究"情報処理学会自然言語処理研究会. 2002・66. 149-156 (2002)
- Description
  「研究成果報告書概要(和文)」より
[Publications] 野呂智哉, 岡崎篤, 徳永健伸, 田中穂積: "大規模日本語文法構築に関する一考察"言語処理学会第8回年次大会予稿集. 387-390 (2002)
- Description
  「研究成果報告書概要(和文)」より
[Publications] 美野秀弥, 橋本泰一, 徳永健伸, 田中穂積: "決定リストを利用した形容動詞の修飾先の決定"言語処理学会第8回年次大会予稿集. 411-414 (2002)
- Description
  「研究成果報告書概要(和文)」より
[Publications] Tokunaga Takenobu, Syotu Yasuhiro, Tanaka Hozumi, Shirai Kiyoaki: "Integration of heterogeneous language resources : A monolingual dictionary and a thesaurus"Proceedings of NLPRS 2001. 135-142 (2001)
- Description
  「研究成果報告書概要(和文)」より
[Publications] 白井清昭, 植木正裕, 橋本泰一, 徳永健伸, 田中穂積: "自然言語解析のためのMSLRパーザツールキット"自然言語処理. 7・5. 93-112 (2000)
- Description
  「研究成果報告書概要(和文)」より
[Publications] Tanaka, H. and Tokunaga, T.: "New research program of language by computer"Gengo. 31, No. 3. 16-22 (2002)
- Description
  「研究成果報告書概要(欧文)」より
[Publications] Noro, T., Sirai, K., Tokunaga, T. and Tanaka H.: "Development of large Japanese grammar-A case study-"IPSJ-SJGNL. 2002.6. 149-156 (2002)
- Description
  「研究成果報告書概要(欧文)」より
[Publications] Noro, T., Okazaki, A., Tokunaga, T. and Tanaka, H.: "A study on large Japanese grammar development"Annual meeting of Association of Natural Language Processing. 387-390 (2002)
- Description
  「研究成果報告書概要(欧文)」より
[Publications] Mino, H., Hasimoto, T. Tokunaga, T and Tanaka, H.: "Disambiguation of adverbial phrase attachment by using decision tree"Annual meeting of Association of Natural Language Processing. 411-414 (2002)
- Description
  「研究成果報告書概要(欧文)」より
[Publications] Tokunaga, T. and Syotu, Y., Sirai, K. and Tanaka, H.: "Integrations of heterogeneous language resources : A monolingual dictionary and a thesaurus"Proc, of NLPRS 2001. 135-142 (2001)
- Description
  「研究成果報告書概要(欧文)」より
[Publications] Sirai, K., Ueki, M., Hasimoto, T., Tokunaga, T. and Tanaka H.: "The MSLR parser : A toolkit of natural language processing"Natural Language Processing. 7, No. 5. 93-112 (2000)
- Description
  「研究成果報告書概要(欧文)」より

2002 Fiscal Year Final Research Report Summary

Construction of large scale annotated corpus and its management system

Principal Investigator

TOKUNAGA Takenobu Department of Computer Science, Associate Professor, 大学院・情報理工学研究科, 助教授 (20197875)

Research Products

[Publications] 田中穂積, 徳永健伸: "コンピュータが拓く新しい言語世界"月刊言語. 31・3. 16-22 (2002)

Description

[Publications] 野呂智哉, 白井清昭, 徳永健伸, 田中穂積: "大規模日本語文法の開発-事例研究"情報処理学会自然言語処理研究会. 2002・66. 149-156 (2002)

Description

[Publications] 野呂智哉, 岡崎篤, 徳永健伸, 田中穂積: "大規模日本語文法構築に関する一考察"言語処理学会第8回年次大会予稿集. 387-390 (2002)

Description

[Publications] 美野秀弥, 橋本泰一, 徳永健伸, 田中穂積: "決定リストを利用した形容動詞の修飾先の決定"言語処理学会第8回年次大会予稿集. 411-414 (2002)

Description

[Publications] Tokunaga Takenobu, Syotu Yasuhiro, Tanaka Hozumi, Shirai Kiyoaki: "Integration of heterogeneous language resources : A monolingual dictionary and a thesaurus"Proceedings of NLPRS 2001. 135-142 (2001)

Description

[Publications] 白井清昭, 植木正裕, 橋本泰一, 徳永健伸, 田中穂積: "自然言語解析のためのMSLRパーザツールキット"自然言語処理. 7・5. 93-112 (2000)

Description

[Publications] Tanaka, H. and Tokunaga, T.: "New research program of language by computer"Gengo. 31, No. 3. 16-22 (2002)

Description

[Publications] Noro, T., Sirai, K., Tokunaga, T. and Tanaka H.: "Development of large Japanese grammar-A case study-"IPSJ-SJGNL. 2002.6. 149-156 (2002)

Description

[Publications] Noro, T., Okazaki, A., Tokunaga, T. and Tanaka, H.: "A study on large Japanese grammar development"Annual meeting of Association of Natural Language Processing. 387-390 (2002)

Description

[Publications] Mino, H., Hasimoto, T. Tokunaga, T and Tanaka, H.: "Disambiguation of adverbial phrase attachment by using decision tree"Annual meeting of Association of Natural Language Processing. 411-414 (2002)

Description

[Publications] Tokunaga, T. and Syotu, Y., Sirai, K. and Tanaka, H.: "Integrations of heterogeneous language resources : A monolingual dictionary and a thesaurus"Proc, of NLPRS 2001. 135-142 (2001)

Description

[Publications] Sirai, K., Ueki, M., Hasimoto, T., Tokunaga, T. and Tanaka H.: "The MSLR parser : A toolkit of natural language processing"Natural Language Processing. 7, No. 5. 93-112 (2000)

Description