2002 Fiscal Year Final Research Report Summary
Construction of large scale annotated corpus and its management system
Project/Area Number |
12480082
|
Research Category |
Grant-in-Aid for Scientific Research (B)
|
Allocation Type | Single-year Grants |
Section | 一般 |
Research Field |
Intelligent informatics
|
Research Institution | Tokyo Institute of Technology |
Principal Investigator |
TOKUNAGA Takenobu Department of Computer Science, Associate Professor, 大学院・情報理工学研究科, 助教授 (20197875)
|
Co-Investigator(Kenkyū-buntansha) |
TANAKA Hozumi Tokyo Institute of Technology, Department of Computer Science, Professor, 大学院・情報理工学研究科, 教授 (80163567)
|
Project Period (FY) |
2000 – 2002
|
Keywords | Natural language processing / Large scale corpora / Japanese grammar / Syntactic analysis / Syntactically annotated corpura / Statistical NLP |
Research Abstract |
Since the middle of 1980's, natural language processing based on a large scale linguistic data has become a main stream in this research area. For this kind of research, linguistic resources play important role, and there has been many attempts to create various kinds of resources. This research project aims to construct an environment to create syntactically annotated Japanese corpora in large scale. To achieve this goal, we conducted the research in the following topics. In 2000, we built an annotation tool which supports a user to annotate syntactic structure on sentences in interactive way. This tool works with an existing parser and the user cab efficiently select a correct syntactic structure from a number of parser's output. In addition, the tool has an ability to navigate the user by suggesting the order of choices. Following this order, the user can efficiently annotate sentences. In 2001, we extracted grammar rules from the EDR corpus, which is one of the existing largest Japan
… More
ese coypus. The drawback of the EDR corpus is that the grammar based on which the corpus is annotated is missing. Thus we first extract the grammar from the EDR corpus automatically and improve it so that the ambiguities of the grammar became as small as possible. In addition, we proposed a new framework to build semantic knowledge which plays important role not only in semantic analysis but also in syntactic analysis. It is difficult to build semantic knowledge from scratch, therefore we took an approach to combine existing semantic knowledge. In 2002, we continued to work on the two topics started in 2001. In addition to this, we constructed a management system of annotated corpora. This system allows users to retrieve various kinds of syntactic structures efficiently. The structures in a sentence are stored in a relational database system, providing users versatile retrieve capability. In order to verify the results of above research, we built a Japanese corpus consisting of about 20,000 sentences. This sentence set is an excerpt from the EDR corpus. This corpus is based on the grammar extracted from the EDR corpus and improved in this project. To annotate the corpus, the annotation tool developed in this project was used, and the resultant corpus was managed by the system mentioned above. Less
|