A theoretical and practical investigation to construct a lexicon for analyzing German text databases.
Grant-in-Aid for Scientific Research (C).
|Research Institution||GIFU KEIZAI UNIVERSITY|
YAMADA Yoshihisa GIFU KEIZAI UNIVERSITY, Faculty of Business Administration, Professor, 経営学部, 教授 (50192406)
|Project Fiscal Year
1997 – 2000
Completed(Fiscal Year 2001)
|Budget Amount *help
¥2,900,000 (Direct Cost : ¥2,900,000)
Fiscal Year 2000 : ¥500,000 (Direct Cost : ¥500,000)
Fiscal Year 1999 : ¥500,000 (Direct Cost : ¥500,000)
Fiscal Year 1998 : ¥800,000 (Direct Cost : ¥800,000)
Fiscal Year 1997 : ¥1,100,000 (Direct Cost : ¥1,100,000)
|Keywords||natural language processing / corpus / text database / software / 自然言語処理 / コーパス / データベース / ソフトウェア / グリム兄弟 / HPSG / テキストデータベース / テキストデータベ-ス / ドイツ語情報処理 / コーパス言語学|
In order to obtain morph-syntactic information such as a part of speech from a plain corpus, it is indispensable to parse the syntactic structure to some extent. For this purpose, this research conceived a lexicon to analyze text databases and realized this as software. The data of Grimm's fairy tales were used as a basic material.
The concrete results achieved by this research are as follows.
1. Continuation of the Grimm corpus
Digital processing of the 1812 and 1819 editions of the fairy tales of the Brothers Grimm was performed. These data were reorganized as a Grimm corpus which also included the existing 1857 edition. It is relatively small in scale, but could be called the first diachronic corpus of German.
2. Completion of the lemma frequency list
The lemma frequency list of the 1857 edition that contains more than 220,000 words was completed. Compared with a simple word frequency list, a lemma frequency list is an intricate work especially in the case of inflectional languages such as German. It is therefore an innovative experiment, and can be valuable in various areas, such as linguistics, lexicography, stylistics, etc.
3. Completion of the corpus analyzing software TEDDY II
The software TEDDY II that implemented ASA (=Auflosungsstrategie der Ambiguitat, strategy to resolve the ambiguity) was completed. The user interface and the display of the output were also improved, compared with the previous version of TEDDY.
Research Output (5results)