Project/Area Number |
14510638
|
Research Category |
Grant-in-Aid for Scientific Research (C)
|
Allocation Type | Single-year Grants |
Section | 一般 |
Research Field |
言語学・音声学
|
Research Institution | The National Institute for Japanese Language |
Principal Investigator |
MAEKAWA Kikuo The National Institute for Japanese Language, Department of Language Research, Section head, 研究開発部門・第2領域, 領域長 (20173693)
|
Co-Investigator(Kenkyū-buntansha) |
TSUKAHARA Wataru The University of Electro-Communications, Assistant, 大学院・情報システム学研究科, 助手
KIKUCHI Hideaki Waseda University, Faculty of Human Science, Lecturer, 人間科学部, 講師 (70308261)
KOISO Hanae The National Institute for Japanese Language, Department of Language Research, Researcher, 研究開発部門・第2領域, 研究員 (30312200)
YONEYAMA Kiyoko Daito-Bunka University, Faculty Foreign Lang., Lecturer, 外国語学部, 講師 (60365856)
籠宮 隆之 独立行政法人国立国語研究所, 研究開発部門第2領域, 特別奨励研究員
|
Project Period (FY) |
2002 – 2003
|
Project Status |
Completed (Fiscal Year 2003)
|
Budget Amount *help |
¥3,900,000 (Direct Cost: ¥3,900,000)
Fiscal Year 2003: ¥1,200,000 (Direct Cost: ¥1,200,000)
Fiscal Year 2002: ¥2,700,000 (Direct Cost: ¥2,700,000)
|
Keywords | XML / Corpus of Spontaneous Japanese / Spontaneous speech / コーパス / 話し言葉 / 『日本語話し言葉コーパス』 |
Research Abstract |
Annotation of spontaneous speech data is a difficult task, but the maintenance of large annotated spontaneous speech database and the information retrieval of such database is all the more difficult. We proposed a XML format that can represent nearly all annotation information of the Corpus of Spontaneous Japanese. CSJ is a world's largest spontaneous speech database with very rich annotation including transcription, POS information, clause boundary information, dependency-structure information, discourse-boundary information, segment label, intonation label, and so forth. Our XML format includes 10 layers (starting with "Talk" element and ending in "Phone" and "Tone" elements) arranged according to the structure of natural language. 208 attributes covers linguistic, paralinguistic, and non-linguistic annotation of the speech data as well as various disfluency phenomena. Also, there are some attributes that are introduced to represent the format of the transcription text. We have converted all 3302 talks of the CSJ (661 hours, over 7.5 million morphemes) into XML document, and used them for the data validation purposes. Information retrieval experiments were also conducted using the XML documents. It turned out that the use of XSLT language gave satisfactory performance. Information retrievals of modest complexity could be performed within 15 to 30 minutes when a PC of ordinary performance (3Ghz CPU with 2GB memory) was used. Lastly, we developed a simple GUI-based search tool that helps naive users to make XSLT query scripts. The software is written in Java language and runs under nearly all PC platforms. The XML documents and GUI search tool will be publicly available as a part of the CSJ in June 2004.
|