2003 Fiscal Year Final Research Report Summary
XML documentation of complex annotation on spontaneous speech data
Project/Area Number |
14510638
|
Research Category |
Grant-in-Aid for Scientific Research (C)
|
Allocation Type | Single-year Grants |
Section | 一般 |
Research Field |
言語学・音声学
|
Research Institution | The National Institute for Japanese Language |
Principal Investigator |
MAEKAWA Kikuo The National Institute for Japanese Language, Department of Language Research, Section head, 研究開発部門・第2領域, 領域長 (20173693)
|
Co-Investigator(Kenkyū-buntansha) |
TSUKAHARA Wataru The University of Electro-Communications, Assistant, 大学院・情報システム学研究科, 助手
KIKUCHI Hideaki Waseda University, Faculty of Human Science, Lecturer, 人間科学部, 講師 (70308261)
KOISO Hanae The National Institute for Japanese Language, Department of Language Research, Researcher, 研究開発部門・第2領域, 研究員 (30312200)
YONEYAMA Kiyoko Daito-Bunka University, Faculty Foreign Lang., Lecturer, 外国語学部, 講師 (60365856)
|
Project Period (FY) |
2002 – 2003
|
Keywords | XML / Corpus of Spontaneous Japanese / Spontaneous speech |
Research Abstract |
Annotation of spontaneous speech data is a difficult task, but the maintenance of large annotated spontaneous speech database and the information retrieval of such database is all the more difficult. We proposed a XML format that can represent nearly all annotation information of the Corpus of Spontaneous Japanese. CSJ is a world's largest spontaneous speech database with very rich annotation including transcription, POS information, clause boundary information, dependency-structure information, discourse-boundary information, segment label, intonation label, and so forth. Our XML format includes 10 layers (starting with "Talk" element and ending in "Phone" and "Tone" elements) arranged according to the structure of natural language. 208 attributes covers linguistic, paralinguistic, and non-linguistic annotation of the speech data as well as various disfluency phenomena. Also, there are some attributes that are introduced to represent the format of the transcription text. We have converted all 3302 talks of the CSJ (661 hours, over 7.5 million morphemes) into XML document, and used them for the data validation purposes. Information retrieval experiments were also conducted using the XML documents. It turned out that the use of XSLT language gave satisfactory performance. Information retrievals of modest complexity could be performed within 15 to 30 minutes when a PC of ordinary performance (3Ghz CPU with 2GB memory) was used. Lastly, we developed a simple GUI-based search tool that helps naive users to make XSLT query scripts. The software is written in Java language and runs under nearly all PC platforms. The XML documents and GUI search tool will be publicly available as a part of the CSJ in June 2004.
|
Research Products
(11 results)