XML documentation of complex annotation on spontaneous speech data

Research Project

Project/Area Number	14510638
Research Category	Grant-in-Aid for Scientific Research (C)
Allocation Type	Single-year Grants
Section	一般
Research Field	言語学・音声学
Research Institution	The National Institute for Japanese Language
Principal Investigator	MAEKAWA Kikuo The National Institute for Japanese Language, Department of Language Research, Section head, 研究開発部門・第2領域, 領域長 (20173693)
Co-Investigator(Kenkyū-buntansha)	TSUKAHARA Wataru The University of Electro-Communications, Assistant, 大学院・情報システム学研究科, 助手 KIKUCHI Hideaki Waseda University, Faculty of Human Science, Lecturer, 人間科学部, 講師 (70308261) KOISO Hanae The National Institute for Japanese Language, Department of Language Research, Researcher, 研究開発部門・第2領域, 研究員 (30312200) YONEYAMA Kiyoko Daito-Bunka University, Faculty Foreign Lang., Lecturer, 外国語学部, 講師 (60365856) 籠宮隆之独立行政法人国立国語研究所, 研究開発部門第2領域, 特別奨励研究員
Project Period (FY)	2002 – 2003
Project Status	Completed (Fiscal Year 2003)
Budget Amount *help	¥3,900,000 (Direct Cost: ¥3,900,000) Fiscal Year 2003: ¥1,200,000 (Direct Cost: ¥1,200,000) Fiscal Year 2002: ¥2,700,000 (Direct Cost: ¥2,700,000)
Keywords	XML / Corpus of Spontaneous Japanese / Spontaneous speech / コーパス / 話し言葉 / 『日本語話し言葉コーパス』
Research Abstract	Annotation of spontaneous speech data is a difficult task, but the maintenance of large annotated spontaneous speech database and the information retrieval of such database is all the more difficult. We proposed a XML format that can represent nearly all annotation information of the Corpus of Spontaneous Japanese. CSJ is a world's largest spontaneous speech database with very rich annotation including transcription, POS information, clause boundary information, dependency-structure information, discourse-boundary information, segment label, intonation label, and so forth. Our XML format includes 10 layers (starting with "Talk" element and ending in "Phone" and "Tone" elements) arranged according to the structure of natural language. 208 attributes covers linguistic, paralinguistic, and non-linguistic annotation of the speech data as well as various disfluency phenomena. Also, there are some attributes that are introduced to represent the format of the transcription text. We have converted all 3302 talks of the CSJ (661 hours, over 7.5 million morphemes) into XML document, and used them for the data validation purposes. Information retrieval experiments were also conducted using the XML documents. It turned out that the use of XSLT language gave satisfactory performance. Information retrievals of modest complexity could be performed within 15 to 30 minutes when a PC of ordinary performance (3Ghz CPU with 2GB memory) was used. Lastly, we developed a simple GUI-based search tool that helps naive users to make XSLT query scripts. The software is written in Java language and runs under nearly all PC platforms. The XML documents and GUI search tool will be publicly available as a part of the CSJ in June 2004.

Report

(3 results)

2003 Annual Research Report Final Research Report Summary
2002 Annual Research Report

Research Products

(24 results)

All Other

All Publications (24 results)

[Publications] 菊池英明, 前川喜久雄, 五十嵐陽介, 米山聖子, 藤本雅子: "『日本語話し言葉コーパス』の音声ラベリング"音声研究. 7(3). 16-26 (2003)
- Description
  「研究成果報告書概要(和文)」より
- Related Report
  2003 Final Research Report Summary
[Publications] 菊池英明: "XMLを利用した『日本語話し言葉コーパス』の検証と検索"平成15年度国立国語研究所公開研究発表会予稿集. 15-20 (2003)
- Description
  「研究成果報告書概要(和文)」より
- Related Report
  2003 Final Research Report Summary
[Publications] H.Kikuchi, K.Maekawa: "Evaluation of the effectiveness of "X-JToBI" : A new prosodic labeling scheme for spontaneous Japanese speech."Proceedings of the 15^<th> International Congress of Phonetic Sciences. 1. 579-582 (2003)
- Description
  「研究成果報告書概要(和文)」より
- Related Report
  2003 Final Research Report Summary
[Publications] 菊池英明, 塚原渉, 前川喜久雄: "XMLを利用した『日本語話し言葉コーパス』(CSJ)の整合性検証"第3回話し言葉の科学と工学ワークショップ講演予稿集. 27-32 (2004)
- Description
  「研究成果報告書概要(和文)」より
- Related Report
  2003 Final Research Report Summary
[Publications] 塚原渉, 菊池英明, 前川喜久雄: "『日本語話し言葉コーパス』のXML検索環境"第3回話し言葉の科学と工学ワークショップ講演予稿集. 33-38 (2004)
- Description
  「研究成果報告書概要(和文)」より
- Related Report
  2003 Final Research Report Summary
[Publications] K.Maekawa, H.Kikuchi, W.Tsukahara: "Corpus of Spontaneous Japanese : Design, Annotation and XML Representation"Proceedings of the International Symposium on Large-scale Knowledge Resources (LKR2004). 19-24 (2004)
- Description
  「研究成果報告書概要(和文)」より
- Related Report
  2003 Final Research Report Summary
[Publications] K.Maekawa, H.Kikuchi: "Corpus-based analysis of vowel devoicing in spontaneous Japanese -An interim report-"J.van de Weijer, K.Nanjo and T.Nishihara (eds.) Voicing in Japanese. The Hague: Mouton. (in press). (2004)
- Description
  「研究成果報告書概要(和文)」より
- Related Report
  2003 Final Research Report Summary
[Publications] H.Kikuchi, K.Maekawa, Y.Igarashi, K.Yoneyama, M.Fujimoto: "Phonetic labeling of the 'Corpus of Spontaneous Japanese'."Journal of the Phonetic Society of Japan. 7(3). 15-26 (2003)
- Description
  「研究成果報告書概要(欧文)」より
- Related Report
  2003 Final Research Report Summary
[Publications] H.Kikuchi, K.Maekawa: "Evaluation of the effectiveness of "X-JToBI": A new prosodic labeling scheme for spontaneous Japanese speech"Proceedings of the 15th International Congress of Phonetic Sciences, 1,Barcelona. 579-582 (2003)
- Description
  「研究成果報告書概要(欧文)」より
- Related Report
  2003 Final Research Report Summary
[Publications] K.Maekawa, H.Kikuchi, W.Tsukahara: "Corpus of Spontaneous Japanese : Design, Annotation and XML Representation"Proceedings of the International Symposium on Large-scale Knowledge Resources (LKR2004) (Tokyo Inst. Technology) (INVITED TALK). 19-24 (2003)
- Description
  「研究成果報告書概要(欧文)」より
- Related Report
  2003 Final Research Report Summary
[Publications] K.Maekawa, H.Kikuchi: "Corpus-based analysis of vowel devoicing in spontaneous Japanese -An interim report-(J. van de Weijer, K.Nanjo and T.Nishihara (eds.)) (Voicing in Japanese.)"The Hague : Mouton (in press).
- Description
  「研究成果報告書概要(欧文)」より
- Related Report
  2003 Final Research Report Summary
[Publications] 菊池英明, 前川喜久雄, 五十嵐陽介, 米山聖子, 藤本雅子: "『日本語話し言葉コーパス』の音声ラベリング"音声研究. 7(3). 16-26 (2003)
- Related Report
  2003 Annual Research Report
[Publications] 菊池英明: "XMLを利用した『日本語話し言葉コーパス』の検証と検索"平成15年度国立国語研究所公開研究発表会予稿集. 15-20 (2003)
- Related Report
  2003 Annual Research Report
[Publications] H.Kikuch, K.Maekawa: "Evaluation of the effectiveness of "X-JToBI" : A new prosodic labeling scheme for spontaneous Japanese speech."Proceedings of the 15^<th> International Congress of Phonetic Sciences. 1. 579-582 (2003)
- Related Report
  2003 Annual Research Report
[Publications] 菊池英明, 塚原渉, 前川喜久雄: "XMLを利用した『日本語話し言葉コーパス』(CSJ)の整合性検証"第3回話し言葉の科学と工学ワークショップ講演予稿集. 27-32 (2004)
- Related Report
  2003 Annual Research Report
[Publications] 塚原渉, 菊池英明, 前川喜久雄: "『日本語話し言葉コーパス』のXML検索環境"第3回話し言葉の科学と工学ワークショップ講演予稿集. 33-38 (2004)
- Related Report
  2003 Annual Research Report
[Publications] K.Maekawa, H.Kikuchi, W.Tsukahara: "Corpus of Spontaneous Japanese : Design, Annotation and XML Representation"Proceedings of the International Symposium on Large-scale Knowledge Resources(LKR2004). 19-24 (2004)
- Related Report
  2003 Annual Research Report
[Publications] K.Maekawa, H.Kikuchi: "Corpus-based analysis of vowel devoicing in spontaneous Japanese -An interim report-"J.van de Weijer, K.Nanjo, and T.Nishihara(eds.) Voicing in Japanese. The Hague : Mouton. (in press). (2004)
- Related Report
  2003 Annual Research Report
[Publications] K.Maekawa: "Design, compilation, and preliminary analyses of the Corpus of Spontaneous Japanese"Proceedings of the NTT-Stanford workshop on concept and language processing. 1. 13-14 (2002)
- Related Report
  2002 Annual Research Report
[Publications] K.Maekawa, H.Kikuchi, Y.Igarashi,, J.Venditti: "X-JToBI: An extended J ToBI for spontaneous speech"Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP2002), Denver, Colorado USA. 3. 1545-1548 (2002)
- Related Report
  2002 Annual Research Report
[Publications] 前川喜久雄: "話し言葉コーパスの利用可能性"日本研究的深化与拓展. 1. 46-47 (2002)
- Related Report
  2002 Annual Research Report
[Publications] 菊池英明, 前川喜久雄: "自発音声韻律ラベリングスキームX-JToBIによるラベリング精度の検証"日本音響学会2002年秋季研究発表会講演論文集. 1. 259-260 (2002)
- Related Report
  2002 Annual Research Report
[Publications] 菊池英明, 前川喜久雄: "自発音声韻律ラベリングスキームX-JTbBIの能力検証"人口知能学会研究会SIG-SLUD. A-202-06. 33-36 (2002)
- Related Report
  2002 Annual Research Report
[Publications] 前川喜久雄: "『日本語話し言葉コーパス』を用いた言語変異研究"音声研究. 6・3. 48-59 (2002)
- Related Report
  2002 Annual Research Report

XML documentation of complex annotation on spontaneous speech data

Principal Investigator

MAEKAWA Kikuo The National Institute for Japanese Language, Department of Language Research, Section head, 研究開発部門・第2領域, 領域長 (20173693)

¥3,900,000 (Direct Cost: ¥3,900,000)

Report

Research Products

[Publications] 菊池英明, 前川喜久雄, 五十嵐陽介, 米山聖子, 藤本雅子: "『日本語話し言葉コーパス』の音声ラベリング"音声研究. 7(3). 16-26 (2003)

Description

Related Report

[Publications] 菊池英明: "XMLを利用した『日本語話し言葉コーパス』の検証と検索"平成15年度国立国語研究所公開研究発表会予稿集. 15-20 (2003)

Description

Related Report

[Publications] H.Kikuchi, K.Maekawa: "Evaluation of the effectiveness of "X-JToBI" : A new prosodic labeling scheme for spontaneous Japanese speech."Proceedings of the 15^<th> International Congress of Phonetic Sciences. 1. 579-582 (2003)

Description

Related Report

[Publications] 菊池英明, 塚原渉, 前川喜久雄: "XMLを利用した『日本語話し言葉コーパス』(CSJ)の整合性検証"第3回話し言葉の科学と工学ワークショップ講演予稿集. 27-32 (2004)

Description

Related Report

[Publications] 塚原渉, 菊池英明, 前川喜久雄: "『日本語話し言葉コーパス』のXML検索環境"第3回話し言葉の科学と工学ワークショップ講演予稿集. 33-38 (2004)

Description

Related Report

[Publications] K.Maekawa, H.Kikuchi, W.Tsukahara: "Corpus of Spontaneous Japanese : Design, Annotation and XML Representation"Proceedings of the International Symposium on Large-scale Knowledge Resources (LKR2004). 19-24 (2004)

Description

Related Report

[Publications] K.Maekawa, H.Kikuchi: "Corpus-based analysis of vowel devoicing in spontaneous Japanese -An interim report-"J.van de Weijer, K.Nanjo and T.Nishihara (eds.) Voicing in Japanese. The Hague: Mouton. (in press). (2004)

Description

Related Report

[Publications] H.Kikuchi, K.Maekawa, Y.Igarashi, K.Yoneyama, M.Fujimoto: "Phonetic labeling of the 'Corpus of Spontaneous Japanese'."Journal of the Phonetic Society of Japan. 7(3). 15-26 (2003)

Description

Related Report

[Publications] H.Kikuchi, K.Maekawa: "Evaluation of the effectiveness of "X-JToBI": A new prosodic labeling scheme for spontaneous Japanese speech"Proceedings of the 15th International Congress of Phonetic Sciences, 1,Barcelona. 579-582 (2003)

Description

Related Report

[Publications] K.Maekawa, H.Kikuchi, W.Tsukahara: "Corpus of Spontaneous Japanese : Design, Annotation and XML Representation"Proceedings of the International Symposium on Large-scale Knowledge Resources (LKR2004) (Tokyo Inst. Technology) (INVITED TALK). 19-24 (2003)

Description

Related Report

[Publications] K.Maekawa, H.Kikuchi: "Corpus-based analysis of vowel devoicing in spontaneous Japanese -An interim report-(J. van de Weijer, K.Nanjo and T.Nishihara (eds.)) (Voicing in Japanese.)"The Hague : Mouton (in press).

Description

Related Report

[Publications] 菊池英明, 前川喜久雄, 五十嵐陽介, 米山聖子, 藤本雅子: "『日本語話し言葉コーパス』の音声ラベリング"音声研究. 7(3). 16-26 (2003)

Related Report

[Publications] 菊池英明: "XMLを利用した『日本語話し言葉コーパス』の検証と検索"平成15年度国立国語研究所公開研究発表会予稿集. 15-20 (2003)

Related Report

[Publications] H.Kikuch, K.Maekawa: "Evaluation of the effectiveness of "X-JToBI" : A new prosodic labeling scheme for spontaneous Japanese speech."Proceedings of the 15^<th> International Congress of Phonetic Sciences. 1. 579-582 (2003)

Related Report

[Publications] 菊池英明, 塚原渉, 前川喜久雄: "XMLを利用した『日本語話し言葉コーパス』(CSJ)の整合性検証"第3回話し言葉の科学と工学ワークショップ講演予稿集. 27-32 (2004)

Related Report

[Publications] 塚原渉, 菊池英明, 前川喜久雄: "『日本語話し言葉コーパス』のXML検索環境"第3回話し言葉の科学と工学ワークショップ講演予稿集. 33-38 (2004)

Related Report

[Publications] K.Maekawa, H.Kikuchi, W.Tsukahara: "Corpus of Spontaneous Japanese : Design, Annotation and XML Representation"Proceedings of the International Symposium on Large-scale Knowledge Resources(LKR2004). 19-24 (2004)

Related Report

[Publications] K.Maekawa, H.Kikuchi: "Corpus-based analysis of vowel devoicing in spontaneous Japanese -An interim report-"J.van de Weijer, K.Nanjo, and T.Nishihara(eds.) Voicing in Japanese. The Hague : Mouton. (in press). (2004)

Related Report

[Publications] K.Maekawa: "Design, compilation, and preliminary analyses of the Corpus of Spontaneous Japanese"Proceedings of the NTT-Stanford workshop on concept and language processing. 1. 13-14 (2002)

Related Report

[Publications] K.Maekawa, H.Kikuchi, Y.Igarashi,, J.Venditti: "X-JToBI: An extended J ToBI for spontaneous speech"Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP2002), Denver, Colorado USA. 3. 1545-1548 (2002)

Related Report

[Publications] 前川喜久雄: "話し言葉コーパスの利用可能性"日本研究的深化与拓展. 1. 46-47 (2002)

Related Report

[Publications] 菊池英明, 前川喜久雄: "自発音声韻律ラベリングスキームX-JToBIによるラベリング精度の検証"日本音響学会2002年秋季研究発表会講演論文集. 1. 259-260 (2002)

Related Report

[Publications] 菊池英明, 前川喜久雄: "自発音声韻律ラベリングスキームX-JTbBIの能力検証"人口知能学会研究会SIG-SLUD. A-202-06. 33-36 (2002)

Related Report

[Publications] 前川喜久雄: "『日本語話し言葉コーパス』を用いた言語変異研究"音声研究. 6・3. 48-59 (2002)

Related Report