2020 年度実施状況報告書

Neural Machine Translation for User-Generated Content

研究課題

研究課題/領域番号	20K19879
研究機関	国立研究開発法人情報通信研究機構
研究代表者	MARIE BENJAMIN 国立研究開発法人情報通信研究機構, 先進的音声翻訳研究開発推進センター先進的翻訳技術研究室, 研究員 (30869433)
研究期間 (年度)	2020-04-01 – 2022-03-31
キーワード	machine translation / user-generated content / COVID-19 / Asian languages / user-generated text
研究実績の概要	The objective of this research is to improve neural machine translation (NMT) for user-generated contents (UGC), i.e., texts written by users of online services such as SNS. The main challenge is that NMT requires a large amount of translations produced by humans to be trained. Such training data is not available for UGC and must be created. The main achievement for the first year of this research is a new method to transform existing training data for NMT into translations of UGC that can be used to train better NMT. For instance, given a training data made of translations of parliamentary debates, this method transform this data to give it the style of UGC, such as online discussions, while preserving their meaning. Using the transformed data to train NMT yields improvements of NMT that becomes better at translating any kinds of UGC (tweets, restaurants reviews, online discussions, etc.). This work has been published in the journal "Transactions of the Association for Computational Linguistics". Another achievement is the creation of a new dataset for evaluating NMT for UGC:1,099 tweets about the COVID-19 crisis translated from Japanese to English, Indonesian, Khmer, and Myanmar languages.
現在までの達成度 (区分)	現在までの達成度 (区分) 2: おおむね順調に進展している理由 All the objectives for this first year have been achieved as planned. A first method to generate synthetic parallel data of user-generated contents (UGC) has been proposed and published. It enables a significantly better machine translation for UGC. The new datatsets of translations of UGC in Asian languages for evaluating UGC has also been created as planned.
今後の研究の推進方策	- Using the datasets created (tweets about the COVID-19 crisis) during the first year of this research to evaluate state-of-the-art neural machine translation (NMT) for the translation of user-generated content (UGC) from Japanese to English, Indonesian, Khmer, and Myanmar. - Extend the dataset created during the first to evaluate NMT for another kind of UGC (online discussions, product reviews, ...) or languages (Hindi, Chinese, ...) - Propose new technologies to generate more data of UGC to train better NMT systems - Present this research at top-tier international conferences
次年度使用額が生じた理由	The main reason for the incurring amount to be used next fiscal year is the focus on this research on user-generated contents, especially tweets about the COVID-19 crisis that took more time to collect. The remaining amount will be used to further extend the dataset created during the previous fiscal year. New languages and/or types of texts will be added to the datasets as originally planned. The data are ready to be translated by professional translators.

研究成果
(4件)

すべて 2021 2020

すべて雑誌論文 (2件) (うち国際共著 2件、査読あり 2件、オープンアクセス 2件) 学会発表 (2件) (うち国際学会 1件)

[雑誌論文] Extremely low-resource neural machine translation for Asian languages2020
- 著者名/発表者名
  Rubino Raphael、Marie Benjamin、Dabre Raj、Fujita Atsushi、Utiyama Masao、Sumita Eiichiro
- 雑誌名
  
  Machine Translation
  
  巻: 34 ページ: 347～382
- DOI
  10.1007/s10590-020-09258-6
- 査読あり / オープンアクセス / 国際共著
[雑誌論文] Synthesizing Parallel Data of User-Generated Texts with Zero-Shot Neural Machine Translation2020
- 著者名/発表者名
  Marie Benjamin、Fujita Atsushi
- 雑誌名
  
  Transactions of the Association for Computational Linguistics
  
  巻: 8 ページ: 710～725
- DOI
  10.1162/tacl_a_00341
- 査読あり / オープンアクセス / 国際共著
[学会発表] Altering Parallel Data into User-Generated Texts with Zero-Shot Neural Machine Translation2021
- 著者名/発表者名
  Marie Benjamin、Fujita Atsushi
- 学会等名
  言語処理学会第27回年次大会（NLP2021）
[学会発表] Tagged Back-translation Revisited: Why Does It Really Work?2020
- 著者名/発表者名
  Marie Benjamin、Rubino Raphael、Fujita Atsushi
- 学会等名
  The 58th Annual Meeting of the Association for Computational Linguistics
- 国際学会

2020 年度 実施状況報告書

Neural Machine Translation for User-Generated Content

研究代表者

MARIE BENJAMIN 国立研究開発法人情報通信研究機構, 先進的音声翻訳研究開発推進センター先進的翻訳技術研究室, 研究員 (30869433)

現在までの達成度 (区分)

理由

研究成果

[雑誌論文] Extremely low-resource neural machine translation for Asian languages2020

著者名/発表者名

雑誌名

DOI

[雑誌論文] Synthesizing Parallel Data of User-Generated Texts with Zero-Shot Neural Machine Translation2020

著者名/発表者名

雑誌名

DOI

[学会発表] Altering Parallel Data into User-Generated Texts with Zero-Shot Neural Machine Translation2021

著者名/発表者名

学会等名

[学会発表] Tagged Back-translation Revisited: Why Does It Really Work?2020

著者名/発表者名

学会等名

2020 年度実施状況報告書