2020 Fiscal Year Research-status Report

Neural Machine Translation for User-Generated Content

Research Project

Project/Area Number	20K19879
Research Institution	National Institute of Information and Communications Technology
Principal Investigator	MARIE BENJAMIN 国立研究開発法人情報通信研究機構, 先進的音声翻訳研究開発推進センター先進的翻訳技術研究室, 研究員 (30869433)
Project Period (FY)	2020-04-01 – 2022-03-31
Keywords	machine translation / user-generated content / COVID-19 / Asian languages / user-generated text
Outline of Annual Research Achievements	The objective of this research is to improve neural machine translation (NMT) for user-generated contents (UGC), i.e., texts written by users of online services such as SNS. The main challenge is that NMT requires a large amount of translations produced by humans to be trained. Such training data is not available for UGC and must be created. The main achievement for the first year of this research is a new method to transform existing training data for NMT into translations of UGC that can be used to train better NMT. For instance, given a training data made of translations of parliamentary debates, this method transform this data to give it the style of UGC, such as online discussions, while preserving their meaning. Using the transformed data to train NMT yields improvements of NMT that becomes better at translating any kinds of UGC (tweets, restaurants reviews, online discussions, etc.). This work has been published in the journal "Transactions of the Association for Computational Linguistics". Another achievement is the creation of a new dataset for evaluating NMT for UGC:1,099 tweets about the COVID-19 crisis translated from Japanese to English, Indonesian, Khmer, and Myanmar languages.
Current Status of Research Progress	Current Status of Research Progress 2: Research has progressed on the whole more than it was originally planned. Reason All the objectives for this first year have been achieved as planned. A first method to generate synthetic parallel data of user-generated contents (UGC) has been proposed and published. It enables a significantly better machine translation for UGC. The new datatsets of translations of UGC in Asian languages for evaluating UGC has also been created as planned.
Strategy for Future Research Activity	- Using the datasets created (tweets about the COVID-19 crisis) during the first year of this research to evaluate state-of-the-art neural machine translation (NMT) for the translation of user-generated content (UGC) from Japanese to English, Indonesian, Khmer, and Myanmar. - Extend the dataset created during the first to evaluate NMT for another kind of UGC (online discussions, product reviews, ...) or languages (Hindi, Chinese, ...) - Propose new technologies to generate more data of UGC to train better NMT systems - Present this research at top-tier international conferences
Causes of Carryover	The main reason for the incurring amount to be used next fiscal year is the focus on this research on user-generated contents, especially tweets about the COVID-19 crisis that took more time to collect. The remaining amount will be used to further extend the dataset created during the previous fiscal year. New languages and/or types of texts will be added to the datasets as originally planned. The data are ready to be translated by professional translators.

Research Products
(4 results)

All 2021 2020

All Journal Article (2 results) (of which Int'l Joint Research: 2 results, Peer Reviewed: 2 results, Open Access: 2 results) Presentation (2 results) (of which Int'l Joint Research: 1 results)

[Journal Article] Extremely low-resource neural machine translation for Asian languages2020
- Author(s)
  Rubino Raphael、Marie Benjamin、Dabre Raj、Fujita Atsushi、Utiyama Masao、Sumita Eiichiro
- Journal Title
  
  Machine Translation
  
  Volume: 34 Pages: 347～382
- DOI
  10.1007/s10590-020-09258-6
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] Synthesizing Parallel Data of User-Generated Texts with Zero-Shot Neural Machine Translation2020
- Author(s)
  Marie Benjamin、Fujita Atsushi
- Journal Title
  
  Transactions of the Association for Computational Linguistics
  
  Volume: 8 Pages: 710～725
- DOI
  10.1162/tacl_a_00341
- Peer Reviewed / Open Access / Int'l Joint Research
[Presentation] Altering Parallel Data into User-Generated Texts with Zero-Shot Neural Machine Translation2021
- Author(s)
  Marie Benjamin、Fujita Atsushi
- Organizer
  言語処理学会第27回年次大会（NLP2021）
[Presentation] Tagged Back-translation Revisited: Why Does It Really Work?2020
- Author(s)
  Marie Benjamin、Rubino Raphael、Fujita Atsushi
- Organizer
  The 58th Annual Meeting of the Association for Computational Linguistics
- Int'l Joint Research

2020 Fiscal Year Research-status Report

Neural Machine Translation for User-Generated Content

Principal Investigator

MARIE BENJAMIN 国立研究開発法人情報通信研究機構, 先進的音声翻訳研究開発推進センター先進的翻訳技術研究室, 研究員 (30869433)

Current Status of Research Progress

Reason

Research Products

[Journal Article] Extremely low-resource neural machine translation for Asian languages2020

Author(s)

Journal Title

DOI

[Journal Article] Synthesizing Parallel Data of User-Generated Texts with Zero-Shot Neural Machine Translation2020

Author(s)

Journal Title

DOI

[Presentation] Altering Parallel Data into User-Generated Texts with Zero-Shot Neural Machine Translation2021

Author(s)

Organizer

[Presentation] Tagged Back-translation Revisited: Why Does It Really Work?2020

Author(s)

Organizer