2020 Fiscal Year Research-status Report
Neural Machine Translation for User-Generated Content
Project/Area Number |
20K19879
|
Research Institution | National Institute of Information and Communications Technology |
Principal Investigator |
MARIE BENJAMIN 国立研究開発法人情報通信研究機構, 先進的音声翻訳研究開発推進センター先進的翻訳技術研究室, 研究員 (30869433)
|
Project Period (FY) |
2020-04-01 – 2022-03-31
|
Keywords | machine translation / user-generated content / COVID-19 / Asian languages / user-generated text |
Outline of Annual Research Achievements |
The objective of this research is to improve neural machine translation (NMT) for user-generated contents (UGC), i.e., texts written by users of online services such as SNS. The main challenge is that NMT requires a large amount of translations produced by humans to be trained. Such training data is not available for UGC and must be created. The main achievement for the first year of this research is a new method to transform existing training data for NMT into translations of UGC that can be used to train better NMT. For instance, given a training data made of translations of parliamentary debates, this method transform this data to give it the style of UGC, such as online discussions, while preserving their meaning. Using the transformed data to train NMT yields improvements of NMT that becomes better at translating any kinds of UGC (tweets, restaurants reviews, online discussions, etc.). This work has been published in the journal "Transactions of the Association for Computational Linguistics". Another achievement is the creation of a new dataset for evaluating NMT for UGC:1,099 tweets about the COVID-19 crisis translated from Japanese to English, Indonesian, Khmer, and Myanmar languages.
|
Current Status of Research Progress |
Current Status of Research Progress
2: Research has progressed on the whole more than it was originally planned.
Reason
All the objectives for this first year have been achieved as planned. A first method to generate synthetic parallel data of user-generated contents (UGC) has been proposed and published. It enables a significantly better machine translation for UGC.
The new datatsets of translations of UGC in Asian languages for evaluating UGC has also been created as planned.
|
Strategy for Future Research Activity |
- Using the datasets created (tweets about the COVID-19 crisis) during the first year of this research to evaluate state-of-the-art neural machine translation (NMT) for the translation of user-generated content (UGC) from Japanese to English, Indonesian, Khmer, and Myanmar. - Extend the dataset created during the first to evaluate NMT for another kind of UGC (online discussions, product reviews, ...) or languages (Hindi, Chinese, ...) - Propose new technologies to generate more data of UGC to train better NMT systems - Present this research at top-tier international conferences
|
Causes of Carryover |
The main reason for the incurring amount to be used next fiscal year is the focus on this research on user-generated contents, especially tweets about the COVID-19 crisis that took more time to collect. The remaining amount will be used to further extend the dataset created during the previous fiscal year. New languages and/or types of texts will be added to the datasets as originally planned. The data are ready to be translated by professional translators.
|