研究実績の概要 |
The objective of this research is to improve neural machine translation (NMT) for user-generated contents (UGC), i.e., texts written by users of online services such as SNS. The main challenge is that NMT requires a large amount of translations produced by humans to be trained. Such training data is not available for UGC and must be created. The main achievement for the first year of this research is a new method to transform existing training data for NMT into translations of UGC that can be used to train better NMT. For instance, given a training data made of translations of parliamentary debates, this method transform this data to give it the style of UGC, such as online discussions, while preserving their meaning. Using the transformed data to train NMT yields improvements of NMT that becomes better at translating any kinds of UGC (tweets, restaurants reviews, online discussions, etc.). This work has been published in the journal "Transactions of the Association for Computational Linguistics". Another achievement is the creation of a new dataset for evaluating NMT for UGC:1,099 tweets about the COVID-19 crisis translated from Japanese to English, Indonesian, Khmer, and Myanmar languages.
|
今後の研究の推進方策 |
- Using the datasets created (tweets about the COVID-19 crisis) during the first year of this research to evaluate state-of-the-art neural machine translation (NMT) for the translation of user-generated content (UGC) from Japanese to English, Indonesian, Khmer, and Myanmar. - Extend the dataset created during the first to evaluate NMT for another kind of UGC (online discussions, product reviews, ...) or languages (Hindi, Chinese, ...) - Propose new technologies to generate more data of UGC to train better NMT systems - Present this research at top-tier international conferences
|