2017 Fiscal Year Annual Research Report
Multiple resource adaptation for low resource neural machine translation
Project/Area Number |
17H06822
|
Research Institution | Osaka University |
Principal Investigator |
チョ シンキ 大阪大学, データビリティフロンティア機構, 特任助教(常勤) (70784891)
|
Project Period (FY) |
2017-08-25 – 2019-03-31
|
Keywords | 機械翻訳 / ローリソース / ドメイン適応 / ニューラル機械翻訳 |
Outline of Annual Research Achievements |
In Japan, because of the rapid increase of foreign tourists and the host of the 2020 Tokyo Olympic Games, translation needs are rapidly growing, making machine translation (MT) indispensable. In MT, the translation knowledge is acquired from parallel corpora (sentence-aligned bilingual texts). However, as parallel corpora between Japanese and most languages (e.g., Japanese-Indonesian) and domains (e.g., medical domain) are very scarce (only tens of thousands of parallel sentences or fewer), the translation quality is not satisfied. Improving MT quality in this low resource scenario is a challenging unsolved problem. The purpose of this research is improving MT quality in this low resource scenario using multiple resources, including parallel corpora of resource rich languages (such as French-English) and domains (such as the parliamentary domain), and large-scale monolingual web corpora. In FY2017, we established model adaptation technologies using resource rich language and domain parallel corpora. Specifically, we obtained the following achievements: 1. Single language/domain adaptation. We developed novel methods and conducted a comprehensive empirical comparison of previous studies. Our research achievements have been published at ACL 2017 (the top conference in natural language processing) and accepted to be published in the Journal of Information Processing in June. 2. Multiple language/domain adaptation. We also developed methods for domain adaptation using multilingual and multi-domain corpora, and presented our work at NLP 2018.
|
Current Status of Research Progress |
Current Status of Research Progress
2: Research has progressed on the whole more than it was originally planned.
Reason
This research is divided into three sub-topics: 1. Model adaptation using resource rich language and domain parallel corpora; 2. Data adaptation using large-scale monolingual web corpora; 3. Multiple resource adapted system integration. In FY2017, we established the model adaptation technology based on both resource rich language and domain parallel corpora as scheduled.
|
Strategy for Future Research Activity |
We will study the remaining two topics: data adaptation using large-scale monolingual web corpora and multiple resource adapted system integration as scheduled. In our journal paper, which will be published in the Journal of Information Processing in June, we actually have conducted a comparison of previous studies in these two topics. In addition, we wrote a survey paper of domain adaptation for neural machine translation and submitted it to COLING 2018 (a top conference in natural language processing). We believe that these preliminary studies will make our research in FY2018 smooth.
|