研究開始時の研究の概要 |
Machine translation has achieved significant advances during the last decade thanks to deep learning technologies and the establishment of neural machine translation (NMT). However, noisy user-generated content (UGC), for instance from online social networks, can still cause disastrous mistranslations in most NMT systems. NMT for UGC is an under-studied challenging topic. This research will create new datasets of UGC for evaluating state-of-the-art NMT systems and will propose new methods to improve NMT for UGC through unsupervised machine translation and style-transfer technologies.
|
研究実績の概要 |
The main achievement for the second year of this research is a new method to extend monolingual data in a low-resource domain and style (e.g., tweets on the topic of COVID-19) to generate larger data for training NMT. For instance, given a a small set of Japanese tweets (e.g., 1000 tweets) about the COVID-19 crisis, that is too small to train NMT, this method artificially extends it to million tweets on the same topic and makes it useful to train better NMT to translate tweets. Using this artificial data to train NMT yields improvements of NMT that becomes better at translating texts even for domain and style for which very few data is available. Experiments have been successfully conducted in various domains and styles (medical, IT, news, tweets, online discussions), and languages (French, German, Japanese). This work has also been extended for "personalizing" NMT, i.e., adapt NMT so it translates texts written by a specific person while preserving the characteristics of writing of this person.
|