Project/Area Number |
20K19879
|
Research Category |
Grant-in-Aid for Early-Career Scientists
|
Allocation Type | Multi-year Fund |
Review Section |
Basic Section 61030:Intelligent informatics-related
|
Research Institution | National Institute of Information and Communications Technology |
Principal Investigator |
MARIE BENJAMIN 国立研究開発法人情報通信研究機構, 先進的音声翻訳研究開発推進センター先進的翻訳技術研究室, 研究員 (30869433)
|
Project Period (FY) |
2020-04-01 – 2022-03-31
|
Project Status |
Discontinued (Fiscal Year 2021)
|
Budget Amount *help |
¥4,160,000 (Direct Cost: ¥3,200,000、Indirect Cost: ¥960,000)
Fiscal Year 2021: ¥520,000 (Direct Cost: ¥400,000、Indirect Cost: ¥120,000)
Fiscal Year 2020: ¥3,640,000 (Direct Cost: ¥2,800,000、Indirect Cost: ¥840,000)
|
Keywords | machine translation / COVID-19 / user-generated content / language model / Asian languages / user-generated text / deep learning / unsupervised learning / social media |
Outline of Research at the Start |
Machine translation has achieved significant advances during the last decade thanks to deep learning technologies and the establishment of neural machine translation (NMT). However, noisy user-generated content (UGC), for instance from online social networks, can still cause disastrous mistranslations in most NMT systems. NMT for UGC is an under-studied challenging topic. This research will create new datasets of UGC for evaluating state-of-the-art NMT systems and will propose new methods to improve NMT for UGC through unsupervised machine translation and style-transfer technologies.
|
Outline of Annual Research Achievements |
The main achievement for the second year of this research is a new method to extend monolingual data in a low-resource domain and style (e.g., tweets on the topic of COVID-19) to generate larger data for training NMT. For instance, given a a small set of Japanese tweets (e.g., 1000 tweets) about the COVID-19 crisis, that is too small to train NMT, this method artificially extends it to million tweets on the same topic and makes it useful to train better NMT to translate tweets. Using this artificial data to train NMT yields improvements of NMT that becomes better at translating texts even for domain and style for which very few data is available. Experiments have been successfully conducted in various domains and styles (medical, IT, news, tweets, online discussions), and languages (French, German, Japanese). This work has also been extended for "personalizing" NMT, i.e., adapt NMT so it translates texts written by a specific person while preserving the characteristics of writing of this person.
|