研究実績の概要 |
We have done two works for improving machine translation this year. Firstly, to improve machine translation for lectures, we created Japanese-English and Chinese-English corpora to train the machine translation model. Lecture transcript translation helps learners understand online courses, however, building high-quality lecture machine translation system lacks publicly available parallel corpora. To address this, we examine a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures in Coursera. To create the dataset, we propose a sentence alignment algorithm and extracted Japanese-English and Chinese-English corpora of approximately 50,000 lines each. Through machine translation experiments, we show that the mined corpora significantly enhance the quality of lecture transcript translation when used in conjunction with out-of-domain parallel corpora. This work is submitted to a journal and under review. Secondly, to lower the cost of building machine translation systems especially for low-resource domains and languages, we propose an efficient way to pre-train sequence-to-sequence model. Pre-trained models are widely used for machine translation, however, training them requires large dataset and computational resources. We propose a data selection algorithm that selects a tiny but representative subset from billion-scale datasets. Experimental results show that pre-training with 0.26% data and 7.3% energy consumption achieves about 90% performance on machine translation.
|
現在までの達成度 (区分) |
現在までの達成度 (区分)
2: おおむね順調に進展している
理由
This year we met some of our planned goals. Firstly, to improve machine translation for lecture, we achieved the goals including collecting raw data from Coursera website, proposing a sentence alignment algorithm, and creating Japanese-English and Chinese-English parallel corpora. Secondly, to improve low-resource machine translation, we achieved the goal of proposing efficient pre-training methods. In addition to the above work, I am also involved in the work of using both visual information and text information to improve machine translation, named multimodal machine translation.
|
今後の研究の推進方策 |
To construct high-quality multilingual machine translation systems, collecting parallel data is important. We have collected large-scale raw data of 60+ low-resource languages. This year, instead of extracting parallel corpora for each specific language pair, we plan to propose language-agnostic alignment algorithms to extract multilingual multi-way parallel sentences from raw data. We will 1) try multilingual models and verify them through experiments, 2) propose algorithms to extract multi-way parallel sentences, and 3) evaluate the performance of the proposed method through machine translation experiments.
|