2021 年度実績報告書

多言語コーパス構築とドメイン適応による低資源機械翻訳

研究課題

研究課題/領域番号	21J23124
配分区分	補助金
研究機関	京都大学
研究代表者	宋海越京都大学, 情報学研究科, 特別研究員(DC1)
研究期間 (年度)	2021-04-28 – 2024-03-31
キーワード	Machine translation / Parallel corpus creation / Pre-training / Data selection
研究実績の概要	We have done two works for improving machine translation this year. Firstly, to improve machine translation for lectures, we created Japanese-English and Chinese-English corpora to train the machine translation model. Lecture transcript translation helps learners understand online courses, however, building high-quality lecture machine translation system lacks publicly available parallel corpora. To address this, we examine a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures in Coursera. To create the dataset, we propose a sentence alignment algorithm and extracted Japanese-English and Chinese-English corpora of approximately 50,000 lines each. Through machine translation experiments, we show that the mined corpora significantly enhance the quality of lecture transcript translation when used in conjunction with out-of-domain parallel corpora. This work is submitted to a journal and under review. Secondly, to lower the cost of building machine translation systems especially for low-resource domains and languages, we propose an efficient way to pre-train sequence-to-sequence model. Pre-trained models are widely used for machine translation, however, training them requires large dataset and computational resources. We propose a data selection algorithm that selects a tiny but representative subset from billion-scale datasets. Experimental results show that pre-training with 0.26% data and 7.3% energy consumption achieves about 90% performance on machine translation.
現在までの達成度 (区分)	現在までの達成度 (区分) 2: おおむね順調に進展している理由 This year we met some of our planned goals. Firstly, to improve machine translation for lecture, we achieved the goals including collecting raw data from Coursera website, proposing a sentence alignment algorithm, and creating Japanese-English and Chinese-English parallel corpora. Secondly, to improve low-resource machine translation, we achieved the goal of proposing efficient pre-training methods. In addition to the above work, I am also involved in the work of using both visual information and text information to improve machine translation, named multimodal machine translation.
今後の研究の推進方策	To construct high-quality multilingual machine translation systems, collecting parallel data is important. We have collected large-scale raw data of 60+ low-resource languages. This year, instead of extracting parallel corpora for each specific language pair, we plan to propose language-agnostic alignment algorithms to extract multilingual multi-way parallel sentences from raw data. We will 1) try multilingual models and verify them through experiments, 2) propose algorithms to extract multi-way parallel sentences, and 3) evaluate the performance of the proposed method through machine translation experiments.

研究成果
(3件)

すべて 2022 2021

すべて学会発表 (3件) (うち国際学会 1件)

[学会発表] Representative Data Selection for Sequence-to-Sequence Pre-training2022
- 著者名/発表者名
  Haiyue Song, Raj Dabre, Zhuoyuan Mao, Chenhui Chu, Sadao Kurohashi
- 学会等名
  言語処理学会第28回年次大会
[学会発表] Improving Medical Relation Extraction with Distantly Supervised Pre-training2022
- 著者名/発表者名
  Zhen Wan, Fei Cheng, Qianying Liu, Zhuoyuan Mao, Haiyue Song, Sadao Kurohashi
- 学会等名
  言語処理学会第28回年次大会
[学会発表] Video-guided Machine Translation with Spatial Hierarchical Attention Network2021
- 著者名/発表者名
  Weiqi Gu, Haiyue Song, Chenhui Chu, Sadao Kurohashi
- 学会等名
  ACL-IJCNLP 2021 Student Research Workshop
- 国際学会

2021 年度 実績報告書

多言語コーパス構築とドメイン適応による低資源機械翻訳

研究代表者

宋 海越 京都大学, 情報学研究科, 特別研究員(DC1)

現在までの達成度 (区分)

理由

研究成果

[学会発表] Representative Data Selection for Sequence-to-Sequence Pre-training2022

著者名/発表者名

学会等名

[学会発表] Improving Medical Relation Extraction with Distantly Supervised Pre-training2022

著者名/発表者名

学会等名

[学会発表] Video-guided Machine Translation with Spatial Hierarchical Attention Network2021

著者名/発表者名

学会等名

2021 年度実績報告書

宋海越京都大学, 情報学研究科, 特別研究員(DC1)