2022 年度実績報告書

事前学習と多言語意味表現学習を統合した低資源機械翻訳

研究課題

研究課題/領域番号	22J13719
配分区分	補助金
研究機関	京都大学
研究代表者	毛卓遠京都大学, 情報学研究科, 特別研究員(DC2)
研究期間 (年度)	2022-04-22 – 2024-03-31
キーワード	multilingual translation / low-resource translation / multilingual embedding / model efficiency
研究実績の概要	In the past year, we focused on improving the efficiency of multilingual sentence representation learning and exploring novel methods for improving multilingual machine translation. Both research promotes the research for multilingual / low-resource neural machine translation. (1) We proposed an efficient and effective method for training and presented the work in 言語処理学会 2023. On the other hand, we proposed knowledge distillation for compressing a large model, and it has been accepted to EACL 2023 main conference, which leads to efficient model inference. With the above achievements, the process of collecting parallel sentences for training translation systems will be accelerated. Specifically, the model training phase can be accelerated by 4 - 16 times, and the model inference phase can achieve 2.5 - 5 times speedup with further faster speed on downstream tasks. (2) We explored novel ways to improve the multilingual translation system with a word-level contrastive learning technique and obtained better translation quality for low-resource language pairs, which was accepted by NAACL 2022 findings. We also explained the improvements by showing the relationship between BLEU scores and sentence retrieval performance of the NMT encoder, which motivates that future work can focus on further improving the encoder’s retrieval performance in many-to-many NMT and contrastive objective’s feasibility in a massively multilingual scenario.
現在までの達成度 (区分)	現在までの達成度 (区分) 2: おおむね順調に進展している理由 We almost finished the intended plans in the past year, including proposing novel methods for training multilingual neural machine translation systems and exploring the corpus construction for multilingual / low-resource neural machine translation. However, as recent work on large language models (GPT) show that the scale of the model and training data is essential, we adjusted our original plan of constructing corpora ourselves. Instead, we focused on the efficiency of the methods for constructing new training data, for which we proposed two methods, respectively, for improving the training efficiency and inference efficiency. Therefore, the current research progress is good, with only an appropriate adjustment on one specific sub-plan.
今後の研究の推進方策	In the following year, we will focus on improving the translation quality for more language pairs, especially for zero-shot neural machine translation. Specifically, we will first explore the optimal model setting for training large-scale multilingual neural machine translation systems. Subsequently, we will explore ways to improve the translation quality for zero-resource language pairs by training intermediate language-agnostic sentence representations within the encoder-decoder model architecture. Moreover, we will submit our previous efficient and effective sentence representation learning method for journal review and advertise our existing work in international conferences to promote the progress of multilingual / low-resource machine translation. Furthermore, with the emergence of the GPT-like large language models, we plan to add a new research topic as a sub-project into this series of translation research. Specifically, we will explore how to prompt large language models to perform well for any desired translation direction. We plan to utilize our proposed multilingual sentence representation techniques to generate robust translation task-specific prompts for large language models.

研究成果
(11件)

すべて 2022

すべて雑誌論文 (2件) (うち国際共著 2件、査読あり 2件、オープンアクセス 1件) 学会発表 (8件) (うち国際学会 7件) 学会・シンポジウム開催 (1件)

[雑誌論文] Linguistically Driven Multi-Task Pre-Training for Low-Resource Neural Machine Translation2022
- 著者名/発表者名
  Zhuoyuan Mao, Chenhui Chu and Sadao Kurohashi
- 雑誌名
  
  ACM Transactions on Asian and Low-Resource Language Information Processing
  
  巻: Vol. 21, Issue 4, 68 ページ: 1-29
- DOI
  10.1145/3491065
- 査読あり / オープンアクセス / 国際共著
[雑誌論文] SCTB-V2: the 2nd Version of the Chinese Treebank in the Scientific Domain2022
- 著者名/発表者名
  Chenhui Chu, Zhuoyuan Mao, Toshiaki Nakazawa, Daisuke Kawahara and Sadao Kurohashi
- 雑誌名
  
  Language Resources and Evaluation
  
  巻: Oct. 2022 ページ: 1-15
- DOI
  10.1007/s10579-022-09615-2
- 査読あり / 国際共著
[学会発表] When do Contrastive Word Alignments Improve Many-to-many Neural Machine Translation?2022
- 著者名/発表者名
  Zhuoyuan Mao, Chenhui Chu, Raj Dabre, Haiyue Song, Zhen Wan and Sadao Kurohashi
- 学会等名
  Findings of the Association for Computational Linguistics: NAACL 2022
- 国際学会
[学会発表] Seeking Diverse Reasoning Logic: Controlled Equation Expression Generation for Solving Math Word Problems2022
- 著者名/発表者名
  Yibin Shen, Qianying Liu, Zhuoyuan Mao, Zhen Wan, Fei Cheng and Sadao Kurohashi
- 学会等名
  Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing
- 国際学会
[学会発表] BERTSeg: BERT Based Unsupervised Subword Segmentation for Neural Machine Translation2022
- 著者名/発表者名
  Haiyue Song, Raj Dabre, Zhuoyuan Mao, Chenhui Chu and Sadao Kurohashi
- 学会等名
  Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing
- 国際学会
[学会発表] Textual Enhanced Contrastive Learning for Solving Math Word Problems2022
- 著者名/発表者名
  Yibin Shen, Qianying Liu, Zhuoyuan Mao, Fei Cheng and Sadao Kurohashi
- 学会等名
  Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
- 国際学会
[学会発表] Rescue Implicit and Long-tail Cases: Nearest Neighbor Relation Extraction2022
- 著者名/発表者名
  Zhen Wan, Qianying Liu, Zhuoyuan Mao, Fei Cheng, Sadao Kurohashi and Jiwei Li
- 学会等名
  Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
- 国際学会
[学会発表] LEALLA: Learning Lightweight Language-agnostic Sentence Embedding with Knowledge Distillation2022
- 著者名/発表者名
  Zhuoyuan Mao and Tetsuji Nakagawa
- 学会等名
  Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
- 国際学会
[学会発表] Relation Extraction with Weighted Contrastive Pre-training on Distant Supervision2022
- 著者名/発表者名
  Zhen Wan, Fei Cheng, Qianying Liu, Zhuoyuan Mao, Haiyue Song and Sadao Kurohashi
- 学会等名
  Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
- 国際学会
[学会発表] Efficiently Learning Multilingual Sentence Representation for Cross-lingual Sentence Classification2022
- 著者名/発表者名
  Zhuoyuan Mao, Chenhui Chu and Sadao Kurohashi
- 学会等名
  言語処理学会第29回年次大会
[学会・シンポジウム開催] Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing2022

2022 年度 実績報告書

事前学習と多言語意味表現学習を統合した低資源機械翻訳

研究代表者

毛 卓遠 京都大学, 情報学研究科, 特別研究員(DC2)

現在までの達成度 (区分)

理由

研究成果

[雑誌論文] Linguistically Driven Multi-Task Pre-Training for Low-Resource Neural Machine Translation2022

著者名/発表者名

雑誌名

DOI

[雑誌論文] SCTB-V2: the 2nd Version of the Chinese Treebank in the Scientific Domain2022

著者名/発表者名

雑誌名

DOI

[学会発表] When do Contrastive Word Alignments Improve Many-to-many Neural Machine Translation?2022

著者名/発表者名

学会等名

[学会発表] Seeking Diverse Reasoning Logic: Controlled Equation Expression Generation for Solving Math Word Problems2022

著者名/発表者名

学会等名

[学会発表] BERTSeg: BERT Based Unsupervised Subword Segmentation for Neural Machine Translation2022

著者名/発表者名

学会等名

[学会発表] Textual Enhanced Contrastive Learning for Solving Math Word Problems2022

著者名/発表者名

学会等名

[学会発表] Rescue Implicit and Long-tail Cases: Nearest Neighbor Relation Extraction2022

著者名/発表者名

学会等名

[学会発表] LEALLA: Learning Lightweight Language-agnostic Sentence Embedding with Knowledge Distillation2022

著者名/発表者名

学会等名

[学会発表] Relation Extraction with Weighted Contrastive Pre-training on Distant Supervision2022

著者名/発表者名

学会等名

[学会発表] Efficiently Learning Multilingual Sentence Representation for Cross-lingual Sentence Classification2022

著者名/発表者名

学会等名

[学会・シンポジウム開催] Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing2022

2022 年度実績報告書

毛卓遠京都大学, 情報学研究科, 特別研究員(DC2)