2022 Fiscal Year Annual Research Report

Unifying Pre-training and Multilingual Semantic Representation Learning for Low-resource Neural Machine Translation

Research Project

Project/Area Number	22J13719
Allocation Type	Single-year Grants
Research Institution	Kyoto University
Principal Investigator	毛卓遠京都大学, 情報学研究科, 特別研究員(DC2)
Project Period (FY)	2022-04-22 – 2024-03-31
Keywords	multilingual translation / low-resource translation / multilingual embedding / model efficiency
Outline of Annual Research Achievements	In the past year, we focused on improving the efficiency of multilingual sentence representation learning and exploring novel methods for improving multilingual machine translation. Both research promotes the research for multilingual / low-resource neural machine translation. (1) We proposed an efficient and effective method for training and presented the work in 言語処理学会 2023. On the other hand, we proposed knowledge distillation for compressing a large model, and it has been accepted to EACL 2023 main conference, which leads to efficient model inference. With the above achievements, the process of collecting parallel sentences for training translation systems will be accelerated. Specifically, the model training phase can be accelerated by 4 - 16 times, and the model inference phase can achieve 2.5 - 5 times speedup with further faster speed on downstream tasks. (2) We explored novel ways to improve the multilingual translation system with a word-level contrastive learning technique and obtained better translation quality for low-resource language pairs, which was accepted by NAACL 2022 findings. We also explained the improvements by showing the relationship between BLEU scores and sentence retrieval performance of the NMT encoder, which motivates that future work can focus on further improving the encoder’s retrieval performance in many-to-many NMT and contrastive objective’s feasibility in a massively multilingual scenario.
Current Status of Research Progress	Current Status of Research Progress 2: Research has progressed on the whole more than it was originally planned. Reason We almost finished the intended plans in the past year, including proposing novel methods for training multilingual neural machine translation systems and exploring the corpus construction for multilingual / low-resource neural machine translation. However, as recent work on large language models (GPT) show that the scale of the model and training data is essential, we adjusted our original plan of constructing corpora ourselves. Instead, we focused on the efficiency of the methods for constructing new training data, for which we proposed two methods, respectively, for improving the training efficiency and inference efficiency. Therefore, the current research progress is good, with only an appropriate adjustment on one specific sub-plan.
Strategy for Future Research Activity	In the following year, we will focus on improving the translation quality for more language pairs, especially for zero-shot neural machine translation. Specifically, we will first explore the optimal model setting for training large-scale multilingual neural machine translation systems. Subsequently, we will explore ways to improve the translation quality for zero-resource language pairs by training intermediate language-agnostic sentence representations within the encoder-decoder model architecture. Moreover, we will submit our previous efficient and effective sentence representation learning method for journal review and advertise our existing work in international conferences to promote the progress of multilingual / low-resource machine translation. Furthermore, with the emergence of the GPT-like large language models, we plan to add a new research topic as a sub-project into this series of translation research. Specifically, we will explore how to prompt large language models to perform well for any desired translation direction. We plan to utilize our proposed multilingual sentence representation techniques to generate robust translation task-specific prompts for large language models.

Research Products
(11 results)

All 2022

All Journal Article (2 results) (of which Int'l Joint Research: 2 results, Peer Reviewed: 2 results, Open Access: 1 results) Presentation (8 results) (of which Int'l Joint Research: 7 results) Funded Workshop (1 results)

[Journal Article] Linguistically Driven Multi-Task Pre-Training for Low-Resource Neural Machine Translation2022
- Author(s)
  Zhuoyuan Mao, Chenhui Chu and Sadao Kurohashi
- Journal Title
  
  ACM Transactions on Asian and Low-Resource Language Information Processing
  
  Volume: Vol. 21, Issue 4, 68 Pages: 1-29
- DOI
  10.1145/3491065
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] SCTB-V2: the 2nd Version of the Chinese Treebank in the Scientific Domain2022
- Author(s)
  Chenhui Chu, Zhuoyuan Mao, Toshiaki Nakazawa, Daisuke Kawahara and Sadao Kurohashi
- Journal Title
  
  Language Resources and Evaluation
  
  Volume: Oct. 2022 Pages: 1-15
- DOI
  10.1007/s10579-022-09615-2
- Peer Reviewed / Int'l Joint Research
[Presentation] When do Contrastive Word Alignments Improve Many-to-many Neural Machine Translation?2022
- Author(s)
  Zhuoyuan Mao, Chenhui Chu, Raj Dabre, Haiyue Song, Zhen Wan and Sadao Kurohashi
- Organizer
  Findings of the Association for Computational Linguistics: NAACL 2022
- Int'l Joint Research
[Presentation] Seeking Diverse Reasoning Logic: Controlled Equation Expression Generation for Solving Math Word Problems2022
- Author(s)
  Yibin Shen, Qianying Liu, Zhuoyuan Mao, Zhen Wan, Fei Cheng and Sadao Kurohashi
- Organizer
  Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing
- Int'l Joint Research
[Presentation] BERTSeg: BERT Based Unsupervised Subword Segmentation for Neural Machine Translation2022
- Author(s)
  Haiyue Song, Raj Dabre, Zhuoyuan Mao, Chenhui Chu and Sadao Kurohashi
- Organizer
  Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing
- Int'l Joint Research
[Presentation] Textual Enhanced Contrastive Learning for Solving Math Word Problems2022
- Author(s)
  Yibin Shen, Qianying Liu, Zhuoyuan Mao, Fei Cheng and Sadao Kurohashi
- Organizer
  Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
- Int'l Joint Research
[Presentation] Rescue Implicit and Long-tail Cases: Nearest Neighbor Relation Extraction2022
- Author(s)
  Zhen Wan, Qianying Liu, Zhuoyuan Mao, Fei Cheng, Sadao Kurohashi and Jiwei Li
- Organizer
  Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
- Int'l Joint Research
[Presentation] LEALLA: Learning Lightweight Language-agnostic Sentence Embedding with Knowledge Distillation2022
- Author(s)
  Zhuoyuan Mao and Tetsuji Nakagawa
- Organizer
  Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
- Int'l Joint Research
[Presentation] Relation Extraction with Weighted Contrastive Pre-training on Distant Supervision2022
- Author(s)
  Zhen Wan, Fei Cheng, Qianying Liu, Zhuoyuan Mao, Haiyue Song and Sadao Kurohashi
- Organizer
  Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
- Int'l Joint Research
[Presentation] Efficiently Learning Multilingual Sentence Representation for Cross-lingual Sentence Classification2022
- Author(s)
  Zhuoyuan Mao, Chenhui Chu and Sadao Kurohashi
- Organizer
  言語処理学会第29回年次大会
[Funded Workshop] Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing2022

2022 Fiscal Year Annual Research Report

Unifying Pre-training and Multilingual Semantic Representation Learning for Low-resource Neural Machine Translation

Principal Investigator

毛 卓遠 京都大学, 情報学研究科, 特別研究員(DC2)

Current Status of Research Progress

Reason

Research Products

[Journal Article] Linguistically Driven Multi-Task Pre-Training for Low-Resource Neural Machine Translation2022

Author(s)

Journal Title

DOI

[Journal Article] SCTB-V2: the 2nd Version of the Chinese Treebank in the Scientific Domain2022

Author(s)

Journal Title

DOI

[Presentation] When do Contrastive Word Alignments Improve Many-to-many Neural Machine Translation?2022

Author(s)

Organizer

[Presentation] Seeking Diverse Reasoning Logic: Controlled Equation Expression Generation for Solving Math Word Problems2022

Author(s)

Organizer

[Presentation] BERTSeg: BERT Based Unsupervised Subword Segmentation for Neural Machine Translation2022

Author(s)

Organizer

[Presentation] Textual Enhanced Contrastive Learning for Solving Math Word Problems2022

Author(s)

Organizer

[Presentation] Rescue Implicit and Long-tail Cases: Nearest Neighbor Relation Extraction2022

Author(s)

Organizer

[Presentation] LEALLA: Learning Lightweight Language-agnostic Sentence Embedding with Knowledge Distillation2022

Author(s)

Organizer

[Presentation] Relation Extraction with Weighted Contrastive Pre-training on Distant Supervision2022

Author(s)

Organizer

[Presentation] Efficiently Learning Multilingual Sentence Representation for Cross-lingual Sentence Classification2022

Author(s)

Organizer

[Funded Workshop] Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing2022

毛卓遠京都大学, 情報学研究科, 特別研究員(DC2)