2022 Fiscal Year Annual Research Report
Unifying Pre-training and Multilingual Semantic Representation Learning for Low-resource Neural Machine Translation
Project/Area Number |
22J13719
|
Allocation Type | Single-year Grants |
Research Institution | Kyoto University |
Principal Investigator |
毛 卓遠 京都大学, 情報学研究科, 特別研究員(DC2)
|
Project Period (FY) |
2022-04-22 – 2024-03-31
|
Keywords | multilingual translation / low-resource translation / multilingual embedding / model efficiency |
Outline of Annual Research Achievements |
In the past year, we focused on improving the efficiency of multilingual sentence representation learning and exploring novel methods for improving multilingual machine translation. Both research promotes the research for multilingual / low-resource neural machine translation. (1) We proposed an efficient and effective method for training and presented the work in 言語処理学会 2023. On the other hand, we proposed knowledge distillation for compressing a large model, and it has been accepted to EACL 2023 main conference, which leads to efficient model inference. With the above achievements, the process of collecting parallel sentences for training translation systems will be accelerated. Specifically, the model training phase can be accelerated by 4 - 16 times, and the model inference phase can achieve 2.5 - 5 times speedup with further faster speed on downstream tasks. (2) We explored novel ways to improve the multilingual translation system with a word-level contrastive learning technique and obtained better translation quality for low-resource language pairs, which was accepted by NAACL 2022 findings. We also explained the improvements by showing the relationship between BLEU scores and sentence retrieval performance of the NMT encoder, which motivates that future work can focus on further improving the encoder’s retrieval performance in many-to-many NMT and contrastive objective’s feasibility in a massively multilingual scenario.
|
Current Status of Research Progress |
Current Status of Research Progress
2: Research has progressed on the whole more than it was originally planned.
Reason
We almost finished the intended plans in the past year, including proposing novel methods for training multilingual neural machine translation systems and exploring the corpus construction for multilingual / low-resource neural machine translation. However, as recent work on large language models (GPT) show that the scale of the model and training data is essential, we adjusted our original plan of constructing corpora ourselves. Instead, we focused on the efficiency of the methods for constructing new training data, for which we proposed two methods, respectively, for improving the training efficiency and inference efficiency. Therefore, the current research progress is good, with only an appropriate adjustment on one specific sub-plan.
|
Strategy for Future Research Activity |
In the following year, we will focus on improving the translation quality for more language pairs, especially for zero-shot neural machine translation. Specifically, we will first explore the optimal model setting for training large-scale multilingual neural machine translation systems. Subsequently, we will explore ways to improve the translation quality for zero-resource language pairs by training intermediate language-agnostic sentence representations within the encoder-decoder model architecture. Moreover, we will submit our previous efficient and effective sentence representation learning method for journal review and advertise our existing work in international conferences to promote the progress of multilingual / low-resource machine translation. Furthermore, with the emergence of the GPT-like large language models, we plan to add a new research topic as a sub-project into this series of translation research. Specifically, we will explore how to prompt large language models to perform well for any desired translation direction. We plan to utilize our proposed multilingual sentence representation techniques to generate robust translation task-specific prompts for large language models.
|