2022 Fiscal Year Annual Research Report
Multilingual corpus construction and domain adaptation for low-resource machine translation
Project/Area Number |
21J23124
|
Allocation Type | Single-year Grants |
Research Institution | Kyoto University |
Principal Investigator |
宋 海越 京都大学, 情報学研究科, 特別研究員(DC1)
|
Project Period (FY) |
2021-04-28 – 2024-03-31
|
Keywords | machine translation / ChatGPT / subword segmentation |
Outline of Annual Research Achievements |
During this year, I have published 5 papers and one journal paper is under review. For the 3 papers as the first author: 1) the first work published in an international conference AACL-IJCNLP2022 exploits BERT-based unsupervised subword segmentation for neural machine translation which is effective on low-resource to high-resource scenarios; 2) the second work published in a domestic conference NLP2023 utilizes machine translation of prompts for adjusting GPT-3 to Japanese tasks; 3) the third work submitting to the NLP journal leverages information from multiple subword segmenters in a proposed subword-relation-aware attention-mechanism and aligning loss objective. Other works include video-information for multimodal NMT which is published in the JIP journal, exploring contrastive word alignments for multilingual NMT which is published in a top international conference NAACL2022, and contrastive pre-training for relation extraction which is published in a top international conference EMNLP2022. Two co-authored papers are under review for international conference ACL2023 and one for EAMT2023. I have also participated in symposiums on campus and workshops in Japan, and communicate with many researchers there. Moreover, I took an internship at NICT in a national lab focusing on machine translation, and we have applied one patent for the BERT-based unsupervised subword segmentation.
|
Current Status of Research Progress |
Current Status of Research Progress
2: Research has progressed on the whole more than it was originally planned.
Reason
In this project we focus on improving the performance of neural machine translation systems, especially low-resource machine translation. Up to now, I have finished part of the goals of the project including building a multilingual parallel dataset and creating a high-quality neural machine translation system through improved subword segmentation input and multiple segmentation aware system. Moreover, because of the trend of ChatGPT, we have also conducted experiments of leveraging machine translation to improve the performance of ChatGPT on Japanese natural language processing tasks. In detail, we have done: 1)To improve low-resource machine translation quality, we build a BERT-based unsupervised subword segmentation system that can generate linguistically motivated segmentation for English words including rare words or unseen words. Experimental results show improved performance on Asian languages to English translation directions 2)We build a multiple-subword-aware neural machine translation system that leverages information from multiple subword segmenters in a proposed subword-relation-aware attention-mechanism and aligning loss objective, which improves the translation system from the model perspective. 3)We applied machine translation to assist ChatGPT on data other than English. We translate Japanese input into English and combine them as the input of ChatGPT. With the precise information from the original Japanese data and English data which is the main training data of ChatGPT, we observed near human performance on the JapaneseGLUE dataset.
|
Strategy for Future Research Activity |
Recently, large models led by ChatGPT provide convenient solutions for various natural language processing tasks, however, there are few research exploring leveraging them to machine translation task, especially for the low-resource scenario. We will explore how to apply ChatGPT to machine translation. We plan to explore how to apply large language models such as ChatGPT to multilingual machine translation and low-resource machine translation task. Currently we aim at two ideas: 1) using existing ChatGPT model and improve prompting method, and 2) fine-tuning open-sourced GPT-like model. For the first idea, we will first test the existing GPT-4 model on multilingual machine translation task through the official API. We focus on the prompt construction process including retrieving examples in the train set that are similar with the input source or target sentence; retrieving sentences in the same language family with the target language, for example, using a larger English-Chinese dataset to improve the translation quality of English-Japanese direction. This is especially useful for low-resource languages where there is a higher resource similar language. For the second idea, we plan to fine-tune our own GPT-4 model for the machine translation task. We adjust the existing model to more fine-grained domains such as the low-resource machine translation task, utilizing small-scale supervised data. We can gain a better understanding on how the model works through inspecting the training or inference process in the locally trained model compared to using the API.
|