2022 Fiscal Year Annual Research Report

Multilingual corpus construction and domain adaptation for low-resource machine translation

Research Project

Project/Area Number	21J23124
Allocation Type	Single-year Grants
Research Institution	Kyoto University
Principal Investigator	宋海越京都大学, 情報学研究科, 特別研究員(DC1)
Project Period (FY)	2021-04-28 – 2024-03-31
Keywords	machine translation / ChatGPT / subword segmentation
Outline of Annual Research Achievements	During this year, I have published 5 papers and one journal paper is under review. For the 3 papers as the first author: 1) the first work published in an international conference AACL-IJCNLP2022 exploits BERT-based unsupervised subword segmentation for neural machine translation which is effective on low-resource to high-resource scenarios; 2) the second work published in a domestic conference NLP2023 utilizes machine translation of prompts for adjusting GPT-3 to Japanese tasks; 3) the third work submitting to the NLP journal leverages information from multiple subword segmenters in a proposed subword-relation-aware attention-mechanism and aligning loss objective. Other works include video-information for multimodal NMT which is published in the JIP journal, exploring contrastive word alignments for multilingual NMT which is published in a top international conference NAACL2022, and contrastive pre-training for relation extraction which is published in a top international conference EMNLP2022. Two co-authored papers are under review for international conference ACL2023 and one for EAMT2023. I have also participated in symposiums on campus and workshops in Japan, and communicate with many researchers there. Moreover, I took an internship at NICT in a national lab focusing on machine translation, and we have applied one patent for the BERT-based unsupervised subword segmentation.
Current Status of Research Progress	Current Status of Research Progress 2: Research has progressed on the whole more than it was originally planned. Reason In this project we focus on improving the performance of neural machine translation systems, especially low-resource machine translation. Up to now, I have finished part of the goals of the project including building a multilingual parallel dataset and creating a high-quality neural machine translation system through improved subword segmentation input and multiple segmentation aware system. Moreover, because of the trend of ChatGPT, we have also conducted experiments of leveraging machine translation to improve the performance of ChatGPT on Japanese natural language processing tasks. In detail, we have done: 1)To improve low-resource machine translation quality, we build a BERT-based unsupervised subword segmentation system that can generate linguistically motivated segmentation for English words including rare words or unseen words. Experimental results show improved performance on Asian languages to English translation directions 2)We build a multiple-subword-aware neural machine translation system that leverages information from multiple subword segmenters in a proposed subword-relation-aware attention-mechanism and aligning loss objective, which improves the translation system from the model perspective. 3)We applied machine translation to assist ChatGPT on data other than English. We translate Japanese input into English and combine them as the input of ChatGPT. With the precise information from the original Japanese data and English data which is the main training data of ChatGPT, we observed near human performance on the JapaneseGLUE dataset.
Strategy for Future Research Activity	Recently, large models led by ChatGPT provide convenient solutions for various natural language processing tasks, however, there are few research exploring leveraging them to machine translation task, especially for the low-resource scenario. We will explore how to apply ChatGPT to machine translation. We plan to explore how to apply large language models such as ChatGPT to multilingual machine translation and low-resource machine translation task. Currently we aim at two ideas: 1) using existing ChatGPT model and improve prompting method, and 2) fine-tuning open-sourced GPT-like model. For the first idea, we will first test the existing GPT-4 model on multilingual machine translation task through the official API. We focus on the prompt construction process including retrieving examples in the train set that are similar with the input source or target sentence; retrieving sentences in the same language family with the target language, for example, using a larger English-Chinese dataset to improve the translation quality of English-Japanese direction. This is especially useful for low-resource languages where there is a higher resource similar language. For the second idea, we plan to fine-tune our own GPT-4 model for the machine translation task. We adjust the existing model to more fine-grained domains such as the low-resource machine translation task, utilizing small-scale supervised data. We can gain a better understanding on how the model works through inspecting the training or inference process in the locally trained model compared to using the API.

Research Products
(6 results)

All 2023 2022

All Journal Article (1 results) (of which Peer Reviewed: 1 results, Open Access: 1 results) Presentation (4 results) (of which Int'l Joint Research: 3 results) Patent(Industrial Property Rights) (1 results)

[Journal Article] Spatial Hierarchical Attention Network Based Video-guided Machine Translation2023
- Author(s)
  Weiqi Gu, Haiyue Song, Chenhui Chu, Sadao Kurohashi
- Journal Title
  
  Journal of Information Processing
  
  Volume: 31 Pages: -
- Peer Reviewed / Open Access
[Presentation] BERTSeg: BERT Based Unsupervised Subword Segmentation for Neural Machine Translation2022
- Author(s)
  Haiyue Song, Raj Dabre, Zhuoyuan Mao, Chenhui Chu, and Sadao Kurohashi
- Organizer
  2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing
- Int'l Joint Research
[Presentation] Large Pre-trained Language Models with Multilingual Prompt for Japanese Natural Language Tasks2022
- Author(s)
  Haiyue Song, Raj Dabre, Chenhui Chu and Sadao Kurohashi
- Organizer
  言語処理学会第29回年次大会
[Presentation] When do Contrastive Word Alignments Improve Many-to-many Neural Machine Translation?2022
- Author(s)
  Zhuoyuan Mao, Chenhui Chu, Raj Dabre, Haiyue Song, Zhen Wan, and Sadao Kurohashi
- Organizer
  Findings of the Association for Computational Linguistics: NAACL 2022
- Int'l Joint Research
[Presentation] Relation Extraction with Weighted Contrastive Pre-training on Distant Supervision2022
- Author(s)
  Zhen Wan, Fei Cheng, Qianying Liu, Zhuoyuan Mao, Haiyue Song, Sadao Kurohashi
- Organizer
  17th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2023)
- Int'l Joint Research
[Patent(Industrial Property Rights)] BERTSeg: BERT Based Subword Segmentation2022
- Inventor(s)
  ソウカイエツ
- Industrial Property Rights Holder
  国立研究開発法人情報通信研究機構
- Industrial Property Rights Type
  特許
- Industrial Property Number
  -

2022 Fiscal Year Annual Research Report

Multilingual corpus construction and domain adaptation for low-resource machine translation

Principal Investigator

宋 海越 京都大学, 情報学研究科, 特別研究員(DC1)

Current Status of Research Progress

Reason

Research Products

[Journal Article] Spatial Hierarchical Attention Network Based Video-guided Machine Translation2023

Author(s)

Journal Title

[Presentation] BERTSeg: BERT Based Unsupervised Subword Segmentation for Neural Machine Translation2022

Author(s)

Organizer

[Presentation] Large Pre-trained Language Models with Multilingual Prompt for Japanese Natural Language Tasks2022

Author(s)

Organizer

[Presentation] When do Contrastive Word Alignments Improve Many-to-many Neural Machine Translation?2022

Author(s)

Organizer

[Presentation] Relation Extraction with Weighted Contrastive Pre-training on Distant Supervision2022

Author(s)

Organizer

[Patent(Industrial Property Rights)] BERTSeg: BERT Based Subword Segmentation2022

Inventor(s)

Industrial Property Rights Holder

Industrial Property Rights Type

Industrial Property Number

宋海越京都大学, 情報学研究科, 特別研究員(DC1)