• Search Research Projects
  • Search Researchers
  • How to Use
  1. Back to project page

2022 Fiscal Year Annual Research Report

Multilingual corpus construction and domain adaptation for low-resource machine translation

Research Project

Project/Area Number 21J23124
Allocation TypeSingle-year Grants
Research InstitutionKyoto University

Principal Investigator

宋 海越  京都大学, 情報学研究科, 特別研究員(DC1)

Project Period (FY) 2021-04-28 – 2024-03-31
Keywordsmachine translation / ChatGPT / subword segmentation
Outline of Annual Research Achievements

During this year, I have published 5 papers and one journal paper is under review. For the 3 papers as the first author: 1) the first work published in an international conference AACL-IJCNLP2022 exploits BERT-based unsupervised subword segmentation for neural machine translation which is effective on low-resource to high-resource scenarios; 2) the second work published in a domestic conference NLP2023 utilizes machine translation of prompts for adjusting GPT-3 to Japanese tasks; 3) the third work submitting to the NLP journal leverages information from multiple subword segmenters in a proposed subword-relation-aware attention-mechanism and aligning loss objective. Other works include video-information for multimodal NMT which is published in the JIP journal, exploring contrastive word alignments for multilingual NMT which is published in a top international conference NAACL2022, and contrastive pre-training for relation extraction which is published in a top international conference EMNLP2022. Two co-authored papers are under review for international conference ACL2023 and one for EAMT2023. I have also participated in symposiums on campus and workshops in Japan, and communicate with many researchers there.
Moreover, I took an internship at NICT in a national lab focusing on machine translation, and we have applied one patent for the BERT-based unsupervised subword segmentation.

Current Status of Research Progress
Current Status of Research Progress

2: Research has progressed on the whole more than it was originally planned.

Reason

In this project we focus on improving the performance of neural machine translation systems, especially low-resource machine translation. Up to now, I have finished part of the goals of the project including building a multilingual parallel dataset and creating a high-quality neural machine translation system through improved subword segmentation input and multiple segmentation aware system. Moreover, because of the trend of ChatGPT, we have also conducted experiments of leveraging machine translation to improve the performance of ChatGPT on Japanese natural language processing tasks. In detail, we have done:
1)To improve low-resource machine translation quality, we build a BERT-based unsupervised subword segmentation system that can generate linguistically motivated segmentation for English words including rare words or unseen words. Experimental results show improved performance on Asian languages to English translation directions
2)We build a multiple-subword-aware neural machine translation system that leverages information from multiple subword segmenters in a proposed subword-relation-aware attention-mechanism and aligning loss objective, which improves the translation system from the model perspective.
3)We applied machine translation to assist ChatGPT on data other than English. We translate Japanese input into English and combine them as the input of ChatGPT. With the precise information from the original Japanese data and English data which is the main training data of ChatGPT, we observed near human performance on the JapaneseGLUE dataset.

Strategy for Future Research Activity

Recently, large models led by ChatGPT provide convenient solutions for various natural language processing tasks, however, there are few research exploring leveraging them to machine translation task, especially for the low-resource scenario. We will explore how to apply ChatGPT to machine translation. We plan to explore how to apply large language models such as ChatGPT to multilingual machine translation and low-resource machine translation task. Currently we aim at two ideas: 1) using existing ChatGPT model and improve prompting method, and 2) fine-tuning open-sourced GPT-like model.
For the first idea, we will first test the existing GPT-4 model on multilingual machine translation task through the official API. We focus on the prompt construction process including retrieving examples in the train set that are similar with the input source or target sentence; retrieving sentences in the same language family with the target language, for example, using a larger English-Chinese dataset to improve the translation quality of English-Japanese direction. This is especially useful for low-resource languages where there is a higher resource similar language.
For the second idea, we plan to fine-tune our own GPT-4 model for the machine translation task. We adjust the existing model to more fine-grained domains such as the low-resource machine translation task, utilizing small-scale supervised data. We can gain a better understanding on how the model works through inspecting the training or inference process in the locally trained model compared to using the API.

  • Research Products

    (6 results)

All 2023 2022

All Journal Article (1 results) (of which Peer Reviewed: 1 results,  Open Access: 1 results) Presentation (4 results) (of which Int'l Joint Research: 3 results) Patent(Industrial Property Rights) (1 results)

  • [Journal Article] Spatial Hierarchical Attention Network Based Video-guided Machine Translation2023

    • Author(s)
      Weiqi Gu, Haiyue Song, Chenhui Chu, Sadao Kurohashi
    • Journal Title

      Journal of Information Processing

      Volume: 31 Pages: -

    • Peer Reviewed / Open Access
  • [Presentation] BERTSeg: BERT Based Unsupervised Subword Segmentation for Neural Machine Translation2022

    • Author(s)
      Haiyue Song, Raj Dabre, Zhuoyuan Mao, Chenhui Chu, and Sadao Kurohashi
    • Organizer
      2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing
    • Int'l Joint Research
  • [Presentation] Large Pre-trained Language Models with Multilingual Prompt for Japanese Natural Language Tasks2022

    • Author(s)
      Haiyue Song, Raj Dabre, Chenhui Chu and Sadao Kurohashi
    • Organizer
      言語処理学会 第29回年次大会
  • [Presentation] When do Contrastive Word Alignments Improve Many-to-many Neural Machine Translation?2022

    • Author(s)
      Zhuoyuan Mao, Chenhui Chu, Raj Dabre, Haiyue Song, Zhen Wan, and Sadao Kurohashi
    • Organizer
      Findings of the Association for Computational Linguistics: NAACL 2022
    • Int'l Joint Research
  • [Presentation] Relation Extraction with Weighted Contrastive Pre-training on Distant Supervision2022

    • Author(s)
      Zhen Wan, Fei Cheng, Qianying Liu, Zhuoyuan Mao, Haiyue Song, Sadao Kurohashi
    • Organizer
      17th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2023)
    • Int'l Joint Research
  • [Patent(Industrial Property Rights)] BERTSeg: BERT Based Subword Segmentation2022

    • Inventor(s)
      ソウ カイエツ
    • Industrial Property Rights Holder
      国立研究開発法人情報通信研究機構
    • Industrial Property Rights Type
      特許
    • Industrial Property Number
      -

URL: 

Published: 2023-12-25  

Information User Guide FAQ News Terms of Use Attribution of KAKENHI

Powered by NII kakenhi