• Search Research Projects
  • Search Researchers
  • How to Use
  1. Back to previous page

Multilingual corpus construction and domain adaptation for low-resource machine translation

Research Project

Project/Area Number 22KJ1724
Project/Area Number (Other) 21J23124 (2021-2022)
Research Category

Grant-in-Aid for JSPS Fellows

Allocation TypeMulti-year Fund (2023)
Single-year Grants (2021-2022)
Section国内
Review Section Basic Section 61030:Intelligent informatics-related
Research InstitutionNational Institute of Information and Communications Technology (2023)
Kyoto University (2021-2022)

Principal Investigator

宋 海越  国立研究開発法人情報通信研究機構, ユニバーサルコミュニケーション研究所 先進的音声翻訳研究開発推進センター 先進的翻訳技術研究室, 研究技術員

Project Period (FY) 2023-03-08 – 2024-03-31
Project Status Completed (Fiscal Year 2023)
Budget Amount *help
¥2,200,000 (Direct Cost: ¥2,200,000)
Fiscal Year 2023: ¥700,000 (Direct Cost: ¥700,000)
Fiscal Year 2022: ¥700,000 (Direct Cost: ¥700,000)
Fiscal Year 2021: ¥800,000 (Direct Cost: ¥800,000)
Keywordsmachine translation / low-resource languages / subword segmentation / subword encoding / decoding algorithm / corpora creation / ChatGPT / Machine translation / Parallel corpus creation / Pre-training / Data selection
Outline of Research at the Start

We focus on improving neural machine translation quality through leveraging large language models such as ChatGPT (current version is GPT-4). We will first test the ability and find the main problem of the current GPT-4 model on the translation task. We then focus on improving the GPT-4 based method through improving the prompts such as providing similar examples. We also have plan to fine-tune our own GPT model on the machine translation task based on open-sourced models such as LLaMA.
Besides, we also continue utilizing better subword segmentation in the neural machine translation model.

Outline of Annual Research Achievements

Our research focused on enhancing machine translation for low-resource scenarios such as translation between Asian languages and English, and translation in specific domains such as the educational domain. To achieve this, we propose to 1) create bilingual corpora, mainly in the first year, for the low-resource domain and 2) optimize the subword segmentation information during the encoding phase in the second year and the decoding phase in the last year.
As for the publications, during the last year, there were 3 first-authored journal papers and 1 conference paper published or submitted. Over the past three years, there have been 4 journal papers and 9 international conference papers, including co-authored papers. Additionally, one patent application is underway.
This research has significantly improved the translation quality for low-resource scenarios. Through experiments, we found that the quality score measured by BLEU is improved by more than 3 points.
The low-resource translation system is indispensable for cross-cultural communication in international events such as EXPO 2025. With our approach, we can make the translation system more practical for participants who speak low-resource languages.

Report

(3 results)
  • 2023 Annual Research Report
  • 2022 Annual Research Report
  • 2021 Annual Research Report
  • Research Products

    (24 results)

All 2024 2023 2022 2021 Other

All Int'l Joint Research (1 results) Journal Article (3 results) (of which Peer Reviewed: 3 results,  Open Access: 3 results) Presentation (16 results) (of which Int'l Joint Research: 12 results) Remarks (3 results) Patent(Industrial Property Rights) (1 results)

  • [Int'l Joint Research] University of Cape Town(南アフリカ)

    • Related Report
      2023 Annual Research Report
  • [Journal Article] DiverSeg: Leveraging Diverse Segmentations with Cross-granularity Alignment for Neural Machine Translation2024

    • Author(s)
      Song Haiyue、Mao Zhuoyuan、Dabre Raj、Chu Chenhui、Kurohashi Sadao
    • Journal Title

      Journal of Natural Language Processing

      Volume: 31 Issue: 1 Pages: 155-188

    • DOI

      10.5715/jnlp.31.155

    • ISSN
      1340-7619, 2185-8314
    • Related Report
      2023 Annual Research Report
    • Peer Reviewed / Open Access
  • [Journal Article] SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation2023

    • Author(s)
      Song Haiyue、Dabre Raj、Chu Chenhui、Kurohashi Sadao、Sumita Eiichiro
    • Journal Title

      ACM Transactions on Asian and Low-Resource Language Information Processing

      Volume: 22 Issue: 8 Pages: 1-24

    • DOI

      10.1145/3610611

    • Related Report
      2023 Annual Research Report
    • Peer Reviewed / Open Access
  • [Journal Article] Spatial Hierarchical Attention Network Based Video-guided Machine Translation2023

    • Author(s)
      Weiqi Gu, Haiyue Song, Chenhui Chu, Sadao Kurohashi
    • Journal Title

      Journal of Information Processing

      Volume: 31

    • Related Report
      2022 Annual Research Report
    • Peer Reviewed / Open Access
  • [Presentation] SubMerge: Merging Equivalent Subword Tokenizations for Subword Regularized Models in Neural Machine Translation.2024

    • Author(s)
      Haiyue Song, Francois Meyer, Raj Dabre, Hideki Tanaka, Chenhui Chu, and Sadao Kurohashi.
    • Organizer
      The 25th Annual Conference of the European Association for Machine Translation (EAMT 2024)
    • Related Report
      2023 Annual Research Report
    • Int'l Joint Research
  • [Presentation] Linguistically Motivated Neural Machine Translation.2024

    • Author(s)
      Haiyue Song, Hour Kaing, and Raj Dabre.
    • Organizer
      The 25th Annual Conference of the European Association for Machine Translation (EAMT 2024)
    • Related Report
      2023 Annual Research Report
    • Int'l Joint Research
  • [Presentation] NGLUEni: Benchmarking and Adapting Pretrained Language Models for Nguni Languages.2024

    • Author(s)
      Francois Meyer, Haiyue Song, Abhisek Chakrabarty, Jan Buys, Raj Dabre and Hideki Tanaka.
    • Organizer
      The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
    • Related Report
      2023 Annual Research Report
    • Int'l Joint Research
  • [Presentation] Enhancing Personality Recognition in Dialogue by Data Augmentation and Heterogeneous Conversational Graph Networks.2024

    • Author(s)
      Yahui Fu, Haiyue Song, Tianyu Zhao, Tatsuya Kawahara.
    • Organizer
      The 14th International Workshop on Spoken Dialogue Systems Technology (IWSDS2024)
    • Related Report
      2023 Annual Research Report
    • Int'l Joint Research
  • [Presentation] Robust Neural Machine Translation for Abugidas by Glyph Perturbation2024

    • Author(s)
      Hour Kaing, Chenchen Ding, Haiyue Song, Jiannan Mao, Hideki Tanaka, and Masao Utiyama.
    • Organizer
      言語処理学会 第30回年次大会
    • Related Report
      2023 Annual Research Report
  • [Presentation] GPT-RE: In-context Learning for Relation Extraction using Large Language Models.2023

    • Author(s)
      Zhen Wan, Fei Cheng, Zhuoyuan Mao, Qianying Liu, Haiyue Song, Jiwei Li, Sadao Kurohashi.
    • Organizer
      The 2023 Conference on Empirical Methods in Natural Language Processing
    • Related Report
      2023 Annual Research Report
    • Int'l Joint Research
  • [Presentation] Exploring the Impact of Layer Normalization for Zero-shot Neural Machine Translation.2023

    • Author(s)
      Zhuoyuan Mao, Raj Dabre, Qianying Liu, Haiyue Song, Chenhui Chu, and Sadao Kurohashi.
    • Organizer
      The 61st Annual Meeting of the Association for Computational Linguistics
    • Related Report
      2023 Annual Research Report
    • Int'l Joint Research
  • [Presentation] Variable-length Neural Interlingua Representations for Zero-shot Neural Machine Translation.2023

    • Author(s)
      Zhuoyuan Mao, Haiyue Song, Raj Dabre, Chenhui Chu, Sadao Kurohashi.
    • Organizer
      Proceedings of the 1st International Workshop on Multilingual, Multimodal and Multitask Language Generation (Multi3Generation) held in conjection with EAMT2023.
    • Related Report
      2023 Annual Research Report
    • Int'l Joint Research
  • [Presentation] Relation Extraction with Weighted Contrastive Pre-training on Distant Supervision.2023

    • Author(s)
      Zhen Wan, Fei Cheng, Qianying Liu, Zhuoyuan Mao, Haiyue Song and Sadao Kurohashi.
    • Organizer
      The 17th Conference of the European Chapter of the Association for Computational Linguistics
    • Related Report
      2023 Annual Research Report
    • Int'l Joint Research
  • [Presentation] BERTSeg: BERT Based Unsupervised Subword Segmentation for Neural Machine Translation2022

    • Author(s)
      Haiyue Song, Raj Dabre, Zhuoyuan Mao, Chenhui Chu, and Sadao Kurohashi
    • Organizer
      2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing
    • Related Report
      2022 Annual Research Report
    • Int'l Joint Research
  • [Presentation] Large Pre-trained Language Models with Multilingual Prompt for Japanese Natural Language Tasks2022

    • Author(s)
      Haiyue Song, Raj Dabre, Chenhui Chu and Sadao Kurohashi
    • Organizer
      言語処理学会 第29回年次大会
    • Related Report
      2022 Annual Research Report
  • [Presentation] When do Contrastive Word Alignments Improve Many-to-many Neural Machine Translation?2022

    • Author(s)
      Zhuoyuan Mao, Chenhui Chu, Raj Dabre, Haiyue Song, Zhen Wan, and Sadao Kurohashi
    • Organizer
      Findings of the Association for Computational Linguistics: NAACL 2022
    • Related Report
      2022 Annual Research Report
    • Int'l Joint Research
  • [Presentation] Relation Extraction with Weighted Contrastive Pre-training on Distant Supervision2022

    • Author(s)
      Zhen Wan, Fei Cheng, Qianying Liu, Zhuoyuan Mao, Haiyue Song, Sadao Kurohashi
    • Organizer
      17th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2023)
    • Related Report
      2022 Annual Research Report
    • Int'l Joint Research
  • [Presentation] Representative Data Selection for Sequence-to-Sequence Pre-training2022

    • Author(s)
      Haiyue Song, Raj Dabre, Zhuoyuan Mao, Chenhui Chu, Sadao Kurohashi
    • Organizer
      言語処理学会第28回年次大会
    • Related Report
      2021 Annual Research Report
  • [Presentation] Improving Medical Relation Extraction with Distantly Supervised Pre-training2022

    • Author(s)
      Zhen Wan, Fei Cheng, Qianying Liu, Zhuoyuan Mao, Haiyue Song, Sadao Kurohashi
    • Organizer
      言語処理学会第28回年次大会
    • Related Report
      2021 Annual Research Report
  • [Presentation] Video-guided Machine Translation with Spatial Hierarchical Attention Network2021

    • Author(s)
      Weiqi Gu, Haiyue Song, Chenhui Chu, Sadao Kurohashi
    • Organizer
      ACL-IJCNLP 2021 Student Research Workshop
    • Related Report
      2021 Annual Research Report
    • Int'l Joint Research
  • [Remarks] Haiyue Song's Homepage

    • URL

      https://shyyhs.github.io/

    • Related Report
      2023 Annual Research Report
  • [Remarks] 言語メディア研究室 研究発表一覧

    • URL

      https://nlp.ist.i.kyoto-u.ac.jp/?%E7%A0%94%E7%A9%B6%E7%99%BA%E8%A1%A8%E4%B8%80%E8%A6%A7

    • Related Report
      2023 Annual Research Report
  • [Remarks] 先進的翻訳技術研究室 論文

    • URL

      https://att-astrec.nict.go.jp/result/

    • Related Report
      2023 Annual Research Report
  • [Patent(Industrial Property Rights)] BERTSeg: BERT Based Subword Segmentation2022

    • Inventor(s)
      ソウ カイエツ
    • Industrial Property Rights Holder
      国立研究開発法人情報通信研究機構
    • Industrial Property Rights Type
      特許
    • Filing Date
      2022
    • Related Report
      2022 Annual Research Report

URL: 

Published: 2021-05-27   Modified: 2024-12-25  

Information User Guide FAQ News Terms of Use Attribution of KAKENHI

Powered by NII kakenhi