Project/Area Number |
22KJ1843
|
Project/Area Number (Other) |
22J13719 (2022)
|
Research Category |
Grant-in-Aid for JSPS Fellows
|
Allocation Type | Multi-year Fund (2023) Single-year Grants (2022) |
Section | 国内 |
Review Section |
Basic Section 61030:Intelligent informatics-related
|
Research Institution | Kyoto University |
Principal Investigator |
毛 卓遠 京都大学, 情報学研究科, 特別研究員(DC2)
|
Project Period (FY) |
2023-03-08 – 2024-03-31
|
Project Status |
Completed (Fiscal Year 2023)
|
Budget Amount *help |
¥1,700,000 (Direct Cost: ¥1,700,000)
Fiscal Year 2023: ¥800,000 (Direct Cost: ¥800,000)
Fiscal Year 2022: ¥900,000 (Direct Cost: ¥900,000)
|
Keywords | low-resource translation / sentence embedding / multilingual translation / multilingual embedding / model efficiency |
Outline of Research at the Start |
With globalization's progress, the demand for automatic multilingual language understanding and translation increases dramatically in many scenes. We aim to tackle the technical barriers in low-resource machine translation (LMT) and design a robust multilingual translation system that supports a large number of the languages, including several low-resource languages. (low-resource language: languages that we do not have sufficient data resources to conduct the translation model training)
|
Outline of Annual Research Achievements |
In the last fiscal year, we developed a state-of-the-art lightweight sentence embedding model, LEALLA. With this pre-trained sentence-level semantic model, new parallel corpora could be constructed more efficiently using this pre-trained sentence embedding model. We also analyzed the Transformer model architecture for low-resource translation and published a paper to the top conference. Finally, we packed up all the work into a thesis. In general, this research embarks on a comprehensive exploration of multilingual representation learning, especially for low-resource translation, addressing the three identified challenges within this domain: (1) To address the high computational demand accompanying the expansion of multilingual model language coverage, we proposed an efficient and effective multilingual sentence embedding (MSE) model. We also introduced a new knowledge distillation method for training lightweight MSE. (2) To tackle the challenge of data scarcity in low-resource languages, we proposed new pre-training objectives for low-resource NMT. Additionally, we introduced word-level contrastive learning for low-resource NMT utilizing statistical word alignments. We also introduced AlignInstruct to enhance translation accuracy in low-resource languages for large language models. (3) To address the limitations in Transformer architecture for zero-shot NMT, we initially proposed a new Transformer architecture that constructs interlingual representations on top of the Transformer encoder. We also comprehensively examined the effects of layer normalization in zero-shot NMT.
|