2018 Fiscal Year Research-status Report
Self-explainable and fast-to-train example-based machine translation using neural networks
Project/Area Number |
18K11447
|
Research Institution | Waseda University |
Principal Investigator |
LEPAGE YVES 早稲田大学, 理工学術院(情報生産システム研究科・センター), 教授 (70573608)
|
Project Period (FY) |
2018-04-01 – 2021-03-31
|
Keywords | natural language / machine translation / case-based reasoning / analogy / explainable AI |
Outline of Annual Research Achievements |
Two example-based machine translation (EBMT) systems were implemented: The first EBMT system follows the direct approach. Numerical approaches were introduced in adaptation and retrieval (1 paper at international conference). The second EBMT system follows the indirect approach (1 paper at international conference). Several versions of the systems were implemented. A better formalisation has been proposed (1 paper submitted to an international conference). Data was collected: The BTEC corpus is too expensive, but contacts with ATR were established for further enquiry. In experiments, data from the Tatoeba corpus were used instead. Very large word embeddings (continuous vector representations of words) in several languages were downloaded and cleaned up. Experiments were conducted: For comparison, an SMT system and an NMT system were built. All the systems were tested on the same data to assess the respective translation accuracy. A GPU machine, DeepLearning Box, was acquired to run experiments. Work on analogy: The use of neural networks has been proposed for formal analogies. Experiments were conducted with the first EBMT system (1 paper at international conference). An algorithm has been proposed for semantico-formal analogies between sentences using vector representations of words. Produced data have been publicly released (1 paper at international conference in next fiscal year). Formal transformations which keep analogies have been studied (1 paper at international conference, best paper award).
|
Current Status of Research Progress |
Current Status of Research Progress
2: Research has progressed on the whole more than it was originally planned.
Reason
The planning is basically kept. In the 1st year, as planned, systems were built, data collected and experiments performed for comparison in translation accuracy. In addition, theoretical work on the formalisation of the EBMT system as a case-based reasoning system and on the formalisation of analogy has been conducted. As for data, the acquisition of the BTEC being too expensive, its acquisition has been cancelled. Work delayed: although initially planned for the 1st fiscal year, work on segmentation of parallel corpora with soft sub-sentential alignment is delayed and will be done during the 2nd fiscal year. Work done in advance of schedule: (1) although initially planned for the 2nd fiscal year, work on solving analogies between soft alignments of sequences of continuous representations of words, has begun. Work will continue on this during the 2nd fiscal year. The project is also in advance with work on resolution of semantico-formal analogies between sentences which was scheduled for the 2nd year. Production of a set of analogies between sentences extracted from Tatoeba is released on our global server. (2) Although initially planned for the 3rd fiscal year, work on self-explanation of translation has already begun. It consists in tracing the execution of the translation process to explain the choices made by the system when translating. Console traces and interfaces have been implemented. Work is still needed to make them shorter, more user-friendly and more understandable.
|
Strategy for Future Research Activity |
As announced in the planning of the project, the main theoretical topic of research for the 2nd fiscal year will be the resolution of analogies between soft representations of sentences using neural networks. This will be integrated in the EBMT systems. Experiments will be performed with bilingual analogies, for the direct approach to EBMT and with monolingual analogies for the indirect approach to EBMT. Monolingual and bilingual alignments and analogies will be used in the final translation system. This requires to merge the direct and the indirect approaches. A formalisation of EBMT seen as local enrichment of a translation memory by application of linguistic variations has been proposed. Linguistic variations in the source language and in the target language are extracted from analogical clusters. The selection of the most efficient variations is still an open problem. The use of penalties to assess the quality of newly created sentence pairs resulting from the enrichment of the case base remains to be evaluated. To merge the direct and the indirect approaches to EBMT, how to associate variations in the source language and variations in the target language (i.e., analogical clusters) will be studied. Work on self-explanation will continue. Interfaces for the visualisation of the application of variations and visualisation of the enrichment of the translation memory will be improved. The existing explanations need to be shorter and more easily understandable by a standard user.
|
Causes of Carryover |
Budget will be spent for participation in an international conference in Europe.
|
Research Products
(5 results)