2019 Fiscal Year Research-status Report
Self-explainable and fast-to-train example-based machine translation using neural networks
Project/Area Number |
18K11447
|
Research Institution | Waseda University |
Principal Investigator |
LEPAGE YVES 早稲田大学, 理工学術院(情報生産システム研究科・センター), 教授 (70573608)
|
Project Period (FY) |
2018-04-01 – 2021-03-31
|
Keywords | analogy / machine translation |
Outline of Annual Research Achievements |
After working on the direct approach in the first year, work on the indirect approach in example-based machine translation (EBMT) system was performed in the second fiscal year. A system was implemented. Numerical approaches were introduced in adaptation and retrieval (1 paper at international conference). In addition, it was studied how to merge the direct and the indirect approaches in EBMT by analogy. A model has been proposed. It is not yet been integrated in the final EBMT system. It exploits vector representations of words for monolingual comparison (results from Neural NLP) and sub-sentential alignment for bilingual comparison (results from SMT) (1 paper at a national conference, accepted, to be published in fiscal year 2020). Also, work on sentence representations for retrieval and similarity computation started. Data was collected: because we could not acquire the BTEC corpus, we use data from the Tatoeba corpus. A method to produce semantico-formal analogies between sentences was proposed (1 paper at an international conference). The dataset was publicly released. Preliminary experiments in matrix representations of sentences and resolution of analogies between such representations was conducted. No paper has been published. Also experiments in improving bilingual word embedding mapping were conducted (1 paper published at international conference). To run experiments, we could not buy another DeepLearning Box as planned because the prices of went up. Instead, one graphic card (GPU) was added to the DeepLearning Box already acquired in fiscal year 2018.
|
Current Status of Research Progress |
Current Status of Research Progress
2: Research has progressed on the whole more than it was originally planned.
Reason
The planning is basically kept. Work planned for the 2nd year, was normally performed: (1) The use of (a) word vector representations, coming from neural NLP, and the use of (b) sub-sentential alignment, coming from statistical machine translation, was adopted for the monolingual and the bilingual cases. The representation of the correspondence between sentences is made by using similarity matrices. The use of sub-sentential alignment and bilingual word embedding mapping was compared in an experiment. (2) In order to go from formal and crispy analogies to softer and more semantically relevant analogies between sentences, a method to solve semantico-formal analogies between sentences was designed. A resource of semantico-formal analogies in English was produced automatically and was publicly released. Work for the 3rd year was initiated: (1) Study of representations of sentences themselves, by use of matrices (of interpolated points in a word embedding space (original approach) or direct sentence embeddings started and is continued. (2) A set of bilingual analogies between sentences extracted from the Tatoeba corpus has been produced. This dataset will be released. Some work delayed: the work on self-explanation of translations was initially planned for the 3rd fiscal year. It was initiated in the 1st year, but was suspended in the 2nd year. It will resume in the 3rd fiscal year. In addition, integrating the resolution of soft analogies in the EBMT system has been slightly delayed.
|
Strategy for Future Research Activity |
During the 3rd year, work on the prototype system will continue. The self-explainable functionality for tracing recursive translation of fragments of sentences was addressed in the 2nd year in the model proposed for the indirect approach to example-based machine translation. A first interface has been designed. However, work on the visualisation of the traces is needed because traces need to shorter and more readable for the user. Also explanation of how similar retrieved sentences match the sentence to be translated need to be inspected. One of the main work will be to conduct experiments to measure to what extent crispy vs. soft comparison of words in the translation of shorter vs. longer sentences using more dense vs. less dense corpora are more efficient. For that, data should be prepared. Work on sentence representations and representations of the correspondence between sentences will continue. Training times will also be measured and compared with training times in the neural approach to machine translation. Work will be conducted on retrieval of similar sentences. It is a necessary component in an example-based machine translation system. The use of vector representations of sentences and cosine similarity will be compared with more classical methods using suffix arrays and pattern-matching techniques. Work on self-explanation will also resume. Interfaces for the visualisation of traces will be improved. The existing explanations need to be shorter and more easily understandable by a standard user.
|
Causes of Carryover |
A research assistant (研究補助者) is paid for weekly work. Participation in international conferences.
|
Research Products
(4 results)