2016 Fiscal Year Research-status Report
言語生産性:有効な類推関係クラスターの迅速な抽出・統計的機械翻訳でその評価
Project/Area Number |
15K00317
|
Research Institution | Waseda University |
Principal Investigator |
LEPAGE YVES 早稲田大学, 理工学術院(情報生産システム研究科・センター), 教授 (70573608)
|
Project Period (FY) |
2015-04-01 – 2018-03-31
|
Keywords | 自然言語処理 / 人工知能 / データ構造 / 形態で豊かな言語 / 中国語・日本語 |
Outline of Annual Research Achievements |
Analogical clusters are a series of pairs of words or word sequences like: `invariable` : `lack of variability` :: `invisible` : `lack of visibility` :: ... :: `intractable` : `lack of tractability`. When given `incredible` : x, it is possible to deduce x automatically with high confidence: x = `lack of credibility`. If `inedible` : x is given, `lack of edibility`can be created but with less confidence, as it is rare in English. If `in double` : x is given, the production of the ridiculous word `doubility` should be barred. The goals of this research are: How to automatically and rapidly collect analogical clusters from data? How to estimate the reliability of new words or word sequences? Can such new word sequences be useful in machine translation? For Chinese--Japanese machine translation, redesigning our production pipeline and eliminating not promising short word sequences allowed to accelerate the production of new short sentences in Japanese or Chinese (50 times as fast). Results in Chinese-to-Japanese translation using quasi-parallel data produced by using analogical clusters have been published in a scientific journal. Analogical clusters have been produced on various languages using various corpora. A large amount of data has been released on our web site. A method to reduce numerous individual analogical equations to one massive analogical equation has been proposed and implemented. Results to compare morphological richness of languages or to predict new words have been published as workshop and conference papers.
|
Current Status of Research Progress |
Current Status of Research Progress
2: Research has progressed on the whole more than it was originally planned.
Reason
The project is now on schedule, even in advance for some part. New promising tracks which were not scheduled have also been explored. Make up for delay: Analogical clusters for all n-grams (from 1 to 6) in 11 languages of the Europarl corpus have been uploaded to our web server (http://lepage-lab.ips.waseda.ac.jp/ > Projects > Kakenhi 15K00317 > Experimental results). This was delayed in the first year. Task finished in advance: All the work concerning application of analogical clusters to improve statistical machine translation, scheduled for the third year, has been done during the second year. Very significant improvements in Chinese―Japanese machine translation by using quasi-parallel corpora produced using analogical clusters have been reported in a scientific journal. Continuous development and exploration of new tracks: The programs for the automatic extraction of clusters implemented during the first year have been improved during the second year. Reduction of numerous individual analogical equations to one massive analogical equation has been implemented on schedule but not yet extensively tested. Automatic production of analogical grids is an additional technique which has been presented in two papers. It was not on our schedule, but it will be contribute in our work during the last year on estimating probabilities of new word sequences. Third year on schedule: The work on estimating the probability of new sequences of words using analogical clusters or grids is scheduled for the third year. It started during the second year.
|
Strategy for Future Research Activity |
A way was to reduce analogical clusters to only few representative ratios. We use the notion of the median string of a set of strings for representativeness. This is not a true pre-compilation, but solving a small number of analogical equations has been made much faster. The reduction of numerous individual analogical equations to one massive analogical equation has been implemented on schedule but not yet extensively tested. It uses a special compact representation has been implemented. These technique which partially relies on the use of maximal entropy and resolution of assignment problem will be reported in coming workshops or conferences.
The third year will concentrate on the estimation of probabilities of new words inside analogical clusters or grids. Filling holes in an analogical grid will be safer than generating a word or a word sequences by solving one analogical equation with one analogical cluster, because more information from more clusters will contribute to the generation. This was not announced in the plan but it is a very promising help in the work on assigning more exact probability scores to newly generated sequences. We intend to participate in the SIGMORPHON campaign 2017 to test our use of analogical clusters and, if possible, resolution of massive analogical equations. This also was not on our schedule, but this offers a way to compare with our techniques with other international research centres.
|
Causes of Carryover |
The amount to be used in the next fiscal year (56,810) comes from underuse of manpower (regular check of experiment machines, global server and network, see (2) in Usage plan).
|
Expenditure Plan for Carryover Budget |
The main usage will be in desktop computer (1), manpower (2) and travels (3). (1) A desktop computer for experiment will be bought. (2) Experiments will be run by 1 PhD student and master students; 1 technician will perform a regular inspection of experiment machines, global server and network. will be done. (3) One participation (1 person / 1 week) at an international conference is scheduled for results presentation.
|
Remarks |
Analogical clusters in 11 languages for different sizes of N-grams for all N-grams in the same corresponding first thousand lines of Europarl corpus version 3.
|
Research Products
(8 results)