2015 Fiscal Year Research-status Report

言語生産性：有効な類推関係クラスターの迅速な抽出・統計的機械翻訳でその評価

Research Project

Project/Area Number	15K00317
Research Institution	Waseda University
Principal Investigator	LEPAGE YVES 早稲田大学, 理工学術院, 教授 (70573608)
Project Period (FY)	2015-04-01 – 2018-03-31
Keywords	自然言語処理 / 人工知能 / データ構造 / 形態で豊かな言語
Outline of Annual Research Achievements	Analogical clusters are a series of pairs of words or word sequences like: `invariable` : `lack of variability` :: `invisible` : `lack of visibility` :: ... :: `intractable` : `lack of tractability`. When given a new word pair where one is unknown, like `incredible` : x, it is possible to create automatically with high confidence the word sequence x = `lack of credibility`. If `inedible` : x is given, `lack of edibility`can be created but with less confidence, as it is rare in English. If `in double` : x is given, the production of the ridiculous word `doubility` should be stopped. How to collect analogical clusters from data? How to estimate the reliability of new word sequence? During the first year, 1) packaging of basic core programs was done for very fast computation of distances between strings and for resolution of analogy between strings. 2) analogical clusters were extracted on words and bigrams in 11 languages including morphologically rich languages. Times were measured for different versions of programs. Improvements in speed were obtained. On Chinese and Japanese, experiments with new programs were conducted. Some results in Chinese-to-Japanese translation using quasi-parallel data produced with analogical clusters have been submitted to a journal. In the next years, the project will deliver programs that will create new words or word sequences in a fast way and estimate their probability. The production of new word sequences in a fast and reliable way will allow to artificially generate new data so as to improve the quality of machine translation systems.
Current Status of Research Progress	Current Status of Research Progress 3: Progress in research has been slightly delayed. Reason We faced a theoretically difficult problem in the extraction of analogical clusters. The extraction of all analogical clusters with the largest possible sizes was found to be equivalent to the problem of extracting all maximal cliques from a graph. Classical algorithms, like the Bron and Kerbosch algorithm, are too slow for the size of our data. We thus propose to partially answer the problem: we greedily extract large cliques so as to cover all vertices in the graph. Analogical clusters were extracted on words and bigrams in 11 languages of the Europarl corpus including morphologically rich and less rich languages. Times were measured in different versions of our programs. Improvements in speed were obtained, but we still work on accelerating the programs. A problem in speed appeared with the Greek language for bigrams. It takes 10 times more than for other languages. We are presently trying to identify the reason. Against our expectations, no clear relationship between the time needed to extract analogical clusters and the morphological richness of the language can be shown at the moment. Uploading the analogical clusters produced in 11 languages and in Chines and Japanese on a web server has not yet been done, but will be done in the near future.
Strategy for Future Research Activity	Relatively to the work in the second fiscal year, study on pre-compilation of analogical clusters has started in two directions. 1) Programs have been written to reduce analogical clusters to only few representative ratios. We use the notion of the median string of a set of strings for representativeness. This is not a true pre-compilation, but solving a small number of analogical equations will be faster than solving all possible analogical equations with all the ratios in an analogical cluster. 2) Preliminary studies have been started to pre-compile a large list of ratios (i.e., an analogical cluster) into a single pattern ratio. E.g. `invisible` : `visibility` :: `incredible` : `credibility` :: … :: `intractable` : `tractability` is represented with `in`.5+/-1. `ble` : ``.5+/-1.`bility` where 5+/-1 shows the average and standard deviation of the length of the variable part. During the second year, we will study how to use this representation to generate word sequences in an efficient way. Relatively to the plan for the third fiscal year, study on a generalization of clusters to grid has started. A grid is several analogical clusters put side by side. Filling holes in an analogical grid will be safer than generating a word or a word sequences by solving one analogical equation with one analogical cluster, because more information from more clusters will contribute to the generation. This was not announced in the plan but it should help in the work on assigning more exact probability scores to newly generated sequences.
Causes of Carryover	The amount to be used in the next fiscal year (420,000) comes from underuse of manpower (450,000 - 84,000 = 366,000), and acquisition of equipment (300,000 - 236,000 = 64,000) during the last fiscal year. A budget of 450,000 yens was planned for (1) running experiments and uploading data produced to a web server, (2) check network and experiment machines. Problems were encountered in the development of programs. The help of a master student was required in solving the problems. Only 84,000 yens were used for this task. The principal investigator ran the major part of experiments (1) by himself, but the data have not yet been uploaded on the web server. Two experiment machines (236,000 yens) were acquired at a price lower than the total budget assigned for equipment (300,000 yens).
Expenditure Plan for Carryover Budget	The usage will be in manpower that was not done during the last fiscal year (master students, PhD student and technician). The experiments on European languages and Chinese-Japanese (1) will be conducted again with more recent and faster programs. Uploading the data produced to a web site will be done so that outcome of research are made visible. Checking the network and setting up experiment machines (2) will be done.