2021 Fiscal Year Research-status Report
Theoretically founded algorithms for the automatic production of analogy tests in NLP
Project/Area Number |
21K12038
|
Research Institution | Waseda University |
Principal Investigator |
LEPAGE YVES 早稲田大学, 理工学術院(情報生産システム研究科・センター), 教授 (70573608)
|
Project Period (FY) |
2021-04-01 – 2024-03-31
|
Keywords | 自然言語処理 / 埋め込み表現 / 類推関係 |
Outline of Annual Research Achievements |
During the first year, work on the tasks (a) to (c) announced in the plan have been pursued in parallel. (a) Casting string edit distances into vector spaces. Review on 1-D clustering and multidimensional scaling. Reading and programmation of metrics ('mutual information score') to measure the difference between analogical grids. (b) Adapting existing algorithms for arithmetic analogy on integer values to real values. Study on distributions of values in word embedding models. Findings: the distribution is Gaussian on one dimension. This poses a problem: no clustering method can be applied to separate values on one dimension. Study on correlations of dimensions in word embedding models. Findings: some dimensions are correlated in subspaces. This allows some dimensionality reduction. (c) Parallelising existing algorithms. Use of the mathematical library numpy in existing programs. A master's student was hired in August and September. Results: speed-up in retrieval of analogical clusters. Work on extraction of all analogies from a word space. As many semantic phenomena are realised formally in language (e.g., the opposition male/female is expressed by suffixes -er/-ress), start with regular patterns like waiter : waitress :: mister : mistress, etc. and extend to vector representations to catch irregular patterns like king : queen. First experiments in local. A study on retrieval of all possible formal analogies between sentences at different granularities has been conducted. Only analogies on the formal level were retrieved, but a journal paper has been published.
|
Current Status of Research Progress |
Current Status of Research Progress
4: Progress in research has been delayed.
Reason
Tasks (a) to (c) are experiencing theoretical or experimental difficulties. (a) No satisfying solution to cast edit distances into vector spaces has been found. Negative results have even been obtained. The proof for the negative results should be made, and should justify to turn to empirical recipes. (b) The distribution on each dimension of vectors in word embedding spaces has been found to be Gaussian. hence, no solution has been found to distribute values in separate bins in a satisfactory way. A new approach should be attempted, that need to modify existing C programs. This is a difficult engineering work. (c) In the task of parallelisation, the introduction of numpy to more places in existing programs led to a dead-end. No further speed-up was obtained. The project is experiencing delays. They are explained by two reasons: 1. The principal investigator has been unexpectedly appointed chair of the field in his school. The percentage of effort that can be allocated for research has decreased. 2. A master's student that was supposed to work with the principal investigator on the problem of converting programs on integer values to real values has not been able to work during the first and second semesters and has taken a leave for personal reasons for the coming semesters.
|
Strategy for Future Research Activity |
In the second fiscal year, work on each of the tasks (a) to (b) should continue. The infrastructure for conducting experiments in the extraction of portions or the entirety of an embedding space should be established. Work on the problem of casting programs working on integer values to real values, and on the development of a framework for the measurement of results will be implemented and tested with existing analogy test sets. Preparation are going on for the implementation of the parallelisation of the tools. The implementation using GPUs should be tested at the end of the second fiscal year.
|
Causes of Carryover |
The budget in the next fiscal year will be mainly dedicated to the following items.(1) Acquisition of computation power. This expenditure was delayed last year because of increases in prices. (2) Personnel expenditures. This year, a PhD student will be hired to work on the problem of casting programs working on integer values to real values, and on the development of a framework for the measurement of results will be implemented and tested with existing analogy test sets. (3) If conditions permit, the principal investigator will participate in conferences. Two invitations have been received for invited talks at a workshop and in a regular seminar in a foreign university. The topics of the talks will be related with this research project. If conditions permit, the budget will be used for travel expenses.
|
Research Products
(1 results)