Project Area | Compilation of a balanced corpus of written Japanese: Infrastructure for the coming Japanese linguistics |
Project/Area Number |
18061003
|
Research Category |
Grant-in-Aid for Scientific Research on Priority Areas
|
Allocation Type | Single-year Grants |
Review Section |
Humanities and Social Sciences
|
Research Institution | Tokyo Institute of Technology |
Principal Investigator |
OKUMURA Manabu Tokyo Institute of Technology, 精密工学研究所, 教授 (60214079)
|
Co-Investigator(Kenkyū-buntansha) |
SHIRAI Kiyoaki 北陸先端科学技術大学院大学, 情報科学研究科, 准教授 (30302970)
SHINNOU Hiroyuki 茨城大学, 工学部, 准教授 (10250987)
TAKAMURA Hiroya 東京工業大学, 精密工学研究所, 准教授 (80361773)
TAKEUCHI Kouichi 岡山大学, 自然科学研究科, 講師 (80311174)
SASAKI Minoru 茨城大学, 工学部, 講師 (60344834)
NAKAMURA Makoto 北陸先端科学技術大学院大学, 情報科学研究科, 助教 (50377438)
|
Project Period (FY) |
2006 – 2010
|
Project Status |
Completed (Fiscal Year 2010)
|
Budget Amount *help |
¥84,700,000 (Direct Cost: ¥84,700,000)
Fiscal Year 2010: ¥18,400,000 (Direct Cost: ¥18,400,000)
Fiscal Year 2009: ¥18,400,000 (Direct Cost: ¥18,400,000)
Fiscal Year 2008: ¥18,400,000 (Direct Cost: ¥18,400,000)
Fiscal Year 2007: ¥18,400,000 (Direct Cost: ¥18,400,000)
Fiscal Year 2006: ¥11,100,000 (Direct Cost: ¥11,100,000)
|
Keywords | 語義タグ付コーパス / 単語の新語義発見 / 機械学習 / 語彙概念構造 / クラスタリング / 多義性解消 / 新語義発見 / 代表性 |
Research Abstract |
1) We constructed a corpus with word-sense annotation, based on the balanced contemporary corpus of written Japanese. 2) We organized the SemEval-2 Japanese Word Sense Disambiguation (WSD) task by using the corpus that we constructed in 1). Nine systems from four organizations participated in the task. 3) We showed that when domain adaptation for WSD (word sense disambiguation) was performed, the most effective domain adaptation method varies according to the properties of the source data and target data. We also presented the way to select the most effective method for domain adaptation depending on these properties using decision tree learning. The average accuracy of WSD showed significant improvement when the domain adaptation method which is selected automatically was used respectively, compared to when the original methods were used collectively. 4) We proposed a supervised word sense disambiguation (WSD) system that uses features obtained from clustering results of word instances.
… More
Our approach is novel in that we employ semi-supervised clustering that controls the fluctuation of the centroid of a cluster, and we select seed instances by considering the frequency distribution of word senses and exclude outliers when we introduce "must-link" constraints between seed instances. In addition, we improved the supervised WSD accuracy by using features computed from word instances in clusters generated by the semi-supervised clustering. 5) We proposed a method of detecting new word senses in a corpus. It consists of two procedures : (A) clusters of word instances are constructed so that the instances of the same sense are merged, (B) then similarity between a cluster and a sense in a dictionary is measured in order to determine senses of instances in each cluster. 6) We proposed the method to detect peculiar examples of the target word from a corpus. Our method is to combine the density based method, Local Outlier Factor (LOF), and One Class SVM, which are representative outlier detection methods in the data mining domain. Our method improved precision and recall of LOF and One Class SVM. And we show that our method can detect new meanings by using the noun 'midori (green)'. 7) We presented a co-clustering-based verb synonym extraction approach that increases the number of extracted meanings of polysemous verbs from a large text corpus. Our proposed approach can extract the different meanings of polysemous verbs by recursively eliminating the extracted clusters from the initial data set. The experimental results of verb synonym extraction show that the proposed approach increases the correct verb clusters by about 50% with a 0.9% increase in precision and a 1.5% increase in recall over the previous approach. Less
|