Text Mining for Languages of All Ages and Countries
Project/Area Number |
22500140
|
Research Category |
Grant-in-Aid for Scientific Research (C)
|
Allocation Type | Single-year Grants |
Section | 一般 |
Research Field |
Intelligent informatics
|
Research Institution | Shonan Institute of Technology |
Principal Investigator |
SUZUKI Makoto 湘南工科大学, 工学部, 准教授 (80339796)
|
Co-Investigator(Renkei-kenkyūsha) |
OHSUGA Akihiko 電気通信大学, 大学院・情報システム学研究科, 教授 (90393842)
GOTO Masayuki 早稲田大学, 創造理工学部・経営システム工学科, 教授 (40287967)
SUKO Tota 早稲田大学, メディアネットワークセンター, 助教 (40409660)
|
Project Period (FY) |
2010 – 2012
|
Project Status |
Completed (Fiscal Year 2012)
|
Budget Amount *help |
¥4,160,000 (Direct Cost: ¥3,200,000、Indirect Cost: ¥960,000)
Fiscal Year 2012: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000)
Fiscal Year 2011: ¥1,300,000 (Direct Cost: ¥1,000,000、Indirect Cost: ¥300,000)
Fiscal Year 2010: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000)
|
Keywords | 多言語処理 / 機械学習 / モデル化 / 文書自動分類 / N-gram / テキストマイニング |
Research Abstract |
We proposed the accumulation method, which is a language-independent text classification method that is based on the character N-gram. The accumulation method does not depend on the language structure, because this method uses the character N-gram to form index terms. If text documents are expressed in Unicode, then the accumulation method can classify documents using the same algorithm. Therefore, we classified English, Japanese, Korean, and Chinese text documents. As a result, the highest macro-averaged F-measures of the proposed method were 94.5% for the English Reuters-21578, 88.5% for the Japanese CD-Mainichi 2002 data set, 90.2% for the Korean Hankyoreh 2008 data set, and 92.6% for the People's Daily 2009-2010 data set. Thus, we obtained good results for these languages. Moreover, we were able to construct a mathematical model of the accumulation method and were able to clarify the mathematical meaning.
|
Report
(4 results)
Research Products
(23 results)
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
[Book] 確率統計学2010
Author(s)
須子統太, 鈴木誠, 浮田善文, 小林学, 後藤正幸
Total Pages
251
Publisher
オーム社
Related Report
-
-
-
-