Project Area | Compilation of a balanced corpus of written Japanese: Infrastructure for the coming Japanese linguistics |
Project/Area Number |
18061007
|
Research Category |
Grant-in-Aid for Scientific Research on Priority Areas
|
Allocation Type | Single-year Grants |
Review Section |
Humanities and Social Sciences
|
Research Institution | The National Institute for Japanese Language |
Principal Investigator |
YAMAZAKI Makoto The National Institute for Japanese Language, 言語資源研究系, 准教授 (30182489)
|
Co-Investigator(Kenkyū-buntansha) |
MARUYAMA Takehiko 国立国語研究所, 言語資源研究系, 助教 (90392539)
KASHINO Wakako 国立国語研究所, 言語資源研究系, 准教授 (50311147)
SANO Motoki 国立国語研究所, コーパス開発センター, プロジェクト特別研究員 (60455425)
YAMAGUCHI Masaya 国立国語研究所, 言語資源研究系, 助教 (30302920)
MABUCHI Yoko 国立国語研究所, コーパス開発センター, プロジェクト特別研究員 (10415614)
TAKADA Tomokazu 国立国語研究所, 理論・構造研究系, 准教授 (90415612)
OGURA Hideki 国立国語研究所, 言語資源研究系, 准教授 (00321547)
FUJIIKE Yumi 国立国語研究所, コーパス開発センター, プロジェクト特別研究員 (20510572)
ONUMA Etsu 国立国語研究所, 管理部研究推進課, 専門職員 (00311150)
MORIMOTO Sachiko 学習院大学, 大学院・人文科学研究科, 助教 (80342939)
大和 淳 文化庁, 長官官房著作権課, 課長補佐 (10377103)
|
Project Period (FY) |
2006 – 2010
|
Project Status |
Completed (Fiscal Year 2010)
|
Budget Amount *help |
¥242,200,000 (Direct Cost: ¥242,200,000)
Fiscal Year 2010: ¥17,500,000 (Direct Cost: ¥17,500,000)
Fiscal Year 2009: ¥29,300,000 (Direct Cost: ¥29,300,000)
Fiscal Year 2008: ¥54,900,000 (Direct Cost: ¥54,900,000)
Fiscal Year 2007: ¥86,200,000 (Direct Cost: ¥86,200,000)
Fiscal Year 2006: ¥54,300,000 (Direct Cost: ¥54,300,000)
|
Keywords | 均衡コーパス / 書き言葉 / 代表性 / 書籍 / サンプリング / XML / 形態解析 / 著作権処理 / 形態素解析 / 代表制 |
Research Abstract |
We have compiled a large balanced corpus of books which will be a highly useful resource for the future research of Japanese language. This corpus is the first authentic balanced written corpus in Japan and has the following characteristics.(1)Represents the distribution of population properly by random sampling. (2)Segmented by two kinds of word unit(short word unit and long word unit). (3)Text strucrure, morphological information and character information are annotated using XML.(4)Every sample is sought the copyright permission as long as possible. The book corpus is the main part of the BCCWJ(Balanced Corpus of Contemporary Written Japanese) and will be open to the public in 2011.
|