Budget Amount *help |
¥2,100,000 (Direct Cost: ¥2,100,000)
Fiscal Year 2003: ¥700,000 (Direct Cost: ¥700,000)
Fiscal Year 2002: ¥1,400,000 (Direct Cost: ¥1,400,000)
|
Research Abstract |
Textual materials on the Internet, or Internet corpus, is a language resource important for and valuable in natural language processing. In this research, we have tried to it in the process of devising a method for analyzing compound words in Japanese, a writer's aid program for translating Japanese into English, and an automatic summarization system for newspaper articles on sassho-jiken. The approach we use in natural language processing is statistical, not linguistic theoretical. We encounter a difficulty in this approach that require a solution to the spares date problem : whatever the result we may get, it will not be reliable one if it is attained from the analysis of insufficient amount of data. The data on the Internet are practically infinite, and our research has proven an effective use of Internet corpus in the areas we dealt with. At the same time, however, Iit has revealed a problem that the data are not well formed on the Internet and a device to eliminate "junk" data would be a necessary process for many language processing systems.
|