2017 Fiscal Year Research-status Report
Research on Knowledge Extraction from Ancient Mongolian Historical Documents using Deep Learning
Project/Area Number |
17K00457
|
Research Institution | Ritsumeikan University |
Principal Investigator |
バトジャルガル ビルゲ 立命館大学, 総合科学技術研究機構, 研究員 (30725396)
|
Project Period (FY) |
2017-04-01 – 2021-03-31
|
Keywords | historical documents / traditional Mongolian / Deep learning / machine learning |
Outline of Annual Research Achievements |
In this research, we propose a comprehensive information extraction and analysis method for digitized ancient Mongolian historical documents. The proposed method will recognize new features and patterns from historical manuscripts by utilizing deep learning techniques. In the FY2017, the following language resources have been prepared: 1.Ancient Mongolian corpora: Corpora of ancient Mongolian manuscripts including the i) “Qad-un undusun-u quriyangγui altan tobci neretu sudur” (The Altan Tobchi or the Golden Summary: Short history of the Origins of the Khans) (written in 1604) a.k.a “Little” Altan Tobchi; ii) the “Asaragci neretu-yin teuke” or “Asragch nertiin tuukh” (The Story of Asragch) (written in 1677); and iii) the “Monggol-un nigucha tobciyan” or “Mongoliin nuuts tovchoo” (The Secret History of the Mongols) have been prepared. 2.Manually annotated training data: Annotated training data have been prepared manually. As syntactic annotation, we made part-of-speech tagging manually. Moreover, each token of digitized ancient Mongolian manuscripts is annotated with the IOB2 tags. Because of some unique features of traditional Mongolian script, we also utilized “Start/End” (SE) chunk tag set, which represents the character position in a word, along with the IOB2 tags.
|
Current Status of Research Progress |
Current Status of Research Progress
2: Research has progressed on the whole more than it was originally planned.
Reason
Our research has been conducted according to the research plan. Useful language resources have been prepared. As planned, above language resources and the achievements in the AY2017 will allow advancing my research towards to the research goal in developing a comprehensive information extraction and analysis method to recognize new features and patterns from historical manuscripts by utilizing deep learning techniques. Moreover, any successful machine learning algorithm depends on a good training data and large data sets.
Ongoing research results and achievements have been published in parts in international conference papers.
|
Strategy for Future Research Activity |
In the AY2018, we will build a deep learning model for processing, classifying and analyzing digital texts and scanned images of ancient Mongolian historical documents. Manually annotated training data and collected digital texts will be utilized for recognizing features and patterns of ancient Mongolian linguistic grammar within manuscripts by employing deep learning networks. The following continuous tasks will be implemented : - Starting from an initial setting, samples from the data set are presented to the deep learning network one after the next. - Each time the system iterates, the deep learning network’s settings are tuned slightly. - Eventually, continuous iterations bring the deep learning network’s output closer to the correct output. Some features of ancient Mongolian historical documents in traditional Mongolian script could have higher weights in deep learning networks, which are: 1) suffixes that have some unique features and 2) end of a token - several final letters have some special features in traditional Mongolian script.
|
Causes of Carryover |
A planned pre-order of a book, which costs 60 EUR (approximately 9000 JPY) didn't arrive by end of the FY2017(H29).
I would like to shift the remaining budget to the next year’s research.
|
Research Products
(5 results)