研究実績の概要 |
In this research, we propose an information extraction method for digitized ancient Mongolian documents by utilizing an ancient-modern dictionary. In the FY2014, the following language resources have been prepared.
1. An ancient-modern (traditional Mongolian and Cyrillic Mongolian) dictionary and parallel corpora: A dictionary have been built by comparing the statistical information such as co-occurrence frequencies and word frequencies that had appeared both in modern and ancient parallel corpora of ancient Mongolian historical documents such as "The Altan Tobchi", "The Story of Asragch" and the "The Secret History of the Mongols”. 2. Annotated training data: Annotated training data have been prepared manually by utilizing a chronological book of ancient Mongolian kings and the Mongol Empire-"Altan tovch".
今後の研究の推進方策 |
In the FY2015, we will propose a named-entities extraction method for ancient Mongolian historical documents that will utilize ancient Mongolian linguistic grammar-based techniques along with a statistical model by employing text mining techniques. The following tasks will be implemented: 1) Extracting and tagging the named entities such as historical figures and place names in ancient Mongolian historical documents 2) Tagging the personal names including generational or dynastic information, an inherited or life-time title of nobility, or a traditional descriptive phrase or nick-names. Besides extracting the named-entities, the following tasks will be done in creating the digital representations of ancient Mongolian historical documents: 1. To encode contextual information for formalizing and representing explicit information about context. 2. To encode ancient words, which were misspelled or written differently than ancient orthography, along with their modern orthography while preserving the writing of original manuscripts. 3. To represent editorial markup, commentaries, alterations, revisions, corrections, transcriptions and interpretations. Moreover, continuous experiments will be conducted to improve the proposed methods.