2001 Fiscal Year Annual Research Report

大規模圧縮文書データベースの構築と高度な検索手法に関する研究

Research Project

Project/Area Number	13780184
Research Category	Grant-in-Aid for Encouragement of Young Scientists (A)
Research Institution	Tohoku University
Principal Investigator	定兼邦彦東北大学, 大学院・情報科学研究科, 助手 (20323090)
Keywords	文字列検索 / データベース / 接尾辞配列 / 圧縮接尾辞配列
Research Abstract	研究の目標は文字列検索の省スペースな索引の構成と高速かつ高精度な検索アルゴリズムの考案である.そのための索引として,圧縮接尾辞配列を実際に構築した.データとしてはヒトの全DNA配列を用いた.これは約30億文字の塩基からなるが,この配列の索引として庄縮を行わない接尾辞配列を用いるとその索引サイズは約13ギガバイトとなる.本研究ではこれを2ギガバイト以下に圧縮した.圧縮することで検索速度は落ちるが,計算機実験により速度の低下は許容範囲であることを確認した.作成した圧縮接尾辞配列を用いて長大なDNA配列からの類似配列検索を実現した. 圧縮接尾辞と組み合わせて検索を高速化するデータ構造として,接尾辞の最長一致接頭辞を高速に求めるための省スペースな索引を提案した.既存のデータ構造では線形サイズのものは存在していなかったが,本研究のデータ構造は線形サイズであり,任意の要素を高速に計算できる.このデータ構造を用いることで,接尾辞配列を用いた文字列検索を模倣することができる.また,圧縮接尾辞配列のデータ構造を改良し,検索したい文字列の長さに比例する検索アルゴリズムを可能とした. 文書データベースからの高精度な検索のための手法として近接検索を提案し,その高速なアルゴリズムを作成した.これは検索時にユーザが指定した単語が複数ある場合に,それらが近くに現れている文書を重要とみなし検索のスコアを高くするものである.Webページからの検索実験により,既存の単語の出現頻度を元にした検索アルゴリズムよりも的確な文書を見つけられることを確認した.また検索時間も高速であった.

Research Products
(3 results)

All Other

All Publications (3 results)

[Publications] K. sadakane, H. Imai: "Fast Algorithms for k-Word Proximity Search"IEICE Trans. Fundamentals. Vol.E84-A No.9. 2311-2318 (2001)
[Publications] K.sadakane, T.Sibuya: "Indexing Huge Genome Sequences for Solving Various Problems"Genome Informatics 2001(Universal Academy Press). No.12. 175-183 (2001)
[Publications] K.sadakane: "Succinct Representations of lcp Information and Improvements in the Compressed Suffix Arrays"Proceedings of ACM-SIAM Symposium on Discrete Algorithms. 225-232 (2002)

2001 Fiscal Year Annual Research Report

大規模圧縮文書データベースの構築と高度な検索手法に関する研究

Principal Investigator

定兼 邦彦 東北大学, 大学院・情報科学研究科, 助手 (20323090)

Research Products

[Publications] K. sadakane, H. Imai: "Fast Algorithms for k-Word Proximity Search"IEICE Trans. Fundamentals. Vol.E84-A No.9. 2311-2318 (2001)

[Publications] K.sadakane, T.Sibuya: "Indexing Huge Genome Sequences for Solving Various Problems"Genome Informatics 2001(Universal Academy Press). No.12. 175-183 (2001)

[Publications] K.sadakane: "Succinct Representations of lcp Information and Improvements in the Compressed Suffix Arrays"Proceedings of ACM-SIAM Symposium on Discrete Algorithms. 225-232 (2002)

定兼邦彦東北大学, 大学院・情報科学研究科, 助手 (20323090)