大規模Web情報の検索アルゴリズムに関する研究

Research Project

Project/Area Number	08J08116
Research Category	Grant-in-Aid for JSPS Fellows
Allocation Type	Single-year Grants
Section	国内
Research Field	Fundamental theory of informatics
Research Institution	The University of Tokushima
Principal Investigator	矢田晋 The University of Tokushima, 大学院・ソシオテクノサイエンス研究部, 特別研究員(PD)
Project Period (FY)	2008 – 2009
Project Status	Completed (Fiscal Year 2009)
Budget Amount *help	¥1,600,000 (Direct Cost: ¥1,600,000) Fiscal Year 2009: ¥800,000 (Direct Cost: ¥800,000) Fiscal Year 2008: ¥800,000 (Direct Cost: ¥800,000)
Keywords	大規模n-gramデータ / Webコーパス / 大規模辞書 / トライ / 簡潔データ構造 / 重み付き入力補完 / ダブル配列
Research Abstract	大規模n-gramデータは,機械翻訳やかな漢字変換に用いる言語モデルの構築や,構文・共起に基づく言語知識の獲得などに有用である.しかし,気軽に利用するには規模が大きすぎるという難点があり,少数の研究において利用される程度にとどまっている.そこで,本研究では,大規模n-gramデータ用の検索システムを開発した.本システムは,導入が容易であり,Webブラウザからの検索もサポートすることで,データの利用にかかる負担を大幅に軽減している.これまでに,言語知識の獲得において,いくつかの成果を上げている. 前年度に作成を開始したWebコーパスについては,規模の拡大により,データベースのサイズで約1.8TB,HTML文書の数で約6000万件という規模になっている.本コーパスは,上述した大規模n-gramデータの構築や,検索システムのテストに利用されている.また,近年になって盛んに研究がおこなわれている,大規模コーパスを前提とする言語処理への利用も進めている. 辞書の構成法に関する研究では,簡潔データ構造とよばれるコンパクトなデータ構造の利用により,語彙数が10億件を超える極めて大規模な辞書を構築できることが確認された.また,各種データ構造の比較により,それぞれの利点や欠点が明らかとなり,用途による使い分けの指標となる情報が得られた.さらに,辞書に登録されている語を入力の候補として,優先順位にしたがって入力を補完する手法を新たに提案した.提案手法を用いると,候補が多い状況でも短時間で入力を補完できるため,より応答性の高いインタフェースを提供できる.

Report

(2 results)

2009 Annual Research Report
2008 Annual Research Report

Research Products

(12 results)

All 2010 2009 2008 Other

All Presentation (6 results) Remarks (6 results)

[Presentation] Customized Tries for Weighted Key Completion2010
- Author(s)
  Susumu Yata
- Organizer
  ICCEA 2010
- Place of Presentation
  Bali Dynasty Resort(Bali Island, Indonesia)
- Year and Date
  2010-03-20
- Related Report
  2009 Annual Research Report
[Presentation] 順序木の簡潔表現を用いたトライ辞書の評価2010
- Author(s)
  矢田晋
- Organizer
  情報処理学会第72回全国大会
- Place of Presentation
  東京大学本郷キャンパス(東京都)
- Year and Date
  2010-03-11
- Related Report
  2009 Annual Research Report
[Presentation] 重複レコードの多い大規模トライ辞書の圧縮2009
- Author(s)
  矢田晋
- Organizer
  第8回情報科学技術フォーラム(FIT 2009)
- Place of Presentation
  東北工業大学八木山キャンパス(宮城県)
- Year and Date
  2009-09-02
- Related Report
  2009 Annual Research Report
[Presentation] 転置ファイルによる大規模n-gramデータの検索システム2009
- Author(s)
  矢田晋
- Organizer
  情報処理学会第95回情報学基礎研究会(FI 95)
- Place of Presentation
  神戸ファッションマート(兵庫県)
- Year and Date
  2009-07-28
- Related Report
  2009 Annual Research Report
[Presentation] ダブル配列による動的辞書の構成と評価2009
- Author(s)
  矢田晋
- Organizer
  情報処理学会第71回全国大会
- Place of Presentation
  立命館大学
- Year and Date
  2009-03-11
- Related Report
  2008 Annual Research Report
[Presentation] Fast string matching with space-efficient word graphs2008
- Author(s)
  Susumu Yata
- Organizer
  Innovations'2008
- Place of Presentation
  Al Ain, UAE
- Year and Date
  2008-11-16
- Related Report
  2008 Annual Research Report
[Remarks] ダブル配列のライブラリ著名なライブラリDartsのクローンであり,形態素解析器MeCabなどに組み込みできる.
- URL
  http://code.google.com/p/darts-clone/
- Related Report
  2009 Annual Research Report
[Remarks] Google N-gram検索システム特定の形態素パターンを含むn-gramをGoogle n-gramコーパスから検索できる.
- URL
  http://code.google.com/p/ssgnc/
- Related Report
  2009 Annual Research Report
[Remarks] 各種トライのライブラリいろいろなトライを同じインタフェースで提供しているため,簡単に比較できる.
- URL
  http://code.google.com/p/sumire-tries/
- Related Report
  2009 Annual Research Report
[Remarks] 簡潔データ構造を用いる大規模なトライ用のライブラリ登録文字列が10億件を超える大規模なトライを構築できる.
- URL
  http://code.google.com/p/taiju/
- Related Report
  2009 Annual Research Report
[Remarks]
- URL
  http://code.google.com/p/darts-clone/
- Related Report
  2008 Annual Research Report
[Remarks]
- URL
  http://code.google.com/p/ssgnc/
- Related Report
  2008 Annual Research Report

大規模Web情報の検索アルゴリズムに関する研究

Principal Investigator

矢田 晋 The University of Tokushima, 大学院・ソシオテクノサイエンス研究部, 特別研究員(PD)

¥1,600,000 (Direct Cost: ¥1,600,000)

Report

Research Products

[Presentation] Customized Tries for Weighted Key Completion2010

Author(s)

Organizer

Place of Presentation

Year and Date

Related Report

[Presentation] 順序木の簡潔表現を用いたトライ辞書の評価2010

Author(s)

Organizer

Place of Presentation

Year and Date

Related Report

[Presentation] 重複レコードの多い大規模トライ辞書の圧縮2009

Author(s)

Organizer

Place of Presentation

Year and Date

Related Report

[Presentation] 転置ファイルによる大規模n-gramデータの検索システム2009

Author(s)

Organizer

Place of Presentation

Year and Date

Related Report

[Presentation] ダブル配列による動的辞書の構成と評価2009

Author(s)

Organizer

Place of Presentation

Year and Date

Related Report

[Presentation] Fast string matching with space-efficient word graphs2008

Author(s)

Organizer

Place of Presentation

Year and Date

Related Report

[Remarks] ダブル配列のライブラリ 著名なライブラリDartsのクローンであり,形態素解析器MeCabなどに組み込みできる.

URL

Related Report

[Remarks] Google N-gram検索システム 特定の形態素パターンを含むn-gramをGoogle n-gramコーパスから検索できる.

URL

Related Report

[Remarks] 各種トライのライブラリ いろいろなトライを同じインタフェースで提供しているため,簡単に比較できる.

URL

Related Report

[Remarks] 簡潔データ構造を用いる大規模なトライ用のライブラリ 登録文字列が10億件を超える大規模なトライを構築できる.

URL

Related Report

[Remarks]

URL

Related Report

[Remarks]

URL

Related Report

矢田晋 The University of Tokushima, 大学院・ソシオテクノサイエンス研究部, 特別研究員(PD)

[Remarks] ダブル配列のライブラリ著名なライブラリDartsのクローンであり,形態素解析器MeCabなどに組み込みできる.

[Remarks] Google N-gram検索システム特定の形態素パターンを含むn-gramをGoogle n-gramコーパスから検索できる.

[Remarks] 各種トライのライブラリいろいろなトライを同じインタフェースで提供しているため,簡単に比較できる.

[Remarks] 簡潔データ構造を用いる大規模なトライ用のライブラリ登録文字列が10億件を超える大規模なトライを構築できる.