• Search Research Projects
  • Search Researchers
  • How to Use
  1. Back to previous page

A Study on Cluster-based Indexing of Textual Data

Research Project

Project/Area Number 15500081
Research Category

Grant-in-Aid for Scientific Research (C)

Allocation TypeSingle-year Grants
Section一般
Research Field Media informatics/Database
Research InstitutionNational Institute of Informatics

Principal Investigator

AIZAWA Akiko  National Institute of Informatics, Research Center for Information Resources, Professor, 情報学資源研究センター, 教授 (90222447)

Project Period (FY) 2003 – 2004
Project Status Completed (Fiscal Year 2004)
Budget Amount *help
¥3,500,000 (Direct Cost: ¥3,500,000)
Fiscal Year 2004: ¥1,700,000 (Direct Cost: ¥1,700,000)
Fiscal Year 2003: ¥1,800,000 (Direct Cost: ¥1,800,000)
KeywordsText Mining / Statistical Language Model / Document Clustering / Information Retrieval / Amount of Information / Extraction of Noun Phrases
Research Abstract

In this study, we proposed a framework and implementation of an information retrieval system that utilizes clusters of similar documents. The proposed method first generates document clusters together with their representative terms and phrases based on the term distribution or term sequence match. Next, considering each document cluster as a single virtual document, an extended index is created. Upon a query submission, the system uses both the original and the extended indices and returns the integrated result. In the research, we also demonstrated that indices generated based on different viewpoints can be used to enhance the flexibility of the retrieval system.
During the research period, we focused on the following research topics:
1. Co-clustering method that is based on the co-occurrence statistics and mutual information
2. Suffix-array based clustering method that utilizes the repetition of textual elements measured by the proposed coincidence score
3. A framework of cluster-based indexing and its implementation
4. Entity identification using the fast repetition-based clustering method
Future research issues include statistical and analytical text processing methods to automatically extract index phrases from the target retrieved document set, and also methods for the identification of textual elements that refer to the same real-world entities.

Report

(3 results)
  • 2004 Annual Research Report   Final Research Report Summary
  • 2003 Annual Research Report
  • Research Products

    (27 results)

All 2005 2004 2003 Other

All Journal Article (21 results) Publications (6 results)

  • [Journal Article] レコード同定問題に関する研究の課題と現状2005

    • Author(s)
      相澤彰子, 大山敬三, 高須淳宏, 安達淳
    • Journal Title

      電子情報通信学会論文誌、D1 VOL.J88-D1 No.3

      Pages: 576-589

    • NAID

      110003207354

    • Description
      「研究成果報告書概要(和文)」より
    • Related Report
      2004 Final Research Report Summary
  • [Journal Article] A Fast Linkage Detection Scheme for Multi-Source Information Integration2005

    • Author(s)
      Akiko Aizawa, Keizo Oyama
    • Journal Title

      WIRI2005 (International Workshop on Challenges in Web Information Retrieval and Integration)

    • Description
      「研究成果報告書概要(和文)」より
    • Related Report
      2004 Final Research Report Summary
  • [Journal Article] Techniques and Research Trends in Record Linkage Studies2005

    • Author(s)
      Akiko Aizawa, Atsuhiro Takasu, Keizo Oyama, Jun Adachi
    • Journal Title

      Journal of IEICE Vol.J88-D1 No.3(in Japanese)

      Pages: 576-589

    • NAID

      110003207354

    • Description
      「研究成果報告書概要(欧文)」より
    • Related Report
      2004 Final Research Report Summary
  • [Journal Article] A Fast Linkage Detection Scheme for Multi-Source Information Integration2005

    • Author(s)
      Akiko Aizawa, Keizo Oyama
    • Journal Title

      WIRI2005 (International Workshop on Challenges in Web Information Retrieval, Integration)

    • Description
      「研究成果報告書概要(欧文)」より
    • Related Report
      2004 Final Research Report Summary
  • [Journal Article] レコード同定問題に関する研究の課題と現状2005

    • Author(s)
      相澤彰子, 大山敬三, 高須淳宏, 安達淳
    • Journal Title

      電子情報通信学会論文誌、DI VOL.J88-D1 No.3

      Pages: 576-589

    • NAID

      110003207354

    • Related Report
      2004 Annual Research Report
  • [Journal Article] A Fast Linkage Detection Scheme for Multi-Source Information Integration2005

    • Author(s)
      Aiko Aizawa, Keizo Oyama
    • Journal Title

      WIRI2005 (International Workshop on Challenges in Web Information Retrieval and Integration)

    • Related Report
      2004 Annual Research Report
  • [Journal Article] 和英著者キーワードからの多言語類語辞書自動構築の試み2004

    • Author(s)
      相澤彰子
    • Journal Title

      情報管理 Vol.47, no.6

      Pages: 401-409

    • NAID

      130000072076

    • Description
      「研究成果報告書概要(和文)」より
    • Related Report
      2004 Final Research Report Summary
  • [Journal Article] Record Linkage of Multi-source Databases: ResearchTrends2004

    • Author(s)
      Akiko Aizawa, Atsuhiro Takasu, Keizo Oyama, Jun Adachi
    • Journal Title

      NII Journal(in Japanese) No.8

      Pages: 43-51

    • NAID

      110001276082

    • Description
      「研究成果報告書概要(欧文)」より
    • Related Report
      2004 Final Research Report Summary
  • [Journal Article] An Approach to Automatic Generation of Multi-lingual Synonymous Terms Dictionary using Japanese-English Bilingual Author's Keywords2004

    • Author(s)
      Akiko Aizawa
    • Journal Title

      Journal of Information Processing and Management(in Japanese) Vol.47 no.6

      Pages: 401-409

    • NAID

      130000072076

    • Description
      「研究成果報告書概要(欧文)」より
    • Related Report
      2004 Final Research Report Summary
  • [Journal Article] A Fast Method fo Duplicated Entries Detection in Bibliographic Databases2004

    • Author(s)
      Akiko Aizawa, Atsuhiro Takasu, Keizo Oyama, Jun Adachi
    • Journal Title

      IPSJ SIG Notes, DBS(in Japanese) Vol.2004 No.45

      Pages: 111-118

    • Description
      「研究成果報告書概要(欧文)」より
    • Related Report
      2004 Final Research Report Summary
  • [Journal Article] An Approach to Cluster-based Indexing2004

    • Author(s)
      Akiko Aizawa
    • Journal Title

      IPSJ SIG Notes, NL(in Japanese) 159-007

      Pages: 159-7

    • NAID

      110002911663

    • Description
      「研究成果報告書概要(欧文)」より
    • Related Report
      2004 Final Research Report Summary
  • [Journal Article] 複数書誌データベース統合における重複エントリーの高速検出法2004

    • Author(s)
      相澤彰子, 大山敬三, 高須淳宏, 安達淳
    • Journal Title

      情報処理学会研究報告.DBS,データベース・システム Vol.2004 Num.45

      Pages: 111-118

    • NAID

      110002911297

    • Related Report
      2004 Annual Research Report
  • [Journal Article] クラスタ指向インデクシングに関する一検討2004

    • Author(s)
      相澤彰子
    • Journal Title

      情報処理学会研究報告.NL,自然言語処理 No.159-007

      Pages: 159-7

    • NAID

      110002911663

    • Related Report
      2004 Annual Research Report
  • [Journal Article] 和英著者キーワードからの多言語類語辞書自動構築の試み2004

    • Author(s)
      相澤彰子
    • Journal Title

      情報管理 vol.47, no.6

      Pages: 401-409

    • NAID

      130000072076

    • Related Report
      2004 Annual Research Report
  • [Journal Article] Analysis of Source Identified Text Corpora : Exploring the Statistics of the Reused Text and Authorship2003

    • Author(s)
      Akiko Aizawa
    • Journal Title

      Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL-03)

      Pages: 383-390

    • Description
      「研究成果報告書概要(和文)」より
    • Related Report
      2004 Final Research Report Summary
  • [Journal Article] 低頻度語の利用によるテキストの分類性能の改善と評価2003

    • Author(s)
      相澤彰子
    • Journal Title

      情報処理学会論文誌 44,7

      Pages: 1720-1730

    • Description
      「研究成果報告書概要(和文)」より
    • Related Report
      2004 Final Research Report Summary
  • [Journal Article] Discovering Homographs using N-partite Graph Clustering2003

    • Author(s)
      Hidekazu Nakawatase, Akiko Aizawa
    • Journal Title

      Proceedings of the 6th International Conference on Discovery Science (DS'03)

      Pages: 402-409

    • Description
      「研究成果報告書概要(和文)」より
    • Related Report
      2004 Final Research Report Summary
  • [Journal Article] Improving the Performance of Text Categorization Using Low Frequency Terms2003

    • Author(s)
      Akiko Aizawa
    • Journal Title

      Journal of InformationProcessing Society of Japan(in Japanese)

      Pages: 1720-1730

    • NAID

      110002711767

    • Description
      「研究成果報告書概要(欧文)」より
    • Related Report
      2004 Final Research Report Summary
  • [Journal Article] Extracting and Analyzing Recycled Word Sentences from Text2003

    • Author(s)
      Akiko Aizawa
    • Journal Title

      IPSJ SIG Notes, FI 2003-FI-71

      Pages: 189-196

    • Description
      「研究成果報告書概要(欧文)」より
    • Related Report
      2004 Final Research Report Summary
  • [Journal Article] On the Analysis of Source Identified Text Corpora2003

    • Author(s)
      Akiko Aizawa
    • Journal Title

      the 17th Annual Conference of the Japanese Society for Artificial Intelligence(in Japanese) 1C5-05

    • NAID

      40020007253

    • Description
      「研究成果報告書概要(欧文)」より
    • Related Report
      2004 Final Research Report Summary
  • [Journal Article] Word Sense Discrimination based on Complete N-partite Graph2003

    • Author(s)
      Hidekazu Nakawatase, Akiko Aizawa
    • Journal Title

      Technical Report of IEICE AI2003-2(in Japanese) 103

      Pages: 7-23

    • NAID

      110003176886

    • Description
      「研究成果報告書概要(欧文)」より
    • Related Report
      2004 Final Research Report Summary
  • [Publications] Akiko Aizawa: "Analysis of Source Identified Text Corpora : Exploring the Statistics of the Reused Text and the Authorship"Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL-03). 383-390 (2003)

    • Related Report
      2003 Annual Research Report
  • [Publications] 相澤彰子: "低頻度後の利用によるテキストの分類性能の改善と評価"情報処理学会論文誌. 44,7. 1720-1730 (2003)

    • Related Report
      2003 Annual Research Report
  • [Publications] 相澤彰子: "テキストからの再利用文字列の抽出と分析"情報処理学会研究報告2003-FI-71. 189-196 (2003)

    • Related Report
      2003 Annual Research Report
  • [Publications] 相澤彰子: "発信者情報が付与されたテキストコーパスの分析について"2003年度人工知能学会全国大会予稿集,1C5-05. (2003)

    • Related Report
      2003 Annual Research Report
  • [Publications] 中渡瀬秀一, 相澤彰子: "完全N部グラフ構造を用いた単語の多義性獲得"電子情報通信学会研究技術報告(人工知能と知識処理). 103. 7-23 (2003)

    • Related Report
      2003 Annual Research Report
  • [Publications] Hidekazu Nakawatase, Akiko Aizawa: "Discovering Homographs using N-partite Graph Clustering"Proceedings of the 6th International Conference on Discovery Science (DS'03). 402-409 (2003)

    • Related Report
      2003 Annual Research Report

URL: 

Published: 2003-04-01   Modified: 2016-04-21  

Information User Guide FAQ News Terms of Use Attribution of KAKENHI

Powered by NII kakenhi