Efficient Plagiarism Detection Based on Image Processing for Documents

Research Project

Project/Area Number	19K12133
Research Category	Grant-in-Aid for Scientific Research (C)
Allocation Type	Multi-year Fund
Section	一般
Review Section	Basic Section 61030:Intelligent informatics-related
Research Institution	Okayama University (2020-2021) FUJITSU LABORATORIES LTD. (2019)
Principal Investigator	Baba Kensuke 岡山大学, サイバーフィジカル情報応用研究コア, 特任教授 (70380681)
Project Period (FY)	2019-04-01 – 2022-03-31
Project Status	Completed (Fiscal Year 2021)
Budget Amount *help	¥4,290,000 (Direct Cost: ¥3,300,000、Indirect Cost: ¥990,000) Fiscal Year 2021: ¥910,000 (Direct Cost: ¥700,000、Indirect Cost: ¥210,000) Fiscal Year 2020: ¥1,820,000 (Direct Cost: ¥1,400,000、Indirect Cost: ¥420,000) Fiscal Year 2019: ¥1,560,000 (Direct Cost: ¥1,200,000、Indirect Cost: ¥360,000)
Keywords	検索 / 文書解析 / 剽窃検知 / 自然言語処理 / 分散表現 / 周波数解析 / 情報検索 / テキスト処理 / 画像処理
Outline of Research at the Start	本研究では，大規模なデータに対する効率的な剽窃検知手法を開発する．入力文書と大量の文書との類似を高速に計算するためのデータのサイズを削減する．文書中の出現語をベクトルで表現することによって，画像のフィルタ手法のアイデアを文書に適用する．研究手順として，まず，適切な語のベクトル表現の獲得方法を開発し，文書間類似度を算出するアルゴリズムを設計する．次に，開発したアルゴリズムを実装し，剽窃検知の対象となる文書データを収集する．最後に，剽窃検知の精度，実行時間，検知用データのサイズを測定し，提案技術の効果を検証する．
Outline of Final Research Achievements	We developed an efficient plagiarism detection method for large-scale document data. Fast computation of similarity over documents for large data requires long computation time or large-size data structure. To solve the problem, we applied the idea of filters for images to documents, to reduce the size of plagiarism detection data. By using a vector representation of words, the proposed method can detect not only plagiarism based on simple string matching, but also plagiarism based on word similarity. We applied the proposed method to documents in the institutional repository of Okayama University and research-related documents owned by each department and implemented it as a search system for researchers and research seeds.
Academic Significance and Societal Importance of the Research Achievements	本研究の成果により，機械学習技術の大規模文書データへの適用により得られる一般的な知識を，剽窃検知という具体的な応用に利用することができるようになった．機械学習技術によって語を数値ベクトルに変換することができ，これを利用することで文書を画像のように扱うことができる．このアイデアを用いて，画像処理のうち類似する部分を網羅的に調べる手法を文書に適用することができるようになった．結果として，ある程度の曖昧さを考慮した文書間の類似部分の検知を，高速かつ省スペースで行う手法が得られた．

Report

(4 results)

2021 Annual Research Report Final Research Report ( PDF )
2020 Research-status Report
2019 Research-status Report

Research Products
(2 results)

All 2020 2019

All Patent(Industrial Property Rights) (2 results)

[Patent(Industrial Property Rights)] 変化検出プログラム、変化検出装置及び変化検出方法2020
- Inventor(s)
  馬場謙介
- Industrial Property Rights Holder
  富士通株式会社
- Industrial Property Rights Type
  特許
- Industrial Property Number
  2020-085172
- Filing Date
  2020
- Related Report
  2020 Research-status Report
[Patent(Industrial Property Rights)] 類似文書検索方法、類似文書検索プログラム、類似文書検索装置、索引情報作成方法、索引情報作成プログラムおよび索引情報作成装置2019
- Inventor(s)
  4.馬場謙介, 野呂智哉, 福田茂紀, 大倉清司
- Industrial Property Rights Holder
  4.馬場謙介, 野呂智哉, 福田茂紀, 大倉清司
- Industrial Property Rights Type
  特許
- Industrial Property Number
  2019-034306
- Filing Date
  2019
- Related Report
  2019 Research-status Report

Efficient Plagiarism Detection Based on Image Processing for Documents

Principal Investigator

Baba Kensuke 岡山大学, サイバーフィジカル情報応用研究コア, 特任教授 (70380681)

¥4,290,000 (Direct Cost: ¥3,300,000、Indirect Cost: ¥990,000)

Report

Research Products

[Patent(Industrial Property Rights)] 変化検出プログラム、変化検出装置及び変化検出方法2020

Inventor(s)

Industrial Property Rights Holder

Industrial Property Rights Type

Industrial Property Number

Filing Date

Related Report

[Patent(Industrial Property Rights)] 類似文書検索方法、類似文書検索プログラム、類似文書検索装置、索引情報作成方法、索引情報作成プログラムおよび索引情報作成装置2019

Inventor(s)

Industrial Property Rights Holder

Industrial Property Rights Type

Industrial Property Number

Filing Date

Related Report