A Study on Data-Space Generating Operations on Massive Data Platforms
Project/Area Number |
24500109
|
Research Category |
Grant-in-Aid for Scientific Research (C)
|
Allocation Type | Multi-year Fund |
Section | 一般 |
Research Field |
Media informatics/Database
|
Research Institution | The University of Electro-Communications |
Principal Investigator |
OHMORI Tadashi 電気通信大学, その他の研究科, 教授 (30233274)
|
Project Period (FY) |
2012-04-01 – 2016-03-31
|
Project Status |
Completed (Fiscal Year 2015)
|
Budget Amount *help |
¥2,990,000 (Direct Cost: ¥2,300,000、Indirect Cost: ¥690,000)
Fiscal Year 2014: ¥650,000 (Direct Cost: ¥500,000、Indirect Cost: ¥150,000)
Fiscal Year 2013: ¥910,000 (Direct Cost: ¥700,000、Indirect Cost: ¥210,000)
Fiscal Year 2012: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000)
|
Keywords | 大規模データ処理 / MapReduce / 類似結合 / 多対多関係 / 編集距離 / ハッシュ結合 / 巨大データ処理 / 2段階ハッシュ分割 / 編集距離結合 / mapreduce / 結合演算 / map/reduce |
Outline of Final Research Achievements |
Similarity joins on massive datasets are useful operations to detect many-to-many relationship residing in target datasets. However many join algorithms on various similarity functions are known to have unstable performance on map/reduce systems. The objective of this research is to clarify reasons of this unstablity, and to solve it. To do so, the research proposes two new algorithmic frameworks. One is the hybrid-hash join enhanced with bucket-regrouping techniques, named HSJ+BR. It solves unexpected unbalance between reducers without intermediate mapreduce jobs. The other is called two-stage hash-partitioning strategy. It can greatly reduce the shuffle overhead caused by too much record-replication associated with many similarity join algorithms. Using these two frameworks, it is shown that stable and efficient performance of similarity joins on map/reduce systems (where, as typical cases, m-to-n equi-join and edit-distance join are used) is achieved.
|
Report
(5 results)
Research Products
(7 results)