2020 Fiscal Year Research-status Report
Efficient Query Processing for Learning-based Data Management
Project/Area Number |
19K11979
|
Research Institution | Osaka University |
Principal Investigator |
肖 川 大阪大学, 情報科学研究科, 准教授(常勤) (10643900)
|
Project Period (FY) |
2019-04-01 – 2022-03-31
|
Keywords | 問合せ処理 / 機械学習 / データベース / データサイエンス |
Outline of Annual Research Achievements |
There were two major achievements in FY2020. First, we studied efficient blocking techniques for queries with learning-based predicates. The blocking rules are a conjunction of similarity predicates over high-dimensional data. To efficiently apply the blocking rules, we modeled this as a query optimization problem. We developed a learning-based method that accurately estimates the cardinality of each similarity predicate and chooses the processing order with the smallest cost. Experiments demonstrated the effectiveness and the efficiency of our approach. Our study has been accepted as a full research paper by ACM SIGMOD International Conference on Management of Data (SIGMOD) 2021. Second, we studied the problem of joinable table discovery in data lakes, which is an important task for data enrichment. We proposed to embed textual values as high-dimensional vectors and join column upon similarity predicates on high-dimensional vectors, hence to address the limitations of traditional equi-join approaches and identify more meaningful results. We devised a series of techniques to speed up the discovery process. Our solution identifies substantially more useful results than equi-joins and outperforms other similarity-based options. Its efficiency was also demonstrated through experimental evaluation. Our discovery appeared as a full research paper at IEEE International Conference on Data Engineering (ICDE) 2021.
|
Current Status of Research Progress |
Current Status of Research Progress
1: Research has progressed more than it was originally planned.
Reason
In our plan for FY2020, we planned to finish Task 2 and develop generic blocking techniques for queries with learning-based predicates. We achieved this goal and developed a solution that works for high-dimensional vectors and a variety of similarity functions over high-dimensional data. We published our discoveries at ACM SIGMOD International Conference on Management of Data (SIGMOD) 2021, a top-tier conference in the database area. We also explored the problem of joinable table discovery in data lakes. We targeted the case when textual values are embedded as high-dimensional vectors and columns are joined upon similarity predicates on high-dimensional vector. Our study was published at IEEE International Conference on Data Engineering (ICDE) 2021, a top-tier conference in the database area. Based on the above achievements in FY2020, we believe that the project has been progressing more smoothly than initially planned. In addition, we started the initial work of implementing a prototype system that integrates all our proposed methods in this research period.
|
Strategy for Future Research Activity |
In FY2021, our ongoing work is to further study on the problem of joinable table discovery in data lakes. We will explore in the direction of column embedding. The new approach will be significantly efficient than our proposed one in ICDE 2021 and retain the accuracy. In addition, the new approach can be extended to solve other related problems in data lake management. Another academic goal of FY2021 is to complete Task 3 and work on system prototyping and evaluation. We have already started the initial work of prototype system implementation. The implemented system will integrate all our proposed methods in this research period, and we seek opportunity of releasing our system.
|
Causes of Carryover |
Due to the COVID-19 outbreak, the PI was unable to attend onsite conferences and this resulted in the aforementioned unused amount, which would have been used for travel expense. The PI requests this amount to be carried forward to FY2021, during which period registration for conferences, publication at journals, and purchase of equipment may occur.
|
Research Products
(14 results)