2019 Fiscal Year Research-status Report
Efficient Query Processing for Learning-based Data Management
Project/Area Number |
19K11979
|
Research Institution | Osaka University |
Principal Investigator |
肖 川 大阪大学, 情報科学研究科, 特任准教授(常勤) (10643900)
|
Project Period (FY) |
2019-04-01 – 2022-03-31
|
Keywords | query processing / ML + DB |
Outline of Annual Research Achievements |
There were two major achievements in FY2019. First, we developed efficient query processing methods for embeddings. We focused on dense high-dimensional data that have been widely used in important real-world applications. We took advantage of our discoveries in the pilot studies to reach a solution that works for binary high-dimensional vectors and efficiently returns answers for similarity search and join queries with Hamming distance constraints. Our experiment results showed very promising query processing performance (4 - 10 times faster than existing solutions). We published our discoveries at IEEE Transactions on Knowledge and Data Engineering (TKDE). Second, we started the study on efficient blocking techniques for queries with learning-based predicates, which can be used for entity matching. The blocking rules are a conjunction of similarity predicates generated through active learning. To efficiently apply these blocking rules, we modeled this as a query optimization problem. We developed a deep learning-based method that generates fast query plans through cardinality estimation. The proposed approach is up to one order of magnitude than the traditional method of employing sampling techniques for cardinality estimation. Our study has been accepted as a full research paper by ACM SIGMOD International Conference on Management of Data (SIGMOD) 2020. In addition, we reported a series of discoveries related to this project at premier database journals and conferences such as the VLDB Journal and IEEE International Conference on Data Engineering (ICDE) 2019.
|
Current Status of Research Progress |
Current Status of Research Progress
1: Research has progressed more than it was originally planned.
Reason
In our plan for FY2019, we planned to finish Task 1 and develop efficient query processing methods for embeddings. We successfully reached this goal and developed a solution that works for binary high-dimensional vectors and efficiently returns answers for similarity search and join queries with Hamming distance constraints. We published our discoveries at IEEE Transactions on Knowledge and Data Engineering (TKDE), a premier journal in the database area. In addition, we made an initial attempt at Task 2 to develop generic blocking techniques for queries with learning-based predicates, which was originally planned as a target in FY2020. We modeled this task as a query optimization problem, and developed a deep learning-based method that generates fast query plans through cardinality estimation. Our study has been accepted as a full research paper by ACM SIGMOD International Conference on Management of Data (SIGMOD) 2020, a top-tier conference in the database area. We also published a few works at premier database journals and conferences such as the VLDB Journal and IEEE International Conference on Data Engineering (ICDE) 2019. Based on the above achievements in FY2019, we believe that the project has been progressing more smoothly than initially planned.
|
Strategy for Future Research Activity |
In FY2020, we will report our discoveries and give a tutorial on similarity query processing for high-dimensional data at International Conference on Very Large Data Bases (VLDB) 2020, a top-tier conference in the database area. We will continue our investigation on Task 2 and develop generic blocking techniques for queries with learning-based predicates. This work is planned to be submitted to VLDB 2021. Another ongoing work is to extend our method developed for Task 1, so that it is able to handle efficient query processing not only on binary vectors with Hamming distance constraints but also real-valued vectors with Euclidean distance or cosine similarity constraints. We plan to submit this work to a top-tier database conference (SIGMOD 2021 or VLDB 2021). In addition, we will prepare for Task 3 and work on system prototyping and evaluation. At the end of FY2020, we are going to start the implementation of a prototype system that integrates all our proposed methods in this research period. The system design will be carried out on Apache Spark or Amazon Web Services for distributed query processing on very large datasets.
|
Causes of Carryover |
In FY2019, the funding was mainly used for registering and attending academic conferences to report our discoveries. Due to the COVID-19 outbreak, the Forum on Data Engineering and Information Management (DEIM) 2020 was canceled at the predetermined conference venue (Bandaiatami, Fukushima) and held as online meetings in March. Therefore, the PI was unable to attend the onsite forum and this resulted in the 85,243 yen unused amount, which was supposed to be the travel expense. The PI requests this amount to be carried forward to FY2020, during which period registration for conferences, publication at journals, and purchase of equipment may occur.
|
Research Products
(17 results)