2019 Fiscal Year Research-status Report

Efficient Query Processing for Learning-based Data Management

Research Project

Project/Area Number	19K11979
Research Institution	Osaka University
Principal Investigator	肖川大阪大学, 情報科学研究科, 特任准教授(常勤) (10643900)
Project Period (FY)	2019-04-01 – 2022-03-31
Keywords	query processing / ML + DB
Outline of Annual Research Achievements	There were two major achievements in FY2019. First, we developed efficient query processing methods for embeddings. We focused on dense high-dimensional data that have been widely used in important real-world applications. We took advantage of our discoveries in the pilot studies to reach a solution that works for binary high-dimensional vectors and efficiently returns answers for similarity search and join queries with Hamming distance constraints. Our experiment results showed very promising query processing performance (4 - 10 times faster than existing solutions). We published our discoveries at IEEE Transactions on Knowledge and Data Engineering (TKDE). Second, we started the study on efficient blocking techniques for queries with learning-based predicates, which can be used for entity matching. The blocking rules are a conjunction of similarity predicates generated through active learning. To efficiently apply these blocking rules, we modeled this as a query optimization problem. We developed a deep learning-based method that generates fast query plans through cardinality estimation. The proposed approach is up to one order of magnitude than the traditional method of employing sampling techniques for cardinality estimation. Our study has been accepted as a full research paper by ACM SIGMOD International Conference on Management of Data (SIGMOD) 2020. In addition, we reported a series of discoveries related to this project at premier database journals and conferences such as the VLDB Journal and IEEE International Conference on Data Engineering (ICDE) 2019.
Current Status of Research Progress	Current Status of Research Progress 1: Research has progressed more than it was originally planned. Reason In our plan for FY2019, we planned to finish Task 1 and develop efficient query processing methods for embeddings. We successfully reached this goal and developed a solution that works for binary high-dimensional vectors and efficiently returns answers for similarity search and join queries with Hamming distance constraints. We published our discoveries at IEEE Transactions on Knowledge and Data Engineering (TKDE), a premier journal in the database area. In addition, we made an initial attempt at Task 2 to develop generic blocking techniques for queries with learning-based predicates, which was originally planned as a target in FY2020. We modeled this task as a query optimization problem, and developed a deep learning-based method that generates fast query plans through cardinality estimation. Our study has been accepted as a full research paper by ACM SIGMOD International Conference on Management of Data (SIGMOD) 2020, a top-tier conference in the database area. We also published a few works at premier database journals and conferences such as the VLDB Journal and IEEE International Conference on Data Engineering (ICDE) 2019. Based on the above achievements in FY2019, we believe that the project has been progressing more smoothly than initially planned.
Strategy for Future Research Activity	In FY2020, we will report our discoveries and give a tutorial on similarity query processing for high-dimensional data at International Conference on Very Large Data Bases (VLDB) 2020, a top-tier conference in the database area. We will continue our investigation on Task 2 and develop generic blocking techniques for queries with learning-based predicates. This work is planned to be submitted to VLDB 2021. Another ongoing work is to extend our method developed for Task 1, so that it is able to handle efficient query processing not only on binary vectors with Hamming distance constraints but also real-valued vectors with Euclidean distance or cosine similarity constraints. We plan to submit this work to a top-tier database conference (SIGMOD 2021 or VLDB 2021). In addition, we will prepare for Task 3 and work on system prototyping and evaluation. At the end of FY2020, we are going to start the implementation of a prototype system that integrates all our proposed methods in this research period. The system design will be carried out on Apache Spark or Amazon Web Services for distributed query processing on very large datasets.
Causes of Carryover	In FY2019, the funding was mainly used for registering and attending academic conferences to report our discoveries. Due to the COVID-19 outbreak, the Forum on Data Engineering and Information Management (DEIM) 2020 was canceled at the predetermined conference venue (Bandaiatami, Fukushima) and held as online meetings in March. Therefore, the PI was unable to attend the onsite forum and this resulted in the 85,243 yen unused amount, which was supposed to be the travel expense. The PI requests this amount to be carried forward to FY2020, during which period registration for conferences, publication at journals, and purchase of equipment may occur.

Research Products
(17 results)

All 2020 2019 Other

All Int'l Joint Research (2 results) Journal Article (5 results) (of which Int'l Joint Research: 3 results, Peer Reviewed: 5 results, Open Access: 5 results) Presentation (7 results) (of which Int'l Joint Research: 4 results) Remarks (3 results)

[Int'l Joint Research] ニューサウスウェールズ大学/メルボルン大学(オーストラリア)
- Country Name
  AUSTRALIA
- Counterpart Institution
  ニューサウスウェールズ大学/メルボルン大学
[Int'l Joint Research] 香港科技大学/北京理工大学/深セン計算科学研究院(中国)
- Country Name
  CHINA
- Counterpart Institution
  香港科技大学/北京理工大学/深セン計算科学研究院
- # of Other Institutions
  1
[Journal Article] Efficient Query Autocompletion with Edit Distance-based Error Tolerance2020
- Author(s)
  Jianbin Qin, Chuan Xiao, Sheng Hu, Jie Zhang, Wei Wang, Yoshiharu Ishikawa, Koji Tsuda, Kunihiko Sadakane
- Journal Title
  
  The VLDB Journal
  
  Volume: - Pages: -
- DOI
  doi.org/10.1007/s00778-019-00595-4
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] Generalizing the Pigeonhole Principle for Similarity Search in Hamming Space2020
- Author(s)
  Jianbin Qin, Chuan Xiao, Yaoshu Wang, Wei Wang, Xuemin Lin, Yoshiharu Ishikawa, Guoren Wang
- Journal Title
  
  IEEE Transactions on Knowledge and Data Engineering
  
  Volume: - Pages: -
- DOI
  10.1109/TKDE.2019.2899597
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] 道路ネットワーク上の軌跡データに対する圧縮索引2020
- Author(s)
  小出智士, 肖川, 石川佳治
- Journal Title
  
  電子情報通信学会論文誌 D
  
  Volume: J103-D Pages: 393-402
- DOI
  10.14923/transinfj.2019DET0001
- Peer Reviewed / Open Access
[Journal Article] Scope-aware Code Completion with Discriminative Modeling2019
- Author(s)
  Sheng Hu, Chuan Xiao, Yoshiharu Ishikawa
- Journal Title
  
  IPSJ Journal of Information Processing
  
  Volume: 27 Pages: 469-478
- DOI
  10.2197/ipsjjip.27.469
- Peer Reviewed / Open Access
[Journal Article] Building Hierarchical Spatial Histograms for Exploratory Analysis in Array DBMS2019
- Author(s)
  Jing Zhao, Yoshiharu Ishikawa, Lei Chen, Chuan Xiao, Kento Sugiura
- Journal Title
  
  IEICE Transactions on Information and Systems
  
  Volume: E102-D Pages: 788-799
- DOI
  10.1587/transinf.2018DAP0020
- Peer Reviewed / Open Access / Int'l Joint Research
[Presentation] Monotonic Cardinality Estimation of Similarity Selection: A Deep Learning Approach2020
- Author(s)
  Yaoshu Wang, Chuan Xiao, Jianbin Qin, Xin Cao, Yifang Sun, Wei Wang, and Makoto Onizuka
- Organizer
  ACM SIGMOD International Conference on Management of Data (SIGMOD 2020)
- Int'l Joint Research
[Presentation] P2P型データ統合アーキテクチャにおけるチケットベース手法を用いた分散トランザクション制御2020
- Author(s)
  三宅康太, 涌田悠佑, 佐々木勇和, 肖川, 鬼塚真
- Organizer
  第12回データ工学と情報マネジメントに関するフォーラム (DEIM 2020)
[Presentation] トライ木及びGMMに基づく略語のフルネームのスケーラブルな推測手法2020
- Author(s)
  高明敏, 肖川, 石川佳治
- Organizer
  第12回データ工学と情報マネジメントに関するフォーラム (DEIM 2020)
[Presentation] 多様化軌跡を効率検索するための統合クエリパラダイム2020
- Author(s)
  胡晟, 馬強, 肖川
- Organizer
  第12回データ工学と情報マネジメントに関するフォーラム (DEIM 2020)
[Presentation] Distributed Transaction Management for P2P-based Update Propagation2019
- Author(s)
  Makoto Onizuka, Yusuke Wakuta, Yuya Sasaki, Chuan Xiao
- Organizer
  The 3rd Workshop on Software Foundations for Data Interoperability (SFDI 2019)
- Int'l Joint Research
[Presentation] Autocompletion for Prefix-Abbreviated Input2019
- Author(s)
  Sheng Hu, Chuan Xiao, Jianbin Qin, Yoshiharu Ishikawa, Qiang Ma
- Organizer
  ACM SIGMOD International Conference on Management of Data (SIGMOD 2019)
- Int'l Joint Research
[Presentation] Dynamic Set kNN Self-Join2019
- Author(s)
  Daichi Amagata, Takahiro Hara, Chuan Xiao
- Organizer
  The 35th IEEE International Conference on Data Engineering (ICDE 2019)
- Int'l Joint Research
[Remarks] 大阪大学ビッグデータ工学講座鬼塚研究室
- URL
  http://www-bigdata.ist.osaka-u.ac.jp/ja/paper/
[Remarks] 名古屋大学情報学研究科データベース研究室（石川研究室）
- URL
  https://www.db.is.i.nagoya-u.ac.jp/ja/research/publications/
[Remarks] Chuan Xiao's homepage
- URL
  https://sites.google.com/site/chuanxiao1983/publication

2019 Fiscal Year Research-status Report

Efficient Query Processing for Learning-based Data Management

Principal Investigator

肖 川 大阪大学, 情報科学研究科, 特任准教授(常勤) (10643900)

Current Status of Research Progress

Reason

Research Products

[Int'l Joint Research] ニューサウスウェールズ大学/メルボルン大学(オーストラリア)

Country Name

Counterpart Institution

[Int'l Joint Research] 香港科技大学/北京理工大学/深セン計算科学研究院(中国)

Country Name

Counterpart Institution

# of Other Institutions

[Journal Article] Efficient Query Autocompletion with Edit Distance-based Error Tolerance2020

Author(s)

Journal Title

DOI

[Journal Article] Generalizing the Pigeonhole Principle for Similarity Search in Hamming Space2020

Author(s)

Journal Title

DOI

[Journal Article] 道路ネットワーク上の軌跡データに対する圧縮索引2020

Author(s)

Journal Title

DOI

[Journal Article] Scope-aware Code Completion with Discriminative Modeling2019

Author(s)

Journal Title

DOI

[Journal Article] Building Hierarchical Spatial Histograms for Exploratory Analysis in Array DBMS2019

Author(s)

Journal Title

DOI

[Presentation] Monotonic Cardinality Estimation of Similarity Selection: A Deep Learning Approach2020

Author(s)

Organizer

[Presentation] P2P型データ統合アーキテクチャにおけるチケットベース手法を用いた分散トランザクション制御2020

Author(s)

Organizer

[Presentation] トライ木及びGMMに基づく略語のフルネームのスケーラブルな推測手法2020

Author(s)

Organizer

[Presentation] 多様化軌跡を効率検索するための統合クエリパラダイム2020

Author(s)

Organizer

[Presentation] Distributed Transaction Management for P2P-based Update Propagation2019

Author(s)

Organizer

[Presentation] Autocompletion for Prefix-Abbreviated Input2019

Author(s)

Organizer

[Presentation] Dynamic Set kNN Self-Join2019

Author(s)

Organizer

[Remarks] 大阪大学 ビッグデータ工学講座 鬼塚研究室

URL

[Remarks] 名古屋大学 情報学研究科 データベース研究室（石川研究室）

URL

[Remarks] Chuan Xiao's homepage

URL

肖川大阪大学, 情報科学研究科, 特任准教授(常勤) (10643900)

[Remarks] 大阪大学ビッグデータ工学講座鬼塚研究室

[Remarks] 名古屋大学情報学研究科データベース研究室（石川研究室）