VLDB 2020 Tutorial: Similarity Query Processing for High-Dimensional Data

This is the resource page for our VLDB 2020 tutorial titled “Similarity Query Processing for High-Dimensional Data” by Jianbin Qin, Wei Wang, Chuan Xiao, and Ying Zhang.


Jianbin Qin, Wei Wang, Chuan Xiao, Ying Zhang: Similarity Query Processing for High-Dimensional Data. Proc. VLDB Endow. 13(12): 3437-3440 (2020) pdf


Similarity query processing has been an active research topic for several decades. It is an essential procedure in a wide range of applications. Recently, embedding and auto-encoding methods as well as pre-trained models have gained popularity. They basically deal with high-dimensional data, and this trend brings new opportunities and challenges to similarity query processing for high-dimensional data. Meanwhile, new techniques have emerged to tackle this long-standing problem theoretically and empirically.

In this tutorial, we summarize existing solutions, especially recent advancements from both database (DB) and machine learning (ML) communities, and analyze their strengths and weaknesses. We review exact and approximate methods such as cover tree, locality sensitive hashing, product quantization, and proximity graphs. We also discuss the selectivity estimation problem and show how researchers are bringing in state-of-the-art ML techniques to address the problem. By highlighting the strong connections between DB and ML, we hope that this tutorial provides an impetus towards new ML for DB solutions and vice versa.


Slides (updated on 4 Sept 2020):

  1. Introduction
  2. Exact Query Processing
  3. Approximate-1
  4. Approximate-2
  5. Approximate-3
  6. Estimation
  7. Open Problems