1. Project Overview

Given a similarity function and two sets of objects, a similarity join returns all pairs of objects (from each set respectively) such that their similarity value satisifies a given criterion. A typical example is to find pairs of documents such that their cosine similarity is above a constant threshold (say, 0.95), as they are likely to be near duplicate documents.

In this project, we focus on efficient algorithms to perform the similarity join on (multi-) sets or strings both exactly and approximately. Commonly used similarity functions for (multi-) sets or strings are more complex than Euclidean distance functions. As a result, many previous approaches have to compute the approximate results instead.

We have developed several fast algorithms that address the above performance issue. They work for Jaccard, Dice, Cosine similarities and Edit distance constraints. We have also devised algorithms to compute top-k similar pairs progressively so that a user does not need to specify a similarity threshold for some unknown dataset. Recently, we have obtained several interesting new results on edit similarity search and joins by exploiting asymmetric signature schemes. The resulting IndexChunkTurbo algorithm can process most of the similarity queries efficiently while occupying almost the least amount of index space.

We also investigated the problem of approximate entity matching. Our SIGMOD09 work can extract approximate mentions of entities in a document given an entity dictionary. Our DEXA11 and SIGMOD13 work considers finding semantically similar but syntactically different entities with the help of transformation rules (e.g., "big apple" = "new york").

We have also worked on similarity queries on other types of objects, e.g., graphs and uncertain/probabilistic sets.

This project is partly funded by ARC Discovery Projects DP0987273, DP0881779, and DP130103401.

2. People

Academic Staff:

PhD Students:

  • Yifei Lu

  • Yifang Sun

  • Xiaoyang Zhang

  • Xiaoling Zhou


  • Sunanda Patro

3. Demo

4. Publications

  1. Pei Wang, Chuan Xiao, Jianbin Qin, Wei Wang, Xiaoyang Zhang, Yoshiharu Ishikawa. Local Similarity Search for Unstructured Text. SIGMOD, 2016.

  2. Wei Song, Jianbin Qin, Muhammad Aamir Cheema, Wei Wang. Pre-computed : Region Guardian Sets Based Reverse kNN Queries. DASFAA, 2016. (Best Student Paper)

  3. Xiaochun Yang, Yaoshu Wang, Bin Wang, Wei Wang. Local Filtering: Improving the Performance of Approximate Queries on String Collections. SIGMOD, 2015.

  4. Yifang Sun, Wei Wang, Jianbin Qin, Ying Zhang, Xuemin Lin. SRS: Solving c-Approximate Nearest Neighbor Queries in High Dimensional Euclidean Space with a Tiny Index. PVLDB, 2015.

    • PDF with errata fixed

    • Slides: PPTX, PDF

    • Source code at github

    • Summary:

      • Nearest Neighbor (NN) search in high dimensional spaces is a challenging problem, and even its approximate version is not any easier. We propose a novel index and query processing algorithms to answer c-Approximate NN queries with probabilistic guarantees as well as excellent empirical performance. We demonstrate that our method supports billions of points on a single commondity PC, while previous methods either cannot handle such large-scale datasets or have to resort to powerful computer clusters.

      • Interesting facts:

        • c-approximate NN queries in arbitrarily high-dimensional space can be solved by exact k-NN search in low-dimensional space, with index size and worst-case query time linear in n (for a typical setting of c = 4, the low dimensional space has 6-8 dimensions).

        • Our method can handle any c, even for c = 1.

        • The first conditional probabilistic guarantee for top-k c-ANN queries.

  5. Yifang Sun, Jianbin Qin, Wei Wang. Near Duplicate Text Detection Using Frequency-Biased Signatures. WISE, 2013.

    • PDF

    • Summary: Winnowing [Schleimer, Wilkerson, and Aiken, SIGMOD 2013] is a classic method for near duplicate text detection, with the unique locality property that stands out from other methods. We propose the k-stability concept, which is a natural and useful extension of the locality property, and compute the k-stability for all Winnowing family algorithms. We further propose a variant of Winnowing algorithm exploiting collection statistics as well as other improvements such as candidate text generation. We experimentally demonstrate the effectiveness and efficiency of our method on the PAN workshop dataset (for Plagiarism Detection).

  6. Chin-Wan Chung, Yufei Tao, Wei Wang. I/O-Efficient Dictionary Search with One Edit Error. SPIRE, 2014.

    • PDF

    • Summary: We give a new structure for 1-error dictionary search in the external memory model that uses O(n/B) space, answers a query in O(1 + m/(wB) + k/B) I/Os, and supports the insertion and deletion of a length-l string in O(l) expected I/Os.

  7. Jianbin Qin, Wei Wang, Chuan Xiao, Yifei Lu, Xuemin Lin, Haixun Wang. Asymmetric Signature Schemes for Efficient Exact Edit Similarity Query Processing. TODS, 2013.

    • PDF

    • Summary: This work significantly extends our previous SIGMOD 09 paper. On the theoretical side, it gives lower bounds on minimum signature sizes for content-based signature schemes (including substrings and subsequences). On the practical side, it gives two new families of algorithms, Super and Turbo, which substantially improves the basic asymmetric signature schemes proposed in our SIGMOD 09 paper. Specifically, the Turbo family algorithms also achieves an improved maximum signature size, and it compares favorably against another recent work (PASSJOIN) belonging to the asymmetric signature scheme (O(τ) vs. O(τ3)).

  8. Xiaoyang Zhang, Jianbin Qin, Wei Wang, Yifang Sun, Jiaheng Lu. HmSearch: An Efficient Hamming Distance Query Processing Algorithm. SSDBM 2013, 2013.

    • PDF

    • Summary: Hamming distance based similarity search has many important applications. Existing methods can only handle very small Hamming distance thresholds and/or cannot deal with data skews effectively. We have developed the HmSearch method which solve the aforementioned limitations.

  9. Jiaheng Lu, Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang. String Similarity Measures and Joins with Synonyms. SIGMOD 2013.

    • PDF

    • Summary: The majority of approaches to define similarity between two "mentions" (or strings referring to an entity) is based on their syntactical similarities. There are many instances where two strings with very different surface forms may be indeed similar. E.g., "big apple" = "new york". We propose a new similarity measure between two strings in the presence of a set of equivalence rules (generally referred to as "Synonyms" in the paper, but they can also capture other semantics, e.g., typographical errors). The new measure compares favorably with existing measures in both effectiveness and computational efficiency. We have also develop efficient similarity join algorithm with two features: (1) a new index capturing both prefix and length filtering, and (2) an online query optimization method based on a novel estimator.

  10. Chuan Xiao, Jianbin Qin, Wei Wang, Yoshiharu Ishikawa, Koji Tsuda, Kunihiko Sadakane. Efficient Error-tolerant Query Autocopletion. PVLDB 2013.

    • PDF

    • Summary: Traditional query autocompletion only suggest strings whose prefix matches the current input query exactly. In contrast, error-tolerant autocompletion allows a fixed amount of edit errors, e.g., suggesting "facebook" for the input query "fcebo". We present a method combining variant generation and trie. Our method achieves a speed-up of up to 3,175 times than traditional trie-based methods.

  11. Jianbin Qin, Xiaoling Zhou, Wei Wang, Chuan Xiao. Trie-based Similarity Search and Join. Scalable Similarity String Search/Join Competition, 2013.

  12. Xiang Zhao, Chuan Xiao, Xuemin Lin, Wei Wang, Yoshiharu Ishikawa. Efficient Processing of Graph Similarity Queries with Edit Distance Constraints. VLDB Journal, 2013.

  13. Ming Gao, Cheqing Jin, Wei Wang, Xuemin Lin, Aoying Zhou. Similarity Query Processing for Probabilistic Sets. ICDE 2013.

    • PDF

    • Summary: We study a simple yet powerful model and similarity measures for probabilistic sets, and both exact and approximate query processing algorithms, with an emphasis on probabilistic sets with large number of elements (thousands rather than tens as studied in prior work).

  14. Jianbin Qin, Chuan Xiao, Wei Wang, Xuemin Lin. A Space-Efficient Indexing Algorithm for Boolean Query Processing. WISE 2013.

    • PDF / TR version

    • Summary: There is much redundancy among postings lists. For example, given the postings list of hong, the postings list of kong can be represented in a much more concise way given that these two terms typically occur together in a document. In this work, we extend this idea to allow M terms to be merged and propose efficient algorithms to achieve this in a computationally efficient way.

  15. Heng Tao Shen, Jiajun Liu, Zi Huang, Chong-Wah Ngo, and Wei Wang. Near-duplicate video retrieval: Current research and future trends. ACM Computing Survey, 2013.

    • [PDF]

    • Summary:

  16. Wei Wang, Jianbin Qin, Chuan Xiao, Xuemin Lin, Heng Tao Shen. VChunkJoin: An Efficient Algorithm for Edit Similarity Joins. TKDE, 2012.

    • PDF

    • Summary: This work proposes a novel method to efficiently process edit similarity join. Compared with existing methods (e.g., ed-join, vgram-join), it uses less space yet it is faster in many parameter settings.

  17. Xiang Zhao, Chuan Xiao, Xuemin Lin, Wei Wang. Efficient Graph Similarity Joins with Edit Distance Constraints. ICDE 2012.

    • PDF

    • Summary: An efficient algorithm to compute graphs within certain (graph) edit distance away from a query graph.

  18. Sunanda Patro, Wei Wang. Learning Top-k Transformation Rules. DEXA 2011.

    • PDF

    • Summary: A new algorithm to learn transformation rules (e.g., VLDB = Very Large Data Base) from unannotated, potentially noisy data. Compared with previous approaches, our method can find much more valid rules in the top-k output.

  19. Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, Guoren Wang. Efficient Similarity Joins for Near Duplicate Detection. ACM Transaction of Database Systems (TODS).

    • PDF

    • Summary: This is the journal version of our WWW 2008 paper, with extension to implementing the PPJoin family of algorithms on top of relational database systems.

  20. Jianbin Qin, Wei Wang, Yifei Lu, Chuan Xiao, Xuemin Lin. Efficient Exact Edit Similarity Query Processing with Asymmetric Signature Schemes. SIGMOD 2011.

    • PDF

    • TBA

    • Summary: Two simple yet highly efficient algorithms are proposed in this paper that works very well for both edit similarity search and joins. Perhaps equally intereting is a comprehensive experiment involding Flamingo (ver 3), PartEnum, Ed-Join, Bed-tree, Trie-Join, NGPP, VGRAM, NS09.

  21. Chaokun Wang, Jianmin Wang, Xuemin Lin, Wei Wang, Haixun Wang, Hongsong Li, Wanpeng Tian, Jun Xu, Rui Li. MapDupReducer: Detecting Near Duplicates over Massive Datasets. SIGMOD 2010.

    • PDF

    • PPT

    • Summary: This work essentially ports the ppjoin+ algorithm to the Map-Reduce framework in order to deal with huge volume of data.

  22. Haichuan Shang, Xuemin Lin, Ying Zhang, Jeffrey Xu Yu, Wei Wang. Connected Substructure Similarity Search. SIGMOD 2010.

    • PDF

    • PPT

    • Summary: This work defines a new (assymmetric) similarity measure between graphs and proposes a new indexing and query processing method to perform efficient subgraph similarity search based on this new measure.

  23. Wei Wang, Chuan Xiao, Xuemin Lin, Chengqi Zhang. Efficient Approximate Entity Extraction with Edit Distance Constraints. SIGMOD 2009.

    • PDF

    • Slides

    • Summary: This work presents an efficient algorithm to extract approximate mentions of entities from incoming documents. E.g., we can match Ronald Regan to Ronald Reagan in the Reuters news. The algorithm is also interesting in its own right as it generalizes the neighborhood-generation-base approach for approximate string matching.

  24. Wei Wang. Efficient Exact Similarity Join Algorithms. Seminar at University of Technology, Sydney, Oct 2009.

  25. Chuan Xiao, Wei Wang, Xuemin Lin, Haichuan Shang. Top-k Set Similarity Joins. ICDE 2009.

    • PDF

    • Slides

    • Summary: Not sure about which similarity threshold to use when performing similarity join? Then this top-k similarity join algorithm is for you; it computes the k pairs of records that have the highest similarity value in a progressive manner.

  26. Wei Wang. Similarity Join Algorithms: An Introduction. Tutorial at SEBD 2008.

    • Slides (ppt slides available upon request).

  27. Chuan Xiao, Wei Wang, Xuemin Lin. Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints. VLDB 2008.

    • PDF

    • Slides

    • Summary: An efficient algorithm, Ed-Join for similarity joins with edit distance constraints. Ed-Join is capable of computing edit similarity join for half a million protein sequences with edit distance threshold up to 20 within a minute.

  28. Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu. Efficient Similarity Joins for Near Duplicate Detection. WWW 2008.

    • PDF

    • Slides

    • Summary: Efficient similarity join algorithms (ppjoin and ppjoin+) for several commonly used similarity functions (including Jaccard, cosine, overlap, edit distance). (Ed-Join is recommended for similarity join with edit distance constraint though)

5. Download

  1. Slides

    • 2014, Locality Sensitive Hashing for Big Data. The 3rd China-Australia Workshop. PPTX (2.4MB) PDF (1MB)

    • 2013, Similarity Query Processing Algorithms: Use of Enumeration and Divide and Conquer Techniques. PPTX (1.5MB) PDF (2.5MB)

  2. Download the latest PPJoin source code here. There is a java ppjoin(plus) package at http://code.google.com/p/ppjoinplus/ and another simjoin package at https://code.google.com/p/similarity-join-tools/. A python implementation is available at https://github.com/teh/ppjoin

  3. Download the PPJoin/Ed-Join binary code (including a simple documentation) here (1.3MB)

  4. Download the PPJoin/Ed-Join datasets (i.e., DBLP, TEXAS, TREC, and UNIREF here (233MB). Or download the DBLP (bzip2 compressed) dataset (25MB).

  5. Download the NGPP binary code (i.e., our SIGMOD09 algorithm) here and a simple README.

  6. Download the GramChunk binary code (i.e., our SIGMOD11 algorithms) here, or read the online documentation.

  7. Download the SRS source code at gibhub: https://github.com/DBWangGroupUNSW/SRS