1. Project Overview
Given a similarity function and two sets of objects, a similarity join returns all pairs of objects (from each set respectively) such that their similarity value satisifies a given criterion. A typical example is to find pairs of documents such that their cosine similarity is above a constant threshold (say, 0.95), as they are likely to be near duplicate documents.
In this project, we focus on efficient algorithms to perform the similarity join on (multi) sets or strings both exactly and approximately. Commonly used similarity functions for (multi) sets or strings are more complex than Euclidean distance functions. As a result, many previous approaches have to compute the approximate results instead.
We have developed several fast algorithms that address the above
performance issue. They work for Jaccard, Dice, Cosine similarities
and Edit distance constraints. We have also devised algorithms to
compute topk similar pairs progressively so that a user does not
need to specify a similarity threshold for some unknown
dataset. Recently, we have obtained several interesting new results on
edit similarity search and joins by exploiting asymmetric signature
schemes. The resulting IndexChunkTurbo
algorithm can process most of
the similarity queries efficiently while occupying almost the least
amount of index space.
We also investigated the problem of approximate entity matching. Our SIGMOD09 work can extract approximate mentions of entities in a document given an entity dictionary. Our DEXA11 and SIGMOD13 work considers finding semantically similar but syntactically different entities with the help of transformation rules (e.g., "big apple" = "new york").
We have also worked on similarity queries on other types of objects, e.g., graphs and uncertain/probabilistic sets.
This project is partly funded by ARC Discovery Projects DP0987273, DP0881779, and DP130103401.
2. People
Academic Staff:
PhD Students:

Yifang Sun

Xiaoyang Zhang

Xiaoling Zhou
Alumni:

Sunanda Patro
4. Publications

Yifang Sun, Jianbin Qin, Wei Wang. Near Duplicate Text Detection Using FrequencyBiased Signatures. WISE, 2013.

[PDF TBA]

Summary: Winnowing ^{[Schleimer, Wilkerson, and Aiken, SIGMOD 2013]} is a classic method for near duplicate text detection, with the unique locality property that stands out from other methods. We propose the kstability concept, which is a natural and useful extension of the locality property, and compute the kstability for all Winnowing family algorithms. We further propose a variant of Winnowing algorithm exploiting collection statistics as well as other improvements such as candidate text generation. We experimentally demonstrate the effectiveness and efficiency of our method on the PAN workshop dataset (for Plagiarism Detection).


Jianbin Qin, Wei Wang, Chuan Xiao, Yifei Lu, Xuemin Lin, Haixun Wang. Asymmetric Signature Schemes for Efficient Exact Edit Similarity Query Processing. TODS, 2013.

Summary: This work significantly extends our previous SIGMOD 09 paper. On the theoretical side, it gives lower bounds on minimum signature sizes for contentbased signature schemes (including substrings and subsequences). On the practical side, it gives two new families of algorithms, Super and Turbo, which substantially improves the basic asymmetric signature schemes proposed in our SIGMOD 09 paper. Specifically, the Turbo family algorithms also achieves an improved maximum signature size, and it compares favorably against another recent work (PASSJOIN) belonging to the asymmetric signature scheme (O(τ) vs. O(τ^{3})).

Xiaoyang Zhang, Jianbin Qin, Wei Wang, Yifang Sun, Jiaheng Lu. HmSearch: An Efficient Hamming Distance Query Processing Algorithm. SSDBM 2013, 2013.

Summary: Hamming distance based similarity search has many important applications. Existing methods can only handle very small Hamming distance thresholds and/or cannot deal with data skews effectively. We have developed the HmSearch method which solve the aforementioned limitations.

Jiaheng Lu, Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang. String Similarity Measures and Joins with Synonyms. SIGMOD 2013.

Summary: The majority of approaches to define similarity between two "mentions" (or strings referring to an entity) is based on their syntactical similarities. There are many instances where two strings with very different surface forms may be indeed similar. E.g., "big apple" = "new york". We propose a new similarity measure between two strings in the presence of a set of equivalence rules (generally referred to as "Synonyms" in the paper, but they can also capture other semantics, e.g., typographical errors). The new measure compares favorably with existing measures in both effectiveness and computational efficiency. We have also develop efficient similarity join algorithm with two features: (1) a new index capturing both prefix and length filtering, and (2) an online query optimization method based on a novel estimator.

Chuan Xiao, Jianbin Qin, Wei Wang, Yoshiharu Ishikawa, Koji Tsuda, Kunihiko Sadakane. Efficient Errortolerant Query Autocopletion. PVLDB 2013.

Summary: Traditional query autocompletion only suggest strings whose prefix matches the current input query exactly. In contrast, errortolerant autocompletion allows a fixed amount of edit errors, e.g., suggesting "
facebook
" for the input query "fcebo
". We present a method combining variant generation and trie. Our method achieves a speedup of up to 3,175 times than traditional triebased methods.

Jianbin Qin, Xiaoling Zhou, Wei Wang, Chuan Xiao. Triebased Similarity Search and Join. Scalable Similarity String Search/Join Competition, 2013.

Xiang Zhao, Chuan Xiao, Xuemin Lin, Wei Wang, Yoshiharu Ishikawa. Efficient Processing of Graph Similarity Queries with Edit Distance Constraints. VLDB Journal, 2013.

Ming Gao, Cheqing Jin, Wei Wang, Xuemin Lin, Aoying Zhou. Similarity Query Processing for Probabilistic Sets. ICDE 2013.

Summary: We study a simple yet powerful model and similarity measures for probabilistic sets, and both exact and approximate query processing algorithms, with an emphasis on probabilistic sets with large number of elements (thousands rather than tens as studied in prior work).

Jianbin Qin, Chuan Xiao, Wei Wang, Xuemin Lin. A SpaceEfficient Indexing Algorithm for Boolean Query Processing. WISE 2013.

Summary: There is much redundancy among postings lists. For example, given the postings list of
hong
, the postings list ofkong
can be represented in a much more concise way given that these two terms typically occur together in a document. In this work, we extend this idea to allow M terms to be merged and propose efficient algorithms to achieve this in a computationally efficient way.

Heng Tao Shen, Jiajun Liu, Zi Huang, ChongWah Ngo, and Wei Wang. Nearduplicate video retrieval: Current research and future trends. ACM Computing Survey, 2013.

[PDF]

Summary:


Wei Wang, Jianbin Qin, Chuan Xiao, Xuemin Lin, Heng Tao Shen. VChunkJoin: An Efficient Algorithm for Edit Similarity Joins. TKDE, 2012.

Summary: This work proposes a novel method to efficiently process edit similarity join. Compared with existing methods (e.g., edjoin, vgramjoin), it uses less space yet it is faster in many parameter settings.

Xiang Zhao, Chuan Xiao, Xuemin Lin, Wei Wang. Efficient Graph Similarity Joins with Edit Distance Constraints. ICDE 2012.

Summary: An efficient algorithm to compute graphs within certain (graph) edit distance away from a query graph.

Sunanda Patro, Wei Wang. Learning Topk Transformation Rules. DEXA 2011.

Summary: A new algorithm to learn transformation rules (e.g.,
VLDB
=Very Large Data Base
) from unannotated, potentially noisy data. Compared with previous approaches, our method can find much more valid rules in the topk output.

Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, Guoren Wang. Efficient Similarity Joins for Near Duplicate Detection. ACM Transaction of Database Systems (TODS).

Summary: This is the journal version of our WWW 2008 paper, with extension to implementing the PPJoin family of algorithms on top of relational database systems.

Jianbin Qin, Wei Wang, Yifei Lu, Chuan Xiao, Xuemin Lin. Efficient Exact Edit Similarity Query Processing with Asymmetric Signature Schemes. SIGMOD 2011.

Chaokun Wang, Jianmin Wang, Xuemin Lin, Wei Wang, Haixun Wang, Hongsong Li, Wanpeng Tian, Jun Xu, Rui Li. MapDupReducer: Detecting Near Duplicates over Massive Datasets. SIGMOD 2010.

Haichuan Shang, Xuemin Lin, Ying Zhang, Jeffrey Xu Yu, Wei Wang. Connected Substructure Similarity Search. SIGMOD 2010.

Wei Wang, Chuan Xiao, Xuemin Lin, Chengqi Zhang. Efficient Approximate Entity Extraction with Edit Distance Constraints. SIGMOD 2009.

Summary: This work presents an efficient algorithm to extract approximate mentions of entities from incoming documents. E.g., we can match Ronald Regan to Ronald Reagan in the Reuters news. The algorithm is also interesting in its own right as it generalizes the neighborhoodgenerationbase approach for approximate string matching.

Wei Wang. Efficient Exact Similarity Join Algorithms. Seminar at University of Technology, Sydney, Oct 2009.

Chuan Xiao, Wei Wang, Xuemin Lin, Haichuan Shang. Topk Set Similarity Joins. ICDE 2009.

Wei Wang. Similarity Join Algorithms: An Introduction. Tutorial at SEBD 2008.

Slides (ppt slides available upon request).


Chuan Xiao, Wei Wang, Xuemin Lin. EdJoin: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints. VLDB 2008.

Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu. Efficient Similarity Joins for Near Duplicate Detection. WWW 2008.
5. Download

Slides

2013, Similarity Query Processing Algorithms: Use of Enumeration and Divide and Conquer Techniques: PPTX (1.5MB) PDF (2.5MB)


Download the latest PPJoin source code here. There is a java ppjoin(plus) package at http://code.google.com/p/ppjoinplus/ and another simjoin package at https://code.google.com/p/similarityjointools/. A python implementation is available at https://github.com/teh/ppjoin

Download the PPJoin/EdJoin binary code (including a simple documentation) here (1.3MB)

Download the PPJoin/EdJoin datasets (i.e., DBLP, TEXAS, TREC, and UNIREF here (233MB). Or download the DBLP (bzip2 compressed) dataset (25MB).

Download the NGPP binary code (i.e., our SIGMOD09 algorithm) here and a simple README.

Download the GramChunk binary code (i.e., our SIGMOD11 algorithms) here, or read the online documentation.