Efficient Exact Similarity Join

1. Project Overview

Given a similarity function and two sets of object, a similarity join returns all pairs of objects (from each set respectively) such that their similarity value satisifies a given criterion. A typical example is to find pairs of documents such that their cosine similarity is above a constant (say, 0.95), as they are likely to be near duplicate documents.

In this project, we focus on efficient algorithms to perform the similarity join exactly on (multi-) sets or strings. Commonly used similarity functions for (multi-) sets or strings are more complex than Euclidean distance functions. As a result, many previous approaches have to compute the approximate results instead.

We have developed several fast algorithms that address the above performance issue. They work for Jaccard, Dice, Cosine similarities and Edit distance constraints. We have also devised algorithms to compute top-k similar pairs progressively so that a user does not need to specify a similarity threshold for some unknown dataset.

This project is partly funded by ARC Discovery Project DP0987273.

2. People

Academic Staff:

PhD Student:

3. Publications

  1. Wei Wang. Efficient Exact Similarity Join Algorithms. Seminar at University of Technology, Sydney, Oct 2009.

  2. Chuan Xiao, Wei Wang, Xuemin Lin, Haichuan Shang. Top-k Set Similarity Joins. ICDE 2009.

  3. Wei Wang. Similarity Join Algorithms: An Introduction. Tutorial at SEBD 2008.

  4. Chuan Xiao, Wei Wang, Xuemin Lin. Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints. VLDB 2008.

  5. Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu. Efficient Similarity Joins for Near Duplicate Detection. WWW 2008.

4. Download