Documentation

1. Software License Agreement

Please read carefully this Software License Agreement before using the package. By installing and using the package, you agree to be bounded by the terms of this agreement. If you do not agree to the terms of this license, please do not use the package. Using any part of the package indicates that you accept these terms.

  1. This package is for research purpose only. Other usage is not allowed without the permission from the authors. It cannot be used for any commercial interest.

  2. This software license agreement and author information will not be altered.

2. Overview of Programs

2.1. Related Papers

  1. Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu: Efficient Similarity Joins For Near Duplicate Detection. WWW 2008: 131-140.

  2. Chuan Xiao, Wei Wang, Xuemin Lin: Ed-Join: An Efficient Algorithm for Similarity Join with Edit Distance Constraints. VLDB 2008.

  3. Roberto J. Bayardo, Yiming Ma, Ramakrishnan Srikant: Scaling up all pairs similarity search. WWW 2007: 131-140.

2.2. Executables

Program Name Description
txtformat Normalize the text data
txtdedup Remove exact duplicates in the text data file
bindedup Remove exact duplicates in the binary data file
tokenizer Tokenize the text file into tokens and store it in the binary format
qtokenizer Tokenize the text file into q-grams and store it in the binary format
allpairs The implementation of the All-Pairs algorithm in [3]
ppjoin, ppjoinplus The implementations of the PPJoin and PPJoin+ algorithms in [1]
allpairsed The implementation of the All-Pairs-Ed algorithm in [2]
edjoin-l, edjoin The implementation of the Ed-Join and Ed-Join-l algorithm in [2]
edjoin-l, edjoin The implementation of the Ed-Join and Ed-Join-l algorithm in [2]
binary-win32.zip A precompiled version for Microsoft Windows

All binary codes were compiled on Debian 4.1.1-21 using g++ 4.2.3 with -O3 flag.

3. Package Manual

3.1. Preprocessing

3.1.1. txtformat

txtformat processes the input file by converting all punctuation to underscores, and all letters to upper case (default) or lowercase (optional).

Usage:

txtformat [input] [output] [case (optional)]

Arguments:
Argument Description
input input text file
output output text file
case (optional) when case is not supplied, no change of cases to the input file; when case is 'l', the program will convert all letters to their lower cases; when case is 'u', then the program will convert all letters to their upper cases.

Example:

txtformat dblp.txt dblp.norm.txt
txtformat dblp.txt dblp.normlower.txt l

3.1.2. txtdedup

txtdedup removes exact duplicates in the text data file

Usage:

txtdedup [input] [output]

Example:

txtdedup dblp.txt dblp.nodup.txt

3.1.3. bindedup

bindedup removes exact duplicates in the binary data file.

Usage:

bindedup [input] [output]

Example:

bindedup dblp.bin dblp.nodup.bin

3.1.4. Tokenizer

Usage:

tokenizer [Dataset] [Record-Num (Optional)]

Arguments:
Argument Description
Dataset Original dataset in txt format.
Record-Num (Optional) Assign a number N so that only the first N records are tokenized.

Example:

tokenizer dblp.txt
Tokenize dblp.txt into binary dataset. The output filename is dblp.txt.bin.

Description:

Create a dataset in binary form with the input dataset in txt format. Each line in the input dataset is considered as a record, and is tokenized into a SET of tokens with the following characters:

.,?!\t
Specifically, if the same token appears k times in a record, it will be counted as k unique tokens.

The records are then sorted in ascending order of size. The output binary dataset is in the form of:

[record-id][record-size][token-list]
Each field is represented as a 4-byte integer.

Example:

	[102][3][1][2][3]
	[253][5][2][3][4][5][6]
The above lines represent two records. The first (id = 102) has 3 tokens, and the token ids are 1, 2 and 3. The second (id = 253) has 5 tokens, and the token ids are 2, 3, 4, 5, and 6.

The output filename appends '.bin' to the input filename. E.g., the output filename is dataset.txt.bin if the input filename is dataset.txt.

3.1.5. q-gram Tokenizer

Usage:

qtokenizer [Q] [Dataset] [Record-Num (Optional)]

Arguments:
Argument Description
Q q-gram length.
Dataset Original dataset in txt format.
Record-Num (Optional) Assign a number N so that only the first N records are tokenized into q-grams.

Example:

qtokenizer 5 dblp.txt
Tokenize dblp.txt into binary dataset with q-gram length 5. The output filenames are dblp.txt.5gram.bin and dblp.txt.5gram.bin.loc.

Description:

Create a q-gram dataset in binary form with the input original strings. Each line in the input dataset is considered as a record, and tokenized into a MULTISET of q-grams. The records are then sorted in ascending order of size. The outputs are two files. The first output file is the q-gram set of records in the form of:

[record-id] [record-size] [qgram-list]
Each field is represented as a 4-byte integer.

The second output file is the locations of the q-grams in the original string, in the form of:

[location_list]
Each unit is represented as a 2-byte integer. The locations have 1-to-1 correspondence with the _q_-grams in the previous q-gram file.

The output filenames are 'qgram.bin' and 'qgram.bin.loc' appended to the input filename. E.g., the outputs are 'dataset.txt.5gram.bin' and 'dataset.txt.5gram.bin.loc' if the input filename is 'dataset.txt' and the q-gram length is 5. When applying Ed-Join/All-Pairs-Ed family algorithm, please ensure that the '.loc' file is in the same folder as the corresponding '.bin' file.

3.2. Set Similarity Join Algorithms

Usage:

allpairs/ppjoin/ppjoinplus [Function] [Threshold] [Dataset-Bin] [Max-Depth (Optional)]

Algorithms:

Arguments:
Argument Description
Function Similarity functions used in set similarity joins. Choose from j (for Jaccard similarity) and c (for Cosine similarity)
Threshold Simlarity threshold, e.g., 0.95.
Dataset-Bin The filename of dataset in binary form.
Max-Depth (Optional) The max recursive level (MAXDEPTH) for suffix filtering. The default value is 2.

Example:

ppjoin j 0.95 dblp.bin
Run ppjoin algorithm on the dblp dataset with Jaccard similarity threshold 0.95, MAXDEPTH = 2.

Example:

ppjoinplus c 0.8 trec.bin 4 > result.txt
Run ppjoin+ algorithm on trec dataset with Cosine similarity threshold 0.8, MAXDEPTH = 4. Output the join results to file result.txt.

Description: The outputs of the program are:

3.3. Edit Similarity Join Algorithms

Usage:

allpairsed/edjoin-l/edjoin [Tau] [Q] [QGram-Set-Bin] [String] [Min-Length (Optional)]

Algorithms:

Arguments:
Argument Description
Tau Edit Distance Threshold.
Q q-gram length.
QGram-Set-Bin The filename of q-gram sets in binary form.
String The filename of the original string.
Min-Length (Optional) The minimum string length. Strings with length below this value will be discarded. The default value is q * (tau + 1), the minimum required string length for count filtering.

Example:

edjoin 3 5 dblp.bin dblp.txt
Run the Ed-Join algorithm on the dblp dataset. The edit distance threshold is 3, and the q-gram length is 5. To ensure that count filtering won't miss any join result, strings shorter than 20 are discarded.

Example:

edjoin-l 10 8 trec.bin trec.txt 100 > result.txt
Run the Ed-Join algorithm with location-based filtering only on the trec dataset. The edit distance threshold is 10, and the q-gram length is 8. Strings shorter than 100 are discarded. Output join results to file result.txt.

4. Complete Examples

4.1. All-Pairs/PPJoin/PPJoin+

Suppose the original text dataset is texas.raw.txt. To perform set similarity join on this dataset with Jaccard similarity t = 0.85, using PPJoin+, run the following command:

$ ./tokenizer texas.raw.txt
10000 27049
20000 45443
30000 61658
40000 76936
50000 91748
60000 106277
70000 120378
80000 134277
90000 148464
100000 162690
110000 176556
120000 190384
130000 204046
140000 219018
150000 233973
String Number: 154987
Token Sorted!
Duplicate Token Removed!
Vector Sorted!
Tokens: 241317

$ ./ppjoinplus j 0.85 texas.raw.txt.bin > result.txt Document: texas.raw.txt.bin Algorithm: ppjoin+ MAXDEPTH = 2 Threshold: Jaccard 0.85 ... LOADING DATASET ... # Records: 154987 # Average Size: 11.2869 === BEGIN JOIN (TIMER STARTED) === # Distinct Indexed Tokens: 233534 # Inverted Index Entries: 41981 # Distinct Tokens: 242241 # Results: 14 # Candidate Pairs: 2016 # Intersections Performed: 2016 === END JOIN (TIMER STOPPED) === Total Running Time: 0.060 $ cat result.txt 72666 9062 0.857 78919 21526 0.857 104174 104173 0.857 120685 98148 0.857 123381 123380 0.857 125968 125966 0.857 126085 126084 0.857 39025 37904 0.867 120835 51430 0.867 123616 123615 0.867 32880 18161 0.875 58241 45306 0.875 121670 121669 0.875 121861 121859 0.882

4.2. All-Pairs-Ed/Ed-Join-l/Ed-Join

Suppose the original txt dataset is dblp.raw.txt. To perform edit similarity join on this dataset with q = 5, tau = 3, discarding strings shorter than 20 and using Ed-Join, run the following command::

$ ./txtformat dblp.raw.txt dblp.norm.txt

$ ./qtokenizer 6 dblp.norm.txt 10000 273345 20000 417843 30000 552687 40000 666132 50000 771969 60000 871278 70000 971830 80000 1052267 90000 1109326 100000 1162820 110000 1226821 120000 1280482 130000 1335476 140000 1381963 150000 1429457 160000 1476352 170000 1519018 180000 1559930 190000 1619003 200000 1661134 210000 1705322 220000 1745568 230000 1784313 240000 1813389 250000 1849278 260000 1881403 270000 1913562 280000 1941345 290000 1971211 300000 2001216 310000 2027749 320000 2059471 330000 2092141 340000 2135331 350000 2164359 360000 2194595 370000 2229923 380000 2254357 390000 2277680 400000 2308603 410000 2333160 420000 2356993 430000 2376388 440000 2400154 450000 2423227 460000 2450382 470000 2476920 480000 2504020 490000 2520704 500000 2541188 510000 2564068 520000 2585998 530000 2611282 540000 2637431 550000 2660196 560000 2678713 570000 2724430 580000 2745555 590000 2769391 600000 2788771 610000 2807129 620000 2822161 630000 2839145 640000 2859084 650000 2882005 660000 2901236 670000 2919307 680000 2939506 690000 2953774 700000 2970362 710000 2989198 720000 3022978 730000 3036501 740000 3050360 750000 3067383 760000 3097902 770000 3117676 780000 3130042 790000 3146643 800000 3158762 810000 3172939 820000 3184544 830000 3194977 840000 3208239 850000 3219395 860000 3234128 870000 3265917 String Number: 873524 Token Sorted! Vector Sorted! Token Number: 3274086 Non-Widow Token Number: 2010404 Maximum Length: 1738 $ ./edjoin 3 6 dblp.norm.txt.6gram.bin dblp.norm.txt 24 > result2.txt Document: dblp.norm.txt.6gram.bin Threshold: Edit Distance 3 Q-Gram Length: 6 Algorithm: Ed-Join Remove String Length Below: 24 ... LOADING DATASET ... # Strings: 872271 # Peak Length: 1743 # Average Length: 104.618 === BEGIN JOIN (TIMER STARTED) === # Distinct Tokens: 3274086 # Distinct Widow Tokens: 1468549 # Distinct Indexed Tokens: 1593032 # Peak Token Frequency within Index: 418 # Average Prefix Length: 9.151 # Inverted Index Entries: 6509857 # CAND-1: 446177 # CAND-2: 34586 # Final Results: 33823 === END JOIN (TIMER STOPPED) === Total Running Time: 7.574 $ wc result2.txt 135292 169115 4779132 result2.txt

5. Binaries for Microsoft Windows

We also compiled binaries (binary-win32.zip) for Microsoft Windows. They are only briefly tested under Microsoft Windows Vista.

6. References

  1. Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu: Efficient Similarity Joins For Near Duplicate Detection. WWW 2008: 131-140.

  2. Chuan Xiao, Wei Wang, Xuemin Lin: Ed-Join: An Efficient Algorithm for Similarity Join with Edit Distance Constraints. VLDB 2008.

  3. Roberto J. Bayardo, Yiming Ma, Ramakrishnan Srikant: Scaling up all pairs similarity search. WWW 2007: 131-140.


Chuan Xiao and Wei Wang

Last Modified: Tue Sep 30 10:14:05 2008.