Please read carefully this Software License Agreement before using the package. By installing and using the package, you agree to be bounded by the terms of this agreement. If you do not agree to the terms of this license, please do not use the package. Using any part of the package indicates that you accept these terms.
This package is for research purpose only. Other usage is not allowed without the permission from the authors. It cannot be used for any commercial interest.
This software license agreement and author information will not be altered.
Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu: Efficient Similarity Joins For Near Duplicate Detection. WWW 2008: 131-140.
Chuan Xiao, Wei Wang, Xuemin Lin: Ed-Join: An Efficient Algorithm for Similarity Join with Edit Distance Constraints. VLDB 2008.
Roberto J. Bayardo, Yiming Ma, Ramakrishnan Srikant: Scaling up all pairs similarity search. WWW 2007: 131-140.
| Program Name | Description | ||
|---|---|---|---|
| txtformat | Normalize the text data | ||
| txtdedup | Remove exact duplicates in the text data file | ||
| bindedup | Remove exact duplicates in the binary data file | ||
| tokenizer | Tokenize the text file into tokens and store it in the binary format | ||
| qtokenizer | Tokenize the text file into q-grams and store it in the binary format | ||
| allpairs | The implementation of the All-Pairs algorithm in [3] | ||
| ppjoin, ppjoinplus | The implementations of the PPJoin and PPJoin+ algorithms in [1] | ||
| allpairsed | The implementation of the All-Pairs-Ed algorithm in [2] | ||
| edjoin-l, edjoin | The implementation of the Ed-Join and Ed-Join-l algorithm in [2] | ||
| edjoin-l, edjoin | The implementation of the Ed-Join and Ed-Join-l algorithm in [2] | ||
| binary-win32.zip | A precompiled version for Microsoft Windows |
All binary codes were compiled on Debian 4.1.1-21 using g++ 4.2.3 with -O3 flag.
txtformat processes the input file by converting all punctuation to underscores, and all letters to upper case (default) or lowercase (optional).
Usage:
txtformat [input] [output] [case (optional)]
Arguments:
| Argument | Description | ||
|---|---|---|---|
| input | input text file | ||
| output | output text file | ||
| case (optional) | when case is not supplied, no change of cases to the input file; when case is 'l', the program will convert all letters to their lower cases; when case is 'u', then the program will convert all letters to their upper cases. |
Example:
txtformat dblp.txt dblp.norm.txt txtformat dblp.txt dblp.normlower.txt l
txtdedup removes exact duplicates in the text data file
Usage:
txtdedup [input] [output]
Example:
txtdedup dblp.txt dblp.nodup.txt
bindedup removes exact duplicates in the binary data file.
Usage:
bindedup [input] [output]
Example:
bindedup dblp.bin dblp.nodup.bin
Usage:
tokenizer [Dataset] [Record-Num (Optional)]
Arguments:
| Argument | Description | ||
|---|---|---|---|
| Dataset | Original dataset in txt format. | ||
| Record-Num (Optional) | Assign a number N so that only the first N records are tokenized. |
Example:
tokenizer dblp.txtTokenize dblp.txt into binary dataset. The output filename is dblp.txt.bin.
Description:
Create a dataset in binary form with the input dataset in txt format. Each line in the input dataset is considered as a record, and is tokenized into a SET of tokens with the following characters:
.,?!\tSpecifically, if the same token appears k times in a record, it will be counted as k unique tokens.
The records are then sorted in ascending order of size. The output binary dataset is in the form of:
[record-id][record-size][token-list]Each field is represented as a 4-byte integer.
Example:
[102][3][1][2][3] [253][5][2][3][4][5][6]The above lines represent two records. The first (id = 102) has 3 tokens, and the token ids are 1, 2 and 3. The second (id = 253) has 5 tokens, and the token ids are 2, 3, 4, 5, and 6.
The output filename appends '.bin' to the input filename. E.g., the output filename is dataset.txt.bin if the input filename is dataset.txt.
Usage:
qtokenizer [Q] [Dataset] [Record-Num (Optional)]
Arguments:
| Argument | Description | ||
|---|---|---|---|
| Q | q-gram length. | ||
| Dataset | Original dataset in txt format. | ||
| Record-Num (Optional) | Assign a number N so that only the first N records are tokenized into q-grams. |
Example:
qtokenizer 5 dblp.txtTokenize dblp.txt into binary dataset with q-gram length 5. The output filenames are dblp.txt.5gram.bin and dblp.txt.5gram.bin.loc.
Description:
Create a q-gram dataset in binary form with the input original strings. Each line in the input dataset is considered as a record, and tokenized into a MULTISET of q-grams. The records are then sorted in ascending order of size. The outputs are two files. The first output file is the q-gram set of records in the form of:
[record-id] [record-size] [qgram-list]Each field is represented as a 4-byte integer.
The second output file is the locations of the q-grams in the original string, in the form of:
[location_list]Each unit is represented as a 2-byte integer. The locations have 1-to-1 correspondence with the _q_-grams in the previous q-gram file.
The output filenames are 'qgram.bin' and 'qgram.bin.loc' appended to the input filename. E.g., the outputs are 'dataset.txt.5gram.bin' and 'dataset.txt.5gram.bin.loc' if the input filename is 'dataset.txt' and the q-gram length is 5. When applying Ed-Join/All-Pairs-Ed family algorithm, please ensure that the '.loc' file is in the same folder as the corresponding '.bin' file.
Usage:
allpairs/ppjoin/ppjoinplus [Function] [Threshold] [Dataset-Bin] [Max-Depth (Optional)]
Algorithms:
All-Pairs
PPJoin
PPJoin+
Arguments:
| Argument | Description | ||
|---|---|---|---|
| Function | Similarity functions used in set similarity joins. Choose from j (for Jaccard similarity) and c (for Cosine similarity) | ||
| Threshold | Simlarity threshold, e.g., 0.95. | ||
| Dataset-Bin | The filename of dataset in binary form. | ||
| Max-Depth (Optional) | The max recursive level (MAXDEPTH) for suffix filtering. The default value is 2. |
Example:
ppjoin j 0.95 dblp.binRun ppjoin algorithm on the dblp dataset with Jaccard similarity threshold 0.95, MAXDEPTH = 2.
Example:
ppjoinplus c 0.8 trec.bin 4 > result.txtRun ppjoin+ algorithm on trec dataset with Cosine similarity threshold 0.8, MAXDEPTH = 4. Output the join results to file result.txt.
Description: The outputs of the program are:
some statistics printed to stderr. They should be self-explanatory.
the join result. Each line consists of line-x line-y sim. line-x/y are line numbers in the original text data file.
Usage:
allpairsed/edjoin-l/edjoin [Tau] [Q] [QGram-Set-Bin] [String] [Min-Length (Optional)]
Algorithms:
All-Pairs-Ed
Ed-Join-l
Ed-Join
Arguments:
| Argument | Description | ||
|---|---|---|---|
| Tau | Edit Distance Threshold. | ||
| Q | q-gram length. | ||
| QGram-Set-Bin | The filename of q-gram sets in binary form. | ||
| String | The filename of the original string. | ||
| Min-Length (Optional) | The minimum string length. Strings with length below this value will be discarded. The default value is q * (tau + 1), the minimum required string length for count filtering. |
Example:
edjoin 3 5 dblp.bin dblp.txtRun the Ed-Join algorithm on the dblp dataset. The edit distance threshold is 3, and the q-gram length is 5. To ensure that count filtering won't miss any join result, strings shorter than 20 are discarded.
Example:
edjoin-l 10 8 trec.bin trec.txt 100 > result.txtRun the Ed-Join algorithm with location-based filtering only on the trec dataset. The edit distance threshold is 10, and the q-gram length is 8. Strings shorter than 100 are discarded. Output join results to file result.txt.
Suppose the original text dataset is texas.raw.txt. To perform set similarity join on this dataset with Jaccard similarity t = 0.85, using PPJoin+, run the following command:
$ ./tokenizer texas.raw.txt 10000 27049 20000 45443 30000 61658 40000 76936 50000 91748 60000 106277 70000 120378 80000 134277 90000 148464 100000 162690 110000 176556 120000 190384 130000 204046 140000 219018 150000 233973 String Number: 154987 Token Sorted! Duplicate Token Removed! Vector Sorted! Tokens: 241317 $ ./ppjoinplus j 0.85 texas.raw.txt.bin > result.txt Document: texas.raw.txt.bin Algorithm: ppjoin+ MAXDEPTH = 2 Threshold: Jaccard 0.85 ... LOADING DATASET ... # Records: 154987 # Average Size: 11.2869 === BEGIN JOIN (TIMER STARTED) === # Distinct Indexed Tokens: 233534 # Inverted Index Entries: 41981 # Distinct Tokens: 242241 # Results: 14 # Candidate Pairs: 2016 # Intersections Performed: 2016 === END JOIN (TIMER STOPPED) === Total Running Time: 0.060 $ cat result.txt 72666 9062 0.857 78919 21526 0.857 104174 104173 0.857 120685 98148 0.857 123381 123380 0.857 125968 125966 0.857 126085 126084 0.857 39025 37904 0.867 120835 51430 0.867 123616 123615 0.867 32880 18161 0.875 58241 45306 0.875 121670 121669 0.875 121861 121859 0.882
Suppose the original txt dataset is dblp.raw.txt. To perform edit similarity join on this dataset with q = 5, tau = 3, discarding strings shorter than 20 and using Ed-Join, run the following command::
$ ./txtformat dblp.raw.txt dblp.norm.txt $ ./qtokenizer 6 dblp.norm.txt 10000 273345 20000 417843 30000 552687 40000 666132 50000 771969 60000 871278 70000 971830 80000 1052267 90000 1109326 100000 1162820 110000 1226821 120000 1280482 130000 1335476 140000 1381963 150000 1429457 160000 1476352 170000 1519018 180000 1559930 190000 1619003 200000 1661134 210000 1705322 220000 1745568 230000 1784313 240000 1813389 250000 1849278 260000 1881403 270000 1913562 280000 1941345 290000 1971211 300000 2001216 310000 2027749 320000 2059471 330000 2092141 340000 2135331 350000 2164359 360000 2194595 370000 2229923 380000 2254357 390000 2277680 400000 2308603 410000 2333160 420000 2356993 430000 2376388 440000 2400154 450000 2423227 460000 2450382 470000 2476920 480000 2504020 490000 2520704 500000 2541188 510000 2564068 520000 2585998 530000 2611282 540000 2637431 550000 2660196 560000 2678713 570000 2724430 580000 2745555 590000 2769391 600000 2788771 610000 2807129 620000 2822161 630000 2839145 640000 2859084 650000 2882005 660000 2901236 670000 2919307 680000 2939506 690000 2953774 700000 2970362 710000 2989198 720000 3022978 730000 3036501 740000 3050360 750000 3067383 760000 3097902 770000 3117676 780000 3130042 790000 3146643 800000 3158762 810000 3172939 820000 3184544 830000 3194977 840000 3208239 850000 3219395 860000 3234128 870000 3265917 String Number: 873524 Token Sorted! Vector Sorted! Token Number: 3274086 Non-Widow Token Number: 2010404 Maximum Length: 1738 $ ./edjoin 3 6 dblp.norm.txt.6gram.bin dblp.norm.txt 24 > result2.txt Document: dblp.norm.txt.6gram.bin Threshold: Edit Distance 3 Q-Gram Length: 6 Algorithm: Ed-Join Remove String Length Below: 24 ... LOADING DATASET ... # Strings: 872271 # Peak Length: 1743 # Average Length: 104.618 === BEGIN JOIN (TIMER STARTED) === # Distinct Tokens: 3274086 # Distinct Widow Tokens: 1468549 # Distinct Indexed Tokens: 1593032 # Peak Token Frequency within Index: 418 # Average Prefix Length: 9.151 # Inverted Index Entries: 6509857 # CAND-1: 446177 # CAND-2: 34586 # Final Results: 33823 === END JOIN (TIMER STOPPED) === Total Running Time: 7.574 $ wc result2.txt 135292 169115 4779132 result2.txt
We also compiled binaries (binary-win32.zip) for Microsoft Windows. They are only briefly tested under Microsoft Windows Vista.
Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu: Efficient Similarity Joins For Near Duplicate Detection. WWW 2008: 131-140.
Chuan Xiao, Wei Wang, Xuemin Lin: Ed-Join: An Efficient Algorithm for Similarity Join with Edit Distance Constraints. VLDB 2008.
Roberto J. Bayardo, Yiming Ma, Ramakrishnan Srikant: Scaling up all pairs similarity search. WWW 2007: 131-140.
Last Modified: Tue Sep 30 10:14:05 2008.