Ass1 FAQ

1. Typos

2. Clarification

In Q2.3, you can safely assume (1) neither s nor t is empty string, and (2) 0 ‹ alpha ‹ 1.

3. Q&A

No. Only star schema is required.

Consider this special case, say, s = “abcd”, t = “abcd...z”, and alpha = 0.5.

Simply put, we treat multiple occurrences of the same q-gram in a string as if they are different q-grams. My preferred notation is to use a different subscript to distinguish them. See Page 8 of SSJoin.ppt or Sec 4.3.1 in “Surajit Chaudhuri, Venkatesh Ganti, Raghav Kaushik: A Primitive Operator for Similarity Joins in Data Cleaning. ICDE 2006: 5.”

Note that the way we handle this case is different from the project, where we essentially just keep the multiplicity of the duplicate tokens. The main reason is that different data preprocessing methods are needed for different data mining tasks/applications.