Q2: “probable” -› “probability”
In Q2.3, you can safely assume (1) neither s nor t is empty string, and (2) 0 ‹ alpha ‹ 1.
Do I need to perform further normalization for Q1?
No. Only star schema is required.
Any hint for Q2.3?
Consider this special case, say, s = “abcd”, t = “abcd...z”, and alpha = 0.5.
In Q2, how to handle multiple occurrences of the same q-gram in a string?
Simply put, we treat multiple occurrences of the same q-gram in a string as if they are different q-grams. My preferred notation is to use a different subscript to distinguish them. See Page 8 of SSJoin.ppt or Sec 4.3.1 in “Surajit Chaudhuri, Venkatesh Ganti, Raghav Kaushik: A Primitive Operator for Similarity Joins in Data Cleaning. ICDE 2006: 5.”
Note that the way we handle this case is different from the project, where we essentially just keep the multiplicity of the duplicate tokens. The main reason is that different data preprocessing methods are needed for different data mining tasks/applications.