COMP9315 21T1 ♢ Assignment 2 ♢ [0/12]
Aim: implement variants of signature-based indexing (SIMC, CATC)
Implement individual relations and commands to work on them.
Each relation R
consists of multiple files:
-
R.info
... relation meta-data (e.g. # tuples)
-
R.data
... data file containing pages of tuples
-
R.tsig
... file containing tuple signatures
-
R.psig
... file containing page-level signatures
-
R.bsig
... file containing bit-sliced signatures
Each relation implements one signature style
(SIMC or CATC)
COMP9315 21T1 ♢ Assignment 2 ♢ [1/12]
File structures for signature info + data files
Tuples are all tupSize
bytes long (based on # attributes)
Signatures are m bits long, rounded to ceil(m/8) bytes
COMP9315 21T1 ♢ Assignment 2 ♢ [2/12]
Superimposed codewords (SIMC):
- signature has m bits, ~m/2 bits set to 1
- codewords have m bits, each has k bits set to 1
- codewords combined via OR
- false matches caused by hash collisions and superimposition
Concatenated codewords (CATC):
- signature has m bits, ~m/2 bits set to 1
- codewords have ui ≅ m/n bits, each has u/2 bits set to 1
- codewords combined via shift + OR
- false matches caused by hash collisions (worse than SIMC)
COMP9315 21T1 ♢ Assignment 2 ♢ [3/12]
More detailed file structures for data + signature files
COMP9315 21T1 ♢ Assignment 2 ♢ [4/12]
COMP9315 21T1 ♢ Assignment 2 ♢ [5/12]
We supply:
- complete command programs to build and query relations
- partially-complete ADTs for operations needed by commands
You complete the ADTs so that the commands work properly
-
create
, insert
, select
... build/query commands
-
gendata
, stats
, dump
... utility commands
-
x1
, x2
, x3
... commands for debugging ADTs
COMP9315 21T1 ♢ Assignment 2 ♢ [6/12]
./create RelName SigType #tuples #attrs 1/pF
- creates a new relation with prefix
RelName
- uses n =
#attrs
to determine tuple size
- uses r =
#tuples
to determine bs ⇒ length of bit-slices
(#tuples
suggests maximum number of tuples to be stored)
- uses
pF
and n to determine m and k for SIMC signatures
- uses
pF
and n to determine m and ui for CATC signatures
- creates files
RelName.info
, RelName.data
, etc. etc.
- as supplied, data and signature files are empty after
create
- when complete,
RelName.bsig
should contain all-zero bit-slices
COMP9315 21T1 ♢ Assignment 2 ♢ [7/12]
./gendata #tuples #attributes [startID] [seed]
- generates
#tuples
tuples in a standard format, e.g.
1234567,iuwhfkajewhkfjkwefbx,a3-101,a4-256,a5-013,...
- first attribute is unique id
- second attribute is 20-char random string (most likely unique)
- rest are
aN-DDD
up to #attributes
- attributes
a3-DDD
to aN-DDD
are not unique
- if no
startID
given, use 1000000
; if no seed
given, use 0
COMP9315 21T1 ♢ Assignment 2 ♢ [8/12]
COMP9315 21T1 ♢ Assignment 2 ♢ [9/12]
./select RelName 'Query'
- finds all matches for pmr query in relation
RelName
- queries are expressed using
?
for unknown attributes, e.g.
?,?,?,?,?
1234567,?,?,?,?
?,?,a3-101,?,?
?,?,a3-101,a4-200,a5-013
- should enclose
Query
in single quotes (to avoid problems with zsh)
- prints one tuple per line (note: order of tuples is not important)
- prints page read statistics at end of output
COMP9315 21T1 ♢ Assignment 2 ♢ [10/12]
ADTs that you need to complete
-
Bits
... operations on bit-strings (set,unset,and,or,shift)
-
Reln
... operations on relations (new,open,close)
-
Query
... operations on queries (start,scan,stats,close)
-
Tsig
... operations on tuple signatures (makeSig,find)
-
Psig
... operations on page signatures (makeSig,find)
-
Bsig
... operations on bit-slices (find)
You
should not change any of the commands
(e.g. select.c
)
COMP9315 21T1 ♢ Assignment 2 ♢ [11/12]
What to do now?
- review the notes on signature-based indexing
- read the spec carefully
- read the code for the commands (to see how they use the ADTs)
- read the ADT
*.h
files; read the ADT *.c
files
- write and test your code (suggest: tsig, then psig, then bsig)
Testing script available next week.
Don't wait. It's easy to devise your own tests.
COMP9315 21T1 ♢ Assignment 2 ♢ [12/12]
Produced: 30 Mar 2021