Applying Space-filling Curves in QBIC Indexing
(Some Quantitative Results)
John Shepherd
IBM Almaden research Center (Feb-Aug,1998)
University of New South Wales, Sydney (regularly)
joint work with Nimrod Megiddo, Xiaoming Zhu
The Problem
Inputs
- a database of N objects
(obj1 , obj2 , ..., objN)
- representations of objects as d-dimensional vectors
(vobj1 , vobj2 , ..., vobjN)
(each vector may represent one or more features of the object)
- a query object q with associated d-dimensional vector (vq)
- a distance measure D(vi , vj) : [0..1)
(D=0 vi=vj)
Outputs
- a list of the k nearest objects in the database
[a1 , a2 , ... ak]
- ordered by distance to query object
D(va1 ,vq)
D(va2 ,vq) ...
D(vak ,vq)
And all done as efficiently as possible ...
(time most important, space less so)
The Context
The solution has to work for ...
- number of dimensions d = 1 .. 500
- number of objects N = 104 .. 1010
- all values in query vector are specified (i.e. no partial-match)
- implementation runs on "standard" hardware/OS
(e.g. don't assume arrays of parallel disks, huge memory, ...)
- independent of distance measure (?)
- 100% retrieval accuracy (?)
And, most importantly, the solution should ...
- use minimal resources: i/o, cpu-time, space
- provide fast database updates (for some applications e.g. Magnifi)
The Costs
I/O cost
- the dominant cost (given current disk technology)
- measured in terms of pages read, not records read
CPU cost
- primarily related to computing distance measure D(vobj , vq)
- can be relatively expensive for some QBIC features
Space
- need to store data (feature vectors) within page-size chunks
- access methods typically require additional disk space
(which can be anywhere from 10% extra to >100% extra)
- intermediate data structures in memory?
Cost Parameters
We need to consider the following quantities:
Name |
Meaning |
Typically |
N |
number of objects in the database |
103 .. 1010 |
d |
number of dimensions in object features |
1 .. 256 |
P |
number of bytes per disk page |
512 .. 8K |
V |
number of bytes for d-dimensional vector |
2d, 4d, 8d |
O |
average number of bytes per stored object (key+vector) |
12+V .. 256+V |
NO |
average number of stored objects per page (=floor(P/O)) |
1 .. 512 |
NP |
number of disk pages to hold stored objects (=ceil(N/NO)) |
50 .. 1010 |
Topen |
time to open a database file |
10ms |
TP |
time to read a page from disk into memory |
10ms |
TD |
time to compute distance between two objects (using vectors) |
100us (?) |
TK |
average time to find the page address for a given object |
30ms (?) |
Tsel |
average time to fetch page for a given curve point |
30ms (?) |
TK is related to the use of a DBMS
to look up a feature vector via an object identifier/key (the file name of the image).
Tsel is related to the indexed lookup
of the nearest curve point via a B-tree during CurveIx querying.
Current Approach
Given a query vector vq ,
the current QBIC method is (essentially):
list of results is initially empty
for each image j in the database
{
dist = D(vj ,vq)
if (#results < k or dist < maxD)
{
insert (j,dist) into results
sort results by dist
maxD = largest dist in results
}
}
Cost = Topen + NPTP + NTD
Note: If q is an image from the database, we can use a
pre-computed distance table to make this much faster.
Goals for a Solution
Any reasonable solution aims to:
- reduce the number of objects considered (determine candidates)
- ... which, in turn, reduces #vectors read and D computations
- but without excessive overhead in determining candidates
Beware: less vectors read does not necessarily mean less pages read
Proposed Approach: CurveIx
Based on the method developed in
[Megiddo and Shaft]
The basic idea:
- project d-dimensional vector onto 1-d space-filling curve
- collect set of curve-neighbours (should contain some k-NN)
- repeat above for several curves with different paths through space
- union of neighbour sets from all curves gives candidates
- retrieve and compute distance only for candidates
We want candidate set to contain most/all of k-NN ...
- how many curves do we need?
- how many neighbours do we examine on each curve?
Curve Mapping Functions
Each Ci has the following properties:
- maps d-dimensional vectors onto points on a line
i.e. [0,1)d [0,1)
- mapping is one-to-one and defines an order on [0,1)d
- inverse mapping is continuous
Points within a small region of the curve are likely
to have been mapped from vectors that are close in d-space.
We generate multiple Cis by applying:
- permutations over vector elements
- translations of individual elements
Data Structures for CurveIx
Ci |
a mapping function for each of the m
different space-filling curves (i=1..m)
|
Db |
a database of N d-dimensional feature vectors;
each entry is a pair:
(imageId, vector)
imageId forms a primary key
|
Index |
a database of Nm curve points; each point is represented
by a tuple
(curveId, point, imageId)
the pair (curveId, point) forms a primary key with an
ordering, where point = CcurveId(vimageId)
|
vq |
feature vector for query q
|
Database Construction
For each image obj, insertion requires:
for i in 1..m
{
p = Ci(vobj)
insert (i, p, obj) into Index
}
insert (obj, vobj) into Db
Finding k-NN (Simple Approach)
Given: Db , vq ,
C1 .. Cm , Index
for i in 1..m
{
p = Ci(vq)
lookup (i,p) in Index
fetch (i,p',j)
while p' "close to" p on curve i
{
collect next (i,p',j) as candidate
}
}
for each candidate j
{
lookup j in Db
fetch (j,vj)
d = D(vj , vq)
include j in k-NN if d small enough
}
Cost = 2Topen + mTsel + NfTP + NfTD
where
- Tsel is the cost of Index lookup using e.g. B-tree
- f is the filtering factor
CurveIx vs. Linear Scan
For linear scan:
Cost = Topen + NPTP + NTD
For CurveIx:
Cost = 2Topen + mTsel + NfTP + NfTD
If CurveIx is to be worthwhile, we need:
(1) mTsel + NfTP < NPTP
(2) NfTD < NTD
(2) simply requires f < 1, which is essential anyway.
... CurveIx vs. Linear Scan
If we assume that each curve lookup involves 3-4 page access
(i.e. Tsel=3), we can restate (1) as:
(1) (3m+Nf)TP < NPTP
Some observations, given NP = 0.05 N:
- we can't afford to have too many curves (m)
- CurveIx has to do effective filtering (f)
- clustering can reduce the NfTP term
Difficult to analyse further, so we implemented the system to see
how it performed ...
Implementation
Aim: build a quick/easy/light implementation of the above approach.
-
make it stand-alone from QBIC, but use QBIC's distance function
-
use the Unix dbm libraries to implement keyed retrieval
Two components:
build
- reads image idents and color histograms from stdin
- stores each in the Db,
a dbm B-tree keyed on ident
- computes m curve-points for each,
and stores in Index,
a dbm B-tree keyed on curveId
and position (Ci(v) value)
query
- reads a single color histogram from stdin
- uses above method to find k best matches
- display list of best matches idents and distances on stdout,
along with performance statistics
Example for build Program
% build db < data
Inserted img00000
Inserted img00001
...
where the data file looks like:
...
img00102
54 0 0 0 0 0 0 5 10 3 1 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0
0 1 8 10 3 2 1 7 3 2 1 1 0 0 0 0
0 0 0 2 14 34 3 1 2 1 1 0 0 0 0 0
0 0 0 0 0 0 2 1 0 0 0 3 1 1 2 737
3 0 5 2 1 0 1 0 0 0 0 0 0 0 0 0
0 0 0 0 7 0 25 2 1 1 1 0 0 0 9 2
1 1 0 0 1 2 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 3 0 0 0 5 0 0 0 1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
...
Example for query Program
An example of executing query on a database of 20,000
images (color histograms) with 100 curves:
% time query db 10 1 100 < qry007
QUERY: img00102
0.000000 img00102
0.005571 img00713
0.008826 img00268
0.011413 img05054
0.011811 img00631
0.014259 img04042
0.027598 img00203
0.037471 img00287
0.063639 img00244
0.067583 img00306
# dist calculations: 655
...
various dbm, i/o statistics
...
2.00user 0.27system ...
Ci Values as Keys
One problem raised early in the implementation:
-
the Hilbert numbers produced by Ci
could be expanded to arbitrary precision in order
to distinguish between
Ci(p1 )
and
Ci(p2 )
To use such values as dbm keys, we
need them as fixed precision, so we limit
expansion (to 4 levels).
Problems:
-
gives keys that are 96 bytes long
(producing very large Index files)
-
different vj can be mapped to same point
(size of candidate sets is artificially inflated)
Solution:
-
use a "prefix" index structure (something like a trie)
Query Algorithm
Given: Db , Index , q , k , Range , Ncurves
dbm.open(Db)
dbm.open(Index)
compute Ci perms and shifts
for i in 1..Ncurves
{
p = Ci(vq)
up = dbm.lookup(Index , (i,p))
dn = dbm.lookup(Index , (i,p))
img = dbm.fetch(up)
insert img in results
while within Range for up and dn
{
dbm.next(up); img = dbm.fetch(up)
check insert img in results
dbm.prev(dn); img = dbm.fetch(dn)
check insert img in results
}
}
for i in 1..k
display (dist,imageId) for resultsi
... Query Algorithm
check insert img in results
{
if already seen img
return // no need to check
data = dbm.lookup(Db , img)
vimg = dbm.fetch(data)
d = D(vimg , vq)
include img in k-NN if d small enough
}
The dbm.lookup steps are expensive (Tsel)
Once a cursor is established, access is via cached buffer.
Performance Analysis
Since CurveIx does not guarantee to deliver the k-NN
among its candidates, we set an "acceptable" accuracy level of 90%.
In other words, CurveIx must deliver 0.9 k nearest-neighbours
to be considered useful.
Our initial concern has been to answer the questions:
-
How many curves are needed to achieve 90% accuracy?
-
How many curve-neighbours do we need to examine?
-
Can all of this be done reasonably efficiently?
So far, preliminary answers are only avaliable for 256-d
QBIC color histograms:
- around 80-100
- around 15-30
- not particularly ... yet
Experiments
Measures for accuracy:
- Acc1 average top-10 entries in CurveIx top-10
- Acc2 how frequently CurveIx gives 10 out of 10
Measures for efficiency:
- Size size of Db file + Index file
- Dist number of distance calculations required
- IO total amount of I/O performed
To determine how these measures vary:
- built databases of size 5K, 10K, 15K, 20K (supersets)
- for each database, ran 25 query "benchmark" set
- for each query, ran for 3,5,10,20,30,40 curve-neighbours
(but because of curve-mapping problem, only got 20,30,40)
- for each query, ran for 20,40,60,80,100 curves
Also implemented a linear scan version for comparison and
to collect the exact answer sets.
Sample Comparison
CurveIx Indexing |
Linear Scan |
% query db 10 1 100
QUERY: img00102
0.000000 img00102
0.005571 img00713
0.008826 img00268
0.011413 img05054
0.011811 img00631
0.014259 img04042
0.027598 img00203
0.037471 img00287
0.063639 img00244
0.067583 img00306
# dist calcs: 524
1.67user 0.23system ...
|
% query db 10
QUERY: img00102
0.000000 img00102
0.005571 img00713
0.008826 img00268
0.011413 img05054
0.011811 img00631
0.014259 img04042
0.027598 img00203
0.037471 img00287
0.063639 img00244
0.067583 img00306
# dist calcs: 20000
1.93user 1.12system ...
|
Experimental Results
For fixed database (20K), effect of varying Range, Ncurves
Curves | Range | Acc1 | Acc2 | Dist |
20 | 20 | 6.72 | 0.20 | 426 |
20 | 30 | 7.28 | 0.28 | 695 |
20 | 40 | 7.68 | 0.36 | 874 |
40 | 20 | 7.68 | 0.28 | 957 |
40 | 30 | 8.16 | 0.40 | 1301 |
40 | 40 | 8.60 | 0.44 | 1703 |
60 | 30 | 8.40 | 0.44 | 1905 |
60 | 40 | 8.60 | 0.48 | 2413 |
80 | 30 | 8.87 | 0.58 | 2485 |
80 | 40 | 9.20 | 0.72 | 3381 |
100 | 30 | 9.10 | 0.70 | 3061 |
100 | 40 | 9.28 | 0.72 | 4156 |
Results: Size vs. Accuracy
For fixed CurveIx parameters (80 curves, 30 range), effect of varying
database size:
#Images | Acc1 | Acc2 |
5K | 9.72 | 0.76 |
10K | 9.44 | 0.80 |
15K | 9.16 | 0.70 |
20K | 9.04 | 0.64 |
Results: Costs
For the 20K database (CurveIx (80,30)):
Approach | Size(Db) | Size(Ix) | IO(bytes) |
Linear Scan | 42MB | - | 42MB |
CurveIx | 46MB | 345MB | 4+36MB |
Remaining Work
The following aspects require further research and implementation
before this approach could be considered a real solution for k-NN:
-
control of search over curves and neighbours
-
better data structure for curve index
-
clustering strategies to reduce page reads
-
compare to other approaches (e.g.
[Pyramid technique]
)
References/Further Reading
-
Berchtold & Böhm & Kriegel,
The Pyramid Technique: Towards breaking the curse of dimensionality,
ACM SIGMOD, 1998
[proposes a new data access technique which is optimized for high-dimensional
range queries]
-
Gaede & Günther,
Multidimensional access methods,
Technical Report, Humboldt University, Berlin, 1995
[comprehensive survey of multidimensional database access methods]
-
Hellerstein, Koutsoupias & Papadimitriou,
On the analysis of indexing schemes,
ACM PODS, 1997
[initial work on a general theory of "indexability", which is
analogous to complexity theory and describes how effective
data/query/access-method combinations are in terms
of disk access and storage cost]
-
Megiddo & Shaft,
Efficient nearest neighbour indexing based on a collection
of space-filling curves,
IBM Internal Research Report, 1997
[description of method to index multidimensional data based on
a multiple complementary "linearisations" of the data points]
-
White & Jain,
Algorithms and Strategies for Similarity Retrieval,
UCSD Visual Computing Lab Report, 1996
[survey and characterisation of a large number of approaches
to multidimensional access methods]
Plus other papers in /home/jas/papers
.
Produced: 27 Jul 98