Space-filling Curves in QBIC Indexing (notes)

Applying Space-filling Curves in QBIC Indexing

(Some Quantitative Results)

John Shepherd

IBM Almaden research Center (Feb-Aug,1998)
University of New South Wales, Sydney (regularly)

joint work with Nimrod Megiddo, Xiaoming Zhu

The Problem

Inputs

a database of N objects (obj₁ , obj₂ , ..., obj_N)
representations of objects as d-dimensional vectors (v_obj1 , v_obj2 , ..., v_objN)
(each vector may represent one or more features of the object)
a query object q with associated d-dimensional vector (v_q)
a distance measure D(v_i , v_j) : [0..1) (D=0 v_i=v_j)

Outputs

a list of the k nearest objects in the database [a₁ , a₂ , ... a_k]
ordered by distance to query object D(v_a1 ,v_q) D(v_a2 ,v_q) ... D(v_ak ,v_q)

And all done as efficiently as possible ... (time most important, space less so)

The Context

The solution has to work for ...

number of dimensions d = 1 .. 500
number of objects N = 10⁴ .. 10¹⁰
all values in query vector are specified (i.e. no partial-match)
implementation runs on "standard" hardware/OS
(e.g. don't assume arrays of parallel disks, huge memory, ...)
independent of distance measure (?)
100% retrieval accuracy (?)

And, most importantly, the solution should ...

use minimal resources: i/o, cpu-time, space
provide fast database updates (for some applications e.g. Magnifi)

The Costs

I/O cost

the dominant cost (given current disk technology)
measured in terms of pages read, not records read

CPU cost

primarily related to computing distance measure D(v_obj , v_q)
can be relatively expensive for some QBIC features

Space

need to store data (feature vectors) within page-size chunks
access methods typically require additional disk space
(which can be anywhere from 10% extra to >100% extra)
intermediate data structures in memory?

Cost Parameters

We need to consider the following quantities:

Name	Meaning	Typically
N	number of objects in the database	10³ .. 10¹⁰
d	number of dimensions in object features	1 .. 256
P	number of bytes per disk page	512 .. 8K
V	number of bytes for d-dimensional vector	2d, 4d, 8d
O	average number of bytes per stored object (key+vector)	12+V .. 256+V
N_O	average number of stored objects per page (=floor(P/O))	1 .. 512
N_P	number of disk pages to hold stored objects (=ceil(N/N_O))	50 .. 10¹⁰
T_open	time to open a database file	10ms
T_P	time to read a page from disk into memory	10ms
T_D	time to compute distance between two objects (using vectors)	100us (?)
T_K	average time to find the page address for a given object	30ms (?)
T_sel	average time to fetch page for a given curve point	30ms (?)

T_K is related to the use of a DBMS to look up a feature vector via an object identifier/key (the file name of the image).

T_sel is related to the indexed lookup of the nearest curve point via a B-tree during CurveIx querying.

Current Approach

Given a query vector v_q , the current QBIC method is (essentially):

list of results is initially empty

for each image j in the database
{
     dist = D(v_j ,v_q)
     if (#results < k or dist < maxD)
     {
          insert (j,dist) into results
          sort results by dist
          maxD = largest dist in results
     }
}

Cost = T_open + N_PT_P + NT_D

Note: If q is an image from the database, we can use a pre-computed distance table to make this much faster.

Goals for a Solution

Any reasonable solution aims to:

reduce the number of objects considered (determine candidates)
... which, in turn, reduces #vectors read and D computations
but without excessive overhead in determining candidates

Beware: less vectors read does not necessarily mean less pages read

Proposed Approach: CurveIx

Based on the method developed in [Megiddo and Shaft]

The basic idea:

project d-dimensional vector onto 1-d space-filling curve
collect set of curve-neighbours (should contain some k-NN)
repeat above for several curves with different paths through space
union of neighbour sets from all curves gives candidates
retrieve and compute distance only for candidates

We want candidate set to contain most/all of k-NN ...

how many curves do we need?
how many neighbours do we examine on each curve?

Curve Mapping Functions

Each C_i has the following properties:

maps d-dimensional vectors onto points on a line
i.e. [0,1)^d [0,1)
mapping is one-to-one and defines an order on [0,1)^d
inverse mapping is continuous

Points within a small region of the curve are likely to have been mapped from vectors that are close in d-space.

We generate multiple C_is by applying:

permutations over vector elements
translations of individual elements

Data Structures for CurveIx

C_i	a mapping function for each of the m different space-filling curves (i=1..m)
Db	a database of N d-dimensional feature vectors; each entry is a pair: `(imageId, vector)` `imageId` forms a primary key
Index	a database of Nm curve points; each point is represented by a tuple `(curveId, point, imageId)` the pair `(curveId, point)` forms a primary key with an ordering, where point = C_curveId(v_imageId)
v_q	feature vector for query q

Database Construction

For each image obj, insertion requires:

for i in 1..m
{
     p = C_i(v_obj)
     insert (i, p, obj) into Index
}
insert (obj, v_obj) into Db

Finding k-NN (Simple Approach)

Given: Db , v_q , C₁ .. C_m , Index

for i in 1..m
{
     p = C_i(v_q)
     lookup (i,p) in Index
     fetch (i,p',j)
     while p' "close to" p on curve i
     {
          collect next (i,p',j) as candidate
     }
}
for each candidate j
{
     lookup j in Db
     fetch (j,v_j)
     d = D(v_j , v_q)
     include j in k-NN if d small enough
}

Cost = 2T_open + mT_sel + NfT_P + NfT_D

where

T_sel is the cost of Index lookup using e.g. B-tree
f is the filtering factor

CurveIx vs. Linear Scan

For linear scan:

Cost = T_open + N_PT_P + NT_D

For CurveIx:

Cost = 2T_open + mT_sel + NfT_P + NfT_D

If CurveIx is to be worthwhile, we need:

(1) mT_sel + NfT_P < N_PT_P

(2) NfT_D < NT_D

(2) simply requires f < 1, which is essential anyway.

... CurveIx vs. Linear Scan

If we assume that each curve lookup involves 3-4 page access (i.e. T_sel=3), we can restate (1) as:

(1) (3m+Nf)T_P < N_PT_P

Some observations, given N_P = 0.05 N:

we can't afford to have too many curves (m)
CurveIx has to do effective filtering (f)
clustering can reduce the NfT_P term

Difficult to analyse further, so we implemented the system to see how it performed ...

Implementation

Aim: build a quick/easy/light implementation of the above approach.

make it stand-alone from QBIC, but use QBIC's distance function
use the Unix dbm libraries to implement keyed retrieval

Two components:

build

reads image idents and color histograms from stdin
stores each in the Db, a dbm B-tree keyed on ident
computes m curve-points for each, and stores in Index, a dbm B-tree keyed on curveId and position (C_i(v) value)

query

reads a single color histogram from stdin
uses above method to find k best matches
display list of best matches idents and distances on stdout, along with performance statistics

Example for `build` Program

% build db < data
Inserted img00000
Inserted img00001
...

where the data file looks like:

...
img00102
 54 0 0 0 0 0 0 5 10 3 1 1 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0
 0 1 8 10 3 2 1 7 3 2 1 1 0 0 0 0
 0 0 0 2 14 34 3 1 2 1 1 0 0 0 0 0
 0 0 0 0 0 0 2 1 0 0 0 3 1 1 2 737
 3 0 5 2 1 0 1 0 0 0 0 0 0 0 0 0
 0 0 0 0 7 0 25 2 1 1 1 0 0 0 9 2
 1 1 0 0 1 2 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 3 0 0 0 5 0 0 0 1 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
...

Example for `query` Program

An example of executing query on a database of 20,000 images (color histograms) with 100 curves:

% time query db 10 1 100 < qry007

QUERY: img00102
0.000000  img00102
0.005571  img00713
0.008826  img00268
0.011413  img05054
0.011811  img00631
0.014259  img04042
0.027598  img00203
0.037471  img00287
0.063639  img00244
0.067583  img00306
# dist calculations: 655
...
   various dbm, i/o statistics
...
2.00user 0.27system ...

C_i Values as Keys

One problem raised early in the implementation:

the Hilbert numbers produced by C_i could be expanded to arbitrary precision in order to distinguish between C_i(p₁ ) and C_i(p₂ )

To use such values as dbm keys, we need them as fixed precision, so we limit expansion (to 4 levels).

Problems:

gives keys that are 96 bytes long
(producing very large Index files)
different v_j can be mapped to same point
(size of candidate sets is artificially inflated)

Solution:

use a "prefix" index structure (something like a trie)

Query Algorithm

Given: Db , Index , q , k , Range , Ncurves

dbm.open(Db)
dbm.open(Index)
compute C_i perms and shifts
for i in 1..Ncurves
{
     p = C_i(v_q)
     up = dbm.lookup(Index , (i,p))
     dn = dbm.lookup(Index , (i,p))
     img = dbm.fetch(up)
     insert img in results
     while within Range for up and dn
     {
          dbm.next(up); img = dbm.fetch(up)
	  check insert img in results
	  dbm.prev(dn); img = dbm.fetch(dn)
	  check insert img in results
     }
}
for i in 1..k
     display (dist,imageId) for results_i

... Query Algorithm

check insert img in results
{
     if already seen img
          return  // no need to check
     
     data = dbm.lookup(Db , img)
     v_img = dbm.fetch(data)
     d = D(v_img , v_q)
     include img in k-NN if d small enough
}

The dbm.lookup steps are expensive (T_sel)

Once a cursor is established, access is via cached buffer.

Performance Analysis

Since CurveIx does not guarantee to deliver the k-NN among its candidates, we set an "acceptable" accuracy level of 90%.

In other words, CurveIx must deliver 0.9 k nearest-neighbours to be considered useful.

Our initial concern has been to answer the questions:

How many curves are needed to achieve 90% accuracy?
How many curve-neighbours do we need to examine?
Can all of this be done reasonably efficiently?

So far, preliminary answers are only avaliable for 256-d QBIC color histograms:

around 80-100
around 15-30
not particularly ... yet

Experiments

Measures for accuracy:

Acc1 average top-10 entries in CurveIx top-10
Acc2 how frequently CurveIx gives 10 out of 10

Measures for efficiency:

Size size of Db file + Index file
Dist number of distance calculations required
IO total amount of I/O performed

To determine how these measures vary:

built databases of size 5K, 10K, 15K, 20K (supersets)
for each database, ran 25 query "benchmark" set
for each query, ran for 3,5,10,20,30,40 curve-neighbours
(but because of curve-mapping problem, only got 20,30,40)
for each query, ran for 20,40,60,80,100 curves

Also implemented a linear scan version for comparison and to collect the exact answer sets.

Sample Comparison

CurveIx Indexing	Linear Scan
% query db 10 1 100 QUERY: img00102 0.000000 img00102 0.005571 img00713 0.008826 img00268 0.011413 img05054 0.011811 img00631 0.014259 img04042 0.027598 img00203 0.037471 img00287 0.063639 img00244 0.067583 img00306 # dist calcs: 524 1.67user 0.23system ...	% query db 10 QUERY: img00102 0.000000 img00102 0.005571 img00713 0.008826 img00268 0.011413 img05054 0.011811 img00631 0.014259 img04042 0.027598 img00203 0.037471 img00287 0.063639 img00244 0.067583 img00306 # dist calcs: 20000 1.93user 1.12system ...

CurveIx Indexing

Linear Scan

% query db 10 1 100

QUERY: img00102
0.000000  img00102
0.005571  img00713
0.008826  img00268
0.011413  img05054
0.011811  img00631
0.014259  img04042
0.027598  img00203
0.037471  img00287
0.063639  img00244
0.067583  img00306

# dist calcs: 524
1.67user 0.23system ...

% query db 10

QUERY: img00102
0.000000  img00102
0.005571  img00713
0.008826  img00268
0.011413  img05054
0.011811  img00631
0.014259  img04042
0.027598  img00203
0.037471  img00287
0.063639  img00244
0.067583  img00306

# dist calcs: 20000
1.93user 1.12system ...

Experimental Results

For fixed database (20K), effect of varying Range, Ncurves

Curves	Range	Acc1	Acc2	Dist
20	20	6.72	0.20	426
20	30	7.28	0.28	695
20	40	7.68	0.36	874
40	20	7.68	0.28	957
40	30	8.16	0.40	1301
40	40	8.60	0.44	1703
60	30	8.40	0.44	1905
60	40	8.60	0.48	2413
80	30	8.87	0.58	2485
80	40	9.20	0.72	3381
100	30	9.10	0.70	3061
100	40	9.28	0.72	4156

Results: Size vs. Accuracy

For fixed CurveIx parameters (80 curves, 30 range), effect of varying database size:

#Images	Acc1	Acc2
5K	9.72	0.76
10K	9.44	0.80
15K	9.16	0.70
20K	9.04	0.64

Results: Costs

For the 20K database (CurveIx (80,30)):

Approach	Size(Db)	Size(Ix)	IO(bytes)
Linear Scan	42MB	-	42MB
CurveIx	46MB	345MB	4+36MB

Remaining Work

The following aspects require further research and implementation before this approach could be considered a real solution for k-NN:

control of search over curves and neighbours
better data structure for curve index
clustering strategies to reduce page reads
compare to other approaches (e.g. [Pyramid technique] )

References/Further Reading

Berchtold & Böhm & Kriegel,
The Pyramid Technique: Towards breaking the curse of dimensionality,
ACM SIGMOD, 1998
[proposes a new data access technique which is optimized for high-dimensional range queries]
Gaede & Günther,
Multidimensional access methods,
Technical Report, Humboldt University, Berlin, 1995
[comprehensive survey of multidimensional database access methods]
Hellerstein, Koutsoupias & Papadimitriou,
On the analysis of indexing schemes,
ACM PODS, 1997
[initial work on a general theory of "indexability", which is analogous to complexity theory and describes how effective data/query/access-method combinations are in terms of disk access and storage cost]
Megiddo & Shaft,
Efficient nearest neighbour indexing based on a collection of space-filling curves,
IBM Internal Research Report, 1997
[description of method to index multidimensional data based on a multiple complementary "linearisations" of the data points]
White & Jain,
Algorithms and Strategies for Similarity Retrieval,
UCSD Visual Computing Lab Report, 1996
[survey and characterisation of a large number of approaches to multidimensional access methods]

Plus other papers in /home/jas/papers.

Produced: 27 Jul 98