Applying Space-filling Curves in QBIC Indexing

(Some Quantitative Results)


John Shepherd


IBM Almaden research Center   (Feb-Aug,1998)
University of New South Wales, Sydney   (regularly)


joint work with Nimrod Megiddo, Xiaoming Zhu


The Problem

Inputs Outputs And all done as efficiently as possible ... (time most important, space less so)


The Context

The solution has to work for ... And, most importantly, the solution should ...


The Costs

I/O cost CPU cost Space


Cost Parameters

We need to consider the following quantities:

Name Meaning Typically
N number of objects in the database 103 .. 1010
d number of dimensions in object features 1 .. 256
P number of bytes per disk page 512 .. 8K
V number of bytes for d-dimensional vector 2d, 4d, 8d
O average number of bytes per stored object (key+vector) 12+V .. 256+V
NO average number of stored objects per page (=floor(P/O)) 1 .. 512
NP number of disk pages to hold stored objects (=ceil(N/NO)) 50 .. 1010
Topen time to open a database file 10ms
TP time to read a page from disk into memory 10ms
TD time to compute distance between two objects (using vectors) 100us (?)
TK average time to find the page address for a given object 30ms (?)
Tsel average time to fetch page for a given curve point 30ms (?)

TK is related to the use of a DBMS to look up a feature vector via an object identifier/key (the file name of the image).

Tsel is related to the indexed lookup of the nearest curve point via a B-tree during CurveIx querying.


Current Approach

Given a query vector vq , the current QBIC method is (essentially):

list of results is initially empty

for each image j in the database
{
     dist = D(vj ,vq)
     if (#results < k or dist < maxD)
     {
          insert (j,dist) into results
          sort results by dist
          maxD = largest dist in results
     }
}


Cost   =   Topen  +  NPTP  +  NTD

Note: If q is an image from the database, we can use a pre-computed distance table to make this much faster.


Goals for a Solution

Any reasonable solution aims to: Beware: less vectors read does not necessarily mean less pages read


Proposed Approach: CurveIx

Based on the method developed in [Megiddo and Shaft]

The basic idea:

[Diagram:pic/c-curves]

We want candidate set to contain most/all of k-NN ...


Curve Mapping Functions

Each Ci has the following properties: Points within a small region of the curve are likely to have been mapped from vectors that are close in d-space.

We generate multiple Cis by applying:


Data Structures for CurveIx

Ci a mapping function for each of the m different space-filling curves (i=1..m)
Db a database of N d-dimensional feature vectors; each entry is a pair:

(imageId, vector)

imageId forms a primary key

Index a database of Nm curve points; each point is represented by a tuple

(curveId, point, imageId)

the pair (curveId, point) forms a primary key with an ordering, where point = CcurveId(vimageId)

vq feature vector for query q


Database Construction

For each image obj, insertion requires:

for i in 1..m
{
     p = Ci(vobj)
     insert (i, p, obj) into Index
}
insert (obj, vobj) into Db


Finding k-NN   (Simple Approach)

Given:   Db ,   vq ,   C1 .. Cm ,   Index

for i in 1..m
{
     p = Ci(vq)
     lookup (i,p) in Index
     fetch (i,p',j)
     while p' "close to" p on curve i
     {
          collect next (i,p',j) as candidate
     }
}
for each candidate j
{
     lookup j in Db
     fetch (j,vj)
     d = D(vj , vq)
     include j in k-NN if d small enough
}

Cost   =   2Topen  +  mTsel  +  NfTP  +  NfTD

where


CurveIx vs. Linear Scan

For linear scan:

    Cost   =   Topen  +  NPTP  +  NTD

For CurveIx:

    Cost   =   2Topen  +  mTsel  +  NfTP  +  NfTD

If CurveIx is to be worthwhile, we need:

    (1)   mTsel  +  NfTP   <   NPTP

    (2)   NfTD   <   NTD

(2) simply requires f < 1, which is essential anyway.


... CurveIx vs. Linear Scan

If we assume that each curve lookup involves 3-4 page access (i.e. Tsel=3), we can restate (1) as:

    (1)   (3m+Nf)TP   <   NPTP

Some observations, given NP = 0.05 N:

Difficult to analyse further, so we implemented the system to see how it performed ...


Implementation

Aim: build a quick/easy/light implementation of the above approach. Two components:

build

query


Example for build Program

% build db < data
Inserted img00000
Inserted img00001
...

where the data file looks like:

...
img00102
 54 0 0 0 0 0 0 5 10 3 1 1 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0
 0 1 8 10 3 2 1 7 3 2 1 1 0 0 0 0
 0 0 0 2 14 34 3 1 2 1 1 0 0 0 0 0
 0 0 0 0 0 0 2 1 0 0 0 3 1 1 2 737
 3 0 5 2 1 0 1 0 0 0 0 0 0 0 0 0
 0 0 0 0 7 0 25 2 1 1 1 0 0 0 9 2
 1 1 0 0 1 2 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 3 0 0 0 5 0 0 0 1 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
...


Example for query Program

An example of executing query on a database of 20,000 images (color histograms) with 100 curves:

% time query db 10 1 100 < qry007

QUERY: img00102
0.000000  img00102
0.005571  img00713
0.008826  img00268
0.011413  img05054
0.011811  img00631
0.014259  img04042
0.027598  img00203
0.037471  img00287
0.063639  img00244
0.067583  img00306
# dist calculations: 655
...
   various dbm, i/o statistics
...
2.00user 0.27system ...


Ci Values as Keys

One problem raised early in the implementation: To use such values as dbm keys, we need them as fixed precision, so we limit expansion (to 4 levels).

Problems:

Solution:


Query Algorithm

Given:   DbIndexqkRangeNcurves

dbm.open(Db)
dbm.open(Index)
compute Ci perms and shifts
for i in 1..Ncurves
{
     p = Ci(vq)
     up = dbm.lookup(Index , (i,p))
     dn = dbm.lookup(Index , (i,p))
     img = dbm.fetch(up)
     insert img in results
     while within Range for up and dn
     {
          dbm.next(up); img = dbm.fetch(up)
	  check insert img in results
	  dbm.prev(dn); img = dbm.fetch(dn)
	  check insert img in results
     }
}
for i in 1..k
     display (dist,imageId) for resultsi


... Query Algorithm

check insert img in results
{
     if already seen img
          return  // no need to check
     
     data = dbm.lookup(Db , img)
     vimg = dbm.fetch(data)
     d = D(vimg , vq)
     include img in k-NN if d small enough
}

The dbm.lookup steps are expensive   (Tsel)

Once a cursor is established, access is via cached buffer.


Performance Analysis

Since CurveIx does not guarantee to deliver the k-NN among its candidates, we set an "acceptable" accuracy level of 90%.

In other words, CurveIx must deliver 0.9 k nearest-neighbours to be considered useful.

Our initial concern has been to answer the questions:

So far, preliminary answers are only avaliable for 256-d QBIC color histograms:


Experiments

Measures for accuracy: Measures for efficiency: To determine how these measures vary: Also implemented a linear scan version for comparison and to collect the exact answer sets.


Sample Comparison

CurveIx Indexing Linear Scan

% query db 10 1 100

QUERY: img00102
0.000000  img00102
0.005571  img00713
0.008826  img00268
0.011413  img05054
0.011811  img00631
0.014259  img04042
0.027598  img00203
0.037471  img00287
0.063639  img00244
0.067583  img00306

# dist calcs: 524
1.67user 0.23system ...

% query db 10

QUERY: img00102
0.000000  img00102
0.005571  img00713
0.008826  img00268
0.011413  img05054
0.011811  img00631
0.014259  img04042
0.027598  img00203
0.037471  img00287
0.063639  img00244
0.067583  img00306

# dist calcs: 20000
1.93user 1.12system ...


Experimental Results

For fixed database (20K), effect of varying Range, Ncurves

Curves Range Acc1 Acc2 Dist
20 20 6.72 0.20 426
20 30 7.28 0.28 695
20 40 7.68 0.36 874
40 20 7.68 0.28 957
40 30 8.16 0.40 1301
40 40 8.60 0.44 1703
60 30 8.40 0.44 1905
60 40 8.60 0.48 2413
80 30 8.87 0.58 2485
80 40 9.20 0.72 3381
100 30 9.10 0.70 3061
100 40 9.28 0.72 4156


Results: Size vs. Accuracy

For fixed CurveIx parameters (80 curves, 30 range), effect of varying database size:

#Images Acc1 Acc2
5K 9.72 0.76
10K 9.44 0.80
15K 9.16 0.70
20K 9.04 0.64


Results: Costs

For the 20K database (CurveIx (80,30)):

Approach Size(Db) Size(Ix) IO(bytes)
Linear Scan42MB-42MB
CurveIx46MB345MB4+36MB


Remaining Work

The following aspects require further research and implementation before this approach could be considered a real solution for k-NN:


References/Further Reading

  1. Berchtold & Böhm & Kriegel,
    The Pyramid Technique: Towards breaking the curse of dimensionality,
    ACM SIGMOD, 1998
    [proposes a new data access technique which is optimized for high-dimensional range queries]
  2. Gaede & Günther,
    Multidimensional access methods,
    Technical Report, Humboldt University, Berlin, 1995
    [comprehensive survey of multidimensional database access methods]
  3. Hellerstein, Koutsoupias & Papadimitriou,
    On the analysis of indexing schemes,
    ACM PODS, 1997
    [initial work on a general theory of "indexability", which is analogous to complexity theory and describes how effective data/query/access-method combinations are in terms of disk access and storage cost]
  4. Megiddo & Shaft,
    Efficient nearest neighbour indexing based on a collection of space-filling curves,
    IBM Internal Research Report, 1997
    [description of method to index multidimensional data based on a multiple complementary "linearisations" of the data points]
  5. White & Jain,
    Algorithms and Strategies for Similarity Retrieval,
    UCSD Visual Computing Lab Report, 1996
    [survey and characterisation of a large number of approaches to multidimensional access methods]
Plus other papers in /home/jas/papers.


Produced: 27 Jul 98