Making QBIC Faster

(viewing version)


Making QBIC Faster

Efficient Content-based Image Retrieval


John Shepherd

joint work with

Nimrod Megiddo, Xiaoming Zhu

IBM Almaden Research Center   (Feb-Aug,1998)


Overview

  1. QBIC and Content-based Image Retrieval
  2. High-dimensional k Nearest Neighbours Problem
  3. Database Indexing Schemes and HDkNN Retrieval
  4. The Curveix Approach ... Preliminary Results
  5. Conclusions and Future Work


The QBIC System

QBIC was developed around 1983 as an experiment in image retrieval.

The aim was to build a system for images that provide analogy to text retrieval operation ``Find documents like this one''.

The focus was on developing measures of image similarity that:


... The QBIC System

[Diagram:pic/qbic]


Content-based Image Retrieval

User supplies a description of desired image via feature vectors.

System returns a ranked list of "matching" images from DB.


[Diagram:pic/cbir-sys]


The Problem

The goal of the QBIC project is to develop a system that So far, the system has been somewhat successful on the first point and has limited itself to databases of < 10,000 images.

One efficiency measure already implemented: an N N table of pre-computed distances to speed up retrieval within database.


High-dimensional Nearest Neighbours Search

The problem of finding k similar images is an instance of the more general problem ...

Given:

find the k data points in the set that are closest to q.

A solution to this problem would be useful in several areas:


The Problem (Re-stated)

Inputs Outputs And all done as efficiently as possible ...

... where time is the most important efficiency factor


The Context

The solution has to work for ... And, most importantly, the solution should ...


The Costs

I/O cost CPU cost Space


Cost Parameters

We need to consider the following quantities:

Name Meaning Typically
N number of objects in the database 103 .. 1010
d number of dimensions in object features 1 .. 256
P number of bytes per disk page 512 .. 8K
V number of bytes for d-dimensional vector 4d
O average number of bytes per stored object (key+vector) 12+4d .. 256+4d
NO average number of stored objects per page
P/O
1 .. 512
NP number of disk pages to hold stored objects
N/NO
50 .. 1010
Topen time to open a database file 10ms
TP time to read a page from disk into memory 10ms
TD time to compute distance between two objects (using vectors) 100us (?)
TK average time to find the page address for a given object 10ms (?)

The TK measure is related to the use of a DBMS (e.g. dbm or DB2) to look up object vectors via the object identifier/key (the file name of the image).


Analysing Costs

We (pessimistically) assume that there is no effective caching of pages read from the database.

That is, accessing n object vectors (chosen at random) is going to result in n page reads.

This gives an upper-bound on the actual cost.
The real cost will typically be less due to clustering and caching.

On the other hand, we also assume that if the access to the vectors is via a sequential scan, then each page in the database file will be read exactly once.


Measuring Costs

Unix has useful tools for measuring/analysing CPU costs (gprof).

Unix (AIX) is not helpful in measuring I/O costs
(the rusage structure doesn't seem to count block input/output)

We use a combination of


Current Approach

Given a query vector vq, current QBIC method is (essentially):

results = [];   maxD = infinity

for each obj in the database
{
	dist = D(vobj,vq)
	if (#results < k or dist < maxD)
	{
		insert (obj,dist) into results
		// results is sorted with length <= k
		maxD = largest dist in results
	}
}


Cost  =  Topen  +  NPTP  +  NTD

Note: If q is an image from the database, we can use a pre-computed distance table to make this much faster.


Specific Goals for a Solution

Any reasonable solution aims to:


Some Potential Solutions

Partition-based Approximation-based A new approach, the pyramid technique [Berchtold et al.,] has elements of both partitioning and projection.


... Some Potential Solutions

Other Optimisations


Tree-based Methods

Each leaf in the tree corresponds to a subregion of the d-dimensional space

Non-leaf nodes correspond to a union of subregions in children

Partitioned space containing data points:

[Diagram:pic/space]


... Tree-based Methods

Tree that induced above partitioning:

[Diagram:pic/tree]

Above has no overlap among subregions; some partitioning schemes permit overlap.

Above uses hyper-rectangles for partitioning; other shapes (e.g. spheres, convex polyhedra) have been used.

Detailed summary/comparison of tree methods can be found in [White & Jain] , [Kurniawati et al.]


Insertion with Trees

Given: a feature vector vobj for a new object.

Insertion procedure:

   traverse tree to find leaf node n
   	    whose region contains vobj
   
   if (enough space in n)
   {
      insert vobj into corresponding page
   }
   else
   {
      need to perform local rearrangement in tree
         by splitting points between n and n's parent
         and n's siblings according to split heuristic
   }


Split heuristics include:


Query with Trees

Given:   query vector vq,   index tree.

Method involves:

Consider the following space and two query points q1 and q2.

[Diagram:pic/qspace]


... Query with Trees

Aim of search: reduce NN-region quickly as possible
(no need to consider nodes which do not intersect NN-region)

Variations on search methods in literature:

May not achieve 100% accuracy; depends on pruning strategy.


Why Trees Don't Work (for d > 10)

[Weber and Blott] derive some results on average performance of kNN search:


Signature-based Methods   (VA-Files)

Signature methods abandon the idea of search complexity better than O(N).

Use a file of signatures in parallel with main data file, one signature for each vector.

[Diagram:pic/sigfile]

Signatures have properties:

One example is the Vector Approximation file by [Weber and Blott.]


VA-File Signatures

Vector (b1, b2, ..., bd) has corresponding signature him(b1)him(b2)...him(bd)

where him gives the top-order m bits

This essentially:

[Diagram:pic/va-partition]

For non-equi-width regions, use mapping array to determine partition boundaries.


Insertion with VA-Files

Given: a feature vector vobj for a new object.

Insertion is extremely simple.

sig = signature(vobj)
append vobj to data file
append sig to signature file

Storage overhead determined by m   (e.g. 3/32).


Query with VA-Files

Given:   query vector vq.

results = [];  maxD = infinity;

for each sig in signature file
{
	vnear = closest point to vq in region[sig]
	dist = D(vnear,vq)
	if (#results < k  or  dist < maxD)
	{
		dist = D(vobj,vq)
		if (#results < k  or  dist < maxD)
		{
			insert (obj,dist) into results
			maxD = largest dist in results
		}
	}
}

Cost = 2Topen   +   TP(Ns + fN)   +   TD(N + fN)

where

Note: achieves 100% accuracy.


Why VA-Files Don't Work (for QBIC)

In the experiments reported by [Weber and Blott] they use uniform data distributions and achieve 0.0005 filtering.

In experiments with QBIC colour histograms, which have extremely skew distributions, we were unable to achieve better than 0.10 filtering on average.

And this was after "tuning" partition boundaries to place approximately equal numbers of points in each partition.

Before tuning, filtering factor was 0.70.

Conclusion VA-file performance is very sensitive to data distribution.

VA-files will also perform poorly if TD is large   (we compute D for every signature)


Projection-based Methods   (C-curves)

Achieve efficiency by The projection typically loses information, and so these methods may give an approximate answer.

Our method, CurveIx (Curve Index), is an example of this approach.


Proposed Approach: CurveIx

Further develops a method initially proposed in [Megiddo and Shaft]

(And possibly earlier? ... the idea of space-filling curves has been around for a while)

The basic idea:

[Diagram:pic/c-curves]

We want candidate set to contain most/all of k-NN ...


Curve Mapping Functions

Each Ci has the following properties: Points within a small region of the curve are likely to have been mapped from vectors that are close in d-space.

We generate multiple Cis by applying:


Data Structures for CurveIx

Ci a mapping function for each of the m different space-filling curves (i=1..m)
Db a database of N d-dimensional feature vectors; each entry is a pair:

(imageId, vector)

imageId forms a primary key

Index a database of Nm curve points; each point is represented by a tuple

(curveId, point, imageId)

the pair (curveId, point) forms a primary key with an ordering, where point = CcurveId(vimageId)

vq feature vector for query q


Database Construction

For each image obj, insertion requires:

for i in 1..m
{
     p = Ci(vobj)
     insert (i, p, obj) into Index
}
insert (obj, vobj) into Db


Example: Search with 5 Curves

[Diagram:pic/curves]


Finding k-NN   (Simple Approach)

Given:   Db ,   vq ,   C1 .. Cm ,   Index

for i in 1..m
{
     p = Ci(vq)
     lookup (i,p) in Index
     fetch (i,p',j)
     while p' "close to" p on curve i
     {
          collect next (i,p',j) as candidate
     }
}
for each candidate j
{
     lookup j in Db
     fetch (j,vj)
     d = D(vj , vq)
     include j in k-NN if d small enough
}

Cost   =   2Topen  +  mTsel  +  NfTP  +  NfTD

where


CurveIx vs. Linear Scan

For linear scan:

    Cost   =   Topen  +  NPTP  +  NTD

For CurveIx:

    Cost   =   2Topen  +  mTsel  +  NfTP  +  NfTD

If CurveIx is to be worthwhile, we need:

    (1)   mTsel  +  NfTP   <   NPTP

    (2)   NfTD   <   NTD

(2) simply requires f < 1, which is essential anyway.


... CurveIx vs. Linear Scan

If we assume that each curve lookup involves 3-4 page access (i.e. Tsel=3), we can restate (1) as:

    (1)   (3m+Nf)TP   <   NPTP

Some observations, given NP = 0.05 N:

Difficult to analyse further, so we implemented the system to see how it performed ...


Implementation

Aim: build a quick/easy/light implementation of the above approach. Two components:

build

query


Example for build Program

Inserted img00000
Inserted img00001
...

where the data file looks like:

...
img00102
 54 0 0 0 0 0 0 5 10 3 1 1 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0
 0 1 8 10 3 2 1 7 3 2 1 1 0 0 0 0
 0 0 0 2 14 34 3 1 2 1 1 0 0 0 0 0
 0 0 0 0 0 0 2 1 0 0 0 3 1 1 2 737
 3 0 5 2 1 0 1 0 0 0 0 0 0 0 0 0
 0 0 0 0 7 0 25 2 1 1 1 0 0 0 9 2
 1 1 0 0 1 2 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 3 0 0 0 5 0 0 0 1 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
...


Example for query Program

An example of executing query on a database of 20,000 images (color histograms) with 100 curves:


QUERY: img00102
0.000000  img00102
0.005571  img00713
0.008826  img00268
0.011413  img05054
0.011811  img00631
0.014259  img04042
0.027598  img00203
0.037471  img00287
0.063639  img00244
0.067583  img00306
# dist calculations: 655
...
   various dbm, i/o statistics
...
2.00user 0.27system ...


Ci Values as Keys

One problem raised early in the implementation: To use such values as dbm keys, we need them as fixed precision, so we limit expansion (to 4 levels).

Problems:

Solution:


Query Algorithm

Given:   DbIndexqkRangeNcurves

dbm.open(Db)
dbm.open(Index)
compute Ci perms and shifts
for i in 1..Ncurves
{
     p = Ci(vq)
     up = dbm.lookup(Index , (i,p))
     dn = dbm.lookup(Index , (i,p))
     img = dbm.fetch(up)
     insert img in results
     while within Range for up and dn
     {
          dbm.next(up); img = dbm.fetch(up)
	  check insert img in results
	  dbm.prev(dn); img = dbm.fetch(dn)
	  check insert img in results
     }
}
for i in 1..k
     display (dist,imageId) for resultsi


... Query Algorithm

check insert img in results
{
     if already seen img
          return  // no need to check
     
     data = dbm.lookup(Db , img)
     vimg = dbm.fetch(data)
     d = D(vimg , vq)
     include img in k-NN if d small enough
}

The dbm.lookup steps are expensive   (Tsel)

Once a cursor is established, access is via cached buffer.


Performance Analysis

Since CurveIx does not guarantee to deliver the k-NN among its candidates, we set an "acceptable" accuracy level of 90%.

In other words, CurveIx must deliver 0.9 k nearest-neighbours to be considered useful.

Our initial concern has been to answer the questions:

So far, preliminary answers are only avaliable for 256-d QBIC color histograms:


Experiments

Measures for accuracy: Measures for efficiency: To determine how these measures vary: Also implemented a linear scan version for comparison and to collect the exact answer sets.


Sample Comparison

CurveIx Indexing Linear Scan


QUERY: img00102
0.000000  img00102
0.005571  img00713
0.008826  img00268
0.011413  img05054
0.011811  img00631
0.014259  img04042
0.027598  img00203
0.037471  img00287
0.063639  img00244
0.067583  img00306

# dist calcs: 524
1.67user 0.23system ...


QUERY: img00102
0.000000  img00102
0.005571  img00713
0.008826  img00268
0.011413  img05054
0.011811  img00631
0.014259  img04042
0.027598  img00203
0.037471  img00287
0.063639  img00244
0.067583  img00306

# dist calcs: 20000
1.93user 1.12system ...


Experimental Results

For fixed database (20K), effect of varying Range, Ncurves

Curves Range Acc1 Acc2 Dist
20 20 6.72 0.20 426
20 30 7.28 0.28 695
20 40 7.68 0.36 874
40 20 7.68 0.28 957
40 30 8.16 0.40 1301
40 40 8.60 0.44 1703
60 30 8.40 0.44 1905
60 40 8.60 0.48 2413
80 30 8.87 0.58 2485
80 40 9.20 0.72 3381
100 30 9.10 0.70 3061
100 40 9.28 0.72 4156


Results: Size vs. Accuracy

For fixed CurveIx parameters (80 curves, 30 range), effect of varying database size:

#Images Acc1 Acc2
5K 9.72 0.76
10K 9.44 0.80
15K 9.16 0.70
20K 9.04 0.64


Results: Costs

For the 20K database (CurveIx (80,30)):

Approach Size(Db) Size(Ix) IO(bytes)
Linear Scan42MB-42MB
CurveIx46MB345MB4+36MB


Other Optimisations for QBIC Indexing

The following techniques can be applied regardless of the underlying index structure.

They involve exploiting properties of data structures and access patterns to reduce amount of I/O.

Derived primarily from profiling studies of QBIC.


Caching

Maintain an in-memory collection of relevant pages.

Read/write pages only when absolutely necessary.

Most commercial DBMSs do this invisibly and effectively.

The dbm system:

Current implementation of QBIC opens/closes file for each record look-up.


... Caching

Interpolation of open-file cache significantly reduces calls to operating system for file I/O.

Example: insertion of 1000 colour histograms:

...
<1000> Insert of ... successful
Reads: 37504864/13137   Writes: 9383936/2291
1213.8u 49.2s 26:02 80% 1760+4256k 0+0io 2087pf+0w

...
<1000> Insert of ... successful
Reads: 4096/1   Writes: 618496/151
1207.2u 11.8s 24:33 82% 2895+4440k 0+0io 2116pf+0w

Since Unix performs its own caching, this technique yields little impact on wall-clock times.

Since queries are handled by sequential scan, cache provides little benefit for them.

For a query server, may provide some querying benefit.


Compression

QBIC colour histograms:

Scope for compression

Have implemented second:


Clustering

Reducing to k candidates is no better than kNO candidates if each one comes from a separate page   (recall than NO may be larger than 100)

This situation becomes worse when vectors are smaller, and so NO becomes larger

Approach: try to ensure that "similar" objects are placed in the same database page.

Plan to look at how effectively clustering can be performed for very large numbers of dimensions.


Conclusions

The CurveIx approach shows some potential as a solution for the HDkNN search problem.

The following aspects require further research and implementation before this approach could be considered a real solution for k-NN:


References/Further Reading

  1. Berchtold & Böhm & Kriegel,
    The Pyramid Technique: Towards breaking the curse of dimensionality,
    ACM SIGMOD, 1998
    [proposes a new data access technique which is optimized for high-dimensional range queries]
  2. Gaede & Günther,
    Multidimensional access methods,
    Technical Report, Humboldt University, Berlin, 1995
    [comprehensive survey of multidimensional database access methods]
  3. Hellerstein, Koutsoupias & Papadimitriou,
    On the analysis of indexing schemes,
    ACM PODS, 1997
    [initial work on a general theory of "indexability", which is analogous to complexity theory and describes how effective data/query/access-method combinations are in terms of disk access and storage cost]
  4. Kurniawati, Jin & Shepherd,
    Techniques for supporting efficient content-based retrieval in multimedia databases,
    Australian Computer Journal, 1997
  5. Megiddo & Shaft,
    Efficient nearest neighbour indexing based on a collection of space-filling curves,
    IBM Internal Research Report, 1997
    [description of method to index multidimensional data based on a multiple complementary "linearisations" of the data points]
  6. Weber & Blott,
    An Approximation-Based Data Struture for Similarity Search,
    Unpublished Report, Bell Labs, 1998
    [provides reasonably formal argument that partitioning-based methods degenerate to worse than linear search once dimensionality around 6..10; proposes VA-files as an alternative]
  7. White & Jain,
    Algorithms and Strategies for Similarity Retrieval,
    UCSD Visual Computing Lab Report, 1996
    [survey and characterisation of a large number of approaches to multidimensional access methods]
Plus numerous other papers downloaded into /home/jas/papers.


Produced: 28 Aug 98