Making QBIC Faster (notes)

Making QBIC Faster

(viewing version)

Making QBIC Faster

Efficient Content-based Image Retrieval

John Shepherd

joint work with

Nimrod Megiddo, Xiaoming Zhu

IBM Almaden Research Center (Feb-Aug,1998)

Overview

QBIC and Content-based Image Retrieval
High-dimensional k Nearest Neighbours Problem
Database Indexing Schemes and HDkNN Retrieval
The Curveix Approach ... Preliminary Results
Conclusions and Future Work

The QBIC System

QBIC was developed around 1983 as an experiment in image retrieval.

The aim was to build a system for images that provide analogy to text retrieval operation ``Find documents like this one''.

The focus was on developing measures of image similarity that:

corresponded somewhat to human perception of "similarity"
were not too expensive to compute

... The QBIC System

Content-based Image Retrieval

User supplies a description of desired image via feature vectors.

System returns a ranked list of "matching" images from DB.

The Problem

The goal of the QBIC project is to develop a system that

returns images that closely match the user's intent
does this quickly, even for very large image collections

So far, the system has been somewhat successful on the first point and has limited itself to databases of < 10,000 images.

One efficiency measure already implemented: an N N table of pre-computed distances to speed up retrieval within database.

High-dimensional Nearest Neighbours Search

The problem of finding k similar images is an instance of the more general problem ...

Given:

a set of data points in a d-dimensional space
a specific query point q in that space

find the k data points in the set that are closest to q.

A solution to this problem would be useful in several areas:

information retrieval, multimedia databases, machine learning, ...

The Problem (Re-stated)

Inputs

a database of N objects (obj₁, obj₂, ..., obj_N)
representations of objects as d-dimensional vectors (v_obj1, v_obj2, ..., v_objN)
(each vector may represent one or more features of the object)
a query object q with associated d-dimensional vector (v_q)
a distance measure D(v_i,v_j) : [0..1) (D=0 -> v_i=v_j)

Outputs

a list of the k nearest objects in the database [a₁, a₂, ... a_k]
ordered by distance to query object D(v_a1,v_q) D(v_a2,v_q) ... D(v_ak,v_q)

And all done as efficiently as possible ...

... where time is the most important efficiency factor

The Context

The solution has to work for ...

number of dimensions d = 1 .. 500
number of objects N = 10⁴ .. 10¹⁰
all values in query vector are specified (i.e. no partial-match)
implementation runs on "standard" hardware/OS
(e.g. don't assume arrays of parallel disks, huge memory, ...)
independent of distance measure (?)
100% retrieval accuracy (?)

And, most importantly, the solution should ...

use minimal resources: i/o, cpu-time, space
provide fast database updates
(important for some applications e.g. Magnifi)

The Costs

I/O cost

the dominant cost (given current disk technology)
measured in terms of pages read, not records read

CPU cost

primarily related to computing distance measure D(v_obj, v_q)
can be relatively expensive for some QBIC features

Space

need to store data (feature vectors) within page-size chunks
access methods typically require additional disk space
(which can be anywhere from 10% extra to >100% extra)
intermediate data structures in memory?

Cost Parameters

We need to consider the following quantities:

Name	Meaning	Typically
N	number of objects in the database	10³ .. 10¹⁰
d	number of dimensions in object features	1 .. 256
P	number of bytes per disk page	512 .. 8K
V	number of bytes for d-dimensional vector	4d
O	average number of bytes per stored object (key+vector)	12+4d .. 256+4d
N_O	average number of stored objects per page P/O	1 .. 512
N_P	number of disk pages to hold stored objects N/N_O	50 .. 10¹⁰
T_open	time to open a database file	10ms
T_P	time to read a page from disk into memory	10ms
T_D	time to compute distance between two objects (using vectors)	100us (?)
T_K	average time to find the page address for a given object	10ms (?)

The T_K measure is related to the use of a DBMS (e.g. dbm or DB2) to look up object vectors via the object identifier/key (the file name of the image).

Analysing Costs

We (pessimistically) assume that there is no effective caching of pages read from the database.

That is, accessing n object vectors (chosen at random) is going to result in n page reads.

This gives an upper-bound on the actual cost.
The real cost will typically be less due to clustering and caching.

On the other hand, we also assume that if the access to the vectors is via a sequential scan, then each page in the database file will be read exactly once.

Measuring Costs

Unix has useful tools for measuring/analysing CPU costs (gprof).

Unix (AIX) is not helpful in measuring I/O costs
(the rusage structure doesn't seem to count block input/output)

We use a combination of

our own I/O monitoring (to count read, write ops)
profiling (fine-grained analysis of cpu-time costs)
wall-clock time (the "bottom line" for users)

Current Approach

Given a query vector v_q, current QBIC method is (essentially):

results = [];   maxD = infinity

for each obj in the database
{
	dist = D(v_obj,v_q)
	if (#results < k or dist < maxD)
	{
		insert (obj,dist) into results
		// results is sorted with length <= k
		maxD = largest dist in results
	}
}

Cost = T_open + N_PT_P + NT_D

Note: If q is an image from the database, we can use a pre-computed distance table to make this much faster.

Specific Goals for a Solution

Any reasonable solution aims to:

reduce the number of objects considered (find candidates)
... which, in turn, reduces i / o and D computations
but without excessive overhead in determining candidates

Some Potential Solutions

Partition-based

use auxiliary (spatial) data structure to identify candidates
space-partitioning methods: Grid file, k-d-b-tree, quadtree
data-partitioning methods: R-tree, X-tree, SS-tree, TV-tree, ...

Approximation-based

use approximating data structure to identify candidates
signature files: VA-files
projections: space-filling curves

A new approach, the pyramid technique [Berchtold et al.,] has elements of both partitioning and projection.

... Some Potential Solutions

Other Optimisations

reduce I/O by reducing size of vectors (compression)
reduce I/O by placing "similar" records together (clustering)
reduce I/O by remembering previous pages (caching)
reduce cpu by making D computation faster (ColorHist)

Tree-based Methods

Each leaf in the tree corresponds to a subregion of the d-dimensional space

Non-leaf nodes correspond to a union of subregions in children

Partitioned space containing data points:

... Tree-based Methods

Tree that induced above partitioning:

Above has no overlap among subregions; some partitioning schemes permit overlap.

Above uses hyper-rectangles for partitioning; other shapes (e.g. spheres, convex polyhedra) have been used.

Detailed summary/comparison of tree methods can be found in [White & Jain] , [Kurniawati et al.]

Insertion with Trees

Given: a feature vector v_obj for a new object.

Insertion procedure:

   traverse tree to find leaf node n
   	    whose region contains v_obj
   
   if (enough space in n)
   {
      insert v_obj into corresponding page
   }
   else
   {
      need to perform local rearrangement in tree
         by splitting points between n and n's parent
         and n's siblings according to split heuristic
   }

Split heuristics include:

minimise total volume of bounding regions
minimise variance in volume of regions
minimise overlap among regions
.....

Query with Trees

Given: query vector v_q, index tree.

Method involves:

a search over the tree
a "nearest neighbour (NN) region"
(related to maxD in the original search method).

Consider the following space and two query points q1 and q2.

... Query with Trees

Aim of search: reduce NN-region quickly as possible
(no need to consider nodes which do not intersect NN-region)

Variations on search methods in literature:

overall search order e.g. depth-first vs. breadth-first
order to check nodes e.g. distance from query point to node centroid
pruning criteria e.g. minimum distance exceeds maxD

May not achieve 100% accuracy; depends on pruning strategy.

Why Trees Don't Work (for d > 10)

[Weber and Blott] derive some results on average performance of kNN search:

(degeneration) the complexity of searching via a partitioning scheme tends to O(N) for sufficiently large d
(complexity) for every partitioning scheme, there exists some d such that all blocks are accessed if dimensionality exceeds d
(performance) most data- and space-partitioning schemes perform worse than sequential scan by the time d=10

Signature-based Methods (VA-Files)

Signature methods abandon the idea of search complexity better than O(N).

Use a file of signatures in parallel with main data file, one signature for each vector.

Signatures have properties:

(much) more compact than vectors
provide approximation to information in vectors

One example is the Vector Approximation file by [Weber and Blott.]

VA-File Signatures

Vector (b₁, b₂, ..., b_d) has corresponding signature hi_m(b₁)hi_m(b₂)...hi_m(b_d)

where hi_m gives the top-order m bits

This essentially:

partitions each dimension into 2^m regions
maps vector into "identifier" of region that contains it

For non-equi-width regions, use mapping array to determine partition boundaries.

Insertion with VA-Files

Given: a feature vector v_obj for a new object.

Insertion is extremely simple.

sig = signature(v_obj)
append v_obj to data file
append sig to signature file

Storage overhead determined by m (e.g. 3/32).

Query with VA-Files

Given: query vector v_q.

results = [];  maxD = infinity;

for each sig in signature file
{
	v_near = closest point to v_q in region[sig]
	dist = D(v_near,v_q)
	if (#results < k  or  dist < maxD)
	{
		dist = D(v_obj,v_q)
		if (#results < k  or  dist < maxD)
		{
			insert (obj,dist) into results
			maxD = largest dist in results
		}
	}
}

Cost = 2T_open + T_P(N_s + fN) + T_D(N + fN)

where

N_s is number of pages in signature file
f is filtering factor (0.0005 under ideal conditions)

Note: achieves 100% accuracy.

Why VA-Files Don't Work (for QBIC)

In the experiments reported by [Weber and Blott] they use uniform data distributions and achieve 0.0005 filtering.

In experiments with QBIC colour histograms, which have extremely skew distributions, we were unable to achieve better than 0.10 filtering on average.

And this was after "tuning" partition boundaries to place approximately equal numbers of points in each partition.

Before tuning, filtering factor was 0.70.

Conclusion VA-file performance is very sensitive to data distribution.

VA-files will also perform poorly if T_D is large (we compute D for every signature)

Projection-based Methods (C-curves)

Achieve efficiency by

projecting data space onto reduced dimensions
utilising existing access methods (e.g. B-trees) to find objects in projection

The projection typically loses information, and so these methods may give an approximate answer.

Our method, CurveIx (Curve Index), is an example of this approach.

Proposed Approach: CurveIx

Further develops a method initially proposed in [Megiddo and Shaft]

(And possibly earlier? ... the idea of space-filling curves has been around for a while)

The basic idea:

project d-dimensional vector onto 1-d space-filling curve
collect set of curve-neighbours (should contain some k-NN)
repeat above for several curves with different paths through space
union of neighbour sets from all curves gives candidates
retrieve and compute distance only for candidates

We want candidate set to contain most/all of k-NN ...

how many curves do we need?
how many neighbours do we examine on each curve?

Curve Mapping Functions

Each C_i has the following properties:

maps d-dimensional vectors onto points on a line i.e. [0,1)^d [0,1)
mapping is one-to-one and defines an order on [0,1)^d
inverse mapping is continuous

Points within a small region of the curve are likely to have been mapped from vectors that are close in d-space.

We generate multiple C_is by applying:

permutations over vector elements
translations of individual elements

Data Structures for CurveIx

C_i	a mapping function for each of the m different space-filling curves (i=1..m)
Db	a database of N d-dimensional feature vectors; each entry is a pair: `(imageId, vector)` `imageId` forms a primary key
Index	a database of Nm curve points; each point is represented by a tuple `(curveId, point, imageId)` the pair `(curveId, point)` forms a primary key with an ordering, where point = C_curveId(v_imageId)
v_q	feature vector for query q

Database Construction

For each image obj, insertion requires:

for i in 1..m
{
     p = C_i(v_obj)
     insert (i, p, obj) into Index
}
insert (obj, v_obj) into Db

Example: Search with 5 Curves

Finding k-NN (Simple Approach)

Given: Db , v_q , C₁ .. C_m , Index

for i in 1..m
{
     p = C_i(v_q)
     lookup (i,p) in Index
     fetch (i,p',j)
     while p' "close to" p on curve i
     {
          collect next (i,p',j) as candidate
     }
}
for each candidate j
{
     lookup j in Db
     fetch (j,v_j)
     d = D(v_j , v_q)
     include j in k-NN if d small enough
}

Cost = 2T_open + mT_sel + NfT_P + NfT_D

where

T_sel is the cost of Index lookup using e.g. B-tree
f is the filtering factor

CurveIx vs. Linear Scan

For linear scan:

Cost = T_open + N_PT_P + NT_D

For CurveIx:

Cost = 2T_open + mT_sel + NfT_P + NfT_D

If CurveIx is to be worthwhile, we need:

(1) mT_sel + NfT_P < N_PT_P

(2) NfT_D < NT_D

(2) simply requires f < 1, which is essential anyway.

... CurveIx vs. Linear Scan

If we assume that each curve lookup involves 3-4 page access (i.e. T_sel=3), we can restate (1) as:

(1) (3m+Nf)T_P < N_PT_P

Some observations, given N_P = 0.05 N:

we can't afford to have too many curves (m)
CurveIx has to do effective filtering (f)
clustering can reduce the NfT_P term

Difficult to analyse further, so we implemented the system to see how it performed ...

Implementation

Aim: build a quick/easy/light implementation of the above approach.

make it stand-alone from QBIC, but use QBIC's distance function
use the Unix dbm libraries to implement keyed retrieval

Two components:

build

reads image idents and color histograms from stdin
stores each in the Db, a dbm B-tree keyed on ident
computes m curve-points for each, and stores in Index, a dbm B-tree keyed on curveId and position (C_i(v) value)

query

reads a single color histogram from stdin
uses above method to find k best matches
display list of best matches idents and distances on stdout, along with performance statistics

Example for `build` Program

Inserted img00000
Inserted img00001
...

where the data file looks like:

...
img00102
 54 0 0 0 0 0 0 5 10 3 1 1 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0
 0 1 8 10 3 2 1 7 3 2 1 1 0 0 0 0
 0 0 0 2 14 34 3 1 2 1 1 0 0 0 0 0
 0 0 0 0 0 0 2 1 0 0 0 3 1 1 2 737
 3 0 5 2 1 0 1 0 0 0 0 0 0 0 0 0
 0 0 0 0 7 0 25 2 1 1 1 0 0 0 9 2
 1 1 0 0 1 2 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 3 0 0 0 5 0 0 0 1 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
...

Example for `query` Program

An example of executing query on a database of 20,000 images (color histograms) with 100 curves:


QUERY: img00102
0.000000  img00102
0.005571  img00713
0.008826  img00268
0.011413  img05054
0.011811  img00631
0.014259  img04042
0.027598  img00203
0.037471  img00287
0.063639  img00244
0.067583  img00306
# dist calculations: 655
...
   various dbm, i/o statistics
...
2.00user 0.27system ...

C_i Values as Keys

One problem raised early in the implementation:

the Hilbert numbers produced by C_i could be expanded to arbitrary precision in order to distinguish between C_i(p₁ ) and C_i(p₂ )

To use such values as dbm keys, we need them as fixed precision, so we limit expansion (to 4 levels).

Problems:

gives keys that are 96 bytes long
(producing very large Index files)
different v_j can be mapped to same point
(size of candidate sets is artificially inflated)

Solution:

use a "prefix" index structure (something like a trie)

Query Algorithm

Given: Db , Index , q , k , Range , Ncurves

dbm.open(Db)
dbm.open(Index)
compute C_i perms and shifts
for i in 1..Ncurves
{
     p = C_i(v_q)
     up = dbm.lookup(Index , (i,p))
     dn = dbm.lookup(Index , (i,p))
     img = dbm.fetch(up)
     insert img in results
     while within Range for up and dn
     {
          dbm.next(up); img = dbm.fetch(up)
	  check insert img in results
	  dbm.prev(dn); img = dbm.fetch(dn)
	  check insert img in results
     }
}
for i in 1..k
     display (dist,imageId) for results_i

... Query Algorithm

check insert img in results
{
     if already seen img
          return  // no need to check
     
     data = dbm.lookup(Db , img)
     v_img = dbm.fetch(data)
     d = D(v_img , v_q)
     include img in k-NN if d small enough
}

The dbm.lookup steps are expensive (T_sel)

Once a cursor is established, access is via cached buffer.

Performance Analysis

Since CurveIx does not guarantee to deliver the k-NN among its candidates, we set an "acceptable" accuracy level of 90%.

In other words, CurveIx must deliver 0.9 k nearest-neighbours to be considered useful.

Our initial concern has been to answer the questions:

How many curves are needed to achieve 90% accuracy?
How many curve-neighbours do we need to examine?
Can all of this be done reasonably efficiently?

So far, preliminary answers are only avaliable for 256-d QBIC color histograms:

around 80-100
around 15-30
not particularly ... yet

Experiments

Measures for accuracy:

Acc1 average top-10 entries in CurveIx top-10
Acc2 how frequently CurveIx gives 10 out of 10

Measures for efficiency:

Size size of Db file + Index file
Dist number of distance calculations required
IO total amount of I/O performed

To determine how these measures vary:

built databases of size 5K, 10K, 15K, 20K (supersets)
for each database, ran 25 query "benchmark" set
for each query, ran for 3,5,10,20,30,40 curve-neighbours
(but because of curve-mapping problem, only got 20,30,40)
for each query, ran for 20,40,60,80,100 curves

Also implemented a linear scan version for comparison and to collect the exact answer sets.

Sample Comparison

CurveIx Indexing	Linear Scan
QUERY: img00102 0.000000 img00102 0.005571 img00713 0.008826 img00268 0.011413 img05054 0.011811 img00631 0.014259 img04042 0.027598 img00203 0.037471 img00287 0.063639 img00244 0.067583 img00306 # dist calcs: 524 1.67user 0.23system ...	QUERY: img00102 0.000000 img00102 0.005571 img00713 0.008826 img00268 0.011413 img05054 0.011811 img00631 0.014259 img04042 0.027598 img00203 0.037471 img00287 0.063639 img00244 0.067583 img00306 # dist calcs: 20000 1.93user 1.12system ...

CurveIx Indexing

Linear Scan


QUERY: img00102
0.000000  img00102
0.005571  img00713
0.008826  img00268
0.011413  img05054
0.011811  img00631
0.014259  img04042
0.027598  img00203
0.037471  img00287
0.063639  img00244
0.067583  img00306

# dist calcs: 524
1.67user 0.23system ...


QUERY: img00102
0.000000  img00102
0.005571  img00713
0.008826  img00268
0.011413  img05054
0.011811  img00631
0.014259  img04042
0.027598  img00203
0.037471  img00287
0.063639  img00244
0.067583  img00306

# dist calcs: 20000
1.93user 1.12system ...

Experimental Results

For fixed database (20K), effect of varying Range, Ncurves

Curves	Range	Acc1	Acc2	Dist
20	20	6.72	0.20	426
20	30	7.28	0.28	695
20	40	7.68	0.36	874
40	20	7.68	0.28	957
40	30	8.16	0.40	1301
40	40	8.60	0.44	1703
60	30	8.40	0.44	1905
60	40	8.60	0.48	2413
80	30	8.87	0.58	2485
80	40	9.20	0.72	3381
100	30	9.10	0.70	3061
100	40	9.28	0.72	4156

Results: Size vs. Accuracy

For fixed CurveIx parameters (80 curves, 30 range), effect of varying database size:

#Images	Acc1	Acc2
5K	9.72	0.76
10K	9.44	0.80
15K	9.16	0.70
20K	9.04	0.64

Results: Costs

For the 20K database (CurveIx (80,30)):

Approach	Size(Db)	Size(Ix)	IO(bytes)
Linear Scan	42MB	-	42MB
CurveIx	46MB	345MB	4+36MB

Other Optimisations for QBIC Indexing

The following techniques can be applied regardless of the underlying index structure.

They involve exploiting properties of data structures and access patterns to reduce amount of I/O.

Derived primarily from profiling studies of QBIC.

Caching

Maintain an in-memory collection of relevant pages.

Read/write pages only when absolutely necessary.

Most commercial DBMSs do this invisibly and effectively.

The dbm system:

performs useful caching while a database file is open
throws away cache each time file is closed

Current implementation of QBIC opens/closes file for each record look-up.

... Caching

Interpolation of open-file cache significantly reduces calls to operating system for file I/O.

Example: insertion of 1000 colour histograms:

...
<1000> Insert of ... successful
Reads: 37504864/13137   Writes: 9383936/2291
1213.8u 49.2s 26:02 80% 1760+4256k 0+0io 2087pf+0w

...
<1000> Insert of ... successful
Reads: 4096/1   Writes: 618496/151
1207.2u 11.8s 24:33 82% 2895+4440k 0+0io 2116pf+0w

Since Unix performs its own caching, this technique yields little impact on wall-clock times.

Since queries are handled by sequential scan, cache provides little benefit for them.

For a query server, may provide some querying benefit.

Compression

QBIC colour histograms:

consist of 256 [0..1000] values as short[256]
contain many zero values (on average >200/256)

Scope for compression

16-bit quantities used for 10-bit values (fixed-length compressed vectors)
because of repetition of particular value (zero) (variable-length compressed vectors)

Have implemented second:

reduced storage costs and query I/O by around 2/3
no increase in processing cost (slight reduction)

Clustering

Reducing to k candidates is no better than kN_O candidates if each one comes from a separate page (recall than N_O may be larger than 100)

This situation becomes worse when vectors are smaller, and so N_O becomes larger

Approach: try to ensure that "similar" objects are placed in the same database page.

Plan to look at how effectively clustering can be performed for very large numbers of dimensions.

Conclusions

The CurveIx approach shows some potential as a solution for the HDkNN search problem.

The following aspects require further research and implementation before this approach could be considered a real solution for k-NN:

control of search over curves and neighbours
better data structure for curve index
clustering strategies to reduce page reads
compare to other approaches (e.g. [Pyramid technique] )

References/Further Reading

Berchtold & Böhm & Kriegel,
The Pyramid Technique: Towards breaking the curse of dimensionality,
ACM SIGMOD, 1998
[proposes a new data access technique which is optimized for high-dimensional range queries]
Gaede & Günther,
Multidimensional access methods,
Technical Report, Humboldt University, Berlin, 1995
[comprehensive survey of multidimensional database access methods]
Hellerstein, Koutsoupias & Papadimitriou,
On the analysis of indexing schemes,
ACM PODS, 1997
[initial work on a general theory of "indexability", which is analogous to complexity theory and describes how effective data/query/access-method combinations are in terms of disk access and storage cost]
Kurniawati, Jin & Shepherd,
Techniques for supporting efficient content-based retrieval in multimedia databases,
Australian Computer Journal, 1997
Megiddo & Shaft,
Efficient nearest neighbour indexing based on a collection of space-filling curves,
IBM Internal Research Report, 1997
[description of method to index multidimensional data based on a multiple complementary "linearisations" of the data points]
Weber & Blott,
An Approximation-Based Data Struture for Similarity Search,
Unpublished Report, Bell Labs, 1998
[provides reasonably formal argument that partitioning-based methods degenerate to worse than linear search once dimensionality around 6..10; proposes VA-files as an alternative]
White & Jain,
Algorithms and Strategies for Similarity Retrieval,
UCSD Visual Computing Lab Report, 1996
[survey and characterisation of a large number of approaches to multidimensional access methods]

Plus numerous other papers downloaded into /home/jas/papers.

Produced: 28 Aug 98