Can QBIC Be Made Faster?

(viewing version)


Can QBIC Be Made Faster?


Some reflections on the HDkNN Problem
(High Dimensional k Nearest-Neighbours)
for Image Indexing


John Shepherd


IBM Almaden research Center   (Feb-Aug,1998)
University of New South Wales, Sydney   (regularly)


The Problem

Inputs Outputs And all done as efficiently as possible ...

... where time is the most important efficiency factor


The Context

The solution has to work for ... And, most importantly, the solution should ...


The Costs

I/O cost CPU cost Space


Cost Parameters

We need to consider the following quantities:

Name Meaning Typically
N number of objects in the database 103 .. 1010
d number of dimensions in object features 1 .. 256
P number of bytes per disk page 512 .. 8K
V number of bytes for d-dimensional vector 4d
O average number of bytes per stored object (key+vector) 12+4d .. 256+4d
NO average number of stored objects per page (=floor(P/O)) 1 .. 512
NP number of disk pages to hold stored objects (=ceil(N/NO)) 50 .. 1010
Topen time to open a database file 10ms
TP time to read a page from disk into memory 10ms
TD time to compute distance between two objects (using vectors) 100us (?)
TK average time to find the page address for a given object 10ms (?)

The TK measure is related to the use of a DBMS (e.g. dbm or DB2) to look up object vectors via the object identifier/key (the file name of the image).


Analysing Costs

We (pessimistically) assume that there is no effective caching of pages read from the database.

That is, accessing n object vectors (chosen at random) is going to result in n page reads.

This gives an upper-bound on the actual cost; we discuss possible amelioration of this via clustering and caching later.

On the other hand, we also assume that if the access to the vectors is via a sequential scan, then each page in the database file will be read exactly once.


Measuring Costs

Unix has useful tools for measuring/analysing CPU costs (gprof).

Unix (AIX) is not helpful in measuring I/O costs
(the rusage structure doesn't seem to count block input/output)

We use a combination of


Current Approach

Given a query vector vq, current QBIC method is (essentially):

results = [];   maxD = infinity

for each obj in the database
{
	dist = D(vobj,vq)
	if (#results < k or dist < maxD)
	{
		insert (obj,dist) into results
		// results is sorted with length <= k
		maxD = largest dist in results
	}
}


Cost  =  Topen  +  NPTP  +  NTD

Note: If q is an image from the database, we can use a pre-computed distance table to make this much faster.


Goals for a Solution

Any reasonable solution aims to:


Some Potential Solutions

Partition-based Approximation-based Other Optimisations


Tree-based Methods

Each leaf in the tree corresponds to a subregion of the d-dimensional space

Parent nodes correspond to union of subregions in children

Partitioned space containing data points:

[Diagram:pic/space]

Tree that induced above partitioning:

[Diagram:pic/tree]

Above has no overlap among subregions; some partitioning schemes permit overlap.

Above uses hyper-rectangles for partitioning; other shapes (e.g. spheres, convex polyhedra) have been used.

Detailed summary/comparison of tree methods can be found in [White & Jain] , [Kurniawati et al.]


Insertion with Trees

Given: a feature vector vobj for a new object.

Insertion procedure:

traverse tree to find leaf node n
		whose region contains vobj

if (enough space in n)
{
	insert vobj into corresponding page
}
else
{
	need to perform local rearrangement in tree
	   by splitting points between n and n's parent
	   and n's siblings according to split heuristic
}


Split heuristics include:


Query with Trees

Given:   query vector vq,   index tree.

Method involves:

Consider the following space and two query points q1 and q2.

[Diagram:pic/qspace]


... Query with Trees

Aim of search: reduce NN-region quickly as possible
(no need to consider nodes which do not intersect NN-region)

Variations on search methods in literature:

May not achieve 100% accuracy; depends on pruning strategy.


Why Trees Don't Work (for d > 10)

[Weber and Blott] derive some results on average performance of kNN search:


Signature-based Methods   (VA-Files)

Signature methods abandon the idea of search complexity better than O(N).

Use a file of signatures in parallel with main data file, one signature for each vector.

[Diagram:pic/sigfile]

Signatures have properties:

One example is the Vector Approximation file by [Weber and Blott.]


VA-File Signatures

Vector (b1, b2, ..., bd) has corresponding signature him(b1)him(b2)...him(bd)

where him gives the top-order m bits

This essentially:

[Diagram:pic/va-partition]

For non-equi-width regions, use mapping array to determine partition boundaries.


Insertion with VA-Files

Given: a feature vector vobj for a new object.

Insertion is extremely simple.

sig = signature(vobj)
append vobj to data file
append sig to signature file

Storage overhead determined by m   (e.g. 3/32).


Query with VA-Files

Given:   query vector vq.

results = [];  maxD = infinity;

for each sig in signature file
{
	vnear = closest point to vq in region[sig]
	dist = D(vnear,vq)
	if (#results < k  or  dist < maxD)
	{
		dist = D(vobj,vq)
		if (#results < k  or  dist < maxD)
		{
			insert (obj,dist) into results
			maxD = largest dist in results
		}
	}
}

Cost = 2Topen   +   TP(Ns + fN)   +   TD(N + fN)

where

Note: achieves 100% accuracy.


Why VA-Files Don't Work (for QBIC)

In the experiments reported by [Weber and Blott] they use uniform data distributions and achieve 0.0005 filtering.

In experiments with QBIC colour histograms, which have extremely skew distributions, we were unable to achieve better than 0.10 filtering on average.

And this was after "tuning" partition boundaries to place approximately equal numbers of points in each partition.

Before tuning, filtering factor was 0.70.

Conclusion VA-file performance is very sensitive to data distribution.

VA-files will also perform poorly if TD is large   (we compute D for every signature)


Projection-based Methods   (C-curves)

Achieve efficiency by In the case of C-curves [Megiddo and Shaft]

[Diagram:pic/c-curves]

Question: how many curves do we need to ensure that we find all/most of the near neighbours?

Answer: depends on the number of "significant" dimensions (determined by e.g. PCA).


Data Structures for C-curves

One encoding function Ci for each of the m different space-filling curves.

One "indexing" relation Curves consisting of (curveId, pointId, objectId) tuples.

Note that (curveId,pointId) form a primary key with an ordering.

The relation can thus be indexed via an efficient database access method.

Insertion requires:

for i in 1 .. m
{
	pointId = Ci(vobj)
	insert (i, pointId, obj) into Curves
}
insert vobj into data file


Query with C-curves

Given:   vq,   C1 .. Cm,   Curves.

candidates = []
for i in 1 .. m
{
	x[i] = Ci(vq)
	objects = select objectId from Curves
	            where C.curveId = i
		    and x[i]-m <= C.pointId <= x[i]+m
	candidates = candidates * objects
}
results = [];   maxD = infinity
for each obj in candidates
{
	dist = D(vobj, vq)
	if (#results < k  or  dist < maxD)
	{
		insert (obj,dist) into results
		maxD = largest dist in results
	}
}

Cost = 2Topen  +  mTsel  +  TPfN  +  TDfN

where


Other Optimisations

The following techniques can be applied regardless of the underlying index structure.

They involve exploiting properties of data structures and access patterns to reduce amount of I/O.

Derived primarily from profiling studies of QBIC.


Caching

Maintain an in-memory collection of relevant pages.

Read/write pages only when absolutely necessary.

Most commercial DBMSs do this invisibly and effectively.

The dbm system:

Current implementation of QBIC opens/closes file for each record look-up.


... Caching

Interpolation of open-file cache significantly reduces calls to operating system for file I/O.

Example: insertion of 1000 colour histograms:

% time QbMkDbs.nocache ... -f QbColorHistogramFeatureClass
...
<1000> Insert of ... successful
Reads: 37504864/13137   Writes: 9383936/2291
1213.8u 49.2s 26:02 80% 1760+4256k 0+0io 2087pf+0w

% time QbMkDbs.cached ... -f QbColorHistogramFeatureClass
...
<1000> Insert of ... successful
Reads: 4096/1   Writes: 618496/151
1207.2u 11.8s 24:33 82% 2895+4440k 0+0io 2116pf+0w

Since Unix performs its own caching, this technique yields little impact on wall-clock times.

Since queries are handled by sequential scan, cache provides little benefit for them.

For a query server, may provide some querying benefit.


Compression

QBIC colour histograms:

Scope for compression

Have implemented second:


Clustering

Reducing to k candidates is no better than kNO candidates if each one comes from a separate page   (recall than NO may be larger than 100)

This situation becomes worse when vectors are smaller, and so NO becomes larger

Approach: try to ensure that "similar" objects are placed in the same database page.

Plan to look at how effectively clustering can be performed for very large numbers of dimensions.


References/Further Reading

  1. Gaede & Günther,
    Multidimensional access methods,
    Technical Report, Humboldt University, Berlin, 1995
    [comprehensive survey of multidimensional database access methods]
  2. Hellerstein, Koutsoupias & Papadimitriou,
    On the analysis of indexing schemes,
    ACM PODS, 1997
    [initial work on a general theory of "indexability", which is analogous to complexity theory and describes how effective data/query/access-method combinations are in terms of disk access and storage cost]
  3. Kurniawati, Jin & Shepherd,
    Techniques for supporting efficient content-based retrieval in multimedia databases,
    Australian Computer Journal, 1997
  4. Megiddo & Shaft,
    Efficient nearest neighbour indexing based on a collection of space-filling curves,
    IBM Internal Research Report, 1997
    [description of method to index multidimensional data based on a multiple complementary "linearisations" of the data points]
  5. Weber & Blott,
    An Approximation-Based Data Struture for Similarity Search,
    Unpublished Report, Bell Labs, 1998
    [provides reasonably formal argument that partitioning-based methods degenerate to worse than linear search once dimensionality around 6..10; proposes VA-files as an alternative]
  6. White & Jain,
    Algorithms and Strategies for Similarity Retrieval,
    UCSD Visual Computing Lab Report, 1996
    [survey and characterisation of a large number of approaches to multidimensional access methods]
Plus numerous other papers downloaded into /home/jas/papers.


Produced: 31 Mar 98