Can QBIC Be Made Faster?
(viewing version)
Can QBIC Be Made Faster?
Some reflections on the HDkNN Problem
(High Dimensional k Nearest-Neighbours)
for Image Indexing
John Shepherd
IBM Almaden research Center (Feb-Aug,1998)
University of New South Wales, Sydney (regularly)
The Problem
Inputs
- a database of N objects
(obj1, obj2, ..., objN)
- representations of objects as d-dimensional vectors
(vobj1, vobj2, ..., vobjN)
(each vector may represent one or more features of the object)
- a query object q with associated d-dimensional vector (vq)
- a distance measure D(vi,vj) : [0..1)
(D=0 -> vi=vj)
Outputs
- a list of the k nearest objects in the database
[a1, a2, ... ak]
- ordered by distance to query object
D(va1,vq) <=
D(va2,vq) <= ... <=
D(vak,vq)
And all done as efficiently as possible ...
... where time is the most important efficiency factor
The Context
The solution has to work for ...
- number of dimensions d = 1 .. 500
- number of objects N = 104 .. 1010
- all values in query vector are specified (i.e. no partial-match)
- implementation runs on "standard" hardware/OS
(e.g. don't assume arrays of parallel disks, huge memory, ...)
- independent of distance measure (?)
- 100% retrieval accuracy (?)
And, most importantly, the solution should ...
- use minimal resources: i/o, cpu-time, space
- provide fast database updates (for some applications e.g. Magnifi)
The Costs
I/O cost
- the dominant cost (given current disk technology)
- measured in terms of pages read, not records read
CPU cost
- primarily related to computing distance measure D(vobj, vq)
- can be relatively expensive for some QBIC features
Space
- need to store data (feature vectors) within page-size chunks
- access methods typically require additional disk space
(which can be anywhere from 10% extra to >100% extra)
- intermediate data structures in memory?
Cost Parameters
We need to consider the following quantities:
| Name |
Meaning |
Typically |
| N |
number of objects in the database |
103 .. 1010 |
| d |
number of dimensions in object features |
1 .. 256 |
| P |
number of bytes per disk page |
512 .. 8K |
| V |
number of bytes for d-dimensional vector |
4d |
| O |
average number of bytes per stored object (key+vector) |
12+4d .. 256+4d |
NO |
average number of stored objects per page (=floor(P/O)) |
1 .. 512 |
| NP |
number of disk pages to hold stored objects (=ceil(N/NO)) |
50 .. 1010 |
| Topen |
time to open a database file |
10ms |
| TP |
time to read a page from disk into memory |
10ms |
| TD |
time to compute distance between two objects (using vectors) |
100us (?) |
| TK |
average time to find the page address for a given object |
10ms (?) |
The TK measure is related to the use of a DBMS
(e.g. dbm or DB2) to look up object vectors via the object
identifier/key (the file name of the image).
Analysing Costs
We (pessimistically) assume
that there is no effective caching of pages read from the database.
That is, accessing n object vectors (chosen at random) is
going to result in n page reads.
This gives an upper-bound on the actual cost;
we discuss possible amelioration of this via clustering and caching later.
On the other hand, we also assume that if the access to
the vectors is via a sequential scan, then each page in
the database file will be read exactly once.
Measuring Costs
Unix has useful tools for measuring/analysing CPU costs (gprof).
Unix (AIX) is not helpful in measuring I/O costs
(the rusage structure doesn't seem to count block input/output)
We use a combination of
- our own I/O monitoring (to count
read, write ops)
- profiling (fine-grained analysis of cpu-time costs)
- wall-clock time (the "bottom line" for users)
Current Approach
Given a query vector vq, current QBIC method is (essentially):
results = []; maxD = infinity
for each obj in the database
{
dist = D(vobj,vq)
if (#results < k or dist < maxD)
{
insert (obj,dist) into results
// results is sorted with length <= k
maxD = largest dist in results
}
}
Cost = Topen + NPTP + NTD
Note: If q is an image from the database, we can use a
pre-computed distance table to make this much faster.
Goals for a Solution
Any reasonable solution aims to:
- reduce the number of objects considered (find candidates)
- ... which, in turn, reduces i/o and D computations
- but without excessive overhead in determining candidates
Some Potential Solutions
Partition-based
- use auxiliary (spatial) data structure to identify candidates
- space-partitioning methods: Grid file, k-d-b-tree, quadtree
- data-partitioning methods: R-tree, X-tree, SS-tree, TV-tree, ...
Approximation-based
- use approximating data structure to identify candidates
- signature files: VA-files
- projections: space-filling curves
Other Optimisations
- reduce I/O by reducing size of vectors (compression)
- reduce I/O by placing "similar" records together (clustering)
- reduce I/O by remembering previous pages (caching)
- reduce cpu by making D computation faster (ColorHist)
Tree-based Methods
Each leaf in the tree corresponds to a subregion of the d-dimensional space
Parent nodes correspond to union of subregions in children
Partitioned space containing data points:
Tree that induced above partitioning:
Above has no overlap among subregions; some partitioning schemes
permit overlap.
Above uses hyper-rectangles for partitioning;
other shapes (e.g. spheres, convex polyhedra) have been used.
Detailed summary/comparison of tree methods can be found in
[White & Jain]
,
[Kurniawati et al.]
Insertion with Trees
Given: a feature vector vobj for a new object.
Insertion procedure:
traverse tree to find leaf node n
whose region contains vobj
if (enough space in n)
{
insert vobj into corresponding page
}
else
{
need to perform local rearrangement in tree
by splitting points between n and n's parent
and n's siblings according to split heuristic
}
Split heuristics include:
- minimise total volume of bounding regions
- minimise variance in volume of regions
- minimise overlap among regions
- .....
Query with Trees
Given:   query vector vq, index tree.
Method involves:
- a search over the tree
- a "nearest neighbour (NN) region"
(related to maxD in the original search method).
Consider the following space and two query points q1 and q2.
... Query with Trees
Aim of search: reduce NN-region quickly as possible
(no need to consider nodes which do not intersect NN-region)
Variations on search methods in literature:
- overall search order
e.g. depth-first vs. breadth-first
- order to check nodes
e.g. distance from query point to node centroid
- pruning criteria
e.g. minimum distance exceeds
maxD
May not achieve 100% accuracy; depends on pruning strategy.
Why Trees Don't Work (for d > 10)
[Weber and Blott]
derive some results on average performance of kNN search:
- (degeneration)
the complexity of searching via a partitioning scheme tends
to O(N) for sufficiently large d
- (complexity)
for every partitioning scheme, there exists some d
such that all blocks are accessed if dimensionality exceeds d
- (performance)
most data- and space-partitioning schemes perform worse than
sequential scan by the time d=10
Signature-based Methods (VA-Files)
Signature methods abandon the idea of search complexity better than O(N).
Use a file of signatures in parallel with main data file,
one signature for each vector.
Signatures have properties:
- (much) more compact than vectors
- provide approximation to information in vectors
One example is the Vector Approximation file by
[Weber and Blott.]
VA-File Signatures
Vector (b1, b2, ..., bd)
has corresponding signature
him(b1)him(b2)...him(bd)
where him gives the top-order m bits
This essentially:
- partitions each dimension into 2m regions
- maps vector into "identifier" of region that contains it
For non-equi-width regions, use mapping array to determine
partition boundaries.
Insertion with VA-Files
Given: a feature vector vobj for a new object.
Insertion is extremely simple.
sig = signature(vobj)
append vobj to data file
append sig to signature file
Storage overhead determined by m (e.g. 3/32).
Query with VA-Files
Given: query vector vq.
results = []; maxD = infinity;
for each sig in signature file
{
vnear = closest point to vq in region[sig]
dist = D(vnear,vq)
if (#results < k or dist < maxD)
{
dist = D(vobj,vq)
if (#results < k or dist < maxD)
{
insert (obj,dist) into results
maxD = largest dist in results
}
}
}
Cost = 2Topen + TP(Ns + fN) + TD(N + fN)
where
- Ns is number of pages in signature file
- f is filtering factor (0.0005 under ideal conditions)
Note: achieves 100% accuracy.
Why VA-Files Don't Work (for QBIC)
In the experiments reported by
[Weber and Blott]
they use uniform data distributions and achieve 0.0005 filtering.
In experiments with QBIC colour histograms, which have extremely skew
distributions, we were unable to achieve better than 0.10 filtering
on average.
And this was after "tuning" partition boundaries to place approximately
equal numbers of points in each partition.
Before tuning, filtering factor was 0.70.
Conclusion VA-file performance is very sensitive to data distribution.
VA-files will also perform poorly if TD is large
(we compute D for every signature)
Projection-based Methods (C-curves)
Achieve efficiency by
- projecting data space onto reduced dimensions
- utilising established access methods to find objects in projection
In the case of C-curves
[Megiddo and Shaft]
- project d-dimensional vectors onto several 1-d space-filling curves
- use efficient primary-key look-up (e.g. B-tree) to find neighbours
- intersect neighbour sets from several curves to obtain candidates
Question: how many curves do we need to ensure that we find all/most
of the near neighbours?
Answer: depends on the number of "significant" dimensions (determined
by e.g. PCA).
Data Structures for C-curves
One encoding function Ci for each of the m
different space-filling curves.
One "indexing" relation Curves consisting of (curveId, pointId, objectId) tuples.
Note that (curveId,pointId) form a primary key
with an ordering.
The relation can thus be indexed via an efficient database
access method.
Insertion requires:
for i in 1 .. m
{
pointId = Ci(vobj)
insert (i, pointId, obj) into Curves
}
insert vobj into data file
Query with C-curves
Given: vq, C1 .. Cm, Curves.
candidates = []
for i in 1 .. m
{
x[i] = Ci(vq)
objects = select objectId from Curves
where C.curveId = i
and x[i]-m <= C.pointId <= x[i]+m
candidates = candidates * objects
}
results = []; maxD = infinity
for each obj in candidates
{
dist = D(vobj, vq)
if (#results < k or dist < maxD)
{
insert (obj,dist) into results
maxD = largest dist in results
}
}
Cost = 2Topen + mTsel + TPfN + TDfN
where
- Tsel is the cost of the range query using e.g. B-tree
(typically log N)
- f is the filtering factor
Other Optimisations
The following techniques can be applied regardless of
the underlying index structure.
They involve exploiting properties of data structures
and access patterns to reduce amount of I/O.
Derived primarily from profiling studies of QBIC.
Caching
Maintain an in-memory collection of relevant pages.
Read/write pages only when absolutely necessary.
Most commercial DBMSs do this invisibly and effectively.
The dbm system:
- performs useful caching while a database file is open
- throws away cache each time file is closed
Current implementation of QBIC opens/closes file for each
record look-up.
... Caching
Interpolation of open-file cache significantly reduces calls
to operating system for file I/O.
Example: insertion of 1000 colour histograms:
% time QbMkDbs.nocache ... -f QbColorHistogramFeatureClass
...
<1000> Insert of ... successful
Reads: 37504864/13137 Writes: 9383936/2291
1213.8u 49.2s 26:02 80% 1760+4256k 0+0io 2087pf+0w
% time QbMkDbs.cached ... -f QbColorHistogramFeatureClass
...
<1000> Insert of ... successful
Reads: 4096/1 Writes: 618496/151
1207.2u 11.8s 24:33 82% 2895+4440k 0+0io 2116pf+0w
Since Unix performs its own caching, this technique
yields little impact on wall-clock times.
Since queries are handled by sequential scan, cache
provides little benefit for them.
For a query server, may provide some querying benefit.
Compression
QBIC colour histograms:
- consist of 256 [0..1000] values as
short[256]
- contain many zero values (on average >200/256)
Scope for compression
- 16-bit quantities used for 10-bit values
(fixed-length compressed vectors)
- because of repetition of particular value (zero)
(variable-length compressed vectors)
Have implemented second:
- reduced storage costs and query I/O by around 2/3
- no increase in processing cost (slight reduction)
Clustering
Reducing to k candidates is no better than
kNO candidates
if each one comes from a separate page
(recall than NO may be larger than 100)
This situation becomes worse when
vectors are smaller,
and so NO becomes larger
Approach: try to ensure that "similar" objects are
placed in the same database page.
Plan to look at how effectively clustering can be performed
for very large numbers of dimensions.
References/Further Reading
-
Gaede & Günther,
Multidimensional access methods,
Technical Report, Humboldt University, Berlin, 1995
[comprehensive survey of multidimensional database access methods]
-
Hellerstein, Koutsoupias & Papadimitriou,
On the analysis of indexing schemes,
ACM PODS, 1997
[initial work on a general theory of "indexability", which is
analogous to complexity theory and describes how effective
data/query/access-method combinations are in terms
of disk access and storage cost]
-
Kurniawati, Jin & Shepherd,
Techniques for supporting efficient content-based retrieval
in multimedia databases,
Australian Computer Journal, 1997
-
Megiddo & Shaft,
Efficient nearest neighbour indexing based on a collection
of space-filling curves,
IBM Internal Research Report, 1997
[description of method to index multidimensional data based on
a multiple complementary "linearisations" of the data points]
-
Weber & Blott,
An Approximation-Based Data Struture for Similarity Search,
Unpublished Report, Bell Labs, 1998
[provides reasonably formal argument that partitioning-based methods
degenerate to worse than linear search once dimensionality around
6..10; proposes VA-files as an alternative]
-
White & Jain,
Algorithms and Strategies for Similarity Retrieval,
UCSD Visual Computing Lab Report, 1996
[survey and characterisation of a large number of approaches
to multidimensional access methods]
Plus numerous other papers downloaded into /home/jas/papers.
Produced: 31 Mar 98