| |
Spatial Data-Mining and Data
Anomaly Detection
Opportunity
There is an ever-growing need for "watch-dog"
programs that can process truly vast data volumes and autonomously
identify, or shortlist, anomalous points or areas in the data,
sometimes without needing to be told specifically what constitutes an
anomaly. Customers for such capabilities include banks, insurance
companies, tax offices, homeland security, security
organisations, organizations such as the INS, statistics bureaus, and
scientific institutions.Example using
Meteorological data.
These days it is important that such systems operate in real time
alongside existing information processing systems, identifying patterns
of behaviour that require deeper scrutiny for reasons of fraud,
investment, defense, crime-fighting, or medical or scientific interest.
In some cases, specific sources of data may have no intrinsic anomalous
qualities, but may show up as anomolous
when compared with other sources of data. Making such comparisons
normally leads to insurmountable computation requirements, particularly
when the data may be updated several times per second. In other cases,
anomalies may only show up in unusual behaviour patterns over time.
Such cases are also difficult to identify with standard technologies.
The state of the art is quite limited. It is currently very difficult
or impossible to:
automatically
classify large numbers of data sources
automatically
classify the data into coherent groupings in real time
automatically
update these classifications as the nature of the data
changes
identify
which sources of information are actually sources of
misinformation
visually
highlight individual or groups of related sources of
information
visually
highlight data that is anomalous and unable to fit into a
normal grouping.
This means that unusual behaviour, whether it be
criminal, potentially of a terrorist nature, of scientific interest,
statistically important, or otherwise instructive, can often go
undetected.
JIGSAW benefits – short term
JIGSAW is unusual in several ways:
scales
to very large data volumes
identifies
unusual behaviour patterns, rather than just unusual data
requires
no cue-ing or priming to search for pre-conceived anomalies
works
in real time or batch mode
tolerates
high levels of noise or incoherence
finds
the nearest relations to any data point
When processing published meteorological data from
Australia's weather stations, JIGSAW (without prior suspicion) remotely
identified a poorly calibrated barometer at one of the outback
stations, and also estimated the magnitude of error in the instrument.
|
|