Handling missing values in database systems using a naive bayesian classifier
Banchong Harangsri,
Samuel Matsushima,
John Shepherd,
Anne H.H. Ngu,
SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery,
Tucson, Arizona, May 1997.
(Compressed Postscript ... 41KB)
This paper demonstrates the utility of a naive Bayesian classifier
for handling the problem of missing values in database systems.
Work in the literature has proposed the storage of null/default
values in place of the missing values in databases. We propose a
novel probabilistic approach to the problem via a learning
classifier machine used in the machine learning community. We show,
by a variety of experiments both with continuous and discrete
attribute domains, that the probabilistically generated values
derived by the classifier are more accurate than the default values,
i.e., give lower misclassification rates. The misclassification rate
was found to be at most 21 percent in all our 25 experiments using
our approach while it was at most 98 percent using the default value
approach -- in fact our approach was always more accurate.
Keys:
database systems,
default values,
missing values,
machine learning
Recent Publications ...
John Shepherd ...
CSE ...
UNSW