Handling missing values in database systems using a naive bayesian classifier

Banchong Harangsri, Samuel Matsushima, John Shepherd, Anne H.H. Ngu,

SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, Tucson, Arizona, May 1997.

(Compressed Postscript ... 41KB)


This paper demonstrates the utility of a naive Bayesian classifier for handling the problem of missing values in database systems. Work in the literature has proposed the storage of null/default values in place of the missing values in databases. We propose a novel probabilistic approach to the problem via a learning classifier machine used in the machine learning community. We show, by a variety of experiments both with continuous and discrete attribute domains, that the probabilistically generated values derived by the classifier are more accurate than the default values, i.e., give lower misclassification rates. The misclassification rate was found to be at most 21 percent in all our 25 experiments using our approach while it was at most 98 percent using the default value approach -- in fact our approach was always more accurate.

Keys: database systems, default values, missing values, machine learning


Recent Publications ... John Shepherd ... CSE ... UNSW