What is cross-entropy, and why use it?

The cross-entropy measure has been used as an alternative to squared error. Cross-entropy can be used as an error measure when a network's outputs can be thought of as representing independent hypotheses (e.g. each node stands for a different concept), and the node activations can be understood as representing the probability (or confidence) that each hypothesis might be true. In that case, the output vector represents a probability distribution, and our error measure - cross-entropy - indicates the distance between what the network believes this distribution should be, and what the teacher says it should be. There is a practical reason to use cross-entropy as well. It may be more useful in problems in which the targets are 0 and 1 (thought the outputs obviously may assume values in between.) Cross-entropy tends to allow errors to change weights even when nodes saturate (which means that their derivatives are asymptotically close to 0.)


This is an excerpt from p. 166 of Plunkett and Elman: Exercises in Rethinking Innateness, MIT Press, 1997.
Wikipedia article on cross entropy