Neural Networks and Error Backpropagation Learning
Reference: Haykin, chapter 4
To introduce some basic concepts of neural networks and then describe
the backpropagation learning algorithm for feedforward neural networks.
backward pass in backpropagation,
forward pass in backpropagation,
generalization in backprop,
generalized delta rule,
hidden unit / node,
layer in a neural network,
linear threshold unit,
momentum in backprop,
multilayer perceptron (MLP),
sequence prediction tasks,
simple recurrent network,
stopping criterion in backprop,
total net input,
total sum-squared error,
- linear threshold units, perceptrons
- outline of biological neural processing
- artificial neurons and the sigmoid function
- error backpropagation learning
- delta rule
- forward and backward passes
- generalized delta rule
- example: XOR with bp in
- generalization and over-fitting
- applications of backprop
- Statistical and connectionist approaches to machine learning are related to
function approximation methods in mathematics.
- For the purposes of illustration let us assume that the learning task
is one of classification.
- That is, we wish to find ways of grouping objects in a universe.
- In Figure 1 we have a universe of objects that belong to either of
two classes `+' or `–'.
Figure 1: A linear discrimination between two classes
Classification Tasks 2
- Using function approximation techniques, we find a surface that separates
the space, and the objects in it, into two different regions.
- In the case shown in Figure 1, the "surface" is just a line, and the
associated function is called a linear discriminant function.
- Linear regression methods, or perceptron learning (see below) can be used
to find linear discriminant functions.
- A perceptron is a simple pattern classifier.
- Given a binary input vector, x, a weight vector, w,
and a threshold value, T, if
Σi wixi > T
then the output is 1, indicating membership of a class, otherwise it is 0,
indicating exclusion from the class.
- w.x = T describes a hyperplane and the goal
of perceptron learning is to find a weight vector, w, that results
in correct classification for all training examples.
Perceptron Learning (Outline)
- All the elements of the weight vector are initially set to 0.
- For each training example, if the perceptron outputs 0 when it
should output 1 then add the input vector to the weight vector; if
the perceptron outputs 1 when it should output 0 then subtract the
input vector from the weight vector; otherwise, do nothing.
- This is repeated until the perceptron yields the correct result
for each training example.
- The algorithm has the effect of reducing the error between
the actual and desired output.
- The perceptron is an example of a linear threshold unit (LTU).
- A single LTU can only recognise one kind of pattern, provided that
the input space is linearly separable.
Perceptron Learning (Outline) 2
Neural Models of Computation
- Biological neurons provide a model for computation (after all,
brains are built from them).
- They have inputs (dendrites), outputs (axon) and a
response to the inputs is generated by a process that gives rise to an
electrical pulse propagated along the axon.
- The place where an axon of one neuron connects to a dendrite of
another neuron is termed a synapse.
- Synapses differ in the effect that they have on the output of
- Some are strongly excitatory, some weakly excitatory, some
weakly inhibitory, and so on.
Figure 2: Neuron architecture
Biological Neurons and Artificial Neurons
- There is an analogy between a biological neuron and the simple
model neurons (also termed nodes or units) used in
- The incoming signals from the axons of other neurons correspond
to the inputs yi to the node.
- The strength of the synapse corresponds to weight
wi associated with a node.
- The signal propagated along the axon corresponds to the output
φ(Σi wiyi) of the node. (The
operator φ will be explained below.)
Figure 3: Artificial neuron (node) architecture
Overall Layout of MLP
- Typically also, each node in a layer (other than the output
layer) is connected to every node in the next layer by a trainable
- The overall layout is illustrated in Figure 5.
Figure 5. A multi-layer network
- Figure 6 shows the internals of a node.
- The weight wj0 acts like a threshold.
- The yi are the outputs of other nodes
(or perhaps inputs to the network).
- The first step is forming the weighted sum vj =
- The second step is applying the nonlinearity function φ
to vj to produce the output yj.
Figure 6. Internal functioning of node j
The Error Back-Propagation Learning Algorithm
- This algorithm was discovered and rediscovered a number of times -
for details, see, e.g. chapter 4 of Haykin, S. Neural Networks -
a comprehensive foundation, 2nd ed., p.43. This reference also
contains the mathematical details of the derivation of the
backpropagation equations, (2nd ed., p.161-167, 3rd ed., p.129-134)
which we shall omit. (This is covered in COMP9444 Neural Networks.)
- Like perceptron learning, back-propagation attempts to reduce the
errors between the output of the network and the desired result.
- However, assigning blame for errors to hidden nodes (i.e. nodes
in the intermediate layers), is not so straightforward. The error of
the output nodes must be propagated back through the hidden nodes.
- The contribution that a hidden node makes to an output node is
related to the strength of the weight on the link between the
two nodes and the level of activation of the hidden node when the
output node was given the wrong level of activation.
- This can be used to estimate the error value for a hidden node
in the penultimate layer, and that can, in turn, be used in making
error estimates for earlier layers.
Weight Change Equation
If node j is an output node, then δj is the product of
φ'(vj) and the error signal ej,
where φ(_) is the
logistic function and vj is the total input to node j (i.e.
and ej is the error signal for node j
(i.e. the difference between the desired output and the actual output);
- If node j is a hidden node, then
δj is the product of
φ'(vj) and the weighted sum of the
δ's computed for the nodes in
the next hidden or output layer that are connected to node j.
[The actual formula is δj =
φ'(vj) &Sigmak δkwkj
where k ranges over those nodes for which wkj is non-zero (i.e.
nodes k that actually have connections from node j. The δk values
have already been computed as they are in the output layer (or a layer closer to the output layer
than node j).]
Two Passes of Computation
FORWARD PASS: weights fixed, input signals propagated through network
and outputs calculated. Outputs oj are compared with desired outputs
dj; the error signal ej = dj - oj
BACKWARD PASS: starts with output layer and recursively computes the
local gradient δj for each node.
Then the weights are updated using the equation above for
Δwji, and back to another forward
With the sigmoidal function φ(x) defined above, it is the case
that φ'(vj) =
yj(1 - yj), a fact that simplifies the computations.
Rate of Learning
- If the learning rate η is very small, then the algorithm proceeds
slowly, but accurately follows the path of steepest descent in weight
- If η is largish, the algorithm may oscillate ("bounce off the
A simple method of effectively increasing the rate of learning is
to modify the delta rule by including a momentum term:
= α Δwji(n-1) +
where α is a positive constant termed the momentum constant.
This is called the generalized delta rule.
- The effect is that if the basic delta rule is consistently pushing
a weight in the same direction, then it gradually gathers "momentum"
in that direction.
Two commonly used stopping criteria are:
- stop after a certain number of runs through all the training
data (each run through all the training data is called an
- stop when the total sum-squared error reaches some low level.
By total sum-squared error we mean
where p ranges over all of the training patterns and i
ranges over all of the output units.
- The weights of a network to be trained by backprop must be
initialized to some non-zero values.
- The usual thing to do is to initialize the weights to small
- The reason for this is that sometimes backprop training runs
become "lost" on a plateau in weight-space, or for some other
reason backprop cannot find a good minimum error value.
- Using small random values means different starting points
for each training run, so that subsequent training runs have a
good chance of finding a suitable minimum.
The Discovery of Backprop
- The backprop algorithm was discovered by three groups at around about the same time.
- First chronologically was Paul Werbos, who published a more general version in his PhD in 1974.
He subsequently "spent many years struggling with folks who refused to listen or publish
or tolerate the idea. Finally, in 1981 [he] published a more persuasive brief paper [on it].
Both that paper and the thesis are reprinted in entirety in P. Werbos,
Roots of Backpropagation, Wiley 1994."
[Source: email from Paul Werbos.]
You can find material by Paul Werbos on his algorithm at http://www.werbos.com/AD2004.pdf
He is now a program manager in the US National Science Foundation.
- Rumelhart, Hinton and Williams published their version of the algorithm in the
mid-1980s. Rumelhart and McClelland produced/edited a two-volume book that included the RHW chapter on backprop, and
chapters on a wide range of other neural network models, in 1986. This book, known humorously as
the "PDP Bible" was extremely influential. Rumelhart did further important work
on neural networks before succumbing to a neurodegenerative disease. The Rumelhart
Prize - the "Nobel Prize" of Cognitive Science, was established in his honour. Hinton is
still going strong, at the University of Toronto. He won the first Rumelhart Prize in
2001. Williams is at Northeastern University, in Massachusetts.
- Yann le Cun, a PhD student in Paris, independently discovered the algorithm (PhD
completed in 1987). He subsequently joined Hinton's group for a while. He now works
at Courant Institute of Mathematical Sciences at New York University.
Yann le Cun
The XOR Problem
Backprop Specification in
Backprop Specification in
- The NODES section says that there
are 3 (non-input) nodes or neurons, of which one, node 3,
is an output node. And that there are two input nodes. In
fact these are referred two as i1 and i2. The hidden layer
nodes are 1 and 2. Node 0 is the bias node.
- The CONNECTIONS section specifies the
connections. Don't worry about groups.
1-3 from 0
means that all non-input nodes receive input from the bias
1-2 from i1-i2 says that the hidden layer
neurons receive input from the input nodes. And so on. Don't
worry about the SPECIAL section.
xor.data file specifies the input patterns,
xor.teach specifies the corresponding training
output patterns. Don't worry about
The second line,
4, just gives the number of
input/training patterns that are to follow.
Backprop Specification in
- create new project called "xor" in Network menu
- create xor.cf, xor.data, xor.teach files as above
- choose "Training Options ..." in Network menu; set options
- choose "Error Display" in Displays menu
- choose "Train the Network" in Network menu
Backprop as a Black Art
The tricky things about backprop networks include designing the network
and then setting the adjustable parameters:
It also turns out to be advisable to stop training early, for the
sake of better generalization performance.
- number of inputs
- number of hidden nodes
- number of layers,
- number of output nodes,
- Generalization means performance on unseen input patterns,
i.e. input patterns which were not among the patterns on which
the network was trained.
- If you train for too long, you can often get the total
sum-squared error very low, by over-fitting the training
data - you get a network which performs very well on the training
data, but not as well as it could on unseen data.
- By stopping training earlier, one hopes that the network
will have learned the broad rules of the problem, but not bent
itself into the shape of some of the more idiosyncratic (perhaps
even noisy) training patterns.
- The next section describes a strategy for picking the
right stopping point.
- Assuming that training patterns are relatively plentiful, one
can divide them into two sections (perhaps 80% / 20% of the original
collection of training patterns).
Figure 8: Illustration of over-fitting
- Train on the first section of the training data (80% of them in
the example above).
- Test on the second section, the test set.
- Train for different numbers of epochs.
- Typically what is found is that, while error on the training set
falls monotonically with the number of epochs, error on the test set
falls and then rises.
- Estimate the point at which test-set error begins to rise again.
Train for this number of epochs.
Successful Applications of Backprop
Backprop tends to work well in some situations where human experts are unable
to articulate a rule for what they are doing - e.g. in areas depending on raw
perception, and where it is difficult to determine the attributes (in
the ID3 sense) that are relevant to the problem at hand.
For example, there is a proprietary system, which includes a backprop
component, for assisting in classifying Pap smears.
- The system picks out from the image the most suspicious-looking
- A human expert then inspects these cells.
- This reduces the problem from looking at maybe 10,000 cells to
looking at maybe 100 cells - this reduces the boredom-induced
- Other successful systems have been built for tasks like
reading handwritten postcodes.
- Function approximation methods, such as the ones discussed
above, can often produce quite accurate classifiers because they
are capable of constructing complex decision surfaces.
- However, knowledge is stored as weights in a matrix, so
the results of learning are not easily available
for inspection by a human reader.
- Moreover, the design of a network usually requires informed
guesswork on the part of the user in order to obtain satisfactory
- Although some effort has been devoted to extracting
meaning from networks, they still communicate little about the
"rules" they implicitly encode.
- Connectionist learning algorithms are still computationally
- A critical factor in their speed is the encoding of
the inputs to the network.
- When learning is complete, the speed of
computation of the resulting network is very high.
|Summary: Error Backpropagation Learning|
- After briefly describing linear threshold units, neural network
computation paradigm in general, and the use of the logistic function
(or similar functions) to transform weighted sums of inputs to a
neuron, we outlined the error backpropagation learning algorithm.
- Backprop's performance on the XOR problem was demonstrated using the
tlearn backprop simulator.
- A number of refinements to backprop were looked at briefly,
including momentum and a technique to obtain the best generalization
- Backprop nets learn slowly but compute quickly once they have learned.
- They can be trained so as to generalize reasonably well.
perceptron activation rule
perceptron learning rule
perceptrons can't learn XOR, backprop-trained nets can learn XOR
parts of a biological neuron (to the level of detail given in lectures)
artificial neuron model (weights, activation function, nonlinearity, logistic function)
layout of multilayer perceptron
generalised delta rule (momentum)
forward and backward pass
stopping criteria and initialisation
generalisation, over-fitting, and how to avoid it
advantages and disadvantages of neural nets vs symbolic AI methods
Copyright © Bill Wilson, 2010, except where another source is acknowledged.
Bill Wilson's contact info
UNSW's CRICOS Provider No. is 00098G