Neural Networks and Error Backpropagation Learning
Reference: Haykin, chapter 4
Aim: 
To introduce some basic concepts of neural networks and then describe
the backpropagation learning algorithm for feedforward neural networks.

Keywords:
activation level,
activation function,
axon,
backpropagation,
backward pass in backpropagation,
bias,
biological neuron,
cell body,
clamping,
connectionism,
delta rule,
dendrite,
epoch,
error backpropagation,
error surface,
excitatory connection,
feedforward networks,
firing,
forward pass in backpropagation,
generalization in backprop,
generalized delta rule,
gradient descent,
hidden layer,
hidden unit / node,
inhibitory connection,
input unit,
layer in a neural network,
learning rate,
linear threshold unit,
local minimum,
logistic function,
momentum in backprop,
multilayer perceptron (MLP),
neural network,
neurode,
neuron (artificial),
node,
output unit,
overfitting,
perceptron,
perceptron learning,
recurrent network,
sequence prediction tasks,
sigmoidal nonlinearity,
simple recurrent network,
squashing function,
stopping criterion in backprop,
synapse,
target output,
threshold,
training pattern,
total net input,
total sumsquared error,
trainable weight,
training pattern,
unit,
weight,
weight space,
XOR problem

Plan: 
 linear threshold units, perceptrons
 outline of biological neural processing
 artificial neurons and the sigmoid function
 error backpropagation learning
 delta rule
 forward and backward passes
 generalized delta rule
 initialization
 example: XOR with bp in
tlearn
 generalization and overfitting
 applications of backprop

Classification Tasks
 Statistical and connectionist approaches to machine learning are related to
function approximation methods in mathematics.
 For the purposes of illustration let us assume that the learning task
is one of classification.
 That is, we wish to find ways of grouping objects in a universe.
 In Figure 1 we have a universe of objects that belong to either of
two classes `+' or `–'.
Figure 1: A linear discrimination between two classes
Classification Tasks 2
 Using function approximation techniques, we find a surface that separates
the space, and the objects in it, into two different regions.
 In the case shown in Figure 1, the "surface" is just a line, and the
associated function is called a linear discriminant function.
 Linear regression methods, or perceptron learning (see below) can be used
to find linear discriminant functions.
History: Perceptrons
 A perceptron is a simple pattern classifier.
 Given a binary input vector, x, a weight vector, w,
and a threshold value, T, if
Σ_{i} w_{i}x_{i} > T
then the output is 1, indicating membership of a class, otherwise it is 0,
indicating exclusion from the class.
 w.x = T describes a hyperplane and the goal
of perceptron learning is to find a weight vector, w, that results
in correct classification for all training examples.
Perceptron Learning (Outline)
 All the elements of the weight vector are initially set to 0.
 For each training example, if the perceptron outputs 0 when it
should output 1 then add the input vector to the weight vector; if
the perceptron outputs 1 when it should output 0 then subtract the
input vector from the weight vector; otherwise, do nothing.
 This is repeated until the perceptron yields the correct result
for each training example.
 The algorithm has the effect of reducing the error between
the actual and desired output.
 The perceptron is an example of a linear threshold unit (LTU).
 A single LTU can only recognise one kind of pattern, provided that
the input space is linearly separable.
Perceptron Learning (Outline) 2
Neural Models of Computation
 Biological neurons provide a model for computation (after all,
brains are built from them).
 They have inputs (dendrites), outputs (axon) and a
response to the inputs is generated by a process that gives rise to an
electrical pulse propagated along the axon.
 The place where an axon of one neuron connects to a dendrite of
another neuron is termed a synapse.
 Synapses differ in the effect that they have on the output of
the neuron.
 Some are strongly excitatory, some weakly excitatory, some
weakly inhibitory, and so on.
Figure 2: Neuron architecture
Biological Neurons and Artificial Neurons
 There is an analogy between a biological neuron and the simple
model neurons (also termed nodes or units) used in
neural networks.
 The incoming signals from the axons of other neurons correspond
to the inputs y_{i} to the node.
 The strength of the synapse corresponds to weight
w_{i} associated with a node.
 The signal propagated along the axon corresponds to the output
φ(Σ_{i} w_{i}y_{i}) of the node. (The
operator φ will be explained below.)
Figure 3: Artificial neuron (node) architecture
Multilayer Perceptrons
Overall Layout of MLP
 Typically also, each node in a layer (other than the output
layer) is connected to every node in the next layer by a trainable
weight.
 The overall layout is illustrated in Figure 5.
Figure 5. A multilayer network
Node Internals
 Figure 6 shows the internals of a node.
 The weight w_{j0} acts like a threshold.
 The y_{i} are the outputs of other nodes
(or perhaps inputs to the network).
 The first step is forming the weighted sum v_{j} =
Σ_{i} w_{ji}y_{i}.
 The second step is applying the nonlinearity function φ
to v_{j} to produce the output y_{j}.
Figure 6. Internal functioning of node j
The Error BackPropagation Learning Algorithm
 This algorithm was discovered and rediscovered a number of times 
for details, see, e.g. chapter 4 of Haykin, S. Neural Networks 
a comprehensive foundation, 2nd ed., p.43. This reference also
contains the mathematical details of the derivation of the
backpropagation equations, (2nd ed., p.161167, 3rd ed., p.129134)
which we shall omit. (This is covered in COMP9444 Neural Networks.)
 Like perceptron learning, backpropagation attempts to reduce the
errors between the output of the network and the desired result.
 However, assigning blame for errors to hidden nodes (i.e. nodes
in the intermediate layers), is not so straightforward. The error of
the output nodes must be propagated back through the hidden nodes.
 The contribution that a hidden node makes to an output node is
related to the strength of the weight on the link between the
two nodes and the level of activation of the hidden node when the
output node was given the wrong level of activation.
 This can be used to estimate the error value for a hidden node
in the penultimate layer, and that can, in turn, be used in making
error estimates for earlier layers.
Weight Change Equation

If node j is an output node, then δ_{j} is the product of
φ'(v_{j}) and the error signal e_{j},
where φ(_) is the
logistic function and v_{j} is the total input to node j (i.e.
Σ_{i} w_{ji}y_{i}),
and e_{j} is the error signal for node j
(i.e. the difference between the desired output and the actual output);
 If node j is a hidden node, then
δ_{j} is the product of
φ'(v_{j}) and the weighted sum of the
δ's computed for the nodes in
the next hidden or output layer that are connected to node j.
[The actual formula is δ_{j} =
φ'(v_{j}) Σ_{k} δ_{k}w_{kj}
where k ranges over those nodes for which w_{kj} is nonzero (i.e.
nodes k that actually have connections from node j. The δ_{k} values
have already been computed as they are in the output layer (or a layer closer to the output layer
than node j).]
Two Passes of Computation
FORWARD PASS: weights fixed, input signals propagated through network
and outputs calculated. Outputs o_{j} are compared with desired outputs
d_{j}; the error signal e_{j} = d_{j}  o_{j}
is computed.
BACKWARD PASS: starts with output layer and recursively computes the
local gradient δ_{j} for each node.
Then the weights are updated using the equation above for
Δw_{ji}, and back to another forward
pass.
Sigmoidal Nonlinearity
With the sigmoidal function φ(x) defined above, it is the case
that φ'(v_{j}) =
y_{j}(1  y_{j}), a fact that simplifies the computations.
Rate of Learning
 If the learning rate η is very small, then the algorithm proceeds
slowly, but accurately follows the path of steepest descent in weight
space.
 If η is largish, the algorithm may oscillate ("bounce off the
canyon walls").
A simple method of effectively increasing the rate of learning is
to modify the delta rule by including a momentum term:
Δw_{ji}(n)
= α Δw_{ji}(n1) +
η δ_{j}(n)y_{i}(n)
where α is a positive constant termed the momentum constant.
This is called the generalized delta rule.
 The effect is that if the basic delta rule is consistently pushing
a weight in the same direction, then it gradually gathers "momentum"
in that direction.
Stopping Criterion
Two commonly used stopping criteria are:
 stop after a certain number of runs through all the training
data (each run through all the training data is called an
epoch);
 stop when the total sumsquared error reaches some low level.
By total sumsquared error we mean
Σ_{p}Σ_{i}e_{i}^{2}
where p ranges over all of the training patterns and i
ranges over all of the output units.
Initialization
 The weights of a network to be trained by backprop must be
initialized to some nonzero values.
 The usual thing to do is to initialize the weights to small
random values.
 The reason for this is that sometimes backprop training runs
become "lost" on a plateau in weightspace, or for some other
reason backprop cannot find a good minimum error value.
 Using small random values means different starting points
for each training run, so that subsequent training runs have a
good chance of finding a suitable minimum.
The Discovery of Backprop
 The backprop algorithm was discovered by three groups at around about the same time.
 First chronologically was Paul Werbos, who published a more general version in his PhD in 1974.
He subsequently "spent many years struggling with folks who refused to listen or publish
or tolerate the idea. Finally, in 1981 [he] published a more persuasive brief paper [on it].
Both that paper and the thesis are reprinted in entirety in P. Werbos,
Roots of Backpropagation, Wiley 1994."
[Source: email from Paul Werbos.]
You can find material by Paul Werbos on his algorithm at http://www.werbos.com/AD2004.pdf
He is now a program manager in the US National Science Foundation.
 Rumelhart, Hinton and Williams published their version of the algorithm in the
mid1980s. Rumelhart and McClelland produced/edited a twovolume book that included the RHW chapter on backprop, and
chapters on a wide range of other neural network models, in 1986. This book, known humorously as
the "PDP Bible" was extremely influential. Rumelhart did further important work
on neural networks before succumbing to a neurodegenerative disease. The Rumelhart
Prize  the "Nobel Prize" of Cognitive Science, was established in his honour. Hinton is
still going strong, at the University of Toronto. He won the first Rumelhart Prize in
2001. Williams is at Northeastern University, in Massachusetts.
 Yann le Cun, a PhD student in Paris, independently discovered the algorithm (PhD
completed in 1987). He subsequently joined Hinton's group for a while. He now works
at Courant Institute of Mathematical Sciences at New York University.

Paul Werbos
 Dave Rumelhart
 Geoff Hinton

Ronald Williams
 Yann le Cun


The XOR Problem
Backprop Specification in tlearn
Backprop Specification in tlearn
2
 The NODES section says that there
are 3 (noninput) nodes or neurons, of which one, node 3,
is an output node. And that there are two input nodes. In
fact these are referred two as i1 and i2. The hidden layer
nodes are 1 and 2. Node 0 is the bias node.
 The CONNECTIONS section specifies the
connections. Don't worry about groups.
13 from 0
means that all noninput nodes receive input from the bias
node. 12 from i1i2
says that the hidden layer
neurons receive input from the input nodes. And so on. Don't
worry about the SPECIAL section.
 The
xor.data
file specifies the input patterns,
and xor.teach
specifies the corresponding training
output patterns. Don't worry about distributed
.
The second line, 4
, just gives the number of
input/training patterns that are to follow.
Backprop Specification in tlearn
3
 start
tlearn
 create new project called "xor" in Network menu
 create xor.cf, xor.data, xor.teach files as above
 choose "Training Options ..." in Network menu; set options
 choose "Error Display" in Displays menu
 choose "Train the Network" in Network menu
Backprop as a Black Art
The tricky things about backprop networks include designing the network
architecture:
 number of inputs
 number of hidden nodes
 number of layers,
 number of output nodes,
and then setting the adjustable parameters:
It also turns out to be advisable to stop training early, for the
sake of better generalization performance.
Generalization
 Generalization means performance on unseen input patterns,
i.e. input patterns which were not among the patterns on which
the network was trained.
 If you train for too long, you can often get the total
sumsquared error very low, by overfitting the training
data  you get a network which performs very well on the training
data, but not as well as it could on unseen data.
 By stopping training earlier, one hopes that the network
will have learned the broad rules of the problem, but not bent
itself into the shape of some of the more idiosyncratic (perhaps
even noisy) training patterns.
 The next section describes a strategy for picking the
right stopping point.
Testing
 Assuming that training patterns are relatively plentiful, one
can divide them into two sections (perhaps 80% / 20% of the original
collection of training patterns).
Figure 8: Illustration of overfitting
 Train on the first section of the training data (80% of them in
the example above).
 Test on the second section, the test set.
 Train for different numbers of epochs.
 Typically what is found is that, while error on the training set
falls monotonically with the number of epochs, error on the test set
falls and then rises.
 Estimate the point at which testset error begins to rise again.
Train for this number of epochs.
Successful Applications of Backprop

Backprop tends to work well in some situations where human experts are unable
to articulate a rule for what they are doing  e.g. in areas depending on raw
perception, and where it is difficult to determine the attributes (in
the ID3 sense) that are relevant to the problem at hand.

For example, there is a proprietary system, which includes a backprop
component, for assisting in classifying Pap smears.
 The system picks out from the image the most suspiciouslooking
cells.
 A human expert then inspects these cells.
 This reduces the problem from looking at maybe 10,000 cells to
looking at maybe 100 cells  this reduces the boredominduced
error rate.
 Other successful systems have been built for tasks like
reading handwritten postcodes.
Discussion
 Function approximation methods, such as the ones discussed
above, can often produce quite accurate classifiers because they
are capable of constructing complex decision surfaces.
 However, knowledge is stored as weights in a matrix, so
the results of learning are not easily available
for inspection by a human reader.
 Moreover, the design of a network usually requires informed
guesswork on the part of the user in order to obtain satisfactory
results.
 Although some effort has been devoted to extracting
meaning from networks, they still communicate little about the
"rules" they implicitly encode.
 Connectionist learning algorithms are still computationally
expensive.
 A critical factor in their speed is the encoding of
the inputs to the network.
 When learning is complete, the speed of
computation of the resulting network is very high.
Summary: Error Backpropagation Learning 
 After briefly describing linear threshold units, neural network
computation paradigm in general, and the use of the logistic function
(or similar functions) to transform weighted sums of inputs to a
neuron, we outlined the error backpropagation learning algorithm.
 Backprop's performance on the XOR problem was demonstrated using the
tlearn backprop simulator.
 A number of refinements to backprop were looked at briefly,
including momentum and a technique to obtain the best generalization
ability.
 Backprop nets learn slowly but compute quickly once they have learned.
 They can be trained so as to generalize reasonably well.

Revision Topics 
perceptrons
perceptron activation rule
perceptron learning rule
perceptrons can't learn XOR, backproptrained nets can learn XOR
parts of a biological neuron (to the level of detail given in lectures)
artificial neuron model (weights, activation function, nonlinearity, logistic function)
layout of multilayer perceptron
delta rule
generalised delta rule (momentum)
forward and backward pass
stopping criteria and initialisation
generalisation, overfitting, and how to avoid it
advantages and disadvantages of neural nets vs symbolic AI methods

Copyright © Bill Wilson, 2012, except where another source is acknowledged.
Bill Wilson's contact info
UNSW's CRICOS Provider No. is 00098G