## Neural Networks and Error Backpropagation Learning

Reference: Haykin, chapter 4

 Aim: To introduce some basic concepts of neural networks and then describe the backpropagation learning algorithm for feedforward neural networks. Keywords: activation level, activation function, axon, backpropagation, backward pass in backpropagation, bias, biological neuron, cell body, clamping, connectionism, delta rule, dendrite, epoch, error backpropagation, error surface, excitatory connection, feedforward networks, firing, forward pass in backpropagation, generalization in backprop, generalized delta rule, gradient descent, hidden layer, hidden unit / node, inhibitory connection, input unit, layer in a neural network, learning rate, linear threshold unit, local minimum, logistic function, momentum in backprop, multilayer perceptron (MLP), neural network, neurode, neuron (artificial), node, output unit, over-fitting, perceptron, perceptron learning, recurrent network, sequence prediction tasks, sigmoidal nonlinearity, simple recurrent network, squashing function, stopping criterion in backprop, synapse, target output, threshold, training pattern, total net input, total sum-squared error, trainable weight, training pattern, unit, weight, weight space, XOR problem Plan: linear threshold units, perceptrons outline of biological neural processing artificial neurons and the sigmoid function error backpropagation learning delta rule forward and backward passes generalized delta rule initialization example: XOR with bp in `tlearn` generalization and over-fitting applications of backprop

• Statistical and connectionist approaches to machine learning are related to function approximation methods in mathematics.

• For the purposes of illustration let us assume that the learning task is one of classification.

• That is, we wish to find ways of grouping objects in a universe.

• In Figure 1 we have a universe of objects that belong to either of two classes `+' or `–'.

Figure 1: A linear discrimination between two classes

### Classification Tasks 2

• Using function approximation techniques, we find a surface that separates the space, and the objects in it, into two different regions.

• In the case shown in Figure 1, the "surface" is just a line, and the associated function is called a linear discriminant function.

• Linear regression methods, or perceptron learning (see below) can be used to find linear discriminant functions.

### History: Perceptrons

• A perceptron is a simple pattern classifier.

• Given a binary input vector, x, a weight vector, w, and a threshold value, T, if

Σi wixi > T

then the output is 1, indicating membership of a class, otherwise it is 0, indicating exclusion from the class.

• w.x = T describes a hyperplane and the goal of perceptron learning is to find a weight vector, w, that results in correct classification for all training examples.

### Perceptron Learning (Outline)

• All the elements of the weight vector are initially set to 0.

• For each training example, if the perceptron outputs 0 when it should output 1 then add the input vector to the weight vector; if the perceptron outputs 1 when it should output 0 then subtract the input vector from the weight vector; otherwise, do nothing.

• This is repeated until the perceptron yields the correct result for each training example.

• The algorithm has the effect of reducing the error between the actual and desired output.

• The perceptron is an example of a linear threshold unit (LTU).

• A single LTU can only recognise one kind of pattern, provided that the input space is linearly separable.

### Perceptron Learning (Outline) 2

• If we wish to recognise more than one pattern, several LTU's can be combined. In this case, instead of having a vector of weights, we have an array. The output will now be a vector:

u = W x

where each element of u indicates membership of a class and each row in W is the set of weights for one LTU.

• This architecture is called a pattern associator.

• LTU's can only produce linear discriminant functions and consequently, they are limited in the kinds of classes that they can learn.

• In particular, they cannot be trained to compute the exclusive-or (XOR) function. [Footnote: (XOR(p,q) is true (i.e. value 1) if p is true or q is true but not both.]

• This seemed in the 1960s to kill off perceptron-based learning.

• However, subsequently a generalization of perceptrons was found that solved this problem.

### Neural Models of Computation

• Biological neurons provide a model for computation (after all, brains are built from them).

• They have inputs (dendrites), outputs (axon) and a response to the inputs is generated by a process that gives rise to an electrical pulse propagated along the axon.

• The place where an axon of one neuron connects to a dendrite of another neuron is termed a synapse.

• Synapses differ in the effect that they have on the output of the neuron.

• Some are strongly excitatory, some weakly excitatory, some weakly inhibitory, and so on.

Figure 2: Neuron architecture

### Biological Neurons and Artificial Neurons

• There is an analogy between a biological neuron and the simple model neurons (also termed nodes or units) used in neural networks.

• The incoming signals from the axons of other neurons correspond to the inputs yi to the node.

• The strength of the synapse corresponds to weight wi associated with a node.

• The signal propagated along the axon corresponds to the output φ(Σi wiyi) of the node. (The operator φ will be explained below.)

Figure 3: Artificial neuron (node) architecture

### Multilayer Perceptrons

• In a multi-layer perceptron (MLP), there is a layer of input nodes, a layer of output nodes, and one or more intermediate layers of nodes that are referred to as hidden nodes (hidden layers).

• In addition to this, each node in an MLP includes a nonlinearity at its output end.

• That is, the output function of the node is of the form:

φ(Σi wixi)

where φ(x) is a differentiable (smooth) function, frequently the logistic function:

φ(x) = 1/(1 + e-x)

Figure 4: Graph of logistic function φ(x)

### Overall Layout of MLP

• Typically also, each node in a layer (other than the output layer) is connected to every node in the next layer by a trainable weight.

• The overall layout is illustrated in Figure 5.

Figure 5. A multi-layer network

### Node Internals

• Figure 6 shows the internals of a node.

• The weight wj0 acts like a threshold.

• The yi are the outputs of other nodes (or perhaps inputs to the network).

• The first step is forming the weighted sum vj = Σi wjiyi.

• The second step is applying the nonlinearity function φ to vj to produce the output yj.

Figure 6. Internal functioning of node j

### The Error Back-Propagation Learning Algorithm

• This algorithm was discovered and rediscovered a number of times - for details, see, e.g. chapter 4 of Haykin, S. Neural Networks - a comprehensive foundation, 2nd ed., p.43. This reference also contains the mathematical details of the derivation of the backpropagation equations, (2nd ed., p.161-167, 3rd ed., p.129-134) which we shall omit. (This is covered in COMP9444 Neural Networks.)

• Like perceptron learning, back-propagation attempts to reduce the errors between the output of the network and the desired result.

• However, assigning blame for errors to hidden nodes (i.e. nodes in the intermediate layers), is not so straightforward. The error of the output nodes must be propagated back through the hidden nodes.

• The contribution that a hidden node makes to an output node is related to the strength of the weight on the link between the two nodes and the level of activation of the hidden node when the output node was given the wrong level of activation.

• This can be used to estimate the error value for a hidden node in the penultimate layer, and that can, in turn, be used in making error estimates for earlier layers.

### Weight Change Equation

• The basic algorithm can be summed up in the following equation (the delta rule) for the change to the weight wji from node i to node j:

 weightchange learningrate localgradient input signalto node j Δwji = η × δj × yi

where the local gradient δj is defined as follows:

1. If node j is an output node, then δj is the product of φ'(vj) and the error signal ej, where φ(_) is the logistic function and vj is the total input to node j (i.e. Σi wjiyi), and ej is the error signal for node j (i.e. the difference between the desired output and the actual output);

2. If node j is a hidden node, then δj is the product of φ'(vj) and the weighted sum of the δ's computed for the nodes in the next hidden or output layer that are connected to node j.
[The actual formula is δj = φ'(vj) Σk δkwkj where k ranges over those nodes for which wkj is non-zero (i.e. nodes k that actually have connections from node j. The δk values have already been computed as they are in the output layer (or a layer closer to the output layer than node j).]

### Two Passes of Computation

FORWARD PASS: weights fixed, input signals propagated through network and outputs calculated. Outputs oj are compared with desired outputs dj; the error signal ej = dj - oj is computed.

BACKWARD PASS: starts with output layer and recursively computes the local gradient δj for each node. Then the weights are updated using the equation above for Δwji, and back to another forward pass.

### Sigmoidal Nonlinearity

With the sigmoidal function φ(x) defined above, it is the case that φ'(vj) = yj(1 - yj), a fact that simplifies the computations.

### Rate of Learning

• If the learning rate η is very small, then the algorithm proceeds slowly, but accurately follows the path of steepest descent in weight space.

• If η is largish, the algorithm may oscillate ("bounce off the canyon walls").

A simple method of effectively increasing the rate of learning is to modify the delta rule by including a momentum term:

Δwji(n) = α Δwji(n-1) + η δj(n)yi(n)

where α is a positive constant termed the momentum constant. This is called the generalized delta rule.

• The effect is that if the basic delta rule is consistently pushing a weight in the same direction, then it gradually gathers "momentum" in that direction.

### Stopping Criterion

Two commonly used stopping criteria are:

• stop after a certain number of runs through all the training data (each run through all the training data is called an epoch);

• stop when the total sum-squared error reaches some low level. By total sum-squared error we mean ΣpΣiei2 where p ranges over all of the training patterns and i ranges over all of the output units.

### Initialization

• The weights of a network to be trained by backprop must be initialized to some non-zero values.

• The usual thing to do is to initialize the weights to small random values.

• The reason for this is that sometimes backprop training runs become "lost" on a plateau in weight-space, or for some other reason backprop cannot find a good minimum error value.

• Using small random values means different starting points for each training run, so that subsequent training runs have a good chance of finding a suitable minimum.

### The Discovery of Backprop

• The backprop algorithm was discovered by three groups at around about the same time.

• First chronologically was Paul Werbos, who published a more general version in his PhD in 1974. He subsequently "spent many years struggling with folks who refused to listen or publish or tolerate the idea. Finally, in 1981 [he] published a more persuasive brief paper [on it]. Both that paper and the thesis are reprinted in entirety in P. Werbos, Roots of Backpropagation, Wiley 1994." [Source: email from Paul Werbos.] You can find material by Paul Werbos on his algorithm at http://www.werbos.com/AD2004.pdf
He is now a program manager in the US National Science Foundation.

• Rumelhart, Hinton and Williams published their version of the algorithm in the mid-1980s. Rumelhart and McClelland produced/edited a two-volume book that included the RHW chapter on backprop, and chapters on a wide range of other neural network models, in 1986. This book, known humorously as the "PDP Bible" was extremely influential. Rumelhart did further important work on neural networks before succumbing to a neurodegenerative disease. The Rumelhart Prize - the "Nobel Prize" of Cognitive Science, was established in his honour. Hinton is still going strong, at the University of Toronto. He won the first Rumelhart Prize in 2001. Williams is at Northeastern University, in Massachusetts.

• Yann le Cun, a PhD student in Paris, independently discovered the algorithm (PhD completed in 1987). He subsequently joined Hinton's group for a while. He now works at Courant Institute of Mathematical Sciences at New York University.

 Paul Werbos Dave Rumelhart Geoff Hinton Ronald Williams Yann le Cun

### The XOR Problem

• Because the XOR problem killed perceptrons in the 1960s, it is common to demonstrate the ability of backprop-trained MLPs using the XOR problem.

We'll look at an example using a backprop simulator called `tlearn`.

The architecture of the desired network is shown below:

Figure 7: Network architecture for XOR problem

### Backprop Specification in `tlearn`

• We must specify the structure of the network, and the training patterns. This is done, for XOR, using two files: the `.cf` (configuration) file and the `.data` and `.teach` files.

```xor.cf:
NODES:
nodes = 3
inputs = 2
outputs = 1
output node is 3
CONNECTIONS:
groups = 0
1-3 from 0
1-2 from i1-i2
3 from 1-2
SPECIAL:
selected = 1-2
weight_limit = 1.0

xor.data:          xor.teach:
distributed        distributed
4                  4
1 0                1
0 1                1
0 0                0
1 1                0
```

### Backprop Specification in `tlearn` 2

• The NODES section says that there are 3 (non-input) nodes or neurons, of which one, node 3, is an output node. And that there are two input nodes. In fact these are referred two as i1 and i2. The hidden layer nodes are 1 and 2. Node 0 is the bias node.

• The CONNECTIONS section specifies the connections. Don't worry about groups. `1-3 from 0` means that all non-input nodes receive input from the bias node. `1-2 from i1-i2` says that the hidden layer neurons receive input from the input nodes. And so on. Don't worry about the SPECIAL section.

• The `xor.data` file specifies the input patterns, and `xor.teach` specifies the corresponding training output patterns. Don't worry about `distributed`. The second line, `4`, just gives the number of input/training patterns that are to follow.

### Backprop Specification in `tlearn` 3

• start `tlearn`
• create new project called "xor" in Network menu
• create xor.cf, xor.data, xor.teach files as above
• choose "Training Options ..." in Network menu; set options
• choose "Error Display" in Displays menu
• choose "Train the Network" in Network menu

### Backprop as a Black Art

The tricky things about backprop networks include designing the network architecture:

• number of inputs
• number of hidden nodes
• number of layers,
• number of output nodes,
and then setting the adjustable parameters:
• learning rate
• momentum.
It also turns out to be advisable to stop training early, for the sake of better generalization performance.

### Generalization

• Generalization means performance on unseen input patterns, i.e. input patterns which were not among the patterns on which the network was trained.

• If you train for too long, you can often get the total sum-squared error very low, by over-fitting the training data - you get a network which performs very well on the training data, but not as well as it could on unseen data.

• By stopping training earlier, one hopes that the network will have learned the broad rules of the problem, but not bent itself into the shape of some of the more idiosyncratic (perhaps even noisy) training patterns.

• The next section describes a strategy for picking the right stopping point.

### Testing

• Assuming that training patterns are relatively plentiful, one can divide them into two sections (perhaps 80% / 20% of the original collection of training patterns).

Figure 8: Illustration of over-fitting

• Train on the first section of the training data (80% of them in the example above).

• Test on the second section, the test set.

• Train for different numbers of epochs.

• Typically what is found is that, while error on the training set falls monotonically with the number of epochs, error on the test set falls and then rises.

• Estimate the point at which test-set error begins to rise again. Train for this number of epochs.

### Successful Applications of Backprop

• Backprop tends to work well in some situations where human experts are unable to articulate a rule for what they are doing - e.g. in areas depending on raw perception, and where it is difficult to determine the attributes (in the ID3 sense) that are relevant to the problem at hand.

• For example, there is a proprietary system, which includes a backprop component, for assisting in classifying Pap smears.

• The system picks out from the image the most suspicious-looking cells.
• A human expert then inspects these cells.
• This reduces the problem from looking at maybe 10,000 cells to looking at maybe 100 cells - this reduces the boredom-induced error rate.

• Other successful systems have been built for tasks like reading handwritten postcodes.

### Discussion

• Function approximation methods, such as the ones discussed above, can often produce quite accurate classifiers because they are capable of constructing complex decision surfaces.

• However, knowledge is stored as weights in a matrix, so the results of learning are not easily available for inspection by a human reader.

• Moreover, the design of a network usually requires informed guesswork on the part of the user in order to obtain satisfactory results.

• Although some effort has been devoted to extracting meaning from networks, they still communicate little about the "rules" they implicitly encode.

• Connectionist learning algorithms are still computationally expensive.

• A critical factor in their speed is the encoding of the inputs to the network.

• When learning is complete, the speed of computation of the resulting network is very high.

 Summary: Error Backpropagation Learning After briefly describing linear threshold units, neural network computation paradigm in general, and the use of the logistic function (or similar functions) to transform weighted sums of inputs to a neuron, we outlined the error backpropagation learning algorithm. Backprop's performance on the XOR problem was demonstrated using the `tlearn` backprop simulator. A number of refinements to backprop were looked at briefly, including momentum and a technique to obtain the best generalization ability. Backprop nets learn slowly but compute quickly once they have learned. They can be trained so as to generalize reasonably well.

Revision Topics
perceptrons
perceptron activation rule
perceptron learning rule
perceptrons can't learn XOR, backprop-trained nets can learn XOR
parts of a biological neuron (to the level of detail given in lectures)
artificial neuron model (weights, activation function, nonlinearity, logistic function)
layout of multilayer perceptron
delta rule
generalised delta rule (momentum)
forward and backward pass
stopping criteria and initialisation
generalisation, over-fitting, and how to avoid it
advantages and disadvantages of neural nets vs symbolic AI methods

Copyright © Bill Wilson, 2012, except where another source is acknowledged.
Bill Wilson's contact info

UNSW's CRICOS Provider No. is 00098G