The Backpropagation neural network is a multilayered, feedforward neural network and is by far the most extensively used[6]. It is also considered one of the simplest and most general methods used for supervised training of multilayered neural networks[6]. Backpropagation works by approximating the nonlinear relationship between the input and the output by adjusting the weight values internally. It can further be generalized for the input that is not included in the training patterns (predictive abilities).
Generally, the Backpropagation network has two stages, training and testing. During the training phase, the network is "shown" sample inputs and the correct classifications. For example, the input might be an encoded picture of a face, and the output could be represented by a code that corresponds to the name of the person.
A further note on encoding information  a neural network, as most learning algorithms, needs to have the inputs and outputs encoded according to an arbitrary user defined scheme. The scheme will define the network architecture so that once a network is trained, the scheme cannot be changed without creating a totally new net. Similarly there are many forms of encoding the network response.
The following figure shows the topology of the Backpropagation neural network that includes and input layer, one hidden layer and an output layer. It should be noted that Backpropagation neural networks can have more than one hidden layer.


The operations of the Backpropagation neural networks can be divided into two steps: feedforward and Backpropagation. In the feedforward step, an input pattern is applied to the input layer and its effect propagates, layer by layer, through the network until an output is produced. The network's actual output value is then compared to the expected output, and an error signal is computed for each of the output nodes. Since all the hidden nodes have, to some degree, contributed to the errors evident in the output layer, the output error signals are transmitted backwards from the output layer to each node in the hidden layer that immediately contributed to the output layer. This process is then repeated, layer by layer, until each node in the network has received an error signal that describes its relative contribution to the overall error.
Once the error signal for each node has been determined, the errors are then used by the nodes to update the values for each connection weights until the network converges to a state that allows all the training patterns to be encoded. The Backpropagation algorithm looks for the minimum value of the error function in weight space using a technique called the delta rule or gradient descent[2]. The weights that minimize the error function is then considered to be a solution to the learning problem.
The network behaviour is analogous to a human that is shown a set of data and is asked to classify them into predefined classes. Like a human, it will come up with "theories" about how the samples fit into the classes. These are then tested against the correct outputs to see how accurate the guesses of the network are. Radical changes in the latest theory are indicated by large changes in the weights, and small changes may be seen as minor adjustments to the theory.
There are also issues regarding generalizing a neural network. Issues
to consider are problems associated with undertraining and overtraining
data. Undertraining can occur when the neural network is not complex enough
to detect a pattern in a complicated data set. This is usually the result
of networks with so few hidden nodes that it cannot accurately represent
the solution, therefore underfitting the data (Figure 6)[1].


On the other hand, overtraining can result in a network that is too
complex, resulting in predictions that are far beyond the range of the
training data. Networks with too many hidden nodes will tend to overfit
the solution (Figure 7)[1].


The aim is to create a neural network with the "right" number of hidden
nodes that will lead to a good solution to the problem (Figure 8)[1].


Using Figure 3, the following describes the learning algorithm and the equations used to train a neural network. For an extensive description on the derivation of the equations used, please refer to reference [10].
Feedforward[10]
When a specified training pattern
is fed to the input layer, the weighted sum of the input to the j^{th}
node in the hidden layer is given by

(1)

Equation (1) is used to calculate the aggregate input to the neuron. The term is the weighted value from a bias node that always has an output value of 1. The bias node is considered a "pseudo input" to each neuron in the hidden layer and the output layer, and is used to overcome the problems associated with situations where the values of an input pattern are zero. If any input pattern has zero values, the neural network could not be trained without a bias node.
To decide whether a neuron should fire, the "Net" term, also known as the action potential, is passed onto an appropriate activation function. The resulting value from the activation function determines the neuron's output, and becomes the input value for the neurons in the next layer connected to it..
Since one of the requirements for the Backpropagation algorithm is that
the activation function is differentiable, a typical activation function
used is the Sigmoid equation (refer to Figure
4):

(2)

It should be noted that many other types of functions can, and are, used: hyperbolic tan being another popular choice.
Similarly, equations (1) and (2) are used to determine the output value for node k in the output layer.
Error Calculations and Weight Adjustments
 Backpropagation [10]
Output Layer
If the actual activation value of the output node, k, is O_{k},
and the expected target output for node k is t_{k}, the difference
between the actual output and the expected output is given by:

(3)

The error signal for node k in the output layer can be calculated as

(4)

where the O_{k}(1O_{k}) term is the derivative of the Sigmoid function.
With the delta rule, the change in the weight connecting input node j and output node k is proportional to the error at node k multiplied by the activation of node j.
The formulas used to modify the weight, w_{j,k}, between the
output node, k, and the node, j is:

(5)
(6) 
where is the change in the weight between nodes j and k, l_{r} is the learning rate. The learning rate is a relatively small constant that indicates the relative change in weights. If the learning rate is too low, the network will learn very slowly, and if the learning rate is too high, the network may oscillate around minimum point (refer to Figure 6), overshooting the lowest point with each weight adjustment, but never actually reaching it. Usually the learning rate is very small, with 0.01 not an uncommon number. Some modifications to the Backpropagation algorithm allows the learning rate to decrease from a large value during the learning process. This has many advantages. Since it is assumed that the network initiates at a state that is distant from the optimal set of weights, training will initially be rapid. As learning progresses, the learning rate decreases as it approaches the optimal point in the minima. Slowing the learning process near the optimal point encourages the network to converge to a solution while reducing the possibility of overshooting. If, however, the learning process initiates close to the optimal point, the system may initially oscillate, but this effect is reduced with time as the learning rate decreases.
It should also be noted that, in equation (5), the x_{k} variable is the input value to the node k, and is the same value as the output from node j.
To improve the process of updating the weights, a modification to equation
(5) is made:

(7)

Here the weight update during the nth iteration is determined by including
a momentum term (),
which is multiplied to the (n1)th iteration of the .
The introduction of the momentum term is used to accelerate the learning
process by "encouraging" the weight changes to continue in the same direction
with larger steps. Furthermore, the momentum term prevents the learning
process from settling in a local
minimum. by "over stepping" the small "hill". Typically, the momentum
term has a value between 0 and 1.


Figure 9 Global and Local Minima of Error Function[3] 
Hidden Layer
The error signal for node j in the hidden layer can be calculated as

(8)

where the Sum term adds the weighted error signal for all nodes, k, in the output layer.
As before, the formula to adjust the weight, w_{i,j}, between
the input node, i, and the node, j is:

(9)
(10) 
Global Error
Finally, Backpropagation is derived by assuming that it is desirable
to minimize the error on the output nodes over all the patterns presented
to the neural network. The following equation is used to calculate the
error function, E,
for all patterns

(11)

Ideally, the error function should have a value of zero when the neural network has been correctly trained. This, however, is numerically unrealistic.
Pseudo Coding
The following describes the Backpropagation algorithm[9],[10]
Assign all network inputs and output
Initialize all weights with small random numbers, typically between 1 and 1 repeat for every pattern in the training set Present the pattern to the network // Propagated
the input forward through the network:
// Propagate
the errors backward through the network
for
all hidden layers
// Calculate
Global Error
end while ((maximum number of iterations < than specified) AND
