Single Layer Perceptron

  1. Perceptron
  2. Representational Power of Perceptrons
  3. The Perceptron Training Rule
  4. Gradient descent and the delta rule
  5. Visualizing the hypothesis space
  6. Summary of Gradient descent rule


A perceptron takes a vector of real-valued inputs, calculates a linear combination of these inputs, then outputs a 1 if the result is greater than some threshold and -1 otherwise. More precisely, given inputs x1 through xn, the output o(x1, ..., xn) computed by the perceptron is

  • w - weight which determines the contribution of input xi to the perceptron output

Notice the quantity (-w0) is a threshold that the weighted combination of inputs w1x1+...+wnxn must surpass in order for the perceptron to output a 1.

Learning a perceptron involves choosing values for the weights w0...wn. Therefore, the space of hypothesis in perceptron learning is the set of all possible real-valued weight vectors.

Representational Power of Perceptrons

A single perceptron can be used to represent many boolean functions. For example, if we assume boolean values of 1(true) and -1(false), then one way to use a two-input perceptron to implement the AND function is to set the weights w0=-0.8, and w1=w2=0.5.

In fact, AND and OR can be viewed as special cases of m-of-n functions: that is, functions where at least m of the n inputs to the perceptron must be true. However, some boolean functions cannot be represented by a single perceptron, such as the XOR function.

The decision surface represented by a two-input perceptron. x1 and x2 are the perceptron inputs.

  • (a) A set of training examples and the decision surface of a perceptron that classifies them correctly
  • (b) A set of training examples that is not linearly separable

The Perceptron Training Rule

How does a single perceptron learn the weight? The precise learning problem is to determine a weight vector that causes the perceptron to produce the correct +1, -1 output for each of the given training examples.

One way to learn an acceptable weight vector is

  1. to begin with random weights
  2. then iteratively apply the perceptron to each training example
  3. modifying the perceptron weights whenever it misclassifies an example.
  4. this process is repeated until the perceptron classifies all training examples correctly.

Weights are modified at each step according to ther perceptron training rule, which revises the weight wi associated with input xi.

Gradient descent and the delta rule

Although the perceptron rule finds a successful weight vector when the training examples are linearly separable, it can fail to converge if the examples are not linearly separable.

Gradient descent searches the hypothesis space of possible weight vectors, even in nonlinear training examples, to find the weights that best fit the training examples.

Training error is the difference between target and output. Methmatically It's defined as follows.

  • D is the set of training examples
  • td is the target output for the training example d
  • od is the output of the linear unit for training example d

Visualizing the hypothesis space

the following graph is a graph of

  • w0, w1 plane - entire hypothesis space
  • vertical axis - error E ralative to a set of training exmaples

Gradient descent algorithms is an algorithms that searches the steepest descent along the error space. It determines a weight vector that minimizes E by starting with an arbitrary initial weight vector, then repeatedly modifying it in small steps.

Linear units has a single global minimum in this error surface. Gradient descent algorithms continue searching process until the global minimum error is reached.

Summary of Gradient descent rule

Each training example is a pair of the form <x, t>, where x is the vector of input values, and t is the target output value. By Gradient rule we get these results.

  • D - the set of training examples
  • td - target output for training example d
  • od - the output of the linear unit for training exmaple d

From here, we can know that each unit weight w is redefined by the error value between target and output and also by learning rate.