The Machine Learning Dictionary

for COMP9414

Copyright © Bill Wilson, 1998 - 2012 Contact Info


You should use The Machine Learning Dictionary to clarify or revise concepts that you have already met. The Machine Learning Dictionary is not a suitable way to begin to learn about Machine Learning.

Further information on Machine Learning can be found in the class web page lecture notes section. Other places to find out about machine learning would be the AAAI (American Association for Artificial Intelligence) Machine Learning page and their AI Reference Shelf for less specific information.

Other related dictionaries:

The Prolog Dictionary - URL:
The Artificial Intelligence Dictionary - URL:
The NLP Dictionary (Natural Language Processing) - URL:

The URL of this Machine Learning Dictionary is

Topics Covered

This dictionary is limited to the ML concepts covered in COMP9414 Artificial Intelligence, at the University of New South Wales, Sydney:

activation level | activation function | Aq | asynchronous | attributes | axon | backed-up error estimate | backpropagation | backward pass in backpropagation | bias | biological neuron | C4.5 | C5 | cell body | clamping | classes in classification tasks | concept learning system (CLS) | conjunctive expressions | connectionism | covering algorithm | decision trees | delta rule | dendrite | entropy | epoch | error backpropagation | error surface | excitatory connection | expected error estimate | feature | feedforward networks | firing | forward pass in backpropagation | function approximation algorithms | generalization in backprop | generalized delta rule | gradient descent | hidden layer | hidden unit / node | hypothesis language | ID3 | inhibitory connection | input unit | instances | Laplace error estimate | layer in a neural network | learning program | learning rate | linear threshold unit | local minimum | logistic function | machine learning | momentum in backprop | multilayer perceptron (MLP) | neural network | neurode | neuron (artificial) | node | noisy data in machine learning | observation language | output unit | over-fitting | perceptron | perceptron learning | propositional learning systems | pruning decision trees | recurrent network | sequence prediction tasks | sigmoidal nonlinearity | simple recurrent network | splitting criterion in ID3 | squashing function | stopping criterion in backprop | supervised learning | symbolic learning algorithms | synapse | synchronous | target output | testing | threshold | training pattern | total net input | total sum-squared error | trainable weight | training pattern | tree induction algorithm | unit | unsupervised learning | weight | weight space | windowing in ID3 | XOR problem


activation level
The activation level of a neuron in an artificial neural network is a real number often limited to the range 0 to 1, or –1 to 1. In the case of an input neuron the value is obtained externally to the network. In the case of a hidden neuron or output neuron the value is obtained from the neuron's activation function. The activation of a neuron is sometimes thought of as corresponding to the average firing rate of a biological neuron.

activation function
In neural networks, an activation function is the function that describes the output behaviour of a neuron. Most network architectures start by computing the weighted sum of the inputs (that is, the sum of the product of each input with the weight associated with that input. This quantity, the total net input is then usually transformed in some way, using what is sometimes called a squashing function. The simplest squashing function is a step function: if the total net input is less than 0 (or more generally, less than some threshold T) then the output of the neuron is 0, otherwise it is 1. A common squashing function is the logistic function.

In summary, the activation function is the result of applying a squashing function to the total net input.

A propositional learning system, developed by Michalski.

asynchronous vs synchronous
When a neural network is viewed as a collection of connected computation devices, the question arises whether the nodes/devices share a common clock, so that they all perform their computations ("fire") at the same time, (i.e. synchronously) or whether they fire at different times, e.g. they may fire equally often on average, but in a random sequence (i.e. asynchronously). In the simplest meaningful case, there are two processing nodes, each connected to the other, as shown below.

In the asynchronous case, if the yellow node fires first, then it uses the then current value of its input from the red node to determine its output in time step 2, and the red node, if it fires next, will use the updated output from the yellow node to compute its new output in time step 3. In summary, the output values of the red and yellow nodes in time step 3 depend on the outputs of the yellow and red nodes in time steps 2 and 1, respectively.

In the synchronous case, each node obtains the current output of the other node at the same time, and uses the value obtained to compute its new output (in time step 2). In summary, the output values of the red and yellow nodes in time step 2 depend on the outputs of the yellow and red nodes in time step 1. This can produce a different result from the asynchronous method.

Some neural network algorithms are firmly tied to synchronous updates, and some can be operated in either mode. Biological neurons normally fire asynchronously.

An attribute is a property of an instance that may be used to determine its classification. For example, when classifying objects into different types in a robotic vision task, the size and shape of an instance may be appropriate attributes. Determining useful attributes that can be reasonably calculated may be a difficult job - for example, what attributes of an arbitrary chess end-game position would you use to decide who can win the game? This particular attribute selection problem has been solved, but with considerable effort and difficulty.

Attributes are sometimes also called features.

The axon is the "output" part of a biological neuron. When a neuron fires, a pulse of electrical activity flows along the axon. Towards its end, or ends, the axon splits into a tree. The ends of the axon come into close contact with the dendrites of other neurons. These junctions are termed synapses. Axons may be short (a couple of millimetres) or long (e.g. the axons of the nerves that run down the legs of a reasonably large animal.)


backed-up error estimate
In decision tree pruning one of the issues in deciding whether to prune a branch of the tree is whether the estimated error in classification is greater if the branch is present or pruned. To estimate the error if the branch is present, one takes the estimated errors associated with the children of the branch nodes (which of course must have been previously computed), multiplies them by the estimated frequencies that the current branch will classify data to each child node, and adds up the resulting products. The frequencies are estimated from the numbers of training data instances that are classified as belonging to each child node. This sum is called the backed-up error estimate for the branch node. (The concept of a backed-up error estimate does not make sense for a leaf node.)

See also expected error estimate.

backward pass in backpropagation
The phase of the error backpropagation learning algorithm when the weights are updated, using the delta rule or some modification of it.

The backward pass starts at the output layer of the feedforward network, and updates the incoming weights to units in that layer using the delta rule. Then it works backward, starting with the penultimate layer (last hidden layer), updating the incoming weights to those layers.

Statistics collected during the forward pass are used during the backward pass in updating the weights.

backpropagation or backprop
see error backpropagation

In feedforward and some other neural networks, each hidden unit and each output unit is connected via a trainable weight to a unit (the bias unit) that always has an activation level of –1.

This has the effect of giving each hidden or output a trainable threshold, equal to the value of the weight from the bias unit to the unit.

biological neuron

This is a very much simplified diagram of a biological neuron. Biological neurons come in a variety of types. There is a lot of further structure and physiology that could be considered. The features of a neuron shown above are those of most interest for those constructing artificial neural networks (other than spiking-neuron-based models, and those relying on synchronous activation, such as the Shastri and Ajjanagadde model: see L.Shastri and V. Ajjanagadde: Behavioral and Brain Sciences (1993) 16, 417-494).

However, from the artificial neural network point of view, a biological neuron operates as follows: electrical pulses from other neurons cause the transfer of substances called neurotransmitters (of which there are several varieties) from the synaptic terminals of a neuron's axon (think "output") across a structure called a synapse to the dendrites of other neurons (call them downstream neurons). The arrival of the neurotransmitter in the dendrite of the downstream neuron increases the tendency of the downstream neuron to send an electrical pulse itself ("fire"). If enough dendrites of a neuron receive neurotransmitters in a short enough period of time, the neuron will fire.

Caveat: neurotransmitter substances may be excitatory or inhibitory. The text above assumes that only excitatory neurotransmitters are involved. Inhibitory neurotransmitters, as the name suggests, reduce the tendency of a neuron to fire. Some neurons have a mixture of excitatory synapses and inhibitory synapses (i.e. synapses using inhibitory neurotransmitters) and will only fire if there is enough additional excitatory input to overcome the effect of the inhibitory synapses.


C4.5 is a later version of the ID3 decision tree induction algorithm.

C5 is a later version of the ID3 decision tree induction algorithm.

cell body
The cell body is the large open space in the "middle" of the neuron, between the dendrites and the axon, where the cell nucleus lives. A landmark in biological neuron architecture, but not of significance in relation to artificial neural networks (except for those trying to model biological neurons as distinct from using simplified neuron models to solve diverse problems).

When a neuron in an neural network has its value forcibly set and fixed to some externally provided value, it is said to be clamped. Such a neuron serves as an input unit for the net.

classes in classification tasks
In a classification task in machine learning, the task is to take each instance and assign it to a particular class. For example, in a machine vision application, the task might involve analysing images of objects on a conveyor belt, and classifying them as nuts, bolts, or other components of some object being assembled. In an optical character recognition task, the task would involve taking instances representing images of characters, and classifying according to which character they are. Frequently in examples, for the sake of simplicity if nothing else, just two classes, sometimes called positive and negative, are used.

concept learning system (CLS)
A decision tree induction program - a precursor of ID3.

conjunctive expressions
A conjunctive expression is an expression like:
size=large and colour in {red, orange}
that is the conjunction of two or more simple predicates like size=large. The term conjunction refers to the presence of the logical operator and, which joins or conjoins the simple predicates. Occasionally the conjunctive expression might just consist of a single predicate.

See also propositional learning systems and covering algorithm.

Connectionism is the neural network approach to solving problems in artificial intelligence - the idea being that connectionists feel that it is appropriate to encode knowledge in the weighted connections between nodes in a neural net. The word "connectionist" is sometimes used with all the heat and force associated with political "isms": "He's a connectionist" can be just as damning, coming from the wrong person, as "He's a communist (or capitalist)". It is also sometimes used as a simple adjective, as in "NetTalk is a connectionist system."

covering algorithm
A covering algorithm, in the context of propositional learning systems, is an algorithm that develops a cover for the set of positive examples - that is, a set of conjunctive expressions that account for all the examples but none of the non-examples.

The algorithm - given a set of examples:

  1. Start with an empty cover.
  2. Select an example.
  3. Find the set of all conjunctive expressions that cover that example.
  4. Select the "best" expression x from that set, according to some criterion (usually "best" is a compromise between generality and compactness and readability).
  5. Add x to the cover.
  6. Go to step 2, unless there are no examples that are not already covered (in which case, stop).

decision trees
A decision tree is a tree in which each non-leaf node is labelled with an attribute or a question of some sort, and in which the branches at that node correspond to the possible values of the attribute, or answers to the question. For example, if the attribute was shape, then there would be branches below that node for the possible values of shape, say square, round and triangular. Leaf nodes are labelled with a class. Decision trees are used for classifying instances - one starts at the root of the tree, and, taking appropriate branches according to the attribute or question asked about at each branch node, one eventually comes to a leaf node. The label on that leaf node is the class for that instance.

delta rule
The delta rule in error backpropagation learning specifies the update to be made to each weight during backprop learning. Roughly speaking, it states that the change to the weight from node i to node j should be proportional to output of node j and also proportional to the "local gradient" at node j.

The local gradient, for an output node, is the product to the derivative of the squashing function evaluated at the total net input to node j, and the error signal (i.e. the difference between the target output and the actual output). In the case of a hidden node, the local gradient is the product of the derivative the squashing function (as above) and the weighted sum of the local gradients of the nodes to which node j is connected in subsequent layers of the net. Got it?

A dendrite is one of the branches on the input end of a biological neuron. It has connections, via synapses to the axons of other neurons.



For our purposes, the entropy measure
–Σi pilog2pi
gives us the average amount of information in bits in some
attribute of an instance. The information referred to is information about what class the instance belongs to, and pi is the probability that an instance belongs to class i.

The rationale for this is as follows: –log2(p) is the amount of information in bits associated with an event of probability p - for example, with an event of probability ½, like flipping a fair coin, log2((p) is –log2(½) = 1, so there is one bit of information. This should coincide with our intuition of what a bit means (if we have one). If there is a range of possible outcomes with associated probabilities, then to work out the average number of bits, we need to multiply the number of bits for each outcome (–log2(p)) by the probability p and sum over all the outcomes. This is where the formula comes from.

Entropy is used in the ID3 decision tree induction algorithm.

In training a neural net, the term epoch is used to describe a complete pass through all of the training patterns. The weights in the neural net may be updated after each pattern is presented to the net, or they may be updated just once at the end of the epoch. Frequently used as a measure of speed of learning - as in "training was complete after x epochs".

error backpropagation learning algorithm
The error backpropagation learning algorithm is a form of supervised learning used to train mainly feedforward neural networks to perform some task. In outline, the algorithm is as follows:
  1. Initialization: the weights of the network are initialized to small random values.
  2. Forward pass: The inputs of each training pattern are presented to the network. The outputs are computed using the inputs and the current weights of the network. Certain statistics are kept from this computation, and used in the next phase. The target outputs from each training pattern are compared with the actual activation levels of the output units - the difference between the two is termed the error. Training may be pattern-by-pattern or epoch-by-epoch. With pattern-by-pattern training, the pattern error is provided directly to the backward pass. With epoch-by-epoch training, the pattern errors are summed across all training patterns, and the total error is provided to the backward pass.

  3. Backward pass: In this phase, the weights of the net are updated. See the main article on the backward pass for some more detail.

  4. Go back to step 2. Continue doing forward and backward passes until the stopping criterion is satisfied.

See also forward pass, backward pass, delta rule, error surface, local minimum, gradient descent and momentum.

Error backpropagation learning is often familiarly referred to just as backprop.

error surface
When total error of a backpropagation-trained neural network is expressed as a function of the weights, and graphed (to the extent that this is possible with a large number of weights), the result is a surface termed the error surface. The course of learning can be traced on the error surface: as learning is supposed to reduce error, when the learning algorithm causes the weights to change, the current point on the error surface should descend into a valley of the error surface.

The "point" defined by the current set of weights is termed a point in weight space. Thus weight space is the set of all possible values of the weights.

See also local minimum and gradient descent.

excitatory connection
see weight.

expected error estimate
In pruning a decision tree, one needs to be able to estimate the expected error at any node (branch or leaf). This can be done using the Laplace error estimate, which is given by the formula

E(S) = (Nn + k – 1) / (N + k).
Sis the set of instances in a node
kis the number of classes (e.g. 2 if instances are just being classified into 2 classes: say positive and negative)
Nis the is the number of instances in S
Cis the majority class in S
nout of N examples in S belong to C


see attribute.

feedforward net
A kind of neural network in which the nodes can be numbered, in such a way that each node has weighted connections only to nodes with higher numbers. Such nets can be trained using the error backpropagation learning algorithm.

In practice, the nodes of most feedforward nets are partitioned into layers - that is, sets of nodes, and the layers may be numbered in such a way that the nodes in each layer are connected only to nodes in the next layer - that is, the layer with the next higher number. Commonly successive layers are totally interconnected - each node in the earlier layer is connected to every node in the next layer.

The first layer has no input connections, so consists of input units and is termed the input layer (yellow nodes in the diagram below).

The last layer has no output connections, so consists of output units and is termed the output layer (maroon nodes in the diagram below).

The layers in between the input and output layers are termed hidden layers, and consist of hidden units (light blue nodes and brown nodes in the diagram below).

When the net is operating, the activations of non-input neurons are computing using each neuron's activation function.

Feedforward network. All connections (arrows) are in one direction; there are no cycles of activation flow (cyclic subgraphs). Each colour identifies a different layer in the network. The layers 1 and 2 are fully interconnected, and so are layers 3 and 4. Layers 2 and 3 are only partly interconnected.

  1. In a biological neural network: neurons in a biological neural network fire when and if they receive enough stimulus via their (input) synapses. This means that an electrical impulse is propagated along the neuron's axon and transmitted to other neurons via the output synaptic connections of the neuron. The firing rate of a neuron is the frequency with which it fires (cf. activation in an artificial neural network.

  2. In an expert system: when a rule in the expert system is used, it is said to fire.
function approximation algorithms
include connectionist and statistical techniques of machine learning. The idea is that machine learning means learning, from a number of examples or instances or training patterns, to compute a function which has as its arguments variables corresponding to the input part of the training pattern(s), and has as its output variables corresponding to the output part of the training patterns, which maps the input part of each training pattern to its output part. The hope is that the function will interpolate / generalize from the training patterns, so that it will produce reasonable outputs when given other inputs.

See also symbolic learning algorithms.

forward pass in backpropagation
In the forward pass in backpropagation, each training pattern is presented to the input units of the network. The hidden unit activations are computed from the inputs and input-to-hidden unit weights, and then (in the case of a 3-layer network, with only a single layer of hidden units) the outputs are computed using the hidden layer activations and the current hidden-to-output weights. Certain statistics are kept from this computation, and used in the backward pass. The target outputs from each training pattern are compared with the actual activation levels of the output units - the difference between the two is termed the error. Training may be pattern-by-pattern or epoch-by-epoch. With pattern-by-pattern training, the pattern error is provided directly to the backward pass. With epoch-by-epoch training, the pattern errors are summed across all training patterns, and the total error is provided to the backward pass.

generalization in backprop
Learning in backprop seems to operate by first of all getting a rough set of weights which fit the training patterns in a general sort of way, and then working progressively towards a set of weights that fit the training patterns exactly. If learning goes too far down this path, one may reach a set of weights that fits the idiosyncrasies of the particular set of patterns very well, but does not interpolate (i.e. generalize) well.

Moreover, with large complex sets of training patterns, it is likely that some errors may occur, either in the inputs or in the outputs. In that case, and again particularly in the later parts of the learning process, it is likely that backprop will be contorting the weights so as to fit precisely around training patterns that are actually erroneous! This phenomenon is known as over-fitting.

This problem can to some extent be avoided by stopping learning early. How does one tell when to stop? One method is to partition the training patterns into two sets (assuming that there are enough of them). The larger part of the training patterns, say 80% of them, chosen at random, form the training set, and the remaining 20% are referred to as the test set. Every now and again during training, one measures the performance of the current set of weights on the test set. One normally finds that the error on the training set drops monotonically (that's what a gradient descent algorithm is supposed to do, after all). However, error on the test set (which will be larger, per pattern, than the error on the training set) will fall at first, then start to rise as the algorithm begins to overtrain. Best generalization performance is gained by stopping the algorithm at the point where error on the test set starts to rise.

generalized delta rule
An improvement on the in error backpropagation learning. If the learning rate (often denoted by η) is small, the backprop algorithm proceeds slowly, but accurately follows the path of steepest descent on the error surface. If η is too large, the algorithm may "bounce off the canyon walls of the error surface" - i.e. not work well. This can be largely avoided by modifying the delta rule to include a momentum term:

Δwji(n) = α Δwji(n–1) + η δj(n) yi(n)

in the notation of Haykin's text (Neural networks - a comprehensive foundation). The constant α is a termed the momentum constant and can be adjusted to achieve the best effect. The second summand corresponds to the standard delta rule, while the first summand says "add α × the previous change to this weight."

This new rule is called the generalized delta rule. The effect is that if the basic delta rule would be consistently pushing a weight in the same direction, then it gradually gathers "momentum" in that direction.

gradient descent
Understanding this term depends to some extent on the error surface metaphor.

When an artificial neural network learning algorithm causes the weights of the net to change, it will do so in such a way that the current point on the error surface will descend into a valley of the error surface, in a direction that corresponds to the steepest (downhill) gradient or slope at the current point on the error surface. For this reason, backprop is said to be a gradient descent method, and to perform gradient descent in weight space.

See also local minimum.


hidden layer
Neurons or units in a feedforward net are usually structured into two or more layers. The input units constitute the input layer. The output units constitute the output layer. Layers in between the input and output layers (that is, layers that consist of hidden units) are termed hidden layers.

In layered nets, each neuron in a given layer is connected by trainable weights to each neuron in the next layer.

hidden unit / node
A hidden unit in a neural network is a neuron which is neither an input unit nor an output unit.

hypothesis language
Term used in analysing machine learning methods. The hypothesis language refers to the notation used by the learning method to represent what it has learned so far. For example, in ID3, the hypothesis language would be the notation used to represent the decision tree (including partial descriptions of incomplete decision trees). In backprop, the hypothesis language would be the notation used to represent the current set of weights. In Aq, the hypothesis language would be the notation used to represent the class descriptions (e.g.
class1 ← size=large and colour in {red, orange}).

See also observation language.


A decision tree induction algorithm, developed by Quinlan. ID3 stands for "Iterative Dichotomizer (version) 3". Later versions include C4.5 and C5.

inhibitory connection
see weight.

input unit
An input unit in a neural network is a neuron with no input connections of its own. Its activation thus comes from outside the neural net. The input unit is said to have its value clamped to the external value.

This term has two, only distantly related, uses:

  1. An instance is a term used in machine learning particularly with symbolic learning algorithms, to describe a single training or test item, usually in the form of a description of the item in terms of its attributes, along with its intended classification. With supervised connectionist learning algorithms, it is more usual to speak of (training or test) patterns.

  2. In general AI parlance, an instance frame is a frame representing a particular individual, as opposed to a generic frame.




Laplace error estimate
this is described in the article on expected error estimates.

layer in a neural network
see article on feedforward networks.

learning program
Normal programs P produce the same output y each time they receive a particular input x. Learning programs are capable of improving their performance so that they may produce different (better) results on second or later times that they receive the same input x.

They achieve this by being able to alter their internal state, q. In effect, they are computing a function of two arguments, P(x | q) = y. When the program is in learning mode, the program computes a new state q' as well as the output y, as it executes.

In the case of supervised learning, in order to construct q', one needs a set of inputs xi and corresponding target outputs zi (i.e. you want P(xi | q) = zi when learning is complete). The new state function L is computed as:

L(P, q, ((x1,z1), ..., (xn, zn))) = q'

See also unsupervised learning, observation language, and hypothesis language.

learning rate
a constant used in error backpropagation learning and other artificial neural network learning algorithms to affect the speed of learning. The mathematics of e.g. backprop are based on small changes being made to the weights at each step: if the changes made to weights are too large, the algorithm may "bounce around" the error surface in a counter-productive fashion. In this case, it is necessary to reduce the learning rate. On the other hand, the small the learning rate, the more steps it takes to get to the stopping criterion. See also momentum.

linear threshold unit (LTU)
A linear threshold unit is a simple artificial neuron whose output is its thresholded total net input. That is, an LTU with threshold T calculates the weighted sum of its inputs, and then outputs 0 if this sum is less than T, and 1 if the sum is greater than T. LTU's form the basis of perceptrons.

local minimum
Understanding this term depends to some extent on the error surface metaphor.

When an artificial neural network learning algorithm causes the total error of the net to descend into a valley of the error surface, that valley may or may not lead to the lowest point on the entire error surface. If it does not, the minimum into which the total error will eventually fall is termed a local minimum. The learning algorithm is sometimes referred to in this case as "trapped in a local minimum."

In such cases, it usually helps to restart the algorithm with a new, randomly chosen initial set of weights - i.e. at a new random point in weight space. As this means a new starting point on the error surface, it is likely to lead into a different valley, and hopefully this one will lead to the true (absolute) minimum error, or at least a better minimum error.

logistic function
The function
φ(x) = 1/(1 + exp(–x))
which, when graphed, looks rather like a smoothed version of the step function
step(x) = 0 if x < 0, = 1 if x ≥ 0

graph of logistic function

It is used to transform the
total net input of an artificial neuron in some implementations of backprop-trained networks.

A related function, also sometimes used in backprop-trained networks, is 2φ(x)–1, which can also be expressed as tanh(x/2). tanh(x/2) is, of course, a smoothed version of the step function which jumps from –1 to 1 at x = 0, i.e. the function which = –1 if x < 0, and = 1 if x ≥ 0.


machine learning
Machine learning is said to occur in a program that can modify some aspect of itself, often referred to as its state, so that on a subsequent execution with the same input, a different (hopefully better) output is produced. See unsupervised learning and supervised learning, and also function approximation algorithms and symbolic learning algorithms.

momentum in backprop
See article on generalized delta rule.

multilayer perceptron (MLP)
See also feedforward network. Such a neural network differs from earlier perceptron-based models in two respects:


neural network
An artificial neural network is a collection of simple artificial neurons connected by directed weighted connections. When the system is set running, the activation levels of the input units is clamped to desired values. After this the activation is propagated, at each time step, along the directed weighted connections to other units. The activations of non-input neurons are computing using each neuron's activation function. The system might either settle into a stable state after a number of time steps, or in the case of a feedforward network, the activation might flow through to output units.

Learning might or might not occur, depending on the type of neural network and the mode of operation of the network.

see neuron.

neuron (artificial)
A simple model of a biological neuron used in neural networks to perform a small part of some overall computational problem. It has inputs from other neurons, with each of which is associated a weight - that is, a number which indicates the degree of importance which this neuron attaches to that input. It also has an activation function, and a bias. The bias acts like a threshold in a perceptron.

Also referred to a neurode, node, or unit.

see neuron, for a node in a neural network.
See graph for a node in a graph.

noisy data in machine learning
The term "noise" in this context refers to errors in the training data for machine learning algorithms. If a problem is difficult enough and complicated enough to be worth doing with machine learning techniques, then any reasonable training set is going to be large enough that there are likely to be errors in it. This will of course cause problems for the learning algorithm.

See also decision tree pruning and generalization in backprop.


observation language
Term used in analysing machine learning methods. The observation language refers to the notation used by the learning method to represent the data it uses for training. For example, in ID3, the observation language would be the notation used to represent the training instances, including attributes and their allowable values, and the way instances are described using attributes. In backprop, the observation language would be the notation used to represent the training patterns. In Aq, the observation language would again be the notation used to represent the instances, much as in ID3.

See also hypothesis language.

output unit
An output unit in a neural network is a neuron with no output connections of its own. Its activation thus serves as one of the output values of the neural net.

see article on generalization in backprop


A perceptron is a simple artificial neuron whose activation function consists of taking the total net input and ouputting 1 if this is above a threshold T, and 0 otherwise.

Perceptrons were originally used as pattern classifiers, where the term pattern is here used not in the sense of training pattern, but just in the sense of an input pattern that is to be put into on of several classes. Perceptual pattern classifiers of this sort (not based on perceptrons!) occur in simple animal visual systems, which can distinguish between prey, predators, and neutral environmental objects.

See also perceptron learning, and the XOR problem.

perceptron learning
The perceptron learning algorithm:
  1. All weights are initially set to zero.
  2. For each training example:
    • if the perceptron outputs 0 when it should output 1, then add the input vector to the weight vector.
    • if the perceptron outputs 1 when it should output 0, then subtract the input vector to the weight vector.
  3. Repeat step 2 until the perceptron yields the correct result for each training example.

propositional learning systems
A propositional learning system represents what it learns about the training instances by expressions equivalent to sentences in some form of logic (e.g.
class1 ← size=large and colour in {red, orange})

. See Aq and covering algorithm.

pruning decision trees
The data used to generate a decision tree, using an algorithm like the ID3 tree induction algorithm, can include noise - that is, instances that contain errors, either in the form of a wrong classification, or in the form of a wrong attribute value. There could also be instances whose classification is problematical because the attributes available are not sufficient to discriminate between some cases.

If for example, a node of the tree contains, say, 99 items in class C1 and 1 in class C2, it is plausible that the 1 item in class C2 is there because of an error either of classification or of feature value. There can thus be an argument for regarding this node as a leaf node of class C1. This termed pruning the decision tree.

The algorithm given in lectures for deciding when to prune is as follows:
At a branch node that is a candidate for pruning:

  1. Approximate expected error for the node.
  2. Approximate backed-up error from the children assuming that we do not prune.
  3. If expected error is less than backed-up error, then prune.



recurrent network
A recurrent network is a neural network in which there is at least one cycle of activation flow. To put it another way, underlying any neural network there is a directed graph, obtained by regarding the neurons as nodes in the graph and the weights as directed edges in the graph. If this graph is not acyclic, the network is recurrent.

A recurrent connection is one that is part of a directed cycle, although term is sometimes reserved for a connection which is clearly going in the "wrong" direction in an otherwise feedforward network.

Recurrent networks include fully recurrent networks in which each neuron is connected to every other neuron, and partly recurrent networks in which greater or lesser numbers of recurrent connections exist. See also simple recurrent network.

This article is included for general interest - recurrent networks are not part of the syllabus of COMP9414 Artificial Intelligence.


sequence prediction tasks
Feedforward networks are fine for classifying objects, but their units (as distinct from their weights) have no memory of previous inputs. Consequently they are unable to cope with sequence prediction tasks - tasks like predicting, given a sequence of sunspot activity counts, what the sunspot activity for the next time period will be, and financial prediction tasks (e.g. given share prices for the last n days, and presumably other economic data, etc., predict tomorrow's share price).

Simple recurrent nets can tackle tasks like this, because they do have a kind of memory for recording information derived from activation values from previous time steps.

This article is included for general interest - sequence prediction tasks are not part of the syllabus of COMP9414 Artificial Intelligence.

sigmoidal nonlinearity
Another name for the logistic function and certain related functions (such as tanh(x)). Sigmoidal functions are a type of squashing function. They are called because sigma is the Greek letter "s", and the logistic functions look somewhat like a sloping letter "s" when graphed.

simple recurrent network
A simple recurrent network is like a feedforward network with an input layer, and output layer, and a single hidden layer, except that there is a further group of units called state units or context units. There is one state unit for each hidden unit. The activation function of the state unit is as follows: the activation of a state unit in time step n is the same of that of the corresponding hidden unit in time step n–1. That is, the state unit activations are copies of the hidden unit activations from the previous time step. Each state unit is also connected to each hidden unit by a trainable weight - the direction of this connection is from the state unit to the hidden unit.

SRN diagram

Simple recurrent networks can learn sequence prediction tasks.

See also recurrent networks.

This article is included for general interest - simple recurrent networks are not part of the syllabus of COMP9414 Artificial Intelligence.

splitting criterion in ID3
The point of the ID3 algorithm is to decide the best attribute, out of those not already used, on which to split the training instances that are classified to a particular branch node.

The algorithm, in outline, is as follows:

  1. if all the instances belong to a single class, there is nothing to do (except create a leaf node labelled with the name of that class).
  2. otherwise, for each attribute that has not already been used, calculate the information gain that would be obtained by using that attribute on the particular set of instances classified to this branch node.
  3. use the attribute with the greatest information gain.

This leaves the question of how to calculate the information gain associated with using a particular attribute A. Suppose that there are k classes C1, C2, ..., Ck, and that of the N instances classified to this node, I1 belong to class C1, I2 belong to class C2, ..., and Ik belong to class Ck.
Let p1 = I1/N, p2 = I2/N, ..., and pk = Ik/N.
The initial entropy E at this node is:

p1log2(p1) –p2log2(p2) ... –pklog2(pk).
Now split the instances on each value of the chosen attribute A. Suppose that there are r attribute values for A, namely a1, a2, ..., ar.
For a particular value aj, say, suppose that there are Jj,1 instances in class C1, Jj,2 instances in class C2, ..., and Jj,k instances in class Ck, for a total of Jj instances having attribute value aj.
Let qj,1 = Jj,1/Jj, qj,2 = Jj,2/Jj, ..., and qj,k = Jj,k/Jj.
The entropy Ej associated with this attribute value aj this position is:
qj,1log2(qj,1) –qj,2log2(qj,2) ... –qj,klog2(qj,k).
Now compute:
E – ((J1/N).E1 + (J2/N).E2 + ... + (Jr/N).Er).
This is the information gain for attribute A.
Note that Jj/N is the estimated probability that an instance classified to this node will have value aj for attribute A. Thus we are weighting the entropy estimates Ej by the estimated probability that an instance has the associated attribute value.

In terms of the example used in the lecture notes, (see also calculations in lecture notes), k = 2 since there are just two classes, positive and negative. I1 = 4 and I2 = 3, and N =7, and so p1 = 4/7 and p2 = 3/7, and E = –p1log2(p1) –p2log2(p2) = –(4/7)×log2(4/7) – (3/7)×log2(3/7). In the example, the first attribute A considered is size, and the first value of size considered, large, corresponds to a1, in the example in the lecture notes, so J1,1 = 2 = J1,2, and J1 = 4. Thus q1,1 = J1,1/J1 = 2/4 = ½, and q1,2 = J1,2/J1 = 2/4 = ½, and E1 = –q1,1log2(q1,1) –q1,2log2(q1,2) = -½×log2(½) – ½×log2(½) = 1.
Similarly E2 = 1 and J2 = 2 (size = small), and E3 = 0 and J3 = 1 (size = medium) so the final information gain,

E – ((J1/N).E1 + (J2/N).E2 + ... + (Jr/N).Er)
= E – ((4/7)×E1 + (2/7)×E2 + (1/7)×E3)
which turned out to be about 0.13 in the example.

squashing function
One of several functions, including sigmoid and step functions (see threshold used to transform the total net input to a neuron to obtain the final output of the neuron.

stopping criterion in backprop
Possible stopping criteria in error backpropagation learning include: Combinations of the two (e.g. whichever of the two occurs first) and other stopping conditions are possible. See the reference by Haykin (Neural networks: a comprehensive foundation p. 153) for more details.

supervised learning
Supervised learning is a kind of machine learning where the learning algorithm is provided with a set of inputs for the algorithm along with the corresponding correct outputs, and learning involves the algorithm comparing its current actual output with the correct or target outputs, so that it knows what its error is, and modify things accordingly.

Contrast unsupervised learning.

symbolic learning algorithms
Symbolic learning algorithms learn concepts by constructing a symbolic expression (such as a decision tree, as in Aq ) that describes a class (or classes) of objects. Many such systems work with representations equivalent to first order logic.

Such learning algorithms have the advantage that their internal representations and their final output can be inspected and relatively easily understood by humans.

See also function approximation algorithms.

A synapse, in a biological neuron, is a connection between the axon of one neuron and the dendrite of another. It corresponds to a weight in an artificial neuron. Synapses have varying strengths and can be changed by learning (like weights). The operation mechanism of synapses is biochemical in nature, with transmitter substances crossing the tiny "synaptic gap" between axon and dendrite when the axon fires.

synchronous vs asynchronous
See asynchronous


target output
The target output of an artificial neuron is the output provided with the training pattern. The difference between this and the actual output of the neuron is the pattern error. This is used in .


Machine learning algorithms are trained on a collection of instances or patterns. Once training is complete, it is usual to test the trained system on a collection of test instances or test patterns which were not used when training the system, in order to find out to what degree the system is able to generalise beyond its training data.

One of the parameters of a perceptron . The output of the perceptron is 0 if the total net input of the perceptron is less than its threshold T, otherwise it is 1.

Compare bias.

total net input
The total net input to a artificial neuron refers to the weighted sum of the neuron's inputs (that is, the sum of the product of each input with the weight associated with that input. This quantity is then usually transformed in some way (see squashing function) to obtain the neuron's output.

See also activation function.

total sum-squared error (TSS)
A measure of how well a backprop-trained neural network is doing at a particular point during its learning.

Given a training pattern, its squared error is obtained by squaring the difference between the target output of an output neuron and the actual output. The sum-squared error, or pattern sum-squared error (PSS), is obtained by adding up the sum-squared errors for each output neuron. The total sum-squared error is obtained by adding up the PSS for each training pattern.

trainable weight
In artificial neural networks that use weighted connections, some of those connections may have fixed values - i.e. they are specified as not being subject to change when the network is trained, e.g. using the error backpropagation learning algorithm.

The other weights, the ones that are subject to change when the net is learning, are referred to as trainable weights.

training pattern

This term is used to refer to a set of input values and the corresponding set of desired or target output values for use in training an artificial neural network. Usually, a largish number of training patterns would be used in the training of any genuinely useful neural network.

In toy problems like the XOR problem, only a few training patterns (in the case of XOR, just 4) may be used.

Patterns used for testing the trained network are referred to as test patterns.

Compare instance.

tree induction algorithm
This article describes the basic tree induction algorithm used by ID3 and successors. The basic idea is to pick an attribute A with values a1, a2, ..., ar, split the training instances into subsets Sa1, Sa2, ..., Sar consisting of those instances that have the corresponding attribute value. Then if a subset has only instances in a single class, that part of the tree stops with a leaf node labelled with the single class. If not, then the subset is split again, recursively, using a different attribute.

This leaves the question of how to choose the best attribute to split on at any branch node. This issue is handled in the article on splitting criterion in ID3.


see neuron.

unsupervised learning
Unsupervised learning signifies a mode of machine learning where the system is not told the "right answer" - for example, it is not trained on pairs consisting of an input and the desired output. Instead the system is given the input patterns and is left to find interesting patterns, regularities, or clusterings among them. To be contrasted to supervised learning, as in error backpropagation and ID3, where the system is told the desired or target outputs to associate with each input pattern used for training.



A weight, in a artificial neural network, is a parameter associated with a connection from one neuron, M, to another neuron N. It corresponds to a synapse in a biological neuron, and it determines how much notice the neuron N pays to the activation it receives from neuron N. If the weight is positive, the connection is called excitatory, while if the weight is negative, the connection is called inhibitory. See also to a neuron.

weight space
See the article on error surfaces in weight space.

windowing in ID3
Windowing, in ID3, and similar algorithms, is a way of dealing with really large sets of training instances.

Without windowing, such an algorithm can be really slow, as it needs to do its information gain calculations (see tree induction algorithms) over huge amounts of data.

With windowing, training is done on a relatively small sample of the data, and then checked against the full set of training data.

Here is the windowing algorithm:

  1. Select a sample S of the training instances at random - say 10% of them. The actual proportion chosen would need to be small enough that ID3 could run fairly fast on them, but large enough to be representative of the whole set of examples.
  2. Run the ID3 algorithm on the set of training instances to obtain a decision tree.
  3. Check the decision tree on the full data set, to obtain a set E of training instances that are misclassified by the tree obtained.
  4. If E is empty, stop.
  5. Let the new set of training instances S' be the union of S and E.
  6. Go to step 2 and run the ID3 calculations with S'.


XOR problem
p XOR q is a binary logical operator which can be defined either as not (p equiv q), or as not p and q or p and not q. It rose to fame in the late 60's when Minsky and Papert demonstrated that perceptron-based systems cannot learn to compute the XOR function. Subsequently, when backprogation-trained multi-layer perceptrons were introduced, one of the first tests performed was to demonstrate that they could in fact learn to compute XOR. No one in their right mind would probably ever have cause to compute XOR in this way in practice, but XOR and generalizations of XOR remain popular as tests of backpropagation-type systems.