Copyright © Bill Wilson, 1998 - 2012 | Contact Info |
You should use The Machine Learning Dictionary to clarify or revise concepts that you have already met. The Machine Learning Dictionary is not a suitable way to begin to learn about Machine Learning.
Further information on Machine Learning can be found in the class web page lecture notes section. Other places to find out about machine learning would be the AAAI (American Association for Artificial Intelligence) Machine Learning page and their AI Reference Shelf for less specific information.
Other related dictionaries:
The Prolog Dictionary -
URL: http://www.cse.unsw.edu.au/~billw/prologdict.html
The Artificial Intelligence Dictionary -
URL: http://www.cse.unsw.edu.au/~billw/aidict.html
The NLP Dictionary (Natural Language Processing) -
URL: http://www.cse.unsw.edu.au/~billw/nlpdict.html
The URL of this Machine Learning Dictionary is http://www.cse.unsw.edu.au/~billw/dictionaries/mldict.html
This dictionary is limited to the ML concepts covered in COMP9414 Artificial Intelligence, at the University of New South Wales, Sydney:
activation level | activation function | Aq | asynchronous | attributes | axon | backed-up error estimate | backpropagation | backward pass in backpropagation | bias | biological neuron | C4.5 | C5 | cell body | clamping | classes in classification tasks | concept learning system (CLS) | conjunctive expressions | connectionism | covering algorithm | decision trees | delta rule | dendrite | entropy | epoch | error backpropagation | error surface | excitatory connection | expected error estimate | feature | feedforward networks | firing | forward pass in backpropagation | function approximation algorithms | generalization in backprop | generalized delta rule | gradient descent | hidden layer | hidden unit / node | hypothesis language | ID3 | inhibitory connection | input unit | instances | Laplace error estimate | layer in a neural network | learning program | learning rate | linear threshold unit | local minimum | logistic function | machine learning | momentum in backprop | multilayer perceptron (MLP) | neural network | neurode | neuron (artificial) | node | noisy data in machine learning | observation language | output unit | over-fitting | perceptron | perceptron learning | propositional learning systems | pruning decision trees | recurrent network | sequence prediction tasks | sigmoidal nonlinearity | simple recurrent network | splitting criterion in ID3 | squashing function | stopping criterion in backprop | supervised learning | symbolic learning algorithms | synapse | synchronous | target output | testing | threshold | training pattern | total net input | total sum-squared error | trainable weight | training pattern | tree induction algorithm | unit | unsupervised learning | weight | weight space | windowing in ID3 | XOR problem
In summary, the activation function is the result of applying a squashing function to the total net input.
In the asynchronous case, if the yellow node fires first, then it uses the then current value of its input from the red node to determine its output in time step 2, and the red node, if it fires next, will use the updated output from the yellow node to compute its new output in time step 3. In summary, the output values of the red and yellow nodes in time step 3 depend on the outputs of the yellow and red nodes in time steps 2 and 1, respectively.
In the synchronous case, each node obtains the current output of the other node at the same time, and uses the value obtained to compute its new output (in time step 2). In summary, the output values of the red and yellow nodes in time step 2 depend on the outputs of the yellow and red nodes in time step 1. This can produce a different result from the asynchronous method.
Some neural network algorithms are firmly tied to synchronous updates, and some can be operated in either mode. Biological neurons normally fire asynchronously.
Attributes are sometimes also called features.
See also expected error estimate.
The backward pass starts at the output layer of the feedforward network, and updates the incoming weights to units in that layer using the delta rule. Then it works backward, starting with the penultimate layer (last hidden layer), updating the incoming weights to those layers.
Statistics collected during the forward pass are used during the backward pass in updating the weights.
This has the effect of giving each hidden or output a trainable threshold, equal to the value of the weight from the bias unit to the unit.
This is a very much simplified diagram of a biological neuron. Biological neurons come in a variety of types. There is a lot of further structure and physiology that could be considered. The features of a neuron shown above are those of most interest for those constructing artificial neural networks (other than spiking-neuron-based models, and those relying on synchronous activation, such as the Shastri and Ajjanagadde model: see L.Shastri and V. Ajjanagadde: Behavioral and Brain Sciences (1993) 16, 417-494).
However, from the artificial neural network point of view, a biological neuron operates as follows: electrical pulses from other neurons cause the transfer of substances called neurotransmitters (of which there are several varieties) from the synaptic terminals of a neuron's axon (think "output") across a structure called a synapse to the dendrites of other neurons (call them downstream neurons). The arrival of the neurotransmitter in the dendrite of the downstream neuron increases the tendency of the downstream neuron to send an electrical pulse itself ("fire"). If enough dendrites of a neuron receive neurotransmitters in a short enough period of time, the neuron will fire.
Caveat: neurotransmitter substances may be excitatory or inhibitory. The text above assumes that only excitatory neurotransmitters are involved. Inhibitory neurotransmitters, as the name suggests, reduce the tendency of a neuron to fire. Some neurons have a mixture of excitatory synapses and inhibitory synapses (i.e. synapses using inhibitory neurotransmitters) and will only fire if there is enough additional excitatory input to overcome the effect of the inhibitory synapses.
See also propositional learning systems and covering algorithm.
The algorithm - given a set of examples:
The local gradient, for an output node, is the product to the derivative of the squashing function evaluated at the total net input to node j, and the error signal (i.e. the difference between the target output and the actual output). In the case of a hidden node, the local gradient is the product of the derivative the squashing function (as above) and the weighted sum of the local gradients of the nodes to which node j is connected in subsequent layers of the net. Got it?
The rationale for this is as follows: –log_{2}(p) is the amount of information in bits associated with an event of probability p - for example, with an event of probability ½, like flipping a fair coin, log_{2}((p) is –log_{2}(½) = 1, so there is one bit of information. This should coincide with our intuition of what a bit means (if we have one). If there is a range of possible outcomes with associated probabilities, then to work out the average number of bits, we need to multiply the number of bits for each outcome (–log_{2}(p)) by the probability p and sum over all the outcomes. This is where the formula comes from.
Entropy is used in the ID3 decision tree induction algorithm.
Error backpropagation learning is often familiarly referred to just as backprop.
The "point" defined by the current set of weights is termed a point in weight space. Thus weight space is the set of all possible values of the weights.
See also local minimum and gradient descent.
S | is the set of instances in a node |
k | is the number of classes (e.g. 2 if instances are just being classified into 2 classes: say positive and negative) |
N | is the is the number of instances in S |
C | is the majority class in S |
n | out of N examples in S belong to C |
In practice, the nodes of most feedforward nets are partitioned into layers - that is, sets of nodes, and the layers may be numbered in such a way that the nodes in each layer are connected only to nodes in the next layer - that is, the layer with the next higher number. Commonly successive layers are totally interconnected - each node in the earlier layer is connected to every node in the next layer.
The first layer has no input connections, so consists of input units and is termed the input layer (yellow nodes in the diagram below).
The last layer has no output connections, so consists of output units and is termed the output layer (maroon nodes in the diagram below).
The layers in between the input and output layers are termed hidden layers, and consist of hidden units (light blue nodes and brown nodes in the diagram below).
When the net is operating, the activations of non-input neurons are computing using each neuron's activation function.
Feedforward network. All connections (arrows) are in one direction; there are
no cycles of activation flow (cyclic subgraphs). Each colour identifies
a different layer in the network. The layers 1 and 2 are fully interconnected,
and so are layers 3 and 4. Layers 2 and 3 are only partly interconnected.
See also symbolic learning algorithms.
Moreover, with large complex sets of training patterns, it is likely that some errors may occur, either in the inputs or in the outputs. In that case, and again particularly in the later parts of the learning process, it is likely that backprop will be contorting the weights so as to fit precisely around training patterns that are actually erroneous! This phenomenon is known as over-fitting.
This problem can to some extent be avoided by stopping learning early. How does one tell when to stop? One method is to partition the training patterns into two sets (assuming that there are enough of them). The larger part of the training patterns, say 80% of them, chosen at random, form the training set, and the remaining 20% are referred to as the test set. Every now and again during training, one measures the performance of the current set of weights on the test set. One normally finds that the error on the training set drops monotonically (that's what a gradient descent algorithm is supposed to do, after all). However, error on the test set (which will be larger, per pattern, than the error on the training set) will fall at first, then start to rise as the algorithm begins to overtrain. Best generalization performance is gained by stopping the algorithm at the point where error on the test set starts to rise.
Δw_{ji}(n) = α Δw_{ji}(n–1) + η δ_{j}(n) y_{i}(n)
in the notation of Haykin's text (Neural networks - a comprehensive foundation). The constant α is a termed the momentum constant and can be adjusted to achieve the best effect. The second summand corresponds to the standard delta rule, while the first summand says "add α × the previous change to this weight."
This new rule is called the generalized delta rule. The effect is that if the basic delta rule would be consistently pushing a weight in the same direction, then it gradually gathers "momentum" in that direction.
When an artificial neural network learning algorithm causes the weights of the net to change, it will do so in such a way that the current point on the error surface will descend into a valley of the error surface, in a direction that corresponds to the steepest (downhill) gradient or slope at the current point on the error surface. For this reason, backprop is said to be a gradient descent method, and to perform gradient descent in weight space.
See also local minimum.
In layered nets, each neuron in a given layer is connected by trainable weights to each neuron in the next layer.
See also observation language.
They achieve this by being able to alter their internal state, q. In effect, they are computing a function of two arguments, P(x | q) = y. When the program is in learning mode, the program computes a new state q' as well as the output y, as it executes.
In the case of supervised learning, in order to construct q', one needs a set of inputs x_{i} and corresponding target outputs z_{i} (i.e. you want P(x_{i} | q) = z_{i} when learning is complete). The new state function L is computed as:
L(P, q, ((x_{1},z_{1}), ..., (x_{n}, z_{n}))) = q'
See also unsupervised learning, observation language, and hypothesis language.
When an artificial neural network learning algorithm causes the total error of the net to descend into a valley of the error surface, that valley may or may not lead to the lowest point on the entire error surface. If it does not, the minimum into which the total error will eventually fall is termed a local minimum. The learning algorithm is sometimes referred to in this case as "trapped in a local minimum."
In such cases, it usually helps to restart the algorithm with a new, randomly chosen initial set of weights - i.e. at a new random point in weight space. As this means a new starting point on the error surface, it is likely to lead into a different valley, and hopefully this one will lead to the true (absolute) minimum error, or at least a better minimum error.
A related function, also sometimes used in backprop-trained networks, is 2φ(x)–1, which can also be expressed as tanh(x/2). tanh(x/2) is, of course, a smoothed version of the step function which jumps from –1 to 1 at x = 0, i.e. the function which = –1 if x < 0, and = 1 if x ≥ 0.
Learning might or might not occur, depending on the type of neural network and the mode of operation of the network.
Also referred to a neurode, node, or unit.
See also decision tree pruning and generalization in backprop.
See also hypothesis language.
Perceptrons were originally used as pattern classifiers, where the term pattern is here used not in the sense of training pattern, but just in the sense of an input pattern that is to be put into on of several classes. Perceptual pattern classifiers of this sort (not based on perceptrons!) occur in simple animal visual systems, which can distinguish between prey, predators, and neutral environmental objects.
See also perceptron learning, and the XOR problem.
If for example, a node of the tree contains, say, 99 items in class C1 and 1 in class C2, it is plausible that the 1 item in class C2 is there because of an error either of classification or of feature value. There can thus be an argument for regarding this node as a leaf node of class C1. This termed pruning the decision tree.
The algorithm given in lectures for deciding when to prune is as
follows:
At a branch node that is a candidate for pruning:
A recurrent connection is one that is part of a directed cycle, although term is sometimes reserved for a connection which is clearly going in the "wrong" direction in an otherwise feedforward network.
Recurrent networks include fully recurrent networks in which each neuron is connected to every other neuron, and partly recurrent networks in which greater or lesser numbers of recurrent connections exist. See also simple recurrent network.
This article is included for general interest - recurrent networks are not part of the syllabus of COMP9414 Artificial Intelligence.
Simple recurrent nets can tackle tasks like this, because they do have a kind of memory for recording information derived from activation values from previous time steps.
This article is included for general interest - sequence prediction tasks are not part of the syllabus of COMP9414 Artificial Intelligence.
Simple recurrent networks can learn sequence prediction tasks.
See also recurrent networks.
This article is included for general interest - simple recurrent networks are not part of the syllabus of COMP9414 Artificial Intelligence.
The algorithm, in outline, is as follows:
This leaves the question of how to calculate the information
gain associated with using a particular attribute A. Suppose
that there are k classes C_{1},
C_{2}, ..., C_{k},
and that of the N instances
classified to this node,
I_{1} belong to class C_{1},
I_{2} belong to class C_{2}, ..., and
I_{k} belong to class C_{k}.
Let p_{1} = I_{1}/N,
p_{2} = I_{2}/N, ..., and
p_{k} = I_{k}/N.
The initial entropy E at this node is:
–p_{1}log_{2}(p_{1}) –p_{2}log_{2}(p_{2}) ... –p_{k}log_{2}(p_{k}).Now split the instances on each value of the chosen attribute A. Suppose that there are r attribute values for A, namely a_{1}, a_{2}, ..., a_{r}.
–q_{j,1}log_{2}(q_{j,1}) –q_{j,2}log_{2}(q_{j,2}) ... –q_{j,k}log_{2}(q_{j,k}).Now compute:
E – ((J_{1}/N).E_{1} + (J_{2}/N).E_{2} + ... + (J_{r}/N).E_{r}).This is the information gain for attribute A.
In terms of the example
used in the lecture notes, (see also
calculations
in lecture notes),
k = 2 since
there are just two classes, positive and negative.
I_{1} = 4 and I_{2} = 3, and N =7,
and so p_{1} = 4/7 and p_{2} = 3/7, and E =
–p_{1}log_{2}(p_{1})
–p_{2}log_{2}(p_{2}) = –(4/7)×log_{2}(4/7) – (3/7)×log_{2}(3/7).
In the example, the first attribute
A considered is size, and the first value of
size considered,
large, corresponds to a_{1},
in the example in the lecture notes, so J_{1,1} = 2 =
J_{1,2}, and J_{1} = 4.
Thus q_{1,1} = J_{1,1}/J_{1} = 2/4 = ½, and
q_{1,2} = J_{1,2}/J_{1} = 2/4 = ½, and
E_{1} =
–q_{1,1}log_{2}(q_{1,1})
–q_{1,2}log_{2}(q_{1,2}) = -½×log_{2}(½) – ½×log_{2}(½)
= 1.
Similarly E_{2} = 1 and J_{2} = 2
(size = small),
and E_{3} = 0 and J_{3} = 1
(size = medium)
so the final information gain,
E – ((J_{1}/N).E_{1} + (J_{2}/N).E_{2} + ... + (J_{r}/N).E_{r})which turned out to be about 0.13 in the example.
= E – ((4/7)×E_{1} + (2/7)×E_{2} + (1/7)×E_{3})
Contrast unsupervised learning.
Such learning algorithms have the advantage that their internal representations and their final output can be inspected and relatively easily understood by humans.
See also function approximation algorithms.
Compare bias.
See also activation function.
Given a training pattern, its squared error is obtained by squaring the difference between the target output of an output neuron and the actual output. The sum-squared error, or pattern sum-squared error (PSS), is obtained by adding up the sum-squared errors for each output neuron. The total sum-squared error is obtained by adding up the PSS for each training pattern.
The other weights, the ones that are subject to change when the net is learning, are referred to as trainable weights.
In toy problems like the XOR problem, only a few training patterns (in the case of XOR, just 4) may be used.
Patterns used for testing the trained network are referred to as test patterns.
Compare instance.
This leaves the question of how to choose the best attribute to split on at any branch node. This issue is handled in the article on splitting criterion in ID3.
Without windowing, such an algorithm can be really slow, as it needs to do its information gain calculations (see tree induction algorithms) over huge amounts of data.
With windowing, training is done on a relatively small sample of the data, and then checked against the full set of training data.
Here is the windowing algorithm: