## Induction of Decision Trees

Reference: Bratko sections 18.5, 18.6

Image above only viewable during lecture

 Aim: To describe an algorithm whose input is a collection of instances and their correct classification and whose output is a decision tree that can be used to classify each instance. Keywords: attributes, backed-up error estimate, C4.5, C5, concept learning system (CLS), decision trees, entropy, expected error estimate, feature, ID3, instances, Laplace error estimate, pruning decision trees, splitting criterion in ID3, tree induction algorithm, windowing in ID3 Plan: What is a decision tree? Building a decision tree Which attribute should we split on? Information theory and the splitting criterion ID3 example & ID3 symbolic version ID3 enhancements: windowing Dealing with noisy data - expected error pruning

### Decision Trees

• A decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a classification or decision.

• For example, a decision tree might help a bank decide whether a person should be offered a loan:

• We wish to be able to induce a decision tree from a set of data about instances together with the decisions or classifications for those instances.

### Example Instance Data

```size:	small medium large
colour:	red blue green
shape:	brick wedge sphere pillar

%% yes

medium	blue	brick
small	red	sphere
large	green	pillar
large	green	sphere

%% no

small	red	wedge
large	red	wedge
large	red	pillar
```

• In this example, there are 7 instances, described in terms of three features or attributes (size, colour, and shape), and the instances are classified into two classes %% yes and %% no.

• We shall now describe an algorithm for inducing a decision tree from such a collection of classified instances.

• Originally termed CLS (Concept Learning System) it has been successively enhanced.

• At the highest level of enhancement that we shall describe in these notes, the system is known as ID3 - later versions include C4, C4.5 and See5/C5.0 (latest version release seems to be 2.07, 2008).

### Originator of the ID3 Algorithm

 ID3 and its successors have been developed by Ross Quinlan, who discovered it while working with Earl Hunt in the 1970s. He subsequently worked at Sydney Uni, Rand Corporation in California, UTS, back to Sydney Uni, and several years at UNSW. He now runs his own company, Rulequest (www.rulequest.com). You can find out more about ID3 and its later development in C4.5 in his book: Quinlan, J. R. C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, 1993, or via links at http://www.rulequest.com/Personal/ Photo of Ross Quinlan from his personal webpage at www.rulequest.com

### Tree Induction Algorithm

• The algorithm operates over a set of training instances, C.
• If all instances in C are in class P, create a node P and stop, otherwise select a feature or attribute F and create a decision node.
• Partition the training instances in C into subsets according to the values of V.
• Apply the algorithm recursively to each of the subsets C.

### Output of Tree Induction Algorithm

This can easily be expressed as a nested if-statement
```if (shape == wedge)
return no;
if (shape == brick)
return yes;
if (shape == pillar)
{
if (colour == red)
return no;
if (colour == green)
return yes;
}
if (shape == sphere)
return yes;
```

### Choosing Attributes and ID3

• The order in which attributes are chosen determines how complicated the tree is.

• ID3 uses information theory to determine the most informative attribute.

• A measure of the information content of a message is the inverse of the probability of receiving the message:

information1(M) = 1/probability(M)

• Taking logs (base 2) makes information correspond to the number of bits required to encode a message:

information(M) = –log2(probability(M))

### Information

• Information content of a message measures the degree of surprise on receiving the message.

• Messages with a high probability are less informative than low probability messages.

• Learning aims to predict accurately, i.e. reduce surprise.

• Probabilities are multiplied to get the probability of two or more things both/all happening. Taking logarithms of the probabilities allows information to be added instead of multiplied.

### Entropy

• Different messages have different probabilities of arrival.

• Overall level of uncertainty (termed entropy) is:

–Σi Pi log2Pi

• Frequency can be used as a probability estimate.

• E.g. if a node has 5 positive examples and 3 negative examples, the estimated probability of positive is 5/8 = 0.625.

### Information and Learning

• We can think of learning as building many-to-one mappings between input and output.

• Learning tries to reduce the information content of the inputs by mapping them to fewer outputs.

• Hence we try to minimise entropy.

• The simplest mapping is to map everything to one output.

• We seek a trade-off between accuracy and simplicity.

### Splitting Criterion

• Work out entropy based on distribution of classes.

• Trying splitting on each attribute.

• Work out expected information gain for each attribute.

• Choose best attribute.

### Example

• Initial decision tree is one node with all examples.

• There are 4 positive examples and 3 negative examples

• i.e. probability of positive is 4/7 = 0.57; probability of negative is 3/7 = 0.43

• Entropy is Eoverall = – (0.5715 × log20.5715) – (0.4285 × log20.4285) = 0.985

• Evaluate possible ways of splitting.

### Example Part 2

• Try split on size which has three values: large, medium and small.

• There are four instances with size = large.

• There are two large positives examples and two large negative examples.

• The probability of positive is 2/4 and so is the probabilty of negative

• The entropy is Elarge = – (2/4 × log2(2/4)) – (2/4 × log2(2/4)) = 1

### Example Part 3

• There is one small positive and one small negative
Entropy is Esmall = – (1/2 × log2(1/2)) – (1/2 × log2(1/2)) = 1

• There is only one medium positive and no medium negatives, so entropy is 0:
Entropy is Emedium = – (1/1 × log2(1/1)) – (0/1 × log2(0/1)) = 0

• Expected information for a split on size Esize is:

Elarge×plarge + Esmall×psmall + Emedium×pmedium
= 1×(4/7) + 1×(2/7) + 0×(1/7) = 0.857

### Example Part 4

• The expected information gain is: Eoverall – Esize = 0.985 – 0.857 = 0.128

• Now try splitting on colour and shape.

• Colour has an information gain of about 0.52

• Shape has an information gain of about 0.7

• Therefore split on shape.

• Repeat for all subtrees.

### Summary of Splitting Criterion

Some people learn best from an example; others like to see the most general formulation of an algorithm. If you are an "examples" person, don't let the following subscript-studded presentation panic you.

Assume there are k classes C1, ... Ck (k = 2 in our example).

to decide which attribute to split on:

• for each attribute that has not already been used
• Calculate the information gain that results from splitting on that attribute
• Split on the attribute that gives the greatest information gain.

### Summary of Splitting Criterion 2

to calculate the information gain from splitting N instances on attribute A:

• Calculate the entropy E of the current set of instances.
• for each value aj of the attribute A(j = 1, ..., r)
• Suppose that there are
Jj,1 instances in class C1,
...,
Jj,k instances in class Ck,
for a total of Jjinstances with A = aj.
• Let qj,1 = Jj,1/Jj, ..., qj,k = Jj,k/Jj;
• The entropy Ej associated with A = aj is
qj,1 log2(qj,1) ... –qj,k log2(qj,k)
• Now compute E – (J1/N)E1 ... –(Jr/N)Er - this is the information gain associated with a split on attribute A.

### Summary of Splitting Criterion 3

to calculate the entropy E of the current set of instances
• Suppose that of the N instances in this node, I1 are in class C1, ..., Ik are in class Ck,
• Let p1 = I1/N, ..., pk = Ik/N,
• Then the initial entropy E is –p1 log2(p1) –p2 log2(p2) ... –pk log2(pk).

### Windowing

• ID3 can deal with very large data sets by performing induction on subsets or windows onto the data.

1. Select a random subset of the whole set of training instances.

2. Use the induction algorithm to form a rule to explain the current window.

3. Scan through all of the training instances looking for exceptions to the rule.

4. Add the exceptions to the window

• Repeat steps 2 to 4 until there are no exceptions left.

### Noisy Data

• Frequently, training data contains "noise" - i.e. examples which are misclassified, or where one or more of the attribute values is wrong.

• In such cases, one is like to end up with a part of the decision tree which considers say 100 examples, of which 99 are in class C1 and the other is apparently in class C2 (because it is misclassified).

• If there are any unused attributes, we might be able to use them to elaborate the tree to take care of this one case, but the subtree we would be building would in fact be wrong, and would likely misclassify real data.

• Thus, particularly if we know there is noise in the training data, it may be wise to "prune" the decision tree to remove nodes which, statistically speaking, seem likely to arise from noise in the training data.

• A question to consider: How fiercely should we prune?

### Expected Error Pruning

• Approximate expected error assuming that we prune at a particular node.

• Approximate backed-up error from children assuming we did not prune.

• If expected error is less than backed-up error, prune.

(Static) Expected Error

• If we prune a node, it becomes a leaf labelled C.

• What will be the expected classification error at this leaf?

 E(S) = N – n + k – 1 N + k

(This is called the Laplace error estimate - it is based on the assumption that the distribution of probabilities that examples will belong to different classes is uniform.)

 S is the set of examples in a node k is the number of classes N examples in S n out of N examples in S belong to the majority class in S

E.g., if k = 2, and the examples in S come out as [p, q], with p positive and q negative, p + q = N, then n = max(p, q).

Backed-up Error

• For a non-leaf node, let children of Node be Node1, Node2, etc

BackedUpError(Node) = Σi Pi×Error(Nodei)

• Probabilities can be estimated by relative frequencies of attribute values in sets of examples that fall into child nodes.

Error(Node) = min(E(Node), BackedUpError(Node))

### Error Calculation for Pruning Example

• Left child of b has class frequencies [3, 2]

 E = N – n + k – 1 = 5 – 3 + 2 – 1 = 0.429 N + k 5 + 2

• Right child has error of 0.333, calculated the same way

• Static error estimate E(b) is 0.375, again calculated using the Laplace error estimate formula, with N=6, n=4, and k=2.

• Backed-up error is:

BackedUpError(b) = (5/6)×0.429 + (1/6)×0.333 = 0.413

(5/6 and 1/6 because there are 4+2=6 examples handled by node b, of which 3+2=5 go to the left subtree and 1 to the right subtree.

• Since backed-up estimate of 0.413 is greater than static estimate of 0.375, we prune the tree and use the static error of 0.375

 Summary: Induction of Decision Trees The ID3 family of decision tree induction algorithms use information theory to decide which attribute shared by a collection of instances to split the data on next. Attributes are chosen repeatedly in this way until a complete decision tree that classifies every input is obtained. If the data is noisy, some of the original instances may be misclassified. It may be possible to prune the decision tree in order to reduce classification errors in the presence of noisy data. The speed of this learning algorithm is reasonably high, as is the speed of the resulting decision tree classification system. Generalisation ability can be reasonable too.

Copyright © Bill Wilson, 2012, except to the extent that other sources are acknowledged. Based on an earlier version by Claude Sammut, and material by Ivan Bratko.

Bill Wilson's contact info

UNSW's CRICOS Provider No. is 00098G