Induction of Decision Trees

Reference: Bratko sections 18.5, 18.6

Aim:
To describe an algorithm whose input is a collection of instances and their correct classification and whose output is a decision tree that can be used to classify each instance.
Keywords: attributes, backed-up error estimate, C4.5, C5, concept learning system (CLS), decision trees, entropy, expected error estimate, feature, ID3, instances, Laplace error estimate, pruning decision trees, splitting criterion in ID3, tree induction algorithm, windowing in ID3
Plan:
  • What is a decision tree?
  • Building a decision tree
  • Which attribute should we split on?
  • Information theory and the splitting criterion
  • ID3 example & ID3 symbolic version
  • ID3 in iProlog
  • ID3 enhancements: windowing
  • Dealing with noisy data - expected error pruning


Decision Trees


Example Instance Data

size:	small medium large
colour:	red blue green
shape:	brick wedge sphere pillar

%% yes

medium	blue	brick
small	red	sphere
large	green	pillar
large	green	sphere

%% no

small	red	wedge
large	red	wedge
large	red	pillar

Originator of the ID3 Algorithm

  • ID3 and its successors have been developed by Ross Quinlan, who discovered it while working with Earl Hunt in the 1970s. He subsequently worked at Sydney Uni, Rand Corporation in California, UTS, back to Sydney Uni, and several years at UNSW. He now runs his own company, Rulequest (www.rulequest.com).

  • You can find out more about ID3 and its later development in C4.5 in his book: Quinlan, J. R. C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, 1993, or via links at http://www.rulequest.com/Personal/

Picture of Ross Quinlan
Photo of Ross Quinlan from his personal webpage at www.rulequest.com

Tree Induction Algorithm


Output of Tree Induction Algorithm

This can easily be expressed as a nested if-statement
if (shape == wedge)
        return no;
if (shape == brick)
        return yes;
if (shape == pillar)
{
        if (colour == red)
                return no;
        if (colour == green)
                return yes;
}
if (shape == sphere)
        return yes;

Choosing Attributes and ID3


Information


Entropy


Information and Learning


Splitting Criterion


Example


Example Part 2


Example Part 3

		

Example Part 4


Summary of Splitting Criterion

Some people learn best from an example; others like to see the most general formulation of an algorithm. If you are an "examples" person, don't let the following subscript-studded presentation panic you.

Assume there are k classes C1, ... Ck (k = 2 in our example).

to decide which attribute to split on:


Summary of Splitting Criterion 2

to calculate the information gain from splitting N instances on attribute A:


Summary of Splitting Criterion 3

to calculate the entropy E of the current set of instances

Using the iProlog Implementation of ID3

% cat example.data
table object(
                texture(smooth, wavy, rough),
                temperature(cold, cool, warm, hot),
                size(small, medium, large),
                class(yes, no)
)!

object(smooth, cold, large, yes).
object(smooth, cold, small, no).
object(smooth, cool, large, yes).
object(smooth, cool, small, yes).
object(smooth, hot, small, yes).
object(wavy, cold, medium, no).
object(wavy, hot, large, yes).
object(rough, cold, large, no).
object(rough, cool, large, yes).
object(rough, hot, small, no).
object(rough, warm, medium, yes).

Using the iProlog Implementation of ID3 - ctd

% prolog example.data
iProlog ML (21 March 2003)
: id(object)?
id0
: pp id0!

object(Texture, Temperature, Size, Class) :-
   (Temperature = cold ->
       (Texture = smooth ->
           (Size = small -> Class = no
           | Size = large -> Class = yes)
       | Texture = wavy -> Class = no
       | Texture = rough -> Class = no)
   | Temperature = cool -> Class = yes
   | Temperature = warm -> Class = yes
   | Temperature = hot ->
       (Texture = smooth -> Class = yes
       | Texture = wavy -> Class = yes
       | Texture = rough -> Class = no)).

:

Windowing


Noisy Data


Expected Error Pruning


(Static) Expected Error

E(S) = Nn + k – 1

N + k

(This is called the Laplace error estimate - it is based on the assumption that the distribution of probabilities that examples will belong to different classes is uniform.)

Sis the set of examples in a node
kis the number of classes
Nexamples in S
Cis the majority class in S
nout of N examples in S belong to C


Backed-up Error

BackedUpError(Node) = Σi Pi×Error(Nodei)

Error(Node) = min(E(Node), BackedUpError(Node))


Pruning Example

ID3 tree pruning diagram


Error Calculation for Pruning Example
Summary: Induction of Decision Trees
  • The ID3 family of decision tree induction algorithms use information theory to decide which attribute shared by a collection of instances to split the data on next.
  • Attributes are chosen repeatedly in this way until a complete decision tree that classifies every input is obtained. If the data is noisy, some of the original instances may be misclassified.
  • It may be possible to prune the decision tree in order to reduce classification errors in the presence of noisy data.
  • The speed of this learning algorithm is reasonably high, as is the speed of the resulting decision tree classification system.
  • Generalisation ability can be reasonable too.

Copyright © Bill Wilson, 2008, except to the extent that other sources are acknowledged. Based on an earlier version by Claude Sammut, and material by Ivan Bratko.

Bill Wilson's contact info

UNSW's CRICOS Provider No. is 00098G