This section continues on with the Tech Support domain described in Section 2.1.1. For convenience, Table 4.1 replicates the data of Table 2.1, but using our terminology for channels and streams. To revise, each stream represents the volume levels of phone conversations to the Tech Support line.
The channel
in the Tech Support dataset is discrete, and in fact,
binary - i.e., the value at each time frame is either H
(high volume) or L (low volume).
How could a classifier be built for the Tech Support domain? One
expert advises that ``runs'' of high volume conversation -
continuous periods where the conversation runs at a high volume level
- are important for classification purposes. Runs of loud
volume could be represented as a tuple
consisting of:
This is our first metafeature, called LoudRun.
Each instance can now be characterised as having a set of
LoudRun events. These can be extracted simply by looking for
sequences of high-volume conversation. For example,
, has one
run of highs starting at time 3 lasting for 1 timestep, a high run
starting at time 6 lasting for one timestep and a high run starting at
time 9 for 4 timesteps. Hence the set of LoudRuns produced
from the training instance
is
. These
tuples are examples of instantiated features.
These instantiated features can be plotted in the two-dimensional space shown in Figure 4.1. This is the parameter space. This two-dimensional space consists of one axis for the start time and another for the duration. Each LoudRun lies within the parameter space. Figure 4.2 is the same as Figure 4.1, but each instantiated feature is marked with its original class.
![]() |
Once the points are in parameter space, ``typical examples'' of LoudRuns can be selected. In this case, the points labelled A, B and C are selected as typical examples, as shown in Figure 4.3. These typical examples are synthetic events. They may or may not be the same as an observed event - so for example, point A actually corresponds to a real event (the instantiated event (3,3) actually was observed in the data), whereas B and C do not.
![]() |
These synthetic events can be used to segment the parameter space into different regions in the following way: for each point in the parameter space, the nearest synthetic event is found. The set of points associated with each synthetic event form a region. Without too much trouble, the boundaries of each region can be calculated. These are shown as dotted lines in Figure 4.3.
The next step is to make use of these regions. Questions like: ``does a given stream have an instantiated feature that belongs in the area around A?'' can be asked. If the question is repeated for B and C, the result is Table 4.3. To construct this table, Table 4.2 is examined, and for each region if there is an instantiated feature that lies within that region, a synthetic feature corresponding to the point is marked as a ``yes''.
| ||||||||||||||||||||||||||||||||||||||||
This is now in a learner-friendly format. In fact, if it is fed it to C4.5, the simple tree in Figure 4.4 results.
This tree says that if the training instance has an instantiated feature that lies within in region C (i.e. a run of high values that starts around time t=10 and goes for approximately 3.33 timesteps), then its class is Angry. Conversely, if it doesn't have such a synthetic feature, then its classification is Happy. In other words, as long as there is not a long high-volume run towards the end of the conversation, the customer is likely to be happy.
This is the core of TClass and also shows the application of metafeatures to temporal classification tasks. Now, let's take a more in-depth look at metafeatures.