next up previous
Next: DETAILED DESCRIPTION Up: paper Previous: INTRODUCTION

OVERVIEW

A summary of how metafeatures are applied is:

To explain the application of metafeatures, we present a simple pedagogical domain. Suppose there is a mythical company called SoftCorp that develops and provides technical support for software. Tech Support calls are recorded for later analysis. SoftCorp wants to find the critical difference between happy and angry customers.

An engineer suggests that the volume level of the conversation is an indication of frustration level. Each call is therefore divided into 30-second segments; and the average volume in each segment is calculated. If it is high volume, it is marked as ``H'', while if it is at a reasonable volume, it is labelled as ``L''. On some subset of their data (in fact, six customers), they determine whether the tech support calls resulted in happy or angry customers by some independent means. These are shown in Table 1.


Table: The training set for the Tech Support domain.
Call Loudness (over time) Class
0 1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4
$ 1$ L L L H H H L L L L L L Happy
$ 2$ L L L H L L H L L H H H H Angry
$ 3$ L L H L L H L L L L L L H H H Angry
$ 4$ L L L L H H H H L L L L L Happy
$ 5$ L L L H H H L L L L Happy
$ 6$ L L H H L L H L L H H H Angry


One expert advises that ``runs'' of high volume conversation - continuous periods where the conversation runs at a high volume level - are important for classification purposes. Runs of loud volume could be represented as a tuple $ (t,d)$ consisting of:

This is our first metafeature, called LoudRun.

Each instance can now be characterised as having a set of LoudRun events - the LoudRun events are the substructures appropriate for this domain. These can be extracted simply by looking for sequences of high-volume conversation. For example, $ s_2$, has one run of highs starting at time 3 lasting for 1 timestep, a high run starting at time 6 lasting for one timestep and a high run starting at time 9 for 4 timesteps. Hence the set of LoudRuns produced from the training instance $ s_2$ is $ \{(3,1), (6,1),(9,4)\}$. These tuples are examples of instantiated features.


Table: Instantiated LoudRun features for the Tech Support domain.
Stream Instantiated features
$ s_1$ $ \{(3,3)\}$
$ s_2$ $ \{(3,1), (6,1),(9,4)\}$
$ s_3$ $ \{(2,1), (5,1), (12,3)\}$
$ s_4$ $ \{(4,4)\}$
$ s_5$ $ \{(3,3)\}$
$ s_6$ $ \{(2,2),(6,1), (9,3)\}$


These instantiated features can be plotted in the two-dimensional space shown in Figure 1. This is the parameter space. This two-dimensional space consists of one axis for the start time and another for the duration.

Figure: Parameter space for the LoudRun metafeature in the Tech Support domain, but this time showing class information. Note that the point (3,3) is a ``double-up''.
\begin{figure}\begin{center}
\leavevmode \epsfxsize =3.2in \epsfbox{blues-class.eps}\centering\vspace{-5pt}
\par\end{center}\centering\end{figure}

Once the points are in parameter space, ``typical examples'' of LoudRuns can be selected. In this case, the points labelled A, B and C are selected, as shown in Figure 2. These are termed synthetic features. They may or may not be the same as an observed event - so for example, point A actually corresponds to a real event (the instantiated event (3,3) actually was observed in the data), whereas B and C do not.

Figure: Three synthetic features and the regions around them for LoudRuns in the Tech Support domain.
\begin{figure}\begin{center}
\leavevmode \epsfxsize =3.2in \epsfbox{blues-centkmeans.eps}\centering\vspace{-5pt}
\par\end{center}\centering\end{figure}

These synthetic events can be used to segment the parameter space into different regions by computing the Voronoi tiling: for each point in the parameter space, the nearest synthetic feature is found. The set of points associated with each synthetic event form a region and the boundaries of each region can be calculated. These are shown as dotted lines in Figure 2.

The next step is to make use of these regions. Questions like: ``does this training instance have an instantiated feature in A's region?'' can be asked. If the question is repeated for B and C, the result is Table 3. To construct this table, Table 2 is examined, and for each region if there is an instantiated feature that lies within it, a synthetic attribute corresponding to the point is marked as a ``yes''. This is now in a learner-friendly format. In fact, if it is fed it to C4.5, the simple tree in Figure 3 results.


Table: Attribution of synthetic features for the Tech Support domain.
Stream Class Synth Attrib
  A B C
1 Happy Yes Yes No
2 Angry No Yes Yes
3 Angry No Yes Yes
4 Happy Yes Yes No
5 Happy Yes Yes No
6 Angry Yes Yes Yes


Figure: Rule for telling happy and angry customers apart.
\begin{figure}\begin{center}
\footnotesize\begin{boxedverbatim}rgnC = yes: Ang...
... Happy (3.0)\end{boxedverbatim}\normalsize\end{center}\vspace{-5pt}
\end{figure}

This tree says that if the training instance has an instantiated feature that lies within in region C (i.e. a run of high values that starts around time t=10 and goes for approximately 3.33 timesteps), then its class is Angry. In other words, as long as there is not a long high-volume run towards the end of the conversation, the customer is likely to be happy.


next up previous
Next: DETAILED DESCRIPTION Up: paper Previous: INTRODUCTION
Mohammed Waleed Kadous 2002-02-12