next up previous contents
Next: Inspiration for metafeatures Up: Metafeatures: A Novel Feature Previous: TClass Overview   Contents


Tech Support revisited

This section continues on with the Tech Support domain described in Section 2.1.1. For convenience, Table 4.1 replicates the data of Table 2.1, but using our terminology for channels and streams. To revise, each stream represents the volume levels of phone conversations to the Tech Support line.


Table 4.1: The training set for the Tech Support domain.
Stream $ v(0)...v(t_{max})$ Class
  0 1 2  
  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0  
$ s_1$ L L L H H H L L L L L L Happy
$ s_2$ L L L H L L H L L H H H H Angry
$ s_3$ L L H L L H L L L L L L H H H Angry
$ s_4$ L L L L H H H H L L L L L Happy
$ s_5$ L L L H H H L L L L Happy
$ s_6$ L L H H L L H L L H H H Angry


The channel $ v$ in the Tech Support dataset is discrete, and in fact, binary - i.e., the value at each time frame is either H (high volume) or L (low volume).

How could a classifier be built for the Tech Support domain? One expert advises that ``runs'' of high volume conversation - continuous periods where the conversation runs at a high volume level - are important for classification purposes. Runs of loud volume could be represented as a tuple $ (t,d)$ consisting of:

This is our first metafeature, called LoudRun.

Each instance can now be characterised as having a set of LoudRun events. These can be extracted simply by looking for sequences of high-volume conversation. For example, $ s_2$, has one run of highs starting at time 3 lasting for 1 timestep, a high run starting at time 6 lasting for one timestep and a high run starting at time 9 for 4 timesteps. Hence the set of LoudRuns produced from the training instance $ s_2$ is $ \{(3,1), (6,1),(9,4)\}$. These tuples are examples of instantiated features.


Table 4.2: Instantiated LoudRun features for the Tech Support domain.
Stream Instantiated features
$ s_1$ $ \{(3,3)\}$
$ s_2$ $ \{(3,1), (6,1),(9,4)\}$
$ s_3$ $ \{(2,1), (5,1), (12,3)\}$
$ s_4$ $ \{(4,4)\}$
$ s_5$ $ \{(3,3)\}$
$ s_6$ $ \{(2,2),(6,1), (9,3)\}$


These instantiated features can be plotted in the two-dimensional space shown in Figure 4.1. This is the parameter space. This two-dimensional space consists of one axis for the start time and another for the duration. Each LoudRun lies within the parameter space. Figure 4.2 is the same as Figure 4.1, but each instantiated feature is marked with its original class.

Figure 4.1: Parameter space for the LoudRun metafeature applied to the Tech Support domain.
\begin{figure}\begin{center}
\leavevmode \epsfxsize =5in \epsfbox{blues-data.eps}\par\centering\centering\end{center}\end{figure}

Figure 4.2: Parameter space for the LoudRun metafeature in the Tech Support domain, but this time showing class information. Note that the point (3,3) is a ``double-up''.
\begin{figure}\begin{center}
\leavevmode \epsfxsize =5in \epsfbox{blues-class.eps}\par\centering\centering\end{center}\end{figure}

Once the points are in parameter space, ``typical examples'' of LoudRuns can be selected. In this case, the points labelled A, B and C are selected as typical examples, as shown in Figure 4.3. These typical examples are synthetic events. They may or may not be the same as an observed event - so for example, point A actually corresponds to a real event (the instantiated event (3,3) actually was observed in the data), whereas B and C do not.

Figure 4.3: Three synthetic events and the regions around them for LoudRuns in the Tech Support domain.
\begin{figure}\begin{center}
\leavevmode \epsfxsize =5in \epsfbox{blues-centkmeans.eps}\par\centering\centering\end{center}\end{figure}

These synthetic events can be used to segment the parameter space into different regions in the following way: for each point in the parameter space, the nearest synthetic event is found. The set of points associated with each synthetic event form a region. Without too much trouble, the boundaries of each region can be calculated. These are shown as dotted lines in Figure 4.3.

The next step is to make use of these regions. Questions like: ``does a given stream have an instantiated feature that belongs in the area around A?'' can be asked. If the question is repeated for B and C, the result is Table 4.3. To construct this table, Table 4.2 is examined, and for each region if there is an instantiated feature that lies within that region, a synthetic feature corresponding to the point is marked as a ``yes''.


Table 4.3: Attribution of synthetic features for the Tech Support domain.
Stream Class Synthetic Events
  Region A Region B Region C
1 Happy Yes Yes No
2 Angry No Yes Yes
3 Angry No Yes Yes
4 Happy Yes Yes No
5 Happy Yes Yes No
6 Angry Yes Yes Yes


This is now in a learner-friendly format. In fact, if it is fed it to C4.5, the simple tree in Figure 4.4 results.

Figure 4.4: Rule for telling happy and angry customers apart.
\begin{figure}\begin{center}
\begin{boxedverbatim}rgnC = yes: Angry (3.0)
rgnC = no: Happy (3.0)\end{boxedverbatim}\end{center}\end{figure}

This tree says that if the training instance has an instantiated feature that lies within in region C (i.e. a run of high values that starts around time t=10 and goes for approximately 3.33 timesteps), then its class is Angry. Conversely, if it doesn't have such a synthetic feature, then its classification is Happy. In other words, as long as there is not a long high-volume run towards the end of the conversation, the customer is likely to be happy.

This is the core of TClass and also shows the application of metafeatures to temporal classification tasks. Now, let's take a more in-depth look at metafeatures.


next up previous contents
Next: Inspiration for metafeatures Up: Metafeatures: A Novel Feature Previous: TClass Overview   Contents
Mohammed Waleed Kadous 2002-12-10