next up previous contents
Next: Experimental Results Up: TTest - An artificial Previous: Objectives of artificial dataset   Contents

TTest

To fulfil this role, the artificial dataset TTest has been designed. TTest is a learning problems with three classes and three channels. It has a total of five parameters controlling various characteristics.

There are three classes, imaginatively called A, B and C. The three channels, likewise, are imaginatively termed alpha, beta and gamma. For each of the three classes, there is a prototype, as shown in Figures 6.7, 6.8 and 6.9. All the prototypes are all exactly a hundred units of time long.

Figure 6.7: Prototype for class A
\begin{figure}\begin{center}
\leavevmode \epsfxsize =5in \epsfbox{perfA.eps}\par\centering\centering\end{center}\end{figure}

Figure 6.8: Prototype for class B
\begin{figure}\begin{center}
\leavevmode \epsfxsize =5in \epsfbox{perfB.eps}\par\centering\centering\end{center}\end{figure}

Figure 6.9: Prototype for class C
\begin{figure}\begin{center}
\leavevmode \epsfxsize =5in \epsfbox{perfC.eps}\par\centering\centering\end{center}\end{figure}

These prototypes can be mathematically defined as follows:

\begin{displaymath}
\begin{array}{lll}
A\alpha(t) & = & \left\{ \begin{array}{ll...
...nd{array}\right. \\  *[1cm]
A\gamma(t) & = & 0 \\
\end{array}\end{displaymath}

\begin{displaymath}
\begin{array}{lll}
B\alpha(t) & = & \left\{ \begin{array}{ll...
...
\end{array}\right.\\  *[0.5cm]
B\gamma(t) & = & 0
\end{array}\end{displaymath}

\begin{displaymath}
\begin{array}{lll}
C\alpha(t) & = & \left\{ \begin{array}{ll...
...0 $} \\
0 & \mbox{otherwise}
\end{array}\right.
\end{array}\end{displaymath}


Now, these are not particularly interesting as a learning task. Even feeding the features into a traditional learner, such as C4.5, treating each channel/frame combination as a single feature (hence generating 300 features), would work, except that the definitions produced would be really difficult to read. So, we've got to spice it up somehow.

Obviously, the solution lies in randomisation. For the purposes of this thesis, let us define a function $ \ensuremath{\mathit{unif}}()$, which returns a uniformly distributed real number in the range $ [-1,1]$. Also, let us define a function $ \ensuremath{\epsilon}()$, which returns a real number of normal distribution, with a mean of 0 and a variance of 1.

One form of variation is temporal stretching. In general, temporal stretching is non-linear; in other words, temporal stretching need not be of the kind where the signal is uniformly slowed down or sped up; rather, some parts can be sped up while others can be slowed down. However, for simplicity, we will assume linear temporal stretching. The amount of temporal (linear) stretching for the TTest dataset is controlled by the parameter $ d$ (short for overall duration). The duration is computed as:

$\displaystyle \ensuremath{\mathit{durn}}= (1+ d*\ensuremath{\mathit{unif}}())*100
$

That is to say that $ d$ specifies the percentage variation of the duration. For example, if $ d=0.1$, this would mean that duration in frames would vary from 90 to 110.

To illustrate this, consider a prototype A signal. Figure 6.10 shows a variety of different instances of A, with $ d=0.2$. Similar effects can be observed on the other classes.

Figure 6.10: Effect of adding duration variation to prototypes of class A.
\begin{figure}\begin{center}
\leavevmode \epsfxsize =5in \epsfbox{classawithd.eps}\par\centering\centering\end{center}\end{figure}

Another form of variation is random noise. For this problem, we will assume that such noise is Gaussian. This is not an unreasonable assumption, as Gaussian noise is typical of many types of sensors and measurements that are used in temporal data. For the TTest dataset, the amount of noise is controlled by the parameter $ g$; the noise on all channels is taken by multiplying the parameter $ g$ by the random Gaussian noise function; i.e.,

$\displaystyle \mathit{noise}() = g* \ensuremath{\epsilon}()
$

then this can be added to the signal. The more noise; the more the underlying signal is obscured. The effect of adding Gaussian noise with $ g = 0.1$ can be seen in Figure 6.11.

Figure 6.11: Effect of adding Gaussian noise to prototypes of class A.
\begin{figure}\begin{center}
\leavevmode \epsfxsize =5in \epsfbox{classawithg.eps}\par\centering\centering\end{center}\end{figure}

As mentioned before, temporal stretching is only a linear approximation of a non-linear process. Hence, we can also modify the times of events within each instance relative to one another. In TTest, this is controlled by the parameter $ c$. Within the dataset, the startpoints and endpoints of increases and decreases, as well as the timing of local maxima and minima are all randomly offset in time. The amount of variation of these temporal events is modified by a random value $ c * \ensuremath{\mathit{unif}}() * 100$. The effect of setting $ c=0.1$ on the three instances of class A are shown in Figure 6.12.

Figure 6.12: Effect of adding sub-event variation to prototypes of class A.
\begin{figure}\begin{center}
\leavevmode \epsfxsize =5in \epsfbox{classawithc.eps}\par\centering\centering\end{center}\end{figure}

Also, as with real datasets, sometimes the amplitudes of the various events varies. In TTest the parameter $ h$ determines the amount of variation in the amplitude of the local maxima of the signal. Each of these maxima is perturbed by an amount $ h*\ensuremath{\mathit{unif}}()$. The effect of setting $ h=0.2$ on three instances of class A can be seen in Figure 6.13.

Figure 6.13: Effect of adding amplitude variation to prototypes of class A.
\begin{figure}\begin{center}
\leavevmode \epsfxsize =5in \epsfbox{classawithh.eps}\par\centering\centering\end{center}\end{figure}

Figure 6.14: Effect of replacing gamma channel with irrelevant signal to class A.
\begin{figure}\begin{center}
\leavevmode \epsfxsize =5in \epsfbox{classawithirrel.eps}\par\centering\centering\end{center}\end{figure}

Finally, one thing that often occurs in real datasets is that there is some data on a channel that looks useful, but in fact is irrelevant. For this reason, for classes A and B, the gamma channel was replaced with something that looks plausible as a signal: a sequence of between 2 and 9 random line segments whose endpoints are randomly generated. For class C, the beta channel is replaced with a similarly generated sequence of random line segments. Note that this is meant to explore a different issue to that of the Gaussian noise above; this is meant to test the learner's ability to cope with data that is irrelevant.

Three instances of class A are shown in Figure 6.14, with the irrelevant features added.

Having included all of these factors, we can now redefine our original signals, in terms of the parameters $ g, c, d, h$; as well as whether we choose to have irrelevant features or not. The definitions hence become:

\begin{displaymath}
\begin{array}{lll}
\mathit{dur} & = & (1+ d * \ensuremath{\...
...f \emph{irrel} is on} \\
\end{array} \right.\\
\end{array}\end{displaymath}

\begin{displaymath}
\begin{array}{lll}
\mathit{dur} & = & (1+ d * \ensuremath{\...
...f \emph{irrel} is on} \\
\end{array} \right.\\
\end{array}\end{displaymath}

\begin{displaymath}
\begin{array}{lll}
\mathit{dur} & = & (1+ d * \ensuremath{\...
...) & \mbox{otherwise}
\end{array}\right. \\  [1cm]
\end{array}\end{displaymath}

This dataset has some useful properties. Firstly, it is truly multivariate. Secondly, it has variation in the length of streams and the onset and duration of sub-events. Thirdly, it has variation in the amplitudes of events as well as Gaussian noise (as does CBF). Fourthly, it has several sub-events and a complicated relation between sub-events which is typical of real-world domains. CBF only had one of these properties.


next up previous contents
Next: Experimental Results Up: TTest - An artificial Previous: Objectives of artificial dataset   Contents
Mohammed Waleed Kadous 2002-12-10