To fulfil this role, the artificial dataset TTest has been designed. TTest is a learning problems with three classes and three channels. It has a total of five parameters controlling various characteristics.
There are three classes, imaginatively called A, B and C. The three channels, likewise, are imaginatively termed alpha, beta and gamma. For each of the three classes, there is a prototype, as shown in Figures 6.7, 6.8 and 6.9. All the prototypes are all exactly a hundred units of time long.
These prototypes can be mathematically defined as follows:
Now, these are not particularly interesting as a learning task. Even feeding the features into a traditional learner, such as C4.5, treating each channel/frame combination as a single feature (hence generating 300 features), would work, except that the definitions produced would be really difficult to read. So, we've got to spice it up somehow.
Obviously, the solution lies in randomisation. For the purposes of
this thesis, let us define a function
, which returns a
uniformly distributed real number in the range
. Also, let us
define a function
, which returns a real number of normal
distribution, with a mean of 0 and a variance of 1.
One form of variation is temporal stretching. In general, temporal
stretching is non-linear; in other words, temporal stretching need
not be of the kind where the signal is uniformly slowed down or sped
up; rather, some parts can be sped up while others can be slowed
down. However, for simplicity, we will assume linear temporal
stretching. The amount of temporal (linear) stretching for the TTest
dataset is controlled by the parameter
(short for overall
duration). The duration is computed as:
To illustrate this, consider a prototype A signal. Figure 6.10 shows
a variety of different instances of A, with
. Similar effects
can be observed on the other classes.
Another form of variation is random noise. For this problem, we will
assume that such noise is Gaussian. This is not an unreasonable
assumption, as Gaussian noise is typical of many types of sensors and
measurements that are used in temporal data. For the TTest dataset,
the amount of noise is controlled by the parameter
; the noise on
all channels is taken by multiplying the parameter
by the random
Gaussian noise function; i.e.,
As mentioned before, temporal stretching is only a linear
approximation of a non-linear process. Hence, we can also modify the
times of events within each instance relative to one another. In
TTest, this is controlled by the parameter
. Within the dataset,
the startpoints and endpoints of increases and decreases, as well as
the timing of local maxima and minima are all randomly offset in time.
The amount of variation of these temporal events is modified by a
random value
. The effect of setting
on the three instances of class A are shown in Figure
6.12.
Also, as with real datasets, sometimes the amplitudes of the various
events varies. In TTest the parameter
determines the amount of
variation in the amplitude of the local maxima of the signal. Each of
these maxima is perturbed by an amount
. The effect of
setting
on three instances of class A can be seen in Figure
6.13.
Finally, one thing that often occurs in real datasets is that there is some data on a channel that looks useful, but in fact is irrelevant. For this reason, for classes A and B, the gamma channel was replaced with something that looks plausible as a signal: a sequence of between 2 and 9 random line segments whose endpoints are randomly generated. For class C, the beta channel is replaced with a similarly generated sequence of random line segments. Note that this is meant to explore a different issue to that of the Gaussian noise above; this is meant to test the learner's ability to cope with data that is irrelevant.
Three instances of class A are shown in Figure 6.14, with the irrelevant features added.
Having included all of these factors, we can now redefine our original
signals, in terms of the parameters
; as well as whether we
choose to have irrelevant features or not. The definitions hence become:
This dataset has some useful properties. Firstly, it is truly multivariate. Secondly, it has variation in the length of streams and the onset and duration of sub-events. Thirdly, it has variation in the amplitudes of events as well as Gaussian noise (as does CBF). Fourthly, it has several sub-events and a complicated relation between sub-events which is typical of real-world domains. CBF only had one of these properties.