In many subareas of AI a significantly increased interest
in empirical evaluations has been found in recent years.
This tendency opposes the kind of research papers which
were often found in earlier AI publications,
where largely ideas and perspectives where
presented and demonstrated using a toy domain.
While it seems a healthy development to scrutinise the validity of new
ideas and claims more closely, it is also important not to lose sight
of what we want to achieve with our research.
For the engineering approach to AI
we can roughly state our research objective as to find ways for developing
(more effectively) systems which perform certain intelligent tasks.
Arguably, the ideal thing would be a bootstrap system which learns to do any
task it is put to on its own.
However, this seems out of the question for the foreseeable future. Consequently,
we are interested in techniques which make the life of
the human system developer easier.
Whether a given technique serves this purpose depends on at least the
following factors (besides the technique itself): the human system developer(s);
the nature of the task; the system developer's understanding
of the nature of the task.
Unfortunately, the factors mentioned above do not provide very solid grounds
for conducting reliable empirical studies with regard
to what technique serves best the purpose.
The critical factors are too ill-defined to give us a good handle on it.
Besides the mentioned factors there is another difficulty which sets
AI apart from most other sciences: the class of tasks we
want to have techniques for is extremely diverse:
indeed, ideally we want, for instance, a learning technique
which is capable to learn anything - which includes the acquisition
of all the scientific knowledge mankind has acquired
and will acquire in the future.
The trouble is that the different scientific disciplines have emerged
because they were found to be different from other disciplines in important
ways, such as with regard to the ontology of the discipline,
the key concepts, to the research methodology,
implicit underlying assumptions, etc.
If we do not integrate all this information into the learning technique
it seems extremely unlikely that it could work
well across the range of disciplines.
As a consequence, the learning technique would need to be tailored
to each domain.
What is then left for more generally applicable research is essentially
the development of tools and techniques which make
the tailoring easier.
If we accept that, what is then the role of empirical studies in AI? At least the following questions can be addressed by empirical studies:
The first question can in theory be answered
analytically because we would usually expect our systems
to function deterministically. Even if there should be an
element of chance involved from an outside source
we may be able to model all interesting scenarios in principle.
However, such an analysis is usually not feasible in practice. So it is
in many cases much more sensible to run algorithms on data and see
how they perform.
However, we are usually not really interested in the numbers
we obtain from the experiments which may compare the performance
of multiple techniques on some benchmarks.
Since in many areas we cannot reasonably hope to find
a single master technique which can be applied to all tasks,
we are rather interested in an improved understanding of why
and under what circumstances
certain techniques perform in the way they do.
Such an improved understanding requires suitable
concepts to characterise our experimental settings
and to formulate our conclusions in regard to where a certain
technique is most suitably applied, etc.
Indeed, empirical research can and should
help a great deal in developing
a finer and more suitable conceptual framework
which will allow us
to characterise the applicability of techniques as well as
to classify application domains.
Opposed to the first question, the second question on the kind of scenario
which we face in real applications is truly empirical in nature.
There is no way that we could think up all the conditions of real
applications.
Here, field studies are
required which tell us about typical features
of applications.
This is a particularly difficult type of research, as again, we lack largely
the suitable concepts to describe our findings.
We often don't know whether two application domains are similar
in important respects or not.
I.e. we don't know whether they can be handled
by the same technique which we have in mind.
The lack of understanding the differences and commonalities
of application domains can be blamed for many of
the disappointments in AI.
The typical experience of finding substantial problems when we
try to scale-up a technique which was successful in a toy-domain
demonstrates that we do not know enough about the differences
between the toy-domain and the larger domain.
It may be a misconception after all to believe
that the toy-domain is somehow
representative for a large class of domains.
To alleviate this situation,
we need to better understand our application domains.
This understanding has to rely on empirical studies but must
also concentrate on the development of new and
more suitable concepts to describe the findings.
Finally, the third question
addresses the psychology of the application developer.
This question resembles to an extent
the questions in software engineering about
the suitability of programming languages etc.
It requires us to study the human psychology
in regard to the task the application developer has
to do.
This seems to be very difficult as it is very expensive to have
a sensible number of controlled studies of how system developers
would cope with a certain technique.
Furthermore, there is continuous change
in the nature of the applications to be developed.
As a consequence, we cannot
be sure how reliable findings from previous studies are.
Can they safely be carried over to new application domains?
Maybe certain aspects
of the new domains are more difficult to articulate
or to understand in the first place. This may result in
unexpectedly
poor performance with a technique, e.g. a certain programming language,
which worked fine before.
For example the development of graphical user interfaces would
indeed by difficult to do
in COBOL or Pascal.
After having posed so many questions, let us now turn to the papers
of this workshop.
The workshop brings experiences and views from a variety of angles
together, which will hopefully advance our understanding of how to do
better research in AI: to better understand how and where
to do empirical studies.