Computer science in general as well Artificial Intelligence more specifically, are young disciplines where research methods are not clearly established as yet. While empirical methods are part of daily practice in many scientific disciplines, including the traditional empirical sciences, such as physics, chemistry, etc as well as in the more recent social sciences, empirical methods do not (yet) belong to the standard tool kit of a Computer Scientist or AI researcher.
Indeed, only after the systematic use of experimental strategies as proposed by Francis Bacon in his Novum organum scientarium in the 17th century, the growth rate of our collective (scientific) knowledge could increase so dramatically and enabled the amazing scientific progress that followed.
Besides the virtues of empirical methods for science in general, there are also a number of scientific disciplines which do not use any empirical method or at least no systematic empirical methods. Among these disciplines are, for instance, Mathematics, Philosophy, Literature, History, etc. Each discipline has its own individual methodology which is tailored to it according to the nature of the field of study. In mathematics, for instance, we study the relationship between abstract objects, such as numbers, sets, relations, etc., which do only exist in our mental life.
Artificial intelligence, being a comparably young discipline, lacks a tradition of a commonly accepted research methodology. Moreover, the objective of researchers in AI are not uniform at all. Not only the techniques for developing intelligent computer systems are of interest, but also an advancement in the understanding of human cognition and intelligent behaviour.
Restricting our considerations on that research aiming at methods for building intelligent computer systems, we can distinguish the engineering approach from the cognitive approach to building AI systems. The engineering approach is primarily interested in developing effective methods for creating machine intelligence - disregarding the way how humans achieve intelligent behaviour.
Even if we focus here on the engineering approach to AI, the subject matter remains remarkably difficult. In more traditional empirical disciplines, such as physics, researchers aim at the development of models which describe certain aspects of nature.
Compared to that is the subject matter of AI less clear: do we aim at modelling certain aspects of nature? Or do we rather aim at developing - at the meta level - methods which allow an easier modelling of certain natural phenomena as well as artifacts? Or do we try to discover the general principles of intelligence which would allow us to build an initial system with basic intelligence which then evolves on its own to a more complex and intelligent system?\footnote{ For the time being it remains unclear, whether such principles exist at all. }
There is a strong case, for targeting the development of specialised AI systems which can at least perform a restricted set of intelligent tasks. The currently established division of AI research into subareas, such as knowledge representation, knowledge acquisition, theorem proving, search, vision, robotics, expert systems, machine learning, neural networks, etc. suggests that the various subareas have much stronger common grounds than the field of AI in its entirety.
  1. theoretical research, essentially mathematical studies in which theorems are formulated and proved.
  2. conceptual research which develops new concepts, for categorising problems, tasks, algorithms, or more general frameworks which provide new perspectives on problem domains or on approaches for solutions.
  3. experimental studies in which algorithms are demonstrated to solve certain problems or to show a certain performance in problem solving.

In many subareas of AI a significantly increased interest in empirical evaluations has been found in recent years. This tendency opposes the kind of research papers which were often found in earlier AI publications, where largely ideas and perspectives where presented and demonstrated using a toy domain.
While it seems a healthy development to scrutinise the validity of new ideas and claims more closely, it is also important not to lose sight of what we want to achieve with our research.
For the engineering approach to AI we can roughly state our research objective as to find ways for developing (more effectively) systems which perform certain intelligent tasks.
Arguably, the ideal thing would be a bootstrap system which learns to do any task it is put to on its own. However, this seems out of the question for the foreseeable future. Consequently, we are interested in techniques which make the life of the human system developer easier.
Whether a given technique serves this purpose depends on at least the following factors (besides the technique itself): the human system developer(s); the nature of the task; the system developer's understanding of the nature of the task.
Unfortunately, the factors mentioned above do not provide very solid grounds for conducting reliable empirical studies with regard to what technique serves best the purpose. The critical factors are too ill-defined to give us a good handle on it.
Besides the mentioned factors there is another difficulty which sets AI apart from most other sciences: the class of tasks we want to have techniques for is extremely diverse: indeed, ideally we want, for instance, a learning technique which is capable to learn anything - which includes the acquisition of all the scientific knowledge mankind has acquired and will acquire in the future. The trouble is that the different scientific disciplines have emerged because they were found to be different from other disciplines in important ways, such as with regard to the ontology of the discipline, the key concepts, to the research methodology, implicit underlying assumptions, etc. If we do not integrate all this information into the learning technique it seems extremely unlikely that it could work well across the range of disciplines. As a consequence, the learning technique would need to be tailored to each domain.
What is then left for more generally applicable research is essentially the development of tools and techniques which make the tailoring easier.

If we accept that, what is then the role of empirical studies in AI? At least the following questions can be addressed by empirical studies:

  1. What is the exact behaviour of an algorithm when run on various kinds of input data (i.e. benchmark problems etc.)?
  2. How does the data in real applications look like?
  3. What approaches and techniques are most suitable for the application developer to provide domain-dependent knowledge, parameters, etc. to the system, so that it will work well in the individual application domain?

The first question can in theory be answered analytically because we would usually expect our systems to function deterministically. Even if there should be an element of chance involved from an outside source we may be able to model all interesting scenarios in principle. However, such an analysis is usually not feasible in practice. So it is in many cases much more sensible to run algorithms on data and see how they perform.
However, we are usually not really interested in the numbers we obtain from the experiments which may compare the performance of multiple techniques on some benchmarks. Since in many areas we cannot reasonably hope to find a single master technique which can be applied to all tasks, we are rather interested in an improved understanding of why and under what circumstances certain techniques perform in the way they do.
Such an improved understanding requires suitable concepts to characterise our experimental settings and to formulate our conclusions in regard to where a certain technique is most suitably applied, etc. Indeed, empirical research can and should help a great deal in developing a finer and more suitable conceptual framework which will allow us to characterise the applicability of techniques as well as to classify application domains.

Opposed to the first question, the second question on the kind of scenario which we face in real applications is truly empirical in nature. There is no way that we could think up all the conditions of real applications. Here, field studies are required which tell us about typical features of applications. This is a particularly difficult type of research, as again, we lack largely the suitable concepts to describe our findings. We often don't know whether two application domains are similar in important respects or not. I.e. we don't know whether they can be handled by the same technique which we have in mind.
The lack of understanding the differences and commonalities of application domains can be blamed for many of the disappointments in AI. The typical experience of finding substantial problems when we try to scale-up a technique which was successful in a toy-domain demonstrates that we do not know enough about the differences between the toy-domain and the larger domain.
It may be a misconception after all to believe that the toy-domain is somehow representative for a large class of domains.
To alleviate this situation, we need to better understand our application domains. This understanding has to rely on empirical studies but must also concentrate on the development of new and more suitable concepts to describe the findings.
Finally, the third question addresses the psychology of the application developer. This question resembles to an extent the questions in software engineering about the suitability of programming languages etc. It requires us to study the human psychology in regard to the task the application developer has to do. This seems to be very difficult as it is very expensive to have a sensible number of controlled studies of how system developers would cope with a certain technique. Furthermore, there is continuous change in the nature of the applications to be developed. As a consequence, we cannot be sure how reliable findings from previous studies are. Can they safely be carried over to new application domains? Maybe certain aspects of the new domains are more difficult to articulate or to understand in the first place. This may result in unexpectedly poor performance with a technique, e.g. a certain programming language, which worked fine before. For example the development of graphical user interfaces would indeed by difficult to do in COBOL or Pascal.
After having posed so many questions, let us now turn to the papers of this workshop. The workshop brings experiences and views from a variety of angles together, which will hopefully advance our understanding of how to do better research in AI: to better understand how and where to do empirical studies.