Machine Learning and Sisyphus III - Introduction to the task and IGBA Database


Abstract

This document is designed to introduce the reader to the machine learning component of Sisyphus III, to familiarise them with the IGBA database and to present support tools we have developed to browse the database. We begin by briefly outlining the motivation behind Sisyphus III and the nature of the task and then proceed to discuss the database within this context. In conclusion we discuss the possible usage of the data within machine-learning oriented solutions to the task.

Sisyphus III - the inclusion of machine learning

Previous Sisyphus tasks lacked a serious machine-learning component. Machine learning has long been hypothesised as a solution to the knowledge acquisition bottleneck that places less reliance upon the existence of human expertise and the competence of knowledge engineers in analysing this. No relevant data-sets were, however, made available within the contexts of these tasks. The machine learning community has thus been unable to fully benefit through active participation.

For example, we, ourselves, have employed the Sisyphus I room allocation problem as the focus for knowledge-intensive data-mining, by regarding the initial problem description provided by Linster as an organisational database of employees over which both knowledge elicitation and machine learning tools can be applied (Cupit and Shadbolt, 1996). Whilst we see our experience of this as providing us with a deeper understanding of the problems involved in data-mining and a proof of concept for the representational redescription of data-sets prior to learning as a means of over-coming these issues. Our analytic methodology allowed machine learning tools to be applied over this database and resulted in the discovery of a high-level teleologically significant semantics that was over-looked by previous solutions.

We were, however, constantly aware of the limitations of this endevour. The initial Sisyphus I problem statement contained enough detail to specify only a small number (fifteen) of cases. Moreover, such description was not based upon observations of the world but invented. We also possessed previous experience of the problem prior to analysis. Such points are not criticisms as such. Analysing small data-sets raises it's own problems and there is nothing, in principle, to stop machine learning tools being applied to imaginary data by people who are already familiar with the domain. Indeed, this and past experience suggests that there is much to be gained from it. Such points do, however, spoil the pretence that this could have been "real" data to be analysed in the context of a "real" task upon which we have little prior knowledge, the type of task our approach is designed to be of usage within.

Finding such data is hard. Whilst many data-sets have been made available as analytic resources within machine learning repositories, these are now well understood and documented. New data quickly becomes stale through repeated analysis. Indeed the definitive account of such data provided by such wide-spread and repeated analysis is a necessary benchmark against which the effectiveness of various learning algorithms can be compared against each other. This approach to evaluating machine learning research, however, can be criticised, like the Sisyphus I and II experiments, as lacking in external validity.

When evaluating machine learning tools against a standard benchmark data-set, one places a focus specifically upon the the empirical performance of the learning algorithm and attempts to hold other factors constant. It intuitively seems unfair to compare the performance of two algorithms if each had experienced different data, were provided with differing degrees of background knowledge or were parametrically biased in different ways. Such methodological issues, concerning how machine learning tools are used in practice are, however, of increasing concern. Indeed, the raison d'etre behind many modern learning tools is supporting such practice. The objective of evaluation might thus be perceived as having widened to include the practice of using machine learning algorithms as well the properties of these. We have begun to ask a new set of questions.

For example, do machine learning techniques naturally fit within wider knowledge acquisition methodologies, such as generic tasks and KADS? Do machine learning tools naturally operate in synergy with each other and with non-learning acquisition techniques? Are different machine learning tools associated with different patterns of usage, for example different pre- and post-processing steps, or different problems? How does a tool user configure algorithmic parameters and decide what learning biases to supply. How much reliance does a particular method place upon the provision of background knowledge and how does it's performance degrade in it's absence. How much extra knowledge is required to prepare data for analysis or make sense of the what is learnt? If such knowledge is necessary, what is it's form - does it relate an understanding of the learning algorithm, the task or the specific domain of enquiry? Does a machine learning approach rely upon data collected specifically within the context of analysis, or can such tools be used over relevant preexisting data collected for different reasons? Such questions are of obvious importance given a general assumption that ML tools possess utility only when sensibly applied.

This provides a context for the principal objectives of the machine learning component of Sisyphus III. These are

 

 

How should such objectives be met? The context of Sisyphus III places strong requirements upon a suitable data-set. It must be considered relevant to the overall problem domain - if this were not the case we would be discussing Sisyphus IV rather than III and such a task would preclude many solutions employing the synergetic usage of learning and non-learning tools in a realistic fashion. The data should be sufficiently realistic and complex to warrant methodological concerns throughout the fields of machine learning, knowledge discovery in databases. This means that it must be large-scale actual data already in existence. More importantly, there needs to be a general lack of information concerning what knowledge might be discovered within, or theories grounded upon, it. It must not be possible to adjust the learning tool or the practice of using it with respect to a known benchmark or classification of the task as an instance of a generic class of learning problem. Whilst a gold standard is bound to emerge through analysis, the task of deciding upon such a standard, and of operating in it's absence, is important in it's own right.

With these points in mind, we turn attention to the IGBA database. We believe this dataset provides the perfect opportunity to play out these objectives in a highly realistic setting.

The IGBA Database - form and content

 

The IGBA base has been constructed under the auspices of the International Union of Geological Sciences, as an experimental attempt to digitize chemical, petrographic, mineralogical and stratigraphic information about igneous rocks in such a fashion that it is both relatively compact and readily accessible for subsequent selective retrieval and/or reduction. It contains 19519 specimen descriptions, drawn from 1357 source references. These specimens are represented in the form of FORTRAN record cards, the structure of which is described in the structure.txt file provided with the database. [1] Instances are clustered into a large number of small groups relating to specimens originating from particular locations and supplied by the same contributor. The content of these descriptions relate to seven general categories, outlined below.

These properties are considered basic components of a specimen description. These, excluding the specimens specific rock name, it's place of origin and numeric information, are recorded in a coded form. Some 880 codes are specified. In addition to these basic components, extra information concerning the rock present in a non-coded form. This information relates to detailed analytic results, such as the specimen's specific gravity and refractive index, as well as textual fields containing comments made authors, contributors and the editor of the database. This data set is thus very, very large in comparison with most others employed as ML benchmarks.

The frequency of rock types found within the database should not be taken as representative of the domain. The relative occurance of a rock type may be biased by many factors other than the frequency of the rock within the domain. It might, for example, relate to physical access to these rocks, an analytic focus upon rocks of a certain form and the frequency by which the term is employed within the geological community itself. It seems geology is drowning in terms resulting from the excesses of turn-of-the-century geologists in attempting to discover and name new rock and mineral types as opposed to ensuring the future utility of these terms.

Whilst a database is essentially a "flat" structure, there are implicit hierarchical relationships between certain descriptors. For example, certain age nouns seem to subsume others. These relationships are recorded implicitly within the the structure.txt file by indentation. The usage of abstract descriptors means that different specimens might be represented at different levels of granularity across the database. The specificity of a particular descriptor can be very high - it is thus used very infrequently. Again, however, it cannot be assumed that the frequency of usage of a term within the database relates to it's frequency of occurance within the domain.

There also seem to be significant differences in information content between specimen descriptors. Certain specimens, for example, come with a complete listing of trace elements whilst in others, no trace element information is provided. Whilst certain properties of an igneous rock specimen are of interest to all, certain information, for example, physical dating procedures and trace element analysis, is costly to obtain and of specific interest to certain research practices . Only a few rocks, for example, can be expected to possess five physical ages and a complete trace element analysis. The number of missing descriptors within the database is thus very high.

These properties appear due to the context surrounding the databases construction. IGBA does not represent a database intended for one specific usage or analysis . It represents a serious undertaking to archive igneous rock descriptions in such a way that they may be used and re-used for different purposes in different contexts in future times. The form and content of the database have been constructed with this teleology in mind. Being able to employ IGBA as a resource for machine learning is, however, an important goal. As interest in the fields of knowledge re-usage and data mining grows, this is exactly the sort of data we can expect to see more of. We will be increasingly asked to apply our technology in a post-hoc fashion over previously-collected data.

 

Employing IGBA as a machine learning resource.

 

These database properties - alternatively, the teleological orientation - makes machine learning over it problematic. Such concerns were raised at the 1996 European Knowledge Acquisition Workshop within the Sisyphus-III working group session. Should we attempt to supply the data in a sanitized form more amenable to the community? Should we translate the syntactic structure of the data from FORTRAN record cards to a representation more amenable to the ML community, such as a proposition list of attributes with associated values? Should we semantically transform the data to be expressed within a different ontology of expression, such as the removal of synonyms or raising the level of semantic granularity? Should we select training and evaluation sets data sets to be used?

Our specific methodology of using machine learning tools (Cupit and Shadbolt, 1994, Cupit and Shadbolt, 1996) places emphasis upon the sanitisation of data prior to analysis through representational and ontological redescription. We possess the software technology to transform the database into a form that might be regarded as "fit for function". Determining exactly what constitutes fitness is, however, difficult to achieve - it is a highly contextualised metric. Our experience in attempting to redescribe the data reveals that it almost impossible to standardise in such a way that would satisfy all interests within the machine learning community.

If we semantically transformed the data prior to it's release, it might be claimed that our transformation was inappropriate - that it suited only one particular usage, or that more appropriate transformations exist with respect to a specific goal. Likewise, it seems almost impossible to translate the data into a different representation language without making certain ontological decisions determined by the structurality of this and the structures that may be formulated within it. For example, redescribing each rock specimen as a propositional list of mutually exclusive attributes denies the possibility of a structured object representation. Indeed, the mineralogical description provided seems to naturally suit a representation of this type. If, however, we decide upon an object-oriented language as our target structure, we must also decide what objects are present. Do we assert that the different minerals found within a sample represent instances of each class of mineral (e.g. hornblende, sanidine) , possessing properties relating to their location (e.g. groundmass, xenocryst) and form (e.g. euhedral, allotrimorphic) or do we decompose a description into objects representing the components of a specimen (groundmass, xenocryst) possessing properties relating to mineralogical content (hornblende, sanidine) and form (euhedral, allotrimorphic).

Ontology, in this context, might be regarded as an iron cage - man-made, useful for certain purposes, yet constraining and limiting. Whilst certain transformations seem more plausible than others, we would not wish to constrain or limit the usage of the data in any way. We also admit that our methodology -the redescription of teleologically unsuited data prior to analysis - is still in it's infancy. We believe that data of this form provides an example for the necessity of our idea and that the small-scale empirical work we have performed revealed provides support for hypothesis that such an activity can be usefully performed. Work conducted under Sisyphus III might disprove either theory - that no redescription is necessary or that such a process is, indeed, possible and practical.

Transforming the data prior to it's usage also seems to go against the motivation behind the Sisyphus III project. The machine learning component of Sisyphus III places emphasis upon the life-cycle of using ML techniques during knowledge acquisition. To pre-process the data would remove an important activity from this life-cycle.

We have thus decided to release the data, as found, in it's rawest form. To compensate for this we provide two extra resources. Knowledge pertaining to the specific task of Sisyphus III is available from our resources page. This knowledge is, however, task-specific in nature. It relates to only a minor number of rocks and rock descriptors present within the database. The database, aside from a partial and somewhat implicit hierarchical decomposition of rocks and minerals, is relatively undocumented. To aid in the usage of the data and dissemination of this knowledge, we provide an IGBA-specific data viewer and sampler. This tool is documented in the following section.

How then, should the researcher proceed. Two courses of action appear appropriate. One is to fully subscribe to the Sisyphus III project and the rules of it's game. In this context, the researcher should aim to employ the data within the specific context of the task - constructing an igneous rock classification system based upon information available in hand samples. The other is to simply pursue ones own agenda. We obviously do not wish to prohibit the usage of the data for any academic purposes [2].

The IGBA Database Browser and Sampler.

The IGBA database browser and sampler has been designed to ease the process of getting to grips with the data. It is not designed as a serious pre-processing tool for machine learning. It runs under Windows3.1 or Windows95 and minimally requires 16 Meg of RAM. It specifically supports the following functions.

Opening the tool

On opening the tool, the descriptor hierarchy, FORTRAN record templates and value codes are automatically loaded and the descriptor hierarchy is displayed.[4] To load a rock specimen file, the user should select "load" from the FILES menu. Rock specimen files possess the prefix ".rck". Having loaded a file, the database will display the first specimen of the first case. A new specimen file can be loaded at any point. Information currently within the database is, however, lost. One can not piece together sample files in a piece by piece fashion by repetitive loading. This is to ensure each specimen and group possesses a unique reference.

Tool interaction

Aside from loading files, all the tools functions are accessed via the button bar at the top of the main window. Holding the mouse cursor over the button will cause the name of the function to be displayed. Clicking on a button performs the associated function. Functions that are not applicable are displayed in gray.

Navigating the database - the transport buttons

The database may be navigating in three different ways using seven transport control buttons. Transport buttons are those possessing a left and/or right arrow. The user may move from case to case in the following fashion.

Constructing a sample.

The user may construct a sample composed of all rocks of specified types, by clicking the dropper button. The screen will change to display two list browsers, representing rock types included in, and excluded from, the sample. Initially all rocks are excluded. To move a rock from window to window, use the up and down arrow buttons. Once you have constructed a set of rock types to sample, click the disk button. You will be asked to provide a filename to which the sample data is saved.

Browsing the descriptor hierarchy.

The descriptor hierarchy by be visualised at any time by clicking the tree button. When a specimen file is not loaded, the hierarchy will display all rock types coded within IGBA. When data is present within the tool, the hierarchy will display the types of rock present within the sample in cyan. Selecting such a node will cause the number of instances of this type to be displayed in the status bar at the bottom of the tool. Note the number displayed relates specifically to that type alone; it does not include all the sub-type instances. All nodes within the hierarchy are initially collapsed. To expand or collapse a node, select it by clicking it with the mouse cursor and then use the tree expand button. To enlarge of decrease the size of the tree, use the magnifying glass buttons . The semantics of this hierarchy are deliberately vague - there is no distiction between nodes of different semantic types, such as objects, attributes, values. However, nodes displayed in capital letters - aside the alphabetical indexing of rock type names, are those that are codified within the database. If a node possesses a suffix "*" it indicates that the name is used more than once in different contexts. No code appears to be used more than twice.

Software status

The software is freely available for usage with the IGBA dataset and is contained in this resource release. Whilst every effort has been made to ensure the code is bug-free, it is relatively untested. Any bugs can be reported to the email address included in the "about" window, displayed by selecting about from the files menu.

[1] Various FORTRAN routines are also included with the database for those that use this language.

[2] The database has been available for academic usage and should not be used commercially.

[3] Future versions of the tool will also support the visualisation of the extra information fields.

[4] These are specified in files named igbavw.hry, igbavw.att and igbavw.cde respectively. igbavw.hry and igbavw.cde contain santised versions of the descriptor names, in that the usage of abbreviations has been removed. The attribute value positions recorded within the igbavw.att file are not those recorded within the structure.txt file provided with the database. The string index positions given within the igbaview.att file are minus 7 the index positions within structure.txt

 

To download the software and igba case data click here...