Extracting Knowledge from Biological Descriptions

Andrew Taylor
Department of Computer Science and Engineering
University of New South Wales
Sydney, Australia
andrewt@cse.unsw.edu.au

Abstract:

We describe a system which performs biological identification on the basis of natural language descriptions. The system parses texts containing large sets of biological descriptions in restricted natural language and constructs a knowledge base. The system can semi-automatically adapt to a text by extending its lexicon and perhaps its grammar. The constructed knowledge bases are used to perform interactive identification of specimens. The system automatically constructs HTML forms to provide a World Wide Web identification interface which can be integrated with hypermedia resources. We describe the system's implementation and its performance on two large botany texts.

Introduction

This work addresses a very old problem - given a set of biological taxonomic descriptions how do we efficiently find the description corresponding to a specimen? As the first systematic biologists developed classifications they quickly realised they also needed to develop methods for finding the appropriate classification for a specimen. The obvious approach is a linear search of classifications. A good example of this is the common novice approach of leafing through a book until they find a picture they think matches their specimen. This, unfortunately, is unreliable and too time-consuming for all but the smallest sets of taxa.

The first remedy - construction of decision trees - appeared in the seventeenth century in the works of Morison and Ray [19] with classifications presented diagrammatically as trees. The great taxonomist Linnaeus [15] gave these trees the name keys by which biologists have known them since. The key as a method of identification was fully realised in Lamarck's Flora Francoise [14]. The modern key in Figure 1 (paraphrased from [5]) differs little from the keys Lamarck constructed two centuries ago.

Not surprisingly given their long and widespread use, keys are effective tools but they have weaknesses. Some of the decision questions may be unanswerable for a particular specimen. For example, flower characters are commonly used in botanical keys but plants are often not in flower. This will make use of the key difficult or impossible but in many cases a diagnosis could quickly be made with characters present in the particular specimen but not employed in the key.

Occasionally specimens will be atypical in some character causing diagnosis to follow a fruitless path through the key. An error in determining whether a specimen meets a decision criterion produces the same result. In both cases backtracking is difficult when the user is uncertain at which node the diagnosis diverged from the correct path. Both these problems are exacerbated if the person attempting the diagnosis is inexpert.

1 Scales in 17 rows ................................................. 2
  Scales in 19 rows ................................................. 4

2 Black or brown above, without pale spots on scales ................ 3
  Blackish above, scales with pale spots ............... Butler's Snake

3 Immaculate black above, crimson below ....... Red Bellied Black Snake
  Brown above, white to orange below ................. King Brown Snake

4 170-200 ventral scale rows ...................... Spotted Black Snake
  210-240 ventral scale rows .......................... Collett's Snake

Figure 1 - Key to Australian Snakes of Genus Pseudechis

Computer-based Identification

Biologists' first attempts to construct more flexible identification methods which would allow the user to choose the characters used in a diagnosis were largely frustrated by the media available. Typically these methods were based on punched cards but other even more ingenious mechanical techniques were tried. Unfortunately all of these were expensive to produce and most suffered from a degree of clumsiness.

The arrival of computers brought a much more suitable medium which was quickly employed in the development of identification systems such as [12]. The most successful of current computer-based identification systems is DELTA [7]. The heart of DELTA is a language into which taxonomic descriptions must be encoded to use the system. One of the components of DELTA is INTKEY, an interactive identification program which given a set of descriptions in DELTA format can efficiently and flexibly perform identifications.

However DELTA's success depends more on its integration with traditional paper-based technology than on INTKEY's virtues for identification. DELTA can generate traditional keys and its language is also carefully designed to allow the DELTA descriptions to be converted to natural language if suitable care is taken in the construction of the DELTA descriptions. This has allowed large books to be generated from DELTA descriptions [23]. Biologists' reactions to DELTA and other computer-based identification systems are mixed. It is clear that natural language will dominate new taxonomic work for at least a short time to come and there is, of course, a vast existing body of such descriptions.

Taxonomic Descriptions As Natural Language

Our approach is in an important sense the inverse of DELTA. We have built an identification system based on automatically building a knowledge base from natural language taxonomic descriptions. If you examine taxonomic descriptions, such as Figure 2 (taken from [13]), it is clear they are written in a distinctive sublanguage. The descriptions contain few verbs and are mainly a terse sequence of noun phrases which are rich in adjectives and adverbs. Many of the nouns and adjectives are peculiar to this domain; others may have a more specialised meaning within this domain.

Shrubs with few erect, slender, few-branched stems to 3 m high; glabrous except at apex of bracts or sparsely to moderately rusty-hairy on axes and lower surface of leaves. Adult leaves narrow-obovate to narrow-spathulate, 8-28 cm long, 20-65 mm wide, apex usually acute to truncate, margins rarely entire or more often toothed with 0-3 pairs of teeth in lower half, base very attenuate, leathery but usually not harsh, minutely granulate when dried; both surfaces with prominently raised veins. Conflorescences few, 90-250-flowered; basal flowers opening first; involucral bracts mostly 5-9 cm long, bright red.

Figure 2 - Taxonomic description of the Waratah

This sublanguage is clearly much more amenable to computer analysis than unrestricted English. Even so, many of the problems that make natural language understanding so difficult persist to some degree. Another important aspect is that our purpose of identification does not require complete understanding of the text. Our system can operate successfully even with incomplete knowledge extraction.

Although our system is intended to handle taxonomic descriptions of any group, so far our efforts have been restricted to two botany texts with usefully disparate characteristics. The Flora of New South Wales [13] is a four volume 2,000 page text containing descriptions of the over 6,000 plant species known from the state of New South Wales. There is great variety in the plants involved from tiny aquatic plants a few millimetres across to towering forest trees over 50 metres tall. There is a corresponding variety in the nature of the descriptions. This is compounded by the text having over twenty authors.

The Flora of Australia - Volume 19 [9] is a 540 page volume describing the almost 600 species in the genera Eucalyptus and Angophera. In contrast to the previous text the species described are closely related and quite similar and there is a single main author. The result is its descriptions are much more consistent in nature than those in [13].

Ontology

Taxonomists consider the essence of a taxonomic description to be a list of properties possessed by the taxon. Each property in the list is considered to be composed of a character and one or more states. A character describes the property that varies between taxa, for example - petal colour or leaf width. A state is the instance of the property for the taxon being described - for example red or 7cm. Commonly multiple states of a character will occur in one taxon - for example, petal colour may be red or pink, leaf width may be between 8cm and 12cm. The terminology is not universal with some authors meaning by character the combination of what we have termed character and state [16].

A character may be an arbitrarily complicated statement with the possible states being true or false. This would require a very expressive knowledge representation. We however wanted a simple knowledge representation to make the identification process very efficient even if this was at the expense of some characters being unrepresentable. Fortunately most characters have a simple form just referring to a simple attribute such as colour, shape, length or hairiness of a particular part of the specimen. We have relied on this using a triple consisting of specimen part, attribute and value/s as the basis of our representation. The specimen part can be a simple name e.g. petal or a compound indicating a subpart, e.g. leaf hairs or subset of parts e.g. female flowers. Attributes are simple names. Values have a simple syntax allowing representation of sets and ranges. The syntax for values also allows qualifiers e.g. usually, often, sometimes and rarely to be represented.

Our intention was to extend this simple representation only as strictly necessary for overall system performance. The only extension that has so far been necessary is for relational characters. For example this allows the representation of leaf length is more than twice leaf width. There are certainly other forms of characters that can not be represented but this has not so far been important. In many cases these characters can not be easily employed in the identification process.

Parsing

The idiomatic nature of the taxonomic sublanguage and the relative simplicity of its structure have led us to construct a grammar rather than adopt an existing grammar. A particular concern was that a general-coverage parser would have an unacceptable error rate in determining the attachment of adjectives because of the dense nature of the sublanguage. This was certainly a problem in cursory trials of general-coverage parsers.

The sublanguage was sufficiently restricted that construction of a special purpose parser with useful coverage was not difficult. Several man-days were invested in constructing incrementally about 70 definite clause grammar rules [21]. These were combined with an initial phase which split a sentence into components allowing information to be obtained when only part of a sentence can be parsed. Unfortunately we found proceeding past an unparsed sentence component resulted in an unacceptable error rate but information from preceding components can be safely used.

The construction of the lexicon was more problematical than the grammar. A sizable fraction of the sublanguage is not even found in general dictionaries nor was a suitable on-line compilation of biological terms available. Instead as the basis of our lexicon we used Radford's textbook work on plant systematics [22]. This was more valuable than a dictionary because it provided an extensive listing of characters and their possible states. This is useful because the character is often partly or completely implicit in the text and must be inferred from its state and the context.

It was necessary to accommodate sloppiness in the texts in the definition of characters. Radford quite logically treats two-dimensional shape and three-dimensional shape as separate characters. However it is not uncommon for two-dimensional and three-dimensional terms to be used interchangeably or mixed improperly, but not ambiguously in the text. The easiest way to accommodate this was to merge these two characters into a single character shape.

Radford's text provided 1500 entries for the lexicon which were supplemented by a further 500 words added manually as they were encountered during parses of the texts. These were mainly words which see wide use outside botany and thus were not present in the lists extracted from [20].

The base lexicon is supplemented by a number of lexical rules. Some of these implement general English morphology, for example almost all plurals of nouns are generated this way rather than by being explicitly included in the lexicon. Other rules are specific to the domain, for example botanists often use the prefix sub- to mean almost - for example using subglobose to mean almost globose. Botanists also commonly construct terms using Latin or Greek prefixes indicating size, number or symmetry and Latin or Greek suffixes indicating plant parts e.g tri-foliate or micro-phyllous.

Coverage

Our parsers provide a useful but far from complete coverage of the texts. It is difficult to provide a precise assessment of coverage because both texts contain varying amounts of information apart from plant descriptions such as details of a plant's commercial value. These sentences contain words that we are not interested in providing in the lexicon and often follow forms we do not need to parse. Furthermore, there is a tendency for the sentences containing the most important descriptive information to follow typical forms and thus they are more easily parsed. Thus raw statistics of sentence coverage which could be easily gathered are not useful.

Our manually constructed parser extracted from [13] 100,000 character/state pairs and 20,000 character/state pairs from [9]. Unfortunately we have no way to estimate how many character/state pairs are present in the text. Our only useful assessment method is manual examination of randomly sampled taxa. Our manually constructed parser typically yields between 60 to 80% of the character/state pairs we would like to extract from each description. This is sufficient for the system to function but overall system performance at identification fell short of desired levels.

If our objective was to only parse the two texts mentioned above then we would have continued extending our grammar and lexicon by hand. As we hope to parse many texts, we instead examined techniques to automatically assist in the extension of the lexicon and grammar.

Extending the Lexicon and Grammar

Categorising unknown words so they can be added to the lexicon appears to be tractable in many cases. It can be apparent even from a single occurrence of an unknown word. For example, suppose the word puce is unknown. If we encounter the phrase - leaves puce to red then it is likely puce should be in the same category as red, a state of the character colour.

The reversibility of Prolog DCG clauses means this can be implemented elegantly and trivially by leaving the category of the word uninstantiated and examining how it is instantiated by legal parses.

Useful negative information can also be discerned from single occurrences of unknown words. If we see the phrase - circular puce leaves then it is unlikely puce belongs in the same category as circular. We also exploit this but unfortunately the non-logical nature of negation in standard Prolog means it must be done less elegantly than in the positive case.

We also attempt to use definitions from a general dictionary, an on-line version of Collins English Dictionary [10], when available in classification of unknown words. Our system does not parse the definitions but rather examines them for words in our lexicon. For example, the dictionary definition for puce is adj, a dark purplish brown suggesting that puce might belong to the same category as brown and purple.

Thirdly we use statistical information for unknown words which occur a sufficient number of times. This is done by extracting bigrams from the text and looking for known words which have similar associations [8]. As yet the error rates on classification of unknown words using the above three methods is too high for the classifications to be used without manual checking, nonetheless they are useful.

More useful and unfortunately more difficult is automatically or semi-automatically adding grammar rules to improve the grammar's coverage. The results of Zelle and Mooney [24] using machine learning to construct parsers are very promising. Unfortunately their methods probably can not be directly applied in our context because of the difficulty in obtaining a training set. We are hopeful inductive logic programming techniques can at least be used to suggest possible new grammar rules.

Identification

The task of the identification engine is, for each taxon in the text, to determine the likelihood that the specimen the user has described belongs to this taxon. The identification engine has a knowledge base containing the list of character/state pairs extracted for each taxon. Similarly it is given a list of character/state pairs obtained from the user's description of the specimen. The engine compares the specimen to every taxon. This non-hierarchical approach is alien to biologists not only because they are accustomed to the hierarchical structure of keys but also because hierarchy is very important in taxonomy.

The user's description is compared to that of a taxon by comparing the states of characters which are known for both. The comparison of two states is not treated as a boolean operation. For example, suppose a taxon is described in the text as having leaves 10-15mm long and a specimen is described as having leaves 17mm long. This is only a minor discrepancy which could easily result from an aberrant specimen, an error in the text or an error in measurement of the specimen. If instead the specimen were described as having leaves 50mm long, this is a major discrepancy and it is much less likely that the specimen belongs to the described taxon. At least notionally when comparing states the identification engine produces the conditional probability that if the specimen belongs to the taxon the user would describe it as having the given state.

This concerns many biologists because identification is commonly viewed as a determinate rather than probabilistic process. This is fostered at least partly by specialists typically dealing with identifications made with a high degree of certainty. They not only have the skills and resources to make identifications with a high degree of certainty but they also often need such a degree of certainty. Although it is undeniable that a fraction of even specialists' identifications are incorrect, there is considerable resistance to viewing identification as a probabilistic process [20]. We believe accommodating the probabilistic nature of the process becomes more important when dealing with non-specialists who are both less skilled and demand less certainty. We hope, in future work, by instrumenting our system to demonstrate that a probabilistic approach has significant benefits. Certainly solid evidence will be needed to convince many biologists that a probabilistic approach is appropriate.

For numeric characters it is easy to develop a formula that estimates a conditional probability in some reasonable and well-behaved way. Non-numeric characters are more problematic. The only exception is colours where Euclidean distance between their coordinates in a metric space designed to match human perceptions can be used as an estimate of their similarity. For other non-numeric characters there is a facility to indicate some degree of similarity between states in the lexicon. For example, the lexicon indicates the shapes hastiform and triangular are similar. This similarity is applied transitively so if the lexicon indicates state X is similar to state Y and Z is similar to Z then the identification engine will assume there is a lesser degree of similarity between X and Z. In many cases states are not just similar but exact synonyms and this can be indicated using the same facility.

Synonyms also occur in the naming of specimen parts and this can be indicated in the lexicon. More problematic is where specimen parts can be easily confused. For example, non-expert users will often use the term leaves when describing Australian Acacias when the structure they are referring to is actually a phyllode and described as such in botanical texts. A solution in this case is easy because indicating that phyllode is a synonym for leaf does not cause other conflicts. Other cases are more difficult to handle. For example, non-experts will often confuse the green branchlets of a Casuarina with cylindrical leaves. Treating branchlet and leaf as synonyms will cause other problems. We do not have a good solution to this problem.

Our system can automatically obtain similarity information where two or more texts are available describing common taxa. For example, if one text describe as a taxon as having fusiform buds and another text describes it as conical then this suggests that these are similar terms. If this occurs in multiple cases then the assumption of similarity seems safe. Comparing the results for texts with common taxa is also valuable for detecting errors in the parser and in the texts. The combination of information from multiple texts may be useful in many situations.

The conditional probabilities obtained for each character common to the specimen description and taxon description are combined using Bayes' Rule to give, at least notionally, an overall probability that the specimen belongs to the taxon. In practice, the presence of unknown distributions and invalid assumptions of independence mean that the results are best viewed as ad-hoc estimates. Nonetheless, this method seems to perform well. There are some similarities with the ranking algorithms of information retrieval [11] and we hope to explore if these can offer better performance.

Taxonomists consider some characters more important than others. This can be because they are less variable, more easily determined correctly or better known. If giving all characters equal weight proves a weakness it may be possible to remedy this where a text contains keys by examining the keys to see which characters are preferred.

User Interface

Our identification system has been designed with users who have limited knowledge of the domain but are not experts. In the case of our two botanical texts, such users might include farmers, park rangers, gardeners, bush regenerators and biologists from other disciplines. We feel there is much more scope to improve the efficiency of the identification process for such users than for experts. We believe computer-based identification could also assist considerably users with no knowledge of the domain. However this will require considerable support at the interface level which is beyond the scope of our current work.

The identification process is not independent of the text which supplied the identification database. Rather we expect the user to supply information about the specimen until the set of likely candidates is sufficiently small, then the user will make a final determination by examining the textual descriptions of these candidates and any accompanying material such as figures or photographs.

We have explored several different user interfaces. As we have a parser for the sublanguage, it was easy to construct an interface where the user describes the specimen in similar natural language to that of the texts. This yields a very simple interface and it has the advantage that the user can choose which characters of the specimen to describe. This is not only convenient but if the user is aware that some of the specimen's character states are unusual they can narrow the candidate set very quickly. The primary disadvantage of this interface is that it provides no cues to the user about how to describe the specimen or which characters will be most useful in narrowing the specimen set.

Another possibility is to provide a menu for each non-numeric character with a menu entry for character state of that character and entry fields for numeric characters. This has the advantages of the natural language interface and it provides cues to the user for describing the specimen. Unfortunately it also can produce a very crowded interface. The diversity of the species described in [13] and the resulting number of characters and states make it difficult to provide such an interface. The much more homogeneous descriptions of [9] and the resulting smaller number of characters and character states make it more suitable but the resulting interface is still very crowded.

A third alternative is to mimic the use of a key incrementally querying the user for the state of characters. Some flexibility could be provided to avoid the disadvantages of keys described in the introduction. For example, the user could be able to refuse a query and a query based on another character would be provided. However when the user is aware that some character states are unusual then the previously described methods will narrow the candidate set much more quickly. Conversely when the user is left with a group of largely similar candidates and they are unware of appropriate discriminating characters this alternative may be more efficient at obtaining the information necessary to reduce the set of candidates.

We believe our eventual interface will be a hybrid of all three approaches but so far integrating them has proved troublesome. We have implemented a natural language-based interface and a menu-based interface as HTML forms [1]. This allows identification of eucalypts using information extracted from [9] to be conducted via the World Wide Web [3]. Both interfaces display the current candidate set as hypertext links to an HTML version of the source text. Thus by clicking on a candidate taxon's name, the user can see the description of the species from the text complete with figures and links to distribution maps.

The centralisation of the knowledge base that this use of the World Wide Web provides offers a very important potential for on-going maintenance of the knowledge base. Although only published six years ago [9] is already seriously out of date because a hundred or more new taxa have since been described by taxonomists. It may be many years before this is remedied by a new edition. A central electronic knowledge base can be much more cheaply and frequently maintained than a paper text.

We have also implemented a natural language-based interface in X. This uses the tool Wafe [17] which allows the X windows calls to be embedded in a small external Tcl program [18]. This is an extremely convenient way to build an X interface into a Prolog program. This allows use of the system in the field on a notebook computer.

Performance

The performance of our system is already surprisingly good. External users successfully employed even alpha versions of the system. We are currently obtaining quantitative performance statistics including speed and accuracy for identification made with our system and for identifications made with the original texts.

Our system when using the knowledge base extracted from [13] is dealing successfully with an order of magnitude more taxa than any computer-based identification system we are aware of. Perhaps more importantly it is also dealing successfully with taxa of far more disparate character than any computer-based identification system we are aware of.

Implementation

3,000 lines of SICStus Prolog [4] make up almost the entirety of our system. Apart from the Prolog, 120 lines of Tcl are used to implement an X interface and a number of small shell scripts are used in the initial processing of texts and as wrappers for the HTML-based interfaces.

Many think that Prolog is well-suited for natural language parsing applications [6]. This also proved to be very much so in our case. Not only was the top-down backtracking approach of DCGs a comfortable formalism for our main grammar but their reversibility proved very useful when extending the lexicon automatically. It is impressive that Prolog was well-suited to implementation of not just part but the entirety of a system of diverse components. We feel that only other logic programming languages could match Prolog for its suitability for this system.

Conclusions and Future Work

We believe we have more than established the viability of basing a biological identification on natural language.

In upcoming work we will build an identification system for amphipods (small crustaceans) based on [2]. The small degree of lexical overlap between this and our current texts and the grammatical differences should provide an interesting test of the generality of our methods.

Acknowledgements

I would like to thank Gwen Harden and the Royal Botanic Gardens Sydney for their vital cooperation. I would also to like to thank the Environmental Resources Information Network for making the [9] electronically available.

REFERENCES

  1. M. Andreessen, ``A HTML Primer", http://www.ncsa.uiuc.edu/demoweb/html-primer.html.
  2. J. L. Barnard and G. S. Karaman, ``The Families and Genera of Marine Gammaridean Amphipoda", Records of the Australian Museum, Supplement 13, 1991.
  3. Berners-Lee, T.J, R. Cailliau and J.-F. Groff, ``The World-Wide Web", Computer Networks and ISDN Systems 25 (1992) 454-459, North-Holland.
  4. M. Carlsson and J. Widen, SICStus Prolog Users Manual, SICS Research Report R88007B, October 1988.
  5. H. Cogger, Reptiles and Amphibian of Australia, Reed Books, 1992.
  6. M. A. Covington, Natural Language Processing for Prolog Programmer, Prentice-Hall, 1994.
  7. M. J. Dallwitz, ``DELTA and INTKEY", Advances in Computer Methods for Systematic Biology, R. Fortuner (Ed.), Johns Hopkins University Press, 1993.
  8. T. Dunning, ``Accurate Methods for the Statistics of Surprise and Coincidence", Computational Linguistics, 19(1), March 1993.
  9. Flora of Australia Volume 19, Myrtaceae, Eucalyptus, Angophora, Australian Government Publishing Service, Canberra, 1988.
  10. E. A. Fox, ``Development of the CODER system", Information Processing and Management, 23(4), 1987.
  11. W. B. Frakes and R. Baeza-Yates (Eds), Information Retrieval Data Structures and Algorithms, Prentice-Hall, 1992.
  12. H. G. Gyllenberg, ``A General Method for Deriving Determination Schemes for Random Collections of Microbial Isolates", Annals of the Finnish Academy of Sciences, ser. A, IV Biology, 69, 1-23, 1963.
  13. G. Harden (Ed.), Flora of New South Wales, University of New South Wales Press, 4 Volumes, 1991-1994.
  14. J. B. P. Lamarck, Flora Francoise, 1st edition, Paris, Imprimerie Royale, 1778.
  15. C. Linnaeus, Clavis Classium in Systemate Phytologorum in Bibliotheca Botanica, Amsterdam, 1736.
  16. E. Mayr and P. D. Ashlock, Principles of Systematic Zoology, McGraw-Hill, 1991.
  17. G. Neumann and S. Nusser, ``Wafe - An X Toolkit Based Front end for Application Programs in Various Programming Languages", USENIX Winter 1993 Technical Conference, San Diego, California, January 25-29, 1993, see also ftp.wu-wien.ac.at:pub/src/X11/wafe.
  18. J. K. Ousterhout, ``Tcl: An Embeddable Command Language", USENIX Winter 1990 Conference, January 1990.
  19. R. J. Pankhurst, Biological Identification, Edward Arnold, 1978.
  20. R. J. Pankhurst, ``Principles and Problems of Identifications", Advances in Computer Methods for Systematic Biology, R. Fortuner (Ed.), Johns Hopkins University Press, 1993.
  21. F. C. N. Pereira and D. H. D. Warren, ``Definite Clause Grammars for Language Analysis", Artificial Intelligence, 13:231-278, 1983.
  22. A. E. Radford, Fundamentals of Plant Systematics, Harper & Row, 1986.
  23. L. Watson and M. J. Dallwitz, Grass Genera of the World, C.A.B. International, Wallingford England, 1992.
  24. J. M. Zelle and R. J. Mooney, ``Inducing Deterministic Prolog Parsers from Treebanks: A Machine Learning Approach", AAAI-94.