Relations and Background Knowledge

Inductions systems, as we have seen so far, might be described as `what you see is what you get'. That is, the output class descriptions use the same vocabulary as the input examples. However, we will see in this section, that it is often useful to incorporate background knowledge into learning.

We use a simple example from Banerji (1980) to the use of background knowledge. There is a language for describing instances of a concept and another for describing concepts. Suppose we wish to represent the binary number, 10, by a left-recursive binary tree of digits `0' and `1':

	[head: [head: 1; tail: nil]; tail: 0]
`head' and `tail' are the names of attributes. Their values follow the colon. The concepts of binary digit and binary number are defined as:
	
	
Thus, an object belongs to a particular class or concept if it satisfies the logical expression in the body of the description. Predicates in the expression may test the membership of an object in a previously learned concept.

Banerji always emphasised the importance of a description language that could `grow'. That is, its descriptive power should increase as new concepts are learned. This can clearly be seen in the example above. Having learned to describe binary digits, the concept of digit becomes available for use in the description of more complex concepts such as binary number.

Extensibility is a natural and easily implemented feature of horn-clause logic. In addition, a description in horn-clause logic is a logic program and can be executed. For example, to recognise an object, a horn clause can be interpreted in a forward chaining manner. Suppose we have a set of clauses:

(3)		
(4)		
and an instance:
(5)		
Clause (3) recognises the first two terms in expression (5) reducing it to
		
Clause (4) reduces this to C2. That is, clauses (3) and (4) recognise expression (5) as the description of an instance of concept C2.

When clauses are executed in a backward chaining manner, they can either verify that the input object belongs to a concept or produce instances of concepts. In other words, we attempt to prove an assertion is true with respect to a background theory. Resolution (Robinson, 1965) provides an efficient means of deriving a solution to a problem, giving a set of axioms which define the task environment. The algorithm takes two terms and resolves them into a most general unifier, as illustrated in Figure 10 by the execution of a simple Prolog program.

	
Figure 10: A resolution proof tree from Muggleton and Feng (1990).

The box in the figure contains clauses that make up the theory, or knowledge base, and the question to be answered, namely, "is it true that a hammer is heavier than a feather"? A resolution proof is a proof by refutation. That is, answer the question, we assume that it is false and then see if the addition, to the theory, of this negative statement results in a contradiction.

The literals on the left hand side of a Prolog clause are positive. Those on the left hand side are negative. The proof procedure looks for complimentary literals in two clauses, i.e. literals of opposite sign that unify. In the example in Figure 10,

		
and
		
unify to create the first resolvent,
		.
A side effect of unification is to create variable substitutions . By continued application of resolution, we can eventually derive the empty clause, which indicates a contradiction.

Plotkin's (1970) work "originated with a suggestion of R.J. Popplestone that since unification is useful in automatic deduction by the resolution method, its dual might prove helpful for induction. The dual of the most general unifier of two literals is called the least general generalisation". At about the same time that Plotkin took up this idea, J.C. Reynolds was also developing the use of least general generalisations. Reynolds (1970) also recognised the connection between deductive theorem proving and inductive learning:

Robinson's Unification Algorithm allows the computation of the greatest common instance of any finite set of unifiable atomic formulas. This suggests the existence of a dual operation of `least common generalization'. It turns out that such an operation exists and can be computed by a simple algorithm.
The method of least general generalisations is based on subsumption. A clause subsumes, or is more general than, another clause if there is a substitution such that .

The least general generalisation of

(6)		p(g(a), a)
(7)		and	p(g(b), b)
(8)		is	p(g(X), X).


Under the substitution {a/X}, (8) is equivalent to (6). and under the substitution {b/X}, (8) is equivalent to (7). Therefore, the least general generalisation of p(g(a), a) and p(g(b), b) is p(g(X), X) and results in the inverse substitution {X/{a, b}}.

Buntine (1988) pointed out that simple subsumption is unable to take advantage of background information which may assist generalisation. Suppose we are given two instances of a concept cuddly_pet,

(9)		
(10)		
Suppose we also know the following:
(11)		
(12)		
According to subsumption, the least general generalisation of (4) and (5) is:
(13)		
since unmatched literals are dropped from the clause. However, given the background knowledge, we can see that this is an over-generalisation. A better one is:
(14)		
The moral being that a generalisation should only be done when the relevant background knowledge suggests it. So, observing (9), use clause (11) as a rewrite rule to produce a generalisation which is clause (14). which also subsumes clause (10).

Buntine drew on earlier work by Sammut (Sammut and Banerji, 1986) in constructing his generalised subsumption. Muggleton and Buntine (1998) took this approach a step further and realised that through the application of a few simple rules, they could invert resolution as Plotkin and Reynolds had wished. Here are two of the rewrite rules in propositional form:

Given a set of clauses, the body of one of which is completely contained in the bodies of the others, such as:

	
the absorption operation results in:
	
Intra-construction takes a group of rules all having the same head, such as:
	
and replaces them with:
	
These two operations can be interpreted in terms of the proof tree shown in Figure 7. Resolution accepts two clauses and applies unification to find the maximal common unifier. In the diagram, two clauses at the top of a "V" are resolved to produce the resolvent at the apex of the "V". Absorption accepts the resolvent and one of the other two clauses to produce the third. Thus, it inverts the resolution step.

Intra-construction automatically creates a new term in its attempt to simplify descriptions. This is an essential feature of inverse resolution since there may be terms in a theory that are not explicitly shown in an example and may have to be invented by the learning program.

DISCUSSION

These methods and others (Muggleton and Feng, 1990; Quinlan, 1990) have made relational learning quite efficient. Because the language of Horn-clause logic is more expressive than the other concept description languages we have seen, it is now possible to learn far more complex concepts than was previously possible. A particularly important application of this style of learning is knowledge discovery. There are now vast databases accumulating information on the genetic structure of human beings, aircraft accidents, company inventories, pharmaceuticals and countless more. Powerful induction programs that use expressive languages may be a vital aid in discovering useful patterns in all these data.

For example, the realities of drug design require descriptive powers that encompass stereo-spatial and other long-range relations between different parts of a molecule, and can generate, in effect, new theories. The pharmaceutical industry spends over $250 million for each new drug released onto the market. The greater part of this expenditure reflects today's unavoidably "scatter-gun" synthesis of compounds which might} possess biological activity. Even a limited capability to construct predictive theories from data promises high returns.

The relational program Golem was applied to the drug design problem of modelling structure-activity relations (King et al, 1992). Training data for the program was 44 trimethoprim analogues and their observed inhibition of E. coli dihydrofolate reductase. A further 11 compounds were used as unseen test data. Golem obtained rules that were statistically more accurate on the training data and also better on the test data than a Hansch linear regression model. Importantly, relational learning yields understandable rules that characterise the stereochemistry of the interaction of trimethoprim with dihydrofolate reductase observed crystallographically. In this domain, relational learning thus offers a new approach which complements other methods, directing the time-consuming process of the design of potent pharmacological agents from a lead compound, variants of which need to be characterised for likely biological activity before committing resources to their synthesis.

CRICOS Provider Code No. 00098G