Hierarchical Reinforcement Learning: A Hybrid Approach, Ryan M.R.K., PhD Thesis, University of New South Wales, School of Computer Science and Engineering, 2004

In this thesis we investigate the relationships between the symbolic and sub-symbolic methods used for controlling agents by artificial intelligence, focusing in particular on methods that learn. In light of the strengths and weaknesses of each approach, we propose a hybridisation of symbolic and subsymbolic methods to capitalise on the best features of each.

We implement such a hybrid system, called Rachel which incorporates techniques from Teleo-Reactive Planning, Hierarchical Reinforcement Learning and Inductive Logic Programming. Rachel uses a novel representation of behaviours, Reinforcement-Learnt Teleo-operators (RL-Tops), which defines the behaviour in terms of its desired consequences but leaves the implementation of the policy to be learnt by reinforcement learning. An RL-Top is an abstract, symbolic description of the purpose of a behaviour, and is used by Rachel both as a planning operator and as the definition of a reward function by which the behaviour can be learnt.

Two new hierarchical reinforcement learning algorithms are introduced, Planned Hierarchical Semi-Markov Q-Learning (P-HSMQ) and Teleo-Reactive Q-Learning (TRQ). The former is an extension of the Hierarchical Semi-Markov Q-Learning algorithm to use computer generated plans in place of task-hierarchies (which are commonly provided by the trainer). The latter is a further elaboration of the algorithm to include more intelligent behaviour termination. The knowledge contained in the plan is used to determine when an executing behaviour is no longer appropriate, and can be prematurely terminated, resulting in more efficient policies.

Incomplete descriptions of the effects of behaviours can lead the planner to make false assumptions in building plans. As behaviours are learnt, not implemented, not every effect of actions can be known in advance. Rachel implements a "reflector" which monitors for such unexpected and unwanted side-effects. Using ILP it learns to predict when they will occur, and so repair its plans to avoid them.

Together, the components of Rachel form a learning system which is able to receive abstract descriptions of behaviours, build plans to discover which of them may be useful to achieve its goals, learn concrete policies and optimal choices of behaviour through trial and error, discover and predict any unwanted side-effects that result and repair its plans to avoid them. It is a demonstration that different approaches to AI, symbolic and sub-symbolic, can be elegantly combined into a single agent architecture.

Download full thesis in postscript or PDF format.

Malcolm Ryan - malcolmr@cse.unsw.edu.au