A variety of methods have been put forward to counteract this problem. Among these, one of the most successful is the hierarchical decomposition of the problem into a set of smaller sub-problems, each of which is defined on a subset of the state space. Because these sub-problems are more limited in scope than their parent problem, they are generally simpler to learn. Given a suitable decomposition, these sub-problems can be learnt relatively quickly, and then can be combined to form a solution to the parent problem.
Several techniques of hierarchical decomposition of reinforcement learning problems have been proposed. My own, called RL-TOPs, is unusual in that it is a hybrid of reinforcement learning and symbolic planning techniques. It is inspired by the theories of learning in cognitive psychology.
Human learning appears to go through several stages:
Presently the system consists of two parts: the Planner and the Actor/Learner.
The Planner handles the declarative knowledge side of the problem. It is provided with a high-level language for describing the agent's state, with rules defining the relationships between those predicates. Also provided is a set of behaviour descriptions in the form of teleo-operators (or tops). These tops describe possibly useful behaviours, in terms of their pre- and post-conditions. The implementation of these behaviours may be provided, but for the most part they are left for the agent to learn.
Tops come in a variety of granularities. Coarse grained tops are defined on large areas of the state space, and have very specific goals. Finer grained tops are defined on progressively smaller subsets of the state space, and have more general goals. Thus fine-grained tops are easier to learn than coarser ones. The role of the planner is to use this declarative knowledge to decompose coarse grained behaviours into plans of fine grained behaviours, which are in turn decomposed into even finer grained behaviours and so on, to the limit of the given behaviours.
The Actor/Learner runs concurrently with the Planner. It selects a coarse grained behaviour which achieves the agent's current goal and proceeds to execute it. Unless the behaviour has already been implemented, this will involve doing some learning, by standard reinforcement learning techniques. As long as the behaviour is viable (that is, its precondition is satisfied) and has not yet succeeded (its postcondition is not satisfied), then the behaviour continues to execute, without reward. When one of these conditions no longer holds, the behaviour terminates with a reward (if it achieved its postcondition) or a penalty (if it left its precondition).
While the Actor is executing, the Planner is searching for a decomposition of the coarse grained behaviour. As it builds its plan, the Actor will from time to time check whether then plan covers the agent's current situation. If one is available, the Actor may choose to begin executing a finer grained behaviour offered by the plan. This behaviour is executed, and if necessary learnt, in the same way as its parent behaviour. It too can be supplanted by a sub-sub-behaviour, if the planner is able to find one.
While the Actor is executing and learning finer grained sub-problems, it does not give up on the parent problems. Even though the parents are not being executed themselves, they are still able to learn from the experiences of their sub-behaviours, by the use of off-policy reinforcement learning algorithms. Each active behaviour observes the same state and action data, but calculates its own reinforcement signals based on its own pre- and post-conditions. In this way, it is possible for the coarser grained behaviours to learn refinements to their plans, finding ways to "cut corners". In this way they can compile the plans into behaviours of their own. When these behaviours become sufficiently confident, by some heuristic measure, they can be executed directly, without the need for decomposition.
Still to be implemented are the processes of reflection and creation of new teleo-operators. It is expected that both of these problems will be made easier by the agent's prior knowledge in terms of the state-description language, and the already provided behaviour descriptions.
Currently the system is being tested on a flight simulator, produced by the School of Computer Science at Curtin University, in collaboration with the Aeronautical and Maritime Research Laboratories in Perth and Melbourne, Australia. The system is learning to takeoff and fly over a sequence of landmarks, with the prior knowledge available from a standard Flight Training Manual, and some demonstration flights. Results will be published as soon as they become available.