The display below shows the BOXES algorithm learning to balance a pole on a cart that runs on a fixed length track. The histogram in the bottom half of the display show the number of trials so far. The three numbers are: the number of runs; the number of trials needed in the last run before success; and the average number of trials needed to succeed. Successes is defined as being able to balance the pole for 10,000 time steps.
Trial-and-Error Learning
- Conventional control theory requires a mathematical model to predict the behaviour of a process so that appropriate control decisions can be made.
- Many processes are too complicated to model accurately.
- Often, not enough information is available about the process' environment.
- When the system is too complicated or the environment is not well understood, an adaptive controller may work.
- An adaptive controller learns how to use the control actions available to meet the system's objective.
- The process is treated as a 'black box' and the program interacts with it by conditioned response.
The Pole and Cart
|
|
Equations of Motion
|
||||||||||||||||||
|
|||||||||||||||||||
BOXES
The BOXES learning algorithm partitions the state space into regions according to how each dimension of the space is discretised. Each box represents a region of the problem space. In the pole and cart problem, there are four dimensions, one for each state variable (position and velocity of the cart, angle and angular velocity of the pole).
Each box contains the following information:
- an action setting (left or right)
- the weighted sum of lifetimes after a left decision (left life)
- the weighted number of times a left decision has been made (left usage)
- the weighted sum of lifetimes after a right decision (right life)
- the weighted number of times a right decision has been made (right usage)
These numbers decay after each trial. That is, before a new value is added to the old value, the old value is multiplied by a factor between 0 and 1 (usually around 0.99). Thus, old experiences have less effect on box settings.
The BOXES Algorithm
boxes
loop
randomly set starting position
put trial into t
if t > 10,000 then exit
for each box, b
if number of entries into b != 0
check_box(b)
trial
put 0 into t
find the current box
if pole has fallen then return t
add one to t
if t > 10,000 then return t
add one to number of entries into current box
add t to time sum of current box
make move according to setting of box
check_box(b)
multiply left life by decay
multiply left usage by decay
multiply right life by decay
multiply left usage by decay
if action setting is LEFT
add no. of entries - time sum to left life
add no. of entries to left usage
if action setting is RIGHT
add no. of entries - time sum to right life
add no. of entries to right usage
put 0 into the no. of entries
put zero into time sum
if LeftValue > RightValue
set action to LEFT
else if LeftValue < RightValue
set action to RIGHT
else
make random choice
Software
Publications
Sammut, C. (1988). Experimental Results from an Evaluation of Algorithms that Learn to Control Dynamic Systems. Proceedings of the Fifth International Conference on Machine Learning, Ann Arbor, Michigan: Morgan Kaufmann, pp. 437-443.
Sammut, C. and Cribb, J. (1990). Is Learning Rate a Good Performance Criterion of Learning? In B. W. Porter & R. J. Mooney (Eds.), Proceedings of the Seventh International Machine Learning Conference, Austin, Texas: Morgan Kaufmann, pp. 170-178.
Sammut, C. and Michie, D. (1991). Controlling a 'Black Box' Simulation of a Space Craft. AI Magazine, 12(1), pp 56-63.
Sammut, C. A. (1994). Recent Progress with BOXES. In K. Furakawa, Michie, D. & S. Muggleton (Eds.), Machine Intelligence 13. Oxford: The Clarendon Press, OUP, pp 363-384.
McGarity, M., Clements, D. and Sammut, C. (1995). Controlling a Steel Mill with BOXES. In S. Muggleton, K. Furakawa, & D. Michie (Eds.), Machine Intelligence 14. Oxford University Press.