The display below shows the BOXES algorithm learning to balance a pole on a cart that runs on a fixed length track. The histogram in the bottom half of the display show the number of trials so far. The three numbers are: the number of runs; the number of trials needed in the last run before success; and the average number of trials needed to succeed. Successes is defined as being able to balance the pole for 10,000 time steps.
TrialandError Learning
 Conventional control theory requires a mathematical model to predict the behaviour of a process so that appropriate control decisions can be made.
 Many processes are too complicated to model accurately.
 Often, not enough information is available about the process' environment.
 When the system is too complicated or the environment is not well understood, an adaptive controller may work.
 An adaptive controller learns how to use the control actions available to meet the system's objective.
 The process is treated as a 'black box' and the program interacts with it by conditioned response.
The Pole and Cart



BOXES
The BOXES learning algorithm partitions the state space into regions according to how each dimension of the space is discretised. Each box represents a region of the problem space. In the pole and cart problem, there are four dimensions, one for each state variable (position and velocity of the cart, angle and angular velocity of the pole).
Each box contains the following information:
 an action setting (left or right)
 the weighted sum of lifetimes after a left decision (left life)
 the weighted number of times a left decision has been made (left usage)
 the weighted sum of lifetimes after a right decision (right life)
 the weighted number of times a right decision has been made (right usage)
These numbers decay after each trial. That is, before a new value is added to the old value, the old value is multiplied by a factor between 0 and 1 (usually around 0.99). Thus, old experiences have less effect on box settings.
The BOXES Algorithm
boxes loop randomly set starting position put trial into t if t > 10,000 then exit for each box, b if number of entries into b != 0 check_box(b) trial put 0 into t find the current box if pole has fallen then return t add one to t if t > 10,000 then return t add one to number of entries into current box add t to time sum of current box make move according to setting of box check_box(b) multiply left life by decay multiply left usage by decay multiply right life by decay multiply left usage by decay if action setting is LEFT add no. of entries  time sum to left life add no. of entries to left usage if action setting is RIGHT add no. of entries  time sum to right life add no. of entries to right usage put 0 into the no. of entries put zero into time sum
if LeftValue > RightValue set action to LEFT else if LeftValue < RightValue set action to RIGHT else make random choice
Software
Publications
Sammut, C. (1988). Experimental Results from an Evaluation of Algorithms that Learn to Control Dynamic Systems. Proceedings of the Fifth International Conference on Machine Learning, Ann Arbor, Michigan: Morgan Kaufmann, pp. 437443.
Sammut, C. and Cribb, J. (1990). Is Learning Rate a Good Performance Criterion of Learning? In B. W. Porter & R. J. Mooney (Eds.), Proceedings of the Seventh International Machine Learning Conference, Austin, Texas: Morgan Kaufmann, pp. 170178.
Sammut, C. and Michie, D. (1991). Controlling a 'Black Box' Simulation of a Space Craft. AI Magazine, 12(1), pp 5663.
Sammut, C. A. (1994). Recent Progress with BOXES. In K. Furakawa, Michie, D. & S. Muggleton (Eds.), Machine Intelligence 13. Oxford: The Clarendon Press, OUP, pp 363384.
McGarity, M., Clements, D. and Sammut, C. (1995). Controlling a Steel Mill with BOXES. In S. Muggleton, K. Furakawa, & D. Michie (Eds.), Machine Intelligence 14. Oxford University Press.