BOXES Learning to Control a Pole and Cart

BOXES

The display below shows the BOXES algorithm learning to balance a pole on a cart that runs on a fixed length track. The histogram in the bottom half of the display show the number of trials so far. The three numbers are: the number of runs; the number of trials needed in the last run before success; and the average number of trials needed to succeed. Successes is defined as being able to balance the pole for 10,000 time steps.

Trial-and-Error Learning

  • Conventional control theory requires a mathematical model to predict the behaviour of a process so that appropriate control decisions can be made.
  • Many processes are too complicated to model accurately.
  • Often, not enough information is available about the process' environment.
  • When the system is too complicated or the environment is not well understood, an adaptive controller may work.
  • An adaptive controller learns how to use the control actions available to meet the system's objective.
  • The process is treated as a 'black box' and the program interacts with it by conditioned response.

The Pole and Cart

Equations of Motion
    mc = mass of cart
mp = mass of pole
l = distance of centre of mass of pole from the pivot
g = acceleration due to gravity
F = force applied to cart
t = time interval of simulation

BOXES

The BOXES learning algorithm partitions the state space into regions according to how each dimension of the space is discretised. Each box represents a region of the problem space. In the pole and cart problem, there are four dimensions, one for each state variable (position and velocity of the cart, angle and angular velocity of the pole).

BOXES array

Each box contains the following information:

  • an action setting (left or right)
  • the weighted sum of lifetimes after a left decision (left life)
  • the weighted number of times a left decision has been made (left usage)
  • the weighted sum of lifetimes after a right decision (right life)
  • the weighted number of times a right decision has been made (right usage)

These numbers decay after each trial. That is, before a new value is added to the old value, the old value is multiplied by a factor between 0 and 1 (usually around 0.99). Thus, old experiences have less effect on box settings.

The BOXES Algorithm

boxes
    loop
        randomly set starting position
        put trial into t
        if t > 10,000 then exit
        for each box, b
            if number of entries into b != 0
                check_box(b)

trial
    put 0 into t
    find the current box
    if pole has fallen then return t
    add one to t
    if t > 10,000 then return t
    add one to number of entries into current box
    add t to time sum of current box
    make move according to setting of box

check_box(b)
    multiply left life by decay
    multiply left usage by decay
    multiply right life by decay
    multiply left usage by decay

    if action setting is LEFT
        add no. of entries - time sum to left life
        add no. of entries to left usage

    if action setting is RIGHT
        add no. of entries - time sum to right life
        add no. of entries to right usage

    put 0 into the no. of entries
    put zero into time sum
        
        
    if LeftValue > RightValue
        set action to LEFT
    else if LeftValue < RightValue
        set action to RIGHT
    else
        make random choice

Software

Publications

Sammut, C. (1988). Experimental Results from an Evaluation of Algorithms that Learn to Control Dynamic Systems. Proceedings of the Fifth International Conference on Machine Learning, Ann Arbor, Michigan: Morgan Kaufmann, pp. 437-443.

Sammut, C. and Cribb, J. (1990). Is Learning Rate a Good Performance Criterion of Learning? In B. W. Porter & R. J. Mooney (Eds.), Proceedings of the Seventh International Machine Learning Conference, Austin, Texas: Morgan Kaufmann, pp. 170-178.

Sammut, C. and Michie, D. (1991). Controlling a 'Black Box' Simulation of a Space Craft. AI Magazine, 12(1), pp 56-63.

Sammut, C. A. (1994). Recent Progress with BOXES. In K. Furakawa, Michie, D. & S. Muggleton (Eds.), Machine Intelligence 13. Oxford: The Clarendon Press, OUP, pp 363-384.

McGarity, M., Clements, D. and Sammut, C. (1995). Controlling a Steel Mill with BOXES. In S. Muggleton, K. Furakawa, & D. Michie (Eds.), Machine Intelligence 14. Oxford University Press.