Tensor Product Networks

Aims
  • to study a neural network with linear activation, multiple activation modes, and one-shot learning
  • to see tensor product networks in action, modelling human analogical reasoning
  • to explore robustness properties using a range of distributed representation strategies
  • Reference Smolensky, Paul: Tensor product variable binding and the representation of symbolic structures in connectionist systems, Artificial Intelligence 46 (1990) 159-216. Other references later.
    Keywords tensor product network, variable binding problem, rank, one-shot learning, orthonormality, relational memory, teaching and retrieval modes, proportional analogies, lesioning a network, random representations, sparse random representations, fact recognition scores, representable non-facts


    Plan
     
  • variable binding / rank 2 tensor product net
  • teaching and retrieving from a TPN
  • orthonormal / distributed representations
  • TPNs as relational memories
  • rank 3 tensor product nets
  • modelling proportional analogy-solving with TPNs
  • graceful degradation in TPNs
  • random representations
  • sparse random representations
  • fact and non-fact recognition scores

    Network Topology and Activation

    There are several ways to connect mini-processing units together and several models of computation for the units. A common model of computation is to assume weighted connections wij from input units with activation xj to unit i, and to define the output (activation function) for unit i to be:

    σ(Σj wijxj)

    where σ is a `squashing' function such as tanh. This is the model used in feedforward networks which learn by error backpropagation.


    Feedforward Nets

    [A net is a feedforward net if the the graph consisting of the neurons as nodes and connections as directed edges is a directed acyclic graph. One can then find nodes with in-degree zero (no connections coming into them) = input nodes; and nodes with out-degree zero (no connections leaving them) = output nodes; anything else is termed a hidden node or unit. Nets like this are often drawn as follows:

    where each rectangle represents a group of neurons/nodes: the input nodes are on the left, output nodes on the right, and hidden units in the middle. The arrows labelled with ω signify that there are connections, usually with `trainable' weights between neurons in one "layer" and those in the next.]


    Fully Recurrent Nets

    Another possibility for connectivity (also called topology) of networks is total interconnection - every neuron has a weighted connected to every other neuron. The connections may or may not be trainable. Such nets are called fully recurrent.


    Tensor Product Nets

    Other possibilities for the activation function include linear networks (where σ is the identity function). One particular kind of network with a linear activation function and a special topology is the rank 2 tensor product network.

    Tensor product networks were applied early to the task of dealing with symbolic structures in connectionist systems, by Paul Smolensky. Earlier papers by Dolan and Smolensky[1] dealt with implementing connectionist production systems (like rule-based expert systems). Smolensky (1990) deals with the "variable binding problem" - how to represent concepts such as x  =  3 in a connectionist system. This can be achieved with a rank 2 tensor product network.


    Rank 2 Tensor Product Nets

    Tensor product nets come with different numbers of dimensions, or rank. The first interesting case is rank 2, where the topology is that of a matrix:

    The network is shown above in teaching mode. There is also a retrieval mode, where you feed the net (the representation of) a variable, and it outputs the value of the symbol (the `filler').


    Teaching Mode

    In teaching mode, when (vectors representing) a variable and a filler are presented to the two sides of the network, the fact that the variable has that filler is learned by the network.

    The teaching is one-shot, as opposed to the iterative learning used by backpropagation networks, and the settling schemes used by other classes of neural network. Nothing is annealed or repeatedly adjusted, and no stopping criterion applies.

    Teaching is accomplished by adjusting the value of the binding unit memory. Specifically, if the i-th component of the filler vector is fi and the j-th component of the variable vector is vj, then fivj is added to bij, the (i,j)-th binding unit memory, for each i and j.

    Another way to look at this is to regard the binding units as a matrix B, and the filler and variable as column vectors f and v. Then what we are doing is forming the outer product fvT and adding it to B:

    B' = B + fvT


    Retrieval Mode

    For exact retrieval, the vectors used to represent variables must be orthogonal to each other (i.e. any two of them should have the dot product equal to zero) and the same must be true for the vectors used to represent the fillers. Each representation vector should also be of length 1 (i.e. the dot product of each vector with itself should be 1).

    It is common to refer to a set of vectors with these properties (orthogonality and length 1) as an orthonormal set.

    Orthonormality entails that the representation vectors are linearly independent, and in particular, if the matrix/tensor has m rows and n columns, then it can represent at most m fillers and n variables.


    Orthonormal Bases

    It is easy to find an orthonormal set of vectors: the standard basis of Rn is orthonormal:

    (1 0 0 ... 0)
    (0 1 0 ... 0)
    (0 0 1 ... 0)
      :  :  :     :
    (0 0 0 ... 1)

    This basis has an advantage and a disadvantage. The advantage is that, using it, it is easy to see what is going on in the tensor. If there is a 1 in row i and column j, then variable j has value equal to filler i. There can be at most one 1 in each column (a variable has at most one filler) but there can be several 1s in a row (in which case several variables have the same filler).

    The disadvantage is that it corresponds to a localist representation - where the responsibility for knowing that variable j has filler i resides in the (i,j)th cell of the tensor alone. If the (i,j)th cell were damaged, the fact would be lost from memory.


    Distributed Representations in TP Nets

    We shall call a basis a distributed representation if its vectors have no zeroes or few zeroes in them.

    If a distributed representation is used, then the memory for the fact that variable j has filler i is distributed all over the tensor.

    Here is a small, simple example of a distributed representation:

    (0.50.50.50.5)
    (0.5-0.50.5-0.5)
    (0.50.5-0.5-0.5)
    (0.5-0.5-0.50.5)

    Retrieval from a TP Net

    Retrieval is accomplished by computing dot products. To retrieve the value/filler for a variable v = (vj) from a rank 2 tensor with binding unit values bij, compute fi = Σ j bijvj, for each i. The resulting vector (fi) represents the filler.

    Notice that the activation function for fi is a regular-type linear activation function.

    To decide whether variable v has filler f, compute D = Σ iΣ j  bijvjfi. D will be either 1 or 0. If it is 1, then variable v has filler f, otherwise not.


    Hadamard Matrices and Orthonormal Bases


    Hadamard Matrices and Orthonormal Bases 2


    Hadamard Matrices 3


    Hadamard Matrices 4


    Learning with Hadamard Representations

    Suppose we are using representations as follows:

    (0.5, 0.5, 0.5, 0.5)Tto representrabbit
    (0.5, -0.5, 0.5, -0.5)Tto representmouse
    (0.5, 0.5, -0.5, -0.5)Tto representcarrot
    (0.5, -0.5, -0.5, 0.5)Tto representcat
    and we want to build a tensor to represent the pairs

    (rabbit, carrot) and (cat, mouse)


    Learning with Hadamard Representations 2

    We check that we can recover carrot from this by unbinding with rabbit: we must compute fi = Σ j bijvj where bij is the matrix, and (vj) is the rabbit vector:

    f1 = b11v1 + b12v2 + b13v3 + b14v4
     = ½ × ¼ × (1×1 + 1×1 + 1×1 + 1×1)
     = 0.5

    and similarly, f2 = 0.5, f3 = -0.5, and f4 = -0.5, so that f represents carrot.


    Learning with Hadamard Representations 3

    For (cat, mouse), we compute:

    The tensor representing both of these is the sum of the two matrices:


    Learning with Hadamard Representations 4

    We check that we can still recover carrot from this by unbinding with rabbit: we must compute fi = Σj bijvj where bij is the (new) matrix, and (vj) is the rabbit vector:

    f1 = b11v1 + b12v2 + b13v3 + b14v4
     = ½ × ¼ × (2×1 + 0×1 + 0×1 + 2×1)
     = 0.5

    and similarly, f2= 0.5, f3= -0.5, and f4= -0.5, so that f represents carrot as before.


    TP Nets as Relational Memories

    So far we have been using the tensor product network to store a particular kind of relational information: variable binding. In variable binding, each variable has a unique filler (at any given time). This restriction on the kind of information stored in the tensor is unnecessary. A rank 2 tensor will store an arbitrary binary relation. For example, suppose I have a set of facts about what, say, animals eat:

    Animal  Food
    rabbit  carrot
    mouse  cheese
    crocodile  student
    rabbit  lettuce
    guineapig  lettuce
    crocodile  lecturer


    TP Nets as Relational Memories 2

    This information can be represented, and stored in the tensor in the usual way, (putting the `animal' in the side we have been calling `variable', and the `food' in the side we have been calling `filler') and then we can try retrieving it. For example, we can present the vector representing rabbit to the variable/animal side of the tensor. What we get out of the filler/food side of the tensor will be the sum of the vectors representing the foods that the tensor has been taught that rabbit eats: in this case carrot + lettuce.

    Checking a particular fact, like that (mouse, cheese) is in the relation, is done just as before: we compute D = ΣiΣj bijvjfi: where v is for varmint and f for food, and if D = 1 then the varmint eats the food.


    Rank 3 Tensors

    From what we've seen so far, we could better have called these nets `matrix nets'. The tensor aspect of things comes in when we generalise to enable us to store ternary (or higher rank) relations.

    Suppose we want to store either a ternary relation (like give, where we record who gives, to whom, and what), or a group of binary relations, where we need to record which relation, and the first and second arguments of each relational instance, like:

    kiss(frank, betty)
    hit(max, frank).

    Now we need a tensor net with three sides: say a REL side, an ARG1 side and an ARG2 side, or more generally a u side, a v side and a w side.


    Gross Structure of a Rank 3 TP Net

    This shows binding units, some activated (shaded), and neurons in the u, v, & w vectors, but not interconnections. The net is ready for retrieval from the u side, given v & w. There are 27 binding units, 3 × 3 × 3. In general, if the u, v, and w sides use vectors with q components, there are q3 binding units.


    Connections of a Rank 3 TP Net

    The binding units are labelled tijk - t for tensor. Each component of a side is connected to a hyperplane[2] of binding units. E.g. v1 is connected to ti1k, for all i and k.


    Retrieval in a Rank 3 Tensor

    If we have concepts (or their representations) for any two sides of the tensor, then we can retrieve something from the third side. For example, if we have u = (ui) and v = (vj), then we can compute wk = Σij tijk uivj, for each value of k, and the result will be the sum of the vectors representing concepts w such that u(v,w) is stored in the tensor.

    This time the activation function for wk is not linear but multilinear. It roughly corresponds to the activation function for higher-order neural networks, where (biologically speaking) two (or more) axons synapse together (axono-axono-dendritic synapse).

    As usual, one can check facts, too. D = Σijktijkuivjwk is 1 exactly when u(v,w) is stored in the tensor, and zero otherwise.


    Teaching in a Rank 3 Tensor

    The internal layout of the binding unit is more complex in this case, because of the extra input.

    To teach the network the fact u(v,w), present u, v and w to the net. In teaching mode, this causes the content of each binding unit memory tijk to be altered by adding uivjwk to it.


    Higher Rank Tensor Product Networks

    For a rank r tensor product network:

    This rapidly becomes impractical, as the size of the network (number of binding units) grows as nr, and it is desirable to have n fairly large in practice, since n is the largest number of concepts that can be represented (per side of the tensor). For example, with a rank 6 tensor, with 64 concepts per side, we would need 646 = 236 ~ 64 billion binding units.


    Gross Topology of a rank 4 tensor product network

    This one has 3 components for each of the 4 `directions', so has a total of 34 = 81 binding units. If you want a challenge exercise, you could try drawing all the connections in a manner analogous to figure some pages back labelled "Network Connectivity for a 3×3×3 Tensor Product Network".


    Applications of Tensor Product Networks

    We'll look at a connectionist model of human analogical reasoning. First some terminology: detailed diagrams of tensor product networks are complicated, so we'll use the following symbol for a rank 3 tensor product network:

    Here v & w are inputs and u is output, but we could make any side the output.


    Solving Proportional Analogy Problems
    using Tensor Product Networks

    The aim is to simulate simple human analogical reasoning. The tensor product network is used to store facts relevant to the analogical reasoning problem.

    Proportional analogy problems are sometimes used in psychological testing. They are fairly easy for a human over a certain age, but it is not particularly clear how to solve them on a machine. A typical example is

    dog : kennel :: rabbit : what?

    The aim is to find the best replacement for the what?. Here the answer is burrow (or maybe hutch, if we are talking about pet rabbits).

    A human usually responds to this problem by saying something like: "The dog lives-in the kennel - what does the rabbit live in? - a burrow." In other words, the human names a relationship between dog and kennel, and then proceeds from there.

    However, the human does not pick just any relation between dog and kennel (like smaller-than(dog, kennel)): they pick the most salient relation. How? And how could we do this with a machine?

    The tensor product network approach actually finesses this question.


    A set of "facts"

    woman
    woman
    woman
    woman
    mare
    mare
    mare
    mare

    woman
    woman
    woman
    baby
    mare
    foal
    rabbit
    barn
    barn
    barn
    barn
    barn
    loves
    mother-of
    larger
    feeds
    feeds
    mother-of
    larger
    larger

    larger
    larger
    lives-in
    lives-in
    lives-in
    lives-in
    lives-in
    larger
    larger
    larger
    larger
    larger
    baby
    baby
    baby
    baby
    foal
    foal
    foal
    rabbit

    rabbit
    foal
    house
    house
    barn
    barn
    burrow
    woman
    baby
    mare
    foal
    rabbit


    Steps in the Simple Analogical Reasoning Algorithm


    Running the net (a Lisp implementation):

    The simulation output below demonstrates four things:

    1. A normal network solving woman : baby :: mare : what?

    2. A network which has had 50% of its binding units `killed' tackling the same problem. The binding units are `killed' by altering them so that they always produce zero output.

    3. A network which has had 90% of its binding units `killed' solving the same problem.

    4. A network which has had 20% of its binding units `randomized' tackling the same problem. The binding units are `randomized' by altering them so that they produce produce uniform random output.

    The point of 2-4 is to demonstrate the robustness of tensor product networks. In all 3 cases, the coefficients of foal and rabbit are reduced, and other concepts may show up with non-zero coefficients, but we can always tell that foal is the correct answer, as it has the largest coefficient.


    ? (solve-simple-analogy 'woman 'baby 'mare :verbose)

    Tensor product net analogy solver
    Network size: 320 nodes (5×8×8).
    Dead so far: 0 nodes (0.0%). Randomized so far: 0 nodes (0.0%)
    Total damaged nodes: 0 (0.0%) ...
    Problem: WOMAN : BABY :: MARE : ?

    Recognizable patterns in solution vector ...
    FOAL = 3.000
    RABBIT = 1.000

    ? (load "lobotomy.lisp")
    ? (nkill 160)
    Neurons dead: 50.00%


    ? (solve-simple-analogy 'woman 'baby 'mare :verbose)

    Network size: 320 nodes (5×8×8).
    Dead so far: 160 nodes (50.0%). Randomized so far: 0 nodes (0.0%)
    Total damaged nodes: 160 (50.0%) ...

    Problem: WOMAN : BABY :: MARE : ?

    Recognizable patterns in solution vector ...
    BABY
    FOAL
    RABBIT
    =
    =
    =
    0.0005
    0.854
    0.137


    ? (nkill 128) ; kill another 40%
    Neurons dead: 90.00%
    ? (solve-simple-analogy 'woman 'baby 'mare :verbose)
    Network size: 320 nodes (5×8×8).
    Dead so far: 288 nodes (90.0%). Randomized so far: 0 nodes (0.0%)
    Total damaged nodes: 288 (90.0%)

    Problem: WOMAN : BABY :: MARE : ?

    Recognizable patterns in solution vector ...
    WOMAN
    FOAL
    RABBIT
    =
    =
    =
    0.001
    0.069
    0.015


    ? (reset-net) ; restore net to original state
    ; - all binding units alive
    ? (nrandomize 64)
    Neurons random: 20.00%

    ? (solve-simple-analogy 'woman 'baby 'mare :verbose)
    Network size: 320 nodes (5×8×8).
    Dead so far: 0 nodes (0.0%). Randomized so far: 64 nodes (20.0%)
    Total damaged nodes: 64 (20.0%)

    Problem: WOMAN : BABY :: MARE : ?
    Volume setting for randomized nodes: 0.10

    Recognizable patterns in solution vector ...
    WOMAN
    MARE
    BABY
    FOAL
    RABBIT
    HOUSE
    BURROW
    =
    =
    =
    =
    =
    =
    =
    0.511
    0.058
    0.456
    2.200
    0.899
    0.312
    0.067


    For more details...

    Halford, Wilson, Guo, Gayler, Wiles, and Stewart, Connectionist implications for processing capacity limitations in analogies, pages 363-415 in K.J. Holyoak and J. Barnden (editors) Advances in Connectionist and Neural Computation Theory: Volume 2: Analogical Connections, Norwood, NJ: Ablex, 1994. ISBN: 1-56750-039-0

    This work led to a relational view of cognition which is described in:

    G.S. Halford, W.H. Wilson, and S. Phillips, Processing capacity defined by relational complexity: Implications for comparative, developmental and cognitive psychology, Behavioral and Brain Sciences 21(6) (1998) 803-831. ISSN 0140-525X.


    Omnidirectional Access

    In solving the analogy problem, the TPN was accessed in two different ways:

    (1) ARG1 and ARG2 in, REL out
    (2) ARG1 and REL in, ARG2 out

    This would not have been possible with a backprop net - the input/output structure is "hard-wired" in backprop nets. In the TPN, the same information in the tensor supports both these modes of operation.

    This is like when kids learn addition/subtraction - you learn that 9 + 7 = 16, and from this you also know that 16 - 7 = 9. We learn addition tables, but not subtraction tables.

    An obvious third access mode: ARG2 and REL in, ARG1 out, is possible. And of course, you can have and ARG1, ARG2, and REL in, YES/NO out access mode.

    Less obviously, you can have access modes like: REL in, ARG1⊗ ARG2 out.
    In fact there are a total of 7 access modes to a rank 3 tensor. There are 2k-1 access modes for a rank k tensor. This property is referred to as omnidirectional access.

    Example: Given a rank 5 tensor, with axes REL, ARG1, ARG2, ARG3, and ARG4, one possible access mode would be: ARG1, ARG2 in, REL⊗ ARG3⊗ ARG4 out


    What else can tensor product nets do?

    Smolensky (AI 46 (1990) full citation above) reviews this. In particular, he presents a lot of theoretical analysis, and inter alia shows that stacks can be implemented in tensor product networks, and indeed that general lists can be. He does this by showing how to implement the basic Lisp primitives car, cdr, and cons (head, tail and : for Haskell hackers).


    Approximate unbinding and random representations

    [Reference: Wilson, Street, and Halford, Solving Proportional Analogy Problems using Tensor Product Networks with Random Representations, 1995 IEEE International Conference on Neural Networks Proceedings, ICNN'95, 2971-2975. http://www.cse.unsw.edu.au/~billw/reprints/randomtensor.ps]

    We have seen that we can kill 90% of binding units and still solve problems with the tensor product networks. When we kill a binding unit, this affects the computation of inner products in the retrieval process - it is if the representation vectors are not exactly orthogonal anymore.

    So perhaps exact orthogonality is not critical - perhaps approximate orthogonality would be enough. It is also implausible that biological neural networks have orthonormal sets of vectors representing concepts.

    What sort of vectors would be `approximately orthogonal'? Answer: vectors with uniform random components chosen from the interval [-r, r] have inner products with mean 0 and variance which depends on r.


    Experiments with `dense' random representations

    Wilson, Street and Halford also tried using as representation vectors, normalised random vectors generated by choosing each component to be a uniform random number from the range [-1,+1], and normalising the resulting vector.

    Normalisation has the effect that vectors with more components will (on average) have smaller components.

    For this reason, they investigated the effect of increasing the number of components on network performance.

    Experiments were done with the analogy problem: woman is to baby as mare is to what? and a particular set of facts.

    We classified runs as good or poor or bad.

    good:score(foal) > 1.2 * score(rabbit) and
    score(rabbit) >1.2 * next highest score
    fair:score(foal) > score(rabbit) > next highest score
    bad:not (good or fair)

    The factor 1.2 was an arbitrarily chosen cutoff. [It was not "tuned".]


    Dense random representation results

    Table 1: Classification of solutions produced by dense random vector experiments (1000 runs for each number of components; 20 facts; solving woman:baby::mare:what?)

    components% good% fair% bad
    25 40.9 10.3 48.8
    40 55.5 7.5 37.0
    70 73.5 9.3 18.2
    100 84.8 4.3 10.9
    130 92.6 3.5 3.9
    160 94.5 2.5 3.0
    190 96.7 1.8 1.5
    220 97.4 1.4 1.2
    250 98.7 1.2 0.1


    Sparse Random Representations

    A distributed representation does not need all or even most nodes to have non-zero activations, but any concept should be represented by a significant number of active nodes.

    Indeed, neurophysiological data indicates that percent activity of neurons in the different regions of the human hippocampal system ranges from about 0.5% to about 9%.


    Experiments with Sparse Random Representations

    Accordingly, Wilson, Street and Halford constructed a version of their model which uses sparse random representations: that is representations are generated with most components zero, and the remainder chosen from a uniform random distribution.

    Again, the inner products of such vectors have mean zero, and so can be considered quasi-orthogonal.

    With small numbers of non-zero components, one would expect freaky behaviour - these cases do not correspond to distributed representations.

    In the experiments, vectors were of length 500 (and so the tensor had 125 million binding units).

    Each experimental run used a fixed number of non-zero components, ranging from 1 to 50, and chose the components initially as uniform random numbers in [-1,+1], but then normalized the resulting vectors.

    One hundred runs were done with each number of non-zero components.


    Results for Sparse Random Representations


    Fact and Non-Fact Recognition Scores

    Further experiments on random representations investigated the ability of the networks to recognize individual facts (as opposed to solving analogy problems, which is a more complex task which involves accessing information relating to a large number of facts).

    These experiments confirmed that random representations, with enough components in the vectors, permit one to decide whether a proposition is a fact or a non-fact (where we define a fact to be something that the network has been taught.)


    Fact and Non-Fact Recognition Results


    What about higher ranks?

    The experiments reported here were done with rank 3 networks. More recent experiments on fact and non-fact retrieval with networks of ranks 4, 5, 6, and 7 with Hadamard bases and large numbers of neurons "killed" (rather than with random representations) have shown that one can kill around 90% of the neurons and still be able to distinguish facts from non-facts.

    Above rank 7, it begins to become difficult to complete simulations with reasonable amounts of computing resources.


    Summary