| Aims |
|
|---|---|
| Reference | Smolensky, Paul: Tensor product variable binding and the representation of symbolic structures in connectionist systems, Artificial Intelligence 46 (1990) 159-216. Other references later. |
| Keywords | tensor product network, variable binding problem, rank, one-shot learning, orthonormality, relational memory, teaching and retrieval modes, proportional analogies, lesioning a network, random representations, sparse random representations, fact recognition scores, representable non-facts |
There are several ways to connect mini-processing units together and several models of computation for the units. A common model of computation is to assume weighted connections wij from input units with activation xj to unit i, and to define the output (activation function) for unit i to be:
σ(Σj wijxj)
where σ is a `squashing' function such as tanh. This is the model used in feedforward networks which learn by error backpropagation.
[A net is a feedforward net if the the graph consisting of the neurons as nodes and connections as directed edges is a directed acyclic graph. One can then find nodes with in-degree zero (no connections coming into them) = input nodes; and nodes with out-degree zero (no connections leaving them) = output nodes; anything else is termed a hidden node or unit. Nets like this are often drawn as follows:

where each rectangle represents a group of neurons/nodes: the input nodes are on the left, output nodes on the right, and hidden units in the middle. The arrows labelled with ω signify that there are connections, usually with `trainable' weights between neurons in one "layer" and those in the next.]
Another possibility for connectivity (also called topology) of networks is total interconnection - every neuron has a weighted connected to every other neuron. The connections may or may not be trainable. Such nets are called fully recurrent.
Other possibilities for the activation function include linear networks (where σ is the identity function). One particular kind of network with a linear activation function and a special topology is the rank 2 tensor product network.
Tensor product networks were applied early to the task of dealing with symbolic structures in connectionist systems, by Paul Smolensky. Earlier papers by Dolan and Smolensky[1] dealt with implementing connectionist production systems (like rule-based expert systems). Smolensky (1990) deals with the "variable binding problem" - how to represent concepts such as x = 3 in a connectionist system. This can be achieved with a rank 2 tensor product network.
Tensor product nets come with different numbers of dimensions, or rank. The first interesting case is rank 2, where the topology is that of a matrix:
The network is shown above in teaching mode. There is also a retrieval mode, where you feed the net (the representation of) a variable, and it outputs the value of the symbol (the `filler').
In teaching mode, when (vectors representing) a variable and a filler are presented to the two sides of the network, the fact that the variable has that filler is learned by the network.
The teaching is one-shot, as opposed to the iterative learning used by backpropagation networks, and the settling schemes used by other classes of neural network. Nothing is annealed or repeatedly adjusted, and no stopping criterion applies.
Teaching is accomplished by adjusting the value of the binding unit memory. Specifically, if the i-th component of the filler vector is fi and the j-th component of the variable vector is vj, then fivj is added to bij, the (i,j)-th binding unit memory, for each i and j.
Another way to look at this is to regard the binding units as a matrix B, and the filler and variable as column vectors f and v. Then what we are doing is forming the outer product fvT and adding it to B:
B' = B + fvT
For exact retrieval, the vectors used to represent variables must be orthogonal to each other (i.e. any two of them should have the dot product equal to zero) and the same must be true for the vectors used to represent the fillers. Each representation vector should also be of length 1 (i.e. the dot product of each vector with itself should be 1).
It is common to refer to a set of vectors with these properties (orthogonality and length 1) as an orthonormal set.
Orthonormality entails that the representation vectors are linearly independent, and in particular, if the matrix/tensor has m rows and n columns, then it can represent at most m fillers and n variables.
It is easy to find an orthonormal set of vectors: the standard basis of Rn is orthonormal:
(1 0 0 ... 0)
(0 1 0 ... 0)
(0 0 1 ... 0)
: : : :
(0 0 0 ... 1)
This basis has an advantage and a disadvantage. The advantage is that, using it, it is easy to see what is going on in the tensor. If there is a 1 in row i and column j, then variable j has value equal to filler i. There can be at most one 1 in each column (a variable has at most one filler) but there can be several 1s in a row (in which case several variables have the same filler).
The disadvantage is that it corresponds to a localist representation - where the responsibility for knowing that variable j has filler i resides in the (i,j)th cell of the tensor alone. If the (i,j)th cell were damaged, the fact would be lost from memory.
We shall call a basis a distributed representation if its vectors have no zeroes or few zeroes in them.
If a distributed representation is used, then the memory for the fact that variable j has filler i is distributed all over the tensor.
Here is a small, simple example of a distributed representation:
( 0.5 0.5 0.5 0.5 ) ( 0.5 -0.5 0.5 -0.5 ) ( 0.5 0.5 -0.5 -0.5 ) ( 0.5 -0.5 -0.5 0.5 )
Retrieval is accomplished by computing dot products. To retrieve the value/filler for a variable v = (vj) from a rank 2 tensor with binding unit values bij, compute fi = Σ j bijvj, for each i. The resulting vector (fi) represents the filler.
Notice that the activation function for fi is a regular-type linear activation function.
To decide whether variable v has filler f, compute D = Σ iΣ j bijvjfi. D will be either 1 or 0. If it is 1, then variable v has filler f, otherwise not.
Another example can be found by doubling the vectors on the previous page and using them as the rows of a matrix.
Suppose we are using representations as follows:
and we want to build a tensor to represent the pairs
(0.5, 0.5, 0.5, 0.5)T to represent rabbit (0.5, -0.5, 0.5, -0.5)T to represent mouse (0.5, 0.5, -0.5, -0.5)T to represent carrot (0.5, -0.5, -0.5, 0.5)T to represent cat
(rabbit, carrot) and (cat, mouse)
We check that we can recover carrot from this by unbinding with rabbit: we must compute fi = Σ j bijvj where bij is the matrix, and (vj) is the rabbit vector:
| f1 | = b11v1 + b12v2 + b13v3 + b14v4 |
| = ½ × ¼ × (1×1 + 1×1 + 1×1 + 1×1) | |
| = 0.5 |
and similarly, f2 = 0.5, f3 = -0.5, and f4 = -0.5, so that f represents carrot.
For (cat, mouse), we compute:
The tensor representing both of these is the sum of the two matrices:
| f1 | = b11v1 + b12v2 + b13v3 + b14v4 |
| = ½ × ¼ × (2×1 + 0×1 + 0×1 + 2×1) | |
| = 0.5 |
and similarly, f2= 0.5, f3= -0.5, and f4= -0.5, so that f represents carrot as before.
So far we have been using the tensor product network to store a particular kind of relational information: variable binding. In variable binding, each variable has a unique filler (at any given time). This restriction on the kind of information stored in the tensor is unnecessary. A rank 2 tensor will store an arbitrary binary relation. For example, suppose I have a set of facts about what, say, animals eat:
| Animal | Food | |
rabbit | ||
mouse | ||
crocodile | ||
rabbit | ||
guineapig | ||
crocodile |
This information can be represented, and stored in the tensor in the usual way, (putting the `animal' in the side we have been calling `variable', and the `food' in the side we have been calling `filler') and then we can try retrieving it. For example, we can present the vector representing rabbit to the variable/animal side of the tensor. What we get out of the filler/food side of the tensor will be the sum of the vectors representing the foods that the tensor has been taught that rabbit eats: in this case carrot + lettuce.
Checking a particular fact, like that (mouse, cheese) is in the relation, is done just as before: we compute D = ΣiΣj bijvjfi: where v is for varmint and f for food, and if D = 1 then the varmint eats the food.
From what we've seen so far, we could better have called these nets `matrix nets'. The tensor aspect of things comes in when we generalise to enable us to store ternary (or higher rank) relations.
Suppose we want to store either a ternary relation (like give, where we record who gives, to whom, and what), or a group of binary relations, where we need to record which relation, and the first and second arguments of each relational instance, like:
kiss(frank, betty)
hit(max, frank).
Now we need a tensor net with three sides: say a REL side, an ARG1 side and an ARG2 side, or more generally a u side, a v side and a w side.
This shows binding units, some activated (shaded), and neurons in the u, v, & w vectors, but not interconnections. The net is ready for retrieval from the u side, given v & w. There are 27 binding units, 3 × 3 × 3. In general, if the u, v, and w sides use vectors with q components, there are q3 binding units.
The binding units are labelled tijk - t for tensor. Each component of a side is connected to a hyperplane[2] of binding units. E.g. v1 is connected to ti1k, for all i and k.
If we have concepts (or their representations) for any two sides of the tensor, then we can retrieve something from the third side. For example, if we have u = (ui) and v = (vj), then we can compute wk = Σij tijk uivj, for each value of k, and the result will be the sum of the vectors representing concepts w such that u(v,w) is stored in the tensor.
This time the activation function for wk is not linear but multilinear. It roughly corresponds to the activation function for higher-order neural networks, where (biologically speaking) two (or more) axons synapse together (axono-axono-dendritic synapse).
As usual, one can check facts, too. D = Σijktijkuivjwk is 1 exactly when u(v,w) is stored in the tensor, and zero otherwise.
The internal layout of the binding unit is more complex in this case, because of the extra input.
To teach the network the fact u(v,w), present u, v and w to the net. In teaching mode, this causes the content of each binding unit memory tijk to be altered by adding uivjwk to it.
For a rank r tensor product network:
This rapidly becomes impractical, as the size of the network (number of binding units) grows as nr, and it is desirable to have n fairly large in practice, since n is the largest number of concepts that can be represented (per side of the tensor). For example, with a rank 6 tensor, with 64 concepts per side, we would need 646 = 236 ~ 64 billion binding units.
This one has 3 components for each of the 4 `directions', so has a total of 34 = 81 binding units. If you want a challenge exercise, you could try drawing all the connections in a manner analogous to figure some pages back labelled "Network Connectivity for a 3×3×3 Tensor Product Network".
We'll look at a connectionist model of human analogical reasoning. First some terminology: detailed diagrams of tensor product networks are complicated, so we'll use the following symbol for a rank 3 tensor product network:
Here v & w are inputs and u is output, but we could make any side the output.
The aim is to simulate simple human analogical reasoning. The tensor product network is used to store facts relevant to the analogical reasoning problem.
Proportional analogy problems are sometimes used in psychological testing. They are fairly easy for a human over a certain age, but it is not particularly clear how to solve them on a machine. A typical example is
dog : kennel :: rabbit : what?
The aim is to find the best replacement for the what?. Here the answer is burrow (or maybe hutch, if we are talking about pet rabbits).
A human usually responds to this problem by saying something like: "The dog lives-in the kennel - what does the rabbit live in? - a burrow." In other words, the human names a relationship between dog and kennel, and then proceeds from there.
However, the human does not pick just any relation between dog and kennel (like smaller-than(dog, kennel)): they pick the most salient relation. How? And how could we do this with a machine?
The tensor product network approach actually finesses this question.
| woman woman woman woman mare mare mare mare woman woman woman baby mare foal rabbit barn barn barn barn barn |
loves mother-of larger feeds feeds mother-of larger larger larger larger lives-in lives-in lives-in lives-in lives-in larger larger larger larger larger |
baby baby baby baby foal foal foal rabbit rabbit foal house house barn barn burrow woman baby mare foal rabbit |
The simulation output below demonstrates four things:
The point of 2-4 is to demonstrate the robustness of tensor product networks. In all 3 cases, the coefficients of foal and rabbit are reduced, and other concepts may show up with non-zero coefficients, but we can always tell that foal is the correct answer, as it has the largest coefficient.
? (solve-simple-analogy 'woman 'baby 'mare :verbose)
Tensor product net analogy solver
Network size: 320 nodes (5×8×8).
Dead so far: 0 nodes (0.0%). Randomized so far: 0 nodes (0.0%)
Total damaged nodes: 0 (0.0%) ...
Problem: WOMAN : BABY :: MARE : ?
Recognizable patterns in solution vector ...
FOAL = 3.000
RABBIT = 1.000
? (load "lobotomy.lisp")
? (nkill 160)
Neurons dead: 50.00%
? (solve-simple-analogy 'woman 'baby 'mare :verbose)
Network size: 320 nodes (5×8×8).
Dead so far: 160 nodes (50.0%). Randomized so far: 0 nodes (0.0%)
Total damaged nodes: 160 (50.0%) ...
Problem: WOMAN : BABY :: MARE : ?
Recognizable patterns in solution vector ...
| BABY FOAL RABBIT | = = = | 0.0005 0.854 0.137 |
? (nkill 128) ; kill another 40%
Neurons dead: 90.00%
? (solve-simple-analogy 'woman 'baby 'mare :verbose)
Network size: 320 nodes (5×8×8).
Dead so far: 288 nodes (90.0%). Randomized so far: 0 nodes (0.0%)
Total damaged nodes: 288 (90.0%)
Problem: WOMAN : BABY :: MARE : ?
Recognizable patterns in solution vector ...
| WOMAN FOAL RABBIT | = = = | 0.001 0.069 0.015 |
? (reset-net) ; restore net to original state
; - all binding units alive
? (nrandomize 64)
Neurons random: 20.00%
? (solve-simple-analogy 'woman 'baby 'mare :verbose)
Network size: 320 nodes (5×8×8).
Dead so far: 0 nodes (0.0%). Randomized so far: 64 nodes (20.0%)
Total damaged nodes: 64 (20.0%)
Problem: WOMAN : BABY :: MARE : ?
Volume setting for randomized nodes: 0.10
Recognizable patterns in solution vector ...
| WOMAN MARE BABY FOAL RABBIT HOUSE BURROW | = = = = = = = | 0.511 0.058 0.456 2.200 0.899 0.312 0.067 |
For more details...
Halford, Wilson, Guo, Gayler, Wiles, and Stewart, Connectionist implications for processing capacity limitations in analogies, pages 363-415 in K.J. Holyoak and J. Barnden (editors) Advances in Connectionist and Neural Computation Theory: Volume 2: Analogical Connections, Norwood, NJ: Ablex, 1994. ISBN: 1-56750-039-0
This work led to a relational view of cognition which is described in:
G.S. Halford, W.H. Wilson, and S. Phillips, Processing capacity defined by relational complexity: Implications for comparative, developmental and cognitive psychology, Behavioral and Brain Sciences 21(6) (1998) 803-831. ISSN 0140-525X.In solving the analogy problem, the TPN was accessed in two different ways:
(1) ARG1 and ARG2 in, REL out
(2) ARG1 and REL in, ARG2 out
This would not have been possible with a backprop net - the input/output structure is "hard-wired" in backprop nets. In the TPN, the same information in the tensor supports both these modes of operation.
This is like when kids learn addition/subtraction - you learn that 9 + 7 = 16, and from this you also know that 16 - 7 = 9. We learn addition tables, but not subtraction tables.
An obvious third access mode: ARG2 and REL in, ARG1 out, is possible. And of course, you can have and ARG1, ARG2, and REL in, YES/NO out access mode.
Less obviously, you can have access modes like: REL in, ARG1⊗ ARG2 out.
In fact there are a total of 7 access modes to a rank 3 tensor.
There are 2k-1 access modes for a rank k tensor.
This property is referred to as omnidirectional access.
Example: Given a rank 5 tensor, with axes REL, ARG1, ARG2, ARG3, and ARG4, one possible access mode would be: ARG1, ARG2 in, REL⊗ ARG3⊗ ARG4 out
Smolensky (AI 46 (1990) full citation above) reviews this. In particular, he presents a lot of theoretical analysis, and inter alia shows that stacks can be implemented in tensor product networks, and indeed that general lists can be. He does this by showing how to implement the basic Lisp primitives car, cdr, and cons (head, tail and : for Haskell hackers).
[Reference: Wilson, Street, and Halford, Solving Proportional Analogy Problems using Tensor Product Networks with Random Representations, 1995 IEEE International Conference on Neural Networks Proceedings, ICNN'95, 2971-2975. http://www.cse.unsw.edu.au/~billw/reprints/randomtensor.ps]
We have seen that we can kill 90% of binding units and still solve problems with the tensor product networks. When we kill a binding unit, this affects the computation of inner products in the retrieval process - it is if the representation vectors are not exactly orthogonal anymore.
So perhaps exact orthogonality is not critical - perhaps approximate orthogonality would be enough. It is also implausible that biological neural networks have orthonormal sets of vectors representing concepts.
What sort of vectors would be `approximately orthogonal'? Answer: vectors with uniform random components chosen from the interval [-r, r] have inner products with mean 0 and variance which depends on r.
Wilson, Street and Halford also tried using as representation vectors, normalised random vectors generated by choosing each component to be a uniform random number from the range [-1,+1], and normalising the resulting vector.
Normalisation has the effect that vectors with more components will (on average) have smaller components.
For this reason, they investigated the effect of increasing the number of components on network performance.
Experiments were done with the analogy problem: woman is to baby as mare is to what? and a particular set of facts.
We classified runs as good or poor or bad.
| good: | score(foal) > 1.2 * score(rabbit) and score(rabbit) >1.2 * next highest score |
| fair: | score(foal) > score(rabbit) > next highest score |
| bad: | not (good or fair) |
The factor 1.2 was an arbitrarily chosen cutoff. [It was not "tuned".]
Table 1: Classification of solutions produced by dense random vector experiments (1000 runs for each number of components; 20 facts; solving woman:baby::mare:what?)
| components | % good | % fair | % bad |
| 25 | 40.9 | 10.3 | 48.8 |
| 40 | 55.5 | 7.5 | 37.0 |
| 70 | 73.5 | 9.3 | 18.2 |
| 100 | 84.8 | 4.3 | 10.9 |
| 130 | 92.6 | 3.5 | 3.9 |
| 160 | 94.5 | 2.5 | 3.0 |
| 190 | 96.7 | 1.8 | 1.5 |
| 220 | 97.4 | 1.4 | 1.2 |
| 250 | 98.7 | 1.2 | 0.1 |
A distributed representation does not need all or even most nodes to have non-zero activations, but any concept should be represented by a significant number of active nodes.
Indeed, neurophysiological data indicates that percent activity of neurons in the different regions of the human hippocampal system ranges from about 0.5% to about 9%.
Accordingly, Wilson, Street and Halford constructed a version of their model which uses sparse random representations: that is representations are generated with most components zero, and the remainder chosen from a uniform random distribution.
Again, the inner products of such vectors have mean zero, and so can be considered quasi-orthogonal.
With small numbers of non-zero components, one would expect freaky behaviour - these cases do not correspond to distributed representations.
In the experiments, vectors were of length 500 (and so the tensor had 125 million binding units).
Each experimental run used a fixed number of non-zero components, ranging from 1 to 50, and chose the components initially as uniform random numbers in [-1,+1], but then normalized the resulting vectors.
One hundred runs were done with each number of non-zero components.
Further experiments on random representations investigated the ability of the networks to recognize individual facts (as opposed to solving analogy problems, which is a more complex task which involves accessing information relating to a large number of facts).
These experiments confirmed that random representations, with enough components in the vectors, permit one to decide whether a proposition is a fact or a non-fact (where we define a fact to be something that the network has been taught.)
The experiments reported here were done with rank 3 networks. More recent experiments on fact and non-fact retrieval with networks of ranks 4, 5, 6, and 7 with Hadamard bases and large numbers of neurons "killed" (rather than with random representations) have shown that one can kill around 90% of the neurons and still be able to distinguish facts from non-facts.
Above rank 7, it begins to become difficult to complete simulations with reasonable amounts of computing resources.