Attractor networks can encourage generalization by enforcing "well formedness" constraints on the intermediate and output representations produced by an otherwise feed-forward process (Mathis & Mozer, 1995). Such constraints are embodied in these networks as distinct fixed-point attractors for every possible well formed representation. Patterns may be "cleaned up" by such a network via a process of settling over time to one of these meaningful, well formed, and stable activation states. The potentially combinatoric space of valid attractor basins need not be explicitly trained, however, but may arise in the compositional interaction of trained attractors (Plaut & McClelland, 1993). It has long been known that the training of recurrent networks may result in spurious attractor basins: fixed-point attractors which are not explicitly trained (Hopfield, 1982). Under appropriate conditions, however, these spurious attractors may actually arise in a systematic manner, producing serendipitous basins which encode novel but meaningful patterns of activation. We refer to the dynamics of such networks as containing articulated attractors - meaningful attractor basins arising from the compositional interaction of explicitly trained attractors.
This paper provides an empirical analysis of the conditions under which articulated attractors form in recurrent neural networks trained using various versions of backpropagation (Rumelhart et al., 1986a). This work stemmed from our initial attempts to incorporate an attractor network of this kind into our connectionist model of instruction following (Noelle & Cottrell, 1995), a model which develops an internal representation of verbal instructions in the service of a task (St. John, 1992). We discovered that articulated attractors did not appear in this model, and this paper sprang from our attempt to explain why. In hopes of acquiring a deep understanding of the learning difficulties experienced by our model, we began with the most simple attractor network architecture possible - a single recurrent layer of processing elements. We incrementally augmented this network with further layers of units, expanding the complexity of the architecture towards the configuration of our instruction following model. This investigation revealed that articulated attractors form readily when the network's recurrent layer is directly provided with a teaching signal, but such systematic dynamics do not appear when recurrent weights are shaped by backpropagated error. In addition to demonstrating this finding, this paper also presents some possible explanations for why this is so.
We begin by describing the simple structured memory task which we used to examine attractor formation in a number of recurrent network architectures. We then present simulation results for three successively more complex architectures, and we close with a discussion of these results.
To be specific, each pattern represented a structure containing two slots, each holding exactly one of five distinct fillers. The contents of the slots were considered independent, with the specific filler in one slot in no way constraining the filler for the other. The whole was encoded as a 10 element binary vector, divided into two groups of five. Since each slot could contain only a single filler, exactly one element in each group of five was turned "on" in each valid pattern. The networks, then, were to learn an attractor for each input pattern involving exactly one of the first five elements "on" and exactly one of the last five elements "on". Thus, with five possibilities for each of two slots, there were only 5^2 = 25 patterns considered "well formed" out of the 2^10 = 1024 possible binary input vectors. This task is depicted schematically in Figure 1. The diagram on the left side of that figure depicts the mapping being performed as the network settles, and the table on the right provides an example of the time course of input activity, expected output activity, and the target output. Note that the input pattern is made available to the network for the first time step only, requiring the network to both "clean up" and remember the pattern over time.
Systematic generalization was the focus of these experiments. The goal was to produce a fixed-point attractor for every valid slot-filler structure, given training on only a fraction of these valid patterns. To this end, each network was explicitly trained on some subset of the allowable input patterns, encouraging the formation of attractors for these patterns by the presentation of an error signal on every time step for a fixed settling period. Once trained, each network was then tested on all valid slot-filler representations, and the number of fixed-point attractors corresponding to these valid patterns was determined. The dynamic behavior of each trained network was also examined to locate any spurious attractors corresponding to ill formed patterns.
For small training sets, consisting of only a few valid patterns, we expected the networks to simply memorize the training instances - to build attractors only for the presented patterns. We predicted, however, that beyond some threshold in training set size the networks would generalize to all valid structures. In order to test this hypothesis, we trained each network architecture on multiple training sets, varying in size. Each trained network was examined to determine the attractor structure resulting from its training. At least five patterns were present in each training set, as this was the minimum number needed to turn each input element "on" at least once over a training set. The largest training set consisted of all 25 well formed patterns. The frequency of each filler in each training set was balanced as much as was possible given the small size of the training sets. Noise was added to input elements during training, but this noise never exceeded 5% of the activation range of the elements (i.e., 0.05 for binary units and 0.10 for bipolar units). Network output targets consisted of the "clean" patterns over the entire time course of network settling, as shown in Figure 1. A settling period of 10 time steps was used during training, and 100 time steps were used during testing.
A summary of the results over all training set sizes is shown on the right side of Figure 2. That graph displays the number of well formed attractors found in a trained network as a function of the training set size. The plot also includes a reference line which depicts the hypothetical case of no generalization outside of the training set. Notice that small training sets resulted in the simple memorization of the trained attractors, but networks that saw at least half of all valid patterns consistently generalized to all 25 allowable structures. Furthermore, none of these networks constructed spurious attractor basins corresponding to ill formed patterns. The weights of these successful networks took the unsurprising form of two uncoupled winner-take-all networks. Each unit had a highly weighted self-connection and inhibited the other four units in its group of five. Weights on connections between units for different slots (i.e., between the two groups of five units) remained close to zero.
Given a sufficiently large training set, these networks consistently exhibited an emergence of systematicity. Generalization was perfect, with a fixed-point attractor formed for every valid pattern.
It is fairly clear why the inclusion of input weights introduced no additional difficulty for the learning of articulated attractors. The training of the input weights was essentially decoupled in time from the training of the recurrent weights. During the initial time step, output activity was at the initialized level of zero, which implied no change to the recurrent weights since this activity plays a multiplicative role in the backpropagation weight update equation. In other words, only the input weights could be updated on the first time step. On the other hand, on every subsequent time step the input layer activity was clamped to zero, directing all weight updates to the recurrent connections. In short, each of the weight matrices was provided with its own direct error signal at regular times during training.
The performance of this configuration, shown in the center of Figure 4, is grim. These networks not only failed to generalize, but they often failed to form attractors for training set patterns. Also, several spurious attractors (as many as 8 for some training set sizes) arose for ill formed patterns. The introduction of an indirect error signal presented a serious obstacle to the formation of articulated attractors.
In hopes of remedying this situation, the training procedure for this architecture was modified in a number of substantial ways. The first modification involved the number of time steps experienced by the network during training. An examination of the dynamics associated with well formed patterns revealed that, when presented with a valid pattern, the activation state of the network often drifted away from that well formed configuration, but it did so only very slowly. After 10 time steps of settling (which was the settling period during network training) almost all training set patterns appeared intact at the output layer. This observation suggested that the number of settling time steps experienced by the network during training was sufficient to keep the network from drifting away from the training patterns too quickly but was insufficient to construct the needed stable fixed-point attractors. To correct for this problem, we retrained these networks using incrementally larger settling times during training. In other words, whenever a network successfully retained the training patterns for t time steps during training, the settling time was advanced to (t+1) for the next training epoch. Unfortunately, this strategy did not work. Invariably, some settling time threshold would be reached, past which the networks would not learn.
Our next modification involved using a more robust estimate of the error gradient by backpropagating error through time all the way to the first time step. Using complete BPTT instead of the SRN training method showed no significant improvement by itself, but when coupled with a switch to a bipolar activation function (units which ranged in activation between -1 and 1) and with a reduced learning rate (0.001), this architecture began to successfully memorize the training set attractors. Systematic generalization remained elusive, however. This performance is shown in Figure 4, on the right.
In the previous two network architectures, the pattern of activation at the recurrent layer was consistently both polarized and sparse. Units tended to be either all the way "on" or all the way "off", and only two of the ten units were "on" for any given valid pattern. These properties of the recurrent layer activation patterns were directly enforced by the error signal provided at the output. In the case of an indirect error signal, however, these properties are no longer directly determined by training. Since the recurrent layer is a hidden layer in these networks, other patterns of activation are free to arise there. Indeed, the activation patterns at the recurrent hidden layers of these networks were quite distributed, with approximately half of the hidden units being highly positive for any given training pattern. These recurrent layer patterns still tended to be polarized, however, presumably because it is easier to construct stable fixed-point attractors in the corners of activation space. [1] Still, these networks apparently included a sufficient number of free parameters (weights) to associate a fairly arbitrary distributed hidden layer attractor with each training pattern. Unlike the dual winner-take-all structure learned by the previous two architectures, these attractors showed few signs of compositionality.
This problem of hidden layer representation is serious. It is quite possible for a network to learn a hidden layer encoding of input patterns which is consistent with the training items but is inherently incapable of generalizing to other valid patterns. This problem may be illustrated by the simple example show in Figure 5. This diagram displays a small piece of a network, including two hidden units and two output units. Two possible configurations of weights between these processing elements are shown, with the output bias weights always being slightly negative. Both configurations can produce the given training set targets at their outputs, but only the configuration on the left is capable of producing the generalization target. The weights in the right network fragment fail because they collapse too many distinct hidden layer patterns to single output patterns. For generalization to have any hope of occurring, hidden layer activation space must retain distinct correlates to the entire range of valid outputs.
One way to avoid this "collapsing" of hidden layer space is to drive the weight vectors coming out of each hidden unit towards mutual orthogonality. This constraint makes the contribution of each hidden unit to the formation of an output pattern orthogonal to the contributions of the other hidden units. Note that the weight set schematically shown on the left in Figure 5, which effectively copies the hidden layer activation pattern to the output layer, is one example of a set of orthogonal outgoing weight vectors which is capable of appropriate generalization. To test this idea of an orthogonality constraint, we added a term to our squared error objective function of the form:
... where a and b are hidden unit indices and theta is the angle between their outgoing weight vectors. Unfortunately, depending on the proportion with which this error term was mixed with squared error, orthogonalization either interfered with the learning of even training set patterns or had little effect at all. We noticed that the orthogonalization term often moved the hidden layer representations away from the corners of activation space, where attractors were typically constructed, so we also added a polarization error term which encouraged bipolar vectors at the hidden layer. This term took the form of:
... where o_a is the activation level of hidden unit a. Even when the objective function was augmented with both of these terms, the best networks still did little more than memorize training patterns.
This problem may be viewed as one of finding a way to bias the learning of a multi-layer network in a way which encourages the general formation of articulated attractors without essentially "hard wiring" the structure of the input patterns. The main question that has yet to be answered is: What is the correct inductive bias for this task? We suggest that this bias should encourage recurrent hidden layer representations which use polarized activation levels and should drive hidden to output weights towards configurations which preserve, as much as possible, accessibility to the whole range of potential output patterns. Polarization is taken as a goal for the sake of the stability of attractor learning. Even with "corner attractors", however, these networks still need to avoid hidden to output mappings which restrict generalization. A technique such as activation sharpening (French, 1991) could potentially produce the kinds of representations needed, but this would require an a priori specification of the number of hidden elements "on" for each pattern. Still, an inductive bias of this sort may be the best that is possible under an indirect error signal.
Our future work will focus on solving this indirect error signal problem using two distinct approaches: by modifying the input pattern encoding and by modifying the network architecture. The first of these approaches involves encoding slot fillers in a non-localist fashion. Rather than assigning a single input and output unit to each filler, a more distributed representation could be used for filler values. This might involve a less sparse binary code in which different fillers share "on" elements, or it might involve a real vector encoding which retains the orthogonality of filler representations present in our localist code. Using a more distributed representation would cause weights from individual inputs and to individual output units to play a significant processing role over multiple filler values. The additional utilization of these weights may facilitate generalization to novel slot-filler patterns.
We will also consider encouraging articulated attractors by constraining the network architecture. In particular, we plan to investigate the possibility of initializing weights at the recurrent layer to a configuration which embodies a collection of winner-take-all networks. These will be implemented using a softmax constraint (Bridle, 1990), so backpropagated error can still successfully reach weights feeding into the attractor network. Also, restricted receptive fields among the hidden to output connections might be used to approximate an orthogonality constraint on this mapping. Such strong architectural constraints may be necessary to consistently produce articulated attractors from distal error.
If further investigation reveals that the learning of systematic attractor structures from a distal teaching signal requires specific constraints on network architecture, cognitive models which utilize such attractor networks will need to assume some significant innate constraints on learning. This does not mean that an architecture specifically tuned to a particular task, such as reading aloud or proper production of verb tense, is necessary. The required innate constraints may simply involve the early presence of lateral inhibition between processing elements grouped into clusters or the existence of map-like structures arising from topologically regular connection patterns. The learning bias introduced by such general connection patterns may be all that is needed. Still, the work presented in this paper suggests that the simple presence of recurrent connections is not enough to produce systematic attractor dynamics. Learning to enforce "well formedness" constraints on internal representations may require somewhat structured network architectures.
Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179-211.
This paper is also available as a GNU Zipped PostScript file (cover page). Other publications by this author are also available online.