85-419/719 Intro PDP: Homework 2 Feedback


1. Hebb and Delta Rules [50 points]

1A [5 points]: Explain why the weight from input:0 to output:0 is equal to 0.375, and why the weight from input:1 to output:4 is equal to -0.25.
The first pass of the Delta rule (starting with weights of zero) is equivalent to the Hebb rule (with an extra factor of 2.0 due to the exponent of 2.0 in the error function), so the change in weight wij for each pattern is equal to 2.0 ε ai tj (where "ε" is the learning rate). After all three patterns (a, b and c), wij = 2.0 ε ( aia tja + aib tjb + aic tjc )

Here are the input and output (target) patterns (and unit numbers, for reference):
pattern Inputs Targets
01234567 01234567
a 1-11-111-1-1 11001100
b 11-1-11-11-1 11110000
c 1-1-111111 10011001

For i (input) =0 and j (target) =0 (bold above), w00 = 2.0 x 0.0625 x ( 1x1 + 1x1 + 1x1 ) = 0.375. For i=1 and j=4 (courier above), w14 = 2.0 x 0.0625 x ( -1x1 + 1x0 + -1x1 ) = -0.25.

1B [5 points]: If, at this point, you apply the Delta rule by clicking on "Train Network", the weights remain unchanged. Why? What would have happened if the Hebb rule had been applied instead?
Because the training patterns are orthogonal, the Hebb rule (and, hence, the Delta rule for the first pass) works perfectly; that is, after training on the patterns one, aj = tj for each pattern. [Strictly speaking, getting this to come out exactly depends on the learning rate; each pattern is composed of 8 ± 1's so that its dot-product with itself is 8, and 8 x 2 (factor from exponent) x 0.0625 (learning rate) = 1.0.] On the second pass, the Delta rule is no longer the same as the Hebb rule (because the weights are not zero), but it changes weights according to ε (tj - aj) ai. But because aj = tj, tj - aj = 0 and so the weight changes are all equal to zero. If the Hebb rule were reapplied, exactly the same weight changes would be applied a second time, and since the weights started at zero, they would double.

1C [10 points]: Describe and explain the similarities and differences among the weights produced by the Hebb rule (hebb-li.wt) and those produced by the Delta rule (delta-li.wt) when training on the linearly independent set.
Patterns "a" and "c" are each orthogonal to pattern "b" but not to each other. This means that "b" can be learned perfectly by either learning rule without affecting the others, so that the differences between the Delta and Hebb rules reflect the relationship of "a" and "c". These two patterns differ in inputs 6 and 7 (counting from 0) and targets 1, 3, 5, and 7. The eight weights between these two inputs and four outputs change the most from epoch 1 (Hebb) to epoch 10 (Delta), because these are the weights that allow the unique aspects of each pattern (inputs 6 and 7) to produce the necessary output differences (over outputs 1, 3, 5, and 7). More generally, the weights generally have the same sign, although the Delta rule has fewer non-zero weights.

1D [10 points]: Why was training with the intermediate learning rate more effective than either the higher or lower rate?
Noise is added to both the inputs ai and the targets tj, and the noisy inputs cause noisy outputs aj. As a result, the weight changes themselves (based on the difference between tj and aj) are noisy and may even change sign as a result of the noise. Moreover, accumulated changes over the four patterns that are even in the correct direction may be too large and overshoot the target by an even greater extent than the original discrepancies. Consequently, the weight changes do not always reduce error and may even increase it. If the learning rate is too large, the resulting weight changes can cause performance to jump around rather wildly. When the learning rate is small, temporary increases in error can still occur, but they are fairly small and quickly reversed by subsequent weight changes. If the learning rate is very small, however, the weight changes are too small to improve performance sufficiently given the number of epochs.

1E [10 points]: Why is learning so much slower using sigmoid units than when using linear units? Describe and explain the similarities and differences in the resulting sets of weights.
There are two contributing factors. The first is that, with linear units, output activations change in a way that is directly proportional to changes in the weights during learning. With sigmoid units, the amount of change in output activation that results from a change in net input gets small and smaller as the net input gets larger and larger (either positively or negatively) due to the asymptotic shape of the sigmoid function. The second contributing factor is that the changes themselves are smaller because of the extra factor of aj(1 - aj) (equal to the slope of the sigmoid function) that is applied to the error derivatives and which never exceeds 0.25. Thus, it can take a very long time to accumulate enough weight changes to generate net inputs large enough to produce output activations near 0.0 or 1.0. The weights produced by the two unit functions generally agree in sign, but the weights for sigmoid units are very much larger. Another difference between the weights is that, with sigmoid units, all the weights are different from 0, whereas with linear units several weights are 0. This is because, in the case of linear units, when the target is 0 in all training examples, the error is always 0 and the weight doesn't change from 0. On the other hand, in the case of a sigmoid unit, its output is never 0; thus, even if the target is 0, there is a weight change (so long as in any of the training patterns the input unit is on at least once, which in this example is always true).

1F [10 points]:Explain why learning fails here [with the "imposs" set], even with the Delta rule.
The Delta rule fails because the target values for at least one of the output units are not linearly separable given the eight inputs. To see this, note that inputs 0, 1, 4 and 5 are identical across the four training patterns. Moreover, input 2 is just the negative of input 3 and input 7 is the negative of input 8, so these don't provide any additional information. So effectively there are two unique input values, creating four input patterns (-1 -1), (-1 1), (1 -1) and (-1 -1). The targets differ only in 2, 3, 4, 5, but targets 2 and 3 are the same and targets 4 and 5 are the same, so again there are only two unique values, and each of these forms an exclusive-or (XOR) with the input values, and hence is not linearly separable. As a result, all of the weights between inputs 2, 3, 7, 8 and outputs 2, 3, 4, 5 end up being zero (because the weight changes caused by two of the patterns are exactly canceled by the weight changes caused by the other two). The only weights that build up are those that generate the constant outputs from the constant inputs.


2. Learning and Generalization [50 points]

[We can't give specific answers here as they depend on the patterns you create. In general, the best answers are those that explain the observations---failure to learn, early vs. late learning, good or bad generalization---in terms of similarities among the input patterns and the consistency with which these patterns "vote" for various output features.]
2A [5 points]: Hand in a table displaying the set of patterns you have constructed, and explain briefly how you designed them.
For illustration purposes, I investigated the "Rule of 78" problem (discussed in McClelland & Rumelhart, 1988, PDP Handbook). This problem involves input patterns which each have three bits on: one among (1 2 3), one among (4 5 6) and one among (7 8). This defines 18 possible inputs (3 x 3 x 2). For each of these, the correct output pattern also has three bits on: the corresponding bits from (1 2 3) and from (4 5 6), and the opposite bit from (7 8). Thus 247 => 248. There is one exception pattern: 147 => 147. This is supposed to be analogous (at a very small scale) to forming the past tense of verbs: for most verbs, you introduce a slight modification at the end by adding "-ed" (e.g., FIT => FITTED) but for some you do nothing (e.g., HIT => HIT). This problem is interesting because the same set of weights must learn to handle this exception while also encoding the "rule" sufficiently well to support generalization (nb. when we consider language processing later in the course, we will see a number of arguments that rules and exceptions must be handled separately). I trained on the 6 patterns on the left (including the exception 147) and tested for generalization to the remaining 12 patterns on the right:
                 TRAINED                                         TESTED
         inputs           outputs                       inputs           outputs
     1 2 3 4 5 6 7 8  1 2 3 4 5 6 7 8               1 2 3 4 5 6 7 8  1 2 3 4 5 6 7 8
     ---------------  ---------------               ---------------  ---------------
p147 1 0 0 1 0 0 1 0  1 0 0 1 0 0 1 0          p148 1 0 0 1 0 0 0 1  1 0 0 1 0 0 1 0
p167 1 0 0 0 0 1 1 0  1 0 0 0 0 1 0 1          p157 1 0 0 0 1 0 1 0  1 0 0 0 1 0 0 1
p247 0 1 0 1 0 0 1 0  0 1 0 1 0 0 0 1          p158 1 0 0 0 1 0 0 1  1 0 0 0 1 0 1 0
p258 0 1 0 0 1 0 0 1  0 1 0 0 1 0 1 0          p168 1 0 0 0 0 1 0 1  1 0 0 0 0 1 1 0
p357 0 0 1 0 1 0 1 0  0 0 1 0 1 0 0 1          p248 0 1 0 1 0 0 0 1  0 1 0 1 0 0 1 0
p368 0 0 1 0 0 1 0 1  0 0 1 0 0 1 1 0          p257 0 1 0 0 1 0 1 0  0 1 0 0 1 0 0 1
                                               p267 0 1 0 0 0 1 1 0  0 1 0 0 0 1 0 1
                                               p268 0 1 0 0 0 1 0 1  0 1 0 0 0 1 1 0
                                               p347 0 0 1 1 0 0 1 0  0 0 1 1 0 0 0 1
                                               p348 0 0 1 1 0 0 0 1  0 0 1 1 0 0 1 0
                                               p358 0 0 1 0 1 0 0 1  0 0 1 0 1 0 1 0
                                               p367 0 0 1 0 0 1 1 0  0 0 1 0 0 1 0 1

2B [10 points]: Examine how well the network does at the end of training in learning the training examples using each unit activation function (linear and sigmoid), and explain the successes and failures (based on the relationships among the trained patterns). Are there any differences between the two functions in how well the patterns can be learned?
Figure 1 shows the total error for 20 epochs of training with either linear units or with sigmoid units, using a learning rate of 0.25. Learning with linear units is much faster than with sigmoid units, because the latter require much large net input to achieve activations near 0 or 1 due to the asymptotic nature of the sigmoid function. This can be seen clearly by examining the weights that result from learning in the two cases (see Figure 2 for linear units and Figure 3 for sigmoid units)---note that training was continued for a total of 100 epochs when using sigmoid units.
Figure 1
Figure 1: Learning with linear vs. sigmoid units.
Figure 1    
LINEAR UNITS                   outputs
              1      2      3      4      5      6      7      8
       1   0.76  -0.23  -0.23   0.09   0.09   0.09   0.71  -0.42   
       2  -0.23   0.76  -0.23   0.09   0.09   0.09  -0.28   0.57    
       3  -0.23  -0.23   0.76   0.09   0.09   0.09   0.21   0.07    
inputs 4   0.09   0.09   0.09   0.76  -0.23  -0.23   0.71  -0.42   
       5   0.09   0.09   0.09  -0.23   0.76  -0.23   0.21   0.07    
       6   0.09   0.09   0.09  -0.23  -0.23   0.76  -0.28   0.57    
       7   0.14   0.14   0.14   0.14   0.14   0.14  -0.42   0.85    
       8   0.14   0.14   0.14   0.14   0.14   0.14   1.07  -0.64   
	
Figure 2: Learned weights after 20 epochs of training with linear units. Weights are also displayed as text to make it easier to compare values and magnitudes.

The solution in the linear case is very clear and elegant. Each set of mutually exclusive bits (1 2 3) and (4 5 6) form a block where each bit in the input supports itself in the output and inhibits its competitors. The same is broadly true for the (7 8) group except the competition is reversed (since generally 7 => 8 and 8 => 7). Note, however, that for these last output bits, the network listens strongly to 1 and 4 (weights of 0.71 and -0.42) which are collectively sufficient to override the 7 => 8 weight of -0.42 and to generate an output of 8. Also note that weights to the (7 8) group from other input have to compensate for these weights, in case only one of 1 or 4 are present (e.g., 167), to override the override, so to speak. The solution in the sigmoid case has the same basic flavor but is a bit less clear, and involves much larger weights (see Figure 3).

Figure 1    
SIGMOID UNITS                  outputs
              1      2      3      4      5      6      7      8
       1   4.68  -3.27  -3.30  -0.41  -1.72   0.75   3.73  -3.73   
       2  -3.68   5.02  -3.65   0.27  -0.06  -2.14  -2.43   2.43    
       3  -3.12  -3.20   5.26  -1.99   0.08  -0.07   0.26  -0.26    
inputs 4  -0.40   0.72  -1.71   4.66  -3.28  -3.29   3.76  -3.76   
       5  -1.96  -0.07   0.07  -3.12   5.25  -3.18   0.22  -0.22    
       6   0.24  -2.10  -0.06  -3.67  -3.66   5.00  -2.43   2.43    
       7  -0.03  -1.66  -1.00  -0.02  -1.02  -1.67  -4.70   4.70    
       8  -2.08   0.20  -0.69  -2.11  -0.67   0.20   6.26  -6.26   
	
Figure 3: Learned weights after 100 epochs of training with sigmoid units.

2C [15 points]: Choose one of the two activation functions and examine the time course of learning. Try to identify what aspects of your patterns the network learns first and what aspects it learns only later. Describe what you observe and try to explain why it happens.
I chose sigmoid units because they generalize better (see below). To illustrate the time-course of learning, Figure 4 shows the error for three informative patterns over the course of training: the exception (147), a similar "regular" item (167), and a regular item with no similarity to the exception (258). Not surprisingly, the exception produces greater error and takes much longer to learn than the regular items. What is interesting, though, is that, even among the regulars, those that are similar to the exception are learned somewhat slower than the rest. This effect arises because the compensatory weights mentioned above---overriding the override---take a while to develop and are not completely successful. The contaminating influence of exceptions on similar regular items is also found in empirical data from language tasks such as forming the past tense of verbs or reading words aloud.
Figure 2
Figure 4: Time-course of learning with sigmoid units.

2D [20 points]: Now, for both the linear and sigmoid network, consider how well the trained network generalizes to the two patterns that you set aside. Report what happens when you test with these, and explain the results in terms of the relationship between the test patterns and the trained patterns.
Table 1 shows the outputs generated by the network after 20 epochs using linear units to the 12 patterns withheld from training. In general, performance is not particularly good---the total error across these patterns is 14.5. Predictably, the particular examples that cause the most difficult (148, 158, 267, 348; shown in bold) are those that are somewhat similar to the exception 147. In fact, the one that is most similar, 148, is also the one that produced the most error by far, 4.5. Thus, with linear units, the network more directly reflects the similarity of items within the training corpus. Also, not surprisingly, the output values that produce the most error are the 7 and 8 positions.

Table 1: Linear Units
Pattern Error Outputs (Targets)
148 4.501 1.0 (1.0) 0.0 (0.0) 0.0 (0.0) 1.0 (1.0) 0.0 (0.0) 0.0 (0.0) 2.5 (1.0) -1.5 (0.0)
157 0.499 1.0 (1.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 1.0 (1.0) 0.0 (0.0) 0.5 (0.0) 0.5 (1.0)
158 2.001 1.0 (1.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 1.0 (1.0) 0.0 (0.0) 2.0 (1.0) -1.0 (0.0)
168 0.500 1.0 (1.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 1.0 (1.0) 1.5 (1.0) -0.5 (0.0)
248 0.500 0.0 (0.0) 1.0 (1.0) 0.0 (0.0) 1.0 (0.0) 0.0 (0.0) 0.0 (0.0) 1.5 (1.0) -0.5 (0.0)
257 0.501 0.0 (0.0) 1.0 (1.0) 0.0 (0.0) 0.0 (0.0) 1.0 (1.0) 0.0 (0.0) -0.5 (0.0) 1.5 (1.0)
267 2.003 0.0 (0.0) 1.0 (1.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 1.0 (1.0) -1.0 (0.0) 2.0 (1.0)
268 0.500 0.0 (0.0) 1.0 (1.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 1.0 (1.0) 0.5 (1.0) 0.5 (0.0)
347 0.500 0.0 (0.0) 0.0 (0.0) 1.0 (1.0) 1.0 (1.0) 0.0 (0.0) 0.0 (0.0) 0.5 (0.0) 0.5 (1.0)
348 2.002 0.0 (0.0) 0.0 (0.0) 1.0 (1.0) 1.0 (1.0) 0.0 (0.0) 0.0 (0.0) 2.0 (1.0) -1.0 (0.0)
358 0.501 0.0 (0.0) 0.0 (0.0) 1.0 (1.0) 0.0 (0.0) 1.0 (1.0) 0.0 (0.0) 1.5 (1.0) -0.5 (0.0)
367 0.501 0.0 (0.0) 0.0 (0.0) 1.0 (1.0) 0.0 (0.0) 0.0 (0.0) 1.0 (1.0) -0.5 (0.0) 1.5 (1.0)
Total 14.509

Table 2 shows the equivalent data after 100 epochs of training using sigmoid units. Here, generalization is much better---the total error is only 0.98. In fact, all output units are on the correct side of 0.5 for all patterns. Again, the patterns that produce the most error (157, 158, 347, 348) are similar to 147, although they are not exactly the same patterns as were difficult for linear units. In particular, it is not always the items which are most similar to 147 which produce the most error---the nonlinearity of the sigmoid activation function allows the network to ignore certain types of similarity (e.g., with 148) when it is beneficial to do so. Also, it is not only positions 7 and 8 that cause problems---sometimes positions 1 or 4 are the most problematic.

Table 2: Sigmoid Units
Pattern Error Outputs (Targets)
148 0.037 0.900 (1.0) 0.087 (0.0) 0.003 (0.0) 0.894 (1.0) 0.003 (0.0) 0.088 (0.0) 1.000 (1.0) 0.000 (0.0)
157 0.219 0.936 (0.0) 0.007 (0.0) 0.014 (0.0) 0.028 (0.0) 0.925 (0.0) 0.016 (0.0) 0.322 (0.0) 0.678 (1.0)
158 0.133 0.655 (1.0) 0.041 (0.0) 0.019 (0.0) 0.003 (0.0) 0.946 (1.0) 0.097 (0.0) 1.000 (1.0) 0.000 (0.0)
168 0.003 0.945 (1.0) 0.006 (0.0) 0.017 (0.0) 0.002 (0.0) 0.002 (0.0) 0.997 (1.0) 0.999 (1.0) 0.001 (0.0)
248 0.003 0.002 (0.0) 0.997 (1.0) 0.002 (0.0) 0.944 (1.0) 0.018 (0.0) 0.005 (0.0) 0.999 (1.0) 0.001 (0.0)
257 0.005 0.003 (0.0) 0.964 (1.0) 0.010 (0.0) 0.054 (0.0) 0.985 (1.0) 0.001 (0.0) 0.001 (0.0) 0.999 (1.0)
267 0.106 0.030 (0.0) 0.778 (1.0) 0.009 (0.0) 0.032 (0.0) 0.009 (0.0) 0.767 (1.0) 0.000 (0.0) 1.000 (1.0)
268 0.083 0.004 (0.0) 0.958 (1.0) 0.012 (0.0) 0.004 (0.0) 0.012 (0.0) 0.955 (1.0) 0.801 (1.0) 0.199 (0.0)
347 0.239 0.027 (0.0) 0.016 (0.0) 0.927 (1.0) 0.934 (1.0) 0.014 (0.0) 0.006 (0.0) 0.338 (0.0) 0.662 (1.0)
348 0.146 0.004 (0.0) 0.093 (0.0) 0.946 (1.0) 0.636 (1.0) 0.020 (0.0) 0.041 (0.0) 1.000 (1.0) 0.000 (0.0)
358 0.004 0.001 (0.0) 0.044 (0.0) 0.990 (1.0) 0.001 (0.0) 0.991 (1.0) 0.045 (0.0) 0.999 (1.0) 0.001 (0.0)
367 0.004 0.051 (0.0) 0.001 (0.0) 0.985 (1.0) 0.003 (0.0) 0.010 (0.0) 0.963 (1.0) 0.001 (0.0) 0.999 (1.0)
Total 0.982

There are two major conclusions to draw from these results:

  1. Both regular and exception items can be processed successfully within a single system; moreover, the pattern of contamination from the exceptions to the regulars matches (qualitatively) the corresponding empirical findings.
  2. Despite having learned an exceptional item, the system can nonetheless apply the "rule" successfully in generalizing to novel inputs (e.g., the past tense of DIT is DITTED rather than DIT; MAVE is pronounced to rhyme with GAVE rather than HAVE). This is true even when the system has only been trained on a relatively small proportion of the regular items (5 of 17 in the current simulation). Again, there is some degree of contamination for items which are similar to the exception, but this contamination is not so strong as to prevent correct responding (at least for sigmoid units), but might be expected to make responses a bit slower in these cases. This also seems to be a property of human performance in these domains.