85-419/719 Intro PDP: Homework 3 Feedback

1. XOR [40 points]

1A [15 pts.] Explain, based on the values of the weights in the network and the details of the back-propagation learning procedure, why the output derivative of the left hidden unit (hidden:0) is larger than that of the right hidden unit (hidden:1), and why both values are very much smaller than that of the output unit.

According to the back-propagation equations, the output derivative of hidden unit i, ∂E/∂ai, is equal to the sum over all output units j of the product the input derivative of that output unit, ∂E/∂nj, and the weight wij from hidden unit i to output unit j. The sum is irrelevant here as there is only one output unit, and ∂E/∂nj is the same for the two hidden units. Thus the difference in magnitude of ∂E/∂ai for the two hidden units must be due to differences in the magnitudes of their outgoing weights. And, indeed, the weight from the left hidden unit (hidden:0) to the output unit (0.272) is much larger than that for the right hidden unit (hidden:1; 0.082). Both values for the hidden units, ∂E/∂ai, are much smaller that the value for the output unit, ∂E/∂aj, because the former values are derived from the latter by multiplying by a number of factors that are less than 1.0 — namely, aj, 1-aj, and wij.

[Many people had trouble with this question. Many talked about the size of the input-to-hidden weights, rather than the hidden-to-output weights, and were unclear about which derivative is discussed (with respect to output, net input, etc.) and which formulas are relevant. Some explained why the output derivative of the output unit is large, but not specifically why it is larger than the hidden unit output derivatives (i.e., didn't mention the additional terms in the derivative).]

 

1B [25 pts.] Explain how the network has solved the XOR problem. Specifically, for each hidden unit, describe how it is behaving across the four cases and explain how it accomplishes this given its bias, its weights from the input units, and the sigmoid activation function. Then explain how the output unit behaves in accordance with XOR based on its bias, its weights from these hidden units (and their activations for the four cases), and the sigmoid function.

Both hidden units have large positive (and roughly equal) weights from each input line, and a large negative bias. For hidden:0, the bias (-2.36) is less in magnitude than either input weight in isolation (5.82), causing it to be off for the {0 0} case but on if either input line is active (which is true for all the other cases: {0 1}, {1 0}, and {1 1}). This corresponds to the logical function OR. In contrast, the bias of hidden:1 (-5.21) is much larger in magnitude than either input weight in isolation (3.40) but not larger than the sum of the two (6.80). This means that the unit is off except in the case where both input lines are active, {1 1}, corresponding to the logical function AND. For both units, the lower and upper asymptotes of the sigmoid function help make their activations close to zero or one (across the various cases), and thus all the cases producing a zero (or a one) can be treated as equivalent by the output unit.

The output unit has a negative bias (-2.97) that keeps it off in the {0 0} case (because both OR and AND produce zeros in this case). It has a larger positive weight from hidden:0 (6.73) which dominates this bias and activates the output unit in the {0 1} and {1 0} cases (for which OR is 1 but AND is 0). But it also has an even larger negative weight from hidden:1 (-7.42) which, when combined with the negative bias, overrides the positive contribution of the OR unit in the {1 1} case and shuts off the output unit. (Note that it's important here that, due to the sigmoid function, the OR unit doesn't respond twice as strongly to {1 1} as it does to {0 1} or {1 0}, or the AND unit wouldn't dominate it.) And finally, the sigmoid function of the output unit serves to eliminate differences in net input magnitudes across the cases to produce values near zero or one.

[People did better with this question. A common problem was to give a good description but not a good explanation. Two almost opposite problems were vagueness and showing detailed calculations without a conceptual explanation. In the first case, some people vaguely described that somehow the bias and weights combine to solve the problem, but without details as to how that happens. In the latter case a few people calculated manually the activation of the hidden units and the output, but did not explain verbally or conceptually what the solution is. There was some confusion about the role of the sigmoid function. A few people said that it allows us to treat anything that is above 0.5 as 1, and below that as 0 and that it doesn't matter how far away the activation value is from the target as long as it's on the same side - this is only true for very large net inputs (positive or negative).]

 

Extra Credit [10 pts.] The very long plateau in error at the beginning of training is not typical of training the XOR problem when starting from different initial random weights (although a brief plateau is not uncommon). Explain why this long plateau occurs for the initial weights used in the homework (XOR.init.wt).

In the absence of information about the input, the best thing an output unit can do (i.e., the behavior that produces the lowest error) is to adopt an activation level which is the expected value (average) of its target values over all the cases (which, in this case, is 0.5). This is exactly what the output unit does after the initial epochs of training, and it does so largely by reducing its positive incoming weights and eliminating its positive bias (from 0.279 to 0.017 after 30 epochs). This produces a squared error of 0.25 for each case, or a total error of 1.0, which is exactly where the plateau is.

Looking back at the initial weights, the incoming weights of the right hidden unit, hidden:1, are very small, so that its net input, and hence its activation, is nearly identical across the four cases (activations range from 0.392 to 0.401; they're a bit below 0.5 due to hidden:1's negative bias of -0.402). But the main point is that the unit provides almost no information about which input case is presented, so we're effectively trying to solve XOR with one hidden unit (hidden:0).

The initial incoming weights for hidden:0 are larger but nearly identical (0.432 and 0.449). Combined with its negative bias (-0.277), hidden:0's activations vary a bit more but mostly linearly as a function of the number of active input lines (0.431 for zero, 0.543 and 0.539 for one, 0.647 for two). The same is true after 30 epochs of training (0.408, 0.508 and 0.504, 0.604). In back-propagation, ∂E/wij = ∂E/nj ai. But note that the sum of ai for hidden:0 in the target=0 cases almost exactly equals the sum in the target=1 cases. This means that the positive derivatives for the target=0 cases almost exactly cancel the negative derivatives for the target=1 cases, and the output weight doesn't change. Moreover, the output (and input) derivative of hidden:0 will be nearly zero, and so its incoming weights also won't change. So basically we're stuck until very tiny weight changes accumulate enough to differentiate some of cases, and this is why the plateau lasts a very long time.

[As expected, this was very difficult and only a few people gave good answers.]

 

2. Implementing Another Feedforward Problem [60 points]

As was the case for the second half of Homework 2, we can't give specific feedback as people chose different problems. In general, we were looking for you to carry out a similar exploration to what you did in Homework 2, but with some sensitivity to the particular characteristics of your problem. As always, we wanted you not only to describe the successes, difficulties, and failures of your network, but also try to explain them in terms of the underlying operation of the network (e.g., sensitivity to the consistency of the outputs produced by training patterns that are similar to a given test pattern).