1A [15 pts.]Explain, based on the values of the weights in the network and the details of the back-propagation learning procedure, why the output derivative of the left hidden unit (hidden:0) is larger than that of the right hidden unit (hidden:1), and why both values are very much smaller than that of the output unit.

According to the back-propagation equations, the output derivative of hidden unit *i*, *∂E/∂a _{i}*, is equal to the sum over all output units

*[Many people had trouble with this question. Many talked about the size of the input-to-hidden weights, rather than the hidden-to-output weights, and were unclear about which derivative is discussed (with respect to output, net input, etc.) and which formulas are relevant. Some explained why the output derivative of the output unit is large, but not specifically why it is larger than the hidden unit output derivatives (i.e., didn't mention the additional terms in the derivative).]*

1B [25 pts.]Explain how the network has solved the XOR problem. Specifically, for each hidden unit, describe how it is behaving across the four cases and explain how it accomplishes this given its bias, its weights from the input units, and the sigmoid activation function. Then explain how the output unit behaves in accordance with XOR based on its bias, its weights from these hidden units (and their activations for the four cases), and the sigmoid function.

Both hidden units have large positive (and roughly equal) weights from each input line, and a large negative bias. For hidden:0, the bias (-2.36) is less in magnitude than either input weight in isolation (5.82), causing it to be off for the {0 0} case but on if either input line is active (which is true for all the other cases: {0 1}, {1 0}, and {1 1}). This corresponds to the logical function OR. In contrast, the bias of hidden:1 (-5.21) is much larger in magnitude than either input weight in isolation (3.40) but not larger than the sum of the two (6.80). This means that the unit is off except in the case where both input lines are active, {1 1}, corresponding to the logical function AND. For both units, the lower and upper asymptotes of the sigmoid function help make their activations close to zero or one (across the various cases), and thus all the cases producing a zero (or a one) can be treated as equivalent by the output unit.

The output unit has a negative bias (-2.97) that keeps it off in the {0 0} case (because both OR and AND produce zeros in this case). It has a larger positive weight from hidden:0 (6.73) which dominates this bias and activates the output unit in the {0 1} and {1 0} cases (for which OR is 1 but AND is 0). But it also has an even larger negative weight from hidden:1 (-7.42) which, when combined with the negative bias, overrides the positive contribution of the OR unit in the {1 1} case and shuts off the output unit. (Note that it's important here that, due to the sigmoid function, the OR unit doesn't respond twice as strongly to {1 1} as it does to {0 1} or {1 0}, or the AND unit wouldn't dominate it.) And finally, the sigmoid function of the output unit serves to eliminate differences in net input magnitudes across the cases to produce values near zero or one.

*[People did better with this question. A common problem was to give a good description but not a good explanation. Two almost opposite problems were vagueness and showing detailed calculations without a conceptual explanation. In the first case, some people vaguely described that somehow the bias and weights combine to solve the problem, but without details as to how that happens. In the latter case a few people calculated manually the activation of the hidden units and the output, but did not explain verbally or conceptually what the solution is. There was some confusion about the role of the sigmoid function. A few people said that it allows us to treat anything that is above 0.5 as 1, and below that as 0 and that it doesn't matter how far away the activation value is from the target as long as it's on the same side - this is only true for very large net inputs (positive or negative).]*

Extra Credit [10 pts.]The very long plateau in error at the beginning of training is not typical of training the XOR problem when starting from different initial random weights (although a brief plateau is not uncommon). Explain why this long plateau occurs for the initial weights used in the homework (XOR.init.wt).

In the absence of information about the input, the best thing an output unit can do (i.e., the behavior that produces the lowest error) is to adopt an activation level which is the expected value (average) of its target values over all the cases (which, in this case, is 0.5). This is exactly what the output unit does after the initial epochs of training, and it does so largely by reducing its positive incoming weights and eliminating its positive bias (from 0.279 to 0.017 after 30 epochs). This produces a squared error of 0.25 for each case, or a total error of 1.0, which is exactly where the plateau is.

Looking back at the initial weights, the incoming weights of the right hidden unit, hidden:1, are very small, so that its net input, and hence its activation, is nearly identical across the four cases (activations range from 0.392 to 0.401; they're a bit below 0.5 due to hidden:1's negative bias of -0.402). But the main point is that the unit provides almost no information about which input case is presented, so we're effectively trying to solve XOR with one hidden unit (hidden:0).

The initial incoming weights for hidden:0 are larger but nearly identical (0.432 and 0.449). Combined with its negative bias (-0.277), hidden:0's activations vary a bit more but mostly linearly as a function of the number of active input lines (0.431 for zero, 0.543 and 0.539 for one, 0.647 for two). The same is true after 30 epochs of training (0.408, 0.508 and 0.504, 0.604). In back-propagation, * ∂E/w _{ij} = ∂E/n_{j} a_{i}. * But note that the sum of

*[As expected, this was very difficult and only a few people gave good answers.]*

As was the case for the second half of Homework 2, we can't give specific feedback as people chose different problems. In general, we were looking for you to carry out a similar exploration to what you did in Homework 2, but with some sensitivity to the particular characteristics of your problem. As always, we wanted you not only to describe the successes, difficulties, and failures of your network, but also try to explain them in terms of the underlying operation of the network (e.g., sensitivity to the consistency of the outputs produced by training patterns that are similar to a given test pattern).

- As mentioned in the instructions to the problem, not all problems should be expected to support effective generalization. In general, tasks exhibit generalization when input similarity is at least somewhat predictive of output similarity (even if only parts of the input predict parts of the output). Tasks like parity are exactly
**not**like this, and thus it is extremely hard to train a network on parity in a way that will give rise to generalization (typically this requires multiple hidden layers). In addition, good generalization usually requires that the training set contain a representative sample of the relevant similarities. Thus, for some problems (e.g., symmetry, at least at a small scale) it is possible to get good generalization but only if you train on close to all of the remaining possible patterns. Finally, tasks in which only one input unit is active for each example (e.g., the standard 8-3-8 encoder)**cannot**generalize because the unit that is active in any withheld pattern will*never*have been active during training, and so that unit's outgoing weights will never have changed. - With tasks for which generalization is expected to be difficult, it is often useful, as a point of comparison, to train a version of the network on all or nearly all possible training patterns. This provides a kind of "upper bound" on how well the network can be expected to perform the task.
- (This was mentioned in class but is worth repeating.) Some tasks (e.g., 8-3-8 encoder, or symmetry with many input units) involve target values that are off for most of the patterns and on for only a few patterns (sometimes only one). This means that the network can eliminate most of the task error by turning all of the units completely off. (You'll see a very rapid drop in error in this situation.) The problem is, if you use a substantial amount of momentum at the beginning of training, the initial weight changes that turn units off will get so exaggerated that the network will never recover to eventually turn on the few correct units. A good practical approach is to use steepest descent (i.e., no momentum) until the network reaches its first initial plateau in error, and then switch to momentum descent. [This problem is somewhat mitigated by using "Doug's Momentum" from the beginning of training, as it clips the integrated gradient to be no larger than 1, thereby preventing the network from pinning output units completely off (or on).]
- If you observe something unusual or unexpected (e.g., a plateau in the learning curve in the middle of training), it's a good idea to figure out what is going on (e.g., which aspect of the task have been learned and which are still incorrect, and why).
- In very small networks (like XOR), it can be useful to examine individual hidden units, but in larger networks, the exact pattern of weights for any given hidden unit is less important (and less interesting); what matters more is the overall degree of similarity between hidden patterns (as compared with the similarities of the corresponding input patterns).
- Networks for homework problems have typically started from all-zero weights or loaded in a set of initial weights. This is mostly so I know that everyone is starting with the same exact configuration. When training a network with hidden units on your own problems, though, it is better to start with random initial weights (just use the "resetNet" command or control-panel button). It's fine to then save these weights if you want to be able to re-create a particular training run. You can also use the "seed" command, which takes an integer argument (e.g., seed 26973), just before the "resetNet" command. This initializes the random number generator, so exactly the same "random" weights will be generated. It is also a good idea, when possible, to carry out multiple runs with different initial random weights to be sure that your results are not due to idiosyncrasies of a particular set of initial weights.
- When creating weight displays---particularly near the end of training---be sure to rescale the displayed values (with the slider on the right) so that only the largest weight is completely black/white.
- Finally, when making claims about how the model behaves (e.g., whether it does or doesn't learn or generalize well), be sure to
**provide evidence**(e.g., graph or table of data, figure that shows unit activations) that supports your claims.