1A [15 pts.] Consider the two examples "0-train5" and "0-train24" in particular. Describe the differences between the inputs and outputs for these examples, and how the outputs in particular are related to other instances of zeros. Try to explain the results in terms of the general properties of learning in PDP networks.
Both of these examples are somewhat idiosyncratic. "0-train5" has much stronger activation on the left than the right vertical strokes of the zero (and they both slant a bit left-to-right), and "0-train24" has stronger activation at the top and its right vertical stroke is a bit wider and weaker than is typical of other zeros. In both cases, the output generated by the model has reduced these idiosyncrasies and is much closer to a typical zero.
The reason this occurs is that the network must recode each input through a much smaller number of hidden units (10 rather than 64). This "bottleneck" does not have the representational capacity to preserve the details (idiosyncrasies) of every image, and so it learns features that contribute to successful reconstruction of as many digits as possible (to minimize the total error on the task). These features are those that are shared by most, but not all, of the individual inputs. The idiosyncratic aspects of inputs such as "0-train5" and "0-train24" are not represented in the hidden layer, and so these inputs are reconstructed using a combination of more typical features (i.e., they are "regularized").
1B [15 pts.] Characterize the kinds of features that the hidden units have learned and how they might combine to reconstruct specific digits.
The network learns fairly large-scale features that span most of the image. Typically the units will have positive weights from a particular elongated region of the input, with some inhibitory "flanks" that prevent the unit from becoming active for other strokes (that don't co-occur with its preferred one). Some units have multiple, disconnected positive regions. For example, hidden:3 responds to the upper right corner and a middle horizontal region, which are both active during most 7's (the database is Canadian in which many people cross their 7's). Some units have more curved positive regions. None of the units seem entirely dedicated to specific digits (although hidden:7 and maybe hidden:9 look particularly good for 2's). Rather, any given digit activates three or four of the features fairly strongly, and these combine to reconstruct it (or, rather, a "regularized" version of it). As one student put it, the network can "draw" each digit by combining the positive regions of a number of hidden units (and by using their negative regions to cancel the extraneous positive parts of other hidden units).
2A [20 pts.] Contrast the kinds of features learned by the recognition network with those learned by the auto-encoder network. Explain the differences in terms of the distinct demands of the two tasks.
In general, the "features" here (the regions of positive input) are smaller than in the auto-encoder, and more of the input has weights that are near zero. In this case, the outgoing weights show how each hidden unit "votes" for digit identities (positive weights), and which it specifically tries to rule out (negative weights) (Rescaling the weight display can help in seeing some of these distinctions.) As in the auto-encoder, the units typically have adjacent inhibitory regions to draw sharper contrasts, but these too tend to be more restricted in size. Unlike the auto-encoder, none of the features look anything like an entire digit, or even a large part of one. It is still the case, though, that individual inputs activate three or four of the hidden units, and these combine to activate the correct identity (and keep all the others off).
These differences make sense given the nature of the two tasks. Recognition requires being sensitive to just those part of the input that distinguish one digit (or a few of them) from the others. Parts of inputs that are common to many digits are less useful for distinguishing among them and so are under-represented. By contrast, for reconstruction, even the shared features need to be reconstructed --- in fact, they are particularly important to represent exactly because they are so common. The network must represent the entirety of the input (as best it can), not just those aspects that discriminate among digits.
2B [15 pts.] Test the performance of the network on both the training and testing sets in the same way as you did after 150 epochs. Report the results for both 150 and 1150 epochs and explain them.
After 150 epochs, the network is correct on 97.8% of the training cases, with an average error per example of 0.218. Performance on the test examples is slightly worse (94.3% correct, error per example of 0.401). This makes sense because, even at this early phase, the network has learned some aspects of the training set that are not completely shared by the test set --- that is, each set has some idiosyncrasies that are not shared by the other. To put it another way, the weight changes caused by training on a pattern are almost always better for itself than are the weight changes caused by training on other patterns.
By 1150 epochs, performance on the training patterns is essentially perfect (100% correct performance; error per example of only 0.010). However, performance on the test set has deteriorated: now only 92.8% of examples are correct, and the error-per-example has more than doubled, to 0.876. This is a clear example of over-fitting. By continuing to train on the training set, the network has succeeded in learning to categorize idiosyncratic cases (e.g., a 2 that looks like a 3) at the expense of more general properties. This over-fitting causes problems for test cases that are similar to these idiosyncratic training cases but require a different classification (e.g., other 3s that really are 3s). The training idiosyncrasies apparently affect only a relatively small number of test cases (which is why generalization performance in terms of percent correct only drops a small amount), but when the network mis-classifies these cases it does so quite strongly.
3A [10 pts.] Why is performance of the deep sigmoid network so much worse than that of the standard sigmoid network? Try to relate your answer to the nature of the learning curve shown in the error graph for the deep network.
Both the standard (3-layer) and deep (4-layer) network reach the same initial plateau in error (just above 12000) but the deep network stays on this plateau for much longer (until about 80 epochs, as compared with about 40 for the standard network). Also, the descent from this plateau toward the ultimate low-level of error reached by the standard network is much slower. Both of these characteristics are caused by the fact that error derivatives get much smaller as you move from the output toward the input across multiple layers of hidden units. The weight changes on connections from the input units to the first hidden units are much, much smaller in the deep vs. standard network, and hence the network is much slower in learning how to use differences in the input to produce differences in the output.
3B [10 pts.] Why is performance of the deep RELU network better than that of the deep sigmoid network? How much additional training of the deep sigmoid network is needed to match the performance of the deep RELU network?
The deep network with RELUs (rectified linear units) learns much faster because, as one back-propagates from the output towards the input, the magnitudes of the error derivatives are maintained to a much greater extent than with sigmoid units. This is because the slope of the RELU function is 1.0 (when net input is above zero), whereas it is never greater than 0.25 (and typically much smaller, being equal to aj (1 - aj)) for sigmoid units. The deep RELU network achieves total error of 293.1 in 150 epochs; the deep sigmoid network needs about 260 epochs to achieve comparable performance (error = 296.7).
3C [15 pts.] Why is performance of the deep RELU network even better than that of the standard sigmoid network after 150 epochs of training for each? Be sure to consider all of the differences between the two networks.
The deep RELU network has two advantages over the standard (3-layer) sigmoid network. The first is that the derivatives are larger (as explained in 3A and 3B) and so the weights change more quickly. [By the way, you can't just increase the learning rate in the sigmoid network to compensate for this, because what's relevant is the relative magnitudes of the earlier vs. later weight changes. Give it a try.]
The second advantage is that the hidden units that connect directly to the output units can compute much more complicated functions than those in the standard network. In the standard network, the incoming weights of each hidden unit separate the input space linearly into those inputs that produce a net input greater than vs. less than 0.0. The sigmoid function then simply "squashes" the extremes so that it doesn't matter so much if a pattern is near vs. far from this linear division. By contrast, in a deep network, the first hidden layer does this but the second (and later) layers can then compute recombinations of these, resulting in much more complicated (and useful) distinctions.
A single RELU is simpler than a single sigmoid unit, so the RELU network suffers a bit from this (you can see this by comparing the 3-layer sigmoid network to a 3-layer RELU network), but the gain from having multiple hidden layers more than compensates for having simpler units.