This homework constitutes 10% of your course grade.

Download and unzip the file http://www.cnbc.cmu.edu/~plaut/IntroPDP/networks/digits.zip which contains files that define four different networks that learn from hand-written digits.

**digits-enc.in**: A network that is trained as an*auto-encoder*to take each digit as input and to regenerate it over the output units via a much smaller number of hidden units.**digits-rec.in**: A network that is trained to*recognize*each digit by activating one of 10 localist output units.**digits-deep.in**: A version of`digits-rec.in`with two sigmoidal hidden layers instead of one.**digits-deep-RELU.in**: A version of`digits-deep.in`in which the hidden units are rectified linear units (RELUs) instead of sigmoid units.

First, start up lens and load `digits-enc.in`. Click on "Train" to
train the network for 300 epochs (i.e., 300 sweeps through all 3823 patterns;
depending on how fast your laptop is, this may take a while.) You'll
notice that the total error reaches a plateau of about 50,000 at about 20
epochs (where the network has only learned the general size and position of
the digits, but is not differentiating among them much), but that the error
starts dropping again at about 60 epochs and that it reaches just over 15,500
by epoch 300. (These error values are *much* larger than in the XOR
problem because the error measure sums over output units and over examples,
and there are *many* more of each in the current simulation.)

Now open the Unit Viewer. We don't really need to see target values (since they're the same as the inputs) so, under the menu "Value", select "Outputs" (rather than "Outputs and Targets"). Click on various examples of the digit zero to try to get an impression of how well the network is able to regenerate each input.

1A [15 pts.]Consider the two examples "0-train5" and "0-train24" in particular. Describe the differences between the inputs and outputs for these examples, and how the outputs in particular are related to other instances of zeros. Try to explain the results in terms of the general properties of learning in PDP networks.

Now, right click on each of the hidden units in turn, to display their incoming and outgoing weights. You will probably want to adjust the slider on the right-hand side to see the weight values well, and you may want to change to the Black-Grey-White or Hinton Diagram palette.

1B [15 pts.]Characterize the kinds of features that the hidden units have learned and how they might combine to reconstruct specific digits.

From the main panel, click on "Run Script" and select
`digits-rec.in`. This will load in the second network, which will
learn to recognize each digit rather than to regenerate it. This network has
the same number of hidden units (10) as the auto-encoder version. Note that
you can switch back and forth between the two networks by clicking on
"Network" in the upper left of the main panel and selecting the relevant
network, which will help in comparing the two.

Train the recognition network for 150 epochs by clicking on the "Train" button. At this point, the network is producing relatively low error on the training set (just under 850). Now, right click on each of the hidden units in turn, to display their incoming and outgoing weights (just as you did for the auto-encoder network).

2A [20 pts.]Contrast the kinds of features learned by the recognition network with those learned by the auto-encoder network. Explain the differences in terms of the distinct demands of the two tasks.

Click on "Testing Set" at the top right of the main panel, select "digits-rec.train" and click "Test" (near the bottom) to test performance on the training set. Note the error per example and the percent of examples for which the output unit criterion is met. (For this network, the latter number is the percent of examples for which the correct output unit is the most active one.) Then set "Testing Set" to "digits-rec.test" and click "Test" to test generalization performance on the (untrained) testing set, noting the same performance measures.

Now set "Weight Updates" to 1000 and click "Train Network" in order to train to a total of 1150 epochs. This may take a while....

2B [15 pts.]Test the performance of the network on both the training and testing sets in the same way as you did after 150 epochs. Report the results for both 150 and 1150 epochs and explain them.

Click on "Run Script" in the main panel and select `digits-deep.in`.
This will load in the third network, which is also trained to recognize digits
but now has an additional hidden layer of 40 units between the input and what
is now the second hidden layer of 10 units. Apart from having different
initial random weights, all other aspects of the simulation, including
learning parameters, are the same as for the standard recognition network
(`digits-rec.in`).

Click on "Train" to train for 150 epochs. Test the performance of the deep
network on both the training and testing sets, and compare the results to
those for the standard network after 150 epochs of training. (You may want to
use "Load Weights" to reload the latter's initial weights,
`digits-rec.init.wt`, and train for 150 epochs again.)

3A [10 pts.]Why is performance of the deep sigmoid network so much worse than that of the standard sigmoid network? Try to relate your answer to the nature of the learning curve shown in the error graph for the deep network.

Finally, use "Run Script" to load in `digits-deep-RELU.in`. This
network is identical to `digits-deep.in` except that the hidden units
are rectified linear units (RELUs) instead of sigmoid units. (The network
also includes a cost function for the hidden units that weakly penalizes large
activation values.)

Click on "Train" to train for 150 epochs. Test the performance of the deep
RELU network on both the training and testing sets, and compare the results to
those for the deep sigmoid (`digits-deep.in`) and standard sigmoid
(`digits-rec.in`) networks after 150 epochs of training.

3B [10 pts.]Why is performance of the deep RELU network better than that of the deep sigmoid network? How much additional training of the deep sigmoid network is needed to match the performance of the deep RELU network?

3C [15 pts.]Why is performance of the deep RELU network even better than that of the standard sigmoid network after 150 epochs of training for each? Be sure to consider all of the differences between the two networks.