85-419/719 Intro PDP: Homework 4

Due Thursday, March 9, 10:30am


This homework constitutes 10% of your course grade.

Reconstruction (auto-encoders) vs. discrimination (recognition); deep networks

Download and unzip the file http://www.cnbc.cmu.edu/~plaut/IntroPDP/networks/digits.zip which contains files that define four different networks that learn from hand-written digits.

All four networks use the same sets of hand-written digits for training and testing, which come from hand-digits.in in the Examples folder (but don't use that version of the network for this homework).

1. Auto-encoder network

First, start up lens and load digits-enc.in. Click on "Train" to train the network for 300 epochs (i.e., 300 sweeps through all 3823 patterns; depending on how fast your laptop is, this may take a while.) You'll notice that the total error reaches a plateau of about 50,000 at about 20 epochs (where the network has only learned the general size and position of the digits, but is not differentiating among them much), but that the error starts dropping again at about 60 epochs and that it reaches just over 15,500 by epoch 300. (These error values are much larger than in the XOR problem because the error measure sums over output units and over examples, and there are many more of each in the current simulation.)

Now open the Unit Viewer. We don't really need to see target values (since they're the same as the inputs) so, under the menu "Value", select "Outputs" (rather than "Outputs and Targets"). Click on various examples of the digit zero to try to get an impression of how well the network is able to regenerate each input.

1A [15 pts.] Consider the two examples "0-train5" and "0-train24" in particular. Describe the differences between the inputs and outputs for these examples, and how the outputs in particular are related to other instances of zeros. Try to explain the results in terms of the general properties of learning in PDP networks.

Now, right click on each of the hidden units in turn, to display their incoming and outgoing weights. You will probably want to adjust the slider on the right-hand side to see the weight values well, and you may want to change to the Black-Grey-White or Hinton Diagram palette.

1B [15 pts.] Characterize the kinds of features that the hidden units have learned and how they might combine to reconstruct specific digits.

2. Recognition network

From the main panel, click on "Run Script" and select digits-rec.in. This will load in the second network, which will learn to recognize each digit rather than to regenerate it. This network has the same number of hidden units (10) as the auto-encoder version. Note that you can switch back and forth between the two networks by clicking on "Network" in the upper left of the main panel and selecting the relevant network, which will help in comparing the two.

Train the recognition network for 150 epochs by clicking on the "Train" button. At this point, the network is producing relatively low error on the training set (just under 850). Now, right click on each of the hidden units in turn, to display their incoming and outgoing weights (just as you did for the auto-encoder network).

2A [20 pts.] Contrast the kinds of features learned by the recognition network with those learned by the auto-encoder network. Explain the differences in terms of the distinct demands of the two tasks.

Click on "Testing Set" at the top right of the main panel, select "digits-rec.train" and click "Test" (near the bottom) to test performance on the training set. Note the error per example and the percent of examples for which the output unit criterion is met. (For this network, the latter number is the percent of examples for which the correct output unit is the most active one.) Then set "Testing Set" to "digits-rec.test" and click "Test" to test generalization performance on the (untrained) testing set, noting the same performance measures.

Now set "Weight Updates" to 1000 and click "Train Network" in order to train to a total of 1150 epochs. This may take a while....

2B [15 pts.] Test the performance of the network on both the training and testing sets in the same way as you did after 150 epochs. Report the results for both 150 and 1150 epochs and explain them.

3. Deep networks (sigmoid and RELU)

Click on "Run Script" in the main panel and select digits-deep.in. This will load in the third network, which is also trained to recognize digits but now has an additional hidden layer of 40 units between the input and what is now the second hidden layer of 10 units. Apart from having different initial random weights, all other aspects of the simulation, including learning parameters, are the same as for the standard recognition network (digits-rec.in).

Click on "Train" to train for 150 epochs. Test the performance of the deep network on both the training and testing sets, and compare the results to those for the standard network after 150 epochs of training. (You may want to use "Load Weights" to reload the latter's initial weights, digits-rec.init.wt, and train for 150 epochs again.)

3A [10 pts.] Why is performance of the deep sigmoid network so much worse than that of the standard sigmoid network? Try to relate your answer to the nature of the learning curve shown in the error graph for the deep network.

Finally, use "Run Script" to load in digits-deep-RELU.in. This network is identical to digits-deep.in except that the hidden units are rectified linear units (RELUs) instead of sigmoid units. (The network also includes a cost function for the hidden units that weakly penalizes large activation values.)

Click on "Train" to train for 150 epochs. Test the performance of the deep RELU network on both the training and testing sets, and compare the results to those for the deep sigmoid (digits-deep.in) and standard sigmoid (digits-rec.in) networks after 150 epochs of training.

3B [10 pts.] Why is performance of the deep RELU network better than that of the deep sigmoid network? How much additional training of the deep sigmoid network is needed to match the performance of the deep RELU network?
3C [15 pts.] Why is performance of the deep RELU network even better than that of the standard sigmoid network after 150 epochs of training for each? Be sure to consider all of the differences between the two networks.