85-419/719 Intro PDP: Homework 3

Due Tuesday, Feb 28, 10:30am

Some questions have multiple parts or ask you to hand in additional things; be careful not to leave out anything in preparing your responses. Although the assignment will be graded out of 100 points, it constitutes 15% of your course grade.

1. XOR [40 points]

Download and unzip the file http://www.cnbc.cmu.edu/~plaut/IntroPDP/networks/XOR.zip which contains XOR.in, XOR.ex, and XOR.init.wt. These files define a network to solve the XOR problem. (Although the Lens Examples folder contains a version of the XOR network, xor.in, don't use that version for this homework.)

Open the Link Viewer to display the values of the initial weights. Note the values from each input unit to each hidden unit (the 2x2 square), and from each hidden unit to the output unit (the top 2x1 rectangle). If you prefer, you can also examine these values by right-clicking on units in the Unit Viewer.

Now click on "Train" in the main panel. Because "Batch Size" and "Weight Updates" are both set to 1, this will train the network once on only the first pattern ("0 0 => 0"). Note the activation values of the hidden and output units. Then, in the Unit Display, select "Output Derivatives" from the "Value" menu, in order to display the derivative of the error with respect to the output (activation) of each unit in the network (dE/daj for each unit j).

1A [15 pts.] Based on the values of the weights in the network and the details of the back-propagation learning procedure, explain why the output derivative of the left hidden unit (hidden:0) is larger than that of the right hidden unit (hidden:1), and why both values are very much smaller than that of the output unit.
Select "Outputs and Targets" from the "Value" menu to go back to displaying unit activations and targets. Set "Batch Size" to 0 in the main panel. This causes Lens to run all four training patterns, accumulating error derivatives as it goes, before actually using them to change the weights---so-called "batch" learning. Also set "Weight Updates" to 300 in the main panel. Then click on "Train" to train for 300 epochs (i.e., 300 sweeps through the four patterns). Notice in the error graph that, after a very long initial flat period, the error finally drops quickly and reaches a value near zero. Use the Unit Viewer to examine the activation values of the hidden and output units for the four cases, and use the Link Viewer (or right-click on various units in the Unit Viewer) to examine the final weights.

1B [25 pts.] Explain how the network has solved the XOR problem. Specifically, for each hidden unit, describe how it is behaving across the four cases and explain how it accomplishes this given its bias, its weights from the input units, and the sigmoid activation function. Then explain how the output unit behaves in accordance with XOR based on its bias, its weights from these hidden units (and their activations for the four cases), and the sigmoid function.
In addition to your answers, hand in a print-out of the final weights as displayed in the Link Viewer (being sure to use "Hinton Diagram" under "Palette" and to rescale the display appropriately, using the slider on the right).
Extra Credit [10 pts.] The very long plateau in error at the beginning of training is not typical of training the XOR problem when starting from different initial random weights (although a brief plateau is not uncommon). Explain why this long plateau occurs for the initial weights used in the homework (XOR.init.wt).

2. Implementing Another Feedforward Problem [60 points]

Study a feedforward problem of your choosing (other than XOR). Be sure to stick with a smallish problem, involving no more than a dozen or so units in total and maybe 15-20 training patterns. Take a look at the back-propagation chapter from the first PDP volume (PDP1 Chapter 8) for some ideas (e.g., parity, encoder, symmetry, negation), but feel free to come up with something more interesting (e.g., you could train on an "encoder" task using inputs/outputs where more than one unit is on per pattern, and where there is some "structure" or systematic relationship in which units are active together). Binary addition is tricky to make work and should be attempted only if you are interested in a challenge. Stay away from the T-C problem and other things later in the chapter. If you do chose one of the problems from the chapter, you may want to use more hidden units and train on more patterns than they do (particularly for symmetry or addition).

You will need to create new script and example files, which is most easily done by copying and renaming existing files (e.g., XOR.in and XOR.ex) and modifying them appropriately (making sure to save the new versions as plain-text files). Also note that you will need to reduce the learning rate because you are training on many more patterns (and might have many more output units). Finally, networks for other homework problems have started from all-zero weights (8x8 associator) or loaded in a set of initial weights (XOR). When training a network with hidden units on your own problem, though, it is better to start with small random initial weights (just use the resetNet command or control-panel button). You can then save these initial weights (with "Save Weights") so you can load them in later if you want to be able to re-create a particular training run.

Carry out learning experiments in which you examine what aspects of the training set the network finds easy or more difficult to learn, and how well it generalizes to patterns that are withheld from training (much like you did for your pattern association problem in the second half of Homework 2). Explore the impact of either 1) changing the number of hidden units, or 2) adding weight decay, on the speed and efficacy of learning, and on generalization. (If you add weight decay, be sure to use a small value (e.g., 0.001) or the network will not be able to learn the training patterns sufficiently well.)

Note that many of the problems mentioned above (e.g., parity) are ones where the most similar inputs map to different outputs and so are not expected to give rise to effective generalization (although it's still worth testing). For problems like these (and maybe others), you might try training on all but one of the set of possible patterns (e.g., 31 patterns for 5-bit parity) and see if the network can generalize to the remaining pattern.

In writing up your results, include a description of the problem you have chosen and why you find it interesting, and displays of the weights of your network before and after training. Be sure to try to explain your results based on your understanding of the properties of PDP networks, the back-propagation learning procedure, and the structure among your training patterns.