Some questions have multiple parts or ask you to hand in additional things;
**be careful not to leave out anything** in preparing your responses.
Although the assignment will be graded out of 100 points, it constitutes 15%
of your course grade.

Open the Link Viewer to display the values of the initial weights. Note the values from each input unit to each hidden unit (the 2x2 square), and from each hidden unit to the output unit (the top 2x1 rectangle). If you prefer, you can also examine these values by right-clicking on units in the Unit Viewer.

Now click on "Train" in the main panel. Because "Batch Size" and "Weight
Updates" are both set to 1, this will train the network once on only the first
pattern ("0 0 => 0"). Note the activation values of the hidden and output
units. Then, in the Unit Display, select "Output Derivatives" from the "Value"
menu, in order to display the derivative of the error with respect to the
output (activation) of each unit in the network (dE/da_{j} for each unit j).

Select "Outputs and Targets" from the "Value" menu to go back to displaying unit activations and targets. Set "Batch Size" to 0 in the main panel. This causes Lens to run all four training patterns, accumulating error derivatives as it goes, before actually using them to change the weights---so-called "batch" learning. Also set "Weight Updates" to 300 in the main panel. Then click on "Train" to train for 300 epochs (i.e., 300 sweeps through the four patterns). Notice in the error graph that, after a very long initial flat period, the error finally drops quickly and reaches a value near zero. Use the Unit Viewer to examine the activation values of the hidden and output units for the four cases, and use the Link Viewer (or right-click on various units in the Unit Viewer) to examine the final weights.1A [15 pts.]Based on the values of the weights in the network and the details of the back-propagation learning procedure, explain why the output derivative of the left hidden unit (hidden:0) is larger than that of the right hidden unit (hidden:1), and why both values are very much smaller than that of the output unit.

In addition to your answers,1B [25 pts.]Explain how the network has solved the XOR problem. Specifically, for each hidden unit, describe how it is behaving across the four cases and explain how it accomplishes this given its bias, its weights from the input units, and the sigmoid activation function. Then explain how the output unit behaves in accordance with XOR based on its bias, its weights from these hidden units (and their activations for the four cases), and the sigmoid function.

Extra Credit [10 pts.]The very long plateau in error at the beginning of training is not typical of training the XOR problem when starting from different initial random weights (although a brief plateau is not uncommon). Explain why this long plateau occurs for the initial weights used in the homework (XOR.init.wt).

Study a feedforward problem of your choosing (other than XOR). Be sure to stick with a smallish problem, involving no more than a dozen or so units in total and maybe 15-20 training patterns. Take a look at the back-propagation chapter from the first PDP volume (PDP1 Chapter 8) for some ideas (e.g., parity, encoder, symmetry, negation), but feel free to come up with something more interesting (e.g., you could train on an "encoder" task using inputs/outputs where more than one unit is on per pattern, and where there is some "structure" or systematic relationship in which units are active together). Binary addition is tricky to make work and should be attempted only if you are interested in a challenge. Stay away from the T-C problem and other things later in the chapter. If you do chose one of the problems from the chapter, you may want to use more hidden units and train on more patterns than they do (particularly for symmetry or addition).

You will need to create new script and example files, which is most easily
done by copying and renaming existing files (e.g., `XOR.in` and
`XOR.ex`) and modifying them appropriately (making sure to save the new
versions as plain-text files). Also note that you will need to reduce the
learning rate because you are training on many more patterns (and might have
many more output units). Finally, networks for other homework problems have
started from all-zero weights (8x8 associator) or loaded in a set of initial
weights (XOR). When training a network with hidden units on your own
problem, though, it is better to start with small random initial weights (just
use the `resetNet` command or control-panel button). You can then save
these initial weights (with "Save Weights") so you can load them in later if
you want to be able to re-create a particular training run.

**Carry out learning experiments** in which you examine what aspects of
the training set the network finds easy or more difficult to learn, and how
well it generalizes to patterns that are withheld from training (much like you
did for your pattern association problem in the second half of Homework 2).
Explore the impact of **either** 1) changing the *number of hidden
units*, or 2) adding *weight decay*, on the speed and efficacy of
learning, and on generalization. (If you add weight decay, be sure to use a
small value (e.g., 0.001) or the network will not be able to learn the
training patterns sufficiently well.)

Note that many of the problems mentioned above (e.g., parity) are ones
where the most similar inputs map to different outputs and so are not expected
to give rise to effective generalization (although it's still worth testing).
For problems like these (and maybe others), you might try training on *all
but one* of the set of possible patterns (e.g., 31 patterns for 5-bit
parity) and see if the network can generalize to the remaining pattern.

In writing up your results, include a description of the problem you have
chosen and why you find it interesting, and displays of the weights of your
network before and after training. Be sure to try to *explain* your
results based on your understanding of the properties of PDP networks, the
back-propagation learning procedure, and the structure among your training
patterns.