Distributed neural networks and backpropagation

Why use distributed representations?

1. The brain does.

2. Distributed representations can have powerful statistical properties, e.g. approximating nonlinear regression and Baysian inference.

3. Distributed representations can be learned rather than hand-coded.

4. Distributed representations are more robust, in that you can delete some of the units without destroying the representation (graceful degradation).

5. Distributed representations can model psychological phenomena that localist representations cannot.

Distributed representations via backpropagation

Structures

Units are typically organized in 3 layers:

• Input layer: activated or deactived by external input.
• Hidden layer: small number of units between input and output layers
• Output layer: produce an answer that can be corrected

If there are too few units in the hidden layer, the network will not learn complex patterns. If there are too many, the network will become a lookup table and will not generalize to new cases.

Unlike the symmetric links in the coherence networks in week 8, the links are unidirectional, so that activation flows from the input layer to the hidden layer to the output layer. These are called feedforward networks.

The links have weights that are adjusted as the network is trained to avoid errors.

Procedures

The network makes inferences by having units in the input layer activated, leading to activation of units in the output layer, which represent the network's conclusion.

The network learns by being told when it has been made errors (supervised learning). If an output unit gives the wrong answer, the error is propagated back down the network, producing weight changes in all the links that produced the wrong answer.

Algorithms

(From P. Johnson-Laird, The Computer and the Mind, 1988, p. 187). In his terms, strengths of connections are weights of links.

After assigning initial random values (between +1 and -1) to the strengths of connections, the main function loops through each input-output pair calling three main functions: propagation of activation up the network, calculation of error, and backpropagation of change in strengths. There are five main calculations.

1. For the propagation of activation:

Input to a unit = (Sum of activations X strengths from units below) - Threshold bias

The threshold of a unit is created by another unit that is connected to and that is always active.

2. For the error of signal of an output unit:

Output error = (target - activation) activation (1- activation)

where "target" refers to the required activity of the output unit, and "activation" refers to its actual activity.

3. For the error signal of a hidden unit:

Hidden error = (Sum of strengths of connection to each output unit X output unit's error) activation (1- activation)

where "activation" refers to a unit's activation.

4. For the change in the strength of connection from a hidden unit to an output unit:

Change = (Learning rate X ouput error X actrivation of hidden unit) + (momentum proportion X previous change)

where learning rate is a global variable (usually set between 0.3 and 0.7 and momentum proportion is another global variable (usually set at around 0.9) which smooths out changes.

5. For the change in strength of connection from an input (or lower level) unit to a hidden unit:

Change = (Learning rate X hidden error X activation of lower unit) + (momentum proportion X previous change)

The strength of a connection from a unit acting as a threshold bias is made according to calculation 4 or 5 except that the unit's activation = 1.

Limitations of backpropagation learning

1. Requires a supervisor to train the system.

2. Does not allow feedback from the output layer to the input layer. (Networks with loops are called recurrent networks.)

3. Requires very large number of trials to train the network effectively - no one-trial learning.

4. Not biologically plausible: real neurons lack connections to do backpropagation. Hebbian learning, in which the weight between two units is increased if the two units are simultaneously active, is more neurologically realistic.

5. Networks trained intensively by backpropogation can become incapable of flexibly learning from examples different from the ones that they were originally trained on.

Unsupervised learning

No input and output layers, and no error signal. Hebbian learning is a simple kind of unsupervised learning.

In place of a supervisor, specify a generative model of the way in which the environment is assumed to generate data. This model is a neural network that can be used to correct the neural network that models the environment. Overall, the neural network trains itself, with only minor feedback from the environment.

Course on unsupervised learning.

There are good introductory articles on neural networks, supervised learning, unsupervised learning, and recurrent networks in the MIT Encyclopedia of Cognitive Science, available in the Porter library reference section.

Structured distributed representations

Representational limitations of simple neural networks

The localist networks described in week 7 and the distributed networks produced by backpropagation are fine for representing simple associations, e.g. between the concepts cat and furry.

But they lack the representational power to convey relational information, as in: Because the cat scratched the dog, the dog chased the cat. In logical symbolism, this is something like: (cause ( (scratch (cat dog)) (chase (dog cat)) ) ).

To model high level cognition, a neural network must be able to distinguish between a dog chasing a cat and a cat chasing a dog, and also be able to represent the higher level relation between scratching and chasing.

In current research, there are two general ways of capturing relational information in distributed representations:

• vector models that build distributed representations algebraically
• neural synchrony models that use time as an extra component

Logical aside: in formal systems, you can prove that any system with n-place relations, e.g. (gives (donor recipient gift)), can be reduced to a system with 2-place relations, but not to a system with only 1-place predicates.

Vector models of distributed representations

History

Fodor and Pylyshyn (1988) argued that artificial neural networks are inherently limited in their ability to represent complex information.

Paul Smolensky (1990) proposed a tensor-product technique.

Tony Plate (1993) developed holographic reduced representations - HRRs.

Chris Eliasmith (2001) developed an HRR-based model of analogical mapping, DRAMA. (Follow this link for a paper that has extensive references.)

Structures

A vector is an ordered set of real numbers, e.g. (.2 .34 .9). We can think of a vector of n numbers as representing the activations (firing rates) of a set of n neurons.

A concept or other mental representation can be modelled as a vector of n numbers, i.e. as a distributed representation involving n artificial neurons.

HRRs use 512-dimensional vectors as distributed representations of concepts and propositions. They are "holographic" in that the encoding and decoding operations on them are those used in explanations of holography. They are "reduced" in that encoding operations can involve loss of information.

To represent (chase (dog cat)), we need six vectors, for chase, dog, cat, relation, agent, and object. Each of these is a randomly chosen 512-dimensional vector.

To bind them up, we use the holographic operation of convolution (CONV), producing new vectors:

• relation CONV chase
• agent CONV dog
• object CONV cat

Then these vectors can be combined by superimposition (+) to produce a vector V1 that represents (chase (dog cat)):

• V1= relation CONV chase + agent CONV dog + object CONV cat
• V2 = relation CONV scratch + agent CONV cat + object CONV dog

Higher order relations such as cause can be represented too, with (cause ( (scratch (cat dog)) (chase (dog cat)) ) ) becoming:

• V3 = relation CONV cause + agent CONV V2 + object CONV V1.

Procedures

Convolution, to bind concepts to their roles.

Superposition, to combine concepts into propositions.

Correlation (decoding), to extract part of a proposition, e.g. finding out what the relation is.

Similarity, to compare vectors using their dot products.

Analog retreival: find similar vectors.

Analogical mapping: use similarity as a strong guide to what corresponds to what.

For details, see Eliasmith & Thagard (2001).

Synchrony models of distributed representations

History

Many neuroscientists have suggested that synchronous activity is an important part of neural processing.

Hummel and Biederman (1992) used dynamic binding in a neural network for shape recognition.

Shastri and Ajjanagadde (1993) developed a synchrony model of inference.

Hummel and Holyoak (1997) proposed a synchrony model (LISA) of analogical mapping and retrieval.

Structures

A concept or object is represented by a (localist) unit.

To represent (chase (dog cat)), we need units, for dog, cat, chase, chase-agent, and chase-object.

To bind them up, we use neural synchrony, i.e. units firing at the same rhythm:

• chase-agent is synchronized with dog
• chase-object is synchronized with cat

Procedures

Encoding of propositions to produce synchronized networks.

Memory retrieval by temporal pattern matching.

Inference by spreading activation through the network.

Analogical mapping by detecting synchronized bindings.

Issues

Computational power: what are the comparative strengths and weaknesses of vector and synchrony models?

Psychological power: which kind of model provides a better account of experimental results, e.g. peoples' strengths and weaknesses in analogical mapping?

Neurological power: how important is synchrony in neural processing?

Are vector models and synchrony models mathematically equivalent?

Pulsed (spiking) neural networks

What are pulsed neural networks?

Human thought depends on the behavior of billions of neurons, nerve cells that signal each other. How does thinking emerge from this complex system?

With the exception of neural synchrony models, the artificial neural networks discussed so far ignore the firing pattern of neurons. They look only at the average rate of firing, represented by a real number, which is the activation of the artificial neuron.

But for real neurons it is important to consider not only how fast they fire, but the specific pattern with which they fire (spike, pulse). One neuron signals another by propagating an all-or-none electrical signal called an action potential.

Pulsed (spiking) neural networks are ones that take into account the pulsing behavior of individual neurons.

See W. Maass and C. M. Bishop (eds.), Pulsed Neural Networks, MIT Press, 1999.

W. Gerstner and W. M. Kistler, Spiking Neuron Models, 2002.

Representation: pulse vs. rate codes

Consider a neuron that fires every 10 ms or so. Then there are around 100 different per-second firing rates it could have, firing anywhere from 0 to 100 time each second. This is a rate code.

But if we consider patterns of firing, i.e. ways in which a neuron can either fire or not fire in a second, then there are 2 to the power 100 different possibilities, which is enormously more than a rate code can handle. This is a pulse code. Maass shows that for some computational tasks, a single pulsing neuron has more computational power than a large network of conventional artificial neurons.

The timing of spikes is established as a means of encoding information in the:

• electrosensory system of electric fish
• auditory system of echolocating bats
• visual system of flies.

Spiking patterns are useful both for a single neuron and for a population of neurons, in which different neurons are synchronized or correlated.

Spike trains

A spike train is a chain of pulses (action potentials) emitted by a single neuron. A spike usually lasts 1-2 ms. A spike train can be characterized by a set of times, t1 ... tn, at which the neuron spikes, or by a series of ones and zeros, where 1 means that the neuron spikes at a particular time and 0 means that it is not spiking.

Whether a neuron N spikes at time t is a function of:

• the magnitude of the input to N, i.e. the electrical signals that N is receiving from other neurons
• the firing threshold, which is the minimal amount of input needed to fire
• refractoriness: after a neuron has fired, it takes a while before it can spike again.

A background signal (e.g. from the brain stem) may provide a phase pattern that allows different neurons to become synchronized or correlated with each other.

Cognitive modelling using pulsing neurons?

Is synchrony useful to model interconnections between different representations?

Do correlated neurons allow us to recognize patterns? Note that correlation is more flexible than synchrony: one neuron may be correlated with another by systematically firing just after it.

Does spiking and neural synchrony help to understand consciousness? What is the nature of emergence?

Is learning more powerful and biologically realistic in spiking neurons? Hebbian learning has been modelled in spiking neurons.

Are pulsed networks still much too simple to model the chemical properties of real neurons, e.g. more than 80 different neurotransmitters? See P. Thagard, How Molecules Matter to Mental Computation.

Phil/Psych 446

Computational Epistemology Laboratory.

Paul Thagard