Background

Naive Discriminative Learning

Terminology

Before explaining Naive Discriminative Learning (NDL) in detail, we want to give you a brief overview over important notions:

cue :: A cue is something that gives a hint on something else. The something else is called outcome. Examples for cues in a text corpus are trigraphs or preceding words for the word or meaning of the word.
outcome :: The outcome is the result of an event. Examples are words, the meaning of the word, or lexomes.
event :: An event connects cues with outcomes. In any event one or more unordered cues are present and one or more outcomes are present.
weights :: The weights represent the learned experience / association between all cues and outcomes of interest. Usually, some meta data is stored alongside the learned weights.

Rescorla Wagner learning rule

In order to update the association strengths (weights) between cues and outcomes we do for each event the following:

We calculate the activation (prediction) \(a_j\) for each outcome \(o_j\) by using all present cues \(C_\text{PRESENT}\):

\[a_j = \sum_{i \text{ for } c_i \in C_\text{PRESENT}} w_{ij}\]

After that, we calculate the update \(\Delta w_{ij}\) for every cue-outcome-combination:

\[\begin{split}\Delta w_{ij} \begin{cases} 0 & \text{if cue } c_i \text{ is absent}\\ \alpha_i \beta_1 \cdot (\lambda - a_j ) & \text{if outcome } o_j \text{ and cue } c_i \text{ is present.}\\ \alpha_i \beta_2 \cdot (0 - a_j ) & \text{if outcome } o_j \text{ is absent and cue } c_i \text{ is present.} \end{cases}\end{split}\]

In the end, we update all weights according to \(w_{ij} = w_{ij} + \Delta w_{ij}\).

Note

If we set all the \(\alpha\)’s and \(\beta\)’s to a fixed value we can replace them in the equation with a general learning parameter \(\eta = \alpha \cdot \beta\).

In matrix notation

We can rewrite the Rescorla-Wagner learning rule into matrix notation with a binary cue (input) vector \(\vec{c}\), which is one for each cue present in the event and zero for all other cues. Respectively, we define a binary outcome (output) vector \(\vec{o}\), which is one for each outcome present in the event and zero if the outcome is not present. In order to stick close to the definition above we can define the activation vector as \(\vec{a} = W^T \vec{c}\). Here \(W^T\) denotes the transposed matrix of the weight matrix \(W\).

For simplicity let us assume we have a fixed learning rate \(\eta = \alpha \beta\). We will relax this simplification in the end. We can rewrite the above rule as:

\[\begin{split}\Delta &= \eta \vec{c} \cdot (\lambda \vec{o} - \vec{a})^T \\ &= \eta \vec{c} \cdot (\lambda \vec{o} - W^T \cdot \vec{c})^T\end{split}\]

Let us first check the dimensionality of the matrices:

\(\Delta\) is the update of the weight matrix \(W\) and therefore needs to have the same dimensions \(n \times m\) where \(n\) denotes the number of cues (inputs) and \(m\) denotes the number of outcomes (outputs).

The cue vector \(\vec{c}\) can be seen as a matrix with dimensions \(n \times 1\) and the outcome vector can be seen as a matrix with dimensions \(m \times 1\). Let us tabulate the dimensions:

\(\lambda \vec{o}\)	\(m \times 1\)
\(W^T\)	\(m \times n\)
\(\vec{c}\)	\(n \times 1\)
\(W^T \cdot \vec{c}\)	\(m \times 1 = (m \times n) \cdot (n \times 1)\)
\(\lambda \vec{o} - W^T \cdot \vec{c}\)	\(m \times 1 = (m \times 1) - (m \times 1)\)
\((\lambda \vec{o} - W^T \cdot \vec{c})^T\)	\(1 \times m = (m \times 1)^T\)
\(\eta \vec{c} \cdot (\lambda \vec{o} - W^T \cdot \vec{c})\)	\(n \times m = (n \times 1) \cdot (1 \times n)\)

We therefore end with the right set of dimensions. We now can try to simplify / rewrite the equation.

\[\begin{split}\Delta &= \eta \vec{c} \cdot ((\lambda \vec{o})^T - (W^T \cdot \vec{c})^T) \\ &= \eta \vec{c} \cdot (\lambda \vec{o}^T - \vec{c}^T \cdot W) \\ &= \eta \lambda \vec{c} \cdot \vec{o}^T - \eta \vec{c} \cdot \vec{c}^T \cdot W \\\end{split}\]

If we now look at the full update:

\[\begin{split}W_{t + 1} &= W_t + \Delta_t \\ &= W + \Delta \\ &= W + \eta \lambda \vec{c} \cdot \vec{o}^T - \eta \vec{c} \cdot \vec{c}^T \cdot W \\ &= \eta \lambda \vec{c} \cdot \vec{o}^T + W - \eta \vec{c} \cdot \vec{c}^T \cdot W \\ &= \eta \lambda \vec{c} \cdot \vec{o}^T + (1 - \eta \vec{c} \cdot \vec{c}^T) \cdot W \\\end{split}\]

We therefore see that the Rescorla-Wagner update is an affine (linear) transformation 1 in the weights \(W\) with an intercept of \(\eta \lambda \vec{c} \cdot \vec{o}^T\) and a slope of \((1 - \eta \vec{c} \cdot \vec{c}^T)\).

In index notation we can write:

\[\begin{split}W^{t + 1} &= W^{t} + \eta \vec{c} \cdot (\lambda \vec{o}^T - \vec{c}^T \cdot W) \\ W^{t + 1}_{ij} &= W^{t}_{ij} + \eta c_i (\lambda o_j - \sum_k c_k W_{kj}) \\\end{split}\]

Note

Properties of the transpose 4 with \(A\) and \(B\) matrices and \(\alpha\) skalar:

\[(A^T)^T = A\]

\[(A + B)^T = A^T + B^T\]

\[(\alpha A)^T = \alpha A^T\]

\[(A \cdot B)^T = B^T \cdot A^T\]

Other Learning Algorithms

The delta rule 2 is a gradient descent learning rule for updating the weights of the inputs to artificial neurons in a single-layer neural network. It is a special case of the more general backpropagation algorithm 3.

The delta rule can be expressed as:

\[\Delta_{ij} = \alpha (t_j - y_j) \partial_{h_j} g(h_j) x_i\]

In the terminology above we can identify the actual output with \(y_j = g(h_j) = g\left(\sum_i w_{ij} c_i\right)\), the cues with \(x_i = c_i\), under the assumption that \(o_j\) is binary (i. e. either zero or one) we can write \(t_j = \lambda o_j\), the learning rate \(\alpha = \eta = \alpha \beta\). Substituting this equalities results in:

\[\Delta_{ij} = \eta (\lambda o_j - g\left(\sum_i w_{ij} c_i\right)) \partial_{h_j} g(h_j) c_i\]

In order to end with the Rescorla-Wagner learning rule we need to set the neuron’s activation function \(g(h_j)\) to the identity function, i. e. \(g(h_j) = 1 \cdot h_j + 0 = h_j = \sum_i w_{ij} c_i\). The derivative in respect to \(h_j\) is \(\partial_{h_j} g(h_j) = 1\) for any input \(h_j\).

We now have:

\[\begin{split}\Delta_{ij} &= \eta (\lambda o_j - \sum_i w_{ij} c_i) \cdot 1 \cdot c_i \\ &= \eta (\lambda o_j - \sum_i w_{ij} c_i) c_i \\ &= \eta c_i (\lambda o_j - \sum_i w_{ij} c_i)\end{split}\]

Assuming the cue vector is binary the vector \(c_i\) effectively filters those updates of the present cues and sets all updates of the cues that are not present to zero. Additionally, we can rewrite the equation above into vector notation (without indices):

\[\begin{split}\Delta_{ij} &= \eta c_i (\lambda o_j - \sum_i w_{ij} c_i) \\ &= \eta c_i (\lambda o_j - \sum_i w_{ij} c_i)\end{split}\]

\[\Delta = \eta \vec{c} \cdot (\lambda \vec{o}^T - W^T \cdot \vec{c})^T\]

This is exactly the form of the Rescorla-Wagner rule rewritten in matrix notation.

Conclusion

In conclusion, the Rescorla-Wagner learning rule, which only allows for one \(\alpha\) and one \(\beta\) and therefore one learning rate \(\eta = \alpha \beta\) is exactly the same as a single layer backpropagation gradient decent method (the delta rule) where the neuron’s activation function \(g(h_j)\) is set to the identity \(g(h_j) = h_j\) and the inputs \(x_i = c_i\) and target outputs \(t_j = \lambda o_j\) to be binary.

References

1: https://en.wikipedia.org/wiki/Affine_transformation
2: https://en.wikipedia.org/wiki/Delta_rule
3: https://en.wikipedia.org/wiki/Backpropagation
4: https://en.wikipedia.org/wiki/Transpose