Usage Examples
Lexical example
The lexical example illustrates the Rescorla-Wagner equations 1. This example is taken from Baayen, Milin, Đurđević, Hendrix and Marelli 2.
Premises
Cues are associated with outcomes and both can be present or absent
Cues are segment (letter) unigrams, bigrams, …
Outcomes are meanings (word meanings, inflectional meanings, affixal meanings), …
\(\textrm{PRESENT}(X, t)\) denotes the presence of cue or outcome \(X\) at time \(t\)
\(\textrm{ABSENT}(X, t)\) denotes the absence of cue or outcome \(X\) at time \(t\)
The association strength \(V_{i}^{t+1}\) of cue \(C_{i}\) with outcome \(O\) at time \(t+1\) is defined as \(V_{i}^{t+1} = V_{i}^{t} + \Delta V_{i}^{t}\)
The change in association strength \(\Delta V_{i}^{t}\) is defined as in (1) with
\(\alpha_{i}\) being the salience of the cue \(i\)
\(\beta_{1}\) being the salience of the situation in which the outcome occurs
\(\beta_{2}\) being the salience of the situation in which the outcome does not occur
\(\lambda\) being the the maximum level of associative strength possible
Default settings for the parameters are: \(\alpha_{i} = \alpha_{j} \: \forall i, j\), \(\beta_{1} = \beta_{2}\) and \(\lambda = 1\)
See comparison_of_algorithms for alternative formulations of the Rescorla Wagner learning rule.
Data
Table 1 |
|||
---|---|---|---|
Word |
Frequency |
Lexical Meaning |
Number |
hand |
10 |
HAND |
|
hands |
20 |
HAND |
PLURAL |
land |
8 |
LAND |
|
lands |
3 |
LAND |
PLURAL |
and |
35 |
AND |
|
sad |
18 |
SAD |
|
as |
35 |
AS |
|
lad |
102 |
LAD |
|
lads |
54 |
LAD |
PLURAL |
lass |
134 |
LASS |
Table 1 shows some words, their frequencies of occurrence and their meanings as an artificial lexicon in the wide format. In the following, the letters (unigrams and bigrams) of the words constitute the cues, the meanings represent the outcomes.
Analyzing any data using pyndl requires them to be in the long format as an utf-8 encoded tab delimited gzipped text file with a header in the first line and two columns:
the first column contains an underscore delimited list of all cues
the second column contains an underscore delimited list of all outcomes
each line therefore represents an event with a pair of a cue and an outcome (occurring one time)
the events (lines) are ordered chronologically
As the data in table 1 are artificial we can generate such a file for this example by expanding table 1 randomly regarding the frequency of occurrence of each event. The resulting event file lexample.tab.gz consists of 420 lines (419 = sum of frequencies + 1 header) and looks like the following (nevertheless you are encouraged to take a closer look at this file using any text editor of your choice):
Cues |
Outcomes |
---|---|
#h_ha_an_nd_ds_s# |
hand_plural |
#l_la_ad_d# |
lad |
#l_la_as_ss_s# |
lass |
pyndl.ndl module
We can now compute the strength of associations (weights or weight matrix)
after the presentation of the 419 tokens of the 10 words using
pyndl.ndl
. pyndl.ndl
provides the two functions
pyndl.ndl.ndl
and pyndl.ndl.dict_ndl
to calculate the
weights for all outcomes over all events. pyndl.ndl.ndl
itself
provides to methods regarding estimation, openmp
and threading
. We have
to specify the path of our event file lexample.tab.gz and
for this example set \(\alpha = 0.1\), \(\beta_{1} = 0.1\),
\(\beta_{2} = 0.1\) with leaving \(\lambda = 1.0\) at its default
value. You can use pyndl directly in a Python3 Shell or you can write an
executable script, this is up to you. For educational purposes we use a Python3
Shell in this example.
pyndl.ndl.ndl
pyndl.ndl.ndl
is a parallel Python implementation using numpy,
multithreading and a binary format which is created automatically. It allows
you to choose between the two methods openmp
and threading
, with the
former one using openMP and therefore being
expected to be faster when analyzing larger data. Unfortunately, openmp
is
only available on Linux right now, therefore all examples use threading
.
Besides, you can set three technical arguments which we will not change here:
n_jobs
(int) giving the number of threads in which the job should be executed (default=2)sequence
(int) giving the length of sublists generated from all outcomes (default=10)remove_duplicates
(logical) to make cues and outcomes unique (default=None; which will raise an error if the same cue is present multiple times in the same event)
Let’s start:
>>> from pyndl import ndl
>>> weights = ndl.ndl(events='docs/data/lexample.tab.gz', alpha=0.1,
... betas=(0.1, 0.1), method='threading')
>>> weights
<xarray.DataArray (outcomes: 8, cues: 15)>
...
weights
is an xarray.DataArray
of dimension len(outcomes)
,
len(cues)
. Our unique, chronologically ordered outcomes are ‘hand’,
‘plural’, ‘lass’, ‘lad’, ‘land’, ‘as’, ‘sad’, ‘and’. Our unique,
chronologically ordered cues are ‘#h’, ‘ha’, ‘an’, ‘nd’, ‘ds’, ‘s#’, ‘#l’,
‘la’, ‘as’, ‘ss’, ‘ad’, ‘d#’, ‘#a’, ‘#s’, ‘sa’. Therefore all three indexing
methods
>>> weights[1, 5]
<xarray.DataArray ()>
...
>>> weights.loc[{'outcomes': 'plural', 'cues': 's#'}]
<xarray.DataArray ()>
array(0.076988...)
Coordinates:
outcomes <U6 'plural'
cues <U2 's#'
...
>>> weights.loc['plural'].loc['s#']
<xarray.DataArray ()>
array(0.076988...)
Coordinates:
outcomes <U6 'plural'
cues <U2 's#'
...
return the weight of the cue ‘s#’ (the unigram ‘s’ being the word-final) for the outcome ‘plural’ (remember counting in Python does start at 0) as ca. 0.077 and hence indicate ‘s#’ being a marker for plurality.
pyndl.ndl.ndl
also allows you to continue learning from a previous
weight matrix by specifying the weight
argument:
>>> weights2 = ndl.ndl(events='docs/data/lexample.tab.gz', alpha=0.1,
... betas=(0.1, 0.1), method='threading', weights=weights)
>>> weights2
<xarray.DataArray (outcomes: 8, cues: 15)>
array([[ 0.24...
...
...]])
Coordinates:
* outcomes (outcomes) <U6 'hand' 'plural'...
* cues (cues) <U2 '#h' 'ha' 'an' 'nd'...
Attributes:...
date:...
event_path:...
...
As you may have noticed already, pyndl.ndl.ndl
provides you with meta
data organized in a dict
which was collected during your calculations. Each
entry of each list
of this meta data therefore references one specific
moment of your calculations:
>>> print('Attributes: ' + str(weights2.attrs))
Attributes: ...
pyndl.ndl.dict_ndl
pyndl.ndl.dict_ndl
is a pure Python implementation, however, it
differs from pyndl.ndl.ndl
regarding the following:
there are only two technical arguments:
remove_duplicates
(logical) andmake_data_array
(logical)by default, no longer an
xarray.DataArray
is returned but adict
of dictshowever, you are still able to get an
xarray.DataArray
by settingmake_data_array=True
the case \(\alpha_{i} \neq \alpha_{j} \:\) can be handled by specifying a
dict
consisting of the cues as keys and corresponding \(\alpha\)’s
Therefore
>>> weights = ndl.dict_ndl(events='docs/data/lexample.tab.gz',
... alphas=0.1, betas=(0.1, 0.1))
>>> weights['plural']['s#'] # doctes: +ELLIPSIS
0.076988227...
yields approximately the same results as before, however, you now can specify
different \(\alpha\)’s per cue and as in pyndl.ndl.ndl
continue
learning or do both:
>>> alphas_cues = dict(zip(['#h', 'ha', 'an', 'nd', 'ds', 's#', '#l', 'la', 'as', 'ss', 'ad', 'd#', '#a', '#s', 'sa'],
... [0.1, 0.2, 0.3, 0.4, 0.1, 0.2, 0.3, 0.1, 0.2, 0.1, 0.2, 0.1, 0.3, 0.1, 0.2]))
>>> weights = ndl.dict_ndl(events='docs/data/lexample.tab.gz',
... alphas=alphas_cues, betas=(0.1, 0.1))
>>> weights2 = ndl.dict_ndl(events='docs/data/lexample.tab.gz',
... alphas=alphas_cues, betas=(0.1, 0.1),
... weights=weights)
If you prefer to get a xarray.DataArray
returned you can set the flag make_data_array=True
:
>>> weights = ndl.dict_ndl(events='docs/data/lexample.tab.gz',
... alphas=alphas_cues, betas=(0.1, 0.1),
... make_data_array=True)
>>> weights
<xarray.DataArray (outcomes: 8, cues: 15)>
...
A minimal workflow example
As you should have a basic understanding of pyndl.ndl
by now, the
following example will show you how to:
generate an event file based on a raw corpus file
count cues and outcomes
filter the events
learn the weights as already shown in the lexical learning example
save and load a weight matrix (netCDF format)
load a weight matrix (netCDF format) into R for further analyses
Generate an event file based on a raw corpus file
Suppose you have a raw utf-8 encoded corpus file (by the way,
pyndl.corpus
allows you to generate such a corpus file from a bunch of
gunzipped xml subtitle files filled with words, which we will not cover here).
For example take a look at lcorpus.txt_.
To analyse the data, you need to convert the file into an event file similar to lexample.tab.gz in our lexical learning example, as currently there is only one word per line and neither is there the column for cues nor for outcomes:
hand
foot
hands
The pyndl.preprocess
module (besides other things) allows you to
generate an event file based on a raw corpus file and filter it:
>>> import pyndl
>>> from pyndl import preprocess
>>> preprocess.create_event_file(corpus_file='docs/data/lcorpus.txt',
... event_file='docs/data/levent.tab.gz',
... allowed_symbols='a-zA-Z',
... context_structure='document',
... event_structure='consecutive_words',
... event_options=(1, ),
... cue_structure='bigrams_to_word')
The function pyndl.preprocess.create_event_file
has several arguments
which you might have to change to suit them your data, so you are strongly
recommended to read its documentation. We set context_structure='document'
as in this case the context is the whole document,
event_structure='consecutive_words'
as these are our events,
event_options=(1, )
as we define an event to be one word and
cue_structure='bigrams_to_word'
to set cues being bigrams.
There are also several technical arguments you can specify, which we will not
change here. Our generated event file levent.tab.gz
now looks
(uncompressed) like this:
Cues |
Outcomes |
---|---|
an_#h_ha_d#_nd |
hand |
ot_fo_oo_#f_t# |
foot |
ds_s#_an_#h_ha_nd |
hands |
Count cues and outcomes
We can now count the cues and outcomes in our event file using the
pyndl.count
module and also generate id maps for cues and outcomes:
>>> from pyndl import count
>>> freq, cue_freq_map, outcome_freq_map = count.cues_outcomes(event_file_name='docs/data/levent.tab.gz')
>>> freq
12
>>> cue_freq_map
Counter({...})
>>> outcome_freq_map
Counter({...})
>>> cues = list(cue_freq_map.keys())
>>> cues.sort()
>>> cue_id_map = {cue: ii for ii, cue in enumerate(cues)}
>>> cue_id_map
{...}
>>> outcomes = list(outcome_freq_map.keys())
>>> outcomes.sort()
>>> outcome_id_map = {outcome: nn for nn, outcome in enumerate(outcomes)}
>>> outcome_id_map
{...}
Filter the events
As we do not want to include the outcomes ‘foot’ and ‘feet’ in this example
as well as their cues ‘#f’, ‘fo’ ‘oo’, ‘ot’, ‘t#’, ‘fe’, ‘ee’ ‘et’, we use the
pyndl.preprocess
module again, filtering our event file and update
the id maps for cues and outcomes:
>>> preprocess.filter_event_file(input_event_file='docs/data/levent.tab.gz',
... output_event_file='docs/data/levent.tab.gz.filtered',
... remove_cues=('#f', 'fo', 'oo', 'ot', 't#', 'fe', 'ee', 'et'),
... remove_outcomes=('foot', 'feet'))
>>> freq, cue_freq_map, outcome_freq_map = count.cues_outcomes(event_file_name='docs/data/levent.tab.gz.filtered')
>>> cues = list(cue_freq_map.keys())
>>> cues.sort()
>>> cue_id_map = {cue: ii for ii, cue in enumerate(cues)}
>>> cue_id_map
{...}
>>> outcomes = list(outcome_freq_map.keys())
>>> outcomes.sort()
>>> outcome_id_map = {outcome: nn for nn, outcome in enumerate(outcomes)}
>>> outcome_id_map
{...}
Alternatively, using pyndl.preprocess.filter_event_file
you can also
specify which cues and outcomes to keep (keep_cues
and keep_outcomes
)
or remap cues and outcomes (cue_map
and outcomes_map
). Besides, there
are also some technical arguments you can specify, which will not discuss here.
Last but not least pyndl.preprocess
does provide some other very
useful functions regarding preprocessing of which we did not make any use here,
so make sure to go through its documentation.
Learn the weights
Computing the strength of associations for the data is now easy, using for
example pyndl.ndl.ndl
from the pyndl.ndl
module like in the lexical learning
example:
>>> from pyndl import ndl
>>> weights = ndl.ndl(events='docs/data/levent.tab.gz.filtered',
... alpha=0.1, betas=(0.1, 0.1), method="threading")
Save and load a weight matrix
is straight forward using the netCDF format 3
>>> import xarray
>>> weights.to_netcdf('docs/data/weights.nc')
>>> with xarray.open_dataarray('docs/data/weights.nc') as weights_read:
... weights_read
In order to keep everything clean we might want to remove all the files we created in this tutorial:
>>> import os
>>> os.remove('docs/data/levent.tab.gz')
>>> os.remove('docs/data/levent.tab.gz.filtered')
Widrow-Hoff (WH) learning
There is a Widrow-Hoff learning module called wh now in pyndl, which uses the same event files and nearly the same function parameters as the ndl.ndl function. The main function to call is wh.wh. Compared to ndl.ndl the wh.wh function adds two look-up tables, one for cues and one for outcomes, to its keyword arguments. Each of this look-up tables maps each cue and / or outcome in your event file to a vector. This look-up table has to be an instance xarray.DataArray and is passed with the keyword argument cue_vectors or outcome_vectors. The second dimension of the look-up table needs to be named cue_vector_dimensions and outcome_vector_dimensions respectively. For more information have a look at the function doc string.
WH example
This example shows that WH learning mimics RW learning, if the cue and outcome vectors are containing unit vectors. Note that WH learning in contrast to the RW learning only has one learning parameter, which is called eta. The assumption is that beta1 equals beta2.
>>> from pyndl import wh, ndl
>>> import xarray as xr
>>> import numpy as np
>>> events = 'docs/data/event_file_wh.tab.gz'
>>> eta = 0.01 # learning rate
>>> cue_vectors = xr.DataArray(np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]], dtype=float),
... dims=('cues', 'cue_vector_dimensions'),
... coords={'cues': ['a', 'b', 'c'], 'cue_vector_dimensions': ['dim1', 'dim2', 'dim3']})
>>> outcome_vectors = xr.DataArray(np.array([[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]], dtype=float),
... dims=('outcomes', 'outcome_vector_dimensions'),
... coords={'outcomes': ['A', 'B', 'C', 'D'],
... 'outcome_vector_dimensions': ['dim1', 'dim2', 'dim3', 'dim4']})
>>> weights_wh = wh.wh(events, eta, cue_vectors=cue_vectors, outcome_vectors=outcome_vectors, method='numpy')
>>> weights_ndl = ndl.ndl(events, alpha=1.0, betas=(eta, eta), method='threading')
The weights returned by wh.wh have dimensions outcome_vector_dimensions and cue_vector_dimensions. Therefore, a direct comparison is not possible. But as the vectors used are unit vectors the first cue_vector_dimension “dim1” corresponds to the first cue “a” and the second vector dimension corresponds to the second cue etc. If the dimensions are ordered by their names, the equality gets apparent.
>>> weights_wh = weights_wh.loc[{'outcome_vector_dimensions': ['dim1', 'dim2', 'dim3', 'dim4'],
... 'cue_vector_dimensions': ['dim1', 'dim2', 'dim3']}]
>>> weights_ndl = weights_ndl.loc[{'outcomes': ['A', 'B', 'C', 'D'], 'cues': ['a', 'b', 'c']}]
>>> print(weights_wh)
<xarray.DataArray (outcome_vector_dimensions: 4, cue_vector_dimensions: 3)>
array([[0.06706..., 0. , 0. ],
[0. , 0.03940..., 0. ],
[0.0094... , 0. , 0.03940...],
[0.01 , 0. , 0. ]])
Coordinates:
* outcome_vector_dimensions (outcome_vector_dimensions) <U4 'dim1' ... 'dim4'
* cue_vector_dimensions (cue_vector_dimensions) <U4 'dim1' 'dim2' 'dim3'
outcomes <U1 'A'
cues <U1 'a'
Attributes: (12/15)
...
>>> print(weights_ndl)
<xarray.DataArray (outcomes: 4, cues: 3)>
array([[0.06706..., 0. , 0. ],
[0. , 0.03940..., 0. ],
[0.0094... , 0. , 0.03940...],
[0.01 , 0. , 0. ]])
Coordinates:
* outcomes (outcomes) <U1 'A' 'B' 'C' 'D'
* cues (cues) <U1 'a' 'b' 'c'
Attributes: (12/17)
...
Furthermore, it is possible to only use either cue_vectors or outcome_vectors. This functionality is Linux only at the moment.
>>> weights_wh_cv_only = wh.wh(events, eta, cue_vectors=cue_vectors, method='openmp')
>>> weights_wh_ov_only = wh.wh(events, eta, outcome_vectors=outcome_vectors, method='openmp')
For this example the content of the resulting weights matches the content of the weights_wh and weights_ndl.
Load a weight matrix to R 4
We can load a in netCDF format saved matrix into R:
> #install.packages("ncdf4") # uncomment to install
> library(ncdf4)
> weights_nc <- nc_open(filename = "docs/data/weights.nc")
> weights_read <- t(as.matrix(ncvar_get(nc = weights_nc, varid = "__xarray_dataarray_variable__")))
> rownames(weights_read) <- ncvar_get(nc = weights_nc, varid = "outcomes")
> colnames(weights_read) <- ncvar_get(nc = weights_nc, varid = "cues")
> nc_close(nc = weights_nc)
> rm(weights_nc)
- 1
Rescorla, R. A., & Wagner, A. R. (1972). A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and non-reinforcement. Classical conditioning II: Current research and theory, 2, 64-99.
- 2
Baayen, R. H., Milin, P., Đurđević, D. F., Hendrix, P., & Marelli, M. (2011). An amorphous model for morphological processing in visual comprehension based on naive discriminative learning. Psychological review, 118(3), 438.
- 3
Unidata (2012). NetCDF. doi:10.5065/D6H70CW6. Retrieved from http://doi.org/10.5065/D6RN35XM)
- 4
R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.