Pyndl - Naive Discriminative Learning in Python

pyndl implements Naïve Discriminative Learning (NDL) in Python. NDL is an incremental learning algorithm grounded in the principles of discrimination learning and motivated by animal and human learning research. Lately, NDL has become a popular tool in language research to examine large corpora and vocabularies, with 750,000 spoken word tokens and a vocabulary size of 52,402 word types. In contrast to previous implementations, pyndl allows for a broader range of analysis, including non-English languages, adds further learning rules and provides better maintainability while having the same fast processing speed. As of today, it supports multiple research groups in their work and led to several scientific publications.

Quickstart

Installation

First, you need to install pyndl. The easiest way to do this is using pip:

pip install --user pyndl

Warning

If you are using any other operating system than Linux this process can be more difficult. Check out Installation for more detailed installation instruction. However, currently we can only ensure the expected behaviour on Linux system. Be aware that on other operating system some functionality may not work

Naive Discriminative Learning

Naive Discriminative Learning, henceforth NDL, is an incremental learning algorithm based on the learning rule of Rescorla and Wagner 1, which describes the learning of direct associations between cues and outcomes. The learning is thereby structured in events where each event consists of a set of cues which give hints to outcomes. Outcomes can be seen as the result of an event, where each outcome can be either present or absent. NDL is naive in the sense that cue-outcome associations are estimated separately for each outcome.

The Rescorla-Wagner learning rule describes how the association strength \(\Delta V_{i}^{t}\) at time \(t\) changes over time. Time is here described in form of learning events. For each event the association strength is updated as

\[V_{i}^{t+1} = V_{i}^{t} + \Delta V_{i}^{t}\]

Thereby, the change in association strength \(\Delta V_{i}^{t}\) is defined as

\[\begin{split}\Delta V_{i}^{t} = \begin{array}{ll} \begin{cases} \displaystyle 0 & \: \textrm{if ABSENT}(C_{i}, t)\\ \alpha_{i}\beta_{1} \: (\lambda - \sum_{\textrm{PRESENT}(C_{j}, t)} \: V_{j}) & \: \textrm{if PRESENT}(C_{j}, t) \: \& \: \textrm{PRESENT}(O, t)\\ \alpha_{i}\beta_{2} \: (0 - \sum_{\textrm{PRESENT}(C_{j}, t)} \: V_{j}) & \: \textrm{if PRESENT}(C_{j}, t) \: \& \: \textrm{ABSENT}(O, t) \end{cases} \end{array}\end{split}\]

with

  • \(\alpha_{i}\) being the salience of the cue \(i\)

  • \(\beta_{1}\) being the salience of the situation in which the outcome occurs

  • \(\beta_{2}\) being the salience of the situation in which the outcome does not occur

  • \(\lambda\) being the the maximum level of associative strength possible

Note

Usually, the parameters are set to \(\alpha_{i} = \alpha_{j} \: \forall i, j\), \(\beta_{1} = \beta_{2}\) and \(\lambda = 1\)

Usage

Analyzing data with pyndl involves three steps

  1. The data has to be preprocessed into the correct format

  2. One of the learning methods of pyndl is used to learn the desired associations

  3. The learned association (commonly also called weights) can be stored or directly be analyzed further.

In the following, a usage example of pyndl is provided, in which the first two of the three steps are described for learning the associations between bigrams and meanings. The first section of this example focuses on the correct preparation of the data with inbuilt methods. However, it is worth to note that the learning algorithm itself does not require the data to be preprocessed by pyndl, nor it is limited by that. The pyndl.preprocess module should rather be seen as a collection of established and commonly used preprocessing methods within the context of NDL. Custom preprocessing can be used as long as the created event files follow the structure as outlined in the next section. The second section, describes how the associations can be learned using pyndl, while the last section describes how this can be exported and, for instance, loaded in R for further investigation.

Data Preparation

To analyse any data using pyndl requires them to be in the long format as an utf-8 encoded tab delimited gzipped text file with a header in the first line and two columns:

  1. the first column contains an underscore delimited list of all cues

  2. the second column contains an underscore delimited list of all outcomes

  3. each line therefore represents an event with a pair of a cue and an outcome (occurring one time)

  4. the events (lines) are ordered chronologically

The algorithm itself is agnostic to the actual domain as long as the data is tokenized as Unicode character strings. While pyndl provides some basic preprocessing for grapheme tokenization (see for instance the following examples), the tokenization of ideograms, pictograms, logograms, and speech has to be implemented manually. However, generic implementations are welcome as a contribution.

Creating Grapheme Clusters From Wide Format Data

Often data which should be analysed is not in the right format to be processed with pyndl. To illustrate how to get the data in the right format we use data from Baayen, Milin, Đurđević, Hendrix & Marelli 2 as an example:

Table 1

Word

Frequency

Lexical Meaning

Number

hand

10

HAND

hands

20

HAND

PLURAL

land

8

LAND

lands

3

LAND

PLURAL

and

35

AND

sad

18

SAD

as

35

AS

lad

102

LAD

lads

54

LAD

PLURAL

lass

134

LASS

Table 1 shows some words, their frequencies of occurrence and their meanings as an artificial lexicon in the wide format. In the following, the letters (unigrams and bigrams) of the words constitute the cues, whereas the meanings represent the outcomes.

As the data in table 1 are artificial we can generate such a file for this example by expanding table 1 randomly regarding the frequency of occurrence of each event. The resulting event file lexample.tab.gz consists of 420 lines (419 = sum of frequencies + 1 header) and looks like the following (nevertheless you are encouraged to take a closer look at this file using any text editor of your choice):

Cues

Outcomes

#h_ha_an_nd_ds_s#

hand_plural

#l_la_ad_d#

lad

#l_la_as_ss_s#

lass

Creating Grapheme Clusters From Corpus Data

Often the corpus which should be analysed is only a raw utf-8 encoded text file that contains huge amounts of text. From here on we will refer to such a file as a corpus file. In the corpus files several documents can be stored with a ---end.of.document--- or ---END.OF.DOCUMENT--- string marking where an old document finished and a new document starts.

The pyndl.preprocess module (besides other things) provides the functionality to directly generate an event file based on a raw corpus file and filter it:

>>> from pyndl import preprocess
>>> preprocess.create_event_file(corpus_file='docs/data/lcorpus.txt',
...                              event_file='docs/data/levent.tab.gz',
...                              allowed_symbols='a-zA-Z',
...                              context_structure='document',
...                              event_structure='consecutive_words',
...                              event_options=(1, ),
...                              cue_structure='bigrams_to_word')

Here we use the example corpus lcorpus.txt to produce an event file levent.tab.gz which (uncompressed) looks like this:

Cues

Outcomes

an_#h_ha_d#_nd

hand

ot_fo_oo_#f_t#

foot

ds_s#_an_#h_ha_nd

hands

Note

pyndl.corpus allows you to generate such a corpus file from a bunch of gunzipped xml subtitle files filled with words.

Learn the associations

The strength of the associations for the data can now easily be computed using the pyndl.ndl.ndl function from the pyndl.ndl module:

>>> from pyndl import ndl
>>> weights = ndl.ndl(events='docs/data/levent.tab.gz',
...                   alpha=0.1, betas=(0.1, 0.1), method="threading")

Save and load a weight matrix

To save time in the future, we recommend saving the weights. For compatibility reasons we recommend saving the weight matrix in the netCDF format 3:

>>> weights.to_netcdf('docs/data/weights.nc')  

Now, the saved weights can later be reused or be analysed in Python or R. In Python the weights can simply be loaded with the xarray module:

>>> import xarray  
>>> with xarray.open_dataarray('docs/data/weights.nc') as weights_read:  
...     weights_read

In R you need the ncdf4 package to load a in netCDF format saved matrix:

> #install.packages("ncdf4") # uncomment to install
> library(ncdf4)
> weights_nc <- nc_open(filename = "docs/data/weights.nc")
> weights_read <- t(as.matrix(ncvar_get(nc = weights_nc, varid = "__xarray_dataarray_variable__")))
> rownames(weights_read) <- ncvar_get(nc = weights_nc, varid = "outcomes")
> colnames(weights_read) <- ncvar_get(nc = weights_nc, varid = "cues")
> nc_close(nc = weights_nc)
> rm(weights_nc)

Clean up

In order to keep everything clean we might want to remove all the files we created in this tutorial:

>>> import os
>>> os.remove('docs/data/levent.tab.gz')

1

Rescorla, R. A., & Wagner, A. R. (1972). A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and non-reinforcement. Classical conditioning II: Current research and theory, 2, 64-99.

2

Baayen, R. H., Milin, P., Đurđević, D. F., Hendrix, P., & Marelli, M. (2011). An amorphous model for morphological processing in visual comprehension based on naive discriminative learning. Psychological review, 118(3), 438.

3

Unidata (2012). NetCDF. doi:10.5065/D6H70CW6. Retrieved from http://doi.org/10.5065/D6RN35XM)