Quick Intro to NLP with `R`

SciencesPo Intro To Programming 2024

Florian Oswald

11 October, 2024

Intro

In this lecture we will introduce the most basic language models with R.

This is based on a nice introduction by Valerio Gherardi, author of the kgrams package for R.

Natuarl Language Processing (NLP) basics

We are all familiar with large language models (LLMs for short) by now.
ChatGPT (short for Chat Generative Pretrained Transformer) is a proprietary solution, there are by now many open source alternatives.
We will not be able to go into the details of those, but see some simpler cousins.

\(k\)-gram language models

Let \(w_i\) be the \(i\)-th word in a sentence, i.e. \(s = w_1 w_2 \dots w_n\)
An NLP model gives the probability of observing this sentence, i.e. \(\Pr(s)\).
As usual, we can sample from \(\Pr(s)\) to obtain random sentences.
In general all the \(s\) at our disposal come from a certain corpus, i.e. a collection of sentences/words.

Continuation Probabilities

Define a sequence of words as context: \(w_1 w_2 \dots w_m\)
We can predict the next word in the sequence by computing \(\Pr(w|c)\), i.e. \(\Pr(w|c)\) is the probability that the next word is \(w\), given context \(c\).
That is in a nutshell what ChatGPT computes for you.

Dictionaries

The list of known words in an NLP model is called the dictionary.
This also tells us how to deal with unknown words - those are mapped to the UNK (unknown word token).
It also tells us how to deal with the end of sentences, by introducing an EOS (end of sentence) token.
kgram models (below) also include a BOS (beginning of sentence) token. Each sentence is left-padded with \(N-1\) BOS tokens (\(N\) the order of the model). This helps predicting the first workd of the next sentence from the preceding \(N-1\) tokens.

\(k\)-gram Models

A \(k\)-gram model makes a markovian assumption on continuation probabilites.
We assume that the next word depends only on the last \(N-1\) words, where \(N\) is the order of the model.
We have

\[\begin{align} \Pr(w|c) &= \Pr(w|w_1 w_2 \cdots w_{N-1})\\ c &= \cdots w_{-1} w_0 w_1 w_2 \cdots w_{N-1} \end{align}\]

We call the \(k\) tuples of words \((w_1, w_2,\dots, w_k)\) k-grams.
You can see that we can only capture relatively short range dependencies.
As \(N\) becomes too large, memory requirements explode.

Estimating Continuation Probabilities

We can make a table from our corpus, counting how many times each \(k\) gram occurs.
While this is simple, we need a smoothing technique to account for the fact that many potentially sensible sentences are never observed in our Corpus.
The smoothing will take some probability from the very frequently observed sequences and give some to the rarer ones, simply speaking.

\[\hat{\Pr}_{MLE}(w|c) = \frac{C(w_1 w_2 \cdots w_{k} w)}{C(w_1 w_2 \cdots w_{k})}\]

Our data is sparse: many sequences are not in our corpus, hence the above estimator incorrectly assigns zero probability to them.
If context \(w_1 w_2 \cdots w_{k}\) not in data, estimator is not defined.

Training and Testing NLP Models

We need an evaluation metric: how good is this model.
Widely used is perplexity: The larger perplexity of a discrete probability distribution, the less likely it will be that an observer could guess the next value to be drawn from it.
We will evaluate \(H=-\frac{1}{W} \sum_s \ln \Pr(s)\) where \(W\) is the total number of words in our corpus.

Training a k-gram model in R

library(kgrams)

We can get the spoken text from the following Shakespear plays:

playcodes <- c(
        "All's Well That Ends Well" = "AWW",
        "Antony and Cleopatra" = "Ant",
        "As You Like It" = "AYL",
        "The Comedy of Errors" = "Err",
        "Coriolanus" = "Cor",
        "Cymbeline" = "Cym",
        "Hamlet" = "Ham",
        "Henry IV, Part 1" = "1H4",
        "Henry IV, Part 2" = "2H4",
        "Henry V" = "H5",
        "Henry VI, Part 1" = "1H6",
        "Henry VI, Part 2" = "2H6",
        "Henry VI, Part 3" = "3H6",
        "Henry VIII" = "H8",
        "Julius Caesar" = "JC",
        "King John" = "Jn",
        "King Lear" = "Lr",
        "Love's Labor's Lost" = "LLL",
        "Macbeth" = "Mac",
        "Measure for Measure" = "MM",
        "The Merchant of Venice" = "MV",
        "The Merry Wives of Windsor" = "Wiv",
        "A Midsummer Night's Dream" = "MND",
        "Much Ado About Nothing" = "Ado",
        "Othello" = "Oth",
        "Pericles" = "Per",
        "Richard II" = "R2",
        "Richard III" = "R3",
        "Romeo and Juliet" = "Rom",
        "The Taming of the Shrew" = "Shr",
        "The Tempest" = "Tmp",
        "Timon of Athens" = "Tim",
        "Titus Andronicus" = "Tit",
        "Troilus and Cressida" = "Tro",
        "Twelfth Night" = "TN",
        "Two Gentlemen of Verona" = "TGV",
        "Two Noble Kinsmen" = "TNK",
        "The Winter's Tale" = "WT"
        )

Estimating 2

We could get the text from “Much Ado about Nothing” as follows:

get_url_con <- function(playcode) {
        stopifnot(playcode %in% playcodes)
        url <- paste0("https://www.folgerdigitaltexts.org/", playcode, "/text")
        con <- url(url)
}

con <- get_url_con("Ado")
open(con)
readLines(con, 10)

 [1] "<br/>"                                              
 [2] "I learn in this letter that Don<br/>"               
 [3] "Pedro of Aragon comes this night to Messina.<br/>"  
 [4] "He is very near by this. He was not three<br/>"     
 [5] "leagues off when I left him.<br/>"                  
 [6] "How many gentlemen have you lost in this<br/>"      
 [7] "action?<br/>"                                       
 [8] "But few of any sort, and none of name.<br/>"        
 [9] "A victory is twice itself when the achiever<br/>"   
[10] "brings home full numbers. I find here that Don<br/>"

close(con)

Defining Training and Testing Data

We will use all plays but “Hamlet” as training data, and reserve this last one for testing our model.

train_playcodes <- playcodes[names(playcodes) != c("Hamlet")]
test_playcodes <- playcodes[names(playcodes) == c("Hamlet")]

We want to pre-process the text data. Here we want to remove some html tags and make everything lower-case.

.preprocess <- function(x) {
        # Remove html tags
        x <- gsub("<[^>]+>", "", x)
        # Lower-case and remove characters not alphanumeric or punctuation
        x <- kgrams::preprocess(x)
        return(x)
}

Preprocessing Text

We need to split sentences at sensible punctuation marks .!?:; and insert EOS and BOS tokens into the data.
This will treat .!?:; as regular words, hence the model will be able to predict those.

.tknz_sent <- function(x) {
        # Collapse everything to a single string
        x <- paste(x, collapse = " ")
        # Tokenize sentences
        x <- kgrams::tknz_sent(x, keep_first = TRUE)
        # Remove empty sentences
        x <- x[x != ""]
        return(x)
}

Making \(k\)-gram frequency counts

Let us now make a table of occurences of all \(k\)-grams in our corpus.
We set an order:

N = 5
freqs = kgram_freqs(N, .preprocess = .preprocess, .tknz_sent = .tknz_sent)
summary(freqs)

A k-gram frequency table.

Parameters:
* N: 5
* V: 0

Number of words in training corpus:
* W: 0

Number of distinct k-grams with positive counts:
* 1-grams:0
* 2-grams:0
* 3-grams:0
* 4-grams:0
* 5-grams:0

So, for now this is an empty model as you can see. Let’s train it on our corpus!

Training the NLP model

lapply(train_playcodes,
       function(playcode) {
               con <- get_url_con(playcode)
               process_sentences(text = con, freqs = freqs, verbose = FALSE)
       })

$`All's Well That Ends Well`
A k-gram frequency table.

$`Antony and Cleopatra`
A k-gram frequency table.

$`As You Like It`
A k-gram frequency table.

$`The Comedy of Errors`
A k-gram frequency table.

$Coriolanus
A k-gram frequency table.

$Cymbeline
A k-gram frequency table.

$`Henry IV, Part 1`
A k-gram frequency table.

$`Henry IV, Part 2`
A k-gram frequency table.

$`Henry V`
A k-gram frequency table.

$`Henry VI, Part 1`
A k-gram frequency table.

$`Henry VI, Part 2`
A k-gram frequency table.

$`Henry VI, Part 3`
A k-gram frequency table.

$`Henry VIII`
A k-gram frequency table.

$`Julius Caesar`
A k-gram frequency table.

$`King John`
A k-gram frequency table.

$`King Lear`
A k-gram frequency table.

$`Love's Labor's Lost`
A k-gram frequency table.

$Macbeth
A k-gram frequency table.

$`Measure for Measure`
A k-gram frequency table.

$`The Merchant of Venice`
A k-gram frequency table.

$`The Merry Wives of Windsor`
A k-gram frequency table.

$`A Midsummer Night's Dream`
A k-gram frequency table.

$`Much Ado About Nothing`
A k-gram frequency table.

$Othello
A k-gram frequency table.

$Pericles
A k-gram frequency table.

$`Richard II`
A k-gram frequency table.

$`Richard III`
A k-gram frequency table.

$`Romeo and Juliet`
A k-gram frequency table.

$`The Taming of the Shrew`
A k-gram frequency table.

$`The Tempest`
A k-gram frequency table.

$`Timon of Athens`
A k-gram frequency table.

$`Titus Andronicus`
A k-gram frequency table.

$`Troilus and Cressida`
A k-gram frequency table.

$`Twelfth Night`
A k-gram frequency table.

$`Two Gentlemen of Verona`
A k-gram frequency table.

$`Two Noble Kinsmen`
A k-gram frequency table.

$`The Winter's Tale`
A k-gram frequency table.

Checking the Frequency tables

the freqs object was modified during the previous call.
Let’s check it quickly:

query(freqs, c("leonato", "pound of flesh", "smartphones"))

[1] 23  6  0

Last thing to do: choose a smoother.

smoothers()

[1] "ml"    "add_k" "abs"   "kn"    "mkn"   "sbo"   "wb"

Let’s choose the modified Kneser-Ney smoother and set some default parameters:

info("mkn")

Interpolated modified Kneser-Ney
 * code: 'mkn'
 * parameters: D1, D2, D3
 * constraints: 0 <= Di <= 1

Building the model

model <- language_model(freqs, smoother = "mkn", D1 = 0.5, D2 = 0.5, D3 = 0.5)
summary(model)

A k-gram language model.

Smoother:
* 'mkn'.

Parameters:
* N: 5
* V: 27133
* D1: 0.5
* D2: 0.5
* D3: 0.5

Number of words in training corpus:
* W: 955351

Number of distinct k-grams with positive counts:
* 1-grams:27135
* 2-grams:296762
* 3-grams:631164
* 4-grams:767563
* 5-grams:800543

Making Predictions with the model

Now we can compute probabilities for given sentences:

sentences <- c(
        "I have a letter from monsieur Berowne to one lady Rosaline.",
        "I have an email from monsieur Valerio to one lady Judit."
)
probability(sentences, model)

[1] 2.407755e-06 3.768346e-40

or we can get the continuation probability for a context:

context <- "pound of"
words <- c("flesh", "bananas")
probability(words %|% context, model)

[1] 3.930320e-01 5.866444e-08

Tuning our models

Remember we held out “Hamlet” from our training data. Let’s use it to test performance now!

con <- get_url_con(test_playcodes)
perplexity(text = con, model = model)

[1] 328.5284

This applies the same transformations and tokenization to test data than it does to training data (which is important).

Tuning More

We could now create a grid over the parameters of the model (D1, D2 etc) as well as the order of the models
We would then choose those parameters for whcih the perplexity is smallest.
Suppose we find that the \(k=4\) models works best.
Let’s use it to create some random sentences!

param(model, "N") <- 4

Random Text generation

set.seed(840)
sample_sentences(model, 10, max_length = 20)

 [1] "hum ! <EOS>"                                                                                                  
 [2] "helen herself . <EOS>"                                                                                        
 [3] "come come and learn of us not ; <EOS>"                                                                        
 [4] "this kindness . <EOS>"                                                                                        
 [5] "rise marcus rise you froward and blister ! <EOS>"                                                             
 [6] "whats here ? <EOS>"                                                                                           
 [7] "kneel ? <EOS>"                                                                                                
 [8] "kill the poys and the purpose of his most precious queen of thee . <EOS>"                                     
 [9] "i pray thee do nothing of a sycamore i did say so and so may you for a servant comes [...] (truncated output)"
[10] "to me and i have merited either in my mind . <EOS>"

Temperature

The temperature parameter makes the pdf smoother and rougher. Smaller values mean the model will not deviate much from it’s implied distribution, higher values means there will be much more randomness in output.

set.seed(841)
sample_sentences(model, 10, max_length = 20) # Normal temperature

 [1] "thou rt not on him yet . <EOS>"                                                                                       
 [2] "o lord sir ! <EOS>"                                                                                                   
 [3] "first mend our dinner here in vienna . <EOS>"                                                                         
 [4] "shall s go hear me sir thomas and leave out the color of pity . <EOS>"                                                
 [5] "my cherry lips to move and with a green and pale at parting when our quick winds shut up his [...] (truncated output)"
 [6] "what a mere child is fancy that it alone thou fool . <EOS>"                                                           
 [7] "i would your grace to wear a heart replete with felt absence of all other how bright bay . <EOS>"                     
 [8] "methinks youre better spoken . <EOS>"                                                                                 
 [9] "i will not rest but mocks peter with our course where is warwick . <EOS>"                                             
[10] "tis the cardinal ; <EOS>"

High temperature

set.seed(841)
sample_sentences(model, 10, max_length = 20, t = 10)

 [1] "servanted position exampled becomd fideles throws videsne containing wharfs apply nag hung helen phrasethat oppressd swelling sluts lings errest kerns [...] (truncated output)"                                        
 [2] "accomplishment bragged freshness cheerful cheaply afterlove debauched honorswhich wretched sounda twelve puzel marriedhath worshipped frankness walters untalked dancing countersealed corners [...] (truncated output)"
 [3] "cocksure temptings contradiction poised mutualities veil liberties firmament wellsteeled angelical leg olivia besiege hardiness lackbeard shut envied silvervoiced givest puddle [...] (truncated output)"              
 [4] "cornwalls rolls unnoble unassailed bliss clutched happy feebled augers isabel thee fabulous je summered faiths mulberry revive crazy request celerity [...] (truncated output)"                                         
 [5] "imbar felt progress streak perishing poorer frenzys glideth beset ascanius irresolute misguide forgd odors allwatchd jacknape redoubled continent nestorlike sirrahthe [...] (truncated output)"                        
 [6] "ahold pottlepots liker peter shrieking correctioner illfavoredly unbraided visage brisk greatsized navys forth bonos altheas goaded heras cookery highblown oerstare [...] (truncated output)"                          
 [7] "holidam bellypinchd careers horsemans needy divinely exits calendars id benevolence plumd sadhearted eaux level league perverse resolve accouterments luggage amort [...] (truncated output)"                           
 [8] "cherishes venom shouldnotwithstanding doomsday swell elseof aloft furrowweeds dercetus pitythey nutshell poll scorpions presents pericles scythes placeth potent drooping botcher [...] (truncated output)"             
 [9] "perversely body ulcerous circumstance whispers sightless reliances parricides pragging piglike oneandtwenty illfaced apparel biggen masteri counterfeit uncivil vouchsafed unforced planks [...] (truncated output)"    
[10] "sag hellbroth holdeth cocklight uproar eclipses bastardizing cojoin antonioo stricken disloyal almain forerun reverted gothe prone branched spleeny towards upon [...] (truncated output)"

Low temperature

set.seed(841)
sample_sentences(model, 10, max_length = 20, t = 0.1)

 [1] "i am not in the world . <EOS>"                                                              
 [2] "i will not be entreated . <EOS>"                                                            
 [3] "i am not in the world . <EOS>"                                                              
 [4] "i am not . <EOS>"                                                                           
 [5] "i am not in the world . <EOS>"                                                              
 [6] "i am not in the world . <EOS>"                                                              
 [7] "i am not in the world . <EOS>"                                                              
 [8] "i am not in the night and tempt the rheumy and unpurgd air to add unto his sickness ? <EOS>"
 [9] "i am not to be a man . <EOS>"                                                               
[10] "i am not . <EOS>"

Quick Intro to NLP with R