library(kgrams)
R
SciencesPo Intro To Programming 2024
11 October, 2024
In this lecture we will introduce the most basic language models with R.
This is based on a nice introduction by Valerio Gherardi, author of the kgrams
package for R.
UNK
(unknown word token).EOS
(end of sentence) token.kgram
models (below) also include a BOS
(beginning of sentence) token. Each sentence is left-padded with \(N-1\) BOS
tokens (\(N\) the order of the model). This helps predicting the first workd of the next sentence from the preceding \(N-1\) tokens.\[\begin{align} \Pr(w|c) &= \Pr(w|w_1 w_2 \cdots w_{N-1})\\ c &= \cdots w_{-1} w_0 w_1 w_2 \cdots w_{N-1} \end{align}\]
\[\hat{\Pr}_{MLE}(w|c) = \frac{C(w_1 w_2 \cdots w_{k} w)}{C(w_1 w_2 \cdots w_{k})}\]
We can get the spoken text from the following Shakespear plays:
playcodes <- c(
"All's Well That Ends Well" = "AWW",
"Antony and Cleopatra" = "Ant",
"As You Like It" = "AYL",
"The Comedy of Errors" = "Err",
"Coriolanus" = "Cor",
"Cymbeline" = "Cym",
"Hamlet" = "Ham",
"Henry IV, Part 1" = "1H4",
"Henry IV, Part 2" = "2H4",
"Henry V" = "H5",
"Henry VI, Part 1" = "1H6",
"Henry VI, Part 2" = "2H6",
"Henry VI, Part 3" = "3H6",
"Henry VIII" = "H8",
"Julius Caesar" = "JC",
"King John" = "Jn",
"King Lear" = "Lr",
"Love's Labor's Lost" = "LLL",
"Macbeth" = "Mac",
"Measure for Measure" = "MM",
"The Merchant of Venice" = "MV",
"The Merry Wives of Windsor" = "Wiv",
"A Midsummer Night's Dream" = "MND",
"Much Ado About Nothing" = "Ado",
"Othello" = "Oth",
"Pericles" = "Per",
"Richard II" = "R2",
"Richard III" = "R3",
"Romeo and Juliet" = "Rom",
"The Taming of the Shrew" = "Shr",
"The Tempest" = "Tmp",
"Timon of Athens" = "Tim",
"Titus Andronicus" = "Tit",
"Troilus and Cressida" = "Tro",
"Twelfth Night" = "TN",
"Two Gentlemen of Verona" = "TGV",
"Two Noble Kinsmen" = "TNK",
"The Winter's Tale" = "WT"
)
We could get the text from “Much Ado about Nothing” as follows:
get_url_con <- function(playcode) {
stopifnot(playcode %in% playcodes)
url <- paste0("https://www.folgerdigitaltexts.org/", playcode, "/text")
con <- url(url)
}
con <- get_url_con("Ado")
open(con)
readLines(con, 10)
[1] "<br/>"
[2] "I learn in this letter that Don<br/>"
[3] "Pedro of Aragon comes this night to Messina.<br/>"
[4] "He is very near by this. He was not three<br/>"
[5] "leagues off when I left him.<br/>"
[6] "How many gentlemen have you lost in this<br/>"
[7] "action?<br/>"
[8] "But few of any sort, and none of name.<br/>"
[9] "A victory is twice itself when the achiever<br/>"
[10] "brings home full numbers. I find here that Don<br/>"
We will use all plays but “Hamlet” as training data, and reserve this last one for testing our model.
We want to pre-process the text data. Here we want to remove some html tags and make everything lower-case.
.!?:;
and insert EOS
and BOS
tokens into the data..!?:;
as regular words, hence the model will be able to predict those.A k-gram frequency table.
Parameters:
* N: 5
* V: 0
Number of words in training corpus:
* W: 0
Number of distinct k-grams with positive counts:
* 1-grams:0
* 2-grams:0
* 3-grams:0
* 4-grams:0
* 5-grams:0
lapply(train_playcodes,
function(playcode) {
con <- get_url_con(playcode)
process_sentences(text = con, freqs = freqs, verbose = FALSE)
})
$`All's Well That Ends Well`
A k-gram frequency table.
$`Antony and Cleopatra`
A k-gram frequency table.
$`As You Like It`
A k-gram frequency table.
$`The Comedy of Errors`
A k-gram frequency table.
$Coriolanus
A k-gram frequency table.
$Cymbeline
A k-gram frequency table.
$`Henry IV, Part 1`
A k-gram frequency table.
$`Henry IV, Part 2`
A k-gram frequency table.
$`Henry V`
A k-gram frequency table.
$`Henry VI, Part 1`
A k-gram frequency table.
$`Henry VI, Part 2`
A k-gram frequency table.
$`Henry VI, Part 3`
A k-gram frequency table.
$`Henry VIII`
A k-gram frequency table.
$`Julius Caesar`
A k-gram frequency table.
$`King John`
A k-gram frequency table.
$`King Lear`
A k-gram frequency table.
$`Love's Labor's Lost`
A k-gram frequency table.
$Macbeth
A k-gram frequency table.
$`Measure for Measure`
A k-gram frequency table.
$`The Merchant of Venice`
A k-gram frequency table.
$`The Merry Wives of Windsor`
A k-gram frequency table.
$`A Midsummer Night's Dream`
A k-gram frequency table.
$`Much Ado About Nothing`
A k-gram frequency table.
$Othello
A k-gram frequency table.
$Pericles
A k-gram frequency table.
$`Richard II`
A k-gram frequency table.
$`Richard III`
A k-gram frequency table.
$`Romeo and Juliet`
A k-gram frequency table.
$`The Taming of the Shrew`
A k-gram frequency table.
$`The Tempest`
A k-gram frequency table.
$`Timon of Athens`
A k-gram frequency table.
$`Titus Andronicus`
A k-gram frequency table.
$`Troilus and Cressida`
A k-gram frequency table.
$`Twelfth Night`
A k-gram frequency table.
$`Two Gentlemen of Verona`
A k-gram frequency table.
$`Two Noble Kinsmen`
A k-gram frequency table.
$`The Winter's Tale`
A k-gram frequency table.
freqs
object was modified during the previous call.Let’s choose the modified Kneser-Ney smoother and set some default parameters:
A k-gram language model.
Smoother:
* 'mkn'.
Parameters:
* N: 5
* V: 27133
* D1: 0.5
* D2: 0.5
* D3: 0.5
Number of words in training corpus:
* W: 955351
Number of distinct k-grams with positive counts:
* 1-grams:27135
* 2-grams:296762
* 3-grams:631164
* 4-grams:767563
* 5-grams:800543
sentences <- c(
"I have a letter from monsieur Berowne to one lady Rosaline.",
"I have an email from monsieur Valerio to one lady Judit."
)
probability(sentences, model)
[1] 2.407755e-06 3.768346e-40
or we can get the continuation probability for a context:
This applies the same transformations and tokenization to test data than it does to training data (which is important).
D1
, D2
etc) as well as the order of the models [1] "hum ! <EOS>"
[2] "helen herself . <EOS>"
[3] "come come and learn of us not ; <EOS>"
[4] "this kindness . <EOS>"
[5] "rise marcus rise you froward and blister ! <EOS>"
[6] "whats here ? <EOS>"
[7] "kneel ? <EOS>"
[8] "kill the poys and the purpose of his most precious queen of thee . <EOS>"
[9] "i pray thee do nothing of a sycamore i did say so and so may you for a servant comes [...] (truncated output)"
[10] "to me and i have merited either in my mind . <EOS>"
[1] "thou rt not on him yet . <EOS>"
[2] "o lord sir ! <EOS>"
[3] "first mend our dinner here in vienna . <EOS>"
[4] "shall s go hear me sir thomas and leave out the color of pity . <EOS>"
[5] "my cherry lips to move and with a green and pale at parting when our quick winds shut up his [...] (truncated output)"
[6] "what a mere child is fancy that it alone thou fool . <EOS>"
[7] "i would your grace to wear a heart replete with felt absence of all other how bright bay . <EOS>"
[8] "methinks youre better spoken . <EOS>"
[9] "i will not rest but mocks peter with our course where is warwick . <EOS>"
[10] "tis the cardinal ; <EOS>"
[1] "servanted position exampled becomd fideles throws videsne containing wharfs apply nag hung helen phrasethat oppressd swelling sluts lings errest kerns [...] (truncated output)"
[2] "accomplishment bragged freshness cheerful cheaply afterlove debauched honorswhich wretched sounda twelve puzel marriedhath worshipped frankness walters untalked dancing countersealed corners [...] (truncated output)"
[3] "cocksure temptings contradiction poised mutualities veil liberties firmament wellsteeled angelical leg olivia besiege hardiness lackbeard shut envied silvervoiced givest puddle [...] (truncated output)"
[4] "cornwalls rolls unnoble unassailed bliss clutched happy feebled augers isabel thee fabulous je summered faiths mulberry revive crazy request celerity [...] (truncated output)"
[5] "imbar felt progress streak perishing poorer frenzys glideth beset ascanius irresolute misguide forgd odors allwatchd jacknape redoubled continent nestorlike sirrahthe [...] (truncated output)"
[6] "ahold pottlepots liker peter shrieking correctioner illfavoredly unbraided visage brisk greatsized navys forth bonos altheas goaded heras cookery highblown oerstare [...] (truncated output)"
[7] "holidam bellypinchd careers horsemans needy divinely exits calendars id benevolence plumd sadhearted eaux level league perverse resolve accouterments luggage amort [...] (truncated output)"
[8] "cherishes venom shouldnotwithstanding doomsday swell elseof aloft furrowweeds dercetus pitythey nutshell poll scorpions presents pericles scythes placeth potent drooping botcher [...] (truncated output)"
[9] "perversely body ulcerous circumstance whispers sightless reliances parricides pragging piglike oneandtwenty illfaced apparel biggen masteri counterfeit uncivil vouchsafed unforced planks [...] (truncated output)"
[10] "sag hellbroth holdeth cocklight uproar eclipses bastardizing cojoin antonioo stricken disloyal almain forerun reverted gothe prone branched spleeny towards upon [...] (truncated output)"
[1] "i am not in the world . <EOS>"
[2] "i will not be entreated . <EOS>"
[3] "i am not in the world . <EOS>"
[4] "i am not . <EOS>"
[5] "i am not in the world . <EOS>"
[6] "i am not in the world . <EOS>"
[7] "i am not in the world . <EOS>"
[8] "i am not in the night and tempt the rheumy and unpurgd air to add unto his sickness ? <EOS>"
[9] "i am not to be a man . <EOS>"
[10] "i am not . <EOS>"