This tutorial is a companion to the talk:
Nini, A. ‘Examining an author’s individual grammar’. Comparative Literature Goes Digital Workshop, Digital Humanities 2025. Universidade Nova de Lisboa, Lisbon, Portugal. 14/07/2025.
The tutorial explains how to replicate the analysis that was
presented as part of this talk. A more general tutorial on how to use
idiolect
can be found on its website here.
You can install idiolect
from CRAN as any other
package.
install.packages("idiolect")
idiolect
depends on quanteda
, which is also
loaded at the same time.
library(idiolect)
## Loading required package: quanteda
## Package version: 4.0.1
## Unicode version: 14.0
## ICU version: 71.1
## Parallel computing: disabled
## See https://quanteda.io for tutorials and examples.
This case study uses the English component of the refcor
corpus, which is available here. To replicate this
analysis in full, you should download just the folder of English texts.
Then, to load this folder of texts as a corpus you can use the
create_corpus()
function.
full.corpus <- create_corpus("path/to/folder")
Our case study is the analysis of a random 1,000 tokens from Dicken’s novel Bleak House. This is our \(Q\) sample.
q.sample <- corpus_subset(
full.corpus,
textname == "bleak" #select just Bleak House
) |>
chunk_texts(size = 1000) |> #break it in 1,000 token chunks
corpus_sample(size = 1) #randomly select one
docnames(q.sample) <- gsub( #remove the number from the sample name
"\\.\\d+",
"",
docnames(q.sample)
)
The candidate author’s data, the \(K\) sample, is instead made up of two random 40,000 token samples from the other two Dickens novels.
k.samples <- corpus_subset(
full.corpus,
textname != "bleak" & author == "dickens" #select other novels by Dickens
) |>
chunk_texts(size = 40000) |> #break them in 40,000 token chunks
corpus_sample(size = 1, by = textname) #randomly select one chunk per novel
docnames(k.samples) <- gsub( #remove the number from the sample names
"\\.\\d+",
"",
docnames(k.samples)
)
Finally, for the reference corpus, we are going to use two random 40,000 token samples from two random novels for each other author in the corpus except for Dickens.
reference <- corpus_subset(
full.corpus,
author != "dickens" #select novels not by Dickens
) |>
corpus_sample(2, by = author) |> #sample two texts per author
chunk_texts(size = 40000) |> #break them in 40,000 token chunks
corpus_sample(size = 1, by = textname) #randomly select one chunk per novel
docnames(reference) <- gsub( #remove the number from the sample names
"\\.\\d+",
"",
docnames(reference)
)
The three sets of samples have to be pre-processed with the
POSnoise algorithm before applying LambdaG. To run
this function you have to have spacyr
installed, as well as
the standard model for English. If you encounter any problems with the
installation check the spacyr
documentation here.
q.sample.pos <- contentmask(q.sample)
k.samples.pos <- contentmask(k.samples)
reference.pos <- contentmask(reference)
The samples also need to be tokenised into sentences.
q.sents <- tokenize_sents(q.sample.pos)
k.sents <- tokenize_sents(k.samples.pos)
ref.sents <- tokenize_sents(reference.pos)
Because these steps could take a long time depending on your computer, some pre-processed samples are ready for you to use in this repository for the next steps. These can be loaded as follows.
k.sents <- readRDS("data/posnoised_K_sample.rds")
q.sents <- readRDS("data/posnoised_Q_sample.rds")
ref.sents <- readRDS("data/posnoised_ref_samples.rds")
A fully pre-processed sample (POSnoised and tokenised) looks like this:
q.sents
## Tokens consisting of 1 document and 2 docvars.
## dickens_bleak.txt :
## [1] "; i was V there because the N V me and would let me go nowhere B ."
## [2] "D N were made N to that J N !"
## [3] "it first came on after two N ."
## [4] "it was then V for another two N while the N ( may his N V off !"
## [5] ") V whether i was my N 's N , about which there was no N at all with any J N ."
## [6] "he then found out that there were not N enough - - V , there were only V as yet !"
## [7] "- - but that we must have another who had been V out and must begin all over again ."
## [8] "the N at that N - - before the N was begun !"
## [9] "- - were three N the N ."
## [10] "my N would have given up the N , and J , to V more N ."
## [11] "my J N , V to me in that will of my N 's , has gone in N ."
## [12] "the N , still J , has V into N , and N , and N , with everything B - - and here i V , this N !"
## [ ... and 64 more ]
There are some mistakes in the sentence tokenisation and these could be fixed by using a different model or by adding some additional pre-processing steps. However, for the sake of this demonstration we are going to use the samples as they are.
With a heatmap the analyst can now identify those constructions that
can be explored using a concordance. In the new upcoming version of
idiolect
it will be possible to produce a concordance
starting from the same sentence-tokenised objects used above. However,
for the current version this is not possible so we need to ‘untokenise’
the samples by recombining all the sentences into full texts.
recombine <- function(sentences){
c <- lapply(sentences, paste0, collapse = " \n ") |>
unlist() |>
corpus()
docvars(c) <- docvars(sentences)
return(c)
}
q.samples <- recombine(q.sents)
k.samples <- recombine(k.sents)
ref.samples <- recombine(ref.sents)
It is now possible to create concordances and explore the patterns in their context.
concordance(
q.data = q.samples,
k.data = k.samples,
reference.data = ref.samples,
search = "never to have"
) |>
dplyr::select(-from, -to) #this simply removes some unnecessary columns