This tutorial is a companion to the talk:

Nini, A. ‘Examining an author’s individual grammar’. Comparative Literature Goes Digital Workshop, Digital Humanities 2025. Universidade Nova de Lisboa, Lisbon, Portugal. 14/07/2025.

The tutorial explains how to replicate the analysis that was presented as part of this talk. A more general tutorial on how to use idiolect can be found on its website here.

1 Installation and loading

You can install idiolect from CRAN as any other package.

install.packages("idiolect")

idiolect depends on quanteda, which is also loaded at the same time.

library(idiolect)

## Loading required package: quanteda

## Package version: 4.0.1
## Unicode version: 14.0
## ICU version: 71.1

## Parallel computing: disabled

## See https://quanteda.io for tutorials and examples.

2 Corpus preparation

This case study uses the English component of the refcor corpus, which is available here. To replicate this analysis in full, you should download just the folder of English texts. Then, to load this folder of texts as a corpus you can use the create_corpus() function.

full.corpus <- create_corpus("path/to/folder")

Our case study is the analysis of a random 1,000 tokens from Dicken’s novel Bleak House. This is our \(Q\) sample.

q.sample <- corpus_subset(                          
  full.corpus,                                      
  textname == "bleak"              #select just Bleak House
) |> 
  chunk_texts(size = 1000) |>      #break it in 1,000 token chunks
  corpus_sample(size = 1)          #randomly select one

docnames(q.sample) <- gsub(        #remove the number from the sample name
  "\\.\\d+", 
  "", 
  docnames(q.sample)
)

The candidate author’s data, the \(K\) sample, is instead made up of two random 40,000 token samples from the other two Dickens novels.

k.samples <- corpus_subset(
  full.corpus,
  textname != "bleak" & author == "dickens"   #select other novels by Dickens
) |> 
  chunk_texts(size = 40000) |>                #break them in 40,000 token chunks
  corpus_sample(size = 1, by = textname)      #randomly select one chunk per novel

docnames(k.samples) <- gsub(                  #remove the number from the sample names
  "\\.\\d+", 
  "", 
  docnames(k.samples)
)

Finally, for the reference corpus, we are going to use two random 40,000 token samples from two random novels for each other author in the corpus except for Dickens.

reference <- corpus_subset(
  full.corpus,
  author != "dickens"                      #select novels not by Dickens
) |> 
  corpus_sample(2, by = author) |>         #sample two texts per author
  chunk_texts(size = 40000) |>             #break them in 40,000 token chunks
  corpus_sample(size = 1, by = textname)   #randomly select one chunk per novel

docnames(reference) <- gsub(               #remove the number from the sample names
  "\\.\\d+", 
  "", 
  docnames(reference)
)

The three sets of samples have to be pre-processed with the POSnoise algorithm before applying LambdaG. To run this function you have to have spacyr installed, as well as the standard model for English. If you encounter any problems with the installation check the spacyr documentation here.

q.sample.pos <- contentmask(q.sample)
k.samples.pos <- contentmask(k.samples)
reference.pos <- contentmask(reference)

The samples also need to be tokenised into sentences.

q.sents <- tokenize_sents(q.sample.pos)
k.sents <- tokenize_sents(k.samples.pos)
ref.sents <- tokenize_sents(reference.pos)

Because these steps could take a long time depending on your computer, some pre-processed samples are ready for you to use in this repository for the next steps. These can be loaded as follows.

k.sents <- readRDS("data/posnoised_K_sample.rds")
q.sents <- readRDS("data/posnoised_Q_sample.rds")
ref.sents <- readRDS("data/posnoised_ref_samples.rds")

A fully pre-processed sample (POSnoised and tokenised) looks like this:

q.sents

## Tokens consisting of 1 document and 2 docvars.
## dickens_bleak.txt :
##  [1] "; i was V there because the N V me and would let me go nowhere B ."                            
##  [2] "D N were made N to that J N !"                                                                 
##  [3] "it first came on after two N ."                                                                
##  [4] "it was then V for another two N while the N ( may his N V off !"                               
##  [5] ") V whether i was my N 's N , about which there was no N at all with any J N ."                
##  [6] "he then found out that there were not N enough - - V , there were only V as yet !"             
##  [7] "- - but that we must have another who had been V out and must begin all over again ."          
##  [8] "the N at that N - - before the N was begun !"                                                  
##  [9] "- - were three N the N ."                                                                      
## [10] "my N would have given up the N , and J , to V more N ."                                        
## [11] "my J N , V to me in that will of my N 's , has gone in N ."                                    
## [12] "the N , still J , has V into N , and N , and N , with everything B - - and here i V , this N !"
## [ ... and 64 more ]

There are some mistakes in the sentence tokenisation and these could be fixed by using a different model or by adding some additional pre-processing steps. However, for the sake of this demonstration we are going to use the samples as they are.

3 LambdaG authorship verification analysis

Let’s run an authorship verification analysis with LambdaG.

lambdaG(q.data = q.sents, k.data = k.sents, ref.data = ref.sents)

The output is a single \(\lambda_G\) score. The score is positive, which indicates that the support is for a same author hypothesis. However, we cannot quantify the strength of this support because to do so we would need a calibration dataset to turn this raw score into a calibrated log-likelihood ratio.

4 Analysing the language of the author

LambdaG can also be used to create a text heatmap to show which patterns contribute to the evidence of same authorship or which patterns are the most idiosyncratic to the candidate author, in this case Dickens. To do so, instead of the lambdaG() function we need to use lambdaG_visualize().

rlv <- lambdaG_visualize(
  q.data = q.sents,
  k.data = k.sents, 
  ref.data = ref.sents,
  print = "heatmap.html",   #this argument should contain the path to the heatmap output
  cores = NULL              #you can specify how many cores to use for parallel processing 
  )

This function will produce a heatmap similar to the one presented during the talk. The heatmap might look slightly differently because LambdaG is a stochastic algorithm.

5 Exploring the patterns in the text using a concordance

With a heatmap the analyst can now identify those constructions that can be explored using a concordance. In the new upcoming version of idiolect it will be possible to produce a concordance starting from the same sentence-tokenised objects used above. However, for the current version this is not possible so we need to ‘untokenise’ the samples by recombining all the sentences into full texts.

recombine <- function(sentences){
  
  c <- lapply(sentences, paste0, collapse = " \n ") |> 
    unlist() |> 
    corpus()
  
  docvars(c) <- docvars(sentences)
  
  return(c)
  
}

q.samples <- recombine(q.sents)
k.samples <- recombine(k.sents)
ref.samples <- recombine(ref.sents)

It is now possible to create concordances and explore the patterns in their context.

concordance(
  q.data = q.samples,
  k.data = k.samples,
  reference.data = ref.samples, 
  search = "never to have"
) |> 
  dplyr::select(-from, -to)   #this simply removes some unnecessary columns

Dickens case study

Andrea Nini

2025-07-14

1 Installation and loading

2 Corpus preparation

3 LambdaG authorship verification analysis

4 Analysing the language of the author

5 Exploring the patterns in the text using a concordance