Visualize the output of the LambdaG algorithm — lambdaG

This function outputs a colour-coded list of sentences belonging to the input Q text ordered from highest to lowest \(\lambda_G\), as shown in Nini et al. (under review).

Usage

lambdaG_visualize(
  q.data,
  k.data,
  ref.data,
  N = 10,
  r = 30,
  output = "html",
  print = "",
  scale = "absolute",
  cores = NULL
)

Arguments

q.data: A single questioned or disputed text as a quanteda tokens object with the tokens being sentences (e.g. the output of tokenize_sents()).
k.data: A known or undisputed corpus containing exclusively a single candidate author's texts as a quanteda tokens object with the tokens being sentences (e.g. the output of tokenize_sents()).
ref.data: The reference dataset as a quanteda tokens object with the tokens being sentences (e.g. the output of tokenize_sents()).
N: The order of the model. Default is 10.
r: The number of iterations. Default is 30.
output: A string detailing the file type of the colour-coded text output. Either "html" (default) or "latex".
print: A string indicating the path to the folder where to print a colour-coded text file. If left empty (default), then nothing is printed.
scale: A string indicating what scale to use to colour-code the text file. If "absolute" (default) then the raw \(\lambda_G\) is used; if "relative", then the z-score of \(\lambda_G\) over the Q data is used instead, thus showing relative importance.
cores: The number of cores to use for parallel processing (the default is one).

Value

The function outputs a list of two objects: a data frame with each row being a token in the Q text and the values of \(\lambda_G\) for the token and sentences, in decreasing order of sentence \(\lambda_G\) and with the relative contribution of each token and each sentence to the final \(\lambda_G\) in percentage; the raw code in html or LaTeX that generates the colour-coded file. If a path is provided for the print argument then the function will also save the colour-coded text as an html or plain text file.

References

Nini, A., Halvani, O., Graner, L., Gherardi, V., Ishihara, S. Authorship Verification based on the Likelihood Ratio of Grammar Models. https://arxiv.org/abs/2403.08462v1

Examples

q.data <- corpus_trim(enron.sample[1], "sentences", max_ntoken = 10) |> quanteda::tokens("sentence")
k.data <- enron.sample[2:5]|> quanteda::tokens("sentence")
ref.data <- enron.sample[6:ndoc(enron.sample)] |> quanteda::tokens("sentence")
outputs <- lambdaG_visualize(q.data, k.data, ref.data, r = 2)
outputs$table
#> # A tibble: 13 × 8
#>    sentence_id token_id t         lambdaG sentence_lambdaG zlambdaG
#>          <int>    <int> <chr>       <dbl>            <dbl>    <dbl>
#>  1           1        1 J          0.771             0.726    1.40 
#>  2           1        2 N          0.220             0.726    0.323
#>  3           1        3 ,         -1.16              0.726   -2.38 
#>  4           1        4 but        0.326             0.726    0.531
#>  5           1        5 that      -0.0955            0.726   -0.297
#>  6           1        6 's         0.991             0.726    1.84 
#>  7           1        7 just      -0.0210            0.726   -0.151
#>  8           1        8 the        0.0194            0.726   -0.072
#>  9           1        9 N          0.0259            0.726   -0.059
#> 10           1       10 it        -0.252             0.726   -0.605
#> 11           1       11 works     -0.102             0.726   -0.31 
#> 12           1       12 .          0.0365            0.726   -0.038
#> 13           1       13 ___EOS___ -0.0385            0.726   -0.185
#> # ℹ 2 more variables: token_contribution <dbl>, sent_contribution <dbl>