Select the most similar texts to a specific text


most_similar(sample, pool, coefficient, n)



This is a single row of a quanteda dfm representing the sample to match.


This is a dfm containing all possible samples from which to select the top n.


The coefficient to use for similarity. Either "minmax", "cosine", or "Phi".


The number of rows to extract from the pool of potential samples.


The function returns a dfm containing the top n most similar rows to the input sample using the minmax distance.


text1 <- "The cat sat on the mat"
text2 <- "The dog sat on the chair"
text3 <- "Violence is the last refuge of the incompetent"
c <- quanteda::corpus(c(text1, text2, text3))
d <- quanteda::tokens(c) |> quanteda::dfm() |> quanteda::dfm_weight(scheme = "prop")
most_similar(d[1,], d[-1,], coefficient = "minmax", n = 1)
#> Document-feature matrix of: 1 document, 13 features (61.54% sparse) and 0 docvars.
#>        features
#> docs          the cat       sat        on mat       dog     chair violence is
#>   text2 0.3333333   0 0.1666667 0.1666667   0 0.1666667 0.1666667        0  0
#>        features
#> docs    last
#>   text2    0
#> [ reached max_nfeat ... 3 more features ]