This function offers three algorithms for topic/content masking. In order to run the masking algorithms, a spacy
tokenizer or POS-tagger has to be run first (via spacyr
). For more information about the masking algorithms see Details below.
Usage
contentmask(
corpus,
model = "en_core_web_sm",
algorithm = "POSnoise",
fw_list = "eng_halvani",
replace_non_ascii = TRUE
)
Arguments
- corpus
A
quanteda
corpus object, typically the output of thecreate_corpus()
function.- model
The spacy model to use. The default is "en_core_web_sm".
- algorithm
A string, either "POSnoise" (default), "frames", or "textdistortion".
- fw_list
The list of function words to use for the
textdistortion
algorithm. This is either the default ("eng_halvani") for the same list of function words used forPOSnoise
or it can be a vector of strings where each string is a function word to keep.- replace_non_ascii
A logical value indicating whether to remove non-ASCII characters (including emojis). This is the default.
Value
A quanteda
corpus object only containing functional tokens, depending on the algorithm chosen. The corpus contains the same docvars as the input. Email addresses or URLs are treated like nouns.
Details
The default algorithm for content masking that this function applies is POSnoise
(Halvani and Graner 2021). This algorithm only works for English and it transforms a text by masking tokens using their POS tag if these tokens are: nouns, verbs, adjectives, adverbs, digits, and symbols while leaving all the rest unchanged. POSnoise
uses a list of function words for English that also includes frequent words belonging to the masked Part of Speech tags that tend to be mostly functional (e.g. make, recently, well).
Another algorithm implemented is Nini's (2023) frames
or frame n-grams
. This algorithm does not involve a special list of tokens and therefore can potentially work for any language provided that the correct spacy
model is loaded. This algorithm consists in masking all tokens using their POS tag only when these are nouns, verbs, or personal pronouns.
Finally, the last algorithm implemented is a version of textdistortion
, as originally proposed by Stamatatos (2017). This version of the algorithm is essentially POSnoise
but without POS tag information. The default implementation uses the same list of function words that are used for POSnoise
. In addition to the function words provided, the function treats all punctuation marks and new line breaks as function words to keep. The basic tokenization is done using spacyr
so the right model for the language being analysed should be selected.
If you have never used spacyr
before then please follow the instructions to set it up and install a model before using this function.
The removal of non-ASCII characters is done using the textclean
package.
References
Halvani, Oren & Lukas Graner. 2021. POSNoise: An Effective Countermeasure Against Topic Biases in Authorship Analysis. In Proceedings of the 16th International Conference on Availability, Reliability and Security, 1–12. Vienna, Austria: Association for Computing Machinery. https://doi.org/10.1145/3465481.3470050. Nini, Andrea. 2023. A Theory of Linguistic Individuality for Authorship Analysis (Elements in Forensic Linguistics). Cambridge, UK: Cambridge University Press. Stamatatos, Efstathios. 2017. Masking topic-related information to enhance authorship attribution. Journal of the Association for Information Science and Technology. https://doi.org/10.1002/asi.23968.
Examples
if (FALSE) { # \dontrun{
text <- "The cat was on the chair. He didn't move\ncat@pets.com;\nhttp://quanteda.io/. i.e. a test "
toy.corpus <- quanteda::corpus(text)
contentmask(toy.corpus, algorithm = "POSnoise")
contentmask(toy.corpus, algorithm = "textdistortion")
} # }