Content masking — contentmask • idiolect

This function offers three algorithms for topic/content masking. In order to run the masking algorithms, a spacy tokenizer or POS-tagger has to be run first (via spacyr). For more information about the masking algorithms see Details below.

Usage

contentmask(
  corpus,
  model = "en_core_web_sm",
  algorithm = "POSnoise",
  fw_list = "eng_halvani",
  replace_non_ascii = TRUE
)

Arguments

corpus: A quanteda corpus object, typically the output of the create_corpus() function.
model: The spacy model to use. The default is "en_core_web_sm".
algorithm: A string, either "POSnoise" (default), "frames", or "textdistortion".
fw_list: The list of function words to use for the textdistortion algorithm. This is either the default ("eng_halvani") for the same list of function words used for POSnoise or it can be a vector of strings where each string is a function word to keep.
replace_non_ascii: A logical value indicating whether to remove non-ASCII characters (including emojis). This is the default.

Value

A quanteda corpus object only containing functional tokens, depending on the algorithm chosen. The corpus contains the same docvars as the input. Email addresses or URLs are treated like nouns.

Details

The default algorithm for content masking that this function applies is POSnoise (Halvani and Graner 2021). This algorithm only works for English and it transforms a text by masking tokens using their POS tag if these tokens are: nouns, verbs, adjectives, adverbs, digits, and symbols while leaving all the rest unchanged. POSnoise uses a list of function words for English that also includes frequent words belonging to the masked Part of Speech tags that tend to be mostly functional (e.g. make, recently, well).

Another algorithm implemented is Nini's (2023) frames or frame n-grams. This algorithm does not involve a special list of tokens and therefore can potentially work for any language provided that the correct spacy model is loaded. This algorithm consists in masking all tokens using their POS tag only when these are nouns, verbs, or personal pronouns.

Finally, the last algorithm implemented is a version of textdistortion, as originally proposed by Stamatatos (2017). This version of the algorithm is essentially POSnoise but without POS tag information. The default implementation uses the same list of function words that are used for POSnoise. In addition to the function words provided, the function treats all punctuation marks and new line breaks as function words to keep. The basic tokenization is done using spacyr so the right model for the language being analysed should be selected.

If you have never used spacyr before then please follow the instructions to set it up and install a model before using this function.

The removal of non-ASCII characters is done using the textclean package.

References

Halvani, Oren & Lukas Graner. 2021. POSNoise: An Effective Countermeasure Against Topic Biases in Authorship Analysis. In Proceedings of the 16th International Conference on Availability, Reliability and Security, 1–12. Vienna, Austria: Association for Computing Machinery. https://doi.org/10.1145/3465481.3470050. Nini, Andrea. 2023. A Theory of Linguistic Individuality for Authorship Analysis (Elements in Forensic Linguistics). Cambridge, UK: Cambridge University Press. Stamatatos, Efstathios. 2017. Masking topic-related information to enhance authorship attribution. Journal of the Association for Information Science and Technology. https://doi.org/10.1002/asi.23968.

Examples

if (FALSE) { # \dontrun{
text <- "The cat was on the chair. He didn't move\ncat@pets.com;\nhttp://quanteda.io/. i.e. a test "
toy.corpus <- quanteda::corpus(text)
contentmask(toy.corpus, algorithm = "POSnoise")
contentmask(toy.corpus, algorithm = "textdistortion")
} # }