This function turns a corpus of texts into a quanteda
tokens object of sentences.
Arguments
- corpus
A
quanteda
corpus object, typically the output of thecreate_corpus()
function or the output ofcontentmask()
.- model
The spacy model to use. The default is "en_core_web_sm".
Details
The function first split each text into paragraphs by splitting at new line markers and then uses spacy to tokenize each paragraph into sentences. The function accepts a plain text corpus input or the output of contentmask()
. This function is necessary to prepare the data for lambdaG()
.