Creates a corpus object from available sources. The currently available sources are:
a character vector, consisting of one document per element; if the elements are named, these names will be used as document names.
a data.frame (or a tibble tbl_df
), whose default
document id is a variable identified by docid_field
; the text of the
document is a variable identified by text_field
; and other variables
are imported as document-level meta-data. This matches the format of
data.frames constructed by the the readtext package.
a tm VCorpus or SimpleCorpus class object, with the fixed metadata fields imported as docvars and corpus-level metadata imported as metacorpus information.
a corpus object.
corpus(x, ...) # S3 method for corpus corpus( x, docnames = quanteda::docnames(x), docvars = quanteda::docvars(x), meta = quanteda::meta(x), ... ) # S3 method for character corpus( x, docnames = NULL, docvars = NULL, meta = list(), unique_docnames = TRUE, ... ) # S3 method for data.frame corpus( x, docid_field = "doc_id", text_field = "text", meta = list(), unique_docnames = TRUE, ... ) # S3 method for kwic corpus(x, split_context = TRUE, extract_keyword = TRUE, meta = list(), ...) # S3 method for Corpus corpus(x, ...)
x | a valid corpus source object |
---|---|
... | not used directly |
docnames | Names to be assigned to the texts. Defaults to the names of
the character vector (if any); |
docvars | a data.frame of document-level variables associated with each text |
meta | a named list that will be added to the corpus as corpus-level,
user meta-data. This can later be accessed or updated using
|
unique_docnames | logical; if |
docid_field | optional column index of a document identifier; defaults
to "doc_id", but if this is not found, then will use the rownames of the
data.frame; if the rownames are not set, it will use the default sequence
based on |
text_field | the character name or numeric index of the source
|
split_context | logical; if |
extract_keyword | logical; if |
A corpus class object containing the original texts, document-level variables, document-level metadata, corpus-level metadata, and default settings for subsequent processing of the corpus.
For quanteda >= 2.0, this is a specially classed character vector. It has many additional attributes but you should not access these attributes directly, especially if you are another package author. Use the extractor and replacement functions instead, or else your code is not only going to be uglier, but also likely to break should the internal structure of a corpus object change. Using the accessor and replacement functions ensures that future code to manipulate corpus objects will continue to work.
The texts and document variables of corpus objects can also be
accessed using index notation and the $
operator for accessing or assigning
docvars. For details, see [.corpus()
.
# create a corpus from texts corpus(data_char_ukimmig2010)#> Corpus consisting of 9 documents. #> BNP : #> "IMMIGRATION: AN UNPARALLELED CRISIS WHICH ONLY THE BNP CAN S..." #> #> Coalition : #> "IMMIGRATION. The Government believes that immigration has e..." #> #> Conservative : #> "Attract the brightest and best to our country. Immigration h..." #> #> Greens : #> "Immigration. Migration is a fact of life. People have alway..." #> #> Labour : #> "Crime and immigration The challenge for Britain We will cont..." #> #> LibDem : #> "firm but fair immigration system Britain has always been an ..." #> #> [ reached max_ndoc ... 3 more documents ]# create a corpus from texts and assign meta-data and document variables summary(corpus(data_char_ukimmig2010, docvars = data.frame(party = names(data_char_ukimmig2010))), 5)#> Corpus consisting of 9 documents, showing 5 documents: #> #> Text Types Tokens Sentences party #> BNP 1125 3280 88 BNP #> Coalition 142 260 4 Coalition #> Conservative 251 499 15 Conservative #> Greens 322 677 21 Greens #> Labour 298 680 29 Labour #># import a tm VCorpus if (requireNamespace("tm", quietly = TRUE)) { data(crude, package = "tm") # load in a tm example VCorpus vcorp <- corpus(crude) summary(vcorp) data(acq, package = "tm") summary(corpus(acq), 5) vcorp2 <- tm::VCorpus(tm::VectorSource(data_char_ukimmig2010)) corp <- corpus(vcorp2) summary(corp) }#> Corpus consisting of 9 documents, showing 9 documents: #> #> Text Types Tokens Sentences author datetimestamp description #> BNP 1125 3280 88 NA 2020-07-27 12:28:04 NA #> Coalition 142 260 4 NA 2020-07-27 12:28:04 NA #> Conservative 251 499 15 NA 2020-07-27 12:28:04 NA #> Greens 322 677 21 NA 2020-07-27 12:28:04 NA #> Labour 298 680 29 NA 2020-07-27 12:28:04 NA #> LibDem 251 483 14 NA 2020-07-27 12:28:04 NA #> PC 77 114 5 NA 2020-07-27 12:28:04 NA #> SNP 88 134 4 NA 2020-07-27 12:28:04 NA #> UKIP 346 722 26 NA 2020-07-27 12:28:04 NA #> heading id language origin #> NA 1 en NA #> NA 2 en NA #> NA 3 en NA #> NA 4 en NA #> NA 5 en NA #> NA 6 en NA #> NA 7 en NA #> NA 8 en NA #> NA 9 en NA #># construct a corpus from a data.frame dat <- data.frame(letter_factor = factor(rep(letters[1:3], each = 2)), some_ints = 1L:6L, some_text = paste0("This is text number ", 1:6, "."), stringsAsFactors = FALSE, row.names = paste0("fromDf_", 1:6)) dat#> letter_factor some_ints some_text #> fromDf_1 a 1 This is text number 1. #> fromDf_2 a 2 This is text number 2. #> fromDf_3 b 3 This is text number 3. #> fromDf_4 b 4 This is text number 4. #> fromDf_5 c 5 This is text number 5. #> fromDf_6 c 6 This is text number 6.summary(corpus(dat, text_field = "some_text", meta = list(source = "From a data.frame called mydf.")))#> Corpus consisting of 6 documents, showing 6 documents: #> #> Text Types Tokens Sentences letter_factor some_ints #> fromDf_1 6 6 1 a 1 #> fromDf_2 6 6 1 a 2 #> fromDf_3 6 6 1 b 3 #> fromDf_4 6 6 1 b 4 #> fromDf_5 6 6 1 c 5 #> fromDf_6 6 6 1 c 6 #># construct a corpus from a kwic object kw <- kwic(data_corpus_inaugural, "southern") summary(corpus(kw))#> Corpus consisting of 28 documents, showing 28 documents: #> #> Text Types Tokens Sentences from to keyword context #> 1797-Adams.1.pre 5 5 1 1802 1802 southern pre #> 1825-Adams.1.pre 4 5 1 2427 2427 southern pre #> 1861-Lincoln.1.pre 4 5 1 96 96 Southern pre #> 1865-Lincoln.1.pre 5 5 1 278 278 southern pre #> 1877-Hayes.1.pre 5 5 1 376 376 Southern pre #> 1877-Hayes.2.pre 5 5 1 946 946 Southern pre #> 1877-Hayes.3.pre 5 5 1 1238 1238 Southern pre #> 1881-Garfield.1.pre 5 5 1 988 988 Southern pre #> 1909-Taft.1.pre 4 5 1 4026 4026 Southern pre #> 1909-Taft.2.pre 5 5 1 4227 4227 Southern pre #> 1909-Taft.3.pre 5 5 1 4347 4347 Southern pre #> 1909-Taft.4.pre 5 5 1 4532 4532 Southern pre #> 1909-Taft.5.pre 5 5 1 4592 4592 Southern pre #> 1953-Eisenhower.1.pre 5 5 1 1221 1221 southern pre #> 1797-Adams.1.post 5 5 1 1802 1802 southern post #> 1825-Adams.1.post 5 5 1 2427 2427 southern post #> 1861-Lincoln.1.post 5 5 1 96 96 Southern post #> 1865-Lincoln.1.post 5 5 2 278 278 southern post #> 1877-Hayes.1.post 5 5 2 376 376 Southern post #> 1877-Hayes.2.post 5 5 1 946 946 Southern post #> 1877-Hayes.3.post 5 5 1 1238 1238 Southern post #> 1881-Garfield.1.post 5 5 2 988 988 Southern post #> 1909-Taft.1.post 5 5 2 4026 4026 Southern post #> 1909-Taft.2.post 5 5 1 4227 4227 Southern post #> 1909-Taft.3.post 5 5 1 4347 4347 Southern post #> 1909-Taft.4.post 5 5 1 4532 4532 Southern post #> 1909-Taft.5.post 5 5 1 4592 4592 Southern post #> 1953-Eisenhower.1.post 5 5 1 1221 1221 southern post #># from a kwic kw <- kwic(data_char_sampletext, "econom*", separator = "", remove_separators = FALSE) # keep original separators summary(corpus(kw))#> Corpus consisting of 10 documents, showing 10 documents: #> #> Text Types Tokens Sentences from to keyword context #> text1.1.pre 2 2 1 313 313 economy pre #> text1.2.pre 2 2 1 390 390 economy pre #> text1.3.pre 2 2 1 516 516 economy pre #> text1.4.pre 2 2 1 941 941 economy pre #> text1.5.pre 2 2 1 976 976 economy pre #> text1.1.post 2 2 1 313 313 economy post #> text1.2.post 3 3 2 390 390 economy post #> text1.3.post 2 2 1 516 516 economy post #> text1.4.post 3 3 2 941 941 economy post #> text1.5.post 3 3 1 976 976 economy post #>#> Corpus consisting of 5 documents, showing 5 documents: #> #> Text Types Tokens Sentences keyword #> text1.L313 5 5 1 economy #> text1.L390 6 6 2 economy #> text1.L516 4 5 1 economy #> text1.L941 6 6 2 economy #> text1.L976 6 6 1 economy #>#> text1.L313 #> " the Irish economy in pursuit " #> text1.L390 #> " the domestic economy? As we" #> text1.L516 #> " the domestic economy show the " #> text1.L941 #> " dislocates the economy. Otherwise those" #> text1.L976 #> " the domestic economy, stimulating demand"