Take a random sample or documents of the specified size from a corpus or
document-feature matrix, with or without replacement. Works just as
sample
works for the documents and their associated
document-level variables.
corpus_sample(x, size = ndoc(x), replace = FALSE, prob = NULL, by = NULL, ...)
x | a corpus object whose documents will be sampled |
---|---|
size | a positive number, the number of documents to select |
replace | Should sampling be with replacement? |
prob | A vector of probability weights for obtaining the elements of the vector being sampled. |
by | a grouping variable for sampling. Useful for resampling
sub-document units such as sentences, for instance by specifying |
... | unused |
A corpus object with number of documents equal to size
, drawn
from the corpus x
. The returned corpus object will contain all of
the meta-data of the original corpus, and the same document variables for
the documents selected.
# sampling from a corpus summary(corpus_sample(data_corpus_inaugural, 5))#> Corpus consisting of 5 documents: #> #> Text Types Tokens Sentences Year President FirstName #> 1833-Jackson 499 1269 29 1833 Jackson Andrew #> 1881-Garfield 1021 3212 111 1881 Garfield James A. #> 1801-Jefferson 717 1927 41 1801 Jefferson Thomas #> 1873-Grant 552 1475 43 1873 Grant Ulysses S. #> 1997-Clinton 773 2449 111 1997 Clinton Bill #> #> Source: Gerhard Peters and John T. Woolley. The American Presidency Project. #> Created: Tue Jun 13 14:51:47 2017 #> Notes: http://www.presidency.ucsb.edu/inaugurals.phpsummary(corpus_sample(data_corpus_inaugural, 10, replace = TRUE))#> Corpus consisting of 10 documents: #> #> Text Types Tokens Sentences Year President FirstName #> 1853-Pierce 1165 3641 104 1853 Pierce Franklin #> 1945-Roosevelt 275 647 26 1945 Roosevelt Franklin D. #> 1957-Eisenhower 621 1931 92 1957 Eisenhower Dwight D. #> 1833-Jackson 499 1269 29 1833 Jackson Andrew #> 2013-Obama 814 2317 88 2013 Obama Barack #> 1961-Kennedy 566 1566 52 1961 Kennedy John F. #> 1797-Adams 826 2578 37 1797 Adams John #> 1909-Taft 1437 5822 159 1909 Taft William Howard #> 1949-Truman 781 2513 116 1949 Truman Harry S. #> 1945-Roosevelt.1 275 647 26 1945 Roosevelt Franklin D. #> #> Source: Gerhard Peters and John T. Woolley. The American Presidency Project. #> Created: Tue Jun 13 14:51:47 2017 #> Notes: http://www.presidency.ucsb.edu/inaugurals.php# sampling sentences within document doccorpus <- corpus(c(one = "Sentence one. Sentence two. Third sentence.", two = "First sentence, doc2. Second sentence, doc2.")) sentcorpus <- corpus_reshape(doccorpus, to = "sentences") texts(sentcorpus)#> one.1 one.2 one.3 #> "Sentence one." "Sentence two." "Third sentence." #> two.1 two.2 #> "First sentence, doc2." "Second sentence, doc2."#> one.1 one.1.1 one.1.2 #> "Sentence one." "Sentence one." "Sentence one." #> two.2 two.1 #> "Second sentence, doc2." "First sentence, doc2."