corpus_sample.Rd
Take a random sample of documents of the specified size from a corpus, with
or without replacement. Works just as sample
works for the
documents and their associated document-level variables.
corpus_sample(x, size = ndoc(x), replace = FALSE, prob = NULL, by = NULL, ...)
x | a corpus object whose documents will be sampled |
---|---|
size | a positive number, the number of documents to select |
replace | Should sampling be with replacement? |
prob | A vector of probability weights for obtaining the elements of the vector being sampled. |
by | a grouping variable for sampling. Useful for resampling
sub-document units such as sentences, for instance by specifying |
... | unused |
A corpus object with number of documents equal to size
, drawn
from the corpus x
. The returned corpus object will contain all of
the meta-data of the original corpus, and the same document variables for
the documents selected.
#> Corpus consisting of 5 documents: #> #> Text Types Tokens Sentences Year President FirstName #> 1805-Jefferson 804 2381 45 1805 Jefferson Thomas #> 1977-Carter 527 1376 52 1977 Carter Jimmy #> 1921-Harding 1169 3721 148 1921 Harding Warren G. #> 1821-Monroe 1259 4886 129 1821 Monroe James #> 1789-Washington 625 1538 23 1789 Washington George #> #> Source: Gerhard Peters and John T. Woolley. The American Presidency Project. #> Created: Tue Jun 13 14:51:47 2017 #> Notes: http://www.presidency.ucsb.edu/inaugurals.php#> Corpus consisting of 10 documents: #> #> Text Types Tokens Sentences Year President FirstName #> 1897-McKinley 1232 4361 130 1897 McKinley William #> 1901-McKinley 854 2437 100 1901 McKinley William #> 1853-Pierce 1165 3641 104 1853 Pierce Franklin #> 1957-Eisenhower 621 1931 92 1957 Eisenhower Dwight D. #> 1965-Johnson 568 1723 93 1965 Johnson Lyndon Baines #> 1989-Bush 795 2681 141 1989 Bush George #> 1829-Jackson 517 1210 25 1829 Jackson Andrew #> 1793-Washington 96 147 4 1793 Washington George #> 1861-Lincoln 1075 4006 135 1861 Lincoln Abraham #> 1881-Garfield 1021 3212 111 1881 Garfield James A. #> #> Source: Gerhard Peters and John T. Woolley. The American Presidency Project. #> Created: Tue Jun 13 14:51:47 2017 #> Notes: http://www.presidency.ucsb.edu/inaugurals.php# sampling sentences within document corp <- corpus(c(one = "Sentence one. Sentence two. Third sentence.", two = "First sentence, doc2. Second sentence, doc2.")) corpsent <- corpus_reshape(corp, to = "sentences") texts(corpsent)#> one.1 one.2 one.3 #> "Sentence one." "Sentence two." "Third sentence." #> two.1 two.2 #> "First sentence, doc2." "Second sentence, doc2."#> one.1 one.2 one.1.1 #> "Sentence one." "Sentence two." "Sentence one." #> two.1 two.2 #> "First sentence, doc2." "Second sentence, doc2."