quanteda/quanteda: CRAN v4.0
Authors/Creators
- Kenneth Benoit1
- Kohei Watanabe
- Haiyan Wang2
- Paul Nulty3
- Adam Obeng4
- Stefan Müller5
- Jiong Wei Lua6
- Aki Matsuo7
- José Tomás Atria
- Olivier Delmarcelle
- Will Lowe8
- Pablo Barberá9
- Tyler Rinker10
- mark padgham11
- Christopher Gandrud12
- Alec L. Robitaille13
- chainsawriot14
- Michael Chirico
- Tom Paskhalis15
- nicmer
- joh-b
- lindbrook
- hofaichan
- etienne-s
- Thomas J. Leeper
- Stas Malavin16
- Michael W. Kearney17
- Kevin Reuning
- Keith Hughitt18
- Katrin Leinweber19
- 1. London School of Economics and Political Science
- 2. Tracr
- 3. Birkbeck, University of London
- 4. Columbia University, London School of Economics
- 5. University College Dublin
- 6. MIT
- 7. Department of Government, University of Essex
- 8. Hertie School
- 9. University of Southern California
- 10. Kangarootime
- 11. @rOpenSci
- 12. @spotify
- 13. @wildlifeevoeco
- 14. @gesistsa
- 15. TCD/LSE/NYU
- 16. Israel Oceanographic and Limnological Research
- 17. Meijer
- 18. National Institutes of Health
- 19. @gitlabhq
Description
quanteda 4.0.0
Changes and additions
Introduces the
tokens_xptrobjects that extend thetokensobjects with external pointers for a greater efficiency. Oncetokensobjects are converted totokens_xptrobjects usingas.tokens_xptr(),tokens_*.tokens_xptr()methods are called automatically.Improved C++ functions to allow the users to change the number of threads for parallel computing in more flexible manner using
quanteda_options(). The value ofthreadscan be changed in the middle of analysis pipeline.Makes
"word4"the default (word) tokeniser, with improved efficiency, language handling, and customisation options.Replaced all occurrences of the magrittr
%>%pipe with the R pipe|>introduced in R 4.1, although the%>%pipe is still re-exported and therefore available to all users of quanteda without loading any additional packages.Added
min_ntokenandmax_ntokentotokens_subset()anddfm_subset()to extract documents based on number of tokens easily. It is equivalent to selecting documents usingntoken().Added a new argument
apply_ifthat allows a tokens-based operation to apply only to documents that meet a logical condition. This argument has been added totokens_select(),tokens_compound(),tokens_replace(),tokens_split(), andtokens_lookup(). This is similar to applyingpurrr::map_if()to a tokens object, but is implemented within the function so that it can be performed efficiently in C++.Added new arguments
append_key,separatorandconcatenatortotokens_lookup(). These allow tokens matched by dictionary values to be retained with their keys appended to them, separated byseparator. The addition of theconcatenatorargument allows additional control at the lookup stage for tokens that will be concatenated from having matched multi-word dictionary values. (#2324)Added a new argument
remove_paddingtontoken()andntype()that allows for not counting padding that might have been left over fromtokens_remove(x, padding = TRUE). This changes the previous number of types fromntype()when pads exist, by counting pads by default. (#2336)Removed dependency on RcppParallel to improve the stability of the C++ code. This change requires the users of Linux-like OS to install the Intel TBB library manually to enable parallel computing.
Removals
bootstrap_dfm()was removed for character and corpus objects. The correct way to bootstrap sentences is not to tokenize them as sentences and then bootstrap them from the dfm. This is consistent with requiring the user to tokenise objects prior to forming dfms or other "downstream" objects.dfm()no longer works on character or corpus objects, only on tokens or other dfm objects. This was deprecated in v3 and removed in v4.Very old arguments to
dfm()options that were not visible but worked with warnings (such asstem = TRUE) are removed.Deprecated or renamed arguments formerly passed in
tokens()that formerly mapped to the v3 arguments with a warning are removed.Methods for readtext objects are removed, since these are data.frame objects that are straightforward to convert into a
corpusobject.topfeatures()no longer works on an fcm object. (#2141)
Deprecations
Some on-the-fly calculations applied to character or corpus objects that require a temporary tokenisation are now deprecated. This includes:
nsentence()-- uselengths(tokens(x, what = "sentence"))instead;ntype()-- usentype(tokens(x))instead; and.ntoken()-- usentoken(tokens(x))instead.char_ngrams()-- usetokens_ngrams(tokens(x))instead.
corpus.kwic()is deprecated, with the suggestion to form a corpus from usingtokens_select(x, window = ...)instead.
Files
quanteda/quanteda-v4.0.zip
Files
(31.4 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:cbd65406d2a3fe4575d94d924148da31
|
31.4 MB | Preview Download |
Additional details
Related works
- Is supplement to
- Software: https://github.com/quanteda/quanteda/tree/v4.0 (URL)