quanteda/quanteda: CRAN v3.0.0
Creators
- Kenneth Benoit1
- Kohei Watanabe
- Haiyan Wang2
- Paul Nulty3
- Adam Obeng4
- Stefan Müller3
- Jiong Wei Lua5
- Aki Matsuo6
- Christian Mueller1
- José Tomás Atria
- Will Lowe7
- Pablo Barberá8
- Christopher Gandrud9
- mark padgham10
- Tyler Rinker11
- Johannes Gruber12
- Katrin Leinweber13
- Michael Chirico
- Michael W. Kearney14
- Stas Malavin15
- Thomas J. Leeper
- 1. London School of Economics and Political Science
- 2. Tracr
- 3. University College Dublin
- 4. Columbia University, London School of Economics
- 5. MIT
- 6. Institute for Analytics and Data Science, University of Essex
- 7. Hertie School
- 8. University of Southern California
- 9. @zalando
- 10. @rOpenSci
- 11. Campus Labs
- 12. University of Glasgow
- 13. @gitlabhq
- 14. @MUDSA
- 15. Soil Cryology Lab
Description
Summary
quanteda 3.0 is a major release that improves functionality, completes the modularisation of the package begun in v2.0, further improves function consistency by removing previously deprecated functions, and enhances workflow stability and consistency by deprecating some shortcut steps built into some functions.
Changes and additionsModularisation: We have now separated the
textplot_*()
functions from the main package into a separate package quanteda.textplots, and thetextstat_*()
functions from the main package into a separate package quanteda.textstats. This completes the modularisation begun in v2 with the move of thetextmodel_*()
functions to the separate package quanteda.textmodels. quanteda now consists of core functions for textual data processing and management.The package dependency structure is now greatly reduced, by eliminating some unnecessary package dependencies, through modularisation, and by addressing complex downstream dependencies in packages such as stopwords. v3 should serve as a more lightweight and more consistent platform for other text analysis packages to build on.
We have added non-standard evaluation for
by
andgroups
arguments to access object docvars:- The
*_sample()
functions' argumentby
, andgroups
in the*_group()
functions, now take unquoted document variable (docvar) names directly, similar to the way thesubset
argument works in the*_subset()
functions. - Quoted docvar names no longer work, as these will be evaluated literally.
- The
by = "document"
formerly sampled fromdocid(x)
, but this functionality is now removed. Instead, useby = docid(x)
to replicate this functionality. - For
groups
, the default is nowdocid(x)
, which is now documented more completely. See?groups
and?docid
.
- The
dfm()
has a new argument,remove_padding
, for removing the "pads" left behind after removing tokens withpadding = TRUE
. (For other extensive changes todfm()
, see "Deprecated" below.)tokens_group()
, formerly internal-only, is now exported.corpus_sample()
,dfm_sample()
, andtokens_sample()
now work consistently (#2023).The
kwic()
return object structure has been redefined, and built with an option to use a new functionindex()
that returns token spans following a pattern search. (#2045 and #2065)The punctuation regular expression and that for matching social media usernames has now been redefined so that the valid Twitter username
@_
is now counted as a "tag" rather than as "punctuation". (#2049)The data object
data_corpus_inaugural
has been updated to include the Biden 2021 inaugural address.A new system of validators for input types now provides better argument type and value checking, with more consistent error messages for invalid types or values.
Upon startup, we now message the console with the Unicode and ICU version information. Because we removed our redefinition of
View()
(see below), the former conflict warning is now gone.as.character.corpus()
now has ause.names = TRUE
argument, similar toas.character.tokens()
(but with a different default value).
The main potentially breaking changes in version 3 relate to the deprecation or elimination of shortcut steps that allowed functions that required tokens inputs to skip the tokens creation step. We did this to require users to take more direct control of tokenization options, or to substitute the alternative tokeniser of their choice (and then coercing it to tokens via [as.tokens()]). This also allows our function behaviour to be more consistent, with each function performing a single task, rather than combining functions (such as tokenisation and constructing a matrix).
The most common example involves constructing a dfm directly from a character
or corpus object. Formerly, this would construct a tokens object internally
before creating the dfm, and allowed passing arguments to tokens()
via ...
.
This is now deprecated, although still functional with a warning.
We strongly encourage either creating a tokens object first, or piping the
tokens return to dfm()
using %>%
. (See examples below.)
We have also deprecated direct character or corpus inputs to [kwic()], since this also requires a tokenised input.
The full listing of deprecations is:
dfm.character()
anddfm.corpus()
are deprecated. Users should create a tokens object first, and input that todfm()
.dfm()
: As of version 3, only tokens objects are supported as inputs todfm()
. Callingdfm()
for character or corpus objects is still functional, but issues a warning. Convenience passing of arguments totokens()
via...
fordfm()
is also deprecated, but undocumented, and functions only with a warning. Users should now create a tokens object (usingtokens()
from character or corpus inputs before callingdfm()
.kwic()
: As of version 3, only tokens objects are supported as inputs tokwic()
. Callingkwic()
for character or corpus objects is still functional, but issues a warning. Passing arguments totokens()
via...
inkwic()
is now disabled. Users should now create a tokens object (usingtokens()
from character or corpus inputs before callingkwic()
.Shortcut arguments to
dfm()
are now deprecated. These are still active, with a warning, although they are no longer documented. These are:stem
-- usetokens_wordstem()
ordfm_wordstem()
instead.select
,remove
-- usetokens_select()
/dfm_select()
ortokens_remove()
/dfm_remove()
instead.dictionary
,thesaurus
-- usetokens_lookup()
ordfm_lookup()
instead.valuetype
,case_insensitive
-- these are disabled; for the deprecated arguments that take these qualifiers, they are fixed to the defaults"glob"
andTRUE
.groups
-- usetokens_group()
ordfm_group()
instead.
texts()
andtexts<-
are deprecated.- Use
as.character.corpus()
to turn a corpus into a simple named character vector. - Use
corpus_group()
instead oftexts(x, groups = ...)
to aggregate texts by a grouping variable. - Use
[<-
instead oftexts()<-
for replacing texts in a corpus object.
- Use
See note above under "Changes" about the
textplot_*()
andtextstat_*()
functions.The following functions have been removed:
- all methods for defunct
corpuszip
objects. View()
functionsas.wfm()
andas.DocumentTermMatrix()
(the same functionality is available viaconvert()
)metadoc()
andmetacorpus()
corpus_trimsentences()
(replaced bycorpus_trim()
)- all of the
tortl
functions - all legacy functions related to the ancient "corpuszip" corpus variant.
- all methods for defunct
dfm
objects can no longer be used as apattern
indfm_select()
(formerly deprecated).dfm_sample()
:- no longer has a
margin
argument. Instead,dfm_sample()
now samples only on documents, the same ascorpus_sample()
andtokens_sample()
; and - no longer works with
by = "document"
-- useby = docid(x)
instead.
- no longer has a
dictionary_edit()
,char_edit()
, andlist_edit()
are removed.dfm_weight()
- formerly deprecated"scheme"
options are now removed.tokens()
- formerly deprecated optionsremove_hyphens
andremove_twitter
are now removed. (Usesplit_hyphens
instead, and the default tokenizer always now preserves Twitter and other social media tags.)Special versions of
head()
andtail()
for corpus, dfm, and fcm objects are now removed, since the base methods work fine for these objects. The main consequence was the removal of thenf
option from the methods for dfm and fcm objects, which limited the number of features. This can be accomplished using the index operator[
instead, or for printing, by specifyingprint(x, max_nfeat = 6L)
(for instance).
Fixed a bug causing
topfeatures(x, group = something)
to fail with weighted dfms (#2032).kwic()
is more stable and does not crash when a vector is supplied as thewindow
argument (#2008).Allow use of multi-threading with more than two threads by fixing
quanteda_options()
.Mentions of the now-removed
ngrams
option indfm(x, ...)
has now been removed from the dfm documentation. (#1990)Handling for some early-cycle v2 dfm object is improved, to ensure that they are updated to the latest object format. (#2097)
Files
quanteda/quanteda-v3.0.0.zip
Files
(37.4 MB)
Name | Size | Download all |
---|---|---|
md5:4258eabbcf76f8d601ebc03721d26e95
|
37.4 MB | Preview Download |
Additional details
Related works
- Is supplement to
- https://github.com/quanteda/quanteda/tree/v3.0.0 (URL)