Published August 16, 2017
| Version v0.99
Software
Open
kbenoit/quanteda: CRAN release v0.99
Creators
- 1. London School of Economics and Political Science
- 2. University of Cambridge
- 3. LSE
- 4. Columbia University, London School of Economics
- 5. London School of Economics
- 6. Trinity College Dublin
- 7. Department of Methodology, London School of Economics
- 8. University of Southern California
- 9. Harvard IQSS (@IQSS)
- 10. University at Buffalo
- 11. @ZGIS
- 12. Soil Cryology Lab
- 13. University of Pennsylvania
Description
New features
- Improvements and consoldiation of methods for detecting multi-word expressions, now active only through
textstat_collocations()
, which computes only thelambda
method for now, but does so accurately and efficiently. (#753, #803). This function is still under development and likely to change further. - Added new
quanteda_options
that affect the maximum documents and features displayed by the dfm print method (#756). ngram
formation is now significantly faster, including with skips (skipgrams).- Improvements to
topfeatures()
:- now accepts a
groups
argument that can be used to generate lists of top (or bottom) features in a group of texts, including by document (#336). - new argument
scheme
that takes the default of (frequency)"count"
but also a new"docfreq"
value (#408).
- now accepts a
- New wrapper
phrase()
converts whitespace-separated multi-word patterns into a list of patterns. This affects the feature/pattern matching intokens/dfm_select/remove
,tokens_compound
,tokens/dfm_lookup
, andkwic
.phrase()
and the associated changes also make the behaviour of using character vectors, lists of characters, dictionaries, and collocation objects for pattern matches far more consistent. (See #820, #787, #740, #837, #836, #838) corpus.Corpus()
for creating a corpus from a tm Corpus now works with more complex objects that include document-level variables, such as data from the manifestoR package (#849).- New plot function
textplot_keyness()
plots term "keyness", the association of words with contrasting classes as measured bytextstat_keyness()
. - Added corpus constructor for corpus objects (#690).
- Added dictionary constructor for dictionary objects (#690).
- Added a tokens constructor for tokens objects (#690), including updates to
tokens()
that improve the consistency and efficiency of the tokenization. - Added new
quanteda_options()
:language_stemmer
andlanguage_stopwords
, now used for default in*_wordstem
functions andstopwords()
for defaults, respectively. Also uses this option indfm()
whenstem = TRUE
, rather than hard-wiring in the "english" stemmer (#386). - Added a new function
textstat_frequency()
to compile feature frequencies, possibly by groups. (#825) - Added
nomatch
option totokens_lookup()
anddfm_lookup()
, to provide tokens or feature counts for categories not matched to any dictionary key. (#496)
- The functions
sequences()
andcollocations()
have been removed and replaced bytextstat_collocations()
. - (Finally) we added "will" to the list of English stopwords (#818).
dfm
objects with one or both dimensions haveing zero length, and emptykwic
objects now display more appropriately in their print methods (per #811).- Pattern matches are now implemented more consistently across functions. In functions such as
*_select
,*_remove
,tokens_compound
,features
has been replaced bypattern
, and inkwic
,keywords
has been replaced bypattern
. These all behave consistently with respect topattern
, which now has a unified single help page and parameter description.(#839) See also above new features related tophrase()
. - We have improved the performance of the C++ routines that handle many of the
tokens_*
functions using hashed tokens, making some of them 10x faster (#853). - Upgrades to the
dfm_group()
function now allow "empty" documents to be created using thefill = TRUE
option, for making documents conform to a selection (similar to howdfm_select()
works for features, when supplied a dfm as the pattern argument). Thegroups
argument now behaves consistently across the functions where it is used. (#854) dictionary()
now requires its main argument to be a list, not a series of elements that can be used to build a list.- Some changes to the behaviour of
tokens()
have improved the behaviour ofremove_hyphens = FALSE
, which now behaves more correctly regardless of the setting ofremove_punct
(#887). - Improved
cbind.dfm()
function allows cbinding vectors, matrixes, and (recyclable) scalars to dfm objects.
- For the underlying methods behind
textstat_collocations()
, we corrected the word matching, and lambda and z calculation methods, which were slightly incorrect before. We also removed the chi2, G2, and pmi statistics, because these were incorrectly calculated for size > 2. - LIWC-formatted dictionary import now robust to assignment to term assignment to missing categories.
textmodel_NB(x, y, distribution = "Bernoulli")
was previously inactive even when this option was set. It has now been fully implemented and tested (#776, #780).- Separators including rare spacing characters are now handled more robustly by the
remove_separators
argument intokens()
. See #796. - Improved memory usage when computing
ntoken()
andntype()
. (#795) - Improvements to
quanteda_options()
now does not throw an error when quanteda functions are called directly without attaching the package. In addition, quanteda options can be set now in .Rprofile and will not be overwritten when the options initialization takes place when attaching the package. - Fixed a bug in
textstat_readability()
that wrongly computed the number of words with fewer than 3 syllables in a text; this affected theFOG.NRI
and theLinsear.Write
measures only. - Fixed mistakes in the computation of two docfreq schemes:
"logave"
and"inverseprob"
. - Fixed a bug in the handling of multi-thread options where the settings using
quanteda_options()
did not actually set the number of threads. In addition, we fixed a bug causing threading to be turned off on macOS (due to a check for a gcc version that is not used for compiling the macOS binaries) prevented multi-threading from being used at all on that platform. - Fixed a bug causing failure when functions that use
quanteda_options()
are called without the namespace or package being attached or loaded (#864). - Fixed a bug in overloading the View method that caused all named objects in the RStudio/Source pane to be named "x". (#893)
Files
kbenoit/quanteda-v0.99.zip
Files
(17.0 MB)
Name | Size | Download all |
---|---|---|
md5:02d2da0dceea64e3d5e4804e3143798d
|
17.0 MB | Preview Download |
Additional details
Related works
- Is supplement to
- https://github.com/kbenoit/quanteda/tree/v0.99 (URL)