Published July 6, 2020
| Version v2.1.0
Software
Open
quanteda/quanteda: CRAN v2.1.0
Creators
- Kenneth Benoit1
- Kohei Watanabe2
- Haiyan Wang3
- Paul Nulty4
- Adam Obeng5
- Stefan Müller4
- Jiong Wei Lua6
- Aki Matsuo7
- Christian Mueller1
- Will Lowe8
- Pablo Barberá9
- Christopher Gandrud10
- mark padgham11
- Tyler Rinker12
- José Tomás Atria
- Johannes Gruber13
- Katrin Leinweber14
- Michael Chirico15
- Michael W. Kearney16
- Stas Malavin17
- Thomas J. Leeper
- hotzeplotz
- Chung-hong Chan18
- etienne-s
- hofaichan
- lindbrook
- nicmer
- Tom Paskhalis19
- 1. London School of Economics and Political Science
- 2. University of Innsbruck
- 3. Tracr
- 4. University College Dublin
- 5. Columbia University, London School of Economics
- 6. MIT
- 7. Institute for Analytics and Data Science, University of Essex
- 8. Hertie School of Governance
- 9. University of Southern California
- 10. @zalando
- 11. @rOpenSci
- 12. Campus Labs
- 13. University of Glasgow
- 14. @gitlabhq
- 15. @myteksi
- 16. @MUDSA
- 17. Soil Cryology Lab
- 18. MZES, University of Mannheim
- 19. NYU/LSE
Description
Changes
- Added
block_size
toquanteda_options()
to control the number of documents in blocked tokenization. - Fixed
print.dictionary2()
to control the printing of nested levels withmax_nkey
(#1967) - Added
textstat_summary()
to provide detailed information about dfm, tokens and corpus objects. It will replacesummary()
in future versions. - Fixed a performance issue causing slowdowns in tokenizing (using the default
what = "word"
) corpora with large numbers of documents that contain social media tags and URLs that needed to be preserved (such a large corpus of Tweets). - Updated the (default) "word" tokenizer to preserve hashtags and usernames better with non-ASCII text, and made these patterns user-configurable in
quanteda_options()
. The following are now preserved: "#政治" as well as Weibo-style hashtags such as "#英国首相#". convert(x, to = "data.frame")
now outputs the first column as "doc_id" rather than "document" since "document" is a commonly occurring term in many texts. (#1918)- Added new methods
char_select()
,char_keep()
, andchar_remove()
for easy manipulation of character vectors. - Added
dictionary_edit()
for easy, interactive editing of dictionaries, plus the functionschar_edit()
andlist_edit()
for editing character and list of character objects. - Added a method to
textplot_wordcloud()
that plots objects fromtextstat_keyness()
, to visualize keywords either by comparison or for the target category only. - Improved the performance of
kwic()
(#1840). - Added new
logsmooth
scheme todfm_weight()
. - Added new
textstat_summary()
method, which returns summary information about the tokens/types/features etc in an object. It also caches summary information so that this can be retrieved on subsequent calls, rather than re-computed.
- Stopped returning
NA
for non-existent features whenn
>nfeat(x)
intextstat_frequency(x, n)
. (#1929) - Fixed a problem in
dfm_lookup()
andtokens_lookup()
in which an error was caused when no dictionary key returned a single match (#1946). - Fixed a bug that caused a
textstat_simil/dist
object converted to a data.frame to drop itsdocument2
labels (#1939). - Fixed a bug causing
dfm_match()
to fail on a dfm that included "pads" (""
). (#1960) - Updated the
data_dfm_lbgexample
object using more modern dfm internals. - Updates
textstat_readability()
,textstat_lexdiv()
, andnscrabble()
so that empty texts are not dropped in the result. (#1976)
Files
quanteda/quanteda-v2.1.0.zip
Files
(38.4 MB)
Name | Size | Download all |
---|---|---|
md5:d47f7a5422db03c600904e1c7e9a4828
|
38.4 MB | Preview Download |
Additional details
Related works
- Is supplement to
- https://github.com/quanteda/quanteda/tree/v2.1.0 (URL)