quanteda/quanteda: CRAN v2.1.0

doi:10.5281/zenodo.3931217

Published July 6, 2020 | Version v2.1.0

Software Open

quanteda/quanteda: CRAN v2.1.0

1. London School of Economics and Political Science
2. University of Innsbruck
3. Tracr
4. University College Dublin
5. Columbia University, London School of Economics
6. MIT
7. Institute for Analytics and Data Science, University of Essex
8. Hertie School of Governance
9. University of Southern California
10. @zalando
11. @rOpenSci
12. Campus Labs
13. University of Glasgow
14. @gitlabhq
15. @myteksi
16. @MUDSA
17. Soil Cryology Lab
18. MZES, University of Mannheim
19. NYU/LSE

Changes

Added block_size to quanteda_options() to control the number of documents in blocked tokenization.
Fixed print.dictionary2() to control the printing of nested levels with max_nkey (#1967)
Added textstat_summary() to provide detailed information about dfm, tokens and corpus objects. It will replace summary() in future versions.
Fixed a performance issue causing slowdowns in tokenizing (using the default what = "word") corpora with large numbers of documents that contain social media tags and URLs that needed to be preserved (such a large corpus of Tweets).
Updated the (default) "word" tokenizer to preserve hashtags and usernames better with non-ASCII text, and made these patterns user-configurable in quanteda_options(). The following are now preserved: "#政治" as well as Weibo-style hashtags such as "#英国首相#".
convert(x, to = "data.frame") now outputs the first column as "doc_id" rather than "document" since "document" is a commonly occurring term in many texts. (#1918)
Added new methods char_select(), char_keep(), and char_remove() for easy manipulation of character vectors.
Added dictionary_edit() for easy, interactive editing of dictionaries, plus the functions char_edit() and list_edit() for editing character and list of character objects.
Added a method to textplot_wordcloud() that plots objects from textstat_keyness(), to visualize keywords either by comparison or for the target category only.
Improved the performance of kwic() (#1840).
Added new logsmooth scheme to dfm_weight().
Added new textstat_summary() method, which returns summary information about the tokens/types/features etc in an object. It also caches summary information so that this can be retrieved on subsequent calls, rather than re-computed.

Bug fixes and stability enhancements

Stopped returning NA for non-existent features when n > nfeat(x) in textstat_frequency(x, n). (#1929)
Fixed a problem in dfm_lookup() and tokens_lookup() in which an error was caused when no dictionary key returned a single match (#1946).
Fixed a bug that caused a textstat_simil/dist object converted to a data.frame to drop its document2 labels (#1939).
Fixed a bug causing dfm_match() to fail on a dfm that included "pads" (""). (#1960)
Updated the data_dfm_lbgexample object using more modern dfm internals.
Updates textstat_readability(), textstat_lexdiv(), and nscrabble() so that empty texts are not dropped in the result. (#1976)

Files

quanteda/quanteda-v2.1.0.zip

Files (38.4 MB)

Name	Size	Download all
quanteda/quanteda-v2.1.0.zip md5:d47f7a5422db03c600904e1c7e9a4828	38.4 MB	Preview Download

Additional details

Is supplement to: https://github.com/quanteda/quanteda/tree/v2.1.0 (URL)

	All versions	This version
Views	4,264	59
Downloads	326	2
Data volume	11.0 GB	76.8 MB

quanteda/quanteda: CRAN v2.1.0

Creators

Description

Files

quanteda/quanteda-v2.1.0.zip

Files (38.4 MB)

Additional details

Related works