There is a newer version of the record available.

Published July 6, 2020 | Version v2.1.0
Software Open

quanteda/quanteda: CRAN v2.1.0

  • 1. London School of Economics and Political Science
  • 2. University of Innsbruck
  • 3. Tracr
  • 4. University College Dublin
  • 5. Columbia University, London School of Economics
  • 6. MIT
  • 7. Institute for Analytics and Data Science, University of Essex
  • 8. Hertie School of Governance
  • 9. University of Southern California
  • 10. @zalando
  • 11. @rOpenSci
  • 12. Campus Labs
  • 13. University of Glasgow
  • 14. @gitlabhq
  • 15. @myteksi
  • 16. @MUDSA
  • 17. Soil Cryology Lab
  • 18. MZES, University of Mannheim
  • 19. NYU/LSE

Description

Changes

  • Added block_size to quanteda_options() to control the number of documents in blocked tokenization.
  • Fixed print.dictionary2() to control the printing of nested levels with max_nkey (#1967)
  • Added textstat_summary() to provide detailed information about dfm, tokens and corpus objects. It will replace summary() in future versions.
  • Fixed a performance issue causing slowdowns in tokenizing (using the default what = "word") corpora with large numbers of documents that contain social media tags and URLs that needed to be preserved (such a large corpus of Tweets).
  • Updated the (default) "word" tokenizer to preserve hashtags and usernames better with non-ASCII text, and made these patterns user-configurable in quanteda_options(). The following are now preserved: "#政治" as well as Weibo-style hashtags such as "#英国首相#".
  • convert(x, to = "data.frame") now outputs the first column as "doc_id" rather than "document" since "document" is a commonly occurring term in many texts. (#1918)
  • Added new methods char_select(), char_keep(), and char_remove() for easy manipulation of character vectors.
  • Added dictionary_edit() for easy, interactive editing of dictionaries, plus the functions char_edit() and list_edit() for editing character and list of character objects.
  • Added a method to textplot_wordcloud() that plots objects from textstat_keyness(), to visualize keywords either by comparison or for the target category only.
  • Improved the performance of kwic() (#1840).
  • Added new logsmooth scheme to dfm_weight().
  • Added new textstat_summary() method, which returns summary information about the tokens/types/features etc in an object. It also caches summary information so that this can be retrieved on subsequent calls, rather than re-computed.
Bug fixes and stability enhancements
  • Stopped returning NA for non-existent features when n > nfeat(x) in textstat_frequency(x, n). (#1929)
  • Fixed a problem in dfm_lookup() and tokens_lookup() in which an error was caused when no dictionary key returned a single match (#1946).
  • Fixed a bug that caused a textstat_simil/dist object converted to a data.frame to drop its document2 labels (#1939).
  • Fixed a bug causing dfm_match() to fail on a dfm that included "pads" (""). (#1960)
  • Updated the data_dfm_lbgexample object using more modern dfm internals.
  • Updates textstat_readability(), textstat_lexdiv(), and nscrabble() so that empty texts are not dropped in the result. (#1976)

Files

quanteda/quanteda-v2.1.0.zip

Files (38.4 MB)

Name Size Download all
md5:d47f7a5422db03c600904e1c7e9a4828
38.4 MB Preview Download

Additional details

Related works