Published April 7, 2023
| Version v3.3
Software
Open
quanteda/quanteda: CRAN v3.3.0
Creators
- Kenneth Benoit1
- Kohei Watanabe
- Haiyan Wang2
- Paul Nulty3
- Adam Obeng4
- Stefan Müller5
- Jiong Wei Lua6
- Aki Matsuo7
- Christian Mueller1
- José Tomás Atria
- odelmarcelle
- Will Lowe8
- Pablo Barberá9
- Christopher Gandrud10
- mark padgham11
- Tyler Rinker12
- James Baird
- Katrin Leinweber13
- Kevin Reuning
- Michael Chirico
- Michael W. Kearney14
- Stas Malavin15
- Thomas J. Leeper
- hotzeplotz
- chainsawriot16
- etienne-s
- hofaichan
- lindbrook
- mmzmm
- nicmer
- 1. London School of Economics and Political Science
- 2. Tracr
- 3. Birkbeck, University of London
- 4. Columbia University, London School of Economics
- 5. University College Dublin
- 6. MIT
- 7. Institute for Analytics and Data Science, University of Essex
- 8. Hertie School
- 9. University of Southern California
- 10. @spotify
- 11. @rOpenSci
- 12. Anthology
- 13. @gitlabhq
- 14. @AwareHQ
- 15. Israel Oceanographic and Limnological Institute
- 16. GESIS
Description
Changes and additions
Implements a
"word4"
tokeniser that is based on new RBBI (RuleBasedBreakIterator) rules, implemented in a new .yml file that can be edited and changed by users, but whose defaults represent a significant improvement in pattern handling for words, sentences, and other forms of patterns. These rules are customised from the ICU rules for breaks, with the standard and customised rules found now in thebreakrules/
system folder, so that they could, in principle, be modified by the user.Other minor changes:
- changes how elapsed time is recorded, by creating a global environment to record these in (aaa.R)
- improves several of the R-coded patterns that apply to
"word2"
:- the hashtag pattern (`pattern_hashtag)
- the separator pattern (by adding
\\p{M}
). - the URL pattern
- creates a new tokens_restore(), implemented in C++, to replace the older
preserve_special()
that rejoined splits created by the default stringi tokeniser machinery. - makes some technical improvements to internal tokenisation functions, such as moving the ellipsis to the end of the function, to allow more modularity in developing future tokenisers.
dfm_group()
now works correctly with an empty dfm (#2225).convert(x, to = "stm")
no longer vulnerable to large numbers of removed features as in #2189.
Files
quanteda/quanteda-v3.3.zip
Files
(37.7 MB)
Name | Size | Download all |
---|---|---|
md5:180aa094148fc8269cb4d4f7627451c8
|
37.7 MB | Preview Download |
Additional details
Related works
- Is supplement to
- https://github.com/quanteda/quanteda/tree/v3.3 (URL)