3688825
doi
10.5281/zenodo.3688825
oai:zenodo.org:3688825
Kohei Watanabe
University of Innsbruck
Haiyan Wang
Tracr
Paul Nulty
University College Dublin
Adam Obeng
Columbia University, London School of Economics
Stefan Müller
University College Dublin
Jiong Wei Lua
MIT
Aki Matsuo
Institute for Analytics and Data Science, University of Essex
Christian Mueller
London School of Economics and Political Science
Will Lowe
Hertie School of Governance
Pablo Barberá
University of Southern California
Tyler Rinker
Campus Labs
mark padgham
@rOpenSci
Christopher Gandrud
@zalando
José Tomás Atria
Tom Paskhalis
NYU/LSE
nicmer
lindbrook
hofaichan
etienne-s
Chung-hong Chan
MZES, University of Mannheim
hotzeplotz
Thomas J. Leeper
Stas Malavin
Soil Cryology Lab
Michael W. Kearney
@MUDSA
Michael Chirico
@myteksi
Katrin Leinweber
@TIBHannover
Johannes Gruber
University of Glasgow
quanteda/quanteda: CRAN v2.0.0
Kenneth Benoit
London School of Economics and Political Science
url:https://github.com/quanteda/quanteda/tree/v2.0.0
info:eu-repo/semantics/openAccess
Other (Open)
<p><strong>quanteda</strong> 2.0 introduces some major changes, detailed here.</p>
What's new in v2.0
<ol>
<li><p>New corpus object structure.</p>
<p>The internals of the corpus object have been redesigned, and now are based around a character vector with meta- and system-data in attributes. These are all updated to work with the existing extractor and replacement functions. If you were using these before, then you should not even notice the change. Docvars are now handled separately from the texts, in the same way that docvars are handled for tokens objects.</p>
</li>
<li><p>New metadata handling.</p>
<p>Corpus-level metadata is now inserted in a user metadata list via <code>meta()</code> and <code>meta<-()</code>. <code>metacorpus()</code> is kept as a synonym for <code>meta()</code>, for backwards compatibility. Additional system-level corpus information is also recorded, but automatically when an object is created.</p>
<p>Document-level metadata is deprecated, and now all document-level information is simply a "docvar". For backward compatibility, <code>metadoc()</code> is kept and will insert document variables (docvars) with the name prefixed by an underscore.</p>
</li>
<li><p>Corpus objects now store default summary statistics for efficiency. When these are present, <code>summary.corpus()</code> retrieves them rather than computing them on the fly.</p>
</li>
<li><p>New index operators for core objects. The main change here is to redefine the <code>$</code> operator for corpus, tokens, and dfm objects (all objects that retain docvars) to allow this operator to access single docvars by name. Some other index operators have been redefined as well, such as <code>[.corpus</code> returning a slice of a corpus, and <code>[[.corpus</code> returning the texts from a corpus.</p>
<p>See the full details at <a href="https://github.com/quanteda/quanteda/wiki/indexing_core_objects">https://github.com/quanteda/quanteda/wiki/indexing_core_objects</a>.</p>
</li>
<li><p><code>*_subset()</code> functions.</p>
<p>The <code>subset</code> argument now must be logical, and the <code>select</code> argument has been removed. (This is part of <code>base::subset()</code> but has never made sense, either in <strong>quanteda</strong> or <strong>base</strong>.)</p>
</li>
<li><p>Return format from <code>textstat_simil()</code> and <code>textstat_dist()</code>.</p>
<p>Now defaults to a sparse matrix from the <strong>Matrix</strong> package, but coercion methods are provided for <code>as.data.frame()</code>, to make these functions return a data.frame just like the other textstat functions. Additional coercion methods are provided for <code>as.dist()</code>, <code>as.simil()</code>, and <code>as.matrix()</code>.</p>
</li>
<li><p>settings functions (and related slots and object attributes) are gone. These are now replaced by a new <code>meta(x, type = "object")</code> that records object-specific meta-data, including settings such as the <code>n</code> for tokens (to record the <code>ngrams</code>).</p>
</li>
<li><p>All included data objects are upgraded to the new formats. This includes the three corpus objects, the single dfm data object, and the LSD 2015 dictionary object.</p>
</li>
<li><p>New print methods for core objects (corpus, tokens, dfm, dictionary) now exist, each with new global options to control the number of documents shown, as well as the length of a text snippet (corpus), the tokens (tokens), dfm cells (dfm), or keys and values (dictionary). Similar to the extended printing options for dfm objects, printing of corpus objects now allows for brief summaries of the texts to be printed, and for the number of documents and the length of the previews to be controlled by new global options.</p>
</li>
<li><p>All textmodels and related functions have been moved to a new package <strong>quanteda.textmodels</strong>. This makes them easier to maintain and update, and keeps the size of the core package down.</p>
</li>
<li><p><strong>quanteda</strong> v2 implements major changes to the <code>tokens()</code> constructor. These are designed to simplify the code and its maintenance in <strong>quanteda</strong>, to allow users to work with other (external) tokenizers, and to improve consistency across the tokens processing options. Changes include:</p>
<ul>
<li><p>A new method <code>tokens.list(x, ...)</code> constructs a <code>tokens</code> object from named list of characters, allowing users to tokenize texts using some other function (or package) such as <code>tokenize_words()</code>, <code>tokenize_sentences()</code>, or <code>tokenize_tweets()</code> from the <strong>tokenizers</strong> package, or the list returned by <code>spacyr::spacy_tokenize()</code>. This allows users to use their choice of tokenizer, as long as it returns a named list of characters. With <code>tokens.list()</code>, all tokens processing (<code>remove_*</code>) options can be applied, or the list can be converted directly to a <code>tokens</code> object without processing using <code>as.tokens.list()</code>.</p>
</li>
<li><p>All tokens options are now <em>intervention</em> options, to split or remove things that by default are not split or removed. All <code>remove_*</code> options to <code>tokens()</code> now remove them from tokens objects by calling <code>tokens.tokens()</code>, after constructing the object. "Pre-processing" is now actually post-processing using <code>tokens_*()</code> methods internally, after a conservative tokenization on token boundaries. This both improves performance and improves consistency in handling special characters (e.g. Twitter characters) across different tokenizer engines. (#1503, #1446, #1801)</p>
</li>
</ul>
<p>Note that <code>tokens.tokens()</code> will remove what is found, but cannot "undo" a removal -- for instance it cannot replace missing punctuation characters if these have already been removed.</p>
<ul>
<li><p>The option <code>remove_hyphens</code> is removed and deprecated, but replaced by <code>split_hyphens</code>. This preserves infix (internal) hyphens rather than splitting them. This behaviour is implemented in both the <code>what = "word"</code> and <code>what = "word2"</code> tokenizer options. This option is <code>FALSE</code> by default.</p>
</li>
<li><p>The option <code>remove_twitter</code> has been removed. The new <code>what = "word"</code> is a smarter tokenizer that preserves social media tags, URLs, and email-addresses. "Tags" are defined as valid social media hashtags and usernames (using Twitter rules for validity) rather than removing the <code>#</code> and <code>@</code> punctuation characters, even if <code>remove_punct = TRUE</code>.</p>
</li>
</ul>
</li>
</ol>
New features
<ul>
<li>Changed the default value of the <code>size</code> argument in <code>dfm_sample()</code> to the number of features, not the number of documents. (#1643)</li>
<li>Fixes a few CRAN-related issues (compiler warnings on Solaris and encoding warnings on r-devel-linux-x86_64-debian-clang.)</li>
<li>Added <code>startpos</code> and <code>endpos</code> arguments to <code>tokens_select()</code>, for selecting on token positions relative to the start or end of the tokens in each document. (#1475)</li>
<li>Added a <code>convert()</code> method for corpus objects, to convert them into data.frame or json formats.</li>
<li>Added a <code>spacy_tokenize()</code> method for corpus objects, to provide direct access via the <strong>spacyr</strong> package.</li>
</ul>
Behaviour changes
<ul>
<li>Added a <code>force = TRUE</code> option and error checking for the situations of applying <code>dfm_weight()</code> or <code>dfm_group()</code> to a dfm that has already been weighted. (#1545) The function <code>textstat_frequency()</code> now allows passing this argument to <code>dfm_group()</code> via <code>...</code>. (#1646)</li>
<li><code>textstat_frequency()</code> now has a new argument for resolving ties when ranking term frequencies, defaulting to the "min" method. (#1634)</li>
<li>New docvars accessor and replacement functions are available for corpus, tokens, and dfm objects via <code>$</code>. (See Index Operators for Core Objects above.)</li>
<li><code>textstat_entropy()</code> now produces a data.frame that is more consistent with other <code>textstat</code> methods. (#1690)</li>
</ul>
Bug fixes and stability enhancements
<ul>
<li>docnames now enforced to be character (formerly, could be numeric for some objects).</li>
<li>docnames are now enforced to be strictly unique for all object classes.</li>
<li>Grouping operations in <code>tokens_group()</code> and <code>dfm_group()</code> are more robust to using multiple grouping variables, and preserve these correctly as docvars in the new dfm. (#1809)</li>
<li>Some fixes to documented ... objects in two functions that were previously causing CRAN check failures on the release of 1.5.2.</li>
</ul>
Other improvements
<ul>
<li>All of the (three) included corpus objects have been cleaned up and augmented with improved meta-data and docvars. The inaugural speech corpus, for instance, now includes the President's political party affiliation.</li>
</ul>
Zenodo
2020-02-26
info:eu-repo/semantics/other
596731
v2.0.0
1680907735.417881
35880621
md5:f2e3a90141e116350930a63a0703fa04
https://zenodo.org/records/3688825/files/quanteda/quanteda-v2.0.0.zip
public
https://github.com/quanteda/quanteda/tree/v2.0.0
Is supplement to
url
10.5281/zenodo.596731
isVersionOf
doi