rspeer/wordfreq: v3.0

Robyn Speer

doi:10.5281/zenodo.7199437

Published September 26, 2022 | Version v3.0.2

Software Open

rspeer/wordfreq: v3.0

Robyn Speer¹

1. Elemental Cognition

v3.0: The "handle numbers better" release

Previously, wordfreq would group all digit sequences of the same 'shape',

with length 2 or more, into a single token and return the frequency of that
token, which would be a vast overestimate.

Now it distributes the frequency over all numbers of that shape, with an
estimated distribution that allows for Benford's law (lower numbers are more
frequent) and a special frequency distribution for 4-digit numbers that look
like years (2010 is more frequent than 1020).

More changes related to digits:

Functions such as iter_wordlist and top_n_list no longer return
multi-digit numbers (they used to return them in their "smashed" form, such
as "0000").
lossy_tokenize no longer replaces digit sequences with 0s. That happens
instead in a place that's internal to the word_frequency function, so we can
look at the values of the digits before they're replaced.

Other changes:

wordfreq is now developed using poetry as its package manager, and with
pyproject.toml as the source of configuration instead of setup.py.
The minimum version of Python supported is 3.7.
Type information is exported using py.typed.

Files

rspeer/wordfreq-v3.0.2.zip

Files (56.8 MB)

Name	Size	Download all
rspeer/wordfreq-v3.0.2.zip md5:22a8337dd5dc94350b6c4f7b93ffb9e9	56.8 MB	Preview Download

Additional details

Is supplement to: https://github.com/rspeer/wordfreq/tree/v3.0.2 (URL)

	All versions	This version
Views	3,240	1,063
Downloads	298	93
Data volume	15.0 GB	6.3 GB

rspeer/wordfreq: v3.0

Creators

Description

Files

rspeer/wordfreq-v3.0.2.zip

Files (56.8 MB)

Additional details

Related works