Published September 26, 2022 | Version v3.0.2
Software Open

rspeer/wordfreq: v3.0

Creators

  • 1. Elemental Cognition

Description

v3.0: The "handle numbers better" release

 

Previously, wordfreq would group all digit sequences of the same 'shape',

with length 2 or more, into a single token and return the frequency of that
token, which would be a vast overestimate.

Now it distributes the frequency over all numbers of that shape, with an
estimated distribution that allows for Benford's law (lower numbers are more
frequent) and a special frequency distribution for 4-digit numbers that look
like years (2010 is more frequent than 1020).

More changes related to digits:

  • Functions such as iter_wordlist and top_n_list no longer return
    multi-digit numbers (they used to return them in their "smashed" form, such
    as "0000").

  • lossy_tokenize no longer replaces digit sequences with 0s. That happens
    instead in a place that's internal to the word_frequency function, so we can
    look at the values of the digits before they're replaced.

Other changes:

  • wordfreq is now developed using poetry as its package manager, and with
    pyproject.toml as the source of configuration instead of setup.py.

  • The minimum version of Python supported is 3.7.

  • Type information is exported using py.typed.

Files

rspeer/wordfreq-v3.0.2.zip

Files (56.8 MB)

Name Size Download all
md5:22a8337dd5dc94350b6c4f7b93ffb9e9
56.8 MB Preview Download

Additional details

Related works