Data sets used for mainly internal purposes by the quanteda package.

data_int_syllables

data_char_stopwords

data_char_wordlists

Format

An object of class integer of length 133245.

Source

The English stopwords are taken from the SMART information retrieval system (obtained from Lewis, David D., et al. "Rcv1: A new benchmark collection for text categorization research." Journal of machine learning research (2004, 5 April): 361-397. Additional stopword lists are taken from the Snowball stemmer project in different languages (see http://snowballstem.org/projects.html). The Greek stopwords were supplied by Carsten Schwemmer (see GitHub issue #282).

Details

data_int_syllables provides an English-language syllables dictionary; it is an integer vector whose element names correspond to English words. Built from the freely available CMU pronunciation dictionary at http://www.speech.cs.cmu.edu/cgi-bin/cmudict. data_char_stopwords provides stopword lists in multiple languages; it is a named list of characters with the lowercase language name (in English) as the name of each list element. Supported languages are Arabic, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Norwegian, Portuguese, Russian, Spanish, and Swedish. data_char_wordlists provides word lists used in some readability indexes; it is a named list of character vectors where each list element corresponds to a different readability index.

These are:

DaleChall

The long Dale-Chall list of 3,000 familiar (English) words needed to compute the Dale-Chall Readability Formula.

Spache

The revised Spache word list (see Klare 1975, 73) needed to compute the Spache Revised Formula of readability (Spache 1974.

References

Chall, J. S., & Dale, E. 1995. Readability Revisited: The New Dale-Chall Readability Formula. Brookline Books.

Klare, G. R. 1975. "Assessing readability." Reading Research Quarterly 10(1): 62-102.

Spache, G. 1953. "A new readability formula for primary-grade reading materials." The Elementary School Journal 53: 410-413.

See also

stopwords