Dataset Open Access

Programming language keyword frequencies extracted from 16,000,000 public GitHub repositories (October 2016)

Markovtsev Vadim


16,000,000 repositories on GitHub as of October 2016, classified with github/linguist and parsed with Pygments. Token.Keyword tokens were filtered and MapReduce-d. Fuzzy duplicate repositories were discarded.

Some languages, e.g. Haskell, are parsed wrong, resulting in many keywords. Still they were not removed since we are not familiar with such languages.


Triples [language name]\t[keyword]\t[frequency]

Tabs and new lines in keywords are escaped as \t and \n respectively.

Files (47.4 MB)
Name Size
keywords.tsv md5:0202530008708c280cc7e641f6754596 47.4 MB Download


Cite as