Dataset Open Access
16,000,000 repositories on GitHub as of October 2016, classified with github/linguist and parsed with Pygments. Token.Keyword tokens were filtered and MapReduce-d. Fuzzy duplicate repositories were discarded.
Some languages, e.g. Haskell, are parsed wrong, resulting in many keywords. Still they were not removed since we are not familiar with such languages.
Triples [language name]\t[keyword]\t[frequency]
Tabs and new lines in keywords are escaped as \t and \n respectively.