Dataset Open Access
This data and material support the paper "Content classification of development emails" published in the proceedings of the 34th International Conference on Software Engineering (ICSE 2012).
Every software system has a history. We find traces of a system's history in software repositories, which are used by developers when building and maintaining their systems. Each repository tells us a part of the history, from its perspective: Issue repositories murmur dark events involving defective and flawed entities; versioning system repositories narrate about restless artifacts and classes that nobody would ever touch; mailing list archives report of unexpected stories on developers’ interactions and opinions.
But... can we seriously trust these repositories? Can we just listen to what they tell us and behave accordingly? Many wise researchers warmly warned us about the risks of showing such a naive faith in data repositories: Versioning system repositories might seduce us with enchanting stories of always changing entities, but in reality many of these entities may just modify their make up and maintain their old behaviour; or issue repositories might tell us a partial truth about certain very special entities, or developers. We do agree with these researchers: Especially natural language documents contain information in different languages, surrounded by much noise. We must pay a special attention when using them.
We created MUCCA, a classification method to use when dealing with natural language documents. It recognizes source code fragments, patches, stack traces, noise, and natural language with significantly high accuracy. In this way, it allows one to subsequently apply ad hoc analysis techniques to exploit the peculiarities of each category, and extract reliable information.
This Zenodo upload supports the paper that describe our work on this topic.
1. Source code & Virtual Image
MUCCA is written in Cincom VisualWorks Smalltalk and is composed of several components.
You can download the source code of the following MUCCA components from this upload (
In addition, we make use of the Weka workbench, for the machine learning tasks. You can download the two trained classifiers that compose MUCCA (
mucca-classifiers folder): Naive Bayes based classifier (
classifier1-nb.model), Decision Tree based classifier (
Alternatively, we created a VirtualBox image with a pre-configured VisualWorks environment, which includes all the MUCCA components, and pre-requisites (both Smalltalk and Java):
MUCCA.ova (Both user and password are
To train machine-learning classifiers and evaluate the effectiveness of the different approaches, we manually create a benchmark, in which emails are classified at character granularity.
Given the time and effort needed to create such a benchmark, we humbly think it is a valuable contribution to the community. With the help of this benchmark, other researchers can reproduce our experiments and devise new classification methods, which can be immediately compared to ours.
You can download the dataset from the GitHub repository (a dump of the GitHub repository is uploaded here (
benchmark/githubDump.zip), or download the full database dump in PostgreSQL format (