Dataset Open Access

Data and material for: "Content classification of development emails"

Alberto Bacchelli; Tommaso Dal Sasso; Marco D'Ambros; Michele Lanza

This data and material support the paper "Content classification of development emails" published in the proceedings of the 34th International Conference on Software Engineering (ICSE 2012).

Every software system has a history. We find traces of a system's history in software repositories, which are used by developers when building and maintaining their systems. Each repository tells us a part of the history, from its perspective: Issue repositories murmur dark events involving defective and flawed entities; versioning system repositories narrate about restless artifacts and classes that nobody would ever touch; mailing list archives report of unexpected stories on developers’ interactions and opinions.

But... can we seriously trust these repositories? Can we just listen to what they tell us and behave accordingly? Many wise researchers warmly warned us about the risks of showing such a naive faith in data repositories: Versioning system repositories might seduce us with enchanting stories of always changing entities, but in reality many of these entities may just modify their make up and maintain their old behaviour; or issue repositories might tell us a partial truth about certain very special entities, or developers. We do agree with these researchers: Especially natural language documents contain information in different languages, surrounded by much noise. We must pay a special attention when using them.

We created MUCCA, a classification method to use when dealing with natural language documents. It recognizes source code fragments, patches, stack traces, noise, and natural language with significantly high accuracy. In this way, it allows one to subsequently apply ad hoc analysis techniques to exploit the peculiarities of each category, and extract reliable information.

This Zenodo upload supports the paper that describe our work on this topic.

 

1. Source code & Virtual Image

MUCCA is written in Cincom VisualWorks Smalltalk and is composed of several components.
You can download the source code of the following MUCCA components from this upload (mucca-source_code folder):

  • Miler2, the core of MUCCA, including metamodels, importers, classification engine, etc.;
  • MailPeek, our web application for the manual classification of email content;
  • PetitIsland, our grammar to generate island parsers;
  • PetitJava, the grammar of Java, which we implemented for PetitParser;
  • PetitSTrace, our island parser for java stack traces.

Note that, in order to make Miler2 run, you will also need the following external Smalltalk components: MooseGlorpSeaside, TwoFlower, MetaDB, PetitParser.

In addition, we make use of the Weka workbench, for the machine learning tasks. You can download the two trained classifiers that compose MUCCA (mucca-classifiers folder): Naive Bayes based classifier (classifier1-nb.model), Decision Tree based classifier (classifier2-dt.model).

Alternatively, we created a VirtualBox image with a pre-configured VisualWorks environment, which includes all the MUCCA components, and pre-requisites (both Smalltalk and Java): MUCCA.ova (Both user and password are muccauser).

 

2. Benchmark

To train machine-learning classifiers and evaluate the effectiveness of the different approaches, we manually create a benchmark, in which emails are classified at character granularity.

Given the time and effort needed to create such a benchmark, we humbly think it is a valuable contribution to the community. With the help of this benchmark, other researchers can reproduce our experiments and devise new classification methods, which can be immediately compared to ours.

You can download the dataset from the GitHub repository (a dump of the GitHub repository is uploaded here (benchmark/githubDump.zip), or download the full database dump in PostgreSQL format (benchmark/benchmarkDump.tar.bz2).

Files (2.9 GB)
Name Size
mucca.zip
md5:68b33509211eb59abcdb0a2c2ddd7cb5
2.9 GB Download
11
2
views
downloads
All versions This version
Views 1111
Downloads 22
Data volume 5.8 GB5.8 GB
Unique views 88
Unique downloads 22

Share

Cite as