Webis Gmane Email Corpus 2019

doi:10.5281/zenodo.3766985

Published June 3, 2020 | Version v1

Dataset Restricted

Webis Gmane Email Corpus 2019

1. Bauhaus-Universität Weimar
2. Leipzig University

The Webis Gmane Email Corpus 2019 is a dataset of more than 153 million parsed and segmented emails crawled between February and May 2019 from gmane.io covering more than 20 years of public mailing lists. The dataset has been published as a resource at ACL 2020.

The dataset comes as a set of Gzip-compressed files containing line-based JSON in the Elasticsearch bulk format. Each data record consists of two lines:

{"index": {"_id": "<urn:uuid:c1d95e4b-0f43-46c7-a99e-c575d1d8e1ce>"}}
{"headers": {"header name": "header value", ...}, "text_plain": "plaintext body", "lang": "en", "segments": [{"end": 99, "label": "paragraph", "begin": 0}, ...], "group": "gmane group name"}

The first line is the Elasticsearch index action with a document UUID, the second one the actual parsed email with a (reduced and anonymized) set of headers, the detected language, the original Gmane group name and the predicted content segments as character spans. The Gzip files are splittable every 1,000 records (line pairs) for parallel processing in, e.g., Hadoop.

Available email headers are:

message_id
date (yyyy-MM-dd HH:mm:ssZZ)
subject
from
to
cc
in_reply_to
references
list_id

Available segment classes are:

paragraph
closing
inline_headers
log_data
mua_signature
patch
personal_signature
quotation
quotation_marker
raw_code
salutation
section_heading
tabular
technical
visual_separator

Find more information about the dataset and the segmentation model at webis.de.

If you are using this resource in your work, please cite it as:

@InProceedings{stein:2020o,
  author =              {Janek Bevendorff and Khalid Al-Khatib and Martin Potthast and Benno Stein},
  booktitle =           {58th Annual Meeting of the Association for Computational Linguistics (ACL 2020)},
  month =               jul,
  publisher =           {Association for Computational Linguistics},
  site =                {Seattle, USA},
  title =               {{Crawling and Preprocessing Mailing Lists At Scale for Dialog Analysis}},
  year =                2020
}

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

The dataset is available only to individual researchers and research institutions. If you qualify for either one, we are happy to share the data with you under the following conditions:

Any non-academic use and redistribution of the data are prohibited. By downloading the dataset, you agree to these terms. We request you be responsible in your research and in your handling of the data and adhere to ethical standards and privacy regulations.

Despite the anonymization of email addresses and headers and the fact that all data comes from a readily-available online source, we take this step as a measure to protect the privacy of users whose data can be found in the corpus.

You are currently not logged in. Do you have an account? Log in here

	All versions	This version
Views	1,153	1,150
Downloads	284	284
Data volume	10.8 TB	10.8 TB

Webis Gmane Email Corpus 2019

Creators

Description

Files

Restricted

Request access