Dataset Restricted Access

Webis Gmane Email Corpus 2019

Janek Bevendorff; Khalid Al-Khatib; Martin Potthast; Benno Stein


MARC21 XML Export

<?xml version='1.0' encoding='UTF-8'?>
<record xmlns="http://www.loc.gov/MARC21/slim">
  <leader>00000nmm##2200000uu#4500</leader>
  <datafield tag="260" ind1=" " ind2=" ">
    <subfield code="c">2020-06-03</subfield>
  </datafield>
  <controlfield tag="005">20200604101820.0</controlfield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">user-webis</subfield>
  </datafield>
  <controlfield tag="001">3766985</controlfield>
  <datafield tag="909" ind1="C" ind2="O">
    <subfield code="p">openaire_data</subfield>
    <subfield code="p">user-webis</subfield>
    <subfield code="o">oai:zenodo.org:3766985</subfield>
  </datafield>
  <datafield tag="520" ind1=" " ind2=" ">
    <subfield code="a">&lt;p&gt;The Webis Gmane Email Corpus 2019 is a dataset of more than 153 million parsed and segmented emails&amp;nbsp;crawled between February and May 2019 from gmane.io covering more than 20 years&amp;nbsp;of public mailing lists. The dataset has been published as a resource at ACL 2020.&lt;/p&gt;

&lt;p&gt;The dataset comes as a set of Gzip-compressed files containing line-based JSON&amp;nbsp;in the&amp;nbsp;&lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html"&gt;Elasticsearch bulk format&lt;/a&gt;. Each data record&amp;nbsp;consists of two lines:&lt;/p&gt;

&lt;pre&gt;&lt;code class="language-json"&gt;{"index": {"_id": "&amp;lt;urn:uuid:c1d95e4b-0f43-46c7-a99e-c575d1d8e1ce&amp;gt;"}}
{"headers": {"header name": "header value", ...}, "text_plain": "plaintext body", "lang": "en", "segments": [{"end": 99, "label": "paragraph", "begin": 0}, ...], "group": "gmane group name"}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The first line is the Elasticsearch index action with a document UUID, the second one the actual parsed email with a (reduced and anonymized) set of headers, the detected language, the original Gmane group name and the predicted content segments as character spans. The Gzip files are splittable every 1,000 records (line pairs) for parallel processing in, e.g., Hadoop.&lt;/p&gt;

&lt;p&gt;Available email headers are:&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;message_id&lt;/li&gt;
	&lt;li&gt;date (yyyy-MM-dd HH:mm:ssZZ)&lt;/li&gt;
	&lt;li&gt;subject&lt;/li&gt;
	&lt;li&gt;from&lt;/li&gt;
	&lt;li&gt;to&lt;/li&gt;
	&lt;li&gt;cc&lt;/li&gt;
	&lt;li&gt;in_reply_to&lt;/li&gt;
	&lt;li&gt;references&lt;/li&gt;
	&lt;li&gt;list_id&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Available segment classes are:&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;paragraph&lt;/li&gt;
	&lt;li&gt;closing&lt;/li&gt;
	&lt;li&gt;inline_headers&lt;/li&gt;
	&lt;li&gt;log_data&lt;/li&gt;
	&lt;li&gt;mua_signature&lt;/li&gt;
	&lt;li&gt;patch&lt;/li&gt;
	&lt;li&gt;personal_signature&lt;/li&gt;
	&lt;li&gt;quotation&lt;/li&gt;
	&lt;li&gt;quotation_marker&lt;/li&gt;
	&lt;li&gt;raw_code&lt;/li&gt;
	&lt;li&gt;salutation&lt;/li&gt;
	&lt;li&gt;section_heading&lt;/li&gt;
	&lt;li&gt;tabular&lt;/li&gt;
	&lt;li&gt;technical&lt;/li&gt;
	&lt;li&gt;visual_separator&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Find more information about the dataset and the segmentation model at&amp;nbsp;&lt;a href="https://webis.de/data#webis-gmane-19"&gt;webis.de&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you are using this resource in your work, please cite it&amp;nbsp;as:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;@InProceedings{stein:2020o,
  author =              {Janek Bevendorff and Khalid Al-Khatib and Martin Potthast and Benno Stein},
  booktitle =           {58th Annual Meeting of the Association for Computational Linguistics (ACL 2020)},
  month =               jul,
  publisher =           {Association for Computational Linguistics},
  site =                {Seattle, USA},
  title =               {{Crawling and Preprocessing Mailing Lists At Scale for Dialog Analysis}},
  year =                2020
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Bauhaus-Universität Weimar</subfield>
    <subfield code="a">Khalid Al-Khatib</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Leipzig University</subfield>
    <subfield code="0">(orcid)0000-0003-2451-0665</subfield>
    <subfield code="a">Martin Potthast</subfield>
  </datafield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="u">Bauhaus-Universität Weimar</subfield>
    <subfield code="0">(orcid)0000-0001-9033-2217</subfield>
    <subfield code="a">Benno Stein</subfield>
  </datafield>
  <datafield tag="542" ind1=" " ind2=" ">
    <subfield code="l">restricted</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">dataset</subfield>
  </datafield>
  <datafield tag="100" ind1=" " ind2=" ">
    <subfield code="u">Bauhaus-Universität Weimar</subfield>
    <subfield code="0">(orcid)0000-0002-3797-0559</subfield>
    <subfield code="a">Janek Bevendorff</subfield>
  </datafield>
  <datafield tag="024" ind1=" " ind2=" ">
    <subfield code="a">10.5281/zenodo.3766985</subfield>
    <subfield code="2">doi</subfield>
  </datafield>
  <datafield tag="245" ind1=" " ind2=" ">
    <subfield code="a">Webis Gmane Email Corpus 2019</subfield>
  </datafield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">doi</subfield>
    <subfield code="i">isVersionOf</subfield>
    <subfield code="a">10.5281/zenodo.3766984</subfield>
  </datafield>
</record>
660
415
views
downloads
All versions This version
Views 660660
Downloads 415415
Data volume 6.5 TB6.5 TB
Unique views 498498
Unique downloads 9898

Share

Cite as