Dataset Open Access
Volodymyr Miz; Benjamin Ricaud; Pierre Vandergheynst
We use the Enron email dataset to build a network of email addresses. It contains 614586 emails sent over the period from 6 January 1998 until 4 February 2004. During the pre-processing, we remove the periods of low activity and keep the emails from 1 January 1999 until 31 July 2002 which is 1448 days of email records in total. Also, we remove email addresses that sent less than three emails over that period. In total, the Enron email network contains 6 600 nodes and 50 897 edges.
To build a graph G = (V, E), we use email addresses as nodes V. Every node vi has an attribute which is a time-varying signal that corresponds to the number of emails sent from this address during a day. We draw an edge eij between two nodes i and j if there is at least one email exchange between the corresponding addresses.
Column 'Count' in 'edges.csv' file is the number of 'From'->'To' email exchanges between the two addresses. This column can be used as an edge weight.
The file 'nodes.csv' contains a dictionary that is a compressed representation of time-series. The format of the dictionary is Day->The Number Of Emails Sent By the Address During That Day. The total number of days is 1448.
'id-email.csv' is a file containing the actual email addresses.