Social Network Analysis of the Enron Corpus
Authors/Creators
Description
In major litigation and regulatory investigations, electronic document review traditionally relies on brute-force keyword searches, a method that is often inefficient and haphazard when applied to massive datasets like the 500,000-email Enron Corpus. This paper investigates whether social network analysis (SNA) can serve as an effective preprocessing triage tool to distill large corpora and rapidly identify discrete subgroups with email content of potential interest.
After filtering the Enron Corpus to remove duplicate messages, external communications, and administrative broadcasts, the remaining internal communication network was analyzed for user prominence. By filtering for the top 100 users based on "stress centrality"—a measure of the shortest paths a vertex must traverse to reach all other vertices—the dataset was optimized for latent network identification. A latent cluster random effects model utilizing a Markov chain Monte Carlo (MCMC) algorithm was then applied to this reduced graph.
Relying solely on basic sender and recipient metadata, the latent network model successfully partitioned the users into three distinct structural clusters. To validate the efficacy of this metadata-driven classification, a natural language processing (NLP) word frequency analysis was conducted on the clusters' contents. The analysis revealed that the socially distinct groups also possessed markedly differing vocabularies, with two of the sub-graphs containing 13.36% and 23.01% strictly unique terms, respectively.
These findings demonstrate that graph-based SNA preprocessing is a highly feasible and computationally efficient method for e-discovery. By reducing dataset size through centrality metrics and clustering users into discrete social networks, investigators can strategically prioritize email review based on the relative positions and distinct vernaculars of an organization's participants.
Files
Enron SNA.pdf
Files
(1.3 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:dcfdbbe7e6ec2870bafe8d6a1ef6edaf
|
1.3 MB | Preview Download |