Published January 1, 2025 | Version v1
Dataset Open

Persuasion Sentences in Spam Email (PerSentSE)

Description

How to Access:

To access this dataset, please contact Francisco Janez via email at francisco.janez@unileon.es. Access will be granted based on specific requests.

Purpose:
The PerSentSE corpus was developed to study persuasive techniques in spam emails. It includes 130 emails randomly selected from the SpamArchive2122 dataset, which contains over 20,000 spam emails in English.

Methodology:

  • Segmentation: Emails were divided into sentences using the NLTK library.
  • Annotation: Eight persuasive techniques, along with a "non-persuasion" class, were identified. Two expert annotators labeled an initial subset of emails to measure inter-annotator agreement, achieving a final acceptable level (γ = 0.63).

Corpus Statistics:

  • Total sentences: 1,075
  • Persuasive sentences: 216 (20.1%)

Persuasion Distribution by Email Sections (Table 7):

  • Subject lines: 35.59% persuasive, with an average of 1.62 techniques.
  • Greeting section: 54.17% persuasive, averaging 1.46 techniques.
  • Email body: 82.46% persuasive, with 5.51 techniques on average.
  • Farewell section: 31.43% persuasive, averaging 1.45 techniques.

Co-occurrence of Techniques (Figure 2):
Some persuasive techniques frequently appeared together:

  • Appeal to Fear/Prejudice with Loaded Language: 25 instances.
  • Exaggeration/Minimization with Loaded Language: 24 instances.
  • Appeal to Fear/Prejudice with Exaggeration/Minimization: 20 instances.

Findings:
The body section of emails concentrates the highest number of persuasive elements, contrary to earlier studies focusing on subject lines alone. This suggests that spam emails rely heavily on persuasive content in their main text.

Files

Files (167.6 kB)

Name Size Download all
md5:9ccd95a1d5e21a3e3f2f5eb8c40a2294
167.6 kB Download

Additional details

Related works

Is published in
Publication: 10.1016/j.eswa.2024.125767 (DOI)