Published May 15, 2025 | Version 1.0.0
Dataset Open

Spam Images in Messaging - Annotated Set (SIMAS)

  • 1. Infobip
  • 2. ROR icon Graz University of Technology

Description

SIMAS Dataset

This archive includes the SIMAS dataset for fine-tuning models for MMS (Multimedia Messaging Service) image moderation. SIMAS is a balanced collection of publicly available images, manually annotated in accordance with a specialized taxonomy designed for identifying visual spam in MMS messages.

Taxonomy for MMS Visual Spam

The following table presents the definitions of categories used for classifying MMS images.

Table 1: Category definitions

Category    Description 
Alcohol*    Content related to alcoholic beverages, including advertisements and consumption. 
Drugs*      Content related to the use, sale, or trafficking of narcotics (e.g., cannabis, cocaine,
Firearms*   Content involving guns, pistols, knives, or military weapons. 
Gambling*   Content related to gambling (casinos, poker, roulette, lotteries). 
Sexual      Content involving nudity, sexual acts, or sexually suggestive material. 
Tobacco*    Content related to tobacco use and advertisements. 
Violence    Content showing violent acts, self-harm, or injury. 
Safe        All other content, including neutral depictions, products, or harmless cultural symbols

Note: Categories marked with an asterisk are regulated in some jurisdictions and may not be universally restricted.

Dataset Collection and Annotation

Data Sources

The SIMAS dataset combines publicly available images from multiple sources, selected to reflect the categories defined in our content taxonomy. Each image was manually reviewed by three independent annotators, with final labels assigned when at least two annotators agreed.

The largest portion of the dataset (30.4%) originates from LAION-400M, a large-scale image-text dataset. To identify relevant content, we first selected a list of ImageNet labels that semantically matched our taxonomy. These labels were generated using GPT-4o in a zero-shot setting, using separate prompts per category. This resulted in 194 candidate labels, of which 88.7% were retained after manual review. The structure of the prompts used in this process is shown in the file gpt4o_imagenet_prompting_scheme.png, which illustrates a shared base prompt template applied across all categories. The fields category_definition, file_examples, and exceptions are specified per category. Definitions align with the taxonomy, while the file_examples column includes sample labels retrieved from the ImageNet label list. The exceptions field contains category-specific filtering instructions; a dash indicates no exceptions were specified.

Another 25.1% of images were sourced from Roboflow, using open datasets such as:

The NudeNet dataset contributes 11.4% of the dataset. We sampled 1,000 images from the “porn” category to provide visual coverage of explicit sexual content.

Another 11.0% of images were collected from Kaggle, including:

An additional 9.9% of images were retrieved from Unsplash, using keyword-based search queries aligned with each category in our taxonomy.

Images from UnsafeBench make up 8.0% of the dataset. Since its original binary labels did not match our taxonomy, all samples were manually reassigned to the most appropriate category.

Finally, 4.2% of images were gathered from various publicly accessible websites. These were primarily used to improve category balance and model generalization, especially in safe classes.

All images collected from the listed sources have been manually reviewed by three independent annotators. Each image is then assigned to a category when at least two annotators reach consensus.

Table 2: Distribution of images per public source and category in SIMAS dataset

Type Category LAION Roboflow NudeNet Kaggle Unsplash UnsafeBench Other Total
Unsafe Alcohol 29 0 3 267 0 1 0 300
Unsafe Drugs 17 211 0 0 13 8 1 250
Unsafe Firearms 0 59 0 229 0 62 0 350
Unsafe Gambling 132 38 0 0 73 39 18 300
Unsafe Sexual 2 0 421 0 3 68 6 500
Unsafe Tobacco 0 446 0 0 43 11 0 500
Unsafe Violence 0 289 0 0 0 11 0 300
Safe Alcohol 140 35 0 0 16 13 96 300
Safe Drugs 67 49 0 15 72 17 30 250
Safe Firearms 173 15 0 3 144 8 7 350
Safe Gambling 164 2 0 1 121 12 0 300
Safe Sexual 235 22 139 2 0 94 8 500
Safe Tobacco 351 67 5 13 8 16 40 500
Safe Violence 212 20 3 21 0 42 2 300
All All 1,522 1,253 571 551 493 402 208 5,000

Balancing

To ensure semantic diversity and dataset balance, undersampling was performed on overrepresented categories using a CLIP-based embedding and k-means clustering strategy. This resulted in a final dataset containing 2,500 spam and 2,500 safe images, evenly distributed across all categories.

Table 3: Distribution of images per category in SIMAS dataset

Type Alcohol Drugs Firearms Gambling Sexual Tobacco Violence Total
Unsafe 300 250 350 300 500 500 300 2,500
Safe 300 250 350 300 500 500 300 2,500
All 600 500 700 600 1,000 1,000 600 5,000

SIMAS+ Dataset

For researchers interested in a more realistic deployment setting, we also curate a complementary dataset called SIMAS+. It is a benchmarking dataset containing publicly accessible images extracted from real-world MMS traffic, specifically from external URLs embedded in messages. Manual annotation was conducted by three independent raters, with a category label assigned when at least two annotators agreed. The dataset was then balanced across spam categories using the same semantic grouping strategy as in SIMAS, ensuring equal representation of safe and unsafe examples per class. The final version of SIMAS+ contains 700 images, with the category distribution presented in the table below.
 
Table 4: Distribution of images per category in SIMAS+ dataset

Type Alcohol Drugs Firearms Gambling Sexual Tobacco Violence Total
Unsafe 100 50 80 50 50 10 10 350
Safe 100 50 80 50 50 10 10 350
All 200 100 160 100 100 20 20 700
 
Note: Due to regulatory and privacy considerations, SIMAS+ is not included in this archive. To obtain access to the SIMAS+ dataset for research purposes, please contact the dataset authors directly.
 

License

This dataset is licensed under the CC BY-NC 4.0 license and may be used for non-commercial research purposes.
 

Files

gpt4o_imagenet_prompting_scheme.png

Files (1.7 GB)

Name Size Download all
md5:b135698203998df869e91f9022fef314
125.5 kB Preview Download
md5:dd59d800cec45fd0a935ad2316667996
7.9 kB Preview Download
md5:5640868df5f8b73043224ff9a5b34214
1.7 GB Download