Spam Images in Messaging - Annotated Set (SIMAS)

Ljubičić, Maria; Lacić, Emanuel; Helic, Denis

doi:10.5281/zenodo.15423637

Published May 15, 2025 | Version 1.0.0

Dataset Open

Spam Images in Messaging - Annotated Set (SIMAS)

1. Infobip
2. Graz University of Technology

SIMAS Dataset

This archive includes the SIMAS dataset for fine-tuning models for MMS (Multimedia Messaging Service) image moderation. SIMAS is a balanced collection of publicly available images, manually annotated in accordance with a specialized taxonomy designed for identifying visual spam in MMS messages.

Taxonomy for MMS Visual Spam

The following table presents the definitions of categories used for classifying MMS images.

Table 1: Category definitions

Category	Description
Alcohol*	Content related to alcoholic beverages, including advertisements and consumption.
Drugs*	Content related to the use, sale, or trafficking of narcotics (e.g., cannabis, cocaine,
Firearms*	Content involving guns, pistols, knives, or military weapons.
Gambling*	Content related to gambling (casinos, poker, roulette, lotteries).
Sexual	Content involving nudity, sexual acts, or sexually suggestive material.
Tobacco*	Content related to tobacco use and advertisements.
Violence	Content showing violent acts, self-harm, or injury.
Safe	All other content, including neutral depictions, products, or harmless cultural symbols

Note: Categories marked with an asterisk are regulated in some jurisdictions and may not be universally restricted.

Dataset Collection and Annotation

Data Sources

The SIMAS dataset combines publicly available images from multiple sources, selected to reflect the categories defined in our content taxonomy. Each image was manually reviewed by three independent annotators, with final labels assigned when at least two annotators agreed.

The largest portion of the dataset (30.4%) originates from LAION-400M, a large-scale image-text dataset. To identify relevant content, we first selected a list of ImageNet labels that semantically matched our taxonomy. These labels were generated using GPT-4o in a zero-shot setting, using separate prompts per category. This resulted in 194 candidate labels, of which 88.7% were retained after manual review. The structure of the prompts used in this process is shown in the file gpt4o_imagenet_prompting_scheme.png, which illustrates a shared base prompt template applied across all categories. The fields category_definition, file_examples, and exceptions are specified per category. Definitions align with the taxonomy, while the file_examples column includes sample labels retrieved from the ImageNet label list. The exceptions field contains category-specific filtering instructions; a dash indicates no exceptions were specified.

Another 25.1% of images were sourced from Roboflow, using open datasets such as:

The NudeNet dataset contributes 11.4% of the dataset. We sampled 1,000 images from the “porn” category to provide visual coverage of explicit sexual content.

Another 11.0% of images were collected from Kaggle, including:

An additional 9.9% of images were retrieved from Unsplash, using keyword-based search queries aligned with each category in our taxonomy.

Images from UnsafeBench make up 8.0% of the dataset. Since its original binary labels did not match our taxonomy, all samples were manually reassigned to the most appropriate category.

Finally, 4.2% of images were gathered from various publicly accessible websites. These were primarily used to improve category balance and model generalization, especially in safe classes.

All images collected from the listed sources have been manually reviewed by three independent annotators. Each image is then assigned to a category when at least two annotators reach consensus.

Table 2: Distribution of images per public source and category in SIMAS dataset

Type	Category	LAION	Roboflow	NudeNet	Kaggle	Unsplash	UnsafeBench	Other	Total
Unsafe	Alcohol	29	0	3	267	0	1	0	300
Unsafe	Drugs	17	211	0	0	13	8	1	250
Unsafe	Firearms	0	59	0	229	0	62	0	350
Unsafe	Gambling	132	38	0	0	73	39	18	300
Unsafe	Sexual	2	0	421	0	3	68	6	500
Unsafe	Tobacco	0	446	0	0	43	11	0	500
Unsafe	Violence	0	289	0	0	0	11	0	300
Safe	Alcohol	140	35	0	0	16	13	96	300
Safe	Drugs	67	49	0	15	72	17	30	250
Safe	Firearms	173	15	0	3	144	8	7	350
Safe	Gambling	164	2	0	1	121	12	0	300
Safe	Sexual	235	22	139	2	0	94	8	500
Safe	Tobacco	351	67	5	13	8	16	40	500
Safe	Violence	212	20	3	21	0	42	2	300
All	All	1,522	1,253	571	551	493	402	208	5,000

Balancing

To ensure semantic diversity and dataset balance, undersampling was performed on overrepresented categories using a CLIP-based embedding and k-means clustering strategy. This resulted in a final dataset containing 2,500 spam and 2,500 safe images, evenly distributed across all categories.

Table 3: Distribution of images per category in SIMAS dataset

Type	Alcohol	Drugs	Firearms	Gambling	Sexual	Tobacco	Violence	Total
Unsafe	300	250	350	300	500	500	300	2,500
Safe	300	250	350	300	500	500	300	2,500
All	600	500	700	600	1,000	1,000	600	5,000

SIMAS+ Dataset

For researchers interested in a more realistic deployment setting, we also curate a complementary dataset called SIMAS+. It is a benchmarking dataset containing publicly accessible images extracted from real-world MMS traffic, specifically from external URLs embedded in messages. Manual annotation was conducted by three independent raters, with a category label assigned when at least two annotators agreed. The dataset was then balanced across spam categories using the same semantic grouping strategy as in SIMAS, ensuring equal representation of safe and unsafe examples per class. The final version of SIMAS+ contains 700 images, with the category distribution presented in the table below.

Table 4: Distribution of images per category in SIMAS+ dataset

Type	Alcohol	Drugs	Firearms	Gambling	Sexual	Tobacco	Violence	Total
Unsafe	100	50	80	50	50	10	10	350
Safe	100	50	80	50	50	10	10	350
All	200	100	160	100	100	20	20	700

Note: Due to regulatory and privacy considerations, SIMAS+ is not included in this archive. To obtain access to the SIMAS+ dataset for research purposes, please contact the dataset authors directly.

License

This dataset is licensed under the CC BY-NC 4.0 license and may be used for non-commercial research purposes.

Files

gpt4o_imagenet_prompting_scheme.png

Files (1.7 GB)

Name	Size	Download all
gpt4o_imagenet_prompting_scheme.png md5:b135698203998df869e91f9022fef314	125.5 kB	Preview Download
README.md md5:dd59d800cec45fd0a935ad2316667996	7.9 kB	Preview Download
simas.tar.gz md5:5640868df5f8b73043224ff9a5b34214	1.7 GB	Download

	All versions	This version
Views	183	183
Downloads	71	71
Data volume	64.3 GB	64.3 GB

Spam Images in Messaging - Annotated Set (SIMAS)

Authors/Creators

Description

SIMAS Dataset

Taxonomy for MMS Visual Spam

Dataset Collection and Annotation

Data Sources

Balancing

SIMAS+ Dataset

License

Files

gpt4o_imagenet_prompting_scheme.png

Files (1.7 GB)