Spam Images in Messaging - Annotated Set (SIMAS)
Authors/Creators
Description
SIMAS Dataset
This archive includes the SIMAS dataset for fine-tuning models for MMS (Multimedia Messaging Service) image moderation. SIMAS is a balanced collection of publicly available images, manually annotated in accordance with a specialized taxonomy designed for identifying visual spam in MMS messages.
Taxonomy for MMS Visual Spam
The following table presents the definitions of categories used for classifying MMS images.
Table 1: Category definitions
| Category | Description |
| Alcohol* | Content related to alcoholic beverages, including advertisements and consumption. |
| Drugs* | Content related to the use, sale, or trafficking of narcotics (e.g., cannabis, cocaine, |
| Firearms* | Content involving guns, pistols, knives, or military weapons. |
| Gambling* | Content related to gambling (casinos, poker, roulette, lotteries). |
| Sexual | Content involving nudity, sexual acts, or sexually suggestive material. |
| Tobacco* | Content related to tobacco use and advertisements. |
| Violence | Content showing violent acts, self-harm, or injury. |
| Safe | All other content, including neutral depictions, products, or harmless cultural symbols |
Note: Categories marked with an asterisk are regulated in some jurisdictions and may not be universally restricted.
Dataset Collection and Annotation
Data Sources
The SIMAS dataset combines publicly available images from multiple sources, selected to reflect the categories defined in our content taxonomy. Each image was manually reviewed by three independent annotators, with final labels assigned when at least two annotators agreed.
The largest portion of the dataset (30.4%) originates from LAION-400M, a large-scale image-text dataset. To identify relevant content, we first selected a list of ImageNet labels that semantically matched our taxonomy. These labels were generated using GPT-4o in a zero-shot setting, using separate prompts per category. This resulted in 194 candidate labels, of which 88.7% were retained after manual review. The structure of the prompts used in this process is shown in the file gpt4o_imagenet_prompting_scheme.png, which illustrates a shared base prompt template applied across all categories. The fields category_definition, file_examples, and exceptions are specified per category. Definitions align with the taxonomy, while the file_examples column includes sample labels retrieved from the ImageNet label list. The exceptions field contains category-specific filtering instructions; a dash indicates no exceptions were specified.
Another 25.1% of images were sourced from Roboflow, using open datasets such as:
- Marijuana and Hemp 200
- Drug Detection
- Plants Classification
- Weapon Detection
- Suicide Detection
- Violence Detection
- Clasificacionimagenes
- Waste Recognition
The NudeNet dataset contributes 11.4% of the dataset. We sampled 1,000 images from the “porn” category to provide visual coverage of explicit sexual content.
Another 11.0% of images were collected from Kaggle, including:
- National Flowers
- Weapon Dataset for YOLOv5
- GUIE Toys
- Alcohol Bottle Images
- Smoking & Drinking Dataset
An additional 9.9% of images were retrieved from Unsplash, using keyword-based search queries aligned with each category in our taxonomy.
Images from UnsafeBench make up 8.0% of the dataset. Since its original binary labels did not match our taxonomy, all samples were manually reassigned to the most appropriate category.
Finally, 4.2% of images were gathered from various publicly accessible websites. These were primarily used to improve category balance and model generalization, especially in safe classes.
All images collected from the listed sources have been manually reviewed by three independent annotators. Each image is then assigned to a category when at least two annotators reach consensus.
Table 2: Distribution of images per public source and category in SIMAS dataset
| Type | Category | LAION | Roboflow | NudeNet | Kaggle | Unsplash | UnsafeBench | Other | Total |
|---|---|---|---|---|---|---|---|---|---|
| Unsafe | Alcohol | 29 | 0 | 3 | 267 | 0 | 1 | 0 | 300 |
| Unsafe | Drugs | 17 | 211 | 0 | 0 | 13 | 8 | 1 | 250 |
| Unsafe | Firearms | 0 | 59 | 0 | 229 | 0 | 62 | 0 | 350 |
| Unsafe | Gambling | 132 | 38 | 0 | 0 | 73 | 39 | 18 | 300 |
| Unsafe | Sexual | 2 | 0 | 421 | 0 | 3 | 68 | 6 | 500 |
| Unsafe | Tobacco | 0 | 446 | 0 | 0 | 43 | 11 | 0 | 500 |
| Unsafe | Violence | 0 | 289 | 0 | 0 | 0 | 11 | 0 | 300 |
| Safe | Alcohol | 140 | 35 | 0 | 0 | 16 | 13 | 96 | 300 |
| Safe | Drugs | 67 | 49 | 0 | 15 | 72 | 17 | 30 | 250 |
| Safe | Firearms | 173 | 15 | 0 | 3 | 144 | 8 | 7 | 350 |
| Safe | Gambling | 164 | 2 | 0 | 1 | 121 | 12 | 0 | 300 |
| Safe | Sexual | 235 | 22 | 139 | 2 | 0 | 94 | 8 | 500 |
| Safe | Tobacco | 351 | 67 | 5 | 13 | 8 | 16 | 40 | 500 |
| Safe | Violence | 212 | 20 | 3 | 21 | 0 | 42 | 2 | 300 |
| All | All | 1,522 | 1,253 | 571 | 551 | 493 | 402 | 208 | 5,000 |
Balancing
To ensure semantic diversity and dataset balance, undersampling was performed on overrepresented categories using a CLIP-based embedding and k-means clustering strategy. This resulted in a final dataset containing 2,500 spam and 2,500 safe images, evenly distributed across all categories.
Table 3: Distribution of images per category in SIMAS dataset
| Type | Alcohol | Drugs | Firearms | Gambling | Sexual | Tobacco | Violence | Total |
|---|---|---|---|---|---|---|---|---|
| Unsafe | 300 | 250 | 350 | 300 | 500 | 500 | 300 | 2,500 |
| Safe | 300 | 250 | 350 | 300 | 500 | 500 | 300 | 2,500 |
| All | 600 | 500 | 700 | 600 | 1,000 | 1,000 | 600 | 5,000 |
SIMAS+ Dataset
| Type | Alcohol | Drugs | Firearms | Gambling | Sexual | Tobacco | Violence | Total |
|---|---|---|---|---|---|---|---|---|
| Unsafe | 100 | 50 | 80 | 50 | 50 | 10 | 10 | 350 |
| Safe | 100 | 50 | 80 | 50 | 50 | 10 | 10 | 350 |
| All | 200 | 100 | 160 | 100 | 100 | 20 | 20 | 700 |