Published August 9, 2022 | Version v1
Dataset Open

Recod.ai Scientific Image Integrity Dataset (RSIID)

  • 1. ROR icon Universidade Estadual de Campinas (UNICAMP)

Description

The Recod.ai Scientific Image Integrity Dataset (RSIID) is a benchmark dataset designed for evaluating forgery detection methods in scientific images. This dataset comprises 39,423 synthetically tampered figures, derived from 2,923 pristine scientific images sourced from Creative Commons repositories. The dataset is divided into training (26,496 figures) and testing (12,927 figures) sets, all licensed under Creative Commons Attribution (CC-BY).

The RSIID is structured by forgery modality and figure complexity, categorized into "Simple" and "Compound" figures:

  • Simple Scientific Figures: These include forgeries created through Retouching (Blurring, Contrast adjustments), Cleaning (Inpainting, Brute-force removal), and Duplication (Copy-Move, Splicing, Overlap).
  • Compound Scientific Figures: These figures consist of multiple panels, where forgeries can occur within a single panel (intra-forgery) or between panels (inter-forgery).
  •  

Additional Resources:

This repository also includes:

  • Source Images and Metadata: The original, untampered images used to create the dataset, along with a spreadsheet (artificial_forgery_src_data.zip) detailing the source of each image.
  • Compound Forgery Templates: Templates used for creating the compound forgeries (template.zip).

 

Related Content:

Research Article - Benchmarking Scientific Image Forgery Detectors 

GitHub Repository - Recod.ai Scientific Image Integrity Library

 

Citation

The dataset is distributed under the Attribution 4.0 International (CC BY 4.0) license.

If you use any content from this repository, please cite:

 @article{cardenuto_2022, 
 title={Benchmarking scientific image forgery detectors},
 volume={28}, DOI={10.1007/s11948-022-00391-4},
 number={4},
 journal={Science and Engineering Ethics},
 author={Cardenuto, João P. and Rocha, Anderson}, year={2022}
 }

Table of contents

Dataset Breakdown (Simple Forgeries):

Simple Forgeries

 

 

 

 

Data Type

Train

Test

 

Modality

Number of figures

Number of figures

Source of forgery figures

-

1932

991

Pristine

-

1932

991

Duplication

Copy–Move

3761

1629

 

Splicing

604

274

 

Overlap

0

660

 

Total

4365

2563

Cleaning

Inpainting

275

117

 

Brute-force

961

412

 

Total

1232

529

Retouching

Blurring

961

414

 

Contrast

966

415

 

Total

1927

829

Total of figures

 

9456

4912

 

Dataset Breakdown (Compound Forgeries):

Compound Forgeries

 

 

 

 

 

 

Data Type

Train

Test

Forgery Location

 

Modality

Number of figures

Number of figures

Source of forgery figures

  -

1932

991

Inter-panel

Duplication

Copy–Move

9516

4094

   

Splicing

604

274

   

Overlap

0

660

   

Total

10120

5028

Intra-panel

Duplication

Copy-Move

3761

1629

   

Total

3761

1629

 

Cleaning

Inpaiting

275

117

   

Brute-Force

957

412

   

Total

1232

529

 

Retouching

Blurring

961

414

   

Contrast

966

415

   

Total

1927

829

Total of figures

 

 

17040

8015

 

 

Files

artificial_forgery_src_data.zip

Files (57.3 GB)

Name Size Download all
md5:73580eaec4b17dd738fe6bed4f66dd67
430.3 MB Preview Download
md5:586a84709b349122b3753e0142951a57
85.7 MB Preview Download
md5:c3a4a68b7db4c3d550f3cacbe796e702
19.8 GB Download
md5:7f3dd5edd3aa52dc0d30889c428938f1
36.9 GB Download

Additional details

Funding

Fundação de Amparo à Pesquisa do Estado de São Paulo
2020/02211-2
Fundação de Amparo à Pesquisa do Estado de São Paulo
2017/12646-3

Dates

Accepted
2022-08-09