Published July 24, 2024 | Version v1
Dataset Restricted

UPM - DATASET

  • 1. ROR icon Universidade Estadual de Campinas (UNICAMP)
  • 2. ROR icon Loyola University Chicago

Description

 

Dataset Overview

This dataset is associated with the article "Unveiling Scientific Articles from Paper Mills with Provenance Analysis." It is designed to support the development of new methods for identifying systematically produced articles. The dataset includes all image panels from the Stock Photo Paper Mill (SPP) [1] and its extended versions.

Dataset Composition

The SPP dataset comprises 121 biomedical articles focused on cancer types and cell tissue samples. Bik [1] has annotated instances of potentially similar images across these papers, and these annotations are publicly available on Bik's website [1] in spreadsheet format.

To enhance the SPP dataset, we introduced distractor documents that do not contain known issues. We created two versions of the SPP extension to study the challenge across increasingly large sets of articles:

  • Version 1 (v1): Includes 969 additional papers containing biomedical figures.

  • Version 2 (v2): Expands further with 3,635 additional papers, similar in nature to those in the first version.

Image Panel Distribution

The following table shows the distribution of each image panel type after extraction from their original articles:

Panel Type

SPP

Extended SPP (v1)

Extended SPP (v2)

Microscopy

925

4,227

14,083

Blots

278

1,298

9,810

Body Imaging

0

573

10,715

Graphs and Plots

1,317

3,620

9,879

Flow Cytometry

63

427

3,053

Total

2,583

10,145

47,540

 

Dataset Structure

The dataset.zip file contains three directories, each corresponding to a different version of the dataset:

  1. spm/: Contains image panels from the original Stock Photo Paper Mill (SPM) set.

  2. extracted_panels/: Includes panels related to the Extended SPP (v1).

  3. annotated_panels/: Contains panels related to the Extended SPP (v2).

Annotations

The dataset includes two types of annotation files:

  • document-level-annotation.json: This file provides annotations detailing how each article reuses content from other articles.

  • image-level-annotation.json: This file includes annotations about groups of images that share similar content.

Please refer to these files for detailed information on the dataset's contents and the relationships between the images and articles.

 

References:

[1] Bik E. The Stock Photo Paper Mill; 2020. Available from https://scienceintegritydigest.com/2020/07/05/the-stock-photo-paper-mill/

 

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Additional details

Additional titles

Subtitle
Dataset of Unveiling Scientific Articles from Paper Mills with Provenance Analysis