UPM - DATASET
Creators
Description
Dataset Overview
This dataset is associated with the article "Unveiling Scientific Articles from Paper Mills with Provenance Analysis." It is designed to support the development of new methods for identifying systematically produced articles. The dataset includes all image panels from the Stock Photo Paper Mill (SPP) [1] and its extended versions.
Dataset Composition
The SPP dataset comprises 121 biomedical articles focused on cancer types and cell tissue samples. Bik [1] has annotated instances of potentially similar images across these papers, and these annotations are publicly available on Bik's website [1] in spreadsheet format.
To enhance the SPP dataset, we introduced distractor documents that do not contain known issues. We created two versions of the SPP extension to study the challenge across increasingly large sets of articles:
-
Version 1 (v1): Includes 969 additional papers containing biomedical figures.
-
Version 2 (v2): Expands further with 3,635 additional papers, similar in nature to those in the first version.
Image Panel Distribution
The following table shows the distribution of each image panel type after extraction from their original articles:
|
Panel Type |
SPP |
Extended SPP (v1) |
Extended SPP (v2) |
|
Microscopy |
925 |
4,227 |
14,083 |
|
Blots |
278 |
1,298 |
9,810 |
|
Body Imaging |
0 |
573 |
10,715 |
|
Graphs and Plots |
1,317 |
3,620 |
9,879 |
|
Flow Cytometry |
63 |
427 |
3,053 |
|
Total |
2,583 |
10,145 |
47,540 |
Dataset Structure
The dataset.zip file contains three directories, each corresponding to a different version of the dataset:
-
spm/: Contains image panels from the original Stock Photo Paper Mill (SPM) set.
-
extracted_panels/: Includes panels related to the Extended SPP (v1).
-
annotated_panels/: Contains panels related to the Extended SPP (v2).
Annotations
The dataset includes two types of annotation files:
-
document-level-annotation.json: This file provides annotations detailing how each article reuses content from other articles.
-
image-level-annotation.json: This file includes annotations about groups of images that share similar content.
Please refer to these files for detailed information on the dataset's contents and the relationships between the images and articles.
References:
[1] Bik E. The Stock Photo Paper Mill; 2020. Available from https://scienceintegritydigest.com/2020/07/05/the-stock-photo-paper-mill/
Files
Additional details
Additional titles
- Subtitle
- Dataset of Unveiling Scientific Articles from Paper Mills with Provenance Analysis