There is a newer version of the record available.

Published September 24, 2024 | Version 1.2.0
Dataset Open

Generative AI aids the publication of fake articles: Methods and materials package

  • 1. Athens University of Economics and Business

Description

This package contains Python, shell, awk scripts, and data used to obtain the curated table associated with the above named article. It also contains (in this file) a description of the methods employed to obtain the curated table with details regarding the published articles.

Contents

The following items are included.

  • README.md: This file
  • article-details.xlsx: Curated table with details of published articles in Microsoft Excel file format
  • index.html: HTML document with
    • links to GIJIR materials saved in the Internet Archive
    • a list of all the GIJIR articles’ citation data according to Crossref and links to each article’s locally available landing page, full-text PDF, plus links to Crossref metadata and the article via DOI and original journal URL. (Note that non-local, non-archived links may rot over time.)
  • Makefile: Commands that orchestrate the articles’ analysis
  • get-metadata.sh: Obtain article metadata pages from the journal’s web site
  • apply-to-pdfs.sh: Apply the specified Python script to all article PDFs
  • extract-citations-emails.py: Extract number of probable in-text citations and corresponding author email from article PDF
  • extract-doi-affiliations.py: Extract article DOI and affiliations from an article’s metadata
  • extract-all-doi-affiliations.sh: Extract article DOI and affiliations from all articles’ metadata
  • emails-to-csv.awk: Convert emails and article numbers to CSV with URL for sending emails
  • ybs-works.json: Results of Crossref query to obtain all the publisher’s works made on 2024-09-22
  • ChatGPT: Prompts and responses associated with the generation of a fake article in one of the journal’s topics.
  • global-us/metadata/: Article metadata as HTML files collected on 2024-09-10
  • global-us/global-us.mellbaou.com/index.php/global/article/download/: A copy of the journal’s article PDFs as crawled on 2024-09-10

Methods (English)

Data were gathered on 10 and 11 September 2024 on a host running an Anaconda Python environment version 1.12.3 and Cygwin Bash version 5.2.15(3). The journal site global-us.mellbaou.com was completely crawled with the wget command in order to obtain the article PDFs. Article metadata were retrieved separately using the get-metadata.sh shell script. Citations and contact emails were extracted from the article PDF files with the extract-citations-emails.py and apply-to-pdfs.sh scripts. Article DOIs and author affiliations were extracted from the article metadata HTML files with the extract-doi-affiliations.py and extract-all-doi-affiliations.sh scripts. The two results were joined based on the article number key and used to create the initial version of the article-details.xlsx Excel file. A list of contact author emails and URLs was created with emails-to-csv.awk, and then used to inform article authors regarding the findings. Articles with a low citation count heuristic (measured through the number brackets and braces appearing before the article’s Reference section) were manually inspected for signs of entirely AI authorship (mainly formulaic content, lack of citations, tables, and figures). A subset of those was also submitted to Turnitin for AI scoring on 2024-09-24.

The provided Microsoft Excel document, based on the automatically generated article-details.tsv file, was hand-curated as follows.

  • Four duplicate entries with wrongly extracted multiple contact emails were removed (articles 172 and 248).
  • Contact emails were obfuscated to comply with personal data protection regulations.
  • Documents ranked 50 or lower with a low citation count and the ten highest ranked ones were hand-verified regarding their AI content.
  • Turnitin AI generation scores were added for one every ten documents in the above low citation count documents and one every two in the above high count documents. Turnitin AI scores were obtained using the web-based service on 2024-09-24.
  • Email domains were extracted from emails and listed in a separate column.
  • A column with undeliverable emails was added and hand-filled based on failed delivery reports regarding the sent notification emails.
  • Affiliations of authors of publications that were unlikely to have been submitted by them (mainly evidenced by wrong contact emails) were marked in bold.
  • Notes with email communications and other provenance details were added to substantiate the preceding actions.

Files

replication.zip

Files (70.9 MB)

Name Size Download all
md5:5f1df9021b508aba42318628609c00fc
70.9 MB Preview Download