Complete Rxivist dataset of scraped bioRxiv data

Abdill, Richard J.; Blekhman, Ran

doi:10.5281/zenodo.3382325

Published August 30, 2019 | Version 2019-08-30

Dataset Open

Complete Rxivist dataset of scraped bioRxiv data

1. University of Minnesota

rxivist.org allows readers to sort and filter the tens of thousands of preprints posted to bioRxiv. Rxivist uses a custom web crawler to index all papers on biorxiv.org; this is a snapshot of Rxivist the production database. The version number indicates the date on which the snapshot was taken. See the included "README.md" file for instructions on how to use the "rxivist.backup" file to import data into a PostgreSQL database server.

Please note this is a different repository than the one used for the Rxivist manuscript—that is in a separate Zenodo repository. You're welcome (and encouraged!) to use this data in your research, but please cite our paper, now published in eLife.

Going forward, this information will also be available pre-loaded into Docker images, available at blekhmanlab/rxivist_data.

Version notes:

2019-08-30
- The Crossref Event Data API, which provides the data used to populate the table of tweet counts, has not been fully functional since early July. While we are optimistic that accurate tweet counts will be available at some point, the sparse values currently in the "crossref_daily" table for July and August should not be considered reliable.
2019-07-01
- A new "institution" field has been added to the "article_authors" table that stores each author's institutional affiliation as listed on that paper. The "authors" table still has each author's most recently observed institution.
  - We began collecting this data in the middle of May, but it has not been applied to older papers yet.
2019-05-11
- The README was updated to correct a link to the Docker repository used for the pre-built images.
2019-03-21
- The license for this dataset has been changed to CC-BY, which allows use for any purpose and requires only attribution.
- A new table, "publication_dates," has been added and will be continually updated. This table will include an entry for each preprint that has been published externally for which we can determine a date of publication, based on data from Crossref. (This table was previously included in the "paper" schema but was not updated after early December 2018.)
- Foreign key constraints have been added to almost every table in the database. This should not impact any read behavior, but anyone writing to these tables will encounter constraints on existing fields that refer to other tables. Most frequently, this means the "article" field in a table will need to refer to an ID that actually exists in the "articles" table.
- The "author_translations" table has been removed. This was used to redirect incoming requests for outdated author profile pages and was likely not of any functional use to others.
- The "README.md" file has been renamed "1README.md" because Zenodo only displays a preview for the file that appears first in the list alphabetically.
- The "article_ranks" and "article_ranks_working" tables have been removed as well; they were unused.
2019-02-13.1
- After consultation with bioRxiv, the "fulltext" table will not be included in further snapshots until (and if) concerns about licensing and copyright can be resolved.
- The "docker-compose.yml" file was added, with corresponding instructions in the README to streamline deployment of a local copy of this database.
2019-02-13
- The redundant "paper" schema has been removed.
- BioRxiv has begun making the full text of preprints available online. Beginning with this version, a new table ("fulltext") is available that contains the text of preprints that have been processed already. The format in which this information is stored may change in the future; any digression will be noted here.
- This is the first version that has a corresponding Docker image.

Files

README.md

Files (117.8 MB)

Name	Size	Download all
docker-compose.yml md5:2c16fd8916db8910b951af1b99503970	305 Bytes	Download
README.md md5:8d36af44691c4c9a464ec4e6fba219cb	9.0 kB	Preview Download
rxivist.backup md5:86e466db1755f81000b14c29a7d6a9e1	117.8 MB	Download

Additional details

Is referenced by: 10.7554/eLife.45133 (DOI)
Is supplement to: 10.5281/zenodo.2465688 (DOI)

	All versions	This version
Views	12,292	821
Downloads	3,766	145
Data volume	429.6 GB	7.9 GB

Complete Rxivist dataset of scraped bioRxiv data

Authors/Creators

Description

Files

README.md

Files (117.8 MB)

Additional details

Related works