Dataset Open Access

Complete Rxivist dataset of scraped biology preprint data

Abdill, Richard J.; Blekhman, Ran

rxivist.org allows readers to sort and filter the tens of thousands of preprints posted to bioRxiv and medRxiv. Rxivist uses a custom web crawler to index all papers posted to those two websites; this is a snapshot of Rxivist the production database. The version number indicates the date on which the snapshot was taken. See the included "README.md" file for instructions on how to use the "rxivist.backup" file to import data into a PostgreSQL database server.

Please note this is a different repository than the one used for the Rxivist manuscript—that is in a separate Zenodo repository. You're welcome (and encouraged!) to use this data in your research, but please cite our paper, now published in eLife.

Going forward, this information will also be available pre-loaded into Docker images, available at blekhmanlab/rxivist_data.

Version notes:

  • 2020-12-07***
    • In addition to bioRxiv preprints, the database now includes all medRxiv preprints as well.
      • The website where a preprint was posted is now recorded in a new field in the "articles" table, called "repo".
    • We've significantly refactored the web crawler to take advantage of developments with the bioRxiv API.
      • The main difference is that preprints flagged as "published" by bioRxiv are no longer recorded on the same schedule that download metrics are updated: The Rxivist database should now record published DOI entries the same day bioRxiv detects them.
    • Twitter metrics have returned, for the most part. Improvements with the Crossref Event Data API mean we can once again tally daily Twitter counts for all bioRxiv DOIs.
      • The "crossref_daily" table remains where these are recorded, and daily numbers are now up to date.
      • Historical daily counts have also been re-crawled to fill in the empty space that started in October 2019.
      • There are still several gaps that are more than a week long due to missing data from Crossref.
      • We have recorded available Crossref Twitter data for all papers with DOI numbers starting with "10.1101," which includes all medRxiv preprints. However, there appears to be almost no Twitter data available for medRxiv preprints.
    • The download metrics for article id 72514 (DOI 10.1101/2020.01.30.927871) were found to be out of date for February 2020 and are now correct. This is notable because article 72514 is the most downloaded preprint of all time; we're still looking into why this wasn't updated after the month ended.
  • 2020-11-18
    • Publication checks should be back on schedule.
  • 2020-10-26
    • This snapshot fixes most of the data issues found in the previous version. Indexed papers are now up to date, and download metrics are back on schedule. The check for publication status remains behind schedule, however, and the database may not include published DOIs for papers that have been flagged on bioRxiv as "published" over the last two months. Another snapshot will be posted in the next few weeks with updated publication information.
  • 2020-09-15
    • A crawler error caused this snapshot to exclude all papers posted after about August 29, with some papers having download metrics that were more out of date than usual. The "last_crawled" field is accurate.
  • 2020-09-08
    • This snapshot is misconfigured and will not work without modification; it has been replaced with version 2020-09-15.
  • 2019-12-27
    • Several dozen papers did not have dates associated with them; that has been fixed.
    • Some authors have had two entries in the "authors" table for portions of 2019, one profile that was linked to their ORCID and one that was not, occasionally with almost identical "name" strings. This happened after bioRxiv began changing author names to reflect the names in the PDFs, rather than the ones manually entered into their system. These database records are mostly consolidated now, but some may remain.
  • 2019-11-29
    • The Crossref Event Data API remains down; Twitter data is unavailable for dates after early October.
  • 2019-10-31
    • The Crossref Event Data API is still experiencing problems; the Twitter data for October is incomplete in this snapshot.
    • The README file has been modified to reflect changes in the process for creating your own DB snapshots if using the newly released PostgreSQL 12.
  • 2019-10-01
    • The Crossref API is back online, and the "crossref_daily" table should now include up-to-date tweet information for July through September.
    • About 40,000 authors were removed from the author table because the name had been removed from all preprints they had previously been associated with, likely because their name changed slightly on the bioRxiv website ("John Smith" to "J Smith" or "John M Smith"). The "author_emails" table was also modified to remove entries referring to the deleted authors. The web crawler is being updated to clean these orphaned entries more frequently.
  • 2019-08-30
    • The Crossref Event Data API, which provides the data used to populate the table of tweet counts, has not been fully functional since early July. While we are optimistic that accurate tweet counts will be available at some point, the sparse values currently in the "crossref_daily" table for July and August should not be considered reliable.
  • 2019-07-01
    • A new "institution" field has been added to the "article_authors" table that stores each author's institutional affiliation as listed on that paper. The "authors" table still has each author's most recently observed institution.
      • We began collecting this data in the middle of May, but it has not been applied to older papers yet.
  • 2019-05-11
    • The README was updated to correct a link to the Docker repository used for the pre-built images.
  • 2019-03-21
    • The license for this dataset has been changed to CC-BY, which allows use for any purpose and requires only attribution.
    • A new table, "publication_dates," has been added and will be continually updated. This table will include an entry for each preprint that has been published externally for which we can determine a date of publication, based on data from Crossref. (This table was previously included in the "paper" schema but was not updated after early December 2018.)
    • Foreign key constraints have been added to almost every table in the database. This should not impact any read behavior, but anyone writing to these tables will encounter constraints on existing fields that refer to other tables. Most frequently, this means the "article" field in a table will need to refer to an ID that actually exists in the "articles" table.
    • The "author_translations" table has been removed. This was used to redirect incoming requests for outdated author profile pages and was likely not of any functional use to others.
    • The "README.md" file has been renamed "1README.md" because Zenodo only displays a preview for the file that appears first in the list alphabetically.
    • The "article_ranks" and "article_ranks_working" tables have been removed as well; they were unused.
  • 2019-02-13.1
    • After consultation with bioRxiv, the "fulltext" table will not be included in further snapshots until (and if) concerns about licensing and copyright can be resolved.
    • The "docker-compose.yml" file was added, with corresponding instructions in the README to streamline deployment of a local copy of this database.
  • 2019-02-13
    • The redundant "paper" schema has been removed.
    • BioRxiv has begun making the full text of preprints available online. Beginning with this version, a new table ("fulltext") is available that contains the text of preprints that have been processed already. The format in which this information is stored may change in the future; any digression will be noted here.
    • This is the first version that has a corresponding Docker image.
Files (316.7 MB)
Name Size
docker-compose.yml
md5:2c16fd8916db8910b951af1b99503970
305 Bytes Download
README.md
md5:831ad1fb826095c3edcf2be49773e340
9.2 kB Download
rxivist.backup
md5:2a750c38c34037375cc41701dc717245
316.7 MB Download
3,445
997
views
downloads
All versions This version
Views 3,445529
Downloads 997100
Data volume 65.9 GB9.8 GB
Unique views 2,661439
Unique downloads 57253

Share

Cite as