Supplementary files for bioRxiv preprint "Public human microbiome data dominated by highly developed countries," by Abdill et al. 2021.

* supp_table01.csv - Supplementary table 1. Samples per tag.
    * tag: The name of a single tag that appears on at least one BioSample entry.
    * samples: The number of human microbiome samples with a value for the tag.
    * coverage: The fraction of human microbiome samples with a value for the tag.
* supp_table02.csv - Supplementary table 2. Country-level data.
    * alpha2: The country's two-letter country code, as defined in ISO 3166-1.
    * country: The name of the country.
    * region: The country's region, of those defined in the United Nations Sustainable Development Goals.
    * samples: The total human microbiome samples attributed to the country.
    * LDC: Whether the country is classified as a United Nations "least developed country."
    * population: The estimated 2020 population of the country, in thousands.
    * perc_sample: The proportion of all world samples from this country.
    * perc_population: The proportion of world population in this country.
    * unscaled_diff: A ratio calculated using perc_sample and perc_population, as described in the methods section.
    * scaled_diff: The values in unscaled_diff, with positive values scaled to stretch from 0 to 100, and negative numbers to stretch from 0 to -100. (See Methods.)
* supp_table03.csv - Supplementary Table 3. Samples by NCBI taxon.
    * code: The NCBI identifier for a single taxon.
    * taxname: The name of the taxon.
    * count: The total number of human microbiome samples classified within each taxon.
* country_counts.csv - Sample counts by country.
    * code: The ISO-3166-1 alpha-2 code of a country
    * samples: The total human microbiome samples associated with that country
* region_years.csv - Samples per region per year.
    * region - The name of a geographic region
    * year - A single year in which samples were released
    * samples - The count of human microbiome samples released in that year that were associated with a country or territory within the region.
    * running total - The cumulative total samples associated with the region, ending with the specified year.
* samples.csv - A list of all samples evaluated in the study.
    * srs: The unique ID assigned to this BioSample.
    * project: The ID of the BioProject in which this BioSample is filed.
    * host: The inferred host from which the sample was taken. (See Methods.)
    * srr: The ID of one of the sequencing runs associated with this BioSample.
    * library_strategy: Mirrors an attribute retrieved from NCBI regarding the sequencing run.
    * library_source: Mirrors an attribute retrieved from NCBI regarding the sequencing run.
    * taxon: The ID of the NCBI taxon in which the BioSample is classified.
    * pubdate: The date on which this sample was released.
    * geo_loc_name: The value of the "geo_loc_name" tag associated with this BioSample.
* acceptable_hosts.csv - A list of all "host" values observed in BioSample entries that were manually flagged as indicating the sample was from a human.
* figures.md - R code used to generate the figures in the manuscript, plus the SQL queries used to generate the data files used in the figures.
* biosample_data.zip - An archive containing a directory of XML files as they were exported from the BioSample website. Each file contains the search results for a single NCBI taxon; the file name indicates the taxon ID.
* code.zip - An archive containing the Python 3 scripts used to query the NCBI APIs for information related to the BioSample entries defined in the files in the biosample_data directory.