This Gass_Iceland_IAV_Phylodynamics_Dataset_README.txt file was generated on 2022-06-01 by Jonathon Gass GENERAL INFORMATION 1. Title of Dataset: Dataset for "Global dissemination of Influenza A virus is driven by wild bird migration through arctic and subarctic zones" / DOI: 10.5061/dryad.m37pvmd2m 2. Author Information A. Principal Investigator Contact Information Name: Jonathon Gass Institution: Tufts University Email: jonathon.gass@tufts.edu B. Alternate Contact Information Name: Nichola Hill Institution: UMass, Boston Email: Nichola.Hill@umb.edu 3. Date of data collection (single date, range, approximate date): Outgroup downloaded 2020-02-12, Ingroup samples collected 2010-05 through 2018-02 4. Geographic location of data collection: Ingroup data collected in Iceland 5. Information about funding sources that supported the collection of the data: This project was funded through the National Institute of Allergy and Infectious Diseases (grant #s: HHSN272201400006C, HHSN272201400008C) SHARING/ACCESS INFORMATION 1. Licenses/restrictions placed on the data: N/A 2. Links to publications that cite or use the data: https://www.authorea.com/doi/full/10.22541/au.164372348.88612449/v1# 3. Links to other publicly accessible locations of the data: N/A 4. Links/relationships to ancillary data sets: N/A 5. Was data derived from another source? NO A. If yes, list source(s): N/A 6. Recommended citation for this dataset: Gass J.D. (2021). Global dissemination of Influenza A virus is driven by wild bird migration through arctic and subarctic zones [Dataset]. Dryad Digital Repository. https://doi.org/10.5061/dryad.m37pvmd2m DATA & FILE OVERVIEW 1. File List: Global_Dataset_394_20210809.csv - Data file contains all data (394 influenza A virus sequences and associated epidemiological meta-data) used in the Global analysis described in Methods of above publication NorthAtlantic_Dataset_866_20210809.csv - Data file contains all data (866 influenza A virus sequences and associated epidemiological meta-data) used in the North Atlantic analysis as described in Methods of above publication 2. Relationship between files, if important: N/A 3. Additional related data collected that was not included in the current data package: N/A 4. Are there multiple versions of the dataset? NO A. If yes, name of file(s) that was updated: N/A i. Why was the file updated? N/A ii. When was the file updated? N/A METHODOLOGICAL INFORMATION 1. Description of methods used for collection/generation of data: All globally available avian and marine mammal IAV PB2 genes sequenced between 2009 and 2019, excluding those from Iceland, were downloaded from the National Center for Biotechnology Information Influenza Virus Resource database (NCBI IVR) (Bao et al., 2008) on February 12, 2020, resulting in 13,434 sequences. 2. Methods for processing the data: Duplicate sequences (based on collection date, location, and nucleotide content) and sequences with less than 75% unambiguous bases were removed, and all vaccine derivative and laboratory-synthesized recombinant sequences were excluded. Sequences in the dataset were only included if isolation dates, location, and host species were available, resulting in 7,210 remaining sequences. The subsequent downsampling strategy aimed to reduce the number of sequence taxa for computational efficiency and to mitigate sampling bias while maintaining genetic diversity in the dataset. Downsampling of global taxa outside of Iceland (i.e. the ‘outgroup’): Four variables were considered important for explaining genetic diversity in the global outgroup IAV sequence dataset: geographic region, host taxa, sampling year, and hemagglutinin (HA) subtype. Five geographic region categories included North America, Europe, Asia, Africa, and South America (Australia and Antarctica were removed due to insufficient sequence counts). Fourteen HA subtype categories included H1, H2, H3, H4, H6, H8, H10, H11, H12, H13, H14, H15, H16, and pooled H5/7/9. H5, H7, and H9 were combined to reduce bias, as these were over-represented in the global dataset. Four host categories included Anseriformes, Charadriiformes, Galliformes, and Other, which comprised all other avian taxa and marine mammals. To inform the downsampling strategy for the outgroup and evaluate if any of the four variables were correlated, a multiple correspondence analysis (MCA) was performed (JMP Pro v.14.0.0 (JMP Version 14.0.0, 1989-2019)). The MCA uses categorical data as input, which for this study included the sampling metadata associated with each sequence (region, host taxa, year, and HA subtype). Through representation of the variables in two-dimensional Euclidean space, significant clustering of HA subtypes with host taxa was detected (Supplementary figure 2), indicative of host-specific subtypes that are a well-known feature of influenza (Olson et al., 2014). These findings confirmed by previously published data on species-specificity of HA subtypes (Byrd-Leotis, Cummings, & Steinhauer, 2017; Long, Mistry, Haslam, & Barclay, 2019; Verhagen et al., 2015) led us to downsample the dataset stratifying taxa by two non-overlapping variables: geographic region and HA subtype. Downsampling of the global data resulted in 21-75 taxa per five geographic region categories and 6-30 taxa per 14 HA subtype categories, resulting in a total of 301 sequences in the global outgroup, with relative evenness across sampling years. This step was performed to mitigate sampling bias resulting in over-representation of species or viral strains, while accounting for genetic diversity in the dataset. Downsampling of Iceland-derived taxa (i.e. the ‘ingroup’): Next, virus sequences from Iceland (including 35 downloaded from NCBI IVR and 58 novel viruses isolated and recently sequenced by our group (n=93)) (Dusek et al., 2022) were downsampled by stratifying taxa by HA subtype (generating 1-15 sequences per 14 HA subtype categories), resulting in 63 sequences. These 63 sequences were used for global and local discrete trait analyses and the ingroup dataset reflected the underlying composition of host-specific subtypes present in this localized system. To assist with rooting and time-calibration of the tree, historical avian sequences from NCBI IVR were downloaded for the years 1979-2008. These were downsampled by year to ensure one sequence per year, resulting in 30 historic sequences. The total downsampled dataset, including the outgroup (n=301), ingroup (n=63), and historic sequences (n=30) resulted in a total of 394 sequences (Gass J.D., 2021). Europe-Iceland-North America Datasets: Following analyses which identified the most significant geographic regions acting as sources of IAVs to Iceland, a second dataset was constructed at a restricted scale to Europe, Iceland, and North America between 2009 and 2019. First, the cleaned global dataset described above (n=7245) was downsampled to include significant source regions of North America (n=3222) and Europe (n=407), totaling 3629 sequences. To identify at lower spatial resolution the source/sink locations relevant to Iceland, a K-means cluster analysis was performed (JMP Pro v.14.0.0 (JMP Version 14.0.0, 1989-2019)) using latitude/longitude coordinates for each of the 3629 sequences (obtained by extracting sampling location from the strain name of each sequence and searching in www.geonames.org). A total of 20 intraregional clusters resulted. Identified clusters with <50 sequences were combined with geographically proximal clusters to increase evenness of within cluster sequence counts for discrete traits analyses, resulting in 13 intraregional cluster locations within North America, Iceland, and the rest of Europe (Supplementary figure 3). Viral sequences were then downsampled from 3629 to 743 taxa stratifying by intraregional cluster groupings and HA subtype. Two datasets were formed, both of which included the 743 downsampled sequences (557 from North America, 229 from Europe excluding Iceland), 30 historic sequences (same as the global analysis), and: (i) for discrete trait phylogeographic and phylodynamic analyses (non-geographic analyses of host transmission and subtype reassortment patterns using discrete diffusion models): 63 downsampled Iceland-derived sequences, totaling 836 sequences, and (ii) for continuous trait phylogeographic analysis, all 93 Iceland-derived sequences were included, totaling 866 sequences (Gass J.D., 2021). For purposes of clarity, we use the term ‘phylogeographic’ to refer to analyses that are geographic in nature and ‘phylodynamic’ to refer to analyses that focus on viral diffusion between non-geographic traits, namely transmission between species and reassortment of HA subtypes. The two datasets were formed because (a) heterogeneity in sampling among locations and host taxa can bias results of discrete trait analyses, therefore downsampling to ensure relative homogeneity of trait groupings (while striving to preserve host and pathogen population diversity across space and time) is required, and (b) continuous trait analyses are robust against heterogeneity in sampling (Baele, Suchard, Rambaut, & Lemey, 2017; Lemey, Rambaut, Welch, & Suchard, 2010), thus the full Iceland dataset was included for continuous analyses (Gass J.D., 2021). Multiple sequence alignments of PB2 sequences were performed using MUSCLE in Geneious Prime 2020.01.02 (https://www.geneious.com) and trimmed to the open reading frame. Maximum-likelihood phylogenies of PB2 segments in the downsampled datasets were reconstructed using RAxML v8.2.12 (Stamatakis, 2006) and temporal signal was investigated using TempEst v1.5.3 (Rambaut, Lam, Max Carvalho, & Pybus, 2016) (Supplementary figure 1). 3. Instrument- or software-specific information needed to interpret the data: This analysis was conducted using BEAST v.1.10.4 (Suchard et al., 2018) 4. Standards and calibration information, if appropriate: N/A 5. Environmental/experimental conditions: N/A 6. Describe any quality-assurance procedures performed on the data: Data were cleaned, duplicates removed. 7. People involved with sample collection, processing, analysis and/or submission: Jonathon Gass DATA-SPECIFIC INFORMATION FOR: Global_Dataset_394_20210809.csv 1. Number of variables: 23 2. Number of cases/rows: 394 3. Variable List: All: Contains all sequence and associated meta-data for each sequenced virus Strain_name: Contains just the strain name associated with each sequence Accessions: GenBank Accession #, unique for each virus sequence. can be used to look up more information for each sequence in GenBank Location: Geographic location where virus was sampled Host: Species name Order: Taxonomic Order group for each host species Subtype: HANA subtype of each sequenced virus NA_subtype: Neuraminidase subtype HA_subtype: Hemagglutinin subtype CountryFIX: Country of origin Region_6categories: Global regions split into 6 categories Continent: Continent yyyy: Year of sampling Month: month of sampling Day: Day of sampling Nucs: PB2 segment nucleotide sequence, prior to alignment and trimming to open reading frame Latitude: latitude of sampling location Longitude: Longitude of sampling location Fulloutgroup_vs_Iceland: Identifies sequences in the outgroup (i.e. all non-Iceland-derived sequences) and those sequences isolated in Iceland Outgroup301_Ingroup93: Identifies sequences in the outgroup (i.e. all non-Iceland-derived sequences) and those sequences isolated in Iceland Downsampled_ingroup63: For discrete trait phylogeographic and phylodynamic analyses (non-geographic analyses of host transmission and subtype reassortment patterns using discrete diffusion models): 63 downsampled Iceland-derived sequences were included. This variable helps identify which of the total 93 Iceland sequences these 63 sequences are. Ingroup63_Outgroup301_Historic30: identifies the ingroup of 63 sequences, the outgroup of 301 sequences, and the 30 historic sequences that were included to root the phylogenetic tree Trimmed_Nucs_394: These are the aligned and trimmed sequences that were used in all phylogenetic analyses DATA-SPECIFIC INFORMATION FOR: NorthAtlantic_Dataset_866_20210809.csv 1. Number of variables: 32 2. Number of cases/rows: 866 All: Contains all sequence and associated meta-data for each sequenced virus Strain_name: Contains just the strain name associated with each sequence Accessions: GenBank Accession #, unique for each virus sequence. can be used to look up more information for each sequence in GenBank Location: Geographic location where virus was sampled Host_GenCat_Local: general host species category Host: Species name Order_All: Taxonomic Order group for each host species Order_AnsCharGallOth: Taxonomic Orders grouped as Anseriformes, Charadriiformes, Galliformes, and Other (comprising all other Orders in the dataset) Subtype: HANA subtype of each sequenced virus NA_subtype: Neuraminidase subtype HA_subtype: Hemagglutinin subtype Country: Country of origin Country_2: Country of origin, with masked historical sequences coded as 'mask' Region_6categories: Global regions split into 6 categories NA_Eur_Ice: Specifies location where sample was derived: North America, Europe, or Iceland Continent: Continent yyyy: Year of sampling Month: month of sampling Day: Day of sampling Nucs: PB2 segment nucleotide sequence, prior to alignment and trimming to open reading frame Latitude: latitude of sampling location Longitude: Longitude of sampling location Source: source of sequence information: NCBI 1979-2008: Historic sequences used to root the tree; NCBI 2009-2019: global outgroup sequences; Icland seqs: sequences sampled by our research group in Iceland Cluster_20: For North Atlantic analysis, sequences were clustered using a K-means clustering approach, which identified 20 clusters based on geographic proximity Cluster_20_Lat: latitude of the cluster centroid Cluster_20_Long: Logitude of the cluster cenroid Cluster20_mod13: The 20 clusters were further grouped into 13 clusters, used in the final analysis Cluster_13_locationNames: Names of each of the 13 clusters Local_downsample_743_93_30: identifies all sequences used in the continuous analysis (743 being the outgroup, ingroup os 93, and historic sequences being 30 in total) Local_downsample_743_63_30: identifies all sequences used in the discrete analysis (743 being the outgroup, ingroup os 63, and historic sequences being 30 in total) Trimmed_Nucs_394: These are the aligned and trimmed sequences that were used in all phylogenetic analyses 4. Missing data codes: All missing data are coded as 'null' 5. Specialized formats or other abbreviations used: N/A