
README for Genomic associations with poxvirus across divergent island populations in Berthelot’s pipit
Eleanor C. Sheppard, Claudia A. Martin, Claire Armstrong, Catalina González-Quevedo, Juan Carlos Illera, Alexander Suh, Lewis G. Spurgin & David S. Richardson

2021-11-17

This readme file describes the data files accompanying the above publication. For any further queries please contact e.sheppard@uea.ac.uk or david.richardson@uea.ac.uk

Original raw RAD reads for each individual sample and the Berthelot's pipit draft genome are not supplied here as they are already available through previously published data by Armstrong et al. 2018 under https://doi.org/10.5061/dryad.9642b

These files include:

1) pipit_qseq_files.zip
This zip file contains a separate .qseq file of raw RAD reads for each individual sample.

2) Anthus_berthelotii_PS_816_genome.zip
This zip file contains a BLAST database of the draft Berthelot's pipit genome as described in the Supplementary Methods of Armstrong et al. 2018. This genome was sequenced from sample 816 from Porto Santo.


The following files, as required to run the analyses detailed in this paper, are provided here:

*NOTE! For samples collected prior to 2019, ‘XX’ has been added to the beginning of the sample identifier throughout this study. This has been done to match the format of the identifier of samples collected from 2019 onwards. Despite the change in format, the integers in the sample identifiers match those used in previous datasets (Armstrong et al. 2018; Martin et al. 2021).

1) R scripts for individual based GLM and GLMM models: "GLM-Ms.R". This one script produces the outputs relating to TLR4 and MHC association analyses. This creates Figure 2 and Tables 1-3, statistics in the main text, and supplementary Tables S5 & S6.

2) SCRIPTS.sh file. Code used to undertake Bayenv analyses outlined in this paper as detailed in the Methods section 'Identification of SNPs correlated with population-level pox prevalence' of the current manuscript. 
*NOTE! These scripts need to be run prior to the R scripts to produce the following outputs:
	matrix_single'X'.txt (x 10)
	bf_environ.pox_100_000_run'X'.txt (x 5)
	bf_environ.pox_200_000_run'X'.txt (x 5)
	bf_environ.pox_500_000_run'X'.txt (x 5)
	MAF_pox_candidateSNPs.csv

3) R script to retrieve the data for environmental and covariance matrix Bayenv input files: "Bayenv_input_script.R". These scripts create Supplementary Tables S3 & S4 and the data is used to create the following files:
	ENVIRONFILE_pox.txt 
	mean_matrix.txt 

3) Bayenv R script for further analyses and plotting: "Bayenv_output_script.R". These scripts create statistics in text, Figures 3 & 4 and some statistics in Table 4.

4) pipit_sample_information.csv
This file contains phenotypic and population information for each pipit sample.
Sample: unique identifier per sample
Archipelago: 'CI' (Canary Islands), 'M' (Madeira), or 'S' (Selvagens)
Island: abbreviation for each island (12)
Population: abbreviation for each population (13)
Population_code: numbered code for each population
Sex: 'M' male, 'F' female
Age_code: Birds were classified as juvenile (EURING age code 3; born this calendar year) or adult (EURING age codes 4–6; born before this calendar year), based on feather moult pattern (Cramp 1988), with the exception of samples collected in 2009, where age codes 3 and 5 were classified as juvenile
Age: 'A' adult, 'J' juvenile
Malaria: presence (Y) or absence (N) of avian malaria in blood sample
Pox: presence (Y) or absence (N) of avian pox lesions on bird
Year: sample collection year
RAD: sample included in RAD dataset (Y) or not (N)
MHC: sample included in MHC dataset (Y) or not (N)
TLR4: sample included in TLR4 dataset (Y) or not (N)
RAD_Library: library number the sample was sequenced within

5) Pipit_MHC.csv
This file contains MHC genotype data for each pipit sample.
Sample: unique identifier per sample
Presence (1) or absence (0) of each allele:
	ANBE10
	ANBE2
	ANBE8
	ANBE4
	ANBE43
	ANBE1
	ANBE44
	ANBE45
	ANBE7
	ANBE13
	ANBE9
	ANBE46
	ANBE16
	ANBE47
	ANBE11
	ANBE28
	ANBE6
	ANBE48
	ANBE49
	ANBE38
	ANBE3
	ANBE31
Nalleles: number of MHC alleles
Nalleles_without_3_31: number of MHC alleles excluding 'low efficiency alleles' (ANBE3 and ANBE31)

6) Pipit_TLR4.csv
This file contains TLR4 genotype data for each pipit sample.
Sample: unique identifier per sample
Number of copies of SNP allele, eg. TLR4_1_A = 2 is a homozygote AA:
	TLR4_1_A
	TLR4_1_G
	TLR4_2_A
	TLR4_2_G
	TLR4_3_C
	TLR4_3_T
	TLR4_4_A
	TLR4_4_C
Number of copies of each protein haplotype:
	TLR4_Prot_1
	TLR4_Prot_2
	TLR4_Prot_3
	TLR4_Prot_4

7) "Berthelots" dataset. 
This includes the zipped .bed and associated .bim and .fam files. These include loci with 3 or fewer ambiguous genotypes, and up to 10% missing/ambiguous genotypes. 
For more information on .bed file formats, see https://www.cog-genomics.org/plink2/formats#bed.
For the "Berthelots" .bim file, the first column gives the chromosome as a number between 0 and 35. The corresponding chromosome names can be found in chromosome_codes.txt.
In the .fam file, the first column contains the population code.

8) chromosome_codes.txt
This file contains zebra finch (Taeniopygia guttata) chromosome names, and their equivalent numeric codes used in the .bim files.