VCF file containing the final dataset of 4596 SNPs used for emperor penguin population genomics.

RADseq:
RAD libraries were prepared using the SbfI restriction enzyme. RADSeq for all individuals was performed at the Edinburgh Genomics Facility, University of Edinburgh (https://genomics.ed.ac.uk/). Briefly, 250 ng of DNA per individual was digested with SbfI-HF (NEB), followed by ligation to barcoded P1 adapters. The uniquely barcoded individuals were pooled into multiplexed libraries and each library sheared into fragments of 300—400 bp. Fragments were size selected using gel electrophoresis. The libraries were blunt ended (NEB Quick Blunting Kit) and A-tailed prior to ligation with P2 adapters (IDT). Enrichment PCR was performed to increase yield, followed by product purification with Ampure beads. The pooled, enriched libraries were checked for size and quantity using Qubit and a qPCR assay. Each library was then sequenced in a lane of the Illumina HiSeq 2500 using 125 base paired-end reads in high output mode (v4 chemistry).

Bioinformatics, SNP calling and filtering:
FastQC was used to assess read quality and check for adapter contamination. We used process_radtags within the Stacks pipeline v1.35 to de-multiplex, trim and clean reads. We then truncated reads to 113 bp and excluded read pairs in which either read had uncalled bases, a low quality score and/or a barcode or cut-site with more than one mismatch. The remaining paired reads were aligned to the emperor penguin reference genome (http://gigadb.org/dataset/100005) using bwa-mem. We prevented terminal alignments by enforcing a clipping penalty of 100. Reads with more than five mismatches, multiple alignments and/or more than two indels were removed using a custom python script (filter.py). We removed PCR duplicates with Picardtools (http://broadinstitute.github.io/picard).

We used the Stacks pipeline (pstacks, cstacks, sstacks, rxstacks, cstacks, sstacks, populations) to prepare a dataset of unlinked, filtered SNPs from the RAD reads. In pstacks we selected a minimum stack depth of six reads mapping to the same location and used the bounded SNP model with a significance level of α = 0.05, an upper bound of 0.1 and a lower bound of 0.0041 (corresponding to the highest sequencing error rate recorded by phiX spikes in the sequencing lanes). All 110 individuals were used to build the catalog in cstacks. In rxstacks we removed confounded loci with a conservative confidence limit of 0.25. Also in rxstacks, we removed excess haplotypes from individuals as well as any loci with a mean log likelihood < -10. Further filtering was conducted in the populations module. We removed SNPs with a minor allele frequency (MAF) < 0.01 and removed loci with a heterozygosity > 0.5, as these could be paralogs. A single SNP per RADtag was chosen at random in order to remove tightly linked SNPs from the dataset. We also specified that a locus must be present in all colonies to be included in the final dataset, as well as genotyped in at least 80% of individuals from each colony. We then removed any SNPs with a mean coverage exceeding 100X to avoid SNPs from repetitive regions of the genome. We also removed SNPs that were out of Hardy Weinberg equilibrium (HWE) in > 50% of the colonies when p < 0.01. BayeScan was used to identify four SNPs putatively under selection which were removed from the dataset.