spflanagan/SCA: selection components analysis v1.0

Sarah P. Flanagan

doi:10.5281/zenodo.200439

Published December 12, 2016 | Version v1.0

Software Open

spflanagan/SCA: selection components analysis v1.0

Sarah P. Flanagan

This release contains the scripts and programs used for the analysis contained in the Flanagan & Jones Evolution paper.

See README_SCA for a full description of each program and script:

SCA Analysis README

This file contains information about each program used in the selection components analysis paper.

GENOME-WIDE SELECTION COMPONENTS ANALYSIS PROGRAMS infer_mat_vcf

PURPOSE

This program takes a list of fathers and their paired offspring to infer the maternal allele at compatible loci.

NOTES

Variant Call Format usually has more entries per individual than just genotype (GT), but this program ignores those. Any pruning for allele depth etc. must occur before running this program.

INPUT

The path for output files
A tab-delimited file with the fathers and their paired offspring. One line per pair, with the father in the first column and the offspring in the second column.
A vcf file with all of the fathers and their offspring in it, with the same name as the names found in the father-offspring pair file.
Output name

OUTPUT

A vcf file with only the GT format. The INFO field is no longer relevant but is retained from the original vcf.

HOW TO RUN

The program currently (5 May 2016) cannot be run interactively or on the command line. I plan to change this in the near future to make it more user-friendly. You must go into the source file and edit the path (where things are), dad_kid_name (name of the father-offspring pairs list file), vcf_name (vcf file name), and summary_name (output vcf) settings.

gwsca_biallelic_vcf

PURPOSE

This program calculates all pairwise FST estimates between groups within a population.

NOTES

You must provide the program with the assignment of each individual to the groups. This program does not do any filtering, although it does output summary files that will help with filterings steps.

INPUT

A vcf file
A file with each individual's assignment to groups. The file must be tab-delimited and contain six columns: individual ID in the sumstats file, an individual ID (can be the same as the first column), the sex of the individual, the age or other grouping factor, and two columns with phenotype/status of that individual. Assignments should be string factors (e.g. "MALE" or "FEMALE") and non-assignment is a 0. Here are some examples:

FileIndID IndID Sex Age Status Phenotype sample_MOM001_align MOM001 0 0 0 MOM sample_FEM087_align FEM087 FEM ADULT 0 0 sample_NPM005_align NPM005 MAL ADULT NONPREG 0 sample_OFF015_align OFF015 0 JUVIE 0 0 sample_PRM085_align PRM085 MAL ADULT PREGNANT 0

OUTPUT

File with the Fsts, with a column describing the locus and then columns for each pairwise Fst comparison
Summary file containing columns PopID, Chromosome (Chrom), Locus position on chromosome (Pos), Locus Name, Number of individuals genotyped at the locus (N), expected heterozygosity (Hs), observed heterozygosity (Ho), Allele frequency in group 1 (Allele1Freq), Allele frequency in group 2 (Allele2Freq), the frequency of homozygotes for major allele (AA), frequency of heterozygotes (Aa), and frequency of homozygotes for minor allele (aa).
A file containing the index (IndCount), individual ID (IndID), population ID (PopIndex) and Pop ID (PopName) for each individual. Individuals belonging to multiple groups will be listed multiple times, once for each group.

HOW TO RUN

The program currently (12 December 2016) cannot be run interactively or on the command line. I plan to change this in the near future to make it more user-friendly.
You must go into the source file and edit the filename settings.

maternal_alleles_sim

PURPOSE

To simulate the inference of maternal alleles and identify bias.

NOTES

This is based on actual allele frequencies from a vcf file. The results of the simulation are analyzed in gwsca_biallelic_analysis.R and reported in the supporting information file of the Flanagan & Jones Evolution paper.

INPUT

A vcf file
You must specify the number of offspring to simulate and an error rate.

OUTPUT

A tab-delimited file containing the locus ID (Locus), reference allele (Ref), alternative allele (Alt), female allele frequency (FemaleAF), male allele frequency (MaleAF), actual maternal allele frequency (ActualMomAF), and the inferred maternal allele frequency (InferredMomAF)

HOW TO RUN

The program currently (12 December 2016) cannot be run interactively or on the command line. You must go into the source file and edit the filename settings.

vcf_to_sca_monnahan

PURPOSE

The purpose of vcf_to_sca_monnahan is to convert a vcf file into input files for the Monnahan/Kelly python script.

NOTES

You must provide the file with parent-offspring matches. You can provide a whitelist if there is a subset of loci you would like to convert (rather than the entire vcf file).

INPUT

A vcf file
A text file containing two tab-separated columns, the first containing the parent ID and the second containing the offspring ID.
[Optional] A text file containing a list of Locus IDs to convert. All other loci will not be included in the output.

OUTPUT

A file compatible with ML.2016.pipefish.py

HOW TO RUN

The program currently (12 December 2016) cannot be run interactively or on the command line. You must go into the source file and edit the filename settings.

extract_sequence_part

PURPOSE

This program removes a portion of a fasta sequence based

NOTES

The program assumes that there is only one sequence in the fasta file. This is what was used to extract the 5kb regions surrounding RAD loci.

INPUT

A fasta file containing the sequences
A tab-delimited file containing the name of sequence and the start and end basepair of the sequence to extract.
Output file name

OUTPUT

A fasta file containing all of the extracted sequences. Each sequence will be re-named as >seqname_start-end

HOW TO RUN

This file can be run interactively or can be run at the command line with flags. -f: fasta file input -i: Input file -o: Output file name. -h: Prints this message no arguments: interactive mode

SCRIPTS convert_snps.R

PURPOSE

This script converts *.snps.tsv files output from Stacks into a more user-friendly format

NOTES

This contains a function, reformat.snp(), which does the reformatting and re-writing of files.

INPUT

You must specify a list of snps files to convert.

OUTPUT

For each file in the list of snps files, a tab-delimited file with the locus ID, major allele, and alternative allele are output.

HOW TO RUN

Run it line-by-line in R to specify the working directory, etc.

merge_vcfs.R

PURPOSE

This script merges two vcf files.

NOTES

This was used to combine the inferred maternal alleles and the genotypes from ddRAD-seq in the Flanagan & Jones paper. This script also removes one redundant individual from the analysis.

INPUT

You need to specify the working directory, the input vcf files, and output file names and locations.

OUTPUT

A single vcf file.

HOW TO RUN

Run it line-by-line in R to specify the working directory, etc.

gwsca_biallelic_analysis.R

PURPOSE

This script contains most of the analysis for the Flanagan & Jones paper in Evolution. NOTES
This was used to combine the inferred maternal alleles and the genotypes from ddRAD-seq in the Flanagan & Jones paper. This script also removes one redundant individual from the analysis.
It contains the following functions:
- extract.info() combines information from a summary file, Stacks tags files, and Stacks snps files to output a dataframe with the columns: "LocusID","NumSNPsOut","BPs","Pop1AF","Pop1Ho","Pop1Hs","Pop1N","Pop2AF","Pop2Ho","Pop2Hs","Pop2N"
- parse.snps() extracts all of the SNPs from a Stacks snps file.
- sem() calculates the standard error of the mean
- vcf.cov.loc() calculates the coverage per locus from a vcf file
The file is split into the following sections:
- PRUNING
  - loci with low coverage, genotyped in few individuals, and deviating significantly from hardy weinberg are excluded
- ANALYSIS
  - Create an initial plot, identify outliers and significant loci
  - Investigate allele frequency changes
  - Compare to Monnahan/Kelly approach
- PLOT
  - Where Figures 1 and 2 are made.
- WRITE TO FILE
  - Outliers are written to file, along with their surrounding 5kb regions for blastx.
- LOOK INTO EXTREME OUTLIERS
  - Investigate coverage etc. in the outliers with extremely high FST values
- BLAST2GO ANNOTATIONS
  - Match Blast2Go results back up with their loci
- COMPARE TO PSTFST SIG LOCI
  - This compared the outliers in this analysis to the significant Pst-Fst loci from Flanagan et al. (2016) in Molecular Ecology. Not included in the Flanagan & Jones Evolution paper.
- MATERNAL ALLELE FREQS SIMULATION
  - This is an analysis of the results from the maternal allele frequency simulation presented in the supplementary material.

INPUT *At the top of the file is a section (FILES) where the major files used can be specified. Some other files are read in during specific analyses.

OUTPUT

A single vcf file.

HOW TO RUN

Run it line-by-line in R to specify the working directory, etc.

plotting_functions.R

PURPOSE

This script contains several functions required to plot genome-wide summary statistics like Fst.

NOTES

plot.genome.wide() and fst.plot() both plot genome-wide statistics. fst.plot() is what was used in Flanagan & Jones to make figures 1 and 2.

INPUT

Each function has its own input requirements.

OUTPUT

genome-wide graphs of summary statistics.

HOW TO RUN

I use this through source() in other R codes.

GO_plot.R

PURPOSE

Create a bar plot summarizing the gene ontology results

NOTES

This was used to create figure 3.
This script contains a function, go.plot(), which does all the hard work of plotting and creating a file.

INPUT

You must provide the blast2go results.

OUTPUT

A barplot summarizing the blast2go results.

HOW TO RUN

Run it line-by-line in R to specify the working directory, etc.

ML.2016.pipefish.py

PURPOSE

Run the analysis from Monnahan et al. (2015), but adapted for the pipefish case.

NOTES

This should be run in the same directory as where you want the output to go.

INPUT

Run it with the output from vcf_to_sca_monnahan

OUTPUT

ML.2016.pipes.calling1.txt with the parameter estimates, maximum log-likelihoods, and likelihood ratio tests.
Detail.output.txt for diagnostics

HOW TO RUN

On the command line: python ML.2016.pipefish.py input_file.txt

SHELL SCRIPTS

run_bowtie.sh and bowtie_rerun_duplicates.sh: Runs bowtie2 on all of the samples
run_refmap_ddrad.sh: Runs Stacks on the SCA samples.
extract_sequence_parts.sh: Runs extract_sequence_parts on 5kb regions surrounding significant loci

SCA_scripts.zip

Files

spflanagan/SCA-v1.0.zip

Files (85.8 MB)

Name	Size	Download all
spflanagan/SCA-v1.0.zip md5:ce0048c1d62b49756fe8a99aaa14b226	85.8 MB	Preview Download

Additional details

Is supplement to: https://github.com/spflanagan/SCA/tree/v1.0 (URL)

Citations

Oops! Something went wrong while fetching results.

	All versions	This version
Views	199	196
Downloads	17	17
Data volume	1.6 GB	1.6 GB

spflanagan/SCA: selection components analysis v1.0

Creators

Description

Files

spflanagan/SCA-v1.0.zip

Files (85.8 MB)

Additional details

Related works