Testcrosses are an efficient strategy for identifying cis-regulatory variation: Bayesian analysis of allele-specific expression (BayesASE)


	Materials and methods



	BayesASE modules

BayesASE consists of four main modules: Genotype Specific References, Alignment and SAM Compare, Prior Calculation, and Bayesian Model (Figure 2). Flexibility, generality and process checking are present in each of the four modules that make up the BayesASE pipeline. Workflows for each module are coded according to SLURM workflow manager, Nextflow and Galaxy specifications, and all scripts are available on Github ( https://github.com/McIntyre-Lab/BayesASE). The Galaxy package BayesASE has a detailed User Guide (Supplementary File S1) that describes the structure of the workflow and all input and output requirements for each of the individual scripts. The code is also available as a PyPI package, a bioconda package and in the Galaxy Toolshed (see Data Availability Section).
The Genotype Specific References module (Supplementary Figure S1) requires as input the reference genome as FASTA file, and genotype specific SNP variants as VCF files, and returns as output a set of genotype specific references obtained incorporating SNP variants into the reference genome. Genotype specific reference reduce mapping bias that occurs when a common reference is used (Skelly et al. 2011;Turro et al. 2011;Graze et al. 2012;Satya et al. 2012;León-Novelo et al. 2014;Munger et al. 2014;Fear et al. 2016). Input VCF files can be generated using the GATK (DePristo et al. 2011), and index input genotype VCF files. BWA index (Li and Durbin 2010) is used to create index files needed for downstream alignment.
The Alignment and SAM Compare module (Supplementary Figure S2) quantifies alignment counts for each input file for each of the two genotype specific genomes of the parents of the F1, compares the alignment files in SAM format, and outputs count tables of reads aligned to each parental genome. In this module, input experimental reads for each F1 sample are aligned to each of their updated parental reference genomes with BWA-MEM (Li 2013). Alignment output counts are compared and reads are designated as mapping to one of the parental genomes (or to both when reads mapping equally well to the two genomes). Technical replicates are summed together, and coverage metrics are calculated. The data are then flagged for low coverage (user defined).
The Bayesian model requires an estimate of a prior to specify the probability that a read generated from the gene in genome 1 maps better to genome 1 than to genome 2 or is mapping equally well on both genomes for each condition. In this implementation, the prior is specified for each gene/cross and each condition separately allowing for maximum flexibility. For each gene/cross/condition, q1 is the prior probability that a read originating from genome 1 maps better to genome 1 and q2 is the prior probability that a read originated from genome 2 maps better to genome 2, with the following constraint: 0 < q1, q2 < 1. The Prior Calculation module (Supplementary Figure S3) estimates priors from DNA data, the RNA data, or simulated data. This module receives count tables for read counts for both informative (mapping to a particular parent) and uninformative reads as input and uses them to estimate a prior probability distribution for each given feature. Priors can also derive independently from this tool and supplied directly to the model. Priors provide information on differences in mapping on the two genomes in absence of AI (Graze et al. 2012;Fear et al. 2016;León-Novelo et al. 2018).
The Bayesian Model module (Supplementary Figure S4) requires as input a design file to identify comparisons to be performed, alignment count tables and priors, either from the previous module or as direct input. The Bayesian model itself is the STAN (Carpenter et al. 2017) implementation of a previously published model for the detection of AI in one or between multiple conditions (León-Novelo et al. 2018) with an extension allowing the priors to be independently specified between conditions. Model output files include values for estimates of levels of AI and their 95% central credible intervals. Output also includes the Bayesian measure of evidence against allelic balance (De Bragança Pereira and Stern 1999;Thulin 2014), defined as the smallest number ev such that the 1-ev central credible interval for estimate of AI does not contain the null value indicating allelic balance. Values of ev can be used in a decision theory context to make decisions about rejecting the null hypothesis.


	Workflow deployment platforms and protocol

To facilitate its use, BayesASE is available for SLURM schedulers, as workflows on the Nextflow platform, and as a Galaxy Tool. There is an example set of sbatch scripts for the SLURM implementation on GitHub ( https://github.com/McIntyre-Lab/BayesASE). Source code is available under the MIT license. BayesASE has a modular structure; flowcharts of the individual modules outline the logic (Supplementary Figures S1–S4) and detailed information about input/output is described in the Galaxy User Guide available on GitHub and as Supplemental File S1. Names are consistent with the flow charts and standard naming conventions for sbatch and Nextflow are used.
The most direct option that is provided for users is through several templates for sbatch scripts; they can be modified as needed with the input and output file names and submitted to a SLURM scheduler. SLURM is an open-source cluster management and job scheduling system ( https://slurm.schedmd.com/); while our work has been focused on SLURM scheduler, our scripts can be used as templates for deployment using a different scheduler. Nextflow (Di Tommaso et al. 2017) is a portable, parallelizable and reproducible framework available in the cluster environment. It defines data workflows that can be executed on diverse portable batch system schedulers such as SGE, SLURM, or Cloud platforms. Nextflow pipelines consist of a configuration file and a series of processes that define the major steps in the pipeline. Processes are independently executed from one another and the platform supports a variety of languages such as Bash, Python, R, and so on. Processes are connected through input or output channels, allowing data to be passed through the pipeline. Nextflow also provides an automatic caching mechanism for identifying and skipping successfully completed tasks and using previously cached results for downstream tasks. Each BasyesASE module has been coded in Nextflow and is available on GitHub ( https://github.com/McIntyre-Lab/BayesASE). Galaxy is an open-source project designed to be deployed in a web browser, providing a user-friendly platform that can be configured to run on global servers in large universities or on the local individual machines (Goecks et al. 2010). Reproducibility in Galaxy is accomplished via histories and the creation of workflows that can be shared among collaborators. The BayesASE modules are developed for use with Galaxy, and have been deposited in the Galaxy Toolshed. Details about how to deploy these tools are in the User Guide.


	Testcrosses can be used to efficiently estimate cis effects

To demonstrate that the testcross can be used to predict allelic differences between nontester alleles we analyzed a publicly available data set of male and female F1 Mus musculus brain RNA-seq from three different inbred mouse strains, CAST/EiJ, PWK/PhJ, and WSB/EiJ (Zou et al. 2014;Crowley et al. 2015). The authors provided two alternative notations for the inbred strains: CAST/EiJ = F, PWK/PhJ = G, and WSB/EiJ = H, or alternatively CAST/EiJ = CAST, PWK/PhJ = PWK, and WSB/EiJ= WSB. We denote each F1 sample by its maternal strain × paternal strain. For example, a CAST × WSB mouse is an offspring of a CAST/EiJ female that is mated with a WSB/EiJ male. This same F1 sample can be denoted as FH (short for F × H) in which an F male is mated with an H female. This shorter notation is used to label some tests/figures and the model results. For the original study, the main hypothesis of interest was a test of the null AIFG =AIGF, AIGH =AIHG, AIFH =AIHF, i.e. the absence of a parent-of-origin effect. We repeated the analysis with BayesASE, using the same read counts as the original work, and male and female offspring of each cross were analyzed separately as a series of pairwise analyses using the Bayesian model (León-Novelo et al. 2018) and not jointly (Zou et al. 2014).
The option of performing pairwise analysis of AI is a novel feature of our Bayesian model (León-Novelo et al. 2018), in which we introduced the possibility of analyzing AI in two different conditions or genotypes and at the same time testing the hypothesis of difference in the levels of AI between the two conditions. We indicate with H1 the hypothesis of allelic balance in condition 1, with H2 the hypothesis of allelic balance in condition 2, and with H3 the hypothesis of no difference between the levels of AI in conditions 1 and 2 (Figure 1).
Our main interest was to assess whether the theoretical prediction that the testcross can be used to compare different nontester alleles. In the mouse data, the offspring of the crosses CAST × PWK and CAST × WSB, represent a testcross of the PWK and WSB alleles with CAST as the tester allele. Using the flexible design in BayesASE, we estimated AI and tested whether the AI was significant between alleles CAST and PWK in the offspring of CAST × PWK (H 1 inFigure 1; red boxes), alleles CAST and WSB in the offspring of CAST × WSB (H2 inFigure 1, red boxes), and the difference in AI between the two offspring (H3 inFigure 1, red boxes). The null hypothesis H3 is that the expression of PWK in CAST × PWK is the same as the expression of WSB in CAST × WSB. Because CAST is common to both crosses (and the maternal allele, in this case), testing H3 is a test between PWK and WSB. We also compared the test crosses PWK × CAST and WSB × CAST in this case CAST is still the tester allele but here it is inherited paternally. The other two testcross combinations were tested in the same manner. Male and female offspring were evaluated separately. We focus only on the autosomal effects and do not consider the X, Y, or mitochondrial loci.
To verify the predictions made in the testcross, the reciprocal crosses were used (Figure 1, blue boxes). For example, to verify differences in the PWK and WSB alleles predicted by the CAST × PWK and CAST × WSB cross we examined the PWK × WSB and WSB × PWK crosses. Because the alleles in the testcrosses are compared with the same parental inheritance and the alleles in the direct test have different parental inheritance, we do not expect that 100% of the predictions will be realized. In addition, the trans environments are different between the testcrosses, with trans effects of the tester allele shared but differing in the nontester allele. The Spearman coefficient of correlation was used to compare the estimates for AI estimated from BayesASE with the estimates of AI used in the original work (Zou et al. 2014;Crowley et al. 2015) for the 95 imprinted genes and all data were analyzed using the same counts from the original analysis.
