MESA PredictDB Models - README ================================= Lauren S. Mogil 2018-05-28 ## Overview This directory contains monocyte gene expression prediction models for 3 populations from MESA (AFA, HIS, CAU) and 2 combined populations (AFHI and ALL) for use with PrediXcan and S-PrediXcan. These models were trained using MESA populations. Genotype data were imputed using Michigan Imputation Server and 1000 genomes phase 3 v5 reference panel. For each model, there is also a corresponding file with SNP covariance data needed for S-PrediXcan. ## Methods Genotype data - Data were obtained from dbGaP accession phs000209.v13.p3. SNPs were filtered by call rates less than 99%. One of a pair of related individuals (IBD >0.05) were removed. Final sample sizes for each population post quality control are AFA = 233, HIS =352, and CAU = 578. MESA populations were imputed using the Michigan Imputation Server and 1000 genomes phase 3 v5 reference panel and Eagle v2.3. CAU used EUR and AFA and HIS used mixed population for QC during imputation. The results were filtered by R2 < 0.8, MAF < 0.01, and removed ambiguous SNPs. For analysis, genotypes were encoded on a continuous range from 0 to 2, with the value denoting the estimated count of the second, or effect, allele. This leaves 9,352,383 SNPs for AFA, 7,201,805 SNPs for HIS, and 5,559,636 SNPs for CAU non-LD pruned SNPs per population post quality controls. Expression data - Data were obtained from GEO accession GSE56045. Illumina IDs were converted to Ensembl IDs using the RefSeq IDs from MESA and gencode.v18 (gtf and metadata files) to match Illumina IDs to Ensembl IDs. If there were multiple Illumina IDs corresponding to an Ensembl ID, the average of those values was used as the expression level. Covariates - Before training models to predict gene expression, expression values were adjusted for the following covariates: 3 genotypic principal components and 10 PEER factors. Expression was adjusted by performing a multivariate linear regression with all covariates, pulling the residual values, and then assigning the residuals to be the new expression values. Annotations - Gene annotation was derived from gencode v18. SNP annotation was generated by running github.com/lmogil/MESA/06_make_files_predictDB.py. Nested Cross Validated Elastic-Net - We used the following "nested" cross validation procedure available at https://github.com/hakyimlab/PredictDBPipeline: 1. Randomly split the data into 5 folds. 2. For each fold: a. Remove the fold from the data. b. Use the remaining data to train an elastic-net model using 10-fold cross-validation to tune the lambda parameter. c. With the trained model, predict on the hold out fold, and get various test statistics for how the model performs. 3. Calculate the average and standard deviation of each of the significance statistics, where applicable. This should provide a reasonable estimate for how well the model will generalize to new data. 4. Train a new elastic-net model using all of the data. Again, use 10-fold cross validation to tune the lambda parameter. The non-zero weights from this final model are what are saved in the database, provided the model meets significance criteria. A model was determined to be "significant" if the average pearson correlation between predicted and observed during nested cross validation was greater than 0.1 (equivalent to R2 > 0.01) and the estimated p-value for this statistic was less than 0.05. See below for how the p-value was calculated. ## Schema ### Weights Table gene - Ensembl ID of the gene rsid - rsid of the SNP from MESA data varID - String label of the format chromosome_position_allele1_allele2_build. All varIDs are from build 37 of the Human Reference Genome. ref_allele - Allele which appears on Reference Genome eff_allele - Alternative/Effect allele. weight - The weight for this SNP in predicting expression for the gene. In predicting the expression for the gene, the weight is multiplied by the count, or estimated count, of the effect allele in the individual. This value is added to all other weighted SNPs in the model. ### Model Summaries Table This table contains summary statistics about each gene model for the tissue gene - Ensembl ID of the gene genename - HUGO symbol for the gene gene_type - Type of the gene according to Gencode v18. We have only included models for protein coding genes. alpha - The alpha parameter used with the R package glmnet to train the elastic net model. Here, we used 0.5 for every gene. n_snps_in_window - The number of cis-SNPs examined for model training. This is the number of SNPs found within 1 million base pairs upstream of the transcription start site and 1 million base pairs downstream of the transcription end site. Sites are determined from Gencode v18. n.snps.in.model - The number of SNPs within the cis window that have non-zero weights, as found by elastic-net. lambda_min_mse - The lambda parameter for elastic-net found to have the minimum average mean square error on the hold out folds during cross validation. test_R2_avg - The average coefficient of determination when predicting values of the hold out fold during nested cross validation. R2 is defined as 1 - sum((y_observed-y_predicted)**2) / sum((y_observed-mean(y_observed)**2) NB: It is possible for this value to be negative and still achieve a positive correlation between predicted and observed. test_R2_sd - The standard deviation of the coefficient of determination for each of the five hold out folds during nested cross-validation. cv_R2_avg - The average coefficient of determination for each of the hold out folds when cross-validation was performed on the entire data set. cv_R2_sd - The standard deviation of the coefficient of determination for each of the 10 hold out folds when doing cross- validation on the entire data set. in_sample_R2 - After an optimal lambda parameter has been chosen, cv.glmnet trains an elastic net model with all of the data provided, i.e. without any hold out folds. Expression is then predicted using this model, and the coefficient of determination is calculated. nested_cv_fisher_pval - Our first attempt at calculating a p-value during nested cross-validation. We calculated a p-value using a correlation test using on each of the five hold out folds, and combined the five p-values using Fisher's method. Since some models yielded predictions with a variance of 0, we were unable to perform a correlation test, and the p-value we reported for that fold was a random number from the uniform distribution on the interval (0,1), which would be the correct distribution under the assumption of the null hypothesis. Under simulated null results, we still found this method did not produce a uniform p-value distribution, with values under 0.05 and over 0.95 being under-represented. We decided not to use this statistic because it did not quite produce the desired results and the random element involved when there was prior knowledge that a model did not predict well. rho_avg - Average correlation between predicted and observed on the hold out folds when doing nested cross-validation. rho_se - Standard deviation of correlation between predicted and observed on the hold out folds when doing nested cross-validation. rho_zscore - Transformation of rho_avg into a z-score using Stouffer's Method. pred.perf.R2 - rho_avg squared. pred.perf.pval - p-value for rho_zscore. pred.perf.qval - Deprecated. Previously held q-values calculated from the distribution of p-values, but it was later deemed this analysis was inappropriate. Kept for reasons of compatibility with PrediXcan and S-PrediXcan software. ### Construction Table This table contains info solely meant for reproducibility purposes. chromosome - Chromosome being analyzed cv_seed - seed that was set at start of script for randomly splitting samples into different folds ### Sample Info Table Basic info on samples used for training. Contains a single row: n_samples - Number of samples population - The population studied (MESA in these .db files) tissue - The tissue from which RNA was sequenced (population in these .db files) The the *v2.db files are filtered by rho_avg> 0.1, pred.perf.pval <0.05, and only contain coding genes. ## Citations BIORXIV/2018/245761 Genetic architecture of gene expression traits across diverse populations Lauren S. Mogil, Angela Andaleon, Alexa Badalamenti, Scott P. Dickinson, Xiuqing Guo, Jerome I. Rotter, W. Craig Johnson, Hae Kyung Im, Yongmei Liu, and Heather E. Wheeler. https://doi.org/10.1101/245761