MESA PredictDB Models - README
=================================

Lauren S. Mogil
2018-05-28

## Overview

This directory contains monocyte gene expression prediction models for 3 
populations from MESA (AFA, HIS, CAU) and 2 combined populations (AFHI and ALL) 
for use with PrediXcan and S-PrediXcan. These models were trained using MESA 
populations. Genotype data were imputed using Michigan Imputation Server and 
1000 genomes phase 3 v5 reference panel. For each model, there is also a corresponding 
file with SNP covariance data needed for S-PrediXcan.

## Methods

Genotype data - Data were obtained from dbGaP accession phs000209.v13.p3. SNPs were 
filtered by call rates less than 99%. One of a pair of related individuals (IBD >0.05) 
were removed. Final sample sizes for each population post quality control are AFA = 233, 
HIS =352, and CAU = 578. MESA populations were imputed using the Michigan Imputation Server and 
1000 genomes phase 3 v5 reference panel and Eagle v2.3. CAU used EUR and AFA and HIS used 
mixed population for QC during imputation. The results were filtered by R2 < 0.8, MAF < 0.01, 
and removed ambiguous SNPs. For analysis, genotypes were encoded on a continuous range from 
0 to 2, with the value denoting the estimated count of the second, or effect, allele. This 
leaves 9,352,383 SNPs for AFA, 7,201,805 SNPs for HIS, and 5,559,636 SNPs for CAU non-LD
pruned SNPs per population post quality controls.   
   

Expression data - Data were obtained from GEO accession GSE56045. Illumina IDs 
were converted to Ensembl IDs using the RefSeq IDs from MESA and gencode.v18 
(gtf and metadata files) to match Illumina IDs to Ensembl IDs. If there were 
multiple Illumina IDs corresponding to an Ensembl ID, the average of those 
values was used as the expression level.


Covariates - Before training models to predict gene expression, expression
values were adjusted for the following covariates: 3 genotypic principal 
components and 10 PEER factors. Expression was adjusted by performing a 
multivariate linear regression with all covariates, pulling the residual values, 
and then assigning the residuals to be the new expression values.

Annotations - Gene annotation was derived from gencode v18. SNP annotation 
was generated by running github.com/lmogil/MESA/06_make_files_predictDB.py.

Nested Cross Validated Elastic-Net - We used the following "nested" cross validation procedure available at 
https://github.com/hakyimlab/PredictDBPipeline:

1. Randomly split the data into 5 folds.

2. For each fold:

    a. Remove the fold from the data.

    b. Use the remaining data to train an elastic-net model using 10-fold
    cross-validation to tune the lambda parameter.

    c. With the trained model, predict on the hold out fold, and get various
    test statistics for how the model performs.

3. Calculate the average and standard deviation of each of the significance
statistics, where applicable. This should provide a reasonable estimate for how
well the model will generalize to new data.

4. Train a new elastic-net model using all of the data. Again, use 10-fold cross
validation to tune the lambda parameter. The non-zero weights from this final
model are what are saved in the database, provided the model meets significance
criteria.

A model was determined to be "significant" if the average pearson correlation
between predicted and observed during nested cross validation was greater than
0.1 (equivalent to R2 > 0.01) and the estimated p-value for this statistic was
less than 0.05. See below for how the p-value was calculated.

## Schema

### Weights Table

gene - Ensembl ID of the gene

rsid - rsid of the SNP from MESA data

varID - String label of the format chromosome_position_allele1_allele2_build.
All varIDs are from build 37 of the Human Reference Genome.

ref_allele - Allele which appears on Reference Genome

eff_allele - Alternative/Effect allele.

weight - The weight for this SNP in predicting expression for the gene. In
predicting the expression for the gene, the weight is multiplied by the count,
or estimated count, of the effect allele in the individual. This value is added
to all other weighted SNPs in the model.

### Model Summaries Table

This table contains summary statistics about each gene model for the tissue

gene - Ensembl ID of the gene

genename - HUGO symbol for the gene

gene_type - Type of the gene according to Gencode v18. We have only included
models for protein coding genes.

alpha - The alpha parameter used with the R package glmnet to train the elastic
net model. Here, we used 0.5 for every gene.

n_snps_in_window - The number of cis-SNPs examined for model training.  This is
the number of SNPs found within 1 million base pairs upstream of the
transcription start site and 1 million base pairs downstream of the
transcription end site. Sites are determined from Gencode v18.

n.snps.in.model - The number of SNPs within the cis window that have
non-zero weights, as found by elastic-net.

lambda_min_mse - The lambda parameter for elastic-net found to have the minimum
average mean square error on the hold out folds during cross validation.

test_R2_avg - The average coefficient of determination when predicting values of
the hold out fold during nested cross validation.  R2 is defined as

  1 - sum((y_observed-y_predicted)**2) / sum((y_observed-mean(y_observed)**2)

NB: It is possible for this value to be negative and still achieve a positive
correlation between predicted and observed.

test_R2_sd - The standard deviation of the coefficient of determination for each
of the five hold out folds during nested cross-validation.

cv_R2_avg - The average coefficient of determination for each of the hold out
folds when cross-validation was performed on the entire data set.

cv_R2_sd - The standard deviation of the coefficient of determination for each
of the 10 hold out folds when doing cross- validation on the entire data set.

in_sample_R2 - After an optimal lambda parameter has been chosen, cv.glmnet
trains an elastic net model with all of the data provided, i.e. without any hold
out folds. Expression is then predicted using this model, and the coefficient of
determination is calculated.

nested_cv_fisher_pval - Our first attempt at calculating a p-value during nested
cross-validation. We calculated a p-value using a correlation test using on each
of the five hold out folds, and combined the five p-values using Fisher's
method. Since some models yielded predictions with a variance of 0, we were
unable to perform a correlation test, and the p-value we reported for that fold
was a random number from the uniform distribution on the interval (0,1), which
would be the correct distribution under the assumption of the null hypothesis.
Under simulated null results, we still found this method did not produce a
uniform p-value distribution, with values under 0.05 and over 0.95 being
under-represented. We decided not to use this statistic because it did not quite
produce the desired results and the random element involved when there was prior
knowledge that a model did not predict well.

rho_avg - Average correlation between predicted and observed on the hold out
folds when doing nested cross-validation.

rho_se - Standard deviation of correlation between predicted and observed on the
hold out folds when doing nested cross-validation.

rho_zscore - Transformation of rho_avg into a z-score using Stouffer's Method.

pred.perf.R2 - rho_avg squared.

pred.perf.pval - p-value for rho_zscore.

pred.perf.qval - Deprecated. Previously held q-values calculated from the
distribution of p-values, but it was later deemed this analysis was
inappropriate. Kept for reasons of compatibility with PrediXcan and S-PrediXcan
software.

### Construction Table

This table contains info solely meant for reproducibility purposes.

chromosome - Chromosome being analyzed

cv_seed - seed that was set at start of script for randomly splitting samples
into different folds


### Sample Info Table

Basic info on samples used for training. Contains a single row:

n_samples - Number of samples

population - The population studied (MESA in these .db files)

tissue - The tissue from which RNA was sequenced (population in these .db files)


The the *v2.db files are filtered by rho_avg> 0.1, pred.perf.pval <0.05, and only contain coding genes. 
 

## Citations

BIORXIV/2018/245761
Genetic architecture of gene expression traits across diverse populations
Lauren S. Mogil, Angela Andaleon, Alexa Badalamenti, Scott P. Dickinson, Xiuqing 
Guo, Jerome I. Rotter, W. Craig Johnson, Hae Kyung Im, Yongmei Liu, and 
Heather E. Wheeler. 
https://doi.org/10.1101/245761