Program : Prot_evol
Author  : Ugo Bastolla Centro de Biologia Molecular Severo Ochoa 
          (CSIC-UAM)
e-mail  : ubastolla@cbm.csic.es

==================================
GENERAL DESCRIPTION
==================================

The program Prot_Evol performs two kinds of computation.

(1) It computes the mean-field site-specific amino acid distributions 
    that have minimal differences with respect to the background 
    distribution and that constraint the average stability of the native
    state of the protein against both unfolding and misfolding.
    
    The program also computes an exchangeability matrix derived from an 
    empirical substitution model or from a mutation model that can be
    used together with the site-specific distributions for applications in
    phylogenetic inference.
    
    Citation: Arenas, Sanchez-Cobos and Bastolla, Maximum likelihood 
    phylogenetic inference with selection on protein folding stability, 
    preprint.
    
(2) It simulates protein evolution subject to the constraint of 
    selection on the folding stability of the native state of the 
    protein against both unfolding and misfolding. It implements three 
    selection models:
    
        (a) Neutral;
        (b) Based on the fixation probability of the Moran process, 
            which depends on the difference of logarithmis fitness and 
            on effective population size;
        (c) Based on the mean-field stationary distributions computed at
            the previous point.

==================================
COMPILE AND RUN:
==================================

To compile and run the program execute the followin commands:

    unzip Prot_evol.zip
    make
    ./Prot_evol # Without parameter file, it will print an help screen
    ./Prot_evol Input_Prot_evol.in

The attached parameter file contains the input parameters that can be 
changed by modifying the file.

IMPORTANT: 
 - Modify the line PDB=... with the local path to your target PDB file
 - Modify the line FILE_STR=... with the local path to the file 
   structures.in, which is included in the package.

==================================
DETAILED DESCRIPTION
==================================

    The commands mentioned in this section can be modified by editing 
the attached input file Input_Prot_evol.in

A) MEAN_FIELD model with independent sites.
    
    This computation is performed if MEANFIELD=1. 
    
    L independent mean-field amino acid distributions (P^MF,i)_a are 
computed by minimizing the Kullback-Leibler divergence with respect to 
the background distribution (P^mut)_a for fixed value of the average 
stability, ave(Delta G). This constraint it imposed through a Lagrange 
multiplier Lambda, whose value is fixed by imposing that the sequences 
in the PDB plus the optional input protein family has maximum likelihood
with respect to the model.

    1) The BACKGROUND distribution may be computed in three different 
       ways.

        1a) As the amino acid distribution observed across all protein 
    sites (command GET_FREQ=2, 19 degrees of freedom). If one of the 
    amino acids is not present, this distribution is combined with the 
    one at point (1b).
     
        1b) Fitted from a codon based mutation model, whose parameters 
    are the nucleotide frequencies pi_A, pi_T, pi_G, pi_C and the 
    enhancement of the mutation rate at CpG dinucleotides. These 
    parameters are fitted imposing that the background distribution is 
    as similar as possible to the observed one (1a) (command GET_FREQ=1,
    4 degrees of freedom). 
     
        1c) The parameters of the mutation model can be input manually 
    from the input file (command GET_FREQ=0, 0 degrees of freedom).

    2) The program outputs the EXCHANGEABILITY matrix E_ab that allows
    computing the mutation rate between amino acids a and b at site i as
    (Q^i)_ab=E_ab*(P^MF,i)_b
    
        We impose that E_ab is symmetric, so that detailed balance is 
    fulfilled and the stationary distributions coincide with the 
    mean-field distributions. Four kinds of computations are 
    implemented. Three are based on an empirical substitution model that
    can be chosen with the commend MATRIX=... (for the moment, available
    options are WAG and JTT).

        2a) EXCHANGE=MUT The exchangeability matrix is computed from the
    same mutation process that generates the background distribution.
        
        2b) EXCHANGE=EXCH The exchangeability matrix is equal to the 
    empirical one. The two former choices yield poor results.
    
        2c) EXCHANGE=RATE The exchangeability matrix is computed 
    imposing that the average rate of amino acid substitutions across 
    all sites is most similar to the one generated by the empirical 
    rates, which is Q_ab=(E^emp)_ab*(f^emp)_b
        
        2d) EXCHANGE=FLUX (default) The exchangeability matrix is 
    computed imposing that the average flux of amino acids across all 
    sites is equal to the one generated by the empirical rates, which is 
    F_ab=(E^emp)_ab*(f^emp)_a*(f^emp)_b.

B) Simulations of protein evolution with selection on folding stability.
    
    If only the mean-field distributions are of interest, this 
computation can be switched off with the command TMAX=0.

    At each time step, one mutated sequence is generated, its folding 
stability against unfolding and misfolding DG_mut is estimated with the 
model of Minning, Porto and Bastolla (Proteins 2013, 81:1102-112), and 
the mutation is fixed or rejected with one of three possible SELECTION 
models:

    3a) Neutral: fixation if DG_mut < DG_thr, rejected otherwise.
    
    DG_thr= 0.95*DG_PDB
    Commands NEUTRAL=1 MEANFIELD=0

    3b) Fixation probability of the Moran model with very low mutation 
    rate, P_fix=(1-f_wt/f_mut)/[1-(f_wt/f_mut)^N], where f_wt and f_mut 
    are the wild type and mutant fitness, respectively, 
    f=exp(-DG/T)/[1+exp(-DG/T)], and N is the effective population size,
    which is input with the command NPOP=... (default is 100).
    
    Commands NEUTRAL=0 MEANFIELD=0

    3c) Mean-field model: We impose that the stationary distribution at 
    each site coincides with the mean-field distribution (P^MF,i)_a.
        
        Since the mutation process is assumed to be symmetric, it is 
    sufficient that the acceptance probability follows the Metropolis 
    criterion:
        P_fix=1 if (P^MF,i)_a_mut > (P^MF,i)_a_wt,
        P_fix= (P^MF,i)_a_mut / (P^MF,i)_a_wt otherwise.
        
    CommandS MEANFIELD=1

==================================
INPUT FILE
==================================
    
    A detailled documentation of the input parameters needed will be found on 
Input_Prot_evol.in file included in this package.

==================================
OUTPUT FILES:
==================================
   
   There are several output files that can be classified in two 
different groups, depending on the meaning of their content.
    

A) Mean-field model files

    a1) <PDB>_<parameters>_profiles.txt 
    Mean-field amino acid distributions and entropy (one site per line)

    a2) <PDB>_<parameters>_AA_profile_global.txt
    Average amino acid distribution across all sites
    
    a3) <PDB>_<parameters>_exchangeability_<EXCH_MODEL>.txt
    Exchangeability matrix (valid for all sites)
    
    a4) <PDB>_<parameters>_summary.dat
    Performances of the mean-field model evaluated with the PDB 
    sequence. This is a tabular file with the following fields:
    
    Protein length; Fitted nucleotide frequencies; PDB code (last column);
    The following measures are given for various types of models: 
    fitted Lambda; KL divergence from background distribution;
    log-likelihood per site of the PDB sequence with respect to the model; 
    average hydrophobicity h;
    freezing temperature of the misfolded ensemble, Tf; 
    average free energy difference between native and misfolded state, DG; 
    	    Models:
	    mut: background distribution at all sites.
            nat: mean-field model constraining only the native energy.
            all: mean-field model constraining DG 
            wt:  sequence in the PDB.

    a5) <PDB>_<parameters>_likelihood.dat
    For each tested value of Lambda, DG, likelihood, hydrophobicity h,
    freezing temperature Tf and convergenge of the nat and all models
    are reported (the mut model coincides with Lambda=0). Conv=0 means 
    that the iterations did not converge.
    
    a6) <PDB>_<parameters>_rate_profile.dat 
    For each protein size we report:
    (1) the amino acid in the PDB,
    (2) the number of native contacts,
    (3) the substitution rate for the chosen exchangeability matrix,
    (4) the entropy of the amino acid distribution,
    (5) the hydrophobicity of the amino acid in the PDB and
    (6) the average hydrophobicity of the mean-field distribution.
    Pearson correlation coefficients between these variables are 
    reported at the end of the file.
    
B) Simulations of evolution files

    b1) Contact_statistics_<L>.dat
    Statistics of alternative contact matrices of length L
    
    b2) <PDB>_<parameters>_stab.dat
    For each amino acid substitution, it reports (1) the mutated site,
    (2) the mutated amino acid, (3) the native energy, (4) DG, (5) fitness, 
    (6) number of synonymous substitutions since the previous aa substitution
    (7) number of attempted mutations.
    This file can be used to reconstruct the protein sequences generated
    by the simulated evolution.

    b3) <PDB>_<parameters>_ave.dat
    Every it_print (internal parameter) substitutions, it prints average
    and standard error of: (1,2) fitness, (3,4) native energy, 
    (5,6) DeltaG, (7) difference between amino acid entropy of mutation and 
    selection, (8) non_synonymos/synonymous subst. rate, (9) acceptance rate.

    b4) <PDB>_<parameters>_final.dat
    Same information as in stab file, but printed at the end of the 
    simulation.













