This file describes the columns of MAF files generated by maf2maf or vcf2maf,
and how to interpret them, especially in the context of cancer genetics.

The first 34 columns are NCI's standard TCGA MAF format, and described here:
https://wiki.nci.nih.gov/x/eJaPAQ

The subsequent 10 columns are relevant to most analyses.

35. HGVSc - the coding sequence of the variant in HGVS recommended format
36. HGVSp - the protein sequence of the variant in HGVS recommended format
37. HGVSp_Short - Same as HGVSp, but using 1-letter amino-acid codes
38. Transcript_ID - transcript onto which the consequence of the variant has been mapped
39. Exon_Number - the exon number (out of total number)
40. t_depth - read depth across this locus in tumor BAM
41. t_ref_count - read depth supporting the reference allele in tumor BAM
42. t_alt_count - read depth supporting the variant allele in tumor BAM
43. n_depth - read depth across this locus in normal BAM
44. n_ref_count - read depth supporting the reference allele in normal BAM
45. n_alt_count - read depth supporting the variant allele in normal BAM

The next column is relevant to analyses that consider the effect of the variant on all alternate
isoforms of the gene, or on non-coding/regulatory transcripts. The effects are sorted first by
transcript biotype priority, then by effect severity, and finally by decreasing order of transcript
length. Each effect in the list is in the format [SYMBOL,Consequence,HGVSp,Transcript_ID,RefSeq].

46. all_effects - a semicolon delimited list of all possible variant effects, sorted by priority

All remaining columns are straight out of Ensembl's VEP annotator, as described here:
http://useast.ensembl.org/info/docs/tools/vep/vep_formats.html#output

47. Allele - the variant allele used to calculate the consequence
48. Gene - stable Ensembl ID of affected gene
49. Feature - stable Ensembl ID of feature
50. Feature_type - type of feature. Currently one of Transcript, RegulatoryFeature, MotifFeature
51. Consequence - consequence type of this variation; comma-delimited if more than one
52. cDNA_position - relative position of base pair in cDNA sequence
53. CDS_position - relative position of base pair in coding sequence
54. Protein_position - relative position of amino acid in protein
55. Amino_acids - only given if the variation affects the protein-coding sequence
56. Codons - the alternative codons with the variant base in upper case
57. Existing_variation - known identifier of existing variation
58. ALLELE_NUM - allele number from input; 0 is reference, 1 is first alternate etc
59. DISTANCE - shortest distance from variant to transcript
60. STRAND_VEP - the DNA strand (1 or -1) on which the transcript/feature lies
61. SYMBOL - the gene symbol
62. SYMBOL_SOURCE - the source of the gene symbol
63. HGNC_ID - gene identifier from the HUGO Gene Nomenclature Committee
64. BIOTYPE - biotype of transcript
65. CANONICAL - a flag indicating that the VEP-based canonical transcript was used for this gene
66. CCDS - the CCDS identifier for this transcript, where applicable
67. ENSP - the Ensembl protein identifier of the affected transcript
68. SWISSPROT - UniProtKB/Swiss-Prot accession
69. TREMBL - UniProtKB/TrEMBL identifier of protein product
70. UNIPARC - UniParc identifier of protein product
71. RefSeq - RefSeq identifier for this transcript
72. SIFT - the SIFT prediction and/or score, with both given as prediction (score)
73. PolyPhen - the PolyPhen prediction and/or score
74. EXON - the exon number (out of total number)
75. INTRON - the intron number (out of total number)
76. DOMAINS - the source and identifier of any overlapping protein domains
77. AF - Non-reference allele and frequency of existing variant in 1000 Genomes
78. AFR_AF - Non-reference allele and frequency of existing variant in 1000 Genomes combined African population
79. AMR_AF - Non-reference allele and frequency of existing variant in 1000 Genomes combined American population
80. ASN_AF - Non-reference allele and frequency of existing variant in 1000 Genomes combined Asian population
81. EAS_AF - Non-reference allele and frequency of existing variant in 1000 Genomes combined East Asian population
82. EUR_AF - Non-reference allele and frequency of existing variant in 1000 Genomes combined European population
83. SAS_AF - Non-reference allele and frequency of existing variant in 1000 Genomes combined South Asian population
84. AA_AF - Non-reference allele and frequency of existing variant in NHLBI-ESP African American population
85. EA_AF - Non-reference allele and frequency of existing variant in NHLBI-ESP European American population
86. CLIN_SIG - clinical significance of variant from dbSNP
87. SOMATIC - somatic status of each ID reported under Existing_variation
88. PUBMED - pubmed ID(s) of publications that cite existing variant
89. MOTIF_NAME - the source and identifier of a transcription factor binding profile aligned at this position
90. MOTIF_POS - the relative position of the variation in the aligned TFBP
91. HIGH_INF_POS - a flag indicating if the variant falls in a high information position of a transcription factor binding profile (TFBP)
92. MOTIF_SCORE_CHANGE - the difference in motif score of the reference and variant sequences for the TFBP
93. IMPACT - the impact modifier for the consequence type
94. PICK - indicates if this block of consequence data was picked by VEP's pick feature
95. VARIANT_CLASS - Sequence Ontology variant class
96. TSL - Transcript support level
97. HGVS_OFFSET - Indicates by how many bases the HGVS notations for this variant have been shifted
98. PHENO - Indicates if existing variant is associated with a phenotype, disease or trait
99. MINIMISED - Alleles in this variant have been converted to minimal representation before consequence calculation
100. ExAC_AF - Global Allele Frequency from ExAC
101. ExAC_AF_AFR - African/African American Allele Frequency from ExAC
102. ExAC_AF_AMR - American Allele Frequency from ExAC
103. ExAC_AF_EAS - East Asian Allele Frequency from ExAC
104. ExAC_AF_FIN - Finnish Allele Frequency from ExAC
105. ExAC_AF_NFE - Non-Finnish European Allele Frequency from ExAC
106. ExAC_AF_OTH - Other Allele Frequency from ExAC
107. ExAC_AF_SAS - South Asian Allele Frequency from ExAC
108. GENE_PHENO - Indicates if gene that the variant maps to is associated with a phenotype, disease or trait
109. FILTER - Copied from input MAF/VCF, with ExAC-based common_variant tag added, as explained below
110. flanking_bps - The reference allele per VCF specs, and its 2 flanking base pairs
111. variant_id - The ID from an input VCF, or the variant_id from an input MAF
112. variant_qual - The QUAL from an input VCF, or the variant_qual from an input MAF
113. ExAC_AF_Adj - Global Adjusted Allele frequency from ExAC
114. ExAC_AC_AN_Adj - Global Adjusted Allele Count and Number from ExAC
115. ExAC_AC_AN - Global Allele Count and Number from ExAC
116. ExAC_AC_AN_AFR - African/African American Allele Count and Number from ExAC
117. ExAC_AC_AN_AMR - American Allele Count and Number from ExAC
118. ExAC_AC_AN_EAS - East Asian Allele Count and Number from ExAC
119. ExAC_AC_AN_FIN - Finnish Allele Count and Number from ExAC
120. ExAC_AC_AN_NFE - Non-Finnish European Allele Count and Number from ExAC
121. ExAC_AC_AN_OTH - Other Allele Count and Number from ExAC
122. ExAC_AC_AN_SAS - South Asian Allele Count and Number from ExAC
123. ExAC_FILTER - FILTER tags retrieved from ExAC VCF; PASS means ExAC thinks it's germline
124. gnomAD_AF - Frequency of existing variant in gnomAD exomes combined population
125. gnomAD_AFR_AF - Frequency of existing variant in gnomAD exomes African/American population
126. gnomAD_AMR_AF - Frequency of existing variant in gnomAD exomes American population
127. gnomAD_ASJ_AF - Frequency of existing variant in gnomAD exomes Ashkenazi Jewish population
128. gnomAD_EAS_AF - Frequency of existing variant in gnomAD exomes East Asian population
129. gnomAD_FIN_AF - Frequency of existing variant in gnomAD exomes Finnish population
130. gnomAD_NFE_AF - Frequency of existing variant in gnomAD exomes Non-Finnish European population
131. gnomAD_OTH_AF - Frequency of existing variant in gnomAD exomes combined other combined populations
132. gnomAD_SAS_AF - Frequency of existing variant in gnomAD exomes South Asian population

To distinguish driver mutations from passenger mutations, the most relevant columns are:

51. Consequence - consequence of this variant (http://useast.ensembl.org/info/genome/variation/predicted_data.html#consequences).
      This may contain multiple terms that are comma-separated. For example, a synonymous mutation might be close enough to an intron
      to alter splicing. VEP will report both "synonymous_variant" and "splice_region_variant" in this column, for such variants.
93. IMPACT - the severity of the consequence (http://useast.ensembl.org/Help/Glossary?id=535). "HIGH" means severe effect on gene.
86. CLIN_SIG - clinical significance of variant per ClinVar (http://www.ncbi.nlm.nih.gov/clinvar/docs/clinsig/). Find important
      variants by looking for tags like pathogenic, likely_pathogenic, or drug_response.
88. PUBMED - pubmed ID(s) of publications that cite existing variant. Some of these pubs are crap, but a few are gold mines.

These are some other columns to help shortlist variants worth looking into:

57. Existing_variation - known identifier of existing variation. If the variant was seen in some
      other somatic/germline DB, its ID will be listed here.
72. SIFT - the SIFT prediction and/or score, with both given as prediction (score).
73. PolyPhen - the PolyPhen prediction and/or score.
109. FILTER - False-positive filtering status, copied from the input MAF/VCF. An additional filter
      named common_variant is also appended, if allele count across at least one ExAC subpopulation
      is >10 (this default cutoff can be changed when running vcf2maf). So if you're handling
      somatic variants, the common_variant tag means this is likely a false-positive. It is less
      likely to be a legit somatic variant at a site that ExAC classifies as germline or artifact.
123. ExAC_FILTER - FILTER tags copied from the ExAC VCF. Differentiates between what ExAC classifies
      as germline (tagged as "PASS") or artifact (one or more tags, but not "PASS").
113. ExAC_AF_Adj - Global allele frequency across the ExAC population, adjusted for samples where
      this position could be genotyped at high quality. If you're handling germline variants, then
      this tells you how common or rare the variant is.
