RefSeq_Protein_Variant_Database_Readme.txt Author list. Glendon J. Parker1,2*¦, Tami Leppert2,3, Deon S. Anex4, Jonathan K. Hilmer5, Nori Matsunami3, Lisa Baird3, Jeffery Stevens3, Krishna Parsawar6, Blythe P. Durbin-Johnson7, David M. Rocke7, Chad Nelson6, Daniel J. Fairbanks1, Andrew S. Wilson8, Robert H. Rice9, Scott R. Woodward10, Brian Bothner5, Bradley R. Hart4, and Mark Leppert3 1Department of Biology, Utah Valley University, Orem, Utah, United States of America. 2Protein-Based Identification Technologies L.L.C., Orem, Utah, United States of America. 3Department of Human Genetics, University of Utah, Salt Lake City, Utah, United States of America. 4 Forensic Science Center, Lawrence Livermore National Laboratory, Livermore, California, United States of America. 5Department of Chemistry and Biochemistry, Montana State University, Bozeman, Montana, United States of America. 6Mass Spectrometry and Proteomics Core Facility, University of Utah, Salt Lake City, Utah, United States of America. 7Department of Public Health Sciences, University of California, Davis, California, United States of America. 8School of Archaeological Sciences, University of Bradford, United Kingdom. 9Department of Environmental Toxicology, University of California, Davis, California, United States of America. 10Sorenson Molecular Genealogical Foundation, Salt Lake City, Utah, United States of America. ¦Current Address: Forensic Science Center, Lawrence Livermore National Laboratory, Livermore, California, United States of America. Phone: 925-423-2318 Introduction. The RefSeq Protein Variant Database is a unique protein sequence database, developed for the express purpose of defining variant peptides that can then be detected for use in the identification of individuals. This database is in Mascot compatible FASTA format and can be used in conjunction with proteomic mass spectrometry analytical tools such as X!tandem, Sequest, PEAKs and Mascot. Creation of RefSeq Protein Variant Database(1). The RefSeq protein database was used as a starting point for the PBIT protein reference database. The RefSeq protein sequence database human.protein.gpff.gz contains all known amino acid (aa) variant information, but is not in a format readily useful as a database for mass spectrometry software engines. From the UCSC ftp site ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/, the file snp137Common.txt.gz, which contains all of the common variants with frequencies >= 1%, was downloaded (http://genome.ucsc.edu/cgi-bin/hgTables; Human, assembly: Feb. 2009 (GRCh37/hg19). The human.protein.gpff file contains reference sequences, but not necessarily unique sequences. First, 4817 duplicated sequences were removed from the database. Then, for each sequence, the list of variants was gathered from two sources: the snp137Common.txt.gz file and the ESP 6500 db which contains SNPs, INDELs and coverage data for the ESP 6500 exomes (chromosomes 1-22, X, and Y). File ESP6500SI-V2-SSA137.dbSNP138-rsIDs.snps_indels.txt.tar.gz comes from the ftp site http://evs.gs.washington.edu/EVS/ at NHLBI. The snp137Common.txt file contains all of the common variants with MAF (minor allele frequency) >= 1%. The ESP6500 database contains data from various collaborators from 6503 samples for European American (EA) and African American (AA) individuals. The ESP 6500 database, with 3.47 million variants, was filtered to pull all variants with either EA or AA MAF >= 0.5%. All unique variants from these two sources were then used to create the variant sequences used in the PBIT database. Each reference sequence was duplicated once, and labeled the same as the reference sequence with the exception of the addition of the Ò.v1Ó string at the end of the NM number. The position of the variants in the sequence and their individual proximity was not a factor. If, however, two or more variants occurred in the same position, the first variant in the list at that position was used. Stop variants were not used. The final PBIT database contains a reference sequence and variant sequence, if one or more variants exist in the sequence, for each protein sequence. There are 34,383 NP_ loci, 1,833 XP_ loci, and 13 YP_ loci in human.protein.gpff for a total 36,229 unique locus names for homo sapiens. The NM numbers are an identifier that differentiates between multiple assignments to the same gene, and are used in the PBIT database as a way of identifying sequence. Large proteins presumably not involved in hair were removed from the file to facilitate run time (Gene Names = TTN, MUC16, OBSCN, NEB, MUC19, AHNAK, AHNAK2, MUC5B, MUC4, FCGBP, MUC12, LOC100289142, USH2A, MUC2, SSPO, HYDIN, RYR1). The database was formatted in FASTA format. Database file summary A characterization of the database includes: 37% of proteins do not have variants > 0.5%, 91% of proteins have 0 or 1 variants > 0.5%, and 99% of proteins have 0, 1 or 2 variants > 0.5%. There are 36,229 protein sequences represented by a unique NM number; 350 protein sequences have at least one peptide with three variants, 169 protein sequences have at least one peptide with four or more variants. The number of variants in the ESP db with a MAF > 0.5% was 106,000 variants.ÊThe unique list of these variants is 67,250.ÊOf these, there are 31,230 which that were not identified in the snp137common.txt file. There are 13,585,949 rs numbers in the snp137Common.txt file and 31,230 NEW ESPdb rs numbers that have a frequency of > 0.5%, so the total number of rs numbers to use is 13,617,179. TheÊhuman.protein.gpff file, which contains 36,229 genes and 732,776 variants, was compared to each of the 13,617,179 variants in the combined ESPdb and snp137common file. Some of these variants are not necessarily associated with genes. The human.protein.gpff file matchedÊ80,598 variants with the variants in the snp137common/ESPdb file. There wereÊ9,491 genes with no variants. There wereÊ19,614 genes withÊa maximum ofÊone variantÊvariant in anyÊpeptide. There were 5,772 genes with one or more peptides with two variants. There wereÊ996 genes with one or more peptides with three variants. This leftÊ356 genes withÊone or more peptides withÊfour variants or more per peptide. Even though there are 80,598 unique variants there are a total number of 127,099 variants in the database file, because of isoforms. Of these 127,099 variants, 1,518 are simple stop codons "*"; 47 more are not simple stops e.g., "y*w". Six variants are replaced by a '-', which is a deletion; these were not included in the variant database. 126,477 variants have a single amino acid replacement, andÊ622Êhave multiple amino acid insertions. The new variant database contains 53,476 sequences and the reference database contains 36,229 sequences. Of theÊ7,124 genes with two variants or more in a peptide(s), there wereÊ1097 variants thatÊshare their variant position with only one other variant, 37 share with two other variants andÊthree share their position with three other variants. License. The RefSeq_Protein_Variant_Database.txt file is released under a Creative Commons Attribution-NonCommercial-NoDerivs license. Disclaimer. The Lawrence Livermore National Laboratory, Office of Scientific and Technical Information, Information Management (IM) number is: LLNL-MI-696826. This document was prepared as an account of work sponsored by an agency of the United States government. Neither the United States government nor Lawrence Livermore National Security, LLC, nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States government or Lawrence Livermore National Security, LLC. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States government or Lawrence Livermore National Security, LLC, and shall not be used for advertising or product endorsement purposes. 1. Parker GJ, Anex DS, Leppert T, Hilmer JK, Matsunami N, Baird L, et al. Demonstration of Protein-Based Human Identification Using the Hair Shaft Proteome. PLoS ONE. 2016;PONE-D-15-37076.