Enhanced Protein Isoform Characterization Through Long-Read Proteogenomics - Jurkat Samples and Reference Data

Miller, Rachel; Jordan, Ben; Mehlferber, Madison; Jeffery, Erin; Chatzipantsiou, Christina; Kaur, Simran; Millikin, Robert; Shortreed, Michael; Tiberi, Simone; Conesa, Ana; Smith, Lloyd; Deslattes Mays, Anne; Sheynkman, Gloria

doi:10.5281/zenodo.5703754

Published July 5, 2021 | Version v6

Journal article Open

Enhanced Protein Isoform Characterization Through Long-Read Proteogenomics - Jurkat Samples and Reference Data

1. University of Wisconsin - Madison
2. University of Virginia
3. Lifebit Biotech Ltd.
4. University of Zurich
5. University of Florida
6. Science and Technology Consulting LLC

The detection of physiologically relevant protein isoforms encoded by the human genome is critical to biomedicine. Mass spectrometry (MS)-based proteomics is the preeminent method for protein detection, but isoform-resolved proteomic analysis relies on accurate reference databases that match the sample; neither a subset nor a superset database is ideal. Long-read RNA sequencing (e.g. PacBio, Oxford Nanopore) provide full-length transcript sequencing, which can be used to predict full-length proteins. Here, we describe a long-read proteogenomics approach for integrating matched long-read RNA-seq and MS-based proteomics data to enhance isoform characterization. We introduce a classification scheme for protein isoforms, discovery novel protein isoforms, and present the first protein inference algorithm for the direct incorporation of long-read transcriptome data in protein inference to enable detection of protein isoforms that are intractable to MS detection. We have released an open-source Nextflow pipeline that integrates long-read RNA-sequencing in a proteomic workflow for isoform-resolved analysis.

Companion Repositories:

Companion Datasets

This Repository contains the Jurkat Samples and Reference Data

Files

jurkat_classification.txt

Files (49.6 GB)

Name	Size	Download all
gencode.v35.annotation.canonical.gtf md5:3e7e167cf2a1756280a12e2c731613de	1.4 GB	Download
Human_Hexamer.tsv md5:06386647ccb0e9942208a659ca761ee1	125.0 kB	Download
Human_logitModel.RData.gz md5:d6bfd335a049ce7173ba7366dc0d48bc	3.1 MB	Download
jurkat_classification.txt md5:55eb7d15f2b68b460a6b784b6baf9306	57.4 MB	Preview Download
jurkat_corrected.fasta.gz md5:423b8fdf5e45c857d12411ede6e008c0	71.2 MB	Download
jurkat_corrected.gtf md5:19fce83d8361f6e0116500ddb723c3d0	188.9 MB	Download
jurkat_gene_kallisto.tsv md5:0f3a0a1525ece57a15ad053674c88c1f	362.6 kB	Download
jurkat_merged.ccs.bam md5:4358213de663b01e20c606d0b772c2aa	7.4 GB	Download
jurkat_r1.fastq.gz md5:2d369ab988f06df9af7b1250e2751219	3.6 GB	Download
jurkat_r2.fastq.gz md5:86657ef4692edee4029a59a196dcca36	3.8 GB	Download
kallist_table_rdeplete_jurkat.tsv md5:0b6ec27c462889cb8854e129c3420441	1.4 MB	Download
mass_spec.tar.gz md5:222d467d8ef8d532be30b29b25472740	5.9 GB	Download
NEB_primers.fasta md5:1ef7d3d031b223776fca759f1e16df2e	70 Bytes	Download
SpritzRunResultsNovember15.2021.tar.gz md5:1d6945c27e7207ff3a74039c7b92b3e2	2.2 GB	Download
star_genome.tar.gz md5:7450045ac7d583dea6345143e0826a14	24.9 GB	Download
Task1SearchTaskconfig_orf.toml md5:64aebe205d4ef6b1f33a50cd22ecbef9	2.5 kB	Download
uniprot_reviewed_canonical_and_isoform.fasta.gz md5:0dbedc2a724f50b4a19b6a7a625f3c2b	9.2 MB	Download

Citations

Oops! Something went wrong while fetching results.

	All versions	This version
Views	949	449
Downloads	1,477	375
Data volume	5.5 TB	1.0 TB

Enhanced Protein Isoform Characterization Through Long-Read Proteogenomics - Jurkat Samples and Reference Data

Creators

Description

Files

jurkat_classification.txt

Files (49.6 GB)