There is a newer version of the record available.

Published January 30, 2022 | Version v1
Dataset Open

Enhanced Protein Isoform Characterization Through Long-Read Proteogenomics - Workflow Results

  • 1. University of Wisconsin - Madison
  • 2. University of Virginia
  • 3. Lifebit Biotech Ltd.
  • 4. University of Zurich
  • 5. University of Florida
  • 6. Science and Technology Consulting LLC

Description

 

The detection of physiologically relevant protein isoforms encoded by the human genome is critical to biomedicine. Mass spectrometry (MS)-based proteomics is the preeminent method for protein detection, but isoform-resolved proteomic analysis relies on accurate reference databases that match the sample; neither a subset nor a superset database is ideal. Long-read RNA sequencing (e.g. PacBio, Oxford Nanopore) provides full-length transcript sequencing, which can be used to predict full-length proteins. Here, we describe a long-read proteogenomics approach for integrating matched long-read RNA-seq and MS-based proteomics data to enhance isoform characterization. We introduce a classification scheme for protein isoforms, discover novel protein isoforms, and present the first protein inference algorithm for the direct incorporation of long-read transcriptome data in protein inference to enable detection of protein isoforms that are intractable to MS detection. We have released an open-source Nextflow pipeline that integrates long-read sequencing in a proteomic workflow for isoform-resolved analysis.

Companion Repositories:

  1. Long-Read-Proteogenomics Workflow GitHub Repository Release
  2. Long-Read-Proteogenomics Analysis GitHub Repository Release

Companion Datasets

  1. Long-Read-Proteogenomics Workflow Sample and Reference Data
  2. TEST Data for Long-Read-Proteogenomics Workflow GitHub Actions

This Repository contains the complete output from the execution of the Long-Read-Proteogenomics Workflow, using the input from Jurkat Samples and Reference Data.   

The file jurkat.flnc.bam was 6.5 GB had to be split into 13 separate files and for use should be rejoined -- here are the steps that were used to split the file up.   

1. Convert jurkat.flnc.bam (binary format) to sam file (text format) without header:  samtools view jurkat.flnc.bam > jurkat.flnc.sam

2. Capture the header: samtools view -H jurkat.flnc.bam > jurkat.flnc.header.sam

3. Split jurkat.flnc.sam into smaller files (aim to get final size under 2GB): split -l 400000 jurkat.flnc.sam jurkat.flnc.chunk.

4. Convert each of these files back to bam for uploading: samtools view -b jurkat.flnc.chunk.a* -o jurkat.flnc.chunk.a*.bam (*=a,b,c,d,e,f,g,h,i,j,k,l,m)

After downloading, reverse this process including using the header file which is found in the LRPG-Manuscript-Results-results-results-jurkat-isoseq3-companion-files.tar.gz file>

1. Convert the bam files back to sam files: samtools view jurkat.flnc.chunk.a*.bam > jurkat.flnc.chunk.a*.sam (*=a,b,c,d,e,f,g,h,i,j,k,l,m)

2. Combine the header together with the sam files: cat jurkat.flnc.chunk.a*sam > jurkcat.flnc.sam (verified the same number of lines of the sam files is identical to the number of lines of the original without header: 4,956,761.  Header file is 13 lines.

3. Convert to bam files if desired: samtools view -b jurkat.flnc.sam -o jurkat.flnc.bam

4. Rehead with the header file: samtools reheader -P -i jurkat.flnc.header.sam jurkat.flnc.bam

Files

Files (9.2 GB)

Name Size Download all
md5:8dc8d53c326a65f004800683681686ea
66.7 kB Download
md5:6489c530fe074dd076bc9d81e340c07e
1.3 MB Download
md5:259ef22adacb8e28e8071ef10be93751
17.1 MB Download
md5:0da4c58c3c241d9dfe6c1097544b0ea4
83.6 MB Download
md5:27cc7e90fa1e950d920214d2f7a7ed65
8.3 MB Download
md5:d53328337f101b66217bc0a0cb224d4a
14.4 MB Download
md5:833924c146b065fe1a492cca4bc27eb3
472.9 MB Download
md5:241a66e03b6946f3194b353e98789f85
484.4 MB Download
md5:576002fd4d27b21b7b162f7e53f60e87
483.2 MB Download
md5:176da136ab70783a73b25be59c0d3d4c
480.3 MB Download
md5:69817bf14fee8ce59c9f2cfea5e6f859
484.6 MB Download
md5:8b47f168b968a34f2373948c25fd9898
478.0 MB Download
md5:106ce494899d39dfc4cf3e257faacbd2
473.5 MB Download
md5:de17a3a63e91af996253148f2d29d1bb
477.7 MB Download
md5:2a7a3e091dfc2efe40c368a228231ee1
517.3 MB Download
md5:514a85a08db12deeff5ec0cf2cf781dd
679.7 MB Download
md5:23ab86b122ef0152f2df9f13a344ab94
679.4 MB Download
md5:af65f666f9acfea4bccbf145bff0e945
675.4 MB Download
md5:8183a9d448684854c9b2a2fc77bcb51b
254.2 MB Download
md5:7c259b75794fbf4907282202025b9e27
351.1 MB Download
md5:fe935448548c16e8d0f69635935751cd
1.3 GB Download
md5:c0734ed3335b025ccc9e42f1c8da2e3c
369.3 MB Download
md5:14fd33f997e53ef92f27e5e185ad2efa
11.1 kB Download
md5:4a4d194ae1d45cafdcdc060aeedf3a77
6.5 MB Download
md5:36b9b726295265a8d19e4e1b8b75da12
82.3 MB Download
md5:42ce13524be256b3e941583fac2f76e2
18.6 MB Download
md5:3d17495d733c05cb08bcbafb28dfeaf5
708.4 kB Download
md5:54204a905fb4350c3535f304eb88b915
2.9 MB Download
md5:b91f8dfd267719bd85f2f67de8856794
9.8 MB Download
md5:a569edbfcb8bdd85c43bfe8d2ff10c61
14.3 MB Download
md5:b44dabd795ca9f5d89a23c36f93bc9f0
5.3 MB Download
md5:1459be6757b5b78595e6480d194feb63
135.2 MB Download
md5:67377088286909877879093fd5e196ab
99.7 MB Download
md5:3b0ccf78df2d63c3f969d6869586d82d
52.6 MB Download
md5:d9159de2953c699ad992751325e0c503
2.4 MB Download