Design of a generalized platform for gathering protein sequence → function datasets at scale
Creators
- Cortade, Dana (Researcher)1
- d'Oelsnitz, Simon (Researcher)2
- Chadha, Anjali (Researcher)3
- Hayes, Oliver (Researcher)3
- Taghon, Geoffrey (Researcher)4
- Doerr, Mark (Researcher)5
- Born, Stefan (Researcher)6
- Kelly, Peter (Researcher)1, 7
- Ross, David (Researcher)4
- DeBenedictis, Erika (Researcher)3
- Align to Innovate
Contributors
Description
This article proposes a high-throughput experimental platform for collecting large-scale protein function datasets. The platform utilizes a pooled, growth-based assay to measure protein function quantitatively, allowing for the analysis of up to 500,000 protein variants per experiment at a cost of approximately $0.05 per sequence. This method is designed to be adaptable to a wide variety of protein functions by validating gene circuits and establishing calibration variants. The process involves creating barcoded libraries of protein variants, transforming them into bacteria, growing them under selective conditions, and sequencing the barcodes to quantify differential growth rates. The data collected will populate an open dataset after an embargo period, facilitating the development of machine learning models to predict protein functions from DNA sequences. The platform aims to standardize data collection across different labs and protein families, ultimately contributing to the creation of a generalizable predictive model for protein function, which could significantly advance the field of biology.
Files
20240806_SequenceToFunction_Main.pdf
Files
(6.2 MB)
Name | Size | Download all |
---|---|---|
md5:34aba387718ea820be43a016142d2cf0
|
6.2 MB | Preview Download |
Additional details
Related works
- Is continued by
- Publication: 10.5281/zenodo.12819116 (DOI)
- Publication: 10.5281/zenodo.12819109 (DOI)