A roadmap for AI-driven protein design
Authors/Creators
Description
I created this free course consisting of 10 lectures to introduce you to AI-driven protein design.
Check this page for a mode detailed description of this resource:
- https://miangoaren.github.io/teaching/proteins
- https://github.com/miangoar/AI-driven-protein-design
- Spanish version: https://miangoar.github.io/teaching/proteins
General description
I want more people to learn how to design proteins using Artificial Intelligence (AI). However, I have encountered three main problems:
- There is a large amount of information, and it is not clear where to start or which topics are necessary.
- There are no comprehensive online courses on this topic in Spanish.
- Courses related to this field are usually expensive for most students in Latin America.
To address these issues, I created this free 37-hour course, distributed across 10 lectures, to introduce you to AI-driven protein design. The course includes two main resources:
- The 10 lectures on YouTube
- A GitHub repository with the following resources:
- Tools: libraries organized into 25 categories
- Learning resources: courses, tutorials and useful publications organized into nine categories
- Databases: resources to download genomic and protein data organized into 12 categories
- Lectures: links to each lecture and to download the slides
- YouTube: Recommended channels and videos to learn about proteins, mathematics, and data science
Course organization
The lectures are organized from 01 to 10 to facilitate conceptual understanding. For example, reviewing AlphaFold requires knowledge of structural biology and deep learning, which are covered in detail in their respective lectures. Below is a brief description of each lecture and its topics:
- Basic computing concepts: how CPUs and GPUs work, as well as the essential software for data analysis.
- Where does your journey begin?
- Hardware
- CPU
- GPU
- Software
- Linux/Bash and GitHub
- Python
- Machine learning: what AI is and its subfields, the current capabilities of algorithms, and how a model is trained.
- Current state of AI
- How AI learns
- Patterns
- Machine learning operations (MLOps)
- Learning paradigms
- How to train a model
- Data processing
- How to choose a model
- Training process
- Deep learning: how neural networks work, the different types of neural networks, and the software used to work with them.
- Neural networks
- Neurons
- Deep learning
- Loss functions
- Backpropagation
- Optimizers
- Architectures
- Explainability (why) and Interpretability (how)
- Deep learning frameworks
- Neural networks
- Transformers and language models: how Transformers and modern language models work.
- Language models
- Transformers
- Original architecture
- BERT and GPT architectures
- Scaling laws
- Pre-training and post-training
- Reinforcement learning
- Performance and generalization
- Benchmark saturation
- Hype
- How to work with LLMs
- Optimization techniques (for GPU-poors like us)
- Hugging Face and Software 2.0
- Protein structure: principles of structural biology and organization.
- Structural organization
- Amino acids
- Secondary and tertiary structure
- Experimental workflow for structure determination
- Structure Viewers
- Classifications
- Folds and domains
- First classification schemes
- Similarity metrics
- Sequence and structural divergence
- Current classifications schemes
- The shape of the protein universe
- Uneven distribution
- Complex homologous relationships
- Switch folds
- Structural organization
- Protein function: how proteins adopt their structure and how function is regulated.
- Protein folding
- Cellular environment
- Thermodynamics and conformational entropy
- Protein function
- Diffusion
- Molecular dynamics and energy functions
- Enzymes
- Functional annotation
- Functional regulation
- Allosterism
- Transcriptional regulation
- Post translational modifications
- Proteostasis and host physiology
- Protein folding
- Protein evolution: origin and diversification from simpler peptides.
- Levels of biological organization
- Evolution across spatio-temporal scales
- Chemical evolution
- Biological evolution
- RNA world hypothesis and ribosome evolution
- Ancestral proteins
- Protein diversification
- The sequence space
- Mutations
- Robustness, evolvability and promiscuity
- Evolution of protein function
- Epistasis: How interactions shape the evolution
- Residue-residue and protein-protein interactions
- Randomness of mutations
- Levels of biological organization
- AlphaFold: overview of AF2 and AF3 architectures and impact.
- The impact of AlphaFold
- AlphaFoldmania
- Protein structure prediction before AlphaFold
- AlphaFold
- AlphaFold2
- Protein language models
- Architecture
- Post-AlphaFold2 era
- AlphaFold3
- Diffusion models for macromolecular modeling
- Architecture
- Post-AlphaFold3 era
- The impact of AlphaFold
- AI-driven protein design: motivations and modern AI methods.
- Protein design
- AI in the biotech market
- Advances from classical methods to AI-driven methods
- Basic considerations to increase the success of a design
- Rational design
- Classic experimental and bioinformatic approaches
- Macromolecular modeling and recombineering
- Evolutionary design
- Directed evolution, ancestral sequence reconstruction and consensus design
- Representation learning
- (Macro)Molecular representations
- Protein language models and ESMFold
- Explainability and interpretability of protein language models
- Scaling laws and multimodality in protein language models
- Generative AI
- Integration of multimodal data
- Sequence generation
- Generalization and fitness prediction with protein language models
- Inverse folding and ProteinMPNN
- Structure generation with diffusion models
- Model selection and computational scoring of candidates
- Model generalization and synthetic data
- Summary
- Protein design
- Data and biases: relevant databases and data processing.
- Big data is Omics
- Properties of a good dataset
- Main datasets
- PDB
- UniProt
- NCBI datasets
- Other interesting datasets
- Data processing
- Data cleaning in biology
- Basic tools for biological data manipulation
- Data splitting
- Generalization in (protein) biology
- Data leakage and other inherent issues
- Biases in the data
- A roadmap for AI-driven protein design
- Big data is Omics
Access to the slides
This course includes +800 slides with image sources, citations, and recommended resources for deeper study in the notes section. I recommend reviewing the slides using PowerPoint. You can download the slides from Zenodo and Google Drive:
By releasing these slides, my goal is to provide access to information for deeper learning. If you are a teacher and have adopted this material for your lectures, please let me know. I would love to learn how you improved the course and to know that more people have learned about protein science.
However, if you identify that someone has plagiarized this course in whole or in part and is charging money for access, I would appreciate being notified, as developing this material required a lot of time and effort, and plagiarism is a serious breach of professionalism and ethics.
How to support this project
If you found this course useful and would like to support it financially, you can donate via PayPal. Donations can be of any amount, or USD 12, 30, or 45 (suggestions based on the economic reality of students in Latin America). Go to the following link to donate: https://www.paypal.com/donate/?hosted_button_id=AG42EZTZW9AJN
If you do not have financial flexibility, but would like to express your gratitude, you can send me your comments by email: gamamiguelangel@gmail.com
Finally, I would appreciate it if you shared this course with interested colleagues or reposted the official course announcement on my social media:
- (X)Twitter: https://x.com/miangoar/status/2014455014626865427
- Instagram: https://www.instagram.com/miangoar/#
- BlueSky: https://bsky.app/profile/miangoar.bsky.social/post/3md24ah4poc2c
- LinkedIn: https://www.linkedin.com/in/miguel-angel-gonzalez-arias-61b04b2ba/
About me
I’m Miguel Angel González Arias. I’m a Mexican biologist interested in proteins, microbes, and computation. For more details about me, my social networks, and other contact information, please visit the following page:
Files
Files
(1.4 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:d764b8163aad895eb9a2f765b87552dc
|
69.9 MB | Download |
|
md5:5b41adb439176f755b348a670c4508cc
|
96.1 MB | Download |
|
md5:1281b64f2c41bde96f4712aabe2bec9f
|
47.0 MB | Download |
|
md5:ea91d24f04b2687440e70bf874d48218
|
104.9 MB | Download |
|
md5:180e0c1c81cca6361a03dfc62c4da015
|
275.1 MB | Download |
|
md5:2c243e96e61dd112a12bf636a5f07772
|
213.3 MB | Download |
|
md5:768880321b85324b4720549f0e7e1494
|
144.7 MB | Download |
|
md5:0ffa5d0da1242e9697612718c3616098
|
144.8 MB | Download |
|
md5:7d83a97960709d11a8988d0dc40be849
|
189.9 MB | Download |
|
md5:896bdaf67d33f13c6d53b38eb1c100f5
|
73.3 MB | Download |
Additional details
Software
- Repository URL
- https://github.com/miangoar/AI-driven-protein-design
- Programming language
- Markdown