Published February 5, 2026 | Version v1
Presentation Open

A roadmap for AI-driven protein design

Description

I created this free course consisting of 10 lectures to introduce you to AI-driven protein design.

Check this page for a mode detailed description of this resource:

  • https://miangoaren.github.io/teaching/proteins
  • https://github.com/miangoar/AI-driven-protein-design 
  • Spanish version: https://miangoar.github.io/teaching/proteins 

General description

I want more people to learn how to design proteins using Artificial Intelligence (AI). However, I have encountered three main problems:

  1. There is a large amount of information, and it is not clear where to start or which topics are necessary.
  2. There are no comprehensive online courses on this topic in Spanish.
  3. Courses related to this field are usually expensive for most students in Latin America.

To address these issues, I created this free 37-hour course, distributed across 10 lectures, to introduce you to AI-driven protein design. The course includes two main resources:

  1. The 10 lectures on YouTube
  2. A GitHub repository with the following resources:
    1. Tools: libraries organized into 25 categories
    2. Learning resources: courses, tutorials and useful publications organized into nine categories
    3. Databases: resources to download genomic and protein data organized into 12 categories
    4. Lectures: links to each lecture and to download the slides
    5. YouTube: Recommended channels and videos to learn about proteins, mathematics, and data science

Course organization

The lectures are organized from 01 to 10 to facilitate conceptual understanding. For example, reviewing AlphaFold requires knowledge of structural biology and deep learning, which are covered in detail in their respective lectures. Below is a brief description of each lecture and its topics:

  1. Basic computing concepts: how CPUs and GPUs work, as well as the essential software for data analysis.
    1. Where does your journey begin?
    2. Hardware
      • CPU
      • GPU
    3. Software
      • Linux/Bash and GitHub
      • Python
  2. Machine learning: what AI is and its subfields, the current capabilities of algorithms, and how a model is trained.
    1. Current state of AI
    2. How AI learns
      • Patterns
      • Machine learning operations (MLOps)
      • Learning paradigms
    3. How to train a model
      • Data processing
      • How to choose a model
      • Training process
  3. Deep learning: how neural networks work, the different types of neural networks, and the software used to work with them.
    1. Neural networks
      • Neurons
      • Deep learning
      • Loss functions
      • Backpropagation
      • Optimizers
      • Architectures
      • Explainability (why) and Interpretability (how)
    2. Deep learning frameworks
  4. Transformers and language models: how Transformers and modern language models work.
    1. Language models
    2. Transformers
      • Original architecture
      • BERT and GPT architectures
      • Scaling laws
      • Pre-training and post-training
      • Reinforcement learning
    3. Performance and generalization
      • Benchmark saturation
      • Hype
    4. How to work with LLMs
      • Optimization techniques (for GPU-poors like us)
      • Hugging Face and Software 2.0
  5. Protein structure: principles of structural biology and organization.
    1. Structural organization
      • Amino acids
      • Secondary and tertiary structure
      • Experimental workflow for structure determination
      • Structure Viewers
    2. Classifications
      • Folds and domains
      • First classification schemes
      • Similarity metrics
      • Sequence and structural divergence
      • Current classifications schemes
    3. The shape of the protein universe
      • Uneven distribution
      • Complex homologous relationships
      • Switch folds
  6. Protein function: how proteins adopt their structure and how function is regulated.
    1. Protein folding
      • Cellular environment
      • Thermodynamics and conformational entropy
    2. Protein function
      • Diffusion
      • Molecular dynamics and energy functions
      • Enzymes
      • Functional annotation
    3. Functional regulation
      • Allosterism
      • Transcriptional regulation
      • Post translational modifications
      • Proteostasis and host physiology
  7. Protein evolution: origin and diversification from simpler peptides.
    1. Levels of biological organization
      • Evolution across spatio-temporal scales
      • Chemical evolution
    2. Biological evolution
      • RNA world hypothesis and ribosome evolution
      • Ancestral proteins
      • Protein diversification
    3. The sequence space
      • Mutations
      • Robustness, evolvability and promiscuity
      • Evolution of protein function
    4. Epistasis: How interactions shape the evolution
      • Residue-residue and protein-protein interactions
      • Randomness of mutations
  8. AlphaFold: overview of AF2 and AF3 architectures and impact.
    1. The impact of AlphaFold
      • AlphaFoldmania
      • Protein structure prediction before AlphaFold
    2. AlphaFold
    3. AlphaFold2
      • Protein language models
      • Architecture
      • Post-AlphaFold2 era
    4. AlphaFold3
      • Diffusion models for macromolecular modeling
      • Architecture
      • Post-AlphaFold3 era
  9. AI-driven protein design: motivations and modern AI methods.
    1. Protein design
      • AI in the biotech market
      • Advances from classical methods to AI-driven methods
      • Basic considerations to increase the success of a design
    2. Rational design
      • Classic experimental and bioinformatic approaches
      • Macromolecular modeling and recombineering
    3. Evolutionary design
      • Directed evolution, ancestral sequence reconstruction and consensus design
    4. Representation learning
      • (Macro)Molecular representations
      • Protein language models and ESMFold
      • Explainability and interpretability of protein language models
      • Scaling laws and multimodality in protein language models
    5. Generative AI
      • Integration of multimodal data
      • Sequence generation
      • Generalization and fitness prediction with protein language models
      • Inverse folding and ProteinMPNN
      • Structure generation with diffusion models
      • Model selection and computational scoring of candidates
      • Model generalization and synthetic data
    6. Summary
  10. Data and biases: relevant databases and data processing.
    1. Big data is Omics
      • Properties of a good dataset
    2. Main datasets
      • PDB
      • UniProt
      • NCBI datasets
      • Other interesting datasets
    3. Data processing
      • Data cleaning in biology
      • Basic tools for biological data manipulation
      • Data splitting
    4. Generalization in (protein) biology
      • Data leakage and other inherent issues
    5. Biases in the data
    6. A roadmap for AI-driven protein design

Access to the slides

This course includes +800 slides with image sources, citations, and recommended resources for deeper study in the notes section. I recommend reviewing the slides using PowerPoint. You can download the slides from Zenodo and Google Drive:

By releasing these slides, my goal is to provide access to information for deeper learning. If you are a teacher and have adopted this material for your lectures, please let me know. I would love to learn how you improved the course and to know that more people have learned about protein science.

However, if you identify that someone has plagiarized this course in whole or in part and is charging money for access, I would appreciate being notified, as developing this material required a lot of time and effort, and plagiarism is a serious breach of professionalism and ethics.

How to support this project

If you found this course useful and would like to support it financially, you can donate via PayPal. Donations can be of any amount, or USD 12, 30, or 45 (suggestions based on the economic reality of students in Latin America). Go to the following link to donate: https://www.paypal.com/donate/?hosted_button_id=AG42EZTZW9AJN 

If you do not have financial flexibility, but would like to express your gratitude, you can send me your comments by email: gamamiguelangel@gmail.com

Finally, I would appreciate it if you shared this course with interested colleagues or reposted the official course announcement on my social media:

  • (X)Twitter: https://x.com/miangoar/status/2014455014626865427 
  • Instagram: https://www.instagram.com/miangoar/#
  • BlueSky: https://bsky.app/profile/miangoar.bsky.social/post/3md24ah4poc2c 
  • LinkedIn: https://www.linkedin.com/in/miguel-angel-gonzalez-arias-61b04b2ba/ 

About me

I’m Miguel Angel González Arias. I’m a Mexican biologist interested in proteins, microbes, and computation. For more details about me, my social networks, and other contact information, please visit the following page:

Files

Files (1.4 GB)

Name Size Download all
md5:d764b8163aad895eb9a2f765b87552dc
69.9 MB Download
md5:5b41adb439176f755b348a670c4508cc
96.1 MB Download
md5:1281b64f2c41bde96f4712aabe2bec9f
47.0 MB Download
md5:ea91d24f04b2687440e70bf874d48218
104.9 MB Download
md5:180e0c1c81cca6361a03dfc62c4da015
275.1 MB Download
md5:2c243e96e61dd112a12bf636a5f07772
213.3 MB Download
md5:768880321b85324b4720549f0e7e1494
144.7 MB Download
md5:0ffa5d0da1242e9697612718c3616098
144.8 MB Download
md5:7d83a97960709d11a8988d0dc40be849
189.9 MB Download
md5:896bdaf67d33f13c6d53b38eb1c100f5
73.3 MB Download

Additional details

Software