FANTASIA V4.1 β LookUp Table β UniProt December 2025 β Experimental Evidence Code (Early Layers and Final Layers)
Description
FANTASIA V4.1 – LookUp Table
UniProt December 2025 – Experimental Evidence Code (Early Layers and Final Layers)
Release: December 2025
System: Protein Information System (PIS v3.1.0)
Compatibility: FANTASIA V4.1
π Overview
This PostgreSQL database backup uses the pgvector extension to store high-dimensional protein embeddings.
It contains precomputed embeddings and functional annotations from the UniProt November 2025 release, restricted to entries with experimental evidence codes only.
The lookup table was generated with PIS v3.1.0 (Protein Information System), an integrated platform for automated extraction, processing, and management of protein-related data.
PIS consolidates information from UniProt, PDB, and GOA, enabling efficient retrieval of sequences, structures, and annotations.
This release is designed for direct use with FANTASIA V4.1, an advanced pipeline for high-confidence functional annotation using Protein Language Models (PLMs).
Unlike earlier releases, this dataset includes Early Layers (the first three) and Final Layers (0-2) for each PLM model, providing comprehensive embeddings for deep similarity search and GO term transfer.
π« Compatibility Notice
This database is not compatible with versions of FANTASIA earlier than v4.1 and
not compatible with PIS versions earlier than v3.1.0.
A tokenization inconsistency affecting the ProtT5-XL-UniRef50 model was corrected in this release.
Because of this fix:
- ProtT5 embeddings produced with versions < FANTASIA v4.1 will not match those stored in this lookup table.
- Incompatibility only affects workflows that use the ProtT5 model.
- However, we highly recommend updating all components (FANTASIA, PIS, database) to ensure consistent behavior across all PLMs.
This lookup table serves as a ready-to-use reference for large-scale protein function transfer:
- Loads multi-layer embeddings into memory
- Performs high-speed nearest-neighbor search in embedding space
- Transfers experimentally supported GO terms from annotated UniProt proteins
It provides a stable, optimized, and fully curated base for reproducible annotation workflows within the FANTASIA ecosystem.
π¦ Embedding Coverage and Dataset Generation Details
π Layer Coverage by Model
Each of the five protein language models in this release includes six embedding layers: the Early Layers (the first three) and the Final Layers (0-2). This configuration provides both low-level and high-level representational information.
- ESM — 33 layers → included: 0, 1, 2, 31, 32, 33
- ProtT5 — 24 layers → included: 0, 1, 2, 22, 23, 24
- ProstT5 — 24 layers → included: 0, 1, 2, 22, 23, 24
- ANKH3-Large — 48 layers → included: 0, 1, 2, 46, 47, 48
- ESM3c — 33 layers → included: 0, 1, 2, 31, 32, 33
This standardized multi-layer extraction ensures balanced coverage for downstream comparative analysis.
π Core Dataset Statistics
- UniProt accessions: 127,546
- Protein records: 127,546
- Unique sequences: 124,397
- Total embeddings (5 models): 124,397
- (Includes 3,149 proteins with identical sequences due to isoforms/redundancy)
- Experimental GO annotations: 627,932
- Sequence redundancy: 2.47%
π Sequence Length Distribution (Unique Sequences)
The 124,397 unique sequences span a wide range:
- Minimum: 3 aa
- Maximum: 35,375 aa
- Mean: 587.44 aa
- Q1: 262 aa
- Median: 431 aa
- Q3: 694 aa
π₯οΈ Computational Infrastructure
All embeddings were generated on an NVIDIA GeForce RTX 3090 Ti (24 GB VRAM) hosted at the Computational Biology and Bioinformatics group (CABD).
Previous lookup tables were created on CESGA Finisterrae III using A100 40 GB GPUs, which encountered memory limitations when processing long sequences (especially under shared-resource conditions).
π« Missing Embeddings Overview
| model | percent_covered | num_missing_emb | min_length | max_length | avg_lenght |
| esm2 | 99.97% | 35 | 9,563 | 35,375 | 18,095.40 |
| esm3c | 100% | - | - | - | - |
| prott5 | 99.66% | 420 | 1,557 | 35,375 | 6,018.19 |
| prostt5 | 99.18% | 1,022 | 1,557 | 35,375 | 4,568.18 |
| ankh3 | 99.95% | 61 | 7,962 | 35,375 | 14,091.56 |
This table summarizes, for each model:
- percentage of successfully covered proteins
- number of sequences that could not be embedded
- minimum, maximum, and average lengths of the problematic sequences
π Commentary on the Missing Embeddings
The data clearly shows that:
- Missing embeddings represent only 0–1% of the dataset, depending on the model.
- ESM3c achieved full coverage (100%) for all sequences, including the longest.
- ProtT5-based models (prott5, prostt5) show the highest failure rate, due to the substantial memory requirements of long transformer contexts.
- All failures are exclusively due to extremely long sequences, frequently in the 10,000–35,000 aa range.
- These lengths exceed the practical VRAM capacity for most PLM inference pipelines (24 GB in this release).
π Additional Files Included
A companion file missing_embeddings_per_model.csv is provided, containing:
- affected UniProt accessions
- full sequence lengths
- model-specific missing status
This file allows users to regenerate these embeddings on hardware with larger memory footprints (48–80 GB) or using architectures with efficient chunked attention.
π¬ Included GO Evidence Codes (Experimental Only)
Only experimental evidence GO annotations are included:
- EXP — Inferred from Experiment
- IDA — Inferred from Direct Assay
- IPI — Inferred from Physical Interaction
- IMP — Inferred from Mutant Phenotype
- IGI — Inferred from Genetic Interaction
- IEP — Inferred from Expression Pattern
- TAS — Traceable Author Statement
- IC — Inferred by Curator
These evidence codes ensure that downstream analyses rely strictly on experimentally validated functional annotations.
Files
missing_embeddings_per_model_filtered.csv
Files
(17.1 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:b63de361211735457c1f45c5b7167764
|
17.1 GB | Download |
|
md5:273eb999a44dd75deaa06d80fa5e3fd7
|
47.9 kB | Preview Download |
Additional details
Funding
Software
- Repository URL
- https://github.com/CBBIO/FANTASIA
- Programming language
- Python
- Development Status
- Active