FANTASIA V4.1 – LookUp Table – UniProt December 2025 – Experimental Evidence Code (Early Layers and Final Layers)

Rojas-Mendoza, Ana M.; Perez-Canales, Francisco M.; Dominguez-Rodriguez, Àlex

doi:10.5281/zenodo.17793273

Published December 2, 2025 | Version v1

Dataset Open

FANTASIA V4.1 – LookUp Table – UniProt December 2025 – Experimental Evidence Code (Early Layers and Final Layers)

FANTASIA V4.1 – LookUp Table

UniProt December 2025 – Experimental Evidence Code (Early Layers and Final Layers)

Release: December 2025

System: Protein Information System (PIS v3.1.0)

Compatibility: FANTASIA V4.1

📖 Overview

This PostgreSQL database backup uses the pgvector extension to store high-dimensional protein embeddings.

It contains precomputed embeddings and functional annotations from the UniProt November 2025 release, restricted to entries with experimental evidence codes only.

The lookup table was generated with PIS v3.1.0 (Protein Information System), an integrated platform for automated extraction, processing, and management of protein-related data.

PIS consolidates information from UniProt, PDB, and GOA, enabling efficient retrieval of sequences, structures, and annotations.

This release is designed for direct use with FANTASIA V4.1, an advanced pipeline for high-confidence functional annotation using Protein Language Models (PLMs).

Unlike earlier releases, this dataset includes Early Layers (the first three) and Final Layers (0-2) for each PLM model, providing comprehensive embeddings for deep similarity search and GO term transfer.

🚫 Compatibility Notice

This database is not compatible with versions of FANTASIA earlier than v4.1 and

not compatible with PIS versions earlier than v3.1.0.

A tokenization inconsistency affecting the ProtT5-XL-UniRef50 model was corrected in this release.

Because of this fix:

ProtT5 embeddings produced with versions < FANTASIA v4.1 will not match those stored in this lookup table.
Incompatibility only affects workflows that use the ProtT5 model.
However, we highly recommend updating all components (FANTASIA, PIS, database) to ensure consistent behavior across all PLMs.

This lookup table serves as a ready-to-use reference for large-scale protein function transfer:

Loads multi-layer embeddings into memory
Performs high-speed nearest-neighbor search in embedding space
Transfers experimentally supported GO terms from annotated UniProt proteins

It provides a stable, optimized, and fully curated base for reproducible annotation workflows within the FANTASIA ecosystem.

📦 Embedding Coverage and Dataset Generation Details

📊 Layer Coverage by Model

Each of the five protein language models in this release includes six embedding layers: the Early Layers (the first three) and the Final Layers (0-2). This configuration provides both low-level and high-level representational information.

ESM — 33 layers → included: 0, 1, 2, 31, 32, 33
ProtT5 — 24 layers → included: 0, 1, 2, 22, 23, 24
ProstT5 — 24 layers → included: 0, 1, 2, 22, 23, 24
ANKH3-Large — 48 layers → included: 0, 1, 2, 46, 47, 48
ESM3c — 33 layers → included: 0, 1, 2, 31, 32, 33

This standardized multi-layer extraction ensures balanced coverage for downstream comparative analysis.

📌 Core Dataset Statistics

UniProt accessions: 127,546
Protein records: 127,546
Unique sequences: 124,397
Total embeddings (5 models): 124,397
- (Includes 3,149 proteins with identical sequences due to isoforms/redundancy)
Experimental GO annotations: 627,932
Sequence redundancy: 2.47%

📈 Sequence Length Distribution (Unique Sequences)

The 124,397 unique sequences span a wide range:

Minimum: 3 aa
Maximum: 35,375 aa
Mean: 587.44 aa
Q1: 262 aa
Median: 431 aa
Q3: 694 aa

🖥️ Computational Infrastructure

All embeddings were generated on an NVIDIA GeForce RTX 3090 Ti (24 GB VRAM) hosted at the Computational Biology and Bioinformatics group (CABD).

Previous lookup tables were created on CESGA Finisterrae III using A100 40 GB GPUs, which encountered memory limitations when processing long sequences (especially under shared-resource conditions).

🚫 Missing Embeddings Overview

model	percent_covered	num_missing_emb	min_length	max_length	avg_lenght
esm2	99.97%	35	9,563	35,375	18,095.40
esm3c	100%	-	-	-	-
prott5	99.66%	420	1,557	35,375	6,018.19
prostt5	99.18%	1,022	1,557	35,375	4,568.18
ankh3	99.95%	61	7,962	35,375	14,091.56

This table summarizes, for each model:

percentage of successfully covered proteins
number of sequences that could not be embedded
minimum, maximum, and average lengths of the problematic sequences

📌 Commentary on the Missing Embeddings

The data clearly shows that:

Missing embeddings represent only 0–1% of the dataset, depending on the model.
ESM3c achieved full coverage (100%) for all sequences, including the longest.
ProtT5-based models (prott5, prostt5) show the highest failure rate, due to the substantial memory requirements of long transformer contexts.
All failures are exclusively due to extremely long sequences, frequently in the 10,000–35,000 aa range.
These lengths exceed the practical VRAM capacity for most PLM inference pipelines (24 GB in this release).

📄 Additional Files Included

A companion file missing_embeddings_per_model.csv is provided, containing:

affected UniProt accessions
full sequence lengths
model-specific missing status

This file allows users to regenerate these embeddings on hardware with larger memory footprints (48–80 GB) or using architectures with efficient chunked attention.

🔬 Included GO Evidence Codes (Experimental Only)

Only experimental evidence GO annotations are included:

EXP — Inferred from Experiment
IDA — Inferred from Direct Assay
IPI — Inferred from Physical Interaction
IMP — Inferred from Mutant Phenotype
IGI — Inferred from Genetic Interaction
IEP — Inferred from Expression Pattern
TAS — Traceable Author Statement
IC — Inferred by Curator

These evidence codes ensure that downstream analyses rely strictly on experimentally validated functional annotations.

Files

missing_embeddings_per_model_filtered.csv

Files (17.1 GB)

Name	Size	Download all
BioData_Dec25_esm2_prott5_prostt5_ankh3_large_esm3c_Layers_3Frist_3Last.backup md5:b63de361211735457c1f45c5b7167764	17.1 GB	Download
missing_embeddings_per_model_filtered.csv md5:273eb999a44dd75deaa06d80fa5e3fd7	47.9 kB	Preview Download

Additional details

European Commission
OSCARS - O.S.C.A.R.S. - Open Science Clusters’ Action for Research and Society 101129751

Repository URL: https://github.com/CBBIO/FANTASIA
Programming language: Python
Development Status: Active

	All versions	This version
Views	26	26
Downloads	16	16
Data volume	34.3 GB	34.3 GB

FANTASIA V4.1 – LookUp Table

UniProt December 2025 – Experimental Evidence Code (Early Layers and Final Layers)

📖 Overview

🚫 Compatibility Notice

📦 Embedding Coverage and Dataset Generation Details

📊 Layer Coverage by Model

📌 Core Dataset Statistics

📈 Sequence Length Distribution (Unique Sequences)

🖥️ Computational Infrastructure

🚫 Missing Embeddings Overview

📌 Commentary on the Missing Embeddings

📄 Additional Files Included

🔬 Included GO Evidence Codes (Experimental Only)

missing_embeddings_per_model_filtered.csv

Files (17.1 GB)

Funding

Software

FANTASIA V4.1 – LookUp Table – UniProt December 2025 – Experimental Evidence Code (Early Layers and Final Layers)

Authors/Creators

Description

FANTASIA V4.1 – LookUp Table

UniProt December 2025 – Experimental Evidence Code (Early Layers and Final Layers)

📖 Overview

🚫 Compatibility Notice

📦 Embedding Coverage and Dataset Generation Details

📊 Layer Coverage by Model

📌 Core Dataset Statistics

📈 Sequence Length Distribution (Unique Sequences)

🖥️ Computational Infrastructure

🚫 Missing Embeddings Overview

📌 Commentary on the Missing Embeddings

📄 Additional Files Included

🔬 Included GO Evidence Codes (Experimental Only)

Files

missing_embeddings_per_model_filtered.csv

Files (17.1 GB)

Additional details

Funding

Software