Training and test data for antibody humanness evaluation
Authors/Creators
Description
### Training and test data for humanness evaluation
This data was collected in conjunction with and used for
training and testing for Parkinson / Wang et al 2024. The
data is organized as follows:
- Heavy chain training and multispecies test data (under the heavy chain folder)
- The conslidated cAb rep file contains training human sequences
- The test sample sequences folder contains fasta files with test sequences for each species
- Light chain training and multispecies test data (under the light chain folder)
- The conslidated cAb rep file contains training human sequences
- The test sample sequences folder contains fasta files with test sequences for each species
- Abybank data (under the abybank compiled data folder)
- This folder contains separate folders for heavy and light chain
- Each subfolder contains test data for a more diverse species set under fasta files for each species
- Humanization test data (under the humanization test data folder)
- The sequences in the parental.fa file were originally humanized as part of drug discovery programs
- The experimental.fa file contains the humanization results
- IMGT and ADA data (under the imgt test data folder)
- The imgt mab db fa and tsv files contain sequences and species assignments for IMGT mAb DB
- The thera ada fa file contains sequences evaluated in the clinic
- The Therapeutic ADA txt file contains anti drug antibody results for those antibodies
The data was retrieved from the following sources.
1. All heavy and light chain training data is from the cAb-Rep database from [Guo et al.](https://pubmed.ncbi.nlm.nih.gov/31649674/)
2. All testing data is from the Observed Antibody Space [(OAS) database](https://opig.stats.ox.ac.uk/webapps/oas/)
The training and test data show is after filtering for quality. The testing data was additionally randomly sampled to yield a set of 50,000 sequences for each species, then filtered to remove duplicates. The human test data was checked to ensure no overlap with the human training set.
The IMGT, ADA and humanization test data was retrieved from Prihoda et al. and
the associated [Github repo](https://github.com/Merck/BioPhi-2021-publication).
See Parkinson et al. 2024 and the associated github repos for more details on how models other than
SAM / AntPack were evaluated on this data.
Files
Files
(1.3 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:03e810fa8f9dfde8b37fee05a82874c4
|
1.3 GB | Download |
Additional details
Dates
- Available
-
2024-01-23