Published December 22, 2022 | Version v2022.12.14
Dataset Open

AntiRef: reference clusters of human antibody sequences

  • 1. Scripps Research

Description

Motivation: Biases in the human antibody repertoire result in publicly available antibody sequence datasets containing many duplicate or highly similar sequences. These redundant sequences are a barrier to rapid similarity searches and reduce the efficiency with which these datasets can be used to train statistical or machine learning models of human antibodies. Identity-based clustering provides a solution, however, the extremely large size of available antibody repertoire datasets make such clustering operations computationally intensive and potentially out of reach for many scientists and researchers who would benefit from such data.

Results: AntiRef (Antibody Reference Clusters), which is modeled after UniRef, provides clustered datasets of filtered human antibody sequences. Starting from a dataset of ~335M unique, full-length, productive human antibody sequences from the Observed Antibody Space repository, several AntiRef cluster sets were generated. Due to the modular nature of recombined antibody genes, the clustering thresholds used by UniRef (100, 90 and 50 percent identity) to cluster general protein sequences are suboptimal for antibody clustering. AntiRef provides reference antibody sequence datasets clustered at a range of relevant identity thresholds: 100, 99, 98, 96, 94, 92 and 90 percent. AntiRef90, which uses the lowest clustering threshold of any AntiRef dataset, is roughly one-third the size of the filtered input dataset and less than half the size of the non-redundant AntiRef100.

Datasets: AntiRef comprises a series of datasets, each representing one of several clustering thresholds. AntiRef datasets were generated by a nested clustering procedure similar to UniRef which, proceeding in order of decreasing stringency, clusters the representative sequences from the preceding round of clustering. AntiRef datasets can be found at the following links:

  • AntiRef100: representative sequences resulting from clustering all filtered AntiRef input sequences at 100% identity.
  • AntiRef99: representative sequences resulting from clustering AntiRef100 at 99% identity.
  • AntiRef98: representative sequences resulting from clustering AntiRef99 at 98% identity.
  • AntiRef96: representative sequences resulting from clustering AntiRef98 at 96% identity.
  • AntiRef94: representative sequences resulting from clustering AntiRef96 at 94% identity.
  • AntiRef92: representative sequences resulting from clustering AntiRef94 at 92% identity.
  • AntiRef90: representative sequences resulting from clustering AntiRef92 at 90% identity.

Files: The following files are included in the primary AntiRef data repository:

  • antiref_cluster-manifest.csv.gz: A compressed CSV file containing the cluster assignments for every sequence in the AntiRef input dataset. For each AntiRef round, cluster names correspond to the sequence ID of the representative sequence (as determined by MMSeqs2). The nested clustering process conserves cluster names between iterations, meaning the clustering lineage of any sequence can easily be traced across all AntiRef datasets.
  • download_heavy.txt: A plain text file (generated by the Observed Antibody Space) containing the commands necessary to download all antibody heavy chain sequences used to create AntiRef.
  • download_light.txt: A plain text file (generated by the Observed Antibody Space) containing the commands necessary to download all antibody light chain sequences used to create AntiRef.

Code: All code used to generate AntiRef (data download, filtering, and clustering) is available under the MIT license on GitHub.

Files

download_heavy.txt

Files (14.1 GB)

Name Size Download all
md5:18fcf6376702a2d23e2acdc32f1fd40a
14.1 GB Download
md5:adcd9860c554bcb2236bf6228bdec9ed
297.3 kB Preview Download
md5:d3400adb720639b42b817b1c812c283e
44.1 kB Preview Download