PlasmoFAB
Creators
- 1. University of Tübingen
Description
PlasmoFAB is a curated dataset containing amino acid sequences of proteins expressed by Plasmodium falciparum (Pf). Sequences are separated into antigen candidates and intracellular proteins. PlasmoFAB is created to provide a high-quality trainings set for machine learning models that will be used for Pf antigen exploration.
We provide PlasmoFAB in form of two separate csv files. One file, named PlasmoFAB_pos.csv, contains the positive set, i.e., all protein sequences that are antigen candidates. The other file, named PlasmoFAB_neg.csv, contains the negative set, i.e., all protein sequences that are intracellular proteins. Each sequence has a flag in the datafield "test" that indicates whether or not the sequence was used in the test set of machine learning experiments performed in the corresponding manuscript.
Additionally, the file PlasmoFAB_datasheet.md contains a datasheet for PlasmoFAB as introduced in Gebru, Timnit, et al. "Datasheets for datasets." Communications of the ACM 64.12 (2021): 86-92.
Notes
Files
PlasmoFAB_Datasheet.md
Additional details
Related works
- Is compiled by
- Software: https://github.com/msmdev/PlasmoFAB (URL)
- Is published in
- Preprint: https://arxiv.org/abs/2301.06454 (URL)