Published January 19, 2023 | Version 1.0.0
Dataset Open

PlasmoFAB

Description

PlasmoFAB is a curated dataset containing amino acid sequences of proteins expressed by Plasmodium falciparum (Pf). Sequences are separated into antigen candidates and intracellular proteins. PlasmoFAB is created to provide a high-quality trainings set for machine learning models that will be used for Pf antigen exploration.

We provide PlasmoFAB in form of two separate csv files. One file, named PlasmoFAB_pos.csv, contains the positive set, i.e., all protein sequences that are antigen candidates. The other file, named PlasmoFAB_neg.csv, contains the negative set, i.e., all protein sequences that are intracellular proteins. Each sequence has a flag in the datafield "test" that indicates whether or not the sequence was used in the test set of machine learning experiments performed in the corresponding manuscript.

Additionally, the file PlasmoFAB_datasheet.md contains a datasheet for PlasmoFAB as introduced in Gebru, Timnit, et al. "Datasheets for datasets." Communications of the ACM 64.12 (2021): 86-92.

Notes

Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy – EXC number 2064/1 – Project number 390727645. This research was supported by the German Federal Ministry of Education and Research (BMBF) project 'Training Center Machine Learning, Tübingen' with grant number 01|S17054. This work was supported by the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039A.

Files

PlasmoFAB_Datasheet.md

Files (727.7 kB)

Name Size Download all
md5:6f7f7976982969b0be1e2f4250889194
13.7 kB Preview Download
md5:54e74c7aa3724efe11d0cd4b35c2f722
389.5 kB Preview Download
md5:df336d7f355162288429e338ca9f8473
324.5 kB Preview Download

Additional details

Related works

Is compiled by
Software: https://github.com/msmdev/PlasmoFAB (URL)
Is published in
Preprint: https://arxiv.org/abs/2301.06454 (URL)