Published July 10, 2023
| Version 1.0.0
Dataset
Open
PFAM Protein Families Dataset for Machine Learning
Description
A cleaned dataset of protein sequences and protein families for classification. The dataset is exported from PFAM as of June 2023 and curated to achieve the following characteristics:
- only protein families included with >=100 sequences
- families with >2000 sequences are truncated and only represented by 2000 sequences (chosen randomly)
- only proteins with sequence lengths between 100 and 1000
- amino acid sequences are form PDB; chains are concatenated only if not similar
The dataset is not balanced, numbers of sequences per family in PFAM and in in dataset are:
families: 62, sequences: 46872
total (in PFAM) -> included (in dataset)
Number in family ALLERGEN: 122 -> 122
Number in family APOPTOSIS: 381 -> 381
Number in family BIOSYNTHETIC PROTEIN: 346 -> 346
Number in family BIOTIN BINDING PROTEIN: 165 -> 165
Number in family BLOOD CLOTTING: 138 -> 138
Number in family CALCIUM BINDING PROTEIN: 135 -> 135
Number in family CELL ADHESION: 1116 -> 1116
Number in family CELL CYCLE: 511 -> 511
Number in family CHAPERONE: 964 -> 964
Number in family CONTRACTILE PROTEIN: 158 -> 158
Number in family CYTOKINE: 191 -> 191
Number in family DE NOVO PROTEIN: 253 -> 253
Number in family DNA BINDING PROTEIN: 1008 -> 1008
Number in family ELECTRON TRANSPORT: 841 -> 841
Number in family FLUORESCENT PROTEIN: 348 -> 348
Number in family GENE REGULATION: 607 -> 607
Number in family HORMONE: 272 -> 272
Number in family HORMONE GROWTH FACTOR: 159 -> 159
Number in family HORMONE RECEPTOR: 121 -> 121
Number in family HYDROLASE: 19551 -> 2000
Number in family HYDROLASE ANTIBIOTIC: 120 -> 120
Number in family HYDROLASE HYDROLASE INHIBITOR: 2890 -> 2000
Number in family HYDROLASE INHIBITOR: 315 -> 315
Number in family IMMUNE SYSTEM: 3333 -> 2000
Number in family IMMUNOGLOBULIN: 155 -> 155
Number in family ISOMERASE: 2457 -> 2000
Number in family ISOMERASE ISOMERASE INHIBITOR: 139 -> 139
Number in family LECTIN: 139 -> 139
Number in family LIGASE: 1780 -> 1780
Number in family LIGASE LIGASE INHIBITOR: 163 -> 163
Number in family LIPID BINDING PROTEIN: 421 -> 421
Number in family LIPID TRANSPORT: 115 -> 115
Number in family LUMINESCENT PROTEIN: 221 -> 221
Number in family LYASE: 4150 -> 2000
Number in family LYASE LYASE INHIBITOR: 298 -> 298
Number in family MEMBRANE PROTEIN: 1338 -> 1338
Number in family METAL BINDING PROTEIN: 951 -> 951
Number in family METAL TRANSPORT: 409 -> 409
Number in family MOTOR PROTEIN: 195 -> 195
Number in family OXIDOREDUCTASE: 11531 -> 2000
Number in family OXIDOREDUCTASE OXIDOREDUCTASE INHIBITOR: 766 -> 766
Number in family OXYGEN STORAGE: 127 -> 127
Number in family OXYGEN STORAGE TRANSPORT: 260 -> 260
Number in family OXYGEN TRANSPORT: 414 -> 414
Number in family PHOTOSYNTHESIS: 173 -> 173
Number in family PLANT PROTEIN: 255 -> 255
Number in family PROTEIN BINDING: 1613 -> 1613
Number in family PROTEIN TRANSPORT: 693 -> 693
Number in family RECEPTOR: 108 -> 108
Number in family REPLICATION: 161 -> 161
Number in family RNA BINDING PROTEIN: 546 -> 546
Number in family SIGNALING PROTEIN: 2312 -> 2000
Number in family STRUCTURAL PROTEIN: 869 -> 869
Number in family SUGAR BINDING PROTEIN: 1250 -> 1250
Number in family TOXIN: 546 -> 546
Number in family TRANSCRIPTION REGULATION: 3283 -> 2000
Number in family TRANSFERASE: 14724 -> 2000
Number in family TRANSFERASE INHIBITOR: 126 -> 126
Number in family TRANSFERASE TRANSFERASE INHIBITOR: 2465 -> 2000
Number in family TRANSLATION: 370 -> 370
Number in family TRANSPORT PROTEIN: 2782 -> 2000
Number in family VIRAL PROTEIN: 2150 -> 2000
Files:
- families.csv: list of protein families with frequencies
- pfam_46872x62.csv: full dataset with amino acid sequences as string (one-letter code)
- pfam-trn-xy.csv: training dataset with amino acid sequences as tokens (1..25) and padded to a common length of 1000 with padding token 0:
Amino acid | Token | Description
--------------------------------
C | 1 | Cysteine
S | 2 | Serine
T | 3 | Threonine
A | 4 | Alanine
G | 5 | Glycine
P | 6 | Proline
D | 7 | Aspartic acid
E | 8 | Glutamic acid
Q | 9 | Glutamine
N | 10 | Asparagine
H | 11 | Histidine
R | 12 | Arginine
K | 13 | Lysine
M | 14 | Methionine
I | 15 | Isoleucine
L | 16 | Leucine
V | 17 | Valine
W | 18 | Tryptophan
Y | 19 | Tyrosine
F | 20 | Phenylalanine
B | 21 | Aspartic acid or Asparagine
Z | 22 | Glutamic acid or Glutamine
J | 23 | Leucine or Isoleucine
U | 24 | Selenocysteine
X | 25 | Unknown amino acid
. | 0 | padding token
- pfam-trn-labels.csv: plain-text labels for training data
- pfam-tst-xy.csv
- pfam-tst-labels.csv: test data
- pfam-balanced-trn-xy.csv
- pfam-balanced-trn-labels.csv:
- pfam-balanced-tst-xy.csv
- pfam-balanced-tst-labels.csv: balanced datasets, created by oversampling.
Files
aminoacids.csv
Files
(490.5 MB)
Name | Size | Download all |
---|---|---|
md5:322d38a6e5f9d89661e2ba3107ff9835
|
591 Bytes | Preview Download |
md5:1237abb122c6c313e1f4d66f2a3fca6f
|
1.5 kB | Preview Download |
md5:d7489b425f70839d153e768778c29dd6
|
1.2 MB | Preview Download |
md5:b246df9277286d2efe3cd1116dd96316
|
101.3 MB | Preview Download |
md5:734ccdf1b1a861a4b2d059c231444af5
|
2.8 MB | Preview Download |
md5:d2efd6967e652d02678caf4fe1951583
|
240.0 MB | Preview Download |
md5:78a84c18cd8fead883ab999a4c5c44b2
|
310.4 kB | Preview Download |
md5:08ef8c6776ddbd77906f3e52343e55b2
|
26.7 MB | Preview Download |
md5:6a9f7528421ce5bf1545d05836f87b6b
|
1.1 MB | Preview Download |
md5:6d04d5bc40a86d3200f0002d23ecef43
|
91.2 MB | Preview Download |
md5:6af03c7343fed9cc63afb40aceb77d3a
|
117.7 kB | Preview Download |
md5:b24222d1669b58948b5dfb9e5926fb0b
|
10.1 MB | Preview Download |
md5:830a426398e6632cd07dfac5c21ba682
|
15.8 MB | Preview Download |
Additional details
References
- Pfam: The protein families database in 2021 J. Mistry, S. Chuguransky, L. Williams, M. Qureshi, G.A. Salazar, E.L.L. Sonnhammer, S.C.E. Tosatto, L. Paladin, S. Raj, L.J. Richardson, R.D. Finn, A. Bateman Nucleic Acids Research (2020) doi: 10.1093/nar/gkaa913