FEDKEA: Enzyme function prediction with a large pretrained protein language model and distance-weighted k-nearest neighbor
Description
Recent advancements in sequencing technologies have led to the identification of a vast number of new protein sequences, surpassing current experimental capabilities for annotation. Enzymes, crucial for diverse biological functions, have garnered significant attention; however, accurately predicting enzyme EC numbers for proteins with unknown functions remains challenging. Here, we introduce FEDKEA, a novel computational model that integrates ESM-2 embeddings and a distance-weighted k-nearest neighbor (KNN) classifier to enhance enzyme function annotation. FEDKEA employs a fine-tuned ESM-2 model with four layers for binary enzyme classification. For predicting EC numbers, it adopts a hierarchical approach, utilizing distinct models and training strategies across the four EC number levels. Specifically, the first EC number level classification utilizes a fine-tuned ESM-2 model with three layers, while transfer learning with embeddings from this model supports second and third-level tasks. The fourth-level classification employs a distance-weighted KNN model. Compared to existing tools such as CLEAN and ECRECer, FEDKEA demonstrates superior performance. We anticipate that FEDKEA will significantly advance the prediction of enzyme functions for uncharacterized proteins, thereby impacting fields such as genomics, physiology and medicine.
Files
Files
(7.6 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:c2c3ed8c457b6f44eba81d4e8a4bc848
|
5.6 GB | Download |
|
md5:40e7985a6a799596d6351efef8aa173e
|
2.0 GB | Download |