Published December 22, 2023 | Version v1
Dataset Open

Enhancing Recognition and Interpretation of Functional Phenotypic Sequences through Fine-Tuning Pre-Trained Genomic Models

Authors/Creators

Description

Decoding human genomic sequences requires comprehensive analysis of DNA sequence functionality. Through computational and experimental approaches, researchers study the genotype-phenotype relationship and generate important datasets that help unravel complicated genetic blueprints. This study explores the use of deep learning, particularly pre-trained models like DNA_bert_6 and human_gpt2-v1, in interpreting and representing human genome sequences. We meticulously construct multiple datasets linking genotypes and phenotypes to fine-tune pre-trained models for precise DNA sequence classification. Furthermore, we specifically focused on the human endogenous retrovirus (HERV) dataset with commendable classification performance (binary and multi-classification accuracy and F1 values above 0.935 and 0.888, respectively). We evaluate the influence of sequence length on classification results and analyze the impact of feature extraction in the model's hidden layers using the HERV dataset. To further understand the phenotype-specific patterns learned by the model, we perform enrichment, pathogenicity and conservation analyzes of specific motifs in the HERV sequence with high average local representation weight (ALRW) scores. Overall, the generated results further provide numerous benchmark genotype-phenotype datasets for evaluating the performance of genomic models. The findings highlight the potential of large models in learning DNA sequence representations, particularly when utilizing the HERV dataset, and provide valuable insights for future research. This work represents an innovative strategy that combines pre-trained model representations with classical omics methods for analyzing the functionality of genome sequences, fostering cross-fertilization between genomics and artificial intelligence.

Files

001.Result.zip

Files (2.4 GB)

Name Size Download all
md5:c834c895e5aeaaeebd652b7c1f4cbd9c
2.4 GB Preview Download