Published May 30, 2026 | Version v1
Software Open

Assessing the feasibility of machine learning for ancient DNA age prediction: limitations and insights. Code and data.

Authors/Creators

Description

 

DNA Age Prediction: Machine Learning Analysis of Ancient DNA Damage Profiles

This repository contains the source code, processed datasets, and supplementary materials associated with the manuscript:

"Assessing the feasibility of machine learning for ancient DNA age prediction: limitations and insights"

Overview

The project investigates whether the chronological age of ancient biological samples can be predicted directly from ancient DNA (aDNA) damage signatures using supervised machine learning. DNA damage statistics were extracted from sequencing alignments using DamageProfiler and used as input features for regression models, including XGBoost, Random Forest, Support Vector Regression, Elastic Net, Ridge Regression, Lasso, Bayesian Ridge, k-Nearest Neighbors, and Decision Trees.

The study evaluates the predictive value of DNA damage patterns while controlling for technical batch effects associated with different sequencing studies and laboratories.

Repository Contents

Source Code

  • Data preprocessing and normalization scripts
  • Feature engineering and filtering pipelines
  • Machine learning training and evaluation code
  • Cross-validation and batch-aware validation procedures
  • PCA dimensionality reduction workflows
  • Feature importance analysis
  • Figure generation scripts
  • Supplementary analyses used in the manuscript

Processed Data

The repository includes processed feature tables generated from DamageProfiler outputs. These tables contain DNA damage statistics and misincorporation profiles used as machine-learning input features in the study.

The processed datasets include:

  • DamageProfiler-derived feature matrices
  • Sample metadata
  • Chronological age labels
  • Batch/publication assignments
  • Intermediate analysis tables required to reproduce the reported results

Raw Data

The original BAM files and sequencing reads are not redistributed through this repository because they are available from public repositories and may be large in size.

The underlying ancient DNA sequencing data can be obtained from:

  • Allen Ancient Genome Diversity Project (AAGDP)
  • European Nucleotide Archive (ENA)
  • NCBI Sequence Read Archive (SRA)

Accession numbers for all samples are provided in the manuscript and supplementary materials.

Reproducibility

The repository contains all processed data and source code necessary to reproduce the machine learning experiments, statistical analyses, tables, and figures presented in the associated publication. Reproduction of the analyses does not require access to the original BAM files, as the processed DamageProfiler feature tables used for model development and evaluation are included.

License

See the repository license for terms of use and redistribution.

Files

GEN_AGE.zip

Files (560.3 MB)

Name Size Download all
md5:76dfe940b2849fb6dc16129616ca0f7b
560.3 MB Preview Download

Additional details

Software

Programming language
Python