Assessing the feasibility of machine learning for ancient DNA age prediction: limitations and insights. Code and data.
Authors/Creators
Description
DNA Age Prediction: Machine Learning Analysis of Ancient DNA Damage Profiles
This repository contains the source code, processed datasets, and supplementary materials associated with the manuscript:
"Assessing the feasibility of machine learning for ancient DNA age prediction: limitations and insights"
Overview
The project investigates whether the chronological age of ancient biological samples can be predicted directly from ancient DNA (aDNA) damage signatures using supervised machine learning. DNA damage statistics were extracted from sequencing alignments using DamageProfiler and used as input features for regression models, including XGBoost, Random Forest, Support Vector Regression, Elastic Net, Ridge Regression, Lasso, Bayesian Ridge, k-Nearest Neighbors, and Decision Trees.
The study evaluates the predictive value of DNA damage patterns while controlling for technical batch effects associated with different sequencing studies and laboratories.
Repository Contents
Source Code
- Data preprocessing and normalization scripts
- Feature engineering and filtering pipelines
- Machine learning training and evaluation code
- Cross-validation and batch-aware validation procedures
- PCA dimensionality reduction workflows
- Feature importance analysis
- Figure generation scripts
- Supplementary analyses used in the manuscript
Processed Data
The repository includes processed feature tables generated from DamageProfiler outputs. These tables contain DNA damage statistics and misincorporation profiles used as machine-learning input features in the study.
The processed datasets include:
- DamageProfiler-derived feature matrices
- Sample metadata
- Chronological age labels
- Batch/publication assignments
- Intermediate analysis tables required to reproduce the reported results
Raw Data
The original BAM files and sequencing reads are not redistributed through this repository because they are available from public repositories and may be large in size.
The underlying ancient DNA sequencing data can be obtained from:
- Allen Ancient Genome Diversity Project (AAGDP)
- European Nucleotide Archive (ENA)
- NCBI Sequence Read Archive (SRA)
Accession numbers for all samples are provided in the manuscript and supplementary materials.
Reproducibility
The repository contains all processed data and source code necessary to reproduce the machine learning experiments, statistical analyses, tables, and figures presented in the associated publication. Reproduction of the analyses does not require access to the original BAM files, as the processed DamageProfiler feature tables used for model development and evaluation are included.
License
See the repository license for terms of use and redistribution.
Files
GEN_AGE.zip
Files
(560.3 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:76dfe940b2849fb6dc16129616ca0f7b
|
560.3 MB | Preview Download |
Additional details
Software
- Programming language
- Python