Published March 5, 2026 | Version 2
Dataset Open

Tree reconstruction guarantees from CRISPR-Cas9 lineage tracing data using Neighbor-Joining

Authors/Creators

Description

# Tree reconstruction guarantees from CRISPR-Cas9 lineage tracing data using Neighbor-Joining

 

<https://www.biorxiv.org/content/10.1101/2024.08.27.610007v1>

 

Our simulated trees paired with lineage tracing data encompass a large number of lineage tracing regimes, which are used to assess the performance of our distance correction method.
 
With respect to version 1 of the dataset, this version increases the number of trees (i.e. repetitions) from 50 to 250, and also includes trees with noisy character matrices which allow probing the robustness of algorithms to technical effects such as sequencing errors; please see below for more details.

 

## Description of the data and file structure

 

For each lineage tracing regime, 250 simulations are performed. All trees have exactly 400 leaves, and were simulated as described in the manuscript. The `default' regime consists of:

 

40 characters.

 

mutation rate adjusted to obtain an expected 50% mutated entries in the character matrix.

 

100 indel states.

 

20% missing data, with 10% coming from heritable epigenetic silencing and 10% coming from sequencing dropouts. (This does not include missing data further introduced by double-resection events, which we also simulate.)

 

Each lineage tracing regime is obtained by perturbing this 'default' lineage tracing regime by varying one of the above parameters. Specifically, we consider varying:

 

* number of characters (a.k.a. barcodes) in the set {10, 20, 40, 60, 90, 150} (with 40 being the default)
* number of states in the set {5, 10, 25, 50, 100, 500, 1000} (with 100 being the default)
* expected proportion mutated in the set {10%, 30%, 50%, 70%, 90%} (with 50 being the default)
* percent missing from epigenetic silencing and sequencing dropouts in the set {0%, 10%, 20%, 30%, 40%, 50%, 60%}, with the percent coming from sequencing dropouts fixed to 10% (except when the total is 0%, in which case it is set to 0%)
* we also include simulations with noise in the character matrix, which probe the algorithm's robustness to effects such as sequencing errors and other artifacts. To simulate this noise, for each entry $X_{ij} \ge 1$ in the character matrix, with probability $p$ we replace it by some other random state uniformly in the set $\{1, 2, \dots, \text{number\_of\_states}\} - \{X_{ij}\}$. We call $p$ the "sequencing error fraction". We vary $p$ in the set {0, 0.001, 0.003, 0.01, 0.03, 0.1} (with 0 being the default).

 

The data from each simulation is stored specifying the parameter that was varied, so for example the simulated data when the number of barcodes is 30 is stored under "trees/number_of_cassettes/30/" . In this directory, for each repetition, we have three files:

 

* tree_{repetition}_character_matrix.csv : Contains the lineage tracing data in csv format.
* tree_{repetition}_newick.txt : Contains the tree in newick format, with branch lengths.
* tree_{repetition}_CassiopeiaTree.pkl : Contains the pickled CassiopeiaTree object from the simulation, which in particular contains the fitness of different nodes in the tree, ancestral lineage tracing barcodes, etc.. It is not necessary for reproducing any of our results, but we provided in case it is convenient.

 

## Code/Software

 

We have additionally open-sourced a repository allowing seamless reproduction of all results in our paper, here:

 

https://github.com/songlab-cal/nj-theory



Files

Files (607.0 MB)

Name Size Download all
md5:ba1b90ab4a860eddd6cc84d3715275dd
607.0 MB Download

Additional details

Software

Repository URL
https://github.com/songlab-cal/nj-theory
Programming language
Python
Development Status
Active