Supplementary data for CRISPR spacer-protospacer matching benchmarks

Neri, Uri; Pedro Camargo, Antônio; Roux, Simon; Brian, Bushnell

doi:10.5281/zenodo.15171878

Published April 8, 2025 | Version v1

Dataset Open

Supplementary data for CRISPR spacer-protospacer matching benchmarks

Raw Outputs for CRISPR Spacer Matching Benchmark Study

This dataset contains the raw outputs from sequence alignment tools used in benchmarking protospacer identification. The data is organized into two main categories: simulated data (random sequences) and real data (IMG/VRv4). These are separated into two tar.zst files for ease of download (see their treebelow).

For more information or for the exact commands used to run the tools, please see the git repository folder "tool_configs".

Simulted Run Directory Naming Convention:

`run_t_{threads}_nc_{n_contigs}_ns_{n_spacers}_ir_{min_insertions}_{max_insertions}_lm_{min_mismatches}_{max_mismatches}_prc_{prop_rc}`

Where:
- t: Number of threads used
- nc: Number of contigs generated
- ns: Number of spacers generated
- ir: Insertion range (min and max insertions per spacer - a range to simulte number of total number of occurences in the reference (contig) file)
- lm: Length/mismatch range (min and max mismatches allowed)
- prc: Proportion of reverse complement insertions

Compressed Formats:

FASTA files (.fa.gz): Compressed with bgzip. To decompress you can use gunzip -c file.fa.gz > file.fa
SAM files (.sam.gz): Compressed with bgzip gunzip -c file.sam.gz > file.sam (note that not all tools tested conformed to SAM v1.4, or had the extended CIGAR string).
TSV files (.tsv.zst): Compressed with zstd, decompress with zstd -d file.tsv.zst(note that blastn and mmseqs results used a slightly modified "m6" output format ("qaccver", "saccver", "nident", "length", "mismatch", "qlen", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore").

General results/performence files:

tools_results.tsv.zst`: Raw alignment results from each tool
hyperfine_results.tsv.zst: per tool runtime as captured by hyperfine.
performance_results.tsv.zst`: Precision, recall, and F1 scores - note, these values migth slightly differ from the ones in the manuscript or the jupyter notebooks - these use a modified defintion of true positives which is more permissive. It is not used in the manuscript or the notebooks, but used as a quick and dirty proxy.
tool_performance_stats_mismatches_*.tsv: Performance breakdown by mismatch level (note that it could be == mismatch or mismatch >= value, i.e. up to mismatch to exactly n mismatch).
tool_performance_by_mismatches.json: like before but usually after some aggregation (into 1 file, in a narrow table format).

Tools tested for all data

Bowtie1 v1.3.1 (64-bit, gcc 13.3.0)
Bowtie2 v2.5.4 (64-bit, gcc 13.3.0)
BBTools (bbmap-skimmer) v39.13
StrobeAlign v0.15.0
BLASTN v2.16.0 (build Dec 14 2024 23:05:40)
MMseqs2 db8ad2d14d0a285ce0ad62bbefd0dce927663315
MUMMER v4.0.1
minimap2 2.28-r1209
spacer-containment v0.1.0
LexicMap v0.5.0 (06741c8)

Tools tested for simulated data only

BWA 0.7.19-r1273
HISAT2 v2.2.1 (64-bit)

Directory Structure

Note - once decompreseed, the structure of the different simulations runs is the same, so in the tree below I only included the subdirectory tree for one such run.
Note2 - the simulated data contains a "combined_sims" folder - this is an aggregation of the individual runs, and is the main data used in the Performance_simulated.ipynb jupyter/python notebook in the git repo.
Note3 - the most up to date and correct versions of the plots/figures are on the repo as well.

Simulated Data:

simulated
├── Runs
│ ├── combined_sims
│ │ ├── simulated_data
│ │ │ ├── ground_truth.tsv.zst
│ │ │ ├── simulated_contigs.fa.gz
│ │ │ └── simulated_spacers.fa.gz
│ │ └── tools_results.tsv.zst
│ └── sims
│ ├── run_t_25_nc_40000_ns_1030_ir_1_2205_lm_0_0_prc_0.5
│ │ ├── bash_scripts
│ │ │ ├── bbmap_skimmer.sh
│ │ │ ├── bbmapskimmermod.sh
│ │ │ ├── blastn.sh
│ │ │ ├── bowtie1.sh
│ │ │ ├── bowtie2.sh
│ │ │ ├── bwa_mem.sh
│ │ │ ├── hisat2.sh
│ │ │ ├── lexicmap.sh
│ │ │ ├── minimap2.sh
│ │ │ ├── minimap2_mod.sh
│ │ │ ├── minimap2_og.sh
│ │ │ ├── mmseqs.sh
│ │ │ ├── mmseqs2_map.sh
│ │ │ ├── mummer4.sh
│ │ │ ├── spacer_containment.sh
│ │ │ └── strobealign.sh
│ │ ├── hyperfine_results.tsv
│ │ ├── performance_results.tsv
│ │ ├── raw_outputs
│ │ │ ├── bbmap_skimmer.sh.json
│ │ │ ├── bbmap_skimmer_mod_output.sam.gz
│ │ │ ├── bbmap_skimmer_output.sam.gz
│ │ │ ├── bbmapskimmermod.sh.json
│ │ │ ├── blastn.sh.json
│ │ │ ├── blastn_output.tsv.zst
│ │ │ ├── bowtie1.sh.json
│ │ │ ├── bowtie1_output.sam.gz
│ │ │ ├── bowtie2.sh.json
│ │ │ ├── bowtie2_output.sam.gz
│ │ │ ├── bwa_mem.sh.json
│ │ │ ├── bwa_mem_output.sam.gz
│ │ │ ├── hisat2.sh.json
│ │ │ ├── hisat2_output.sam.gz
│ │ │ ├── hyperfine_output_bbmap_skimmer.sh.txt
│ │ │ ├── hyperfine_output_bbmapskimmermod.sh.txt
│ │ │ ├── hyperfine_output_blastn.sh.txt
│ │ │ ├── hyperfine_output_bowtie1.sh.txt
│ │ │ ├── hyperfine_output_bowtie2.sh.txt
│ │ │ ├── hyperfine_output_bwa_mem.sh.txt
│ │ │ ├── hyperfine_output_hisat2.sh.txt
│ │ │ ├── hyperfine_output_lexicmap.sh.txt
│ │ │ ├── hyperfine_output_minimap2.sh.txt
│ │ │ ├── hyperfine_output_minimap2_mod.sh.txt
│ │ │ ├── hyperfine_output_minimap2_og.sh.txt
│ │ │ ├── hyperfine_output_mmseqs.sh.txt
│ │ │ ├── hyperfine_output_mmseqs2_map.sh.txt
│ │ │ ├── hyperfine_output_mummer4.sh.txt
│ │ │ ├── hyperfine_output_spacer_containment.sh.txt
│ │ │ ├── hyperfine_output_strobealign.sh.txt
│ │ │ ├── lexicmap.sh.json
│ │ │ ├── lexicmap_output.tsv.zst
│ │ │ ├── minimap2.sh.json
│ │ │ ├── minimap2_mod.sh.json
│ │ │ ├── minimap2_mod_output.sam.gz
│ │ │ ├── minimap2_og.sh.json
│ │ │ ├── minimap2_og_output.sam.gz
│ │ │ ├── minimap2_output.sam.gz
│ │ │ ├── mmseqs.sh.json
│ │ │ ├── mmseqs2_map.sh.json
│ │ │ ├── mmseqs_output.tsv.zst
│ │ │ ├── mmseqsmap_output.tsv.zst
│ │ │ ├── mummer4.sh.json
│ │ │ ├── mummer4_output.sam.gz
│ │ │ ├── spacer_containment.sh.json
│ │ │ ├── spacer_containment_output.tsv.zst
│ │ │ ├── strobealign.sh.json
│ │ │ └── strobealign_output.sam.gz
│ │ ├── simulated_data
│ │ │ ├── ground_truth.tsv.zst
│ │ │ ├── simulated_contigs.fa.gz
│ │ │ └── simulated_spacers.fa.gz
│ │ └── tools_results.tsv.zst
│ ├── run_t_25_nc_40000_ns_1030_ir_1_2205_lm_1_1_prc_0.5/...
│ ├── run_t_25_nc_40000_ns_1030_ir_1_2205_lm_2_2_prc_0.5/...
│ ├── run_t_25_nc_40000_ns_1030_ir_1_2205_lm_3_3_prc_0.5
│ ├── run_t_25_nc_40000_ns_1225_ir_1_1225_lm_0_0_prc_0.5
│ ├── run_t_25_nc_40000_ns_1225_ir_1_1225_lm_1_1_prc_0.5
│ ├── run_t_25_nc_40000_ns_1225_ir_1_1225_lm_2_2_prc_0.5
│ └── run_t_25_nc_40000_ns_1225_ir_1_1225_lm_3_3_prc_0.5
├── plots
│ ├── matrix_0.html
│ ├── matrix_0.svg
│ ├── matrix_1.html
│ ├── matrix_1.svg
│ ├── matrix_2.html
│ ├── matrix_2.svg
│ ├── matrix_3.html
│ ├── matrix_3.svg
│ ├── matrix_4.html
│ ├── matrix_4.svg
│ ├── matrix_5.html
│ ├── matrix_5.svg
│ ├── tool_performance_by_mismatches.html
│ ├── tool_performance_by_mismatches.json
│ ├── tool_performance_grid.html
│ ├── tool_performance_grid.svg
│ ├── tool_performance_mismatches_0.pdf
│ ├── tool_performance_mismatches_1.pdf
│ ├── tool_performance_mismatches_2.pdf
│ ├── tool_performance_mismatches_3.pdf
│ ├── tool_performance_stats_mismatches_0.tsv
│ ├── tool_performance_stats_mismatches_1.tsv
│ ├── tool_performance_stats_mismatches_2.tsv
│ ├── tool_performance_stats_mismatches_3.tsv
│ ├── tool_performance_vs_mismatches.html
│ ├── tool_performance_vs_mismatches.svg
│ ├── tool_recall_per_spacer_contig_grid_log.pdf
│ └── tool_recall_per_spacer_only_grid_log.pdf
└── results
├── aggregated_ground_truth.parquet
├── aggregated_performance_runtime.parquet
├── aggregated_runtimes.parquet
├── aggregated_tool_results.parquet
├── matrix_0.tsv
├── matrix_1.tsv
├── matrix_2.tsv
├── matrix_3.tsv
├── matrix_4.tsv
├── matrix_5.tsv
└── tool_performance_by_mismatches.tsv

Real Data:

real_data
├── bash_scripts
│ ├── bbmap_skimmer.sh
│ ├── blastn.sh
│ ├── bowtie1.sh
│ ├── bowtie2.sh
│ ├── lexicmap.sh
│ ├── minimap2.sh
│ ├── mmseqs.sh
│ ├── mummer4.sh
│ ├── spacer_containment.sh
│ └── strobealign.sh
├── job_scripts
│ ├── bbmap_skimmer.sh
│ ├── blastn.sh
│ ├── bowtie1.sh
│ ├── bowtie2.sh
│ ├── lexicmap.sh
│ ├── minimap2.sh
│ ├── mmseqs.sh
│ ├── mummer4.sh
│ ├── spacer_containment.sh
│ └── strobealign.sh
├── plots
│ ├── matrix_0.html
│ ├── matrix_0.svg
│ ├── matrix_1.html
│ ├── matrix_1.svg
│ ├── matrix_2.html
│ ├── matrix_2.svg
│ ├── matrix_3.html
│ ├── matrix_3.svg
│ ├── tool_performance_detailed_3bins.pdf
│ ├── tool_performance_detailed_stats.tsv
│ ├── tool_performance_max_mm_0_detailed_3bins.pdf
│ ├── tool_performance_max_mm_0_detailed_stats.tsv
│ ├── tool_performance_max_mm_1_detailed_3bins.pdf
│ ├── tool_performance_max_mm_1_detailed_stats.tsv
│ ├── tool_performance_max_mm_2_detailed_3bins.pdf
│ ├── tool_performance_max_mm_2_detailed_stats.tsv
│ ├── tool_performance_max_mm_3_detailed_3bins.pdf
│ ├── tool_performance_max_mm_3_detailed_stats.tsv
│ ├── tool_performance_mm_0_detailed_stats.tsv
│ ├── tool_performance_mm_1_detailed_stats.tsv
│ ├── tool_performance_mm_2_detailed_stats.tsv
│ ├── tool_performance_mm_3_detailed_stats.tsv
│ ├── tool_performance_panel.pdf
│ ├── tool_performance_panel.svg
│ ├── tool_performance_perfect_detailed_3bins.pdf
│ ├── tool_performance_perfect_detailed_stats.tsv
│ ├── tool_performance_vs_mismatches.pdf
│ ├── tool_performance_vs_occurrences_detailed.pdf
│ ├── tool_performance_vs_occurrences_detailed_3bins.pdf
│ ├── upset_0.pdf
│ ├── upset_1.pdf
│ ├── upset_2.pdf
│ └── upset_3.pdf
├── raw_outputs
│ ├── bbmap_skimmer_output.sam.gz
│ ├── blastn_output.tsv.zst
│ ├── bowtie1_output.sam.gz
│ ├── bowtie2_output.sam.gz
│ ├── lexicmap_output.tsv.zst
│ ├── minimap2_output.sam.gz
│ ├── mmseqs_output.tsv.zst
│ ├── mummer4_output.sam.gz
│ ├── spacer_containment_output.tsv.zst
│ └── strobealign_output.sam.gz
├── results
│ ├── Tool_exclusivity.tsv
│ ├── deviation_counts.csv
│ ├── matrix_0.tsv
│ ├── matrix_1.tsv
│ ├── matrix_2.tsv
│ ├── matrix_3.tsv
│ ├── matrix_4.tsv
│ ├── matrix_5.tsv
│ ├── spacer_counts_with_tools.parquet
│ ├── summary_stats.parquet
│ ├── tool_performance_by_mismatches.tsv
│ ├── tool_performance_vs_occurrences_detailed_stats.tsv
│ └── tools_results_mm_recalced.parquet
├── sacct.out
└── slurm_logs
├── bbmap_skimmer-15192666.err
├── bbmap_skimmer-15192666.out
├── bowtie1-15296994.err
├── bowtie1-15296994.out
├── bowtie2-15192707.err
├── bowtie2-15192707.out
├── lexicmap-15192729.err
├── lexicmap-15192729.out
├── minimap2-15192728.err
├── minimap2-15192728.out
├── mmseqs-15192703.err
├── mmseqs-15192703.out
├── mummer4-15192721.err
├── mummer4-15192721.out
├── spacer_containment-15224132.err
├── spacer_containment-15224132.out
├── strobealign-15192702.err
├── strobealign-15192702.out
├── vsearch-15258853.err
└── vsearch-15258853.out

Files

Files (41.2 GB)

Name	Size	Download all
real_data.tar.zst md5:c11c8e6fe687b70cd9bab0112f5bbc66	20.8 GB	Download
simulated_data.tar.zst md5:7fa7e8a8770ea7ca316de20897647ae4	20.4 GB	Download

Additional details

Repository URL: https://code.jgi.doe.gov/spacersdb/spacer_matching_bench
Programming language: Python, Rust

	All versions	This version
Views	36	36
Downloads	57	57
Data volume	1.2 TB	1.2 TB

Supplementary data for CRISPR spacer-protospacer matching benchmarks

Creators

Description

Raw Outputs for CRISPR Spacer Matching Benchmark Study

Simulted Run Directory Naming Convention:

Compressed Formats:

General results/performence files:

Tools tested for all data

Tools tested for simulated data only

Directory Structure

Simulated Data:

Real Data:

Files

Files (41.2 GB)

Additional details

Software