Published April 8, 2025 | Version v1
Dataset Open

Supplementary data for CRISPR spacer-protospacer matching benchmarks

Description

Raw Outputs for CRISPR Spacer Matching Benchmark Study

This dataset contains the raw outputs from sequence alignment tools used in benchmarking protospacer identification. The data is organized into two main categories: simulated data (random sequences) and real data (IMG/VRv4). These are separated into two tar.zst files for ease of download (see their treebelow).
For more information or for the exact commands used to run the tools, please see the git repository folder "tool_configs".

Simulted  Run Directory Naming Convention:

 `run_t_{threads}_nc_{n_contigs}_ns_{n_spacers}_ir_{min_insertions}_{max_insertions}_lm_{min_mismatches}_{max_mismatches}_prc_{prop_rc}`

Where:
t: Number of threads used
- nc: Number of contigs generated
ns: Number of spacers generated
- ir: Insertion range (min and max insertions per spacer - a range to simulte number of total number of occurences in the reference (contig) file)
- lm: Length/mismatch range (min and max mismatches allowed)
- prc: Proportion of reverse complement insertions

Compressed Formats:

  • FASTA files (.fa.gz): Compressed with bgzip. To decompress you can use gunzip -c file.fa.gz > file.fa
  • SAM files (.sam.gz): Compressed with bgzip   gunzip -c file.sam.gz > file.sam (note that not all tools tested conformed to SAM v1.4, or had the extended CIGAR string).
  • TSV files (.tsv.zst): Compressed with zstd, decompress with  zstd -d file.tsv.zst(note that blastn and mmseqs results used a slightly modified "m6" output format ("qaccver", "saccver", "nident", "length", "mismatch", "qlen", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore").

General results/performence files:

  • tools_results.tsv.zst`: Raw alignment results from each tool
  • hyperfine_results.tsv.zst: per tool runtime as captured by hyperfine.
  • performance_results.tsv.zst`: Precision, recall, and F1 scores - note, these values migth slightly differ from the ones in the manuscript or the jupyter notebooks - these use a modified defintion of true positives which is more permissive. It is not used in the manuscript or the notebooks, but used as a quick and dirty proxy. 
  • tool_performance_stats_mismatches_*.tsv: Performance breakdown by mismatch level (note that it could be == mismatch or mismatch >= value, i.e. up to mismatch to exactly n mismatch).
  • tool_performance_by_mismatches.json: like before but usually after some aggregation (into 1 file, in a narrow table format).

Tools tested for all data

  • Bowtie1 v1.3.1 (64-bit, gcc 13.3.0)
  • Bowtie2 v2.5.4 (64-bit, gcc 13.3.0)
  • BBTools (bbmap-skimmer) v39.13
  • StrobeAlign v0.15.0
  • BLASTN v2.16.0 (build Dec 14 2024 23:05:40)
  • MMseqs2 db8ad2d14d0a285ce0ad62bbefd0dce927663315
  • MUMMER v4.0.1
  • minimap2 2.28-r1209
  • spacer-containment v0.1.0
  • LexicMap v0.5.0 (06741c8)

Tools tested for simulated data only

  • BWA 0.7.19-r1273
  • HISAT2 v2.2.1 (64-bit)

Directory Structure

Note - once decompreseed, the structure of the different simulations runs is the same, so in the tree below I only included the subdirectory tree for one such run.
Note2 - the simulated data contains a "combined_sims" folder - this is an aggregation of the individual runs, and is the main data used in the Performance_simulated.ipynb jupyter/python notebook in the git repo.
Note3 - the most up to date and correct versions of the plots/figures are on the repo as well.

 Simulated Data:

simulated
├── Runs
│   ├── combined_sims
│   │   ├── simulated_data
│   │   │   ├── ground_truth.tsv.zst
│   │   │   ├── simulated_contigs.fa.gz
│   │   │   └── simulated_spacers.fa.gz
│   │   └── tools_results.tsv.zst
│   └── sims
│       ├── run_t_25_nc_40000_ns_1030_ir_1_2205_lm_0_0_prc_0.5
│       │   ├── bash_scripts
│       │   │   ├── bbmap_skimmer.sh
│       │   │   ├── bbmapskimmermod.sh
│       │   │   ├── blastn.sh
│       │   │   ├── bowtie1.sh
│       │   │   ├── bowtie2.sh
│       │   │   ├── bwa_mem.sh
│       │   │   ├── hisat2.sh
│       │   │   ├── lexicmap.sh
│       │   │   ├── minimap2.sh
│       │   │   ├── minimap2_mod.sh
│       │   │   ├── minimap2_og.sh
│       │   │   ├── mmseqs.sh
│       │   │   ├── mmseqs2_map.sh
│       │   │   ├── mummer4.sh
│       │   │   ├── spacer_containment.sh
│       │   │   └── strobealign.sh
│       │   ├── hyperfine_results.tsv
│       │   ├── performance_results.tsv
│       │   ├── raw_outputs
│       │   │   ├── bbmap_skimmer.sh.json
│       │   │   ├── bbmap_skimmer_mod_output.sam.gz
│       │   │   ├── bbmap_skimmer_output.sam.gz
│       │   │   ├── bbmapskimmermod.sh.json
│       │   │   ├── blastn.sh.json
│       │   │   ├── blastn_output.tsv.zst
│       │   │   ├── bowtie1.sh.json
│       │   │   ├── bowtie1_output.sam.gz
│       │   │   ├── bowtie2.sh.json
│       │   │   ├── bowtie2_output.sam.gz
│       │   │   ├── bwa_mem.sh.json
│       │   │   ├── bwa_mem_output.sam.gz
│       │   │   ├── hisat2.sh.json
│       │   │   ├── hisat2_output.sam.gz
│       │   │   ├── hyperfine_output_bbmap_skimmer.sh.txt
│       │   │   ├── hyperfine_output_bbmapskimmermod.sh.txt
│       │   │   ├── hyperfine_output_blastn.sh.txt
│       │   │   ├── hyperfine_output_bowtie1.sh.txt
│       │   │   ├── hyperfine_output_bowtie2.sh.txt
│       │   │   ├── hyperfine_output_bwa_mem.sh.txt
│       │   │   ├── hyperfine_output_hisat2.sh.txt
│       │   │   ├── hyperfine_output_lexicmap.sh.txt
│       │   │   ├── hyperfine_output_minimap2.sh.txt
│       │   │   ├── hyperfine_output_minimap2_mod.sh.txt
│       │   │   ├── hyperfine_output_minimap2_og.sh.txt
│       │   │   ├── hyperfine_output_mmseqs.sh.txt
│       │   │   ├── hyperfine_output_mmseqs2_map.sh.txt
│       │   │   ├── hyperfine_output_mummer4.sh.txt
│       │   │   ├── hyperfine_output_spacer_containment.sh.txt
│       │   │   ├── hyperfine_output_strobealign.sh.txt
│       │   │   ├── lexicmap.sh.json
│       │   │   ├── lexicmap_output.tsv.zst
│       │   │   ├── minimap2.sh.json
│       │   │   ├── minimap2_mod.sh.json
│       │   │   ├── minimap2_mod_output.sam.gz
│       │   │   ├── minimap2_og.sh.json
│       │   │   ├── minimap2_og_output.sam.gz
│       │   │   ├── minimap2_output.sam.gz
│       │   │   ├── mmseqs.sh.json
│       │   │   ├── mmseqs2_map.sh.json
│       │   │   ├── mmseqs_output.tsv.zst
│       │   │   ├── mmseqsmap_output.tsv.zst
│       │   │   ├── mummer4.sh.json
│       │   │   ├── mummer4_output.sam.gz
│       │   │   ├── spacer_containment.sh.json
│       │   │   ├── spacer_containment_output.tsv.zst
│       │   │   ├── strobealign.sh.json
│       │   │   └── strobealign_output.sam.gz
│       │   ├── simulated_data
│       │   │   ├── ground_truth.tsv.zst
│       │   │   ├── simulated_contigs.fa.gz
│       │   │   └── simulated_spacers.fa.gz
│       │   └── tools_results.tsv.zst
│       ├── run_t_25_nc_40000_ns_1030_ir_1_2205_lm_1_1_prc_0.5/...
│       ├── run_t_25_nc_40000_ns_1030_ir_1_2205_lm_2_2_prc_0.5/...
│       ├── run_t_25_nc_40000_ns_1030_ir_1_2205_lm_3_3_prc_0.5
│       ├── run_t_25_nc_40000_ns_1225_ir_1_1225_lm_0_0_prc_0.5
│       ├── run_t_25_nc_40000_ns_1225_ir_1_1225_lm_1_1_prc_0.5
│       ├── run_t_25_nc_40000_ns_1225_ir_1_1225_lm_2_2_prc_0.5
│       └── run_t_25_nc_40000_ns_1225_ir_1_1225_lm_3_3_prc_0.5
├── plots
│   ├── matrix_0.html
│   ├── matrix_0.svg
│   ├── matrix_1.html
│   ├── matrix_1.svg
│   ├── matrix_2.html
│   ├── matrix_2.svg
│   ├── matrix_3.html
│   ├── matrix_3.svg
│   ├── matrix_4.html
│   ├── matrix_4.svg
│   ├── matrix_5.html
│   ├── matrix_5.svg
│   ├── tool_performance_by_mismatches.html
│   ├── tool_performance_by_mismatches.json
│   ├── tool_performance_grid.html
│   ├── tool_performance_grid.svg
│   ├── tool_performance_mismatches_0.pdf
│   ├── tool_performance_mismatches_1.pdf
│   ├── tool_performance_mismatches_2.pdf
│   ├── tool_performance_mismatches_3.pdf
│   ├── tool_performance_stats_mismatches_0.tsv
│   ├── tool_performance_stats_mismatches_1.tsv
│   ├── tool_performance_stats_mismatches_2.tsv
│   ├── tool_performance_stats_mismatches_3.tsv
│   ├── tool_performance_vs_mismatches.html
│   ├── tool_performance_vs_mismatches.svg
│   ├── tool_recall_per_spacer_contig_grid_log.pdf
│   └── tool_recall_per_spacer_only_grid_log.pdf
└── results
    ├── aggregated_ground_truth.parquet
    ├── aggregated_performance_runtime.parquet
    ├── aggregated_runtimes.parquet
    ├── aggregated_tool_results.parquet
    ├── matrix_0.tsv
    ├── matrix_1.tsv
    ├── matrix_2.tsv
    ├── matrix_3.tsv
    ├── matrix_4.tsv
    ├── matrix_5.tsv
    └── tool_performance_by_mismatches.tsv

 Real Data:

real_data
├── bash_scripts
│   ├── bbmap_skimmer.sh
│   ├── blastn.sh
│   ├── bowtie1.sh
│   ├── bowtie2.sh
│   ├── lexicmap.sh
│   ├── minimap2.sh
│   ├── mmseqs.sh
│   ├── mummer4.sh
│   ├── spacer_containment.sh
│   └── strobealign.sh
├── job_scripts
│   ├── bbmap_skimmer.sh
│   ├── blastn.sh
│   ├── bowtie1.sh
│   ├── bowtie2.sh
│   ├── lexicmap.sh
│   ├── minimap2.sh
│   ├── mmseqs.sh
│   ├── mummer4.sh
│   ├── spacer_containment.sh
│   └── strobealign.sh
├── plots
│   ├── matrix_0.html
│   ├── matrix_0.svg
│   ├── matrix_1.html
│   ├── matrix_1.svg
│   ├── matrix_2.html
│   ├── matrix_2.svg
│   ├── matrix_3.html
│   ├── matrix_3.svg
│   ├── tool_performance_detailed_3bins.pdf
│   ├── tool_performance_detailed_stats.tsv
│   ├── tool_performance_max_mm_0_detailed_3bins.pdf
│   ├── tool_performance_max_mm_0_detailed_stats.tsv
│   ├── tool_performance_max_mm_1_detailed_3bins.pdf
│   ├── tool_performance_max_mm_1_detailed_stats.tsv
│   ├── tool_performance_max_mm_2_detailed_3bins.pdf
│   ├── tool_performance_max_mm_2_detailed_stats.tsv
│   ├── tool_performance_max_mm_3_detailed_3bins.pdf
│   ├── tool_performance_max_mm_3_detailed_stats.tsv
│   ├── tool_performance_mm_0_detailed_stats.tsv
│   ├── tool_performance_mm_1_detailed_stats.tsv
│   ├── tool_performance_mm_2_detailed_stats.tsv
│   ├── tool_performance_mm_3_detailed_stats.tsv
│   ├── tool_performance_panel.pdf
│   ├── tool_performance_panel.svg
│   ├── tool_performance_perfect_detailed_3bins.pdf
│   ├── tool_performance_perfect_detailed_stats.tsv
│   ├── tool_performance_vs_mismatches.pdf
│   ├── tool_performance_vs_occurrences_detailed.pdf
│   ├── tool_performance_vs_occurrences_detailed_3bins.pdf
│   ├── upset_0.pdf
│   ├── upset_1.pdf
│   ├── upset_2.pdf
│   └── upset_3.pdf
├── raw_outputs
│   ├── bbmap_skimmer_output.sam.gz
│   ├── blastn_output.tsv.zst
│   ├── bowtie1_output.sam.gz
│   ├── bowtie2_output.sam.gz
│   ├── lexicmap_output.tsv.zst
│   ├── minimap2_output.sam.gz
│   ├── mmseqs_output.tsv.zst
│   ├── mummer4_output.sam.gz
│   ├── spacer_containment_output.tsv.zst
│   └── strobealign_output.sam.gz
├── results
│   ├── Tool_exclusivity.tsv
│   ├── deviation_counts.csv
│   ├── matrix_0.tsv
│   ├── matrix_1.tsv
│   ├── matrix_2.tsv
│   ├── matrix_3.tsv
│   ├── matrix_4.tsv
│   ├── matrix_5.tsv
│   ├── spacer_counts_with_tools.parquet
│   ├── summary_stats.parquet
│   ├── tool_performance_by_mismatches.tsv
│   ├── tool_performance_vs_occurrences_detailed_stats.tsv
│   └── tools_results_mm_recalced.parquet
├── sacct.out
└── slurm_logs
    ├── bbmap_skimmer-15192666.err
    ├── bbmap_skimmer-15192666.out
    ├── bowtie1-15296994.err
    ├── bowtie1-15296994.out
    ├── bowtie2-15192707.err
    ├── bowtie2-15192707.out
    ├── lexicmap-15192729.err
    ├── lexicmap-15192729.out
    ├── minimap2-15192728.err
    ├── minimap2-15192728.out
    ├── mmseqs-15192703.err
    ├── mmseqs-15192703.out
    ├── mummer4-15192721.err
    ├── mummer4-15192721.out
    ├── spacer_containment-15224132.err
    ├── spacer_containment-15224132.out
    ├── strobealign-15192702.err
    ├── strobealign-15192702.out
    ├── vsearch-15258853.err
    └── vsearch-15258853.out

Files

Files (41.2 GB)

Name Size Download all
md5:c11c8e6fe687b70cd9bab0112f5bbc66
20.8 GB Download
md5:7fa7e8a8770ea7ca316de20897647ae4
20.4 GB Download

Additional details

Software

Repository URL
https://code.jgi.doe.gov/spacersdb/spacer_matching_bench
Programming language
Python, Rust