Supplementary data for CRISPR spacer-protospacer matching benchmarks
Description
Raw Outputs for CRISPR Spacer Matching Benchmark Study
tree
below).Simulted Run Directory Naming Convention:
`run_t_{threads}_nc_{n_contigs}_ns_{n_spacers}_ir_{min_insertions}_{max_insertions}_lm_{min_mismatches}_{max_mismatches}_prc_{prop_rc}`
Where:
- t
: Number of threads used
- nc
: Number of contigs generated
- ns
: Number of spacers generated
- ir
: Insertion range (min and max insertions per spacer - a range to simulte number of total number of occurences in the reference (contig) file)
- lm
: Length/mismatch range (min and max mismatches allowed)
- prc
: Proportion of reverse complement insertions
Compressed Formats:
- FASTA files (.fa.gz): Compressed with bgzip. To decompress you can use
gunzip -c file.fa.gz > file.fa
- SAM files (.sam.gz): Compressed with bgzip
gunzip -c file.sam.gz > file.sam
(note that not all tools tested conformed to SAM v1.4, or had the extended CIGAR string). - TSV files (.tsv.zst): Compressed with zstd, decompress with
zstd -d file.tsv.zst
(note that blastn and mmseqs results used a slightly modified "m6" output format ("qaccver", "saccver", "nident", "length", "mismatch", "qlen", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore").
General results/performence files:
- tools_results.tsv.zst`: Raw alignment results from each tool
- hyperfine_results.tsv.zst: per tool runtime as captured by hyperfine.
- performance_results.tsv.zst`: Precision, recall, and F1 scores - note, these values migth slightly differ from the ones in the manuscript or the jupyter notebooks - these use a modified defintion of true positives which is more permissive. It is not used in the manuscript or the notebooks, but used as a quick and dirty proxy.
- tool_performance_stats_mismatches_*.tsv: Performance breakdown by mismatch level (note that it could be == mismatch or mismatch >= value, i.e. up to mismatch to exactly n mismatch).
- tool_performance_by_mismatches.json: like before but usually after some aggregation (into 1 file, in a narrow table format).
Tools tested for all data
- Bowtie1 v1.3.1 (64-bit, gcc 13.3.0)
- Bowtie2 v2.5.4 (64-bit, gcc 13.3.0)
- BBTools (bbmap-skimmer) v39.13
- StrobeAlign v0.15.0
- BLASTN v2.16.0 (build Dec 14 2024 23:05:40)
- MMseqs2 db8ad2d14d0a285ce0ad62bbefd0dce927663315
- MUMMER v4.0.1
- minimap2 2.28-r1209
- spacer-containment v0.1.0
- LexicMap v0.5.0 (06741c8)
Tools tested for simulated data only
- BWA 0.7.19-r1273
- HISAT2 v2.2.1 (64-bit)
Directory Structure
Note - once decompreseed, the structure of the different simulations runs is the same, so in the tree below I only included the subdirectory tree for one such run.
Note2 - the simulated data contains a "combined_sims" folder - this is an aggregation of the individual runs, and is the main data used in the Performance_simulated.ipynb
jupyter/python notebook in the git repo.
Note3 - the most up to date and correct versions of the plots/figures are on the repo as well.
Simulated Data:
simulated
├── Runs
│ ├── combined_sims
│ │ ├── simulated_data
│ │ │ ├── ground_truth.tsv.zst
│ │ │ ├── simulated_contigs.fa.gz
│ │ │ └── simulated_spacers.fa.gz
│ │ └── tools_results.tsv.zst
│ └── sims
│ ├── run_t_25_nc_40000_ns_1030_ir_1_2205_lm_0_0_prc_0.5
│ │ ├── bash_scripts
│ │ │ ├── bbmap_skimmer.sh
│ │ │ ├── bbmapskimmermod.sh
│ │ │ ├── blastn.sh
│ │ │ ├── bowtie1.sh
│ │ │ ├── bowtie2.sh
│ │ │ ├── bwa_mem.sh
│ │ │ ├── hisat2.sh
│ │ │ ├── lexicmap.sh
│ │ │ ├── minimap2.sh
│ │ │ ├── minimap2_mod.sh
│ │ │ ├── minimap2_og.sh
│ │ │ ├── mmseqs.sh
│ │ │ ├── mmseqs2_map.sh
│ │ │ ├── mummer4.sh
│ │ │ ├── spacer_containment.sh
│ │ │ └── strobealign.sh
│ │ ├── hyperfine_results.tsv
│ │ ├── performance_results.tsv
│ │ ├── raw_outputs
│ │ │ ├── bbmap_skimmer.sh.json
│ │ │ ├── bbmap_skimmer_mod_output.sam.gz
│ │ │ ├── bbmap_skimmer_output.sam.gz
│ │ │ ├── bbmapskimmermod.sh.json
│ │ │ ├── blastn.sh.json
│ │ │ ├── blastn_output.tsv.zst
│ │ │ ├── bowtie1.sh.json
│ │ │ ├── bowtie1_output.sam.gz
│ │ │ ├── bowtie2.sh.json
│ │ │ ├── bowtie2_output.sam.gz
│ │ │ ├── bwa_mem.sh.json
│ │ │ ├── bwa_mem_output.sam.gz
│ │ │ ├── hisat2.sh.json
│ │ │ ├── hisat2_output.sam.gz
│ │ │ ├── hyperfine_output_bbmap_skimmer.sh.txt
│ │ │ ├── hyperfine_output_bbmapskimmermod.sh.txt
│ │ │ ├── hyperfine_output_blastn.sh.txt
│ │ │ ├── hyperfine_output_bowtie1.sh.txt
│ │ │ ├── hyperfine_output_bowtie2.sh.txt
│ │ │ ├── hyperfine_output_bwa_mem.sh.txt
│ │ │ ├── hyperfine_output_hisat2.sh.txt
│ │ │ ├── hyperfine_output_lexicmap.sh.txt
│ │ │ ├── hyperfine_output_minimap2.sh.txt
│ │ │ ├── hyperfine_output_minimap2_mod.sh.txt
│ │ │ ├── hyperfine_output_minimap2_og.sh.txt
│ │ │ ├── hyperfine_output_mmseqs.sh.txt
│ │ │ ├── hyperfine_output_mmseqs2_map.sh.txt
│ │ │ ├── hyperfine_output_mummer4.sh.txt
│ │ │ ├── hyperfine_output_spacer_containment.sh.txt
│ │ │ ├── hyperfine_output_strobealign.sh.txt
│ │ │ ├── lexicmap.sh.json
│ │ │ ├── lexicmap_output.tsv.zst
│ │ │ ├── minimap2.sh.json
│ │ │ ├── minimap2_mod.sh.json
│ │ │ ├── minimap2_mod_output.sam.gz
│ │ │ ├── minimap2_og.sh.json
│ │ │ ├── minimap2_og_output.sam.gz
│ │ │ ├── minimap2_output.sam.gz
│ │ │ ├── mmseqs.sh.json
│ │ │ ├── mmseqs2_map.sh.json
│ │ │ ├── mmseqs_output.tsv.zst
│ │ │ ├── mmseqsmap_output.tsv.zst
│ │ │ ├── mummer4.sh.json
│ │ │ ├── mummer4_output.sam.gz
│ │ │ ├── spacer_containment.sh.json
│ │ │ ├── spacer_containment_output.tsv.zst
│ │ │ ├── strobealign.sh.json
│ │ │ └── strobealign_output.sam.gz
│ │ ├── simulated_data
│ │ │ ├── ground_truth.tsv.zst
│ │ │ ├── simulated_contigs.fa.gz
│ │ │ └── simulated_spacers.fa.gz
│ │ └── tools_results.tsv.zst
│ ├── run_t_25_nc_40000_ns_1030_ir_1_2205_lm_1_1_prc_0.5/...
│ ├── run_t_25_nc_40000_ns_1030_ir_1_2205_lm_2_2_prc_0.5/...
│ ├── run_t_25_nc_40000_ns_1030_ir_1_2205_lm_3_3_prc_0.5
│ ├── run_t_25_nc_40000_ns_1225_ir_1_1225_lm_0_0_prc_0.5
│ ├── run_t_25_nc_40000_ns_1225_ir_1_1225_lm_1_1_prc_0.5
│ ├── run_t_25_nc_40000_ns_1225_ir_1_1225_lm_2_2_prc_0.5
│ └── run_t_25_nc_40000_ns_1225_ir_1_1225_lm_3_3_prc_0.5
├── plots
│ ├── matrix_0.html
│ ├── matrix_0.svg
│ ├── matrix_1.html
│ ├── matrix_1.svg
│ ├── matrix_2.html
│ ├── matrix_2.svg
│ ├── matrix_3.html
│ ├── matrix_3.svg
│ ├── matrix_4.html
│ ├── matrix_4.svg
│ ├── matrix_5.html
│ ├── matrix_5.svg
│ ├── tool_performance_by_mismatches.html
│ ├── tool_performance_by_mismatches.json
│ ├── tool_performance_grid.html
│ ├── tool_performance_grid.svg
│ ├── tool_performance_mismatches_0.pdf
│ ├── tool_performance_mismatches_1.pdf
│ ├── tool_performance_mismatches_2.pdf
│ ├── tool_performance_mismatches_3.pdf
│ ├── tool_performance_stats_mismatches_0.tsv
│ ├── tool_performance_stats_mismatches_1.tsv
│ ├── tool_performance_stats_mismatches_2.tsv
│ ├── tool_performance_stats_mismatches_3.tsv
│ ├── tool_performance_vs_mismatches.html
│ ├── tool_performance_vs_mismatches.svg
│ ├── tool_recall_per_spacer_contig_grid_log.pdf
│ └── tool_recall_per_spacer_only_grid_log.pdf
└── results
├── aggregated_ground_truth.parquet
├── aggregated_performance_runtime.parquet
├── aggregated_runtimes.parquet
├── aggregated_tool_results.parquet
├── matrix_0.tsv
├── matrix_1.tsv
├── matrix_2.tsv
├── matrix_3.tsv
├── matrix_4.tsv
├── matrix_5.tsv
└── tool_performance_by_mismatches.tsv
Real Data:
├── bash_scripts
│ ├── bbmap_skimmer.sh
│ ├── blastn.sh
│ ├── bowtie1.sh
│ ├── bowtie2.sh
│ ├── lexicmap.sh
│ ├── minimap2.sh
│ ├── mmseqs.sh
│ ├── mummer4.sh
│ ├── spacer_containment.sh
│ └── strobealign.sh
├── job_scripts
│ ├── bbmap_skimmer.sh
│ ├── blastn.sh
│ ├── bowtie1.sh
│ ├── bowtie2.sh
│ ├── lexicmap.sh
│ ├── minimap2.sh
│ ├── mmseqs.sh
│ ├── mummer4.sh
│ ├── spacer_containment.sh
│ └── strobealign.sh
├── plots
│ ├── matrix_0.html
│ ├── matrix_0.svg
│ ├── matrix_1.html
│ ├── matrix_1.svg
│ ├── matrix_2.html
│ ├── matrix_2.svg
│ ├── matrix_3.html
│ ├── matrix_3.svg
│ ├── tool_performance_detailed_3bins.pdf
│ ├── tool_performance_detailed_stats.tsv
│ ├── tool_performance_max_mm_0_detailed_3bins.pdf
│ ├── tool_performance_max_mm_0_detailed_stats.tsv
│ ├── tool_performance_max_mm_1_detailed_3bins.pdf
│ ├── tool_performance_max_mm_1_detailed_stats.tsv
│ ├── tool_performance_max_mm_2_detailed_3bins.pdf
│ ├── tool_performance_max_mm_2_detailed_stats.tsv
│ ├── tool_performance_max_mm_3_detailed_3bins.pdf
│ ├── tool_performance_max_mm_3_detailed_stats.tsv
│ ├── tool_performance_mm_0_detailed_stats.tsv
│ ├── tool_performance_mm_1_detailed_stats.tsv
│ ├── tool_performance_mm_2_detailed_stats.tsv
│ ├── tool_performance_mm_3_detailed_stats.tsv
│ ├── tool_performance_panel.pdf
│ ├── tool_performance_panel.svg
│ ├── tool_performance_perfect_detailed_3bins.pdf
│ ├── tool_performance_perfect_detailed_stats.tsv
│ ├── tool_performance_vs_mismatches.pdf
│ ├── tool_performance_vs_occurrences_detailed.pdf
│ ├── tool_performance_vs_occurrences_detailed_3bins.pdf
│ ├── upset_0.pdf
│ ├── upset_1.pdf
│ ├── upset_2.pdf
│ └── upset_3.pdf
├── raw_outputs
│ ├── bbmap_skimmer_output.sam.gz
│ ├── blastn_output.tsv.zst
│ ├── bowtie1_output.sam.gz
│ ├── bowtie2_output.sam.gz
│ ├── lexicmap_output.tsv.zst
│ ├── minimap2_output.sam.gz
│ ├── mmseqs_output.tsv.zst
│ ├── mummer4_output.sam.gz
│ ├── spacer_containment_output.tsv.zst
│ └── strobealign_output.sam.gz
├── results
│ ├── Tool_exclusivity.tsv
│ ├── deviation_counts.csv
│ ├── matrix_0.tsv
│ ├── matrix_1.tsv
│ ├── matrix_2.tsv
│ ├── matrix_3.tsv
│ ├── matrix_4.tsv
│ ├── matrix_5.tsv
│ ├── spacer_counts_with_tools.parquet
│ ├── summary_stats.parquet
│ ├── tool_performance_by_mismatches.tsv
│ ├── tool_performance_vs_occurrences_detailed_stats.tsv
│ └── tools_results_mm_recalced.parquet
├── sacct.out
└── slurm_logs
├── bbmap_skimmer-15192666.err
├── bbmap_skimmer-15192666.out
├── bowtie1-15296994.err
├── bowtie1-15296994.out
├── bowtie2-15192707.err
├── bowtie2-15192707.out
├── lexicmap-15192729.err
├── lexicmap-15192729.out
├── minimap2-15192728.err
├── minimap2-15192728.out
├── mmseqs-15192703.err
├── mmseqs-15192703.out
├── mummer4-15192721.err
├── mummer4-15192721.out
├── spacer_containment-15224132.err
├── spacer_containment-15224132.out
├── strobealign-15192702.err
├── strobealign-15192702.out
├── vsearch-15258853.err
└── vsearch-15258853.out
Files
Files
(41.2 GB)
Name | Size | Download all |
---|---|---|
md5:c11c8e6fe687b70cd9bab0112f5bbc66
|
20.8 GB | Download |
md5:7fa7e8a8770ea7ca316de20897647ae4
|
20.4 GB | Download |
Additional details
Software
- Repository URL
- https://code.jgi.doe.gov/spacersdb/spacer_matching_bench
- Programming language
- Python, Rust