Seq is a programming language for computational genomics and bioinformatics. With a Python-compatible syntax and a host of domain-specific features and optimizations, Seq makes writing high-performance genomics software as easy as writing Python code, and achieves performance comparable to (and in many cases better than) C/C++.
This document describes the artifact submitted alongside the Seq paper and how to use it to run the benchmarks given in Section 6, for each of the various evaluated languages/compilers.
While it is difficult to accurately reproduce performance results inside a VM (although in our experience, performance in the VM has been roughly comparable to the paper's results), we hope that this artifact will at least allow users to get familar with Seq, play around with some real-world code examples, and in general see how it compares to the alternatives.
First download the Vagrant (https://www.vagrantup.com) image at http://cb.csail.mit.edu/cb/seq/34.tgz (3 GB), which contains the image file
seq.box. This image requires working VirtualBox and Vagrant installations.
Once it is downloaded, run the following commands to start the Seq VM:
# Add the downloaded box to Vagrant. Takes a few minutes. vagrant init seq /path/to/downloaded/seq.box # Edit Vagrantfile to set up the VM parameters. Details below. vim Vagrantfile # Start the VM vagrant up # Connect to the VM vagrant ssh
The Seq VM requires at least 2 GB of RAM to run. For experiments such as fasta and knucleotide, you should grant the VM at least 6 GB of RAM. snap requires 32 GB of RAM, while sga requires 64 GB of RAM. Also, have at least two cores if you wish to run parallel Seq code.
An example Vagrantfile with 64 GB of RAM is given below:
Vagrant.configure("2") do |config| config.vm.box = "seq" config.vm.box_url = "/path/to/downloaded/seq.box" config.vm.provider "virtualbox" do |vb| # Set up the VM RAM in MB vb.memory = "65536" # Make sure to disable serial ports for successful booting vb.customize [ "modifyvm", :id, "--uartmode1", "disconnected" ] end end
Several benchmarks require large datasets in order to reproduce the results from the paper. Specifically, 16mer, rc, cpg, snap and sga require the whole-genome HG00123 reads available at http://cb.csail.mit.edu/cb/seq/HG00123.tar.bz2 (4.5 GB; 24 GB uncompressed). The VM already comes with the truncated HG00123 toy dataset that allows the benchmarks to run; download the full dataset only if you wish to reproduce the exact experiments outlined in the paper.
Additionally, the snap benchmark requires a large human genome index available at http://cb.csail.mit.edu/cb/seq/snap.tar.bz2 (22 GB; 29 GB uncompressed).
Note: All files are tarred and bzipped by default. Remember to untar them before running the experiments via
tar jxvf arhcive.tar.bz2! Note: All data files must be placed (or untarred) in the
The source code of all benchmarks is located in the
benchmarks/<benchmark_name> directory. For example, a C++ implementation of the cpg benchmark is available in
benchmarks/cpg/cpg.cc; an idiomatic Seq implementation would be
benchmarks/cpg/cpg.id.seq, while a parallel Seq version is located in
Clang and g++ share the same implementations, as do Python, PyPy and Nuitka. In some cases, the Shedskin implementations are slightly different than Python's due to the feature gap between Python and Shedskin. Shedskin implementations can be found in
You can also download the benchmark code separately at http://cb.csail.mit.edu/cb/seq/benchmarks.tgz.
Aside from the datasets/indices described above, the VM comes preloaded with everything needed to run the benchmarks from Section 6 of the Seq paper. Once the VM boots, you can use the
run.sh script to run any of the experiments:
./run.sh <experiment_name> <compiler>
experiment_name is one of the benchmarks presented in the paper, namely:
compiler is one of the tested compilers, namely:
As noted in the paper, seq-id refers to idiomatic Seq code (i.e. using non-Python compatible features). seq-par is for parallel Seq runs for the cpg, 16mer and sga benchmarks. For the snap and sga benchmarks, seq-id indicates the use of prefetching.
Importantly, not all combinations of experiment+compiler exist! In particular (as also shown in the paper's results):
In the home directory, there is a
reproduce/ folder containing scripts to reproduce the tables in the paper. If you don't see this folder (perhaps due to using an older VM instance), you can download it by running the following in the home directory:
wget -c http://cb.csail.mit.edu/cb/seq/reproduce.tgz -O - | tar -xz
For example, to reproduce Table 1, simply use
You can use
reproduce/table3, for Tables 2 and 3 respectively. Note that Figure 15 simply shows the results from Tables 1 and 2 as bar charts.
If you wish to play with the Seq compiler, you can run Seq code through the JIT by typing:
seq myfile.seq <program args>
Alternatively, you can produce a compiled executable by running
seq-compile myfile.seq myexec
and then run the executable via
./myexec <program args>
All other tools (g++, clang, nuitka, pypy, julia, shedskin and python) are already loaded in the default
Note: Parallelism will not work with the default Seq compiler. Parallel Seq is provided separately (executables are located in the
$HOME/seq-tapirdirectory) due to ongoing dependency issues (LLVM+Tapir) that we hope to resolve in the near future. The environment variable
OMP_NUM_THREADScontrols the number of threads that Seq programs are allowed to consume. For more details, consult the
Note: The Seq build provided in the VM is a debug build, so you may see warnings when e.g. using atomic operations or shadowing variables/functions. It is safe to ignore these!
I see an error message about temporary space/directories/files.
The VM is probably out of space. You can remove output files from previous runs via
rm -rf ~/out/*; this should free up some space.
I see a Seq assertion failure in
The most likely reason for this is not passing enough arguments to a program, and getting an exception when the
argv list is accessed.