Seq Artifact -- Overview and Guide ================================== About ----- Seq is a programming language for computational genomics and bioinformatics. With a Python-compatible syntax and a host of domain-specific features and optimizations, Seq makes writing high-performance genomics software as easy as writing Python code, and achieves performance comparable to (and in many cases better than) C/C++. Read more about Seq in our [paper](http://cb.csail.mit.edu/cb/seq/oopsla19-paper34.pdf), conditionally accepted to [OOPSLA 2019](https://conf.researchr.org/track/splash-2019/splash-2019-oopsla). Overview -------- This document describes the artifact submitted alongside the Seq paper and how to use it to run the benchmarks given in Section 6, for each of the various evaluated languages/compilers. While it is difficult to accurately reproduce performance results inside a VM (although in our experience, performance in the VM has been roughly comparable to the paper's results), we hope that this artifact will at least allow users to get familar with Seq, play around with some real-world code examples, and in general see how it compares to the alternatives. Getting Started --------------- #### VM Installation First download the Vagrant ([https://www.vagrantup.com](https://www.vagrantup.com)) image at [http://cb.csail.mit.edu/cb/seq/34.tgz](http://cb.csail.mit.edu/cb/seq/34.tgz) (3 GB), which contains the image file `seq.box`. This image requires working VirtualBox and Vagrant installations. Once it is downloaded, run the following commands to start the Seq VM: ``` # Add the downloaded box to Vagrant. Takes a few minutes. vagrant init seq /path/to/downloaded/seq.box # Edit Vagrantfile to set up the VM parameters. Details below. vim Vagrantfile # Start the VM vagrant up # Connect to the VM vagrant ssh ``` #### Vagrantfile Configuration The Seq VM requires at least 2 GB of RAM to run. For experiments such as _fasta_ and _knucleotide_, you should grant the VM at least 6 GB of RAM. _snap_ requires 32 GB of RAM, while _sga_ requires 64 GB of RAM. Also, have at least two cores if you wish to run parallel Seq code. An example Vagrantfile with 64 GB of RAM is given below: ```ruby Vagrant.configure("2") do |config| config.vm.box = "seq" config.vm.box_url = "/path/to/downloaded/seq.box" config.vm.provider "virtualbox" do |vb| # Set up the VM RAM in MB vb.memory = "65536" # Make sure to disable serial ports for successful booting vb.customize [ "modifyvm", :id, "--uartmode1", "disconnected" ] end end ``` #### Datasets Several benchmarks require large datasets in order to reproduce the results from the paper. Specifically, _16mer_, _rc_, _cpg_, _snap_ and _sga_ require the whole-genome HG00123 reads available at [http://cb.csail.mit.edu/cb/seq/HG00123.tar.bz2](http://cb.csail.mit.edu/cb/seq/HG00123.tar.bz2) (4.5 GB; 24 GB uncompressed). The VM already comes with the truncated HG00123 toy dataset that allows the benchmarks to run; download the full dataset only if you wish to reproduce the exact experiments outlined in the paper. Additionally, the _snap_ benchmark requires a large human genome index available at [http://cb.csail.mit.edu/cb/seq/snap.tar.bz2](http://cb.csail.mit.edu/cb/seq/snap.tar.bz2) (22 GB; 29 GB uncompressed). > Note: All files are tarred and bzipped by default. Remember to untar them before running the experiments via `tar jxvf arhcive.tar.bz2`! > Note: All data files must be placed (or untarred) in the `$HOME/data` directory. #### Code The source code of all benchmarks is located in the `benchmarks/` directory. For example, a C++ implementation of the _cpg_ benchmark is available in `benchmarks/cpg/cpg.cc`; an idiomatic Seq implementation would be `benchmarks/cpg/cpg.id.seq`, while a parallel Seq version is located in `benchmarks/cpg/cpg.par.seq`. Clang and g++ share the same implementations, as do Python, PyPy and Nuitka. In some cases, the Shedskin implementations are slightly different than Python's due to the feature gap between Python and Shedskin. Shedskin implementations can be found in `_shed.py`. You can also download the benchmark code separately at [http://cb.csail.mit.edu/cb/seq/benchmarks.tgz](http://cb.csail.mit.edu/cb/seq/benchmarks.tgz). Step by Step Instructions ------------------------- Aside from the datasets/indices described above, the VM comes preloaded with everything needed to run the benchmarks from Section 6 of the Seq paper. Once the VM boots, you can use the `run.sh` script to run any of the experiments: ./run.sh `experiment_name` is one of the benchmarks presented in the paper, namely: - fasta - revcomp - knucleotide - cpg - 16mer - rc - sga - snap `compiler` is one of the tested compilers, namely: - all (runs the experiment with all compilers) - g++ - clang - seq - seq-id - seq-par - python - nuitka - shedskin - pypy - julia As noted in the paper, _seq-id_ refers to idiomatic Seq code (i.e. using non-Python compatible features). _seq-par_ is for parallel Seq runs for the _cpg_, _16mer_ and _sga_ benchmarks. For the _snap_ and _sga_ benchmarks, _seq-id_ indicates the use of prefetching. Importantly, **not all combinations of experiment+compiler exist!** In particular (as also shown in the paper's results): - For _fasta_, _revcomp_ and _knucleotide_, we do not have separate idiomatic Seq implementations, so these all run with regular _seq_. - _snap_ and _sga_ only have Seq (_seq_, _seq-id_) and C++ (_clang_, _g++_) implementations. - Parallel implementations (_seq-par_) only exist for _cpg_, _16mer_ and _sga_. #### Reproducing the Tables In the home directory, there is a `reproduce/` folder containing scripts to reproduce the tables in the paper. If you don't see this folder (perhaps due to using an older VM instance), you can download it by running the following in the home directory: ``` wget -c http://cb.csail.mit.edu/cb/seq/reproduce.tgz -O - | tar -xz ``` For example, to reproduce Table 1, simply use ``` reproduce/table1 ``` You can use `reproduce/table2` and `reproduce/table3`, for Tables 2 and 3 respectively. Note that Figure 15 simply shows the results from Tables 1 and 2 as bar charts. #### Playing with Seq If you wish to play with the Seq compiler, you can run Seq code through the JIT by typing: ``` seq myfile.seq ``` Alternatively, you can produce a compiled executable by running ``` seq-compile myfile.seq myexec ``` and then run the executable via ``` ./myexec ``` All other tools (g++, clang, nuitka, pypy, julia, shedskin and python) are already loaded in the default `PATH`. > Note: Parallelism will not work with the default Seq compiler. Parallel Seq is provided separately (executables are located in the `$HOME/seq-tapir` directory) due to ongoing dependency issues (LLVM+Tapir) that we hope to resolve in the near future. The environment variable `OMP_NUM_THREADS` controls the number of threads that Seq programs are allowed to consume. For more details, consult the `run_seq_par` function in `run.sh`. > Note: The Seq build provided in the VM is a debug build, so you may see warnings when e.g. using atomic operations or shadowing variables/functions. It is safe to ignore these! #### Troubleshooting > I see an error message about temporary space/directories/files. The VM is probably out of space. You can remove output files from previous runs via `rm -rf ~/out/*`; this should free up some space. > I see a Seq assertion failure in `list.seq`. The most likely reason for this is not passing enough arguments to a program, and getting an exception when the `argv` list is accessed.