Published November 8, 2024 | Version v1
Software Open

Domainator: A flexible and modular software suite for domain-based gene neighborhood and protein search, extraction, and clustering.

Description

Archive snapshot of Domainator Github page:

nebiolabs/domainator: A flexible and modular software suite for domain-based gene neighborhood and protein search, extraction, and clustering.

 

Domainator provides several dozen discrete, flexible programs that can be composed into a broad range of workflows via command line or python scripting. Domainator also treats HMM-profiles as first-class citizens, supporting subsetting of .hmm files and comparison of HMM-profiles, including the construction of profile similarity networks and trees.

Domainator uses the GenBank file format as a carrier of both sequence and annotation data. Independence from a fixed set of sequence sources and the co-location of sequences and all their annotation data in a single file increases data portability and decreases complexity for end-users. Domainator can add functional annotations to sequences by comparison either to HMM-profiles, protein sequences, or both at the same time. For example, in a single call to domainate, a set of genome or metagenome contigs can be annotated with hits to Pfam HMM-profiles and hits to REBASE Gold Standard protein sequences at the same time.

The individual programs that make up Domainator can be roughly classified into six categories corresponding to their typical roles in workflows. The first steps in most workflows typically involve passing sequence data through one or more editors.

Editors are programs whose output format is the same as their input format. Each individual editor performs a simple task, but they can be combined in arbitrarily long chains to accomplish complex transformations. Examples are domain_search, which outputs a subset of the input sequences, based on the presence of a hit to a reference sequence or profile; domainate, which outputs all the input sequence but adds domain annotations based on hits to reference sequences; deduplicate_genbank, which performs similarity clustering using CD-HIT or usearch and outputs only the cluster representatives input sequences; and select_by_cds, which extracts genome neighborhoods around domains of interest.

In Domainator programs, perhaps differently from other software, the file being edited is supplied via the -i argument and criteria for editing is supplied via other arguments, such as the -r argument for reference sequences or hmm profiles. For example searching for hits to a query in the UniProt database may be accomplished via domain_search.py -i uniprot.fasta -r query.hmm -o hits.gb, similarly annotating a set of contigs with Pfam annotations may be accomplished with domainate.py -i unannotated.gb -r pfam-A.hmm -o annotated.gb .

Summary report programs summarize data into graphs and statistics, for example, the number of sequences in a file and the count of each kind of domain. Record-wise report programs produce tab-separated files, for example where each row corresponds to a genome contig, a protein, or a domain, and values are data such as length, taxonomy ID, domain content, etc.

Record-wise reports are useful for exporting data to programs, such as Excel, which can’t read GenBank or hmm files, and they also find use as intermediary files between some programs in Domainator.

Comparison programs generate pairwise score or distance matrices between contigs or HMM-profiles. Compare_contigs uses the Jaccard and adjacency indexes to compare proteins or gene neighborhoods based on their domain content, whereas seq_dist uses local alignment scores, for example via phmmer , Diamond , hmmsearch , or the Viterbi profile-comparison algorithm. Comparison programs output data matrices.

Plotting programs convert data into formats appropriate for graphical visualization, for example converting score matrices and tabular metadata into trees or similarity networks which can be viewed in Cytoscape or other external visualization tools, depending on the data type.

Finally, there are few programs that defy categorization, these programs perform functions such as downloading data from NCBI or UniProt, converting files between formats, or generating profile-profile alignments.

Files

domainator-main_November_08_2024.zip

Files (4.5 MB)

Name Size Download all
md5:7eab790d256ade7d434206715b3507a7
4.5 MB Preview Download

Additional details

Related works

Is described by
Preprint: 10.1101/2024.04.23.590562 (DOI)

Software

Repository URL
https://github.com/nebiolabs/domainator
Programming language
Python
Development Status
Active