Clean FASTA Files
The FASTA file cleaner modifies FASTA files so that they are compatible with SEISMIC-RNA.
Background about cleaning FASTA files
Purpose of cleaning FASTA files
Variants of FASTA format exist that are not compatible with SEISMIC-RNA.
FASTA files from databases such as NCBI and Gencode often encode
metadata in their header lines, which SEISMIC-RNA cannot handle.
Additionally, some FASTA files may contain characters besides the four
nucleotides and N, such as the IUPAC extended nucleotide codes.
For details on the valid FASTA format for SEISMIC-RNA and the rationale
behind it, see FASTA: Reference sequences.
Algorithm for cleaning FASTA files
The FASTA file cleaner of SEISMIC-RNA follows the following procedure:
If a line begins with
>, then it is assumed to be a header line; otherwise, a body (i.e. sequence-encoding) line.If the line is a header line, then it is trimmed from the right side until all characters in the line (except the initial
>character) are valid path characters.If the line is a body line, then all whitespace characters are removed and the remaining characters are converted to uppercase. If the FASTA sequence type is DNA, then every
Uis replaced withT; and if RNA, then everyTis replaced withU. All remaining illegal characters that belong to the set of standard IUPAC extended nucleotide codes are replaced withN. If any illegal characters remain after that, then an error is raised.
How to clean FASTA files
Command line for cleaning FASTA files
Clean the FASTA file refs.fa:
seismic +cleanfa refs.fa
This command will output a cleaned FASTA file with the same name as the
input FASTA file, located in the output directory (default ./out).
In this case: out/refs.fa.
To change the output directory, use the --out-dir/-o option:
seismic +cleanfa -o clean refs.fa
will output the cleaned FASTA file clean/refs.fa.
By default, this command will not overwrite existing FASTA files.
To force overwriting, use the option --force:
seismic +cleanfa --force refs.fa
Troubleshooting cleaning FASTA files
Errors while cleaning FASTA files
Line contains sequence characters other than whitespace or IUPAC codes
This error means that the sequence contains characters that cannot be cleaned because they do not belong in FASTA files, so it is unclear how they should be handled automatically. The only characters that can be cleaned automatically are as follows:
whitespace: deleted
lowercase: converted to uppercase
U(for DNA): replaced withTT(for RNA): replaced withUBDHKMRSVWY(IUPAC extended nucleotide codes): replaced withNstandard four nucleotides, plus
N: unchanged
If the FASTA file contains any other characters, then first remove them
manually and then rerun +cleanfa.
The error message will indicate which line(s) have invalid characters.