Published April 5, 2024 | Version v1.0.2
Software Open

Fasta_Seq_Prepare_Bash

Description

AramayoLab

Motivation

The main motivation for generating this script was to be able to pre-process both transcriptome and/or proteome fasta files before subjecting them to computational analysis.
By default the (-d flag 'Dust Gremlings') of the script will remove characters that if present in the sequence have the potential to interfere and or abort the processing
of the file by certain software packages. This is particularly true for InterProScan, which is extremely sensitive to the presence of the character '*", which can usually be
found in ENSEMBL protein sequences. These characters, when found,  are replaced by the IUPAC 'X' (unknown).
In addition, the '-r', and '-s' flags, control the sorting of the transcriptome and/or proteome fasta sequences from large to small (-r flag), or from small to large (-s flag.
By default, the (-l flag) of the script removes transcripts '<=' than 150 nucleotides, and proteins '<=' than 50 amino-acids residues, unless other values are assigned.
If desired, the (-u flag) of the script can also remove sequences that are >= to a given size provided.
If activate, the (-f flag) of the script will remove sequences belonging to the following biotypes: IG_D_gene, IG_J_gene, TR_D_gene, and TR_J_gene, as defined by the fasta headers of
ENSEMBL transcriptome and/or protein sequences.
A fasta file could also be split into files containing a given number of sequences (e.g., 50000). This behavior (suppressed by default) is controlled by the (-i flag).
A  transcriptome and/or proteome file can also be clustered using CD-HIT-EST or CD-HIT, respectively, as controlled by the (-c flag). This behavior is suppressed by default.
If splitting and clustering of a given fasta file is simultaneously requested, the file will be first clustered and then the resulting clustered file will be split.
The (-x flag), controls the number of cores used in the clustering process. Finally, one can define the TMPDIR to be written to the same directory where the
script is being executed, if so desired.

Documentation

########################################################################################################################################################################################################
ARAMAYO_LAB

This program is free software: you can redistribute it and/or modify it under the terms of the GNU
General Public License as published by the Free Software Foundation, either version 3 of the
License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without
even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not,
see <https://www.gnu.org/licenses/>.

SCRIPT_NAME:                       Fasta_Seq_Prepare_v1.0.2.sh
SCRIPT_VERSION:                    1.0.2

USAGE: Fasta_Seq_Prepare_v1.0.2.sh
       -p Homo_sapiens.GRCh38.pep.all.fa               # REQUIRED if -t Not Provided (Proteins File - Proteome)
       -t Homo_sapiens.GRCh38.cds.all.fa               # REQUIRED if -p Not Provided (Transcripts File - Transcriptome)
       -l Sequences Lower Size                         # OPTIONAL (default = 50 - proteins | 150 - transcripts)
       -u Sequences Upper Size                         # OPTIONAL (default = No Limit)
       -f Filter Biotypes                              # OPTIONAL (default = No)
       -d Dust Gremlings                               # OPTIONAL (default = Yes)(Yes: Converts * to X - proteins and * to n - nucleotides)
       -s Sort File From Shorter to Larger             # OPTIONAL (default = No) (Yes: Sequences will be sorted from shorter to larger)
       -r Sort File From Larger to Shorter             # OPTIONAL (default = No) (Yes: Sequences will be sorted from larger to shorter)
       -c Cluster Sequences                            # OPTIONAL (default = No) (Yes: Requires cd-hit Installed)
       -w Fasta Width                                  # OPTIONAL (default = 80)
       -i Split fasta file                             # OPTIONAL (default = No = 1)
       -x Number of Cores                              # OPTIONAL (default = 2)
       -z TMPDIR Location                              # OPTIONAL (default=0='TMPDIR Run')

TYPICAL COMMANDS:
                                   Fasta_Seq_Prepare_v1.0.2.sh -p Homo_sapiens.GRCh38.pep.all.fa
                                   Fasta_Seq_Prepare_v1.0.2.sh -p Homo_sapiens.GRCh38.pep.all.fa -c yes
                                   Fasta_Seq_Prepare_v1.0.2.sh -t Homo_sapiens.GRCh38.cdna.all.fa
                                   Fasta_Seq_Prepare_v1.0.2.sh -t Homo_sapiens.GRCh38.cdna.all.fa -c yes

INPUT01:          -p FLAG          REQUIRED input ONLY if the '-t' flag associated file is not provided
INPUT01_FORMAT:                    Proteome Fasta File
INPUT01_DEFAULT:                   No default

INPUT02:          -t FLAG          REQUIRED input ONLY if the '-p' flag associated file is not provided
INPUT02_FORMAT:                    Transcriptome Fasta File
INPUT02_DEFAULT:                   No default

INPUT03:          -l FLAG          OPTIONAL input. Sequence Lower Size Filtering Value
INPUT03_FORMAT:                    Numeric
INPUT03_DEFAULT:                   50 (proteins) | 150 (transcripts) | 1 = No Limit (Do Not Filter)
INPUT03_NOTES:                     If provided, this number will be used to reject sequences whose length are equal to, or shorter than, the number provided
INPUT03_NOTES:                     Note that the default value is a function of the class of file provided (i.e., proteome or transcriptome)
INPUT03_NOTES:                     Also note that any numerical value will be accepted

INPUT04:          -u FLAG          OPTIONAL input. Sequence Upper Size Filtering Value
INPUT04_FORMAT:                    Numeric
INPUT04_DEFAULT:                   0 = No Limit (Do Not Filter)
INPUT04_NOTES:                     If provided, this number will be used to reject sequences whose length are equal to, or larger than, the number provided
INPUT04_NOTES:                     Note that any numerical value will be accepted

INPUT05:          -f FLAG          OPTIONAL input. Filter Biotype
INPUT05_FORMAT:                    yes | no
INPUT05_DEFAULT:                   no (Do Not Filter)
INPUT05_NOTES:                     If Activated, sequences belonging to the following biotypes will be filtered out: IG_D_gene, IG_J_gene, TR_D_gene, and TR_J_gene

INPUT06:          -d FLAG          OPTIONAL input. Dust Gremlings
INPUT06_FORMAT:                    yes | no
INPUT06_DEFAULT:                   yes (Dust Gremlings)
INPUT06_NOTES:                     Dusting will convert '*' to 'X' (for proteins), and '*' to 'n' (for transcripts)

INPUT07:          -s FLAG          OPTIONAL input. Sort file
INPUT07_FORMAT:                    yes | no
INPUT07_DEFAULT:                   no (Do Not Sort)
INPUT07_NOTES:                     If provided, sequences will be sorted from shorter to larger

INPUT08:          -r FLAG          OPTIONAL input. Sort file
INPUT08_FORMAT:                    yes | no
INPUT08_DEFAULT:                   no (Do Not Sort)
INPUT08_NOTES:                     If provided, sequences will be sorted from larger to shorter

INPUT09:          -c FLAG          OPTIONAL input. Cluster Sequences
INPUT09_FORMAT:                    yes | no
INPUT09_DEFAULT:                   no (Do Not Cluster)
INPUT09_NOTES:                     If provided, sequences will be clustered by cd-hit according to stringent parameters

INPUT10:          -w FLAG          OPTIONAL input
INPUT10_FORMAT:                    Numeric
INPUT10_DEFAULT:                   80
INPUT10_NOTES:                     The number sets the width of the output fasta file

INPUT11:          -i FLAG          OPTIONAL input
INPUT11_FORMAT:                    Numeric
INPUT11_DEFAULT:                   1 (Do Not Split)
INPUT11_NOTES:                     Determines the number of fasta sequences requested to be present on each resulting splitted file
INPUT11_NOTES:                     This option is only valid for fasta files containing >= 2 files

INPUT12:          -x FLAG          OPTIONAL input
INPUT12_FORMAT:                    Numeric
INPUT12_DEFAULT:                   2
INPUT12_NOTES:                     The maximum number of cores requested should be equal to N-1; where N is the total number of cores available in the computer performing the analysis
INPUT12_NOTES:                     Number of Cores

INPUT13:          -z FLAG          OPTIONAL input
INPUT13_FORMAT:                    Numeric: 0 == TMPDIR Run | 1 == Normal Run
INPUT13_DEFAULT:                   0 == TMPDIR Run
INPUT13_NOTES:                     0 Processes the data in the $TMPDIR directory of the computer used or of the node assigned by the SuperComputer scheduler
INPUT13_NOTES:                     Processing the data in the $TMPDIR directory of the node assigned by the SuperComputer scheduler reduces the possibility of file error generation due to network traffic
INPUT13_NOTES:                     1 Processes the data in the same directory where the script is being run

DEPENDENCIES:                      GNU AWK:       Required (https://www.gnu.org/software/gawk/)
                                   GNU COREUTILS: Required (https://www.gnu.org/software/coreutils/)
                                   CD-HIT:        Required if clustering is invoked (https://github.com/weizhongli/cdhit; PMID: 16731699; PMID: 23060610)

Author:                            Rodolfo Aramayo
WORK_EMAIL:                        raramayo@tamu.edu
PERSONAL_EMAIL:                    rodolfo@aramayo.org
########################################################################################################################################################################################################

Development/Testing Environment:

Distributor ID:       Apple, Inc.
Description:          Apple M1 Max
Release:              14.4.1
Codename:             Sonoma
Distributor ID:       Ubuntu
Description:          Ubuntu 22.04.3 LTS
Release:              22.04
Codename:             jammy

Required Script Dependencies:

GNU AWK (https://www.gnu.org/software/gawk/)

Version Number: 5.3.0, API 4.0

GNU Awk 5.3.0, API 4.0
Copyright (C) 1989, 1991-2023 Free Software Foundation.

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program. If not, see http://www.gnu.org/licenses/.

GNU COREUTILS (https://www.gnu.org/software/coreutils/)

Version Number: 8.30

(GNU coreutils) 9.4
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Richard M. Stallman and David MacKenzie.

CDHIT (https://github.com/weizhongli/cdhit)

Version Number: 4.8.1

		====== CD-HIT version 4.8.1 (built on Jan 21 2024) ======

Usage: cd-hit [Options]

Options

   -i	input filename in fasta format, required, can be in .gz format
   -o	output filename, required
   -c	sequence identity threshold, default 0.9
 	this is the default cd-hit's "global sequence identity" calculated as:
 	number of identical amino acids or bases in alignment
 	divided by the full length of the shorter sequence
   -G	use global sequence identity, default 1
 	if set to 0, then use local sequence identity, calculated as :
 	number of identical amino acids or bases in alignment
 	divided by the length of the alignment
 	NOTE!!! don't use -G 0 unless you use alignment coverage controls
 	see options -aL, -AL, -aS, -AS
   -b	band_width of alignment, default 20
   -M	memory limit (in MB) for the program, default 800; 0 for unlimitted;
   -T	number of threads, default 1; with 0, all CPUs will be used
   -n	word_length, default 5, see user's guide for choosing it
   -l	length of throw_away_sequences, default 10
   -t	tolerance for redundance, default 2
   -d	length of description in .clstr file, default 20
 	if set to 0, it takes the fasta defline and stops at first space
   -s	length difference cutoff, default 0.0
 	if set to 0.9, the shorter sequences need to be
 	at least 90% length of the representative of the cluster
   -S	length difference cutoff in amino acid, default 999999
 	if set to 60, the length difference between the shorter sequences
 	and the representative of the cluster can not be bigger than 60
   -aL	alignment coverage for the longer sequence, default 0.0
 	if set to 0.9, the alignment must covers 90% of the sequence
   -AL	alignment coverage control for the longer sequence, default 99999999
 	if set to 60, and the length of the sequence is 400,
 	then the alignment must be >= 340 (400-60) residues
   -aS	alignment coverage for the shorter sequence, default 0.0
 	if set to 0.9, the alignment must covers 90% of the sequence
   -AS	alignment coverage control for the shorter sequence, default 99999999
 	if set to 60, and the length of the sequence is 400,
 	then the alignment must be >= 340 (400-60) residues
   -A	minimal alignment coverage control for the both sequences, default 0
 	alignment must cover >= this value for both sequences
   -uL	maximum unmatched percentage for the longer sequence, default 1.0
 	if set to 0.1, the unmatched region (excluding leading and tailing gaps)
 	must not be more than 10% of the sequence
   -uS	maximum unmatched percentage for the shorter sequence, default 1.0
 	if set to 0.1, the unmatched region (excluding leading and tailing gaps)
 	must not be more than 10% of the sequence
   -U	maximum unmatched length, default 99999999
 	if set to 10, the unmatched region (excluding leading and tailing gaps)
 	must not be more than 10 bases
   -B	1 or 0, default 0, by default, sequences are stored in RAM
 	if set to 1, sequence are stored on hard drive
 	!! No longer supported !!
   -p	1 or 0, default 0
 	if set to 1, print alignment overlap in .clstr file
   -g	1 or 0, default 0
 	by cd-hit's default algorithm, a sequence is clustered to the first
 	cluster that meet the threshold (fast cluster). If set to 1, the program
 	will cluster it into the most similar cluster that meet the threshold
 	(accurate but slow mode)
 	but either 1 or 0 won't change the representatives of final clusters
   -sc	sort clusters by size (number of sequences), default 0, output clusters by decreasing length
 	if set to 1, output clusters by decreasing size
   -sf	sort fasta/fastq by cluster size (number of sequences), default 0, no sorting
 	if set to 1, output sequences by decreasing cluster size
 	this can be very slow if the input is in .gz format
   -bak	write backup cluster file (1 or 0, default 0)
   -h	print this help

   Questions, bugs, contact Weizhong Li at liwz@sdsc.edu
   For updated versions and information, please visit: http://cd-hit.org
                                                    or https://github.com/weizhongli/cdhit

   cd-hit web server is also available from http://cd-hit.org

   If you find cd-hit useful, please kindly cite:

   "CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences", Weizhong Li & Adam Godzik. Bioinformatics, (2006) 22:1658-1659
   "CD-HIT: accelerated for clustering the next generation sequencing data", Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu & Weizhong Li. Bioinformatics, (2012) 28:3150-3152

Notes

If you use this software, please cite it as below.

Files

raramayo/Fasta_Seq_Prepare_Bash-v1.0.2.zip

Files (25.2 kB)

Name Size Download all
md5:24639fc53f20daa8283f49c196e20720
25.2 kB Preview Download

Additional details

Related works

Software

Repository URL
https://github.com/raramayo/Fasta_Seq_Prepare_Bash
Programming language
Shell, Awk
Development Status
Active