Sars-Cov-2 and Mers sequences from human host with no unknown characters

Formentin, Marco; Marco, Favretti; Roberto, Chignola

doi:10.5281/zenodo.8362885

Published January 2, 2024 | Version v1

Software Open

Sars-Cov-2 and Mers sequences from human host with no unknown characters

1. University of Padua
2. University of Verona

The datasets are organized as follows: first column, number of bases in a given sequence; second, third, fourth and fifth columns, number of bases of type A, C, G and T, respectively, in the same sequence.

1) Sars-Cov-2 dataset. This dataset contains number of bases for the complete genome sequences from a human host, with none unknown characters. In the NCBI database, there are about 950.000 sequences with these characteristics.

2) Restricted Sars-Cov-2 dataset: This dataset contains number of bases for the complete sequences from a human host, with no unknown characters, with 29903 bases, that is of the same length as the reference sequence NC045512.2. We obtained, from the NCBI database, about 5600 sequences with such features.

3) Mers dataset: This dataset contains number of bases for the complete sequences of about 200 complete genome sequences from a human host, with no unknown characters.

Methods

Raw datasets are genome sequences retrieved from the National Center for Biotechnology Information (NCBI) database (https://www.ncbi.nlm.nih.gov). The sequences were filtered according to the following criteria:

1) Sars-Cov-2 dataset. This dataset contains complete genome sequences from a human host, with none unknown characters. In the NCBI database, there are about 950.000 sequences with these characteristics.

2) Restricted Sars-Cov-2 dataset: This dataset contains complete genome sequences from a human host, with no unknown characters, with 29903 bases, that is of the same length as the reference sequence NC045512.2. We obtained, from the NCBI database, about 5600 sequences with such features.

3) Mers dataset: We selected about 200 complete genome sequences from a human host, with no unknown characters.

Raw data have been processed through a C++ code (provided with the datsets) that reads a dataset of nucleic acid sequences in FASTA format and returns the number of bases in each sequence. The output file seqcount.txt contains a table organized as follows: first column, number of bases in a given sequence; second, third, fourth and fifth columns, number of bases of type A, C, G and T, respectively, in the same sequence. Each row reports the data calculated for successive sequences following the same order of the raw datasets.

Files

code.pdf

Files (99.2 kB)

Name	Size	Download all
code.pdf md5:2b159f3558d62794560ef23b52559a50	99.2 kB	Preview Download

Additional details

Is source of: 10.5061/dryad.9s4mw6mp2 (DOI)

Citations

Oops! Something went wrong while fetching results.

	All versions	This version
Views	74	74
Downloads	23	23
Data volume	2.3 MB	2.3 MB

Sars-Cov-2 and Mers sequences from human host with no unknown characters

Creators

Description

Methods

Files

code.pdf

Files (99.2 kB)

Additional details

Related works