Design of Effective Lossless Data Compression Technique for Multiple Genomic DNA Sequences

In recent years, a massive amount of genomic DNA sequences are being created which leads to the development of new storing and archiving methods. There is a major challenge to process, store or transmit the huge volume of DNA sequences data. To lessen the number of bits needed to store and transmit data, data compression (DC) techniques are proposed. Recently, DC becomes more popular, and large number of techniques is proposed with applications in several domains. In this paper, a lossless compression technique named Arithmetic coding is employed to compress DNA sequences. In order to validate the performance of the proposed model, the artificial genome dataset is used and the results are investigated interms of different evaluation parameters. Experiments were performed on artificial datasets and the compression performance of Arithmetic coding is compared to Huffman coding, LZW coding, and LZMA techniques. From simulation results, it is clear that the Arithmetic coding achieves significantly better compression with a compression ratio of 0.261 at the bit rate of 2.16 bpc.


Introduction
As the DNA sequences are highly useful in various fields like biology, medicine, and genetics, large amount of DNA sequences has been generated rapidly [1][2][3]. It is also identified that there will be enormous growth in the volume of DNA sequences in the future. So, it is essential to handle the amount of DNA sequences effectively while storing or transmitting it. DNA is the fundamental unit of living things as the data to create a living organism is totally stored in it [4,5]. DNA is a long sequence which is usually composed of four varieties of bases namely Cytosine (C), Adenine (A), Thymine (T), and Guanine (G). The combinations of C, A, T, G in the long DNA sequence is shown in Fig. 1. It is commonly represented by {A, C, G, T} and each one is denoted as nucleotides. Adenine is commonly connected by Cytosine and Thymine is connected using Guanine.
In replication process of DNA, when the 2 components are detached to act as templates to produce copies, an enzyme is known as DNA polymerases read the strand in 3' and 5' directions deploying the respective nucleotide [6][7][8].
The massive amount of DNA sequence imposes a new challenge for storage space and bandwidth resources. Generally, the publicly available gene datasets are archived as normal text files with increased burden of storage space and transmission [9,10]. Without high speed internet, it is difficult or sometimes impossible to share genetic information across some parts of the world. Several researchers have been made to handle the storage of massive genomic datasets. One manner to deal with this large amount of genomic information is to compress. DC technique reduces the quantity of information transmitted/stored [11]. It is appropriate for compressing images, text, and video/audio. The information could be alphanumerical character in text documents, number that represents the sample in image/audio series/waveforms of number generated with few procedures, and so on. Also, DC is called a manner of demonstrating information in its dense formation. It is employed from satellite imaging, medical imaging to WSNs.

Fig. 1. DNA sequence in text format
Due to the unique nature of DNA, it is not possible to apply conventional compression algorithm to compress DNA sequences [12,13]. Few researches have only been performed on the compressions of genome datasets. Conventional compression techniques operate by identifying the repeated patterns for DNA encoding. Some of the existing techniques are BioCompress, Cfact, DNACompress, and DNAPack. Though various techniques are available, significantly higher compression ratio is not attained. The DNA sequences can be efficiently compressed only by exploiting the special properties of DNA [14]. The absence of efficient compression technique especially for DNA sequences motivated us to perform this work.
In this paper, a lossless compression technique named Arithmetic coding is employed to compress DNA sequences. To validate the performances of the presented method, the artificial genome dataset is used and the results are investigated interms of different evaluation parameters. Experiments were performed on artificial datasets and the compression performance of Arithmetic coding is compared Huffman coding, LZW coding, and LZMA techniques.

Review of Existing Compression Standards
Rashid [15] proposed a new combination of cryptography and steganography methods. The proposed method consists of two phases: the encrypted and hide the message and message extraction phases. The encrypt and hide message phase consists of cryptography phase and steganography phase that includes six steps, firstly Caesar cipher applied to encrypt the message, secondly convert ciphertext to DNA sequences, thirdly convert DNA character to their equivalent ASCII, fourthly convert ASCII to binary, fifthly shift binary based on a specific key. Finally, the sixth step hides the ciphertext in the cover image. In [16], the feature from protein sequence is found out through prolonging the concept of DPC, EDF, and KSB to PSSM. The fundamental data has been determined via a compression method called DCT and the method has been trained by SVM. The predictive accuracy has been additionally increased with GA approach.
Karmakar et al. [17] proposed an efficient and new sparse depiction that relied on spatial compression of video signal incorporated by a hyperchaotic DNA coded relied on encryption model which provides high performances. The improvement of compression efficacy is attained by presenting sparse coded on the video frame, and high safety is attained with five dimensional hyperchaotic DNA coded on the sparse coding frame. The new method is used for huge sets of video signals to test and compare its efficiency by lately presented video coded as well as encryption system. Alsaffar et al. [18] integrate among numerous phases of encryption technique: image steganography, DNA, GZIP, and AES. They presented increasing with factors and final step of DNA encryptions, as well the output of these procedures were compressed by a GZIP method, in which messages are transformed to a novel version besides its size decreased to (75%) when the messages are encrypted by means of AES encryptions for increasing the safety levels. Further, LSB images Steganography method is used for hiding encrypted messages in higher quality images.
Afify et al. [19] discuss the applications of DNA coded features, services models, and safety problems. It proposes a method to secure information while transferring/storing, that can be low cost and secured by bio computation technique. The tools use DNA, ML and steganography, BD methods, and binary coding rule for making the method secure in which extra layers of bio-security, i.e., are highly efficient compared to traditional cryptographic techniques. An algorithm for building the extended BWT (eBWT) is presented in [20] comprises a string collection from its grammar-compressed representation. Our technique exploits the string repetitions captured by the grammar to boost the computation of the eBWT. Thus, the more repetitive the collection is, the lower are the resources we use per input symbol. We rely on a new grammar recently proposed at DCC'21 whose nonterminals serve as building blocks for inducing the eBWT. A relevant application for this idea is the construction of self-indexes for analyzing sequencing reads --massive and repetitive string collections of raw genomic data.
Yang et al. [21] presented a 2 image compression encryptions system according to DNA method and fractional hyperchaotic scheme. Initially, 2 images are treated using DCT model. Next, the spectrum of the 2 images was organized in Z-scan, thus the 2 images are mixed and compressed to novel images. Lastly, the resultant images are encrypted with DNA coding. In [22], new hash functions were introduced which eliminate hash collision for DNA sequences. It provides accurate hash and produces hash value appropriate time. They projected 2 accurate strings matching methods according to the presented method. Initially, they replaced a conventional Hash-q algorithm. Next, enhanced the initial method with the shift size.

Lossless Compression algorithm on DNA Sequencing
The DNA sequences can be efficiently compressed only by exploiting the special properties of DNA. Arithmetic coding is employed to compress the DNA sequences. It generates variable length code and has greater than Huffman coding from several features. It can be extremely helpful under conditions in which the source contains small alphabet with skewed probability. If the string has been encoding utilize Arithmetic coding, frequent happening symbols were coding with smaller bits than infrequently happening symbols. It changes the input data as to floating point numbers from the range of [0, 1]. This technique was executed by splitting [0-1] as to segment and the length of all segments are dependent upon probabilities of all symbols. Afterward, the output data was recognized from the respective segment dependent upon symbols. It could not easier for implementing if related to another technique. The advantage of Arithmetic coding over Huffman coding was ability for segregating the model and coding feature of compression technique. The algorithm of the Arithmetic encoder and decoder is given in Algorithm 1 and Algorithm 2.

Algorithm 1: Arithmetic Encoding Procedure
Step 1: Call encoder symbol frequently to all symbols from the message Step 2: Confirm that a notable "terminator" symbol has been encoded later then communicate some values from the range [LL, HH].
Step Step 1: "Value" is the number which is obtained Step 2: Continue calling decoder-symbol still the terminator symbol has been returned. Step

Performance Evaluation
For ensuring the efficacy of the Arithmetic coding on DNA sequence compression, a comparative analysis is made with Huffman coding, LZW, and LZMA. For experimentation, publicly available artificial genome dataset is used [24]. The artificial dataset contains 6 sequences which are implanted with exact subsequences of length 100. The characters in the dataset are randomly generated with the four bases A, T, C, and G. Inter-sequences similarities are included as identical subsequences across the six sequences. Particularly, the ith sequence was generated with i groups of subsequences, each of which is composed of 100 non-overlapping subsequences interleaved with one base symbol. The subsequences in the k th group of the ith sequence were randomly matched with those in the (k+1) th group of the (i+1) th sequence. Table 1 offers a comprehensive result analysis of the proposed with existing compression techniques.

Fig. 3. CF analysis of different compression models on DNA sequences
Next, a comprehensive CF analysis of the proposed model with the state of art compression techniques is provided in Fig. 3

Conclusion
To reduce the number of bits required to store and transmit DNA sequences, DC is proposed. DC is the process of reducing the amount of data without negotiating the data quality to a certain extent. In this paper, a lossless compression technique named Arithmetic code is employed to compress DNA sequences. It is a variable length coding technique which is highly useful in situations where the sources contain small alphabets with skewed probabilities. Experiments were performed on artificial datasets and the compression performance of Arithmetic coding is compared Huffman coding, LZW coding, and LZMA technique. From simulation results, it is clear that the Arithmetic coding attains meaningfully improved compression with a compression ratio of 0.261 at the bit rate of 2.16 bpc.