DNA Barcoding is the process that uses a specific region in the DNA to identify a given species. The bar-codes are similar to industrial barcodes and are available publicly in order to help conservation and enhancement of biological studies of various plants and micro-organisms. There happen to be 2.9 billion pairs or more pairs of haploid genomes which correspond to approximately 725 megabytes considering each base pair takes up to 2 bits. This therefore reciprocates to the fact that an enormous amount of raw data is extracted while extracting DNA. The basic utility of DNA Barcoding remains to provide a rather seamless availability of the nearest match of the DNA sequence which can only be possible if the data that is extracted is processed at a considerable amount of speed. We aim at comparing all the possible lossless algorithms and create a comparative study in order to classify the most effective and apt algorithm for the process of DNA compression.
The process of encoding large amounts of data into fewer bits is known as a compression algorithm. The use of compression algorithms on hefty amount of data can lead to a dramatic reduction in the file size and the processing speed with which the data can be rendered. Encoding and compressing DNA may be a tiresome task. A DNA sequence consists of four base Adenine, Thymine, Cytosine and Guanine. Each purine Adenine is paired with a corresponding pyrimidine Thymine whereas the pyrimidine Cytosine is paired with a purine Guanine. Each DNA sub-string contains a combination of three letters known as Codons. Some DNA characteristics are repeated and form a set of repetitive occurrences. The compliment and palindrome of each such codon can be identified with the help of tools which are readily available
Compression algorithms are an effective way to reduce the memory storage and increase the transition speed of any data. These algorithms are closely linked with machine learning and data differencing. The process of data compression reduces the consumption of resources such as transmission bandwidth and storage capacity, Data compression reduces the original information into fewer bits than the actual representation. Compression algorithms are broadly classified in Lossless and Lossy algorithms.
Lossless algorithms exploit the wasted space used to represent a given set of data without losing any of the authentic information hence it is a reversible process. Lossless compression the rather feasible since majority of the real-word data display the property of statistical redundancy. The most effective Lossless algorithms incorporate the use of randomized algorithm which works on certain extent of randomness in their results such as pattern detection and prediction by partial matching. One of the primary lossless algorithms which tends to be used on a wide-scale is known as the Huffman coding. We shall dwell more upon the details of Huffman code further in this research.
Lossy algorithms are the based on the opposite conceptual structure than the Lossless algorithm. In lossy algorithm the loss of non-essential data is acceptable. By dropping the non-elementary data from the original source can help in reducing the memory storage required. Though the effective use of Lossy algorithms in real time problems will be limited it is most commonly used to compress video and images. For example, the fragments of anAudio segment which consists of frequencies which are inaudible to the human ear will be discarded.
The present system implemented in the compression of DNA Barcoding utilizes the well-known Lossless Algorithms. The general compression algorithms were adapted to compression English predictive text since the regularity in the DNA codons is minimal, the compression of DNA generally can be a difficult task to undertake. Several lossless tools are available for the text compression are readily available such as GZIP, Lempel-Ziv (LZ), Lempel-Ziv-Welch (LZW). The most commonly used technique implemented is the Huffman Code.The compression with the help of Huffman code proceeds as follows. The algorithm counts the frequency of Adenine, Cytosine, Thymine and Guanine. It then constructs a table enlisting the frequency of each base present and considers it as a new tree node. Each initial node is marked as an unprocessed node. A binary tree is then constructed which completes the following iterations:
Following the above example, the new bit count after processing the process of encoding would be (9 × 1) + (4 × 3) + (5 × 2) + (3 × 3) = 40 which would provide a compression ratio of 23.8%. The adaptive and static methods of Huffman coding though highly relevant fail miserably when it comes to encoding DNA since the chances of the occurrence of the four symbols does not vary much. Prediction with partial match is unable to compress any DNA codon sequencing which would have less than 2 bits per allocated symbol. While the arithmetic coding in the likes of Context Tree Weighting shall be able to overcome this shortcoming, these algorithms possess low decoding speed.
It is evident that the Huffman coding is indeed the most feasible solution to the DNA compression but lacks the accuracy in compression ratio. We therefore carry out a head to head compression between the two adaptive versions of this lossless algorithm in order to justify the most appropriate solution to DNA compression and encoding. The G-SQZ algorithm first each pair and hence constructs a Huffman tree for each of these specific codes. Small frequency codes are encoded into larger fragments and large frequency codes are encoded in smaller fragments. This is followed by a secondary scan which records a header and an encoded read block into a binary output file. This binary output file contains additional information such as the meta characters, number of visits to each pair, platform. The binary encoded file also consists of identifiers so that the original dataset can be retrieved along with the fixed length of the header, the specified sequence of the block. The information stored in the header blocks can be accessed by simple query statements which can help display the number of citations, statistics of the pair hence saving time in traversing through a large set of data.
The average bases per runtime are approximately 4 billion (232 bases). Keeping the statistics in vision, the G-SQZ is commonly designed as a 64-bit application. Table 2 displays the comparative analysis between the results of G-SQZ with Huffman coding tools gzip v1.3.5. and bzip2 v1.0.5 on open source data from 1000 Genomes Project.
After comparing the modes of compression suitable for DNA Barcoding we can conclude that G-SQZ is one of the most effective methods for this process. The G-SQZ performs exceedingly well surpassing the native Huffman code. The algorithm is specifically designed for the sequencing read of data present in a known format. It does not employ string matching but rather inculcates on the counting of the base, quality pair. It also provides an option to retrieve specified data from the header block with the help of queries and the order of the base, quality pair remains constant.
This essay has been submitted by a student. This is not an example of the work written by our professional essay writers. You can order our professional work here.