In this paper, we explore a representation methodology for the compression of DNA isolates. Using lossless string compression via tokenization of frequently repeated segments of DNA, we reduce the length of the isolates to be counted as k-mers for classification. With this new representation, we apply a previously established feature sampling method to dramatically reduce the feature space. In understanding the genetic diversity, we also look at conserving biological function across these spaces. Using a random forest model we were able to predict the resistance or susceptibility of bacteria with 85-90\% accuracy, with a 30-50\% reduction in overall isolate length, and an 80-90\% reduction in the feature space over baseline. Significant con...
The genome of an organism contains all hereditary information encoded in DNA.Genome databases are ra...
Genome data are becoming increasingly important for modern medicine. As the rate of increase in DNA ...
The increase in memory and in network traffic used and caused by new sequenced biological data has r...
In this paper, we explore a representation methodology for the compression of DNA isolates. Using lo...
Rapid advancements in research in the field of DNA sequence discovery has led to a vast range of com...
The increasing volume of biological data requires finding new ways to save these data in genetic ban...
Biological data mainly comprises of Deoxyribonucleic acid (DNA) and protein sequences. These arethe ...
With increasing number of DNA sequences being discovered the problem of storing and using genomic da...
International audienceEfficiently managing vast DNA datasets necessitates the development of highly ...
The exponential growth of high-throughput DNA sequence data has posed great challenges to genomic da...
Abstract — In this paper we show that the normalized compression distance can be applied to gene exp...
Genome data are becoming increasingly important for modern medicine. As the rate of increase in DNA ...
DNA compression has been a subject of great interest since the availability of genomic databases. Al...
As the usage of technology increases rapidly today, the amount of data created also increases expone...
<div><p>Genome data are becoming increasingly important for modern medicine. As the rate of increase...
The genome of an organism contains all hereditary information encoded in DNA.Genome databases are ra...
Genome data are becoming increasingly important for modern medicine. As the rate of increase in DNA ...
The increase in memory and in network traffic used and caused by new sequenced biological data has r...
In this paper, we explore a representation methodology for the compression of DNA isolates. Using lo...
Rapid advancements in research in the field of DNA sequence discovery has led to a vast range of com...
The increasing volume of biological data requires finding new ways to save these data in genetic ban...
Biological data mainly comprises of Deoxyribonucleic acid (DNA) and protein sequences. These arethe ...
With increasing number of DNA sequences being discovered the problem of storing and using genomic da...
International audienceEfficiently managing vast DNA datasets necessitates the development of highly ...
The exponential growth of high-throughput DNA sequence data has posed great challenges to genomic da...
Abstract — In this paper we show that the normalized compression distance can be applied to gene exp...
Genome data are becoming increasingly important for modern medicine. As the rate of increase in DNA ...
DNA compression has been a subject of great interest since the availability of genomic databases. Al...
As the usage of technology increases rapidly today, the amount of data created also increases expone...
<div><p>Genome data are becoming increasingly important for modern medicine. As the rate of increase...
The genome of an organism contains all hereditary information encoded in DNA.Genome databases are ra...
Genome data are becoming increasingly important for modern medicine. As the rate of increase in DNA ...
The increase in memory and in network traffic used and caused by new sequenced biological data has r...