Accepted to BioinformaticsAnalysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via “seeds”: simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence. Here we study a simple sparse-seeding method: using seeds at positions of certain “words” (e.g. ac, at, gc, or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally-overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed “minimizer” sparse-seeding methods. Our approach can b...
Motivation: Most sequence alignment techniques make use of exact k-mer hits, called seeds, as anchor...
We apply the concept of subset seeds to similarity search in protein sequences. The main question st...
We study the problem of computing optimal spaced seeds for identifying homologous coding DNA sequen...
International audienceMotivation: Analysis of genetic sequences is usually based on finding similar ...
AbstractGenomics studies routinely depend on similarity searches based on the strategy of finding sh...
The challenge of similarity search in massive DNA sequence databases has inspired major changes in B...
Homology search finds similar segments between two biological sequences, such as DNA or protein sequ...
Motivation: Comparison of nucleic acid and protein sequences is a fundamental tool of modern bioinfo...
Most commonly used similarity search methods in genomic sequences are heuristic ones. These are base...
The primary goal of bioinformatics is to increase an understanding in the biology of organisms. Comp...
AbstractLarge-scale comparison of genomic DNA is of fundamental importance in annotating functional ...
We apply the concept of subset seeds to similarity search in protein sequences. The main question st...
Summary: Multiple spaced seeds represent the current state-of-the-art for similarity search in bioin...
Many algorithms for sequence analysis rely on word matching or word statistics. Often, these approac...
The minimal-length encoding approach is applied to define concept of sequence similarity. A sequence...
Motivation: Most sequence alignment techniques make use of exact k-mer hits, called seeds, as anchor...
We apply the concept of subset seeds to similarity search in protein sequences. The main question st...
We study the problem of computing optimal spaced seeds for identifying homologous coding DNA sequen...
International audienceMotivation: Analysis of genetic sequences is usually based on finding similar ...
AbstractGenomics studies routinely depend on similarity searches based on the strategy of finding sh...
The challenge of similarity search in massive DNA sequence databases has inspired major changes in B...
Homology search finds similar segments between two biological sequences, such as DNA or protein sequ...
Motivation: Comparison of nucleic acid and protein sequences is a fundamental tool of modern bioinfo...
Most commonly used similarity search methods in genomic sequences are heuristic ones. These are base...
The primary goal of bioinformatics is to increase an understanding in the biology of organisms. Comp...
AbstractLarge-scale comparison of genomic DNA is of fundamental importance in annotating functional ...
We apply the concept of subset seeds to similarity search in protein sequences. The main question st...
Summary: Multiple spaced seeds represent the current state-of-the-art for similarity search in bioin...
Many algorithms for sequence analysis rely on word matching or word statistics. Often, these approac...
The minimal-length encoding approach is applied to define concept of sequence similarity. A sequence...
Motivation: Most sequence alignment techniques make use of exact k-mer hits, called seeds, as anchor...
We apply the concept of subset seeds to similarity search in protein sequences. The main question st...
We study the problem of computing optimal spaced seeds for identifying homologous coding DNA sequen...