In modern column-oriented databases, compression is important for improving I/O throughput and overall database performance. Many string columnar data cannot be compressed by special-purpose algorithms such as run-length encoding or dictionary compression, and the typical choice for them is the LZ77-based compression algorithms such as GZIP or Snappy. These algorithms treat data as a byte block and do not exploit the columnar nature of the data. In this thesis, we develop a compression algorithm using frequent string patterns directly mined from a sample of a string column. The patterns are used as the dictionary phrases for compression. We discuss some interesting properties of frequent patterns in the context of compression, and develop a...
The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes ba...
The need to store and query a set of strings – a string dictionary – arises in many kinds of applica...
Abstract. Motivated by the imminent growth of massive, highly redun-dant genomic databases we study ...
Domain encoding is a common technique to compress the columns of a column store and to accelerate ma...
Columnar databases have dominated the data analysis market for their superior performance in query p...
Abstract. Text mining from large scaled data is of great importance in computer sci-ence. In this pa...
A pattern database (PDB) is a heuristic function implemented as a lookup table that stores the lengt...
We propose a streaming algorithm, based on the minimal description length (MDL) principle, for extra...
With the widening gap between processor and memory speeds, memory system designers may find cache co...
Compression based pattern mining has been successfully applied to many data mining tasks. We propose...
The use of Data mining is increasing very rapidly as daily analysis of transaction database consisti...
Abstract. A pattern database (PDB) is a heuristic function implemented as a lookup table. It stores ...
We present an algorithm for compressing pattern databases (PDBs) and a method for fast random access...
The need to store and query a set of strings { a string dictionary { arises in many kinds of applica...
Pattern mining is one of the best-known concepts in Data Mining. A big problem in pattern mining is ...
The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes ba...
The need to store and query a set of strings – a string dictionary – arises in many kinds of applica...
Abstract. Motivated by the imminent growth of massive, highly redun-dant genomic databases we study ...
Domain encoding is a common technique to compress the columns of a column store and to accelerate ma...
Columnar databases have dominated the data analysis market for their superior performance in query p...
Abstract. Text mining from large scaled data is of great importance in computer sci-ence. In this pa...
A pattern database (PDB) is a heuristic function implemented as a lookup table that stores the lengt...
We propose a streaming algorithm, based on the minimal description length (MDL) principle, for extra...
With the widening gap between processor and memory speeds, memory system designers may find cache co...
Compression based pattern mining has been successfully applied to many data mining tasks. We propose...
The use of Data mining is increasing very rapidly as daily analysis of transaction database consisti...
Abstract. A pattern database (PDB) is a heuristic function implemented as a lookup table. It stores ...
We present an algorithm for compressing pattern databases (PDBs) and a method for fast random access...
The need to store and query a set of strings { a string dictionary { arises in many kinds of applica...
Pattern mining is one of the best-known concepts in Data Mining. A big problem in pattern mining is ...
The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes ba...
The need to store and query a set of strings – a string dictionary – arises in many kinds of applica...
Abstract. Motivated by the imminent growth of massive, highly redun-dant genomic databases we study ...