There is a steady increase in the size of data stored and processed as part of data science applications, leading to bottlenecks and inefficiencies at various layers of the stack. One way of reducing such bottlenecks and increasing energy efficiency is by tailoring the underlying distributed storage solution to the application domain, using resources more efficiently. We explore this idea in the context of a popular column-oriented storage format used in big data workloads, namely Apache Parquet. Our prototype uses an FPGA-based storage node that offers high bandwidth data deduplication and a companion software library that exposes an API for Parquet file access. This way the storage node remains general purpose and could be shared by appli...
While in-memory databases have largely removed I/O as a bottleneck for database operations, loading ...
Apache Parquet is a columnar storage format for the Hadoop ecosystem. The technology has become alm...
FPGA-based accelerators are becoming a first class citizen in data centers. Adding FPGAs in data cen...
There is a steady increase in the size of data stored and processed as part of data science applicat...
Distributed storage in the cloud needs to offer both low latency and high bandwidth access to data a...
In order to keep up with big data workloads, distributed storage needs to offer low latency, high ba...
In the domain of big data analytics, the bottleneck of converting storage-focused file formats to in...
With the advent of high-bandwidth non-volatile storage devices, the classical assumption that databa...
Big data applications are becoming more commonplace due to an abundance of digital data and increasi...
A columnar data representation is known to be an efficient way for data storage, specifically in cas...
Many applications make extensive use of various forms of compression techniques for storing and comm...
With the increasing number of connected devices, it becomes essential to find novel data management ...
The ever increasing amount of data being handled in data centers causes an intrinsic inefficiency: m...
Because of fundamental limitations of CMOS technology, computing researchers and the computing indus...
Summarization: Important design considerations for the cost-effective employment of hardware acceler...
While in-memory databases have largely removed I/O as a bottleneck for database operations, loading ...
Apache Parquet is a columnar storage format for the Hadoop ecosystem. The technology has become alm...
FPGA-based accelerators are becoming a first class citizen in data centers. Adding FPGAs in data cen...
There is a steady increase in the size of data stored and processed as part of data science applicat...
Distributed storage in the cloud needs to offer both low latency and high bandwidth access to data a...
In order to keep up with big data workloads, distributed storage needs to offer low latency, high ba...
In the domain of big data analytics, the bottleneck of converting storage-focused file formats to in...
With the advent of high-bandwidth non-volatile storage devices, the classical assumption that databa...
Big data applications are becoming more commonplace due to an abundance of digital data and increasi...
A columnar data representation is known to be an efficient way for data storage, specifically in cas...
Many applications make extensive use of various forms of compression techniques for storing and comm...
With the increasing number of connected devices, it becomes essential to find novel data management ...
The ever increasing amount of data being handled in data centers causes an intrinsic inefficiency: m...
Because of fundamental limitations of CMOS technology, computing researchers and the computing indus...
Summarization: Important design considerations for the cost-effective employment of hardware acceler...
While in-memory databases have largely removed I/O as a bottleneck for database operations, loading ...
Apache Parquet is a columnar storage format for the Hadoop ecosystem. The technology has become alm...
FPGA-based accelerators are becoming a first class citizen in data centers. Adding FPGAs in data cen...