We develop a new strategy for processing a collection of documents on a cluster of multicore processors to build the inverted files at almost the peak I/O throughput of the underlying system. Our algorithm is based on a number of novel techniques including: (i) a high-throughput pipelined strategy that produces parallel parsed streams that are consumed at the same rate by parallel indexers; (ii) a hybrid trie and B-tree dictionary data structure that enables efficient parallel construction of the global dictionary; and (iii) a partitioning strategy of the work of the indexers using random sampling, which achieve extremely good load balancing with minimal communication overhead. We have performed extensive tests of our algorithm on a cluster...
Many scientific applications are I/O intensive and have tremendous I/O requirements, including check...
Information retrieval over clustered document collections has two successive stages: first identifyi...
: An inverted index stores, for each term that appears in a collection of documents, a list of docum...
We develop a new strategy for processing a collection of documents on a cluster of multicore process...
Current high-throughput algorithms for constructing inverted files all follow the MapReduce framewo...
Current trends in processor architectures increasingly include more cores on a single chip and more ...
International audienceThis paper introduces a research about parallelization of an entire applicatio...
International audienceThis paper introduces a research about parallelization of an entire applicatio...
This paper introduces a research about parallelization of an entire application of Document-Categori...
Advances in cloud computing, 64-bit architectures and huge RAMs enable performing many search relate...
We present a general method of parallel query processing that allows scalable performance on distrib...
The growing amount of on-line data demands efficient parallel and distributed indexing mechanisms to...
Parallel input/output in high performance computing is a field of increasing importance. In particul...
Multiple-disk I/O systems (Disk Arrays) have been an attractive approach to meet high performance I/...
Many scientific applications are I/O intensive and have tremendous I/O requirements, including check...
Many scientific applications are I/O intensive and have tremendous I/O requirements, including check...
Information retrieval over clustered document collections has two successive stages: first identifyi...
: An inverted index stores, for each term that appears in a collection of documents, a list of docum...
We develop a new strategy for processing a collection of documents on a cluster of multicore process...
Current high-throughput algorithms for constructing inverted files all follow the MapReduce framewo...
Current trends in processor architectures increasingly include more cores on a single chip and more ...
International audienceThis paper introduces a research about parallelization of an entire applicatio...
International audienceThis paper introduces a research about parallelization of an entire applicatio...
This paper introduces a research about parallelization of an entire application of Document-Categori...
Advances in cloud computing, 64-bit architectures and huge RAMs enable performing many search relate...
We present a general method of parallel query processing that allows scalable performance on distrib...
The growing amount of on-line data demands efficient parallel and distributed indexing mechanisms to...
Parallel input/output in high performance computing is a field of increasing importance. In particul...
Multiple-disk I/O systems (Disk Arrays) have been an attractive approach to meet high performance I/...
Many scientific applications are I/O intensive and have tremendous I/O requirements, including check...
Many scientific applications are I/O intensive and have tremendous I/O requirements, including check...
Information retrieval over clustered document collections has two successive stages: first identifyi...
: An inverted index stores, for each term that appears in a collection of documents, a list of docum...