In this paper we address the issue of efficient doall workload distribution on a embedded 3D MPSoC. 3D stacking technology enables low latency and high bandwidth access to multiple, large memory banks in close spatial proximity. In our implementation one silicon layer contains multiple processors, whereas one or more DRAM layers on top host a NUMA memory subsystem. To obtain high locality and balanced workload we consider a two-step approach. First, a compiler pass analyzes memory references in a loop and schedules each iteration to the processor owning the most frequently accessed data. Second, if locality-aware loop parallelization has generated unbalanced workload we allow idle processors to execute part of the remaining work from neighb...
Abstract—This paper demonstrates a fully functional hard-ware and software design for a 3D stacked m...
Hardware design is evolving towards manycore processors that will be used in large clusters to achie...
This paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlle...
In this paper we address the issue of efficient doall workload distribution on a embedded 3D MPSoC. ...
While the past research discussed several advantages of multipro-cessor-system-on-a-chip (MPSOC) arc...
International audienceDuring the past few years, embedded digital systems have been requested to pro...
International audienceWith the emergence of manycore architectures, the need of on-chip memories suc...
Computational task DAGs are executed on parallel computers by a task scheduling algorithm. Intellige...
In this chapter, we present a methodology for efficient load balancing of computational problems tha...
none5noAbstract Sub-50nm CMOS technologies are affected by significant variability which causes pow...
Emerging TSV-based 3D integration technologies have shown great promise to overcome scalability limi...
\u3cp\u3eMulti-Processor Systems on a Chip (MPSoCs) are suitable platforms for the implementation of...
This paper aims to address the issue of CPU-memory intercommunication latency with the help of 3D st...
Abstract: This study presents the results of research in dynamic load balancing for Continuous Colli...
Future multi- and many- core processors are likely to have tens of cores arranged in a tiled archite...
Abstract—This paper demonstrates a fully functional hard-ware and software design for a 3D stacked m...
Hardware design is evolving towards manycore processors that will be used in large clusters to achie...
This paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlle...
In this paper we address the issue of efficient doall workload distribution on a embedded 3D MPSoC. ...
While the past research discussed several advantages of multipro-cessor-system-on-a-chip (MPSOC) arc...
International audienceDuring the past few years, embedded digital systems have been requested to pro...
International audienceWith the emergence of manycore architectures, the need of on-chip memories suc...
Computational task DAGs are executed on parallel computers by a task scheduling algorithm. Intellige...
In this chapter, we present a methodology for efficient load balancing of computational problems tha...
none5noAbstract Sub-50nm CMOS technologies are affected by significant variability which causes pow...
Emerging TSV-based 3D integration technologies have shown great promise to overcome scalability limi...
\u3cp\u3eMulti-Processor Systems on a Chip (MPSoCs) are suitable platforms for the implementation of...
This paper aims to address the issue of CPU-memory intercommunication latency with the help of 3D st...
Abstract: This study presents the results of research in dynamic load balancing for Continuous Colli...
Future multi- and many- core processors are likely to have tens of cores arranged in a tiled archite...
Abstract—This paper demonstrates a fully functional hard-ware and software design for a 3D stacked m...
Hardware design is evolving towards manycore processors that will be used in large clusters to achie...
This paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlle...