Holistic tuning and optimization of hybrid MPI and OpenMP applications is becoming focus for parallel code developers as the number of cores and hardware threads in processing nodes of high-end systems continue to increase. For example, there is support for 32 hardware threads on a Cray XE6 node with Interlagos processors while the IBM Blue Gene/Q system could support up to 64 threads per node. Note that, by default, OpenMP threads and MPI tasks are pinned to processor cores on these high-end systems and throughout the paper we assume fix bindings of threads to physical cores for the discussion. A number of OpenMP runtimes also support user specified bindings of threads to physical cores. Parallel and node efficiencies on these high-end sys...