Compiled low-level languages, such as C/C++ and Fortran, have been employed as programming tools to implement applications to explore GPU devices. As a counterpoint to that trend, this paper presents a performance and programming effort analysis with Python, an interpreted and high-level language, which was applied to develop the kernels and applications of NAS Parallel Benchmarks targeting GPUs. We used Numba environment to enable CUDA support in Python, a tool that allows us to implement the GPU programs with pure Python code. Our experimental results showed that Python applications reached a performance similar to C++ programs employing CUDA and better than C++ using OpenACC for most NPB benchmarks. Furthermore, Python codes demanded less operations related to the GPU framework than CUDA, mainly because Python needs a lower number of statements to manage memory allocations and data transfers. Despite that, our Python implementations required more operations than OpenACC ones.
@article{DiDomenico2023,author={Di Domenico, Daniel and Lima, Jo{\~a}o V. F. and Cavalheiro, Gerson G. H.},title={NAS Parallel Benchmarks with Python: a performance and programming effort analysis focusing on GPUs},journal={The Journal of Supercomputing},year={2023},month=may,day={01},note={CAPES Qualis A2},volume={79},number={8},pages={8890--8911},issn={1573-0484},doi={10.1007/s11227-022-04932-3},url={https://doi.org/10.1007/s11227-022-04932-3}}
An evaluation of relational and NoSQL distributed databases on a low-power cluster
The constant growth of social media, unconventional web technologies, mobile applications, and Internet of Things (IoT) devices create challenges for cloud data systems in order to support huge datasets and very high request rates. NoSQL databases, such as Cassandra and HBase, and relational SQL databases with replication, such as Citus/PostgreSQL, have been used to increase horizontal scalability and high availability of data store systems. In this paper, we evaluated three distributed databases on a low-power low-cost cluster of commodity Single-Board Computers (SBC): relational Citus/PostgreSQL and NoSQL databases Cassandra and HBase. The cluster has 15 Raspberry Pi 3 nodes with Docker Swarm orchestration tool for service deployment and ingress load balancing over SBCs. We believe that a low-cost SBC cluster can support cloud serving goals such as scale-out, elasticity, and high availability. Experimental results clearly demonstrated that there is a trade-off between performance and replication, which provides availability and partition tolerance. Besides, both properties are essential in the context of distributed systems with low-power boards. Cassandra attained better results with its consistency levels specified by the client. Both Citus and HBase enable consistency but it penalizes performance as the number of replicas increases.
@article{daSilva2023,author={da Silva, Lucas Ferreira and Lima, Jo{\~a}o V. F.},title={An evaluation of relational and NoSQL distributed databases on a low-power cluster},journal={The Journal of Supercomputing},year={2023},dimensions={true},note={CAPES Qualis A2},month=aug,day={01},volume={79},number={12},pages={13402--13420},issn={1573-0484},doi={10.1007/s11227-023-05166-7},url={https://doi.org/10.1007/s11227-023-05166-7}}
2022
NAS Parallel Benchmark Kernels with Python: A performance and programming effort analysis focusing on GPUs
Daniel Di Domenico, Gerson G. H. Cavalheiro, and João V. F. Lima
In 2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), Aug 2022
GPU devices are currently seen as one of the trending topics for parallel computing. Commonly, GPU applications are developed with programming tools based on compiled languages, like C/C++ and Fortran. This paper presents a performance and programming effort analysis employing the Python high-level language to implement the NAS Parallel Benchmark kernels targeting GPUs. We used Numba environment to enable CUDA support in Python, a tool that allows us to implement a GPU application with pure Python code. Our experimental results showed that Python applications reached a performance similar to C++ programs employing CUDA and better than C++ using OpenACC for most NPB kernels. Furthermore, Python codes required less operations related to the GPU framework than CUDA, mainly because Python needs a lower number of statements to manage memory allocations and data transfers. However, our Python versions demanded more operations than OpenACC implementations.
@inproceedings{9756706,author={Di Domenico, Daniel and Cavalheiro, Gerson G. H. and Lima, João V. F.},booktitle={2022 30th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)},title={NAS Parallel Benchmark Kernels with Python: A performance and programming effort analysis focusing on GPUs},year={2022},volume={},number={},note={CAPES Qualis A4},pages={26-33},doi={10.1109/PDP55904.2022.00013}}
2021
Collaborative execution of fluid flow simulation using non-uniform decomposition on heterogeneous architectures
Gabriel Freytag, Matheus S. Serpa, João V.F. Lima, and
2 more authors
Journal of Parallel and Distributed Computing, Aug 2021
The demand for computing power, along with the diversity of computational problems, culminated in a variety of heterogeneous architectures. Among them, hybrid architectures combine different specialized hardware into a single chip, comprising a System-on-Chip (SoC). Since these architectures usually have limited resources, efficiently splitting data and tasks between the different hardware is primal to improve performance. In this context, we explore the non-uniform decomposition of the data domain to improve fluid flow simulation performance on heterogeneous architectures. We evaluate two hybrid architectures: one comprised of a general-purpose x86 CPU and a graphics processing unit (GPU) integrated into a single chip (AMD Kaveri SoC), and another comprised by a general-purpose ARM CPU and a Field Programmable Gate Array (FPGA) integrated into the same chip (Intel Arria 10 SoC). We investigate the effects on performance and energy efficiency of data decomposition on each platform’s devices on a collaborative execution. Our case study is the well-known Lattice Boltzmann Method (LBM), where we apply the technique and analyze the performance and energy efficiency of five kernels on both devices on each platform. Our experimental results show that non-uniform partitioning improves the performance of LBM kernels by up to 11.40% and 15.15% on AMD Kaveri and Intel Arria 10, respectively. While AMD’s Kaveri platform’s performance efficiency is of up to 10.809 MLUPS with an energy efficiency of 142.881 MLUPKJ, Intel’s Arria 10 platform’s is of up to 1.12 MLUPS and 82.272 MLUPKJ.
@article{FREYTAG202111,title={Collaborative execution of fluid flow simulation using non-uniform decomposition on heterogeneous architectures},journal={Journal of Parallel and Distributed Computing},volume={152},pages={11-20},year={2021},issn={0743-7315},note={CAPES Qualis A1},doi={https://doi.org/10.1016/j.jpdc.2021.02.006},url={https://www.sciencedirect.com/science/article/pii/S0743731521000277},author={Freytag, Gabriel and Serpa, Matheus S. and Lima, João V.F. and Rech, Paolo and Navaux, Philippe O.A.},keywords={System-on-Chip, FPGA, GPU, Non-uniform partitioning, Lattice Boltzmann Method}}
A Memory Affinity Analysis of Scientific Applications on NUMA Platforms
Rafael Gauna Trindade, João V. F. Lima, and Andrea Schwertner Charão
In 2021 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW), Aug 2021
Understanding the underlying architecture is essential for scientific applications in general. An example of a computing environment is Non-Uniform Memory Access (NUMA) systems that enable a large amount of shared main memory. Nevertheless, NUMA systems can impose significant access latencies on data communications between distant memory nodes. Parallel applications with a naïve design may suffer significant performance penalties due to the lack of locality mechanisms. In this paper we present performance metrics on scientific applications to identify locality problems in NUMA systems and show data and thread mapping strategies to mitigate them. Our experiments were performed with four well-known scientific applications: CoMD, LBM, LULESH and Ondes3D. Experimental results demonstrate that scientific applications had significant locality problems and data and thread mapping strategies improved performance on all four applications.
@inproceedings{9651715,author={Trindade, Rafael Gauna and Lima, João V. F. and Charão, Andrea Schwertner},booktitle={2021 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)},title={A Memory Affinity Analysis of Scientific Applications on NUMA Platforms},year={2021},volume={},number={},pages={1-8},doi={10.1109/SBAC-PADW53941.2021.00011}}
An evaluation of Cassandra NoSQL database on a low-power cluster
Lucas Ferreira Da Silva, and João V. F. Lima
In 2021 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW), Aug 2021
The constant growth of social media, unconventional web technologies, mobile applications, and Internet of Things (IoT) devices, create challenges for cloud data systems in order to support huge datasets and very high request rates. NoSQL distributed databases such as Cassandra have been used for unstructured data storage and to increase horizontal scalability and high availability. In this paper, we evaluated Cassandra on a low-power low-cost cluster of commodity Single Board Computers (SBC). The cluster has 15 Raspberry Pi v3 nodes with Docker Swarm orchestration tool for Cassandra service deployment and ingress load balancing over SBCs. Experimental results demonstrated that hardware limitations impacted workload throughput, but read and write latencies were comparable to results from other works on high-end or virtualized platforms. Despite the observed limitations, the results show that a low-cost SBC cluster can support cloud serving goals such as scale-out, elasticity and high availability.
@inproceedings{9651694,author={Da Silva, Lucas Ferreira and Lima, João V. F.},booktitle={2021 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)},title={An evaluation of Cassandra NoSQL database on a low-power cluster},year={2021},volume={},number={},pages={9-14},doi={10.1109/SBAC-PADW53941.2021.00012}}
2020
XKBlas: a High Performance Implementation of BLAS-3 Kernels on Multi-GPU Server
Thierry Gautier, and João V. F. Lima
In 2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), Aug 2020
In the last ten years, GPUs have dominated the market considering the computing/power metric and numerous research works have provided Basic Linear Algebra Subprograms implementations accelerated on GPUs. Several software libraries have been developed for exploiting performance of systems with accelerators, but the real performance may be far from the platform peak performance. This paper presents XKBlas that aims to improve performance of BLAS-3 kernels on multi-GPU systems. At low level, we model computation as a set of tasks accessing data on different resources. At high level, the API design favors non-blocking calls as uniform concept to overlap latency, even by fine grain computation. Unit benchmark of BLAS-3 kernels showed that XKBlas outperformed most implementations including the overhead of dynamic task’s creation and scheduling. XKBlas outperformed BLAS implementations such as cuBLAS-XT, PaRSEC, BLASX and Chameleon/StarPU.
@inproceedings{9092322,author={Gautier, Thierry and Lima, João V. F.},booktitle={2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)},title={XKBlas: a High Performance Implementation of BLAS-3 Kernels on Multi-GPU Server},year={2020},volume={},number={},dimensions={true},note={CAPES Qualis A4},pages={1-8},doi={10.1109/PDP50117.2020.00008}}
2019
A Dynamic Task-Based D3Q19 Lattice-Boltzmann Method for Heterogeneous Architectures
João V. F. Lima, Gabriel Freytag, Vinicius Garcia Pinto, and
2 more authors
In 2019 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), Aug 2019
Nowadays computing platforms expose a significant number of heterogeneous processing units such as multicore processors and accelerators. The task-based programming model has been a de facto standard model for such architectures since its model simplifies programming by unfolding parallelism at runtime based on data-flow dependencies between tasks. Many studies have proposed parallel strategies over heterogeneous platforms with accelerators. However, to the best of our knowledge, no dynamic task-based strategy of the Lattice-Boltzmann Method (LBM) has been proposed to exploit CPU+GPU computing nodes. In this paper, we present a dynamic task-based D3Q19 LBM implementation using three runtime systems for heterogeneous architectures: OmpSs, StarPU, and XKaapi. We detail our implementations and compare performance over two heterogeneous platforms. Experimental results demonstrate that our task-based approach attained up to 8.8 of speedup over an OpenMP parallel loop version.
@inproceedings{8671583,author={Lima, João V. F. and Freytag, Gabriel and Pinto, Vinicius Garcia and Schepke, Claudio and Navaux, Philippe O. A.},booktitle={2019 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)},title={A Dynamic Task-Based D3Q19 Lattice-Boltzmann Method for Heterogeneous Architectures},year={2019},volume={},number={},note={CAPES Qualis A4},pages={108-115},dimensions={true},doi={10.1109/EMPDP.2019.8671583}}
Performance and energy analysis of OpenMP runtime systems with dense linear algebra algorithms
João Vicente Ferreira Lima, Issam Raïs, Laurent Lefèvre, and
1 more author
The International Journal of High Performance Computing Applications, Aug 2019
In this article, we analyze performance and energy consumption of five OpenMP runtime systems over a non-uniform memory access (NUMA) platform. We also selected three CPU-level optimizations or techniques to evaluate their impact on the runtime systems: processors features Turbo Boost and C-States, and CPU Dynamic Voltage and Frequency Scaling through Linux CPUFreq governors. We present an experimental study to characterize OpenMP runtime systems on the three main kernels in dense linear algebra algorithms (Cholesky, LU, and QR) in terms of performance and energy consumption. Our experimental results suggest that OpenMP runtime systems can be considered as a new energy leverage, and Turbo Boost, as well as C-States, impacted significantly performance and energy. CPUFreq governors had more impact with Turbo Boost disabled, since both optimizations reduced performance due to CPU thermal limits. An LU factorization with concurrent-write extension from libKOMP achieved up to 63% of performance gain and 29% of energy decrease over original PLASMA algorithm using GNU C compiler (GCC) libGOMP runtime.
@article{doi:10.1177/1094342018792079,author={Lima, João Vicente Ferreira and Raïs, Issam and Lefèvre, Laurent and Gautier, Thierry},title={Performance and energy analysis of OpenMP runtime systems with dense linear algebra algorithms},journal={The International Journal of High Performance Computing Applications},volume={33},number={3},pages={431-443},year={2019},dimensions={true},note={CAPES Qualis A2},doi={10.1177/1094342018792079},url={ https://doi.org/10.1177/1094342018792079},eprint={ https://doi.org/10.1177/1094342018792079}}
Non-uniform Partitioning for Collaborative Execution on Heterogeneous Architectures
Gabriel Freytag, Matheus S. Serpa, João Vicente Ferreira Lima, and
2 more authors
In 2019 31st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Aug 2019
Since the demand for computing power increases, new architectures arise to obtain better performance. An important class of integrated devices is heterogeneous architectures, which join different specialized hardware into a single chip, composing a System on Chip - SoC. Within this context, effectively splitting tasks between the different architectures is primal to obtain efficiency and performance. In this work, we evaluate two heterogeneous architectures: one composed of a general-purpose CPU and a graphics processing unit (GPU) integrated into a single chip (AMD Kaveri SoC), and another composed by a general-purpose CPU and a Field Programmable Gate Array (FPGA) integrated into a single chip (Intel Arria 10 SoC). We investigate how data partitioning affects the performance of each device in a collaborative execution through the decomposition of the data domain. As a case study, we apply the technique in the well-known Lattice Boltzmann Method (LBM), analyzing the performance of five kernels in both architectures. Our experimental results show that non-uniform partitioning improves LBM kernels performance by up to 11.40% and 15.15% on AMD Kaveri and Intel Arria 10, respectively.
@inproceedings{8924156,author={Freytag, Gabriel and Serpa, Matheus S. and Lima, João Vicente Ferreira and Rech, Paolo and Navaux, Philippe O. A.},booktitle={2019 31st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)},title={Non-uniform Partitioning for Collaborative Execution on Heterogeneous Architectures},year={2019},volume={},number={},note={CAPES Qualis A4},dimensions={true},pages={128-135},keywords={Computer architecture;Kernel;Field programmable gate arrays;Performance evaluation;Collaboration;Central Processing Unit;Graphics processing units;Heterogeneous Architectures;Collaborative Execution;Non-Uniform Partitioning;FPGA;GPU;Lattice Boltzmann Method},doi={10.1109/SBAC-PAD.2019.00031}}
2013
XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures
Thierry Gautier, João V.F. Lima, Nicolas Maillard, and
1 more author
In 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, Aug 2013
Most recent HPC platforms have heterogeneous nodes composed of multi-core CPUs and accelerators, like GPUs. Programming such nodes is typically based on a combination of OpenMP and CUDA/OpenCL codes; scheduling relies on a static partitioning and cost model. We present the XKaapi runtime system for data-flow task programming on multi-CPU and multi-GPU architectures, which supports a data-flow task model and a locality-aware work stealing scheduler. XKaapi enables task multi-implementation on CPU or GPU and multi-level parallelism with different grain sizes. We show performance results on two dense linear algebra kernels, matrix product (GEMM) and Cholesky factorization (POTRF), to evaluate XKaapi on a heterogeneous architecture composed of two hexa-core CPUs and eight NVIDIA Fermi GPUs. Our conclusion is two-fold. First, fine grained parallelism and online scheduling achieve performance results as good as static strategies, and in most cases outperform them. This is due to an improved work stealing strategy that includes locality information; a very light implementation of the tasks in XKaapi; and an optimized search for ready tasks. Next, the multi-level parallelism on multiple CPUs and GPUs enabled by XKaapi led to a highly efficient Cholesky factorization. Using eight NVIDIA Fermi GPUs and four CPUs, we measure up to 2.43 TFlop/s on double precision matrix product and 1.79 TFlop/s on Cholesky factorization; and respectively 5.09 TFlop/s and 3.92 TFlop/s in single precision.
@inproceedings{6569905,author={Gautier, Thierry and Lima, João V.F. and Maillard, Nicolas and Raffin, Bruno},booktitle={2013 IEEE 27th International Symposium on Parallel and Distributed Processing},title={XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures},year={2013},volume={},number={},pages={1299-1308},dimensions={true},note={CAPES Qualis A1},doi={10.1109/IPDPS.2013.66}}
2012
Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs
João V.F. Lima, Thierry Gautier, Nicolas Maillard, and
1 more author
In 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing, Aug 2012
The race for Exascale computing has naturally led
the current technologies to converge to multi-CPU/multi-GPU
computers, based on thousands of CPUs and GPUs intercon-
nected by PCI-Express buses or interconnection networks. To
exploit this high computing power, programmers have to solve
the issue of scheduling parallel programs on hybrid architectures.
And, since the performance of a GPU increases at a much faster
rate than the throughput of a PCI bus, data transfers must be
managed efficiently by the scheduler.
This paper targets multi-GPU compute nodes, where several
GPUs are connected to the same machine. To overcome the data
transfer limitations on such platforms, the available softwares
compute, usually before the execution, a mapping of the tasks
that respects their dependencies and minimizes the global data
transfers. Such an approach is too rigid and it cannot adapt
the execution to possible variations of the system or to the
application’s load.
We propose a solution that is orthogonal to the above men-
tioned: extensions of the XKaapi software stack that enable
to exploit full performance of a multi-GPUs system through
asynchronous GPU tasks. XKaapi schedules tasks by using a
standard Work Stealing algorithm and the runtime efficiently
exploits concurrent GPU operations. The runtime extensions
make it possible to overlap the data transfers and the task
executions on current generation of GPUs. We demonstrate that
the overlapping capability is at least as important as computing
a scheduling decision to reduce completion time of a parallel
program.
Our experiments on two dense linear algebra problems (Matrix
Product and Cholesky factorization) show that our solution is
highly competitive with other softwares based on static schedul-
ing. Moreover, we are able to sustain the peak performance
(310 GFlop/s) on DGEMM, even for matrices that cannot
be stored entirely in one GPU memory. With eight GPUs, we
archive a speed-up of 6.74 with respect to single-GPU. The
performance of our Cholesky factorization, with more complex
dependencies between tasks, outperforms the state of the art
single-GPU MAGMA code.
@inproceedings{6374774,author={Lima, João V.F. and Gautier, Thierry and Maillard, Nicolas and Danjean, Vincent},booktitle={2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing},title={Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs},year={2012},dimensions={true},note={CAPES Qualis A4},volume={},number={},pages={75-82},doi={10.1109/SBAC-PAD.2012.28}}