Linux / arm64
Linux / amd64
The NVIDIA HPC-Benchmarks collection provides four benchmarks (HPL, HPL-MxP, HPCG and STREAM) widely used in the HPC community optimized for performance on NVIDIA accelerated HPC systems.
NVIDIA's HPL and HPL-MxP benchmarks provide software packages to solve a (random) dense linear system in double precision (64 bits) arithmetic and in mixed precision arithmetic using Tensor Cores, respectively, on distributed-memory computers equipped with NVIDIA GPUs, based on the Netlib HPL benchmark and HPL-MxP benchmark
NVIDIA's HPCG benchmark accelerates the High Performance Conjugate Gradients (HPCG) Benchmark. HPCG is a software package that performs a fixed number of multigrid preconditioned (using a symmetric Gauss-Seidel smoother) conjugate gradient (PCG) iterations using double precision (64 bit) floating point values.
NVIDIA's STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth. NVIDIA HPC-Benchmarks container includes STREAM benchmarks optimized for NVIDIA Ampere GPU architecture (sm80), NVIDIA Hopper GPU architecture (sm90) and NVIDIA Grace CPU.
The NVIDIA HPC-Benchmarks provides multiplatform (x86 and aarch64) container image hpc-benchmarks:24.03
which is based on NVIDIA Optimized Frameworks 24.01 container image.
In addition to NVIDIA Optimized Frameworks 24.01 container images, the hpc-benchmarks:24.03
container image is provided with the following packages embedded:
HPL-NVIDIA 24.03
HPL-MxP-NVIDIA 24.03
HPCG-NVIDIA 24.03
STREAM-NVIDIA 24.03
NVIDIA NVSHMEM 2.11
NVIDIA NCCL 2.20.3
NVIDIA HPC-X 2.18
NVIDIA NVPL 23.11
Intel MKL 2020.4-912
Using the NVIDIA HPC-Benchmarks Container requires the host system to have the following installed:
For supported versions, see the Framework Containers Support Matrix and the NVIDIA Container Toolkit Documentation
NVIDIA's HPL benchmark requires GDRCopy installed on the system. Please visit https://developer.nvidia.com/gdrcopy and https://github.com/NVIDIA/gdrcopy#build-and-installation for more information. In addition, please be aware that GDRCopy requires an extra kernel-mode driver to be installed and loaded on the target machine.
The NVIDIA HPC-Benchmarks Container supports NVIDIA Ampere GPU architecture (sm80) or NVIDIA Hopper GPU architecture (sm90). The current container version is aimed at clusters of DGX A100, DGX H100, NVIDIA Grace Hopper, and NVIDIA Grace CPU nodes (Previous GPU generations are not expected to work).
The hpc-benchmarks:24.03
container provides the HPL-NVIDIA
, HPL-MxP-NVIDIA
, HPCG-NVIDIA
and STREAM-NVIDIA
benchmarks in the following folder structure:
hpl.sh
script in the folder /workspace
to invoke the xhpl executable.
hpl-mxp.sh
script in the folder /workspace
to invoke the xhpl-mxp executable.
hpcg.sh
script in the folder /workspace
to invoke the xhpcg executable.
stream-gpu-test.sh
script in the folder /workspace
to invoke the stream_test executable for NVIDIA H100 or A100 GPU.
HPL-NVIDIA
in the folder /workspace/hpl-linux-x86_64
contains:
xhpl
executable.sample-slurm
directory.sample-dat
directory.HPL-MxP-NVIDIA
in the folder /workspace/hpl-mxp-linux-x86_64
contains:
xhpl_mxp
executable.sample-slurm
directory.HPCG-NVIDIA
in the folder /workspace/hpcg-linux-x86_64
contains:
xhpcg
executable.sample-slurm
directorysample-dat
directory.STREAM-NVIDIA
in the folder /workspace/stream-gpu-linux-x86_64
stream_test
executable. GPU STREAM benchmark with double precision elements.stream_test_fp32
executable. GPU STREAM benchmark with single precision elements.hpl-aarch64.sh
script in the folder /workspace
to invoke the xhpl executable for NVIDIA Grace CPU.
hpl.sh
script in the folder /workspace
to invoke the xhpl executable for NVIDIA Grace Hopper.
hpl-mxp-aarch64.sh
script in the folder /workspace
to invoke the xhpl-mxp executable NVIDIA Grace CPU.
hpl-mxp.sh
script in the folder /workspace
to invoke the xhpl-mxp executable for NVIDIA Grace Hopper.
hpcg-aarch64.sh
script in the folder /workspace
to invoke the xhpcg executables for NVIDIA Grace Hopper and Grace CPU.
stream-test-cpu.sh
script in the folder /workspace
to invoke the stream_test executable NVIDIA Grace CPU.
stream-test-gpu.sh
script in the folder /workspace
to invoke the stream_test executable for NVIDIA Grace Hopper.
HPL-NVIDIA
in the folder /workspace/hpl-linux-aarch64
contains:
xhpl
executable for NVIDIA Grace CPU.sample-slurm
directory.sample-dat
directory.HPL-NVIDIA
in the folder /workspace/hpl-linux-aarch64-gpu
contains:
xhpl
executable for NVIDIA Grace Hopper.sample-slurm
directory.sample-dat
directory.HPL-MxP-NVIDIA
in the folder /workspace/hpl-mxp-linux-aarch64
contains:
xhpl_mxp
executable for NVIDIA Grace CPU.sample-slurm
directory.HPL-MxP-NVIDIA
in the folder /workspace/hpl-mxp-linux-aarch64-gpu
contains:
xhpl_mxp
executable for NVIDIA Grace Hopper.sample-slurm
directory.HPCG-NVIDIA
in the folder /workspace/hpcg-linux-aarch64
contains:
xhpcg
executable for NVIDIA Grace Hopper.xhpcg-cpu
executable for NVIDIA Grace CPU.sample-slurm
directorysample-dat
directory.STREAM-NVIDIA
in the folder /workspace/stream-gpu-linux-aarch64
stream_test
executable. GPU STREAM benchmark with double precision elements.stream_test_fp32
executable. GPU STREAM benchmark with single precision elements.STREAM-NVIDIA
in the folder /workspace/stream-cpu-linux-aarch64
stream_test
executable. NVIDAI Grace CPU STREAM benchmark with double precision elements.The HPL-NVIDIA
benchmark uses input format as the standard Netlib HPL benchmark. Please see the Netlib HPL benchmark for getting started with the HPL software concepts and best practices.
The HPCG-NVIDIA
benchmark uses the same input format as the standard HPCG-Benchmark. Please see the HPCG-Benchmark for getting started with the HPCG software concepts and best practices.
The HPL-MxP-NVIDIA
benchmark accepts the list of parameters to describe input tasks and set additional tuning settings. The description of parameters can be found in README and TUNING files.
The HPL-NVIDIA, HPL-MxP-NVIDIA, and HPCG-NVIDIA with GPU support expect one GPU per MPI process. As such, set the number of MPI processes to match the number of available GPUs in the cluster.
Version 24.03 of the HPL-NVIDIA
benchmark introduced a new 'out-of-core' mode. This is an opt-in feature and the default mode remains the 'in-core' mode.
The HPL-NVIDIA
out-of-core mode allows for the use of larger matrix sizes and, unlike the HPL-NVIDIA
in-core mode, any matrix data that does not fit within GPU memory will now be stored in the host CPU memory. This will happen automatically and only requires a user to turn on the feature with an environment variable (HPL_OOC_MODE=1
) and use a larger matrix (such as through the N parameter in an input file).
Performance will depend on host-device transfer speeds. For best performance, try to keep the amount of host memory used for the matrix to around 6-16 GiB on platforms where the CPU and GPU are connected via PCIe (such as x86). For systems where there is a faster CPU-GPU interconnect (such as Grace Hopper), sizes greater than 16 GiB may be beneficial. A method to estimate the matrix size for this feature is to take the largest per GPU memory size used with HPL-NVIDIA
in-core mode, add the target amount of host data, and then work out the new matrix size from this total size.
All the new environment variables needed by the HPL-NVIDIA
out-of-core mode can be found in the provided /workspace/hpl-linux-x86_64/TUNING
or /workspace/hpl-linux-aarch64-gpu/TUNING
files.
If HPL-NVIDIA
out-of-core mode is enabled, it is highly recommended to pass the CPU, GPU, and memory affinity arguments to hpl.sh
.
The scripts hpl.sh
and hpcg.sh
can be invoked on a command line or through a Slurm batch-script to launch the HPL-NVIDIA
and HPCG-NVIDIA
benchmarks, respectively. The scripts hpl.sh
and hpcg.sh
accept the following parameters:
--dat
path to HPL.dat.
Optional parameters:--gpu-affinity <string>
colon separated list of GPU indices--cpu-affinity <string>
colon separated list of CPU index ranges--mem-affinity <string>
colon separated list of memory indices--ucx-affinity <string>
colon separated list of UCX devices--ucx-tls <string>
UCX transport to use--exec-name <string>
HPL executable file--no-multinode
enable flags for no-multinode (no-network) executionIn addition, the script hpcg.sh
alternatively to input file accepts the following parameters:
--nx
specifies the local (to an MPI process) X dimensions of the problem--ny
specifies the local (to an MPI process) Y dimensions of the problem--nz
specifies the local (to an MPI process) Z dimensions of the problem--rt
specifies the number of seconds of how long the timed portion of the benchmark should run--b
activates benchmarking mode to bypass CPU reference execution when set to one (--b 1)--l2cmp
activates compression in GPU L2 cache when set to one (--l2cmp 1)The script hpl-mxp.sh
can be invoked on a command line or through a Slurm batch script to launch the HPL-MxP-NVIDIA benchmark
. The script hpl-mxp.sh
requires the following parameters:
--gpu-affinity <string>
colon separated list of GPU indices--nprow <int>
number of rows in the processor grid"--npcol <int>
number of columns in the processor grid"--nporder <string>
"row" or "column" major layout of the processor grid"--n <int>
size of N-by-N matrix--nb <int>
nb is the blocking constant (panel size)"
The full list of accepted parameters can be found in README and TUNING files.Note:
HPL-MxP-NVIDIA
benchmark. Below the example for DGX A100 and DGX H100:--mem-affinity 0:0:0:0:1:1:1:1
--cpu-affinity 0-13:14-27:28-41:42-55:56-69:70-83:84-97:98-111
--mem-affinity 2:3:0:1:6:7:4:5
--cpu-affinity 32-47:48-63:0-15:16-31:96-111:112-127:64-79:80-95
The script stream-gpu-test.sh
can be invoked on a command line or through a Slurm batch script to launch the STREAM-NVIDIA benchmark
. The script stream-gpu-test.sh
accepts the following optional parameters:
--d <int>
device number--n <int>
number of elements in the arrays--dt fp32
enable fp32 stream testHPL-NVIDIA
, HPCG-NVIDIA
, HPL-MxP-NVIDIA benchmark
and STREAM-NVIDIA benchmark
for GPU can be run in the same way with HPL-NVIDIA
, HPCG-NVIDIA
, HPL-MxP-NVIDIA benchmark
and STREAM-NVIDIA benchmark
from x86_64 container image (see details in x86 container image
section).
This section provides sample runs of HPL-NVIDIA
, HPL-MxP-NVIDIA
, and HPCG-NVIDIA
benchmarks for NVIDIA Grace CPU.
The scripts hpl-aarch64.sh
and hpcg-aarch64.sh
can be invoked on a command line or through a Slurm batch-script to launch the HPL-NVIDIA
and HPCG-NVIDIA
benchmarks for NVIDIA Grace CPU, respectively.
The scripts hpl-aarch64.sh
and hpcg-aarch64.sh
accept the following parameters:
--dat
path to HPL.dat.
Optional parameters:--cpu-affinity <string>
colon separated list of CPU index ranges--mem-affinity <string>
colon separated list of memory indices--ucx-affinity <string>
colon separated list of UCX devices--ucx-tls <string>
UCX transport to use--exec-name <string>
HPL executable file
Note: It is recommended to bind MPI process to NUMA node on NVIDIA Grace CPU, for example: ./hpl-aarch64.sh --dat /my-dat-files/HPL.dat --cpu-affinity 0-71:72-143 --mem-affinity 0:1
In addition, instead of an input file, the script hpcg-aarch64.sh
accepts the following parameters:
--nx
specifies the local (to an MPI process) X dimensions of the problem--ny
specifies the local (to an MPI process) Y dimensions of the problem--nz
specifies the local (to an MPI process) Z dimensions of the problem--rt
specifies the number of seconds of how long the timed portion of the benchmark should run--b
activates benchmarking mode to bypass CPU reference execution when set to one (--b=1)--l2cmp
activates compression in GPU L2 cache when set to one (--l2cmp=1)
The following parameters controls the NVIDIA-HPCG benchmark on Grace-Hopper systems:--exm
specifies the execution mode. 0 is GPU-only, 1 is Grace-only, and 2 is GPU-Grace. Default is 0--pby
specifies the direction that GPU and Grace will not have the same local dimension. 0 is auto, 1 is X, 2 is Y, and 3 is Z. Default is 0. Note that the GPU and Grace local problems can differ in one dimension only--lpm
controls the meaning of the value provided for --g2c
parameter. Applicable when --exm
is 2 and depends on the different local dimension specified by --pby
Value Explanation:--nx 128
--ny 128
--nz 128
--pby 2
--g2c 8
means the different Grace dim (Y in this example) is 1/8 the different GPU dim. GPU local problem is 128x128x128 and Grace local problem is 128x16x128.--nx 128 --ny 128 --nz 128 --pby 3 --g2c 64
means the different Grace dim (Z in this example) is 64. GPU local problem is 128x128x128 and Grace local problem is 128x128x64.--g2c
is the ratio. For example, --pby 1, --nx 1024, and --g2c 8
, then GPU X dim is 896 and Grace X dim is 128.--g2c
is absolute. For example, --pby 1, --nx 1024, and --g2c 96
then GPU X dim is 928 and Grace X dim is 96.--g2c
specifies the value of different dimensions of the GPU and Grace local problems. Depends on --pby
and --lpm
values.Optional parameters of hpcg-aarch64.sh
script:
--npx
specifies the process grid X dimension of the problem--npy
specifies the process grid Y dimension of the problem--npz
specifies the process grid Z dimension of the problemThe script hpl-mxp-aarch64.sh
can be invoked on a command line or through a Slurm batch script to launch the HPL-MxP-NVIDIA benchmark
for NVIDIA Grace CPU. The script hpl-mxp-aarch64.sh
requires the following parameters:
--nprow <int>
number of rows in the processor grid--npcol <int>
number of columns in the processor grid--nporder <string>
"row" or "column" major layout of the processor grid--n <int>
size of N-by-N matrix--nb <int>
nb is the blocking constant (panel size)
The full list of accepted parameters can be found in README and TUNING files.The script stream-cpu-test.sh
can be invoked on a command line or through a Slurm batch script to launch the STREAM-NVIDIA benchmark
. The script stream-cpu-test.sh
accepts the following optional parameters:
--n <int>
number of elements in the arrays--t <int>
number of threadsFor a general guide on pulling and running containers, see Running A Container chapter in the NVIDIA Containers For Deep Learning Frameworks User’s Guide. For more information about using NGC, refer to the NGC Container User Guide.
The examples below use Pyxis/enroot from NVIDIA to facilitate running HPC-Benchmarks Containers. Note that an enroot .credentials
file is necessary to use these NGC containers.
To copy and customize the sample Slurm scripts and/or sample HPL.dat/hpcg.dat files from the containers, run the container in interactive mode, while mounting a folder outside the container, and copy the needed files, as follows:
CONT='nvcr.io#nvidia/hpc-benchmarks:24.03'
MOUNT="$PWD:/home_pwd"
srun -N 1 --cpu-bind=none --mpi=pmix \
--container-image="${CONT}" \
--container-mounts="${MOUNT}" \
--pty bash
Once inside the container, copy the needed files to /home_pwd
.
HPL-NVIDIA
runSeveral sample Slurm scripts and several sample input files are available in the container at /workspace/hpl-linux-x86_64
or /workspace/hpl-linux-aarch64-gpu
.
To run HPL-NVIDIA
on a single node with 4 GPUs using your custom HPL.dat file:
CONT='nvcr.io#nvidia/hpc-benchmarks:24.03'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
srun -N 1 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
--container-image="${CONT}" \
--container-mounts="${MOUNT}" \
./hpl.sh --dat /my-dat-files/HPL.dat
To run HPL-NVIDIA
on nodes 16 with 4 GPUs (or 8 nodes with 8 GPUs) using provided sample HPL-dgx-64GPUs.dat files:
CONT='nvcr.io#nvidia/hpc-benchmarks:24.03'
srun -N 16 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
--container-image="${CONT}" \
./hpl.sh --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-64GPUs.dat
CONT='nvcr.io#nvidia/hpc-benchmarks:24.03'
srun -N 8 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \
--container-image="${CONT}" \
./hpl.sh --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-64GPUs.dat
HPL-MxP-NVIDIA
runSeveral sample Slurm scripts and are available in the container at /workspace/hpl-mxp-linux-x86_64
or /workspace/hpl-mxp-linux-aarch64-gpu
.
To run HPL-MxP-NVIDIA on a single node with 8 GPUs:
CONT='nvcr.io/nvidia/hpc-benchmarks:24.03'
srun -N 1 --ntasks-per-node=8 \
--container-image="${CONT}" \
./hpl-mxp.sh --n 380000 --nb 2048 --nprow 2 --npcol 4 --nporder row --gpu-affinity 0:1:2:3:4:5:6:7
To run HPL-MxP-NVIDIA on 4 nodes, each node with 4 GPUs:
CONT='nvcr.io/nvidia/hpc-benchmarks:24.03'
srun -N 4 --ntasks-per-node=4 \
--container-image="${CONT}" \
./hpl-mxp.sh --n 280000 --nb 2048 --nprow 4 --npcol 4 --nporder row --gpu-affinity 0:1:2:3
Pay special attention to CPU cores affinity/binding, as it greatly affects the performance of the HPL benchmarks.
HPCG-NVIDIA
runSeveral sample Slurm scripts and sample input file are available in the container at /workspace/hpcg-linux-x86_64
or /workspace/hpcg-linux-aarch64
To run HPCG-NVIDIA
on a single node with one GPU using your custom hpcg.dat file on x86:
CONT='nvcr.io/nvidia/hpc-benchmarks:24.03'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
srun -N 1 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \
--container-image="${CONT}" \
--container-mounts="${MOUNT}" \
./hpcg.sh --dat /my-dat-files/hpcg.dat
To run HPCG-NVIDIA
on nodes 16 with 4 GPUs (or 8 nodes with 8 GPUs) using script parameters on x86:
CONT='nvcr.io/nvidia/hpc-benchmarks:24.03'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
srun -N 16 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
--container-image="${CONT}" \
--container-mounts="${MOUNT}" \
./hpcg.sh --nx 256 --ny 256 --nz 256 --rt 2
To run HPCG-NVIDIA
on nodes 16 with 4 GPUs (or 8 nodes with 8 GPUs) using script parameters on aarch64:
CONT='nvcr.io/nvidia/hpc-benchmarks:24.03'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
srun -N 16 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
--container-image="${CONT}" \
--container-mounts="${MOUNT}" \
./hpcg-arch64.sh --nx 256 --ny 256 --nz 256 --rt 2
HPL-NVIDIA
runSeveral sample input files are available in the container at /workspace/hpl-linux-aarch64
.
To run HPL-NVIDIA
on two nodes of NVIDIA Grace CPU using your custom HPL.dat file:
CONT='nvcr.io#nvidia/hpc-benchmarks:24.03'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
srun -N 2 --ntasks-per-node=2 --cpu-bind=none --mpi=pmix \
--container-image="${CONT}" \
--container-mounts="${MOUNT}" \
./hpl-aarch64.sh --dat /my-dat-files/HPL.dat --cpu-affinity 0-71:72-143 --mem-affinity 0:1
where --cpu-affinity
is mapping to cores on the local node and --mem-affinity
is mapping to NUMA-nodes on the local node.
HPL-MxP-NVIDIA
runTo run HPL-MxP-NVIDIA on a single node of NVIDIA Grace Hopper x4:
CONT='nvcr.io/nvidia/hpc-benchmarks:24.03'
srun -N 1 --ntasks-per-node=16 \
--container-image="${CONT}" \
./hpl-mxp-aarch64.sh --n 380000 --nb 2048 --nprow 2 --npcol 4 --nporder row \
--cpu-affinity 0-71:72-143:144-215:216-287 \
--mem-affinity 0:1:2:3
where --cpu-affinity
is mapping to cores on the local node and --mem-affinity
is mapping to NUMA-nodes on the local node.
HPCG-NVIDIA
runSample input file is available in the container at /workspace/hpcg-linux-aarch64
To run HPCG-NVIDIA
on two nodes of NVIDIA Grace CPU using your custom parameters:
CONT='nvcr.io/nvidia/hpc-benchmarks:24.03'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
srun -N 2 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
--container-image="${CONT}" \
--container-mounts="${MOUNT}" \
./hpcg-aarch64.sh --exm 1 --nx 512 --ny 512 --nz 288 --rt 30 --cpu-affinity 0-35:36-71:72-107:108-143 --mem-affinity 0:0:1:1
To run HPCG-NVIDIA
on NVIDIA Grace Hopper x2 using script parameters on aarch64:
CONT='nvcr.io/nvidia/hpc-benchmarks:24.03'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
#GPU+Grace (Heterogeneous execution)
#GPU rank has 8 OpenMP threads and Grace rank has 64 OpenMP threads
srun -N 2 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \
--container-image="${CONT}" \
--container-mounts="${MOUNT}" \
./hpcg-arch64.sh --nx 256 --ny 1024 --nz 288 --rt 2
--exm 2 --pby 2 --lpm 1 --g2c 64 \
--npx 4 --npy 4 --npz 1 \
--cpu-affinity 0-7:8-71:72-79:80-143:144-151:152-215:216-223:224-287 \
--mem-affinity 0:0:1:1:2:2:3:3
The instructions below assume Singularity 3.4.1 or later.
Save the HPC-Benchmark container as a local Singularity image file:
$ singularity pull --docker-login hpc-benchmarks:24.03.sif docker://nvcr.io/nvidia/hpc-benchmarks:24.03
This command saves the container in the current directory as hpc-benchmarks:24.03.sif
.
HPL-NVIDIA
runSeveral sample Slurm scripts and several sample input files are available in the container at /workspace/hpl-linux-x86_64
or /workspace/hpl-linux-aarch64-gpu
.
To run HPL-NVIDIA
on a single node with 4 GPUs using your custom HPL.dat file:
CONT='/path/to/hpc-benchmarks:24.03.sif'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
srun -N 1 --ntasks-per-node=4 singularity run --nv \
-B "${MOUNT}" "${CONT}" \
./hpl.sh --dat /my-dat-files/HPL.dat
To run HPL-NVIDIA
on nodes 16 with 4 GPUs (or 8 nodes with 8 GPUs) using provided sample HPL-dgx-64GPUs.dat files:
CONT='/path/to/hpc-benchmarks:24.03.sif'
srun -N 16 --ntasks-per-node=4 singularity run --nv \
"${CONT}" \
./hpl.sh --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-64GPUs.dat
CONT='/path/to/hpc-benchmarks:24.03.sif'
srun -N 8 --ntasks-per-node=8 singularity run --nv \
"${CONT}" \
./hpl.sh --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-64GPUs.dat
HPL-MxP-NVIDIA
runSeveral sample Slurm scripts are available in the container at /workspace/hpl-mxp-linux-x86_64
or /workspace/hpl-mxp-linux-aarch64-gpu
.
To run HPL-MxP-NVIDIA on a single node with 8 GPUs:
CONT='/path/to/hpc-benchmarks:24.03.sif'
srun -N 1 --ntasks-per-node=8 singularity run --nv \
"${CONT}" \
./hpl-mxp.sh --n 380000 --nb 2048 --nprow 2 --npcol 4 --nporder row --gpu-affinity 0:1:2:3:4:5:6:7
To run HPL-MxP-NVIDIA on a 4 nodes, each node with 4 GPUs:
CONT='/path/to/hpc-benchmarks:24.03.sif'
srun -N 4 --ntasks-per-node=4 singularity run --nv \
"${CONT}" \
./hpl-mxp.sh --n 280000 --nb 2048 --nprow 4 --npcol 4 --nporder row --gpu-affinity 0:1:2:3
Pay special attention to CPU cores affinity/binding, as it greatly affects the performance of the HPL benchmarks.
HPCG-NVIDIA
runSeveral sample Slurm scripts and sample input file are available in the container at /workspace/hpcg-linux-x86_64
or /workspace/hpcg-linux-aarch64-gpu
To run HPCG-NVIDIA
on a single node with one GPU using your custom hpcg.dat file on x86:
CONT='/path/to/hpc-benchmarks:24.03.sif'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
srun -N 1 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \
--container-image="${CONT}" \
--container-mounts="${MOUNT}" \
./hpcg.sh --dat /my-dat-files/hpcg.dat
To run HPCG-NVIDIA
on nodes 16 with 4 GPUs (or 8 nodes with 8 GPUs) using script parameters on x86:
CONT='/path/to/hpc-benchmarks:24.03.sif'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
srun -N 16 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
--container-image="${CONT}" \
--container-mounts="${MOUNT}" \
./hpcg.sh --nx 256 --ny 256 --nz 256 --rt 2
To run HPCG-NVIDIA
on a single node with one 4 GPUs using your custom hpcg.dat file on aarch64:
CONT='/path/to/hpc-benchmarks:24.03.sif'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
srun -N 1 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \
--container-image="${CONT}" \
--container-mounts="${MOUNT}" \
./hpcg-aarch64.sh --dat /my-dat-files/hpcg.dat
HPL-NVIDIA
runSeveral sample input files are available in the container at /workspace/hpl-linux-aarch64
.
To run HPL-NVIDIA
on two nodes of NVIDIA Grace CPU using your custom HPL.dat file:
CONT='/path/to/hpc-benchmarks:24.03.sif'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
srun -N 2 --ntasks-per-node=2 singularity run \
-B "${MOUNT}" "${CONT}" \
./hpl-aarch64.sh --dat /my-dat-files/HPL.dat --cpu-affinity 0-71:72-143 --mem-affinity 0:1
where --cpu-affinity
is mapping to cores on the local node and --mem-affinity
is mapping to NUMA-nodes on the local node.
HPL-MxP-NVIDIA
runTo run HPL-MxP-NVIDIA on a single node of NVIDIA Grace Hopper x4:
CONT='/path/to/hpc-benchmarks:24.03.sif'
srun -N 1 --ntasks-per-node=16 singularity run \
"${CONT}" \
./hpl-mxp-aarch64.sh --n 380000 --nb 2048 --nprow 2 --npcol 4 --nporder row \
--cpu-affinity 0-71:72-143:144-215:216-287 \
--mem-affinity 0:1:2:3
where --cpu-affinity
is mapping to cores on the local node and --mem-affinity
is mapping to NUMA-nodes on the local node.
HPCG-NVIDIA
runSample input file is available in the container at /workspace/hpcg-linux-aarch64
To run HPCG-NVIDIA
on two nodes of NVIDIA Grace CPU using your custom hpcg.dat file:
CONT='/path/to/hpc-benchmarks:24.03.sif'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
srun -N 2 --ntasks-per-node=4 singularity run \
-B "${MOUNT}" "${CONT}" \
./hpcg-aarch64.sh --exm 1 --nx 512 --ny 512 --nz 288 --rt 10 --cpu-affinity 0-35:36-71:72-107:108-143 --mem-affinity 0:0:1:1
To run HPCG-NVIDIA
on NVIDIA Grace Hopper x2 using script parameters on aarch64:
CONT='/path/to/hpc-benchmarks:24.03.sif'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
#GPU+Grace (Heterogeneous execution)
srun -N 2 --ntasks-per-node=8 singularity run \
-B "${MOUNT}" "${CONT}" \
./hpcg-arch64.sh --nx 256 --ny 1024 --nz 288 --rt 2
--exm 2 --pby 2 --lpm 1 --g2c 64 \
--npx 4 --npy 4 --npz 1 \
--cpu-affinity 0-7:8-71:72-79:80-143:144-151:152-215:216-223:224-287 \
--mem-affinity 0:0:1:1:2:2:3:3
The below examples are for single node runs with Docker. It is not recommended to use Docker for multi-node runs.
Download the HPL-Benchmark container as a local Docker image file:
$ docker pull nvcr.io/nvidia/hpc-benchmarks:24.03
NOTE: you may want to add --privileged
flag for your docker command to avoid a “set_mempolicy” error.
HPL-NVIDIA
runTo run HPL-NVIDIA
on a single node with 4 GPUs using your custom HPL.dat file:
CONT='nvcr.io/nvidia/hpc-benchmarks:24.03'
MOUNT="/full-path/to/your/custom/dat-files:/my-dat-files"
docker run --gpus all --shm-size=1g -v ${MOUNT} \
${CONT} \
mpirun --bind-to none -np 4 \
./hpl.sh --dat /my-dat-files/HPL.dat
HPL-MxP-NVIDIA
runTo run HPL-MxP-NVIDIA on a single node with 8 GPUs:
CONT='nvcr.io/nvidia/hpc-benchmarks:24.03'
docker run --gpus all --shm-size=1g \
${CONT} \
mpirun --bind-to none -np 8 \
./hpl-mxp.sh --n 380000 --nb 2048 --nprow 2 --npcol 4 --nporder row --gpu-affinity 0:1:2:3:4:5:6:7
HPCG-NVIDIA
runTo run HPCG-NVIDIA
on a single node with one GPU using your custom hpcg.dat file on x86:
CONT='nvcr.io/nvidia/hpc-benchmarks:24.03'
MOUNT="/full-path/to/your/custom/dat-files:/my-dat-files"
docker run --gpus all -v --shm-size=1g ${MOUNT} \
${CONT} \
mpirun --bind-to none -np 8 \
./hpcg.sh --dat /my-dat-files/hpcg.dat
HPL-NVIDIA
runSeveral sample docker run scripts are available in the container at /workspace/hpl-linux-aarch64
.
To run HPL-NVIDIA
on a single NVIDIA Grace CPU mode using your custom HPL.dat file:
CONT='nvcr.io#nvidia/hpc-benchmarks:24.03'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
docker run -v ${MOUNT} \
"${CONT}" \
mpirun --bind-to none -np 2 \
./hpl-aarch64.sh --dat /my-dat-files/HPL.dat --cpu-affinity 0-71:72-143 --mem-affinity 0:1
where --cpu-affinity
is mapping to cores on the local node and --mem-affinity
is mapping to NUMA-nodes on the local node.
HPL-MxP-NVIDIA
runSeveral sample docker run scripts are available in the container at /workspace/hpl-mxp-linux-aarch64
.
To run HPL-MxP-NVIDIA on a single node of NVIDIA Grace Hopper x4:
CONT='nvcr.io/nvidia/hpc-benchmarks:24.03'
docker run \
"${CONT}" \
mpirun --bind-to none -np 4 \
./hpl-mxp-aarch64.sh --n 380000 --nb 2048 --nprow 2 --npcol 4 --nporder row \
--cpu-affinity 0-71:72-143:144-215:216-287 --mem-affinity 0:1:2:3
where --cpu-affinity
is mapping to cores on the local node and --mem-affinity
is mapping to NUMA-nodes on the local node.
HPCG-NVIDIA
runSeveral sample docker run scripts are available in the container at /workspace/hpcg-linux-aarch64
.
To run HPCG-NVIDIA
on a single node of NVIDIA Grace CPU using your custom parameters file:
CONT='nvcr.io/nvidia/hpc-benchmarks:24.03'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
docker run -v ${MOUNT} \
"${CONT}" \
mpirun --bind-to none -np 4 \
./hpcg-aarch64.sh --exm 1 --nx 512 --ny 512 --nz 288 --rt 10 --cpu-affinity 0-35:36-71:72-107:108-143 --mem-affinity 0:0:1:1
To run HPCG-NVIDIA
on NVIDIA Grace Hopper x2 using script parameters on aarch64:
CONT='nvcr.io/nvidia/hpc-benchmarks:24.03'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
#GPU+Grace (Heterogeneous execution)
docker run -v ${MOUNT} \
"${CONT}" \
mpirun --bind-to none -np 16 \
./hpcg-arch64.sh --nx 256 --ny 1024 --nz 288 --rt 2
--exm 2 --pby 2 --lpm 1 --g2c 64 \
--npx 4 --npy 4 --npz 1 \
--cpu-affinity 0-7:8-71:72-79:80-143:144-151:152-215:216-223:224-287 \
--mem-affinity 0:0:1:1:2:2:3:3