NVIDIA HPC-Benchmarks

For copy image paths and more information, please view on a desktop device.

Features

Description

The NVIDIA HPC-Benchmarks collection provides three NVIDIA accelerated HPC benchmarks: HPL-NVIDIA, HPL-MxP-NVIDIA, and HPCG-NVIDIA.

Publisher

NVIDIA

Latest Tag

24.03

Modified

April 1, 2024

Compressed Size

4.12 GB

Multinode Support

Yes

Multi-Arch Support

Yes

24.03 (Latest) Security Scan Results

Linux / arm64

Linux / amd64

NVIDIA HPC-Benchmarks 24.03

The NVIDIA HPC-Benchmarks collection provides four benchmarks (HPL, HPL-MxP, HPCG and STREAM) widely used in the HPC community optimized for performance on NVIDIA accelerated HPC systems.

NVIDIA's HPL and HPL-MxP benchmarks provide software packages to solve a (random) dense linear system in double precision (64 bits) arithmetic and in mixed precision arithmetic using Tensor Cores, respectively, on distributed-memory computers equipped with NVIDIA GPUs, based on the Netlib HPL benchmark and HPL-MxP benchmark

NVIDIA's HPCG benchmark accelerates the High Performance Conjugate Gradients (HPCG) Benchmark. HPCG is a software package that performs a fixed number of multigrid preconditioned (using a symmetric Gauss-Seidel smoother) conjugate gradient (PCG) iterations using double precision (64 bit) floating point values.

NVIDIA's STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth. NVIDIA HPC-Benchmarks container includes STREAM benchmarks optimized for NVIDIA Ampere GPU architecture (sm80), NVIDIA Hopper GPU architecture (sm90) and NVIDIA Grace CPU.

Container packages

The NVIDIA HPC-Benchmarks provides multiplatform (x86 and aarch64) container image hpc-benchmarks:24.03 which is based on NVIDIA Optimized Frameworks 24.01 container image.

In addition to NVIDIA Optimized Frameworks 24.01 container images, the hpc-benchmarks:24.03 container image is provided with the following packages embedded:

HPL-NVIDIA 24.03
HPL-MxP-NVIDIA 24.03
HPCG-NVIDIA 24.03
STREAM-NVIDIA 24.03
NVIDIA NVSHMEM 2.11
NVIDIA NCCL 2.20.3
NVIDIA HPC-X 2.18
NVIDIA NVPL 23.11
Intel MKL 2020.4-912

Prerequisites

Using the NVIDIA HPC-Benchmarks Container requires the host system to have the following installed:

Docker Engine
NVIDIA GPU Drivers
NVIDIA Container Toolkit or NVIDIA Pyxis/Enroot, or Singularity version 3.4.1 or later

For supported versions, see the Framework Containers Support Matrix and the NVIDIA Container Toolkit Documentation

NVIDIA's HPL benchmark requires GDRCopy installed on the system. Please visit https://developer.nvidia.com/gdrcopy and https://github.com/NVIDIA/gdrcopy#build-and-installation for more information. In addition, please be aware that GDRCopy requires an extra kernel-mode driver to be installed and loaded on the target machine.

The NVIDIA HPC-Benchmarks Container supports NVIDIA Ampere GPU architecture (sm80) or NVIDIA Hopper GPU architecture (sm90). The current container version is aimed at clusters of DGX A100, DGX H100, NVIDIA Grace Hopper, and NVIDIA Grace CPU nodes (Previous GPU generations are not expected to work).

Containers folder structure

The hpc-benchmarks:24.03 container provides the HPL-NVIDIA, HPL-MxP-NVIDIA, HPCG-NVIDIA and STREAM-NVIDIA benchmarks in the following folder structure:

x86 container image:

hpl.sh script in the folder /workspace to invoke the xhpl executable.
hpl-mxp.sh script in the folder /workspace to invoke the xhpl-mxp executable.
hpcg.sh script in the folder /workspace to invoke the xhpcg executable.
stream-gpu-test.sh script in the folder /workspace to invoke the stream_test executable for NVIDIA H100 or A100 GPU.
HPL-NVIDIA in the folder /workspace/hpl-linux-x86_64 contains:
- xhpl executable.
- Samples of Slurm batch-job scripts in sample-slurm directory.
- Samples of input files in sample-dat directory.
- README, RUNNING, and TUNING guides.
HPL-MxP-NVIDIA in the folder /workspace/hpl-mxp-linux-x86_64 contains:
- xhpl_mxp executable.
- Samples of Slurm batch-job scripts in sample-slurm directory.
- README, RUNNING, and TUNING guides.
HPCG-NVIDIA in the folder /workspace/hpcg-linux-x86_64 contains:
- xhpcg executable.
- Samples of Slurm batch-job scripts in sample-slurm directory
- Sample input file in sample-dat directory.
- README, RUNNING, and TUNING guides.
STREAM-NVIDIA in the folder /workspace/stream-gpu-linux-x86_64
- stream_test executable. GPU STREAM benchmark with double precision elements.
- stream_test_fp32 executable. GPU STREAM benchmark with single precision elements.

aarch64 container image:

hpl-aarch64.sh script in the folder /workspace to invoke the xhpl executable for NVIDIA Grace CPU.
hpl.sh script in the folder /workspace to invoke the xhpl executable for NVIDIA Grace Hopper.
hpl-mxp-aarch64.sh script in the folder /workspace to invoke the xhpl-mxp executable NVIDIA Grace CPU.
hpl-mxp.sh script in the folder /workspace to invoke the xhpl-mxp executable for NVIDIA Grace Hopper.
hpcg-aarch64.sh script in the folder /workspace to invoke the xhpcg executables for NVIDIA Grace Hopper and Grace CPU.
stream-test-cpu.sh script in the folder /workspace to invoke the stream_test executable NVIDIA Grace CPU.
stream-test-gpu.sh script in the folder /workspace to invoke the stream_test executable for NVIDIA Grace Hopper.
HPL-NVIDIA in the folder /workspace/hpl-linux-aarch64 contains:
- xhpl executable for NVIDIA Grace CPU.
- Samples of Slurm batch-job scripts in sample-slurm directory.
- Samples of input files in sample-dat directory.
- README, RUNNING, and TUNING guides.
HPL-NVIDIA in the folder /workspace/hpl-linux-aarch64-gpu contains:
- xhpl executable for NVIDIA Grace Hopper.
- Samples of Slurm batch-job scripts in sample-slurm directory.
- Samples of input files in sample-dat directory.
- README, RUNNING, and TUNING guides.
HPL-MxP-NVIDIA in the folder /workspace/hpl-mxp-linux-aarch64 contains:
- xhpl_mxp executable for NVIDIA Grace CPU.
- Samples of Slurm batch-job scripts in sample-slurm directory.
- README, and RUNNING guides.
HPL-MxP-NVIDIA in the folder /workspace/hpl-mxp-linux-aarch64-gpu contains:
- xhpl_mxp executable for NVIDIA Grace Hopper.
- Samples of Slurm batch-job scripts in sample-slurm directory.
- README, RUNNING, and TUNING guides.
HPCG-NVIDIA in the folder /workspace/hpcg-linux-aarch64 contains:
- xhpcg executable for NVIDIA Grace Hopper.
- xhpcg-cpu executable for NVIDIA Grace CPU.
- Samples of Slurm batch-job scripts in sample-slurm directory
- Sample input file in sample-dat directory.
- README, RUNNING, and TUNING guides.
STREAM-NVIDIA in the folder /workspace/stream-gpu-linux-aarch64
- stream_test executable. GPU STREAM benchmark with double precision elements.
- stream_test_fp32 executable. GPU STREAM benchmark with single precision elements.
STREAM-NVIDIA in the folder /workspace/stream-cpu-linux-aarch64
- stream_test executable. NVIDAI Grace CPU STREAM benchmark with double precision elements.

Running the HPL-NVIDIA, HPL-MxP-NVIDIA, HPCG-NVIDIA and STREAM-NVIDIA Benchmarks

The HPL-NVIDIA benchmark uses input format as the standard Netlib HPL benchmark. Please see the Netlib HPL benchmark for getting started with the HPL software concepts and best practices.

The HPCG-NVIDIA benchmark uses the same input format as the standard HPCG-Benchmark. Please see the HPCG-Benchmark for getting started with the HPCG software concepts and best practices.

The HPL-MxP-NVIDIA benchmark accepts the list of parameters to describe input tasks and set additional tuning settings. The description of parameters can be found in README and TUNING files.

The HPL-NVIDIA, HPL-MxP-NVIDIA, and HPCG-NVIDIA with GPU support expect one GPU per MPI process. As such, set the number of MPI processes to match the number of available GPUs in the cluster.

HPL-NVIDIA Out-of-core mode

Version 24.03 of the HPL-NVIDIA benchmark introduced a new 'out-of-core' mode. This is an opt-in feature and the default mode remains the 'in-core' mode.

The HPL-NVIDIA out-of-core mode allows for the use of larger matrix sizes and, unlike the HPL-NVIDIA in-core mode, any matrix data that does not fit within GPU memory will now be stored in the host CPU memory. This will happen automatically and only requires a user to turn on the feature with an environment variable (HPL_OOC_MODE=1) and use a larger matrix (such as through the N parameter in an input file).

Performance will depend on host-device transfer speeds. For best performance, try to keep the amount of host memory used for the matrix to around 6-16 GiB on platforms where the CPU and GPU are connected via PCIe (such as x86). For systems where there is a faster CPU-GPU interconnect (such as Grace Hopper), sizes greater than 16 GiB may be beneficial. A method to estimate the matrix size for this feature is to take the largest per GPU memory size used with HPL-NVIDIA in-core mode, add the target amount of host data, and then work out the new matrix size from this total size.

All the new environment variables needed by the HPL-NVIDIA out-of-core mode can be found in the provided /workspace/hpl-linux-x86_64/TUNING or /workspace/hpl-linux-aarch64-gpu/TUNING files.

If HPL-NVIDIA out-of-core mode is enabled, it is highly recommended to pass the CPU, GPU, and memory affinity arguments to hpl.sh.

x86 container image

The scripts hpl.sh and hpcg.sh can be invoked on a command line or through a Slurm batch-script to launch the HPL-NVIDIA and HPCG-NVIDIA benchmarks, respectively. The scripts hpl.sh and hpcg.sh accept the following parameters:

--dat path to HPL.dat. Optional parameters:
--gpu-affinity <string> colon separated list of GPU indices
--cpu-affinity <string> colon separated list of CPU index ranges
--mem-affinity <string> colon separated list of memory indices
--ucx-affinity <string> colon separated list of UCX devices
--ucx-tls <string> UCX transport to use
--exec-name <string> HPL executable file
--no-multinode enable flags for no-multinode (no-network) execution

In addition, the script hpcg.sh alternatively to input file accepts the following parameters:

--nx specifies the local (to an MPI process) X dimensions of the problem
--ny specifies the local (to an MPI process) Y dimensions of the problem
--nz specifies the local (to an MPI process) Z dimensions of the problem
--rt specifies the number of seconds of how long the timed portion of the benchmark should run
--b activates benchmarking mode to bypass CPU reference execution when set to one (--b 1)
--l2cmp activates compression in GPU L2 cache when set to one (--l2cmp 1)

The script hpl-mxp.sh can be invoked on a command line or through a Slurm batch script to launch the HPL-MxP-NVIDIA benchmark. The script hpl-mxp.sh requires the following parameters:

--gpu-affinity <string> colon separated list of GPU indices
--nprow <int> number of rows in the processor grid"
--npcol <int> number of columns in the processor grid"
--nporder <string> "row" or "column" major layout of the processor grid"
--n <int> size of N-by-N matrix
--nb <int> nb is the blocking constant (panel size)" The full list of accepted parameters can be found in README and TUNING files.

Note:

CPU and memory affinities can improve performance of HPL-MxP-NVIDIA benchmark. Below the example for DGX A100 and DGX H100:
- DGX-H100: --mem-affinity 0:0:0:0:1:1:1:1 --cpu-affinity 0-13:14-27:28-41:42-55:56-69:70-83:84-97:98-111
- DGX-A100: --mem-affinity 2:3:0:1:6:7:4:5 --cpu-affinity 32-47:48-63:0-15:16-31:96-111:112-127:64-79:80-95

The script stream-gpu-test.shcan be invoked on a command line or through a Slurm batch script to launch the STREAM-NVIDIA benchmark. The script stream-gpu-test.sh accepts the following optional parameters:

--d <int> device number
--n <int> number of elements in the arrays
--dt fp32 enable fp32 stream test

aarch64 container image

HPL-NVIDIA, HPCG-NVIDIA, HPL-MxP-NVIDIA benchmark and STREAM-NVIDIA benchmark for GPU can be run in the same way with HPL-NVIDIA, HPCG-NVIDIA, HPL-MxP-NVIDIA benchmark and STREAM-NVIDIA benchmark from x86_64 container image (see details in x86 container image section).

This section provides sample runs of HPL-NVIDIA, HPL-MxP-NVIDIA, and HPCG-NVIDIA benchmarks for NVIDIA Grace CPU.

The scripts hpl-aarch64.sh and hpcg-aarch64.sh can be invoked on a command line or through a Slurm batch-script to launch the HPL-NVIDIA and HPCG-NVIDIA benchmarks for NVIDIA Grace CPU, respectively.

The scripts hpl-aarch64.sh and hpcg-aarch64.sh accept the following parameters:

--dat path to HPL.dat. Optional parameters:
--cpu-affinity <string> colon separated list of CPU index ranges
--mem-affinity <string> colon separated list of memory indices
--ucx-affinity <string> colon separated list of UCX devices
--ucx-tls <string> UCX transport to use
--exec-name <string> HPL executable file Note: It is recommended to bind MPI process to NUMA node on NVIDIA Grace CPU, for example: ./hpl-aarch64.sh --dat /my-dat-files/HPL.dat --cpu-affinity 0-71:72-143 --mem-affinity 0:1

In addition, instead of an input file, the script hpcg-aarch64.sh accepts the following parameters:

--nx specifies the local (to an MPI process) X dimensions of the problem
--ny specifies the local (to an MPI process) Y dimensions of the problem
--nz specifies the local (to an MPI process) Z dimensions of the problem
--rt specifies the number of seconds of how long the timed portion of the benchmark should run
--b activates benchmarking mode to bypass CPU reference execution when set to one (--b=1)
--l2cmp activates compression in GPU L2 cache when set to one (--l2cmp=1) The following parameters controls the NVIDIA-HPCG benchmark on Grace-Hopper systems:
--exm specifies the execution mode. 0 is GPU-only, 1 is Grace-only, and 2 is GPU-Grace. Default is 0
--pby specifies the direction that GPU and Grace will not have the same local dimension. 0 is auto, 1 is X, 2 is Y, and 3 is Z. Default is 0. Note that the GPU and Grace local problems can differ in one dimension only
--lpm controls the meaning of the value provided for --g2c parameter. Applicable when --exm is 2 and depends on the different local dimension specified by --pby Value Explanation:
- 0 means nx/ny/nz are GPU local dims and g2c value is the ratio of GPU dim to Grace dim. For example, --nx 128 --ny 128 --nz 128 --pby 2 --g2c 8 means the different Grace dim (Y in this example) is 1/8 the different GPU dim. GPU local problem is 128x128x128 and Grace local problem is 128x16x128.
- 1 means nx/ny/nz are GPU local dims and g2c value is the absolute value for the different dim for Grace. For example, --nx 128 --ny 128 --nz 128 --pby 3 --g2c 64 means the different Grace dim (Z in this example) is 64. GPU local problem is 128x128x128 and Grace local problem is 128x128x64.
- 2 assumes a local problem formed by combining a GPU and a Grace problems. The value 2 means the sum of the different dims of the GPU and Grace is combined in the different dimension value. --g2c is the ratio. For example, --pby 1, --nx 1024, and --g2c 8, then GPU X dim is 896 and Grace X dim is 128.
- 3 assumes a local problem formed by combining a GPU and a Grace problems. The value 3 means the sum of the different dims of the GPU and Grace is combined in the different dimension value. --g2c is absolute. For example, --pby 1, --nx 1024, and --g2c 96 then GPU X dim is 928 and Grace X dim is 96.
--g2c specifies the value of different dimensions of the GPU and Grace local problems. Depends on --pby and --lpm values.

Optional parameters of hpcg-aarch64.sh script:

--npx specifies the process grid X dimension of the problem
--npy specifies the process grid Y dimension of the problem
--npz specifies the process grid Z dimension of the problem

The script hpl-mxp-aarch64.sh can be invoked on a command line or through a Slurm batch script to launch the HPL-MxP-NVIDIA benchmark for NVIDIA Grace CPU. The script hpl-mxp-aarch64.sh requires the following parameters:

--nprow <int> number of rows in the processor grid
--npcol <int> number of columns in the processor grid
--nporder <string> "row" or "column" major layout of the processor grid
--n <int> size of N-by-N matrix
--nb <int> nb is the blocking constant (panel size) The full list of accepted parameters can be found in README and TUNING files.

The script stream-cpu-test.shcan be invoked on a command line or through a Slurm batch script to launch the STREAM-NVIDIA benchmark. The script stream-cpu-test.sh accepts the following optional parameters:

--n <int> number of elements in the arrays
--t <int> number of threads

For a general guide on pulling and running containers, see Running A Container chapter in the NVIDIA Containers For Deep Learning Frameworks User’s Guide. For more information about using NGC, refer to the NGC Container User Guide.

Running with Pyxis/Enroot

The examples below use Pyxis/enroot from NVIDIA to facilitate running HPC-Benchmarks Containers. Note that an enroot .credentials file is necessary to use these NGC containers.

To copy and customize the sample Slurm scripts and/or sample HPL.dat/hpcg.dat files from the containers, run the container in interactive mode, while mounting a folder outside the container, and copy the needed files, as follows:

CONT='nvcr.io#nvidia/hpc-benchmarks:24.03'
MOUNT="$PWD:/home_pwd"

srun -N 1 --cpu-bind=none --mpi=pmix \
     --container-image="${CONT}" \
     --container-mounts="${MOUNT}" \
     --pty bash

Once inside the container, copy the needed files to /home_pwd.

HPL-NVIDIA, HPL-MxP-NVIDIA, HPCG-NVIDIA and STREAM-NVIDIA Benchmarks with support of GPU

Examples of `HPL-NVIDIA` run

Several sample Slurm scripts and several sample input files are available in the container at /workspace/hpl-linux-x86_64 or /workspace/hpl-linux-aarch64-gpu.

To run HPL-NVIDIA on a single node with 4 GPUs using your custom HPL.dat file:

CONT='nvcr.io#nvidia/hpc-benchmarks:24.03'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

srun -N 1 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
     --container-image="${CONT}" \
     --container-mounts="${MOUNT}" \
     ./hpl.sh --dat /my-dat-files/HPL.dat

To run HPL-NVIDIA on nodes 16 with 4 GPUs (or 8 nodes with 8 GPUs) using provided sample HPL-dgx-64GPUs.dat files:

CONT='nvcr.io#nvidia/hpc-benchmarks:24.03'

srun -N 16 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
     --container-image="${CONT}" \
     ./hpl.sh --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-64GPUs.dat

CONT='nvcr.io#nvidia/hpc-benchmarks:24.03'

srun -N 8 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \
     --container-image="${CONT}" \
     ./hpl.sh --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-64GPUs.dat

Examples of `HPL-MxP-NVIDIA` run

Several sample Slurm scripts and are available in the container at /workspace/hpl-mxp-linux-x86_64 or /workspace/hpl-mxp-linux-aarch64-gpu.

To run HPL-MxP-NVIDIA on a single node with 8 GPUs:

CONT='nvcr.io/nvidia/hpc-benchmarks:24.03'

srun -N 1 --ntasks-per-node=8 \
     --container-image="${CONT}" \
     ./hpl-mxp.sh --n 380000 --nb 2048 --nprow 2 --npcol 4 --nporder row --gpu-affinity 0:1:2:3:4:5:6:7

To run HPL-MxP-NVIDIA on 4 nodes, each node with 4 GPUs:

CONT='nvcr.io/nvidia/hpc-benchmarks:24.03'

srun -N 4 --ntasks-per-node=4 \
     --container-image="${CONT}" \
     ./hpl-mxp.sh --n 280000 --nb 2048 --nprow 4 --npcol 4 --nporder row --gpu-affinity 0:1:2:3

Pay special attention to CPU cores affinity/binding, as it greatly affects the performance of the HPL benchmarks.

Examples of `HPCG-NVIDIA` run

Several sample Slurm scripts and sample input file are available in the container at /workspace/hpcg-linux-x86_64 or /workspace/hpcg-linux-aarch64

To run HPCG-NVIDIA on a single node with one GPU using your custom hpcg.dat file on x86:

CONT='nvcr.io/nvidia/hpc-benchmarks:24.03'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

srun -N 1 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \
     --container-image="${CONT}" \
     --container-mounts="${MOUNT}" \
     ./hpcg.sh --dat /my-dat-files/hpcg.dat

To run HPCG-NVIDIA on nodes 16 with 4 GPUs (or 8 nodes with 8 GPUs) using script parameters on x86:

CONT='nvcr.io/nvidia/hpc-benchmarks:24.03'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

srun -N 16 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
     --container-image="${CONT}" \
     --container-mounts="${MOUNT}" \
     ./hpcg.sh --nx 256 --ny 256 --nz 256 --rt 2

To run HPCG-NVIDIA on nodes 16 with 4 GPUs (or 8 nodes with 8 GPUs) using script parameters on aarch64:

CONT='nvcr.io/nvidia/hpc-benchmarks:24.03'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

srun -N 16 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
     --container-image="${CONT}" \
     --container-mounts="${MOUNT}" \
     ./hpcg-arch64.sh --nx 256 --ny 256 --nz 256 --rt 2

HPL-NVIDIA, HPL-MxP-NVIDIA, HPCG-NVIDIA and STREAM-NVIDIA Benchmarks for NVIDIA Grace CPU

Examples of `HPL-NVIDIA` run

Several sample input files are available in the container at /workspace/hpl-linux-aarch64.

To run HPL-NVIDIA on two nodes of NVIDIA Grace CPU using your custom HPL.dat file:

CONT='nvcr.io#nvidia/hpc-benchmarks:24.03'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

srun -N 2 --ntasks-per-node=2 --cpu-bind=none --mpi=pmix \
     --container-image="${CONT}" \
     --container-mounts="${MOUNT}" \
     ./hpl-aarch64.sh --dat /my-dat-files/HPL.dat --cpu-affinity 0-71:72-143 --mem-affinity 0:1

where --cpu-affinity is mapping to cores on the local node and --mem-affinity is mapping to NUMA-nodes on the local node.

Examples of `HPL-MxP-NVIDIA` run

To run HPL-MxP-NVIDIA on a single node of NVIDIA Grace Hopper x4:

CONT='nvcr.io/nvidia/hpc-benchmarks:24.03'

srun -N 1 --ntasks-per-node=16 \
     --container-image="${CONT}" \
     ./hpl-mxp-aarch64.sh --n 380000 --nb 2048 --nprow 2 --npcol 4 --nporder row \
     --cpu-affinity 0-71:72-143:144-215:216-287 \
     --mem-affinity 0:1:2:3

where --cpu-affinity is mapping to cores on the local node and --mem-affinity is mapping to NUMA-nodes on the local node.

Examples of `HPCG-NVIDIA` run

Sample input file is available in the container at /workspace/hpcg-linux-aarch64

To run HPCG-NVIDIA on two nodes of NVIDIA Grace CPU using your custom parameters:

CONT='nvcr.io/nvidia/hpc-benchmarks:24.03'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

srun -N 2 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
     --container-image="${CONT}" \
     --container-mounts="${MOUNT}" \
     ./hpcg-aarch64.sh --exm 1 --nx 512 --ny 512 --nz 288 --rt 30 --cpu-affinity 0-35:36-71:72-107:108-143 --mem-affinity 0:0:1:1

To run HPCG-NVIDIA on NVIDIA Grace Hopper x2 using script parameters on aarch64:

CONT='nvcr.io/nvidia/hpc-benchmarks:24.03'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

#GPU+Grace (Heterogeneous execution)
#GPU rank has 8 OpenMP threads and Grace rank has 64 OpenMP threads
srun -N 2 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \
     --container-image="${CONT}" \
     --container-mounts="${MOUNT}" \
     ./hpcg-arch64.sh --nx 256 --ny 1024 --nz 288 --rt 2
     --exm 2 --pby 2 --lpm 1 --g2c 64 \
     --npx 4 --npy 4 --npz 1 \ 
     --cpu-affinity 0-7:8-71:72-79:80-143:144-151:152-215:216-223:224-287 \
     --mem-affinity 0:0:1:1:2:2:3:3

Running with Singularity

The instructions below assume Singularity 3.4.1 or later.

Pull the image

Save the HPC-Benchmark container as a local Singularity image file:

$ singularity pull --docker-login hpc-benchmarks:24.03.sif docker://nvcr.io/nvidia/hpc-benchmarks:24.03

This command saves the container in the current directory as hpc-benchmarks:24.03.sif.

HPL-NVIDIA, HPL-MxP-NVIDIA, HPCG-NVIDIA and STREAM-NVIDIA Benchmarks with support of GPU

Examples of `HPL-NVIDIA` run

Several sample Slurm scripts and several sample input files are available in the container at /workspace/hpl-linux-x86_64 or /workspace/hpl-linux-aarch64-gpu.

To run HPL-NVIDIA on a single node with 4 GPUs using your custom HPL.dat file:

CONT='/path/to/hpc-benchmarks:24.03.sif'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

srun -N 1 --ntasks-per-node=4 singularity run --nv \
     -B "${MOUNT}" "${CONT}" \
     ./hpl.sh --dat /my-dat-files/HPL.dat

To run HPL-NVIDIA on nodes 16 with 4 GPUs (or 8 nodes with 8 GPUs) using provided sample HPL-dgx-64GPUs.dat files:

CONT='/path/to/hpc-benchmarks:24.03.sif'

srun -N 16 --ntasks-per-node=4 singularity run --nv \
     "${CONT}" \
     ./hpl.sh --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-64GPUs.dat

CONT='/path/to/hpc-benchmarks:24.03.sif'

srun -N 8 --ntasks-per-node=8 singularity run --nv \
     "${CONT}" \
     ./hpl.sh --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-64GPUs.dat

Examples of `HPL-MxP-NVIDIA` run

Several sample Slurm scripts are available in the container at /workspace/hpl-mxp-linux-x86_64 or /workspace/hpl-mxp-linux-aarch64-gpu.

To run HPL-MxP-NVIDIA on a single node with 8 GPUs:

CONT='/path/to/hpc-benchmarks:24.03.sif'

srun -N 1 --ntasks-per-node=8 singularity run --nv \
     "${CONT}" \
     ./hpl-mxp.sh --n 380000 --nb 2048 --nprow 2 --npcol 4 --nporder row --gpu-affinity 0:1:2:3:4:5:6:7

To run HPL-MxP-NVIDIA on a 4 nodes, each node with 4 GPUs:

CONT='/path/to/hpc-benchmarks:24.03.sif'

srun -N 4 --ntasks-per-node=4 singularity run --nv \
     "${CONT}" \
     ./hpl-mxp.sh --n 280000 --nb 2048 --nprow 4 --npcol 4 --nporder row --gpu-affinity 0:1:2:3

Pay special attention to CPU cores affinity/binding, as it greatly affects the performance of the HPL benchmarks.

Examples of `HPCG-NVIDIA` run

Several sample Slurm scripts and sample input file are available in the container at /workspace/hpcg-linux-x86_64 or /workspace/hpcg-linux-aarch64-gpu

To run HPCG-NVIDIA on a single node with one GPU using your custom hpcg.dat file on x86:

CONT='/path/to/hpc-benchmarks:24.03.sif'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

srun -N 1 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \
     --container-image="${CONT}" \
     --container-mounts="${MOUNT}" \
     ./hpcg.sh --dat /my-dat-files/hpcg.dat

To run HPCG-NVIDIA on nodes 16 with 4 GPUs (or 8 nodes with 8 GPUs) using script parameters on x86:

CONT='/path/to/hpc-benchmarks:24.03.sif'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

srun -N 16 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
     --container-image="${CONT}" \
     --container-mounts="${MOUNT}" \
     ./hpcg.sh --nx 256 --ny 256 --nz 256 --rt 2

To run HPCG-NVIDIA on a single node with one 4 GPUs using your custom hpcg.dat file on aarch64:

CONT='/path/to/hpc-benchmarks:24.03.sif'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

srun -N 1 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \
     --container-image="${CONT}" \
     --container-mounts="${MOUNT}" \
     ./hpcg-aarch64.sh --dat /my-dat-files/hpcg.dat

HPL-NVIDIA, HPL-MxP-NVIDIA, HPCG-NVIDIA and STREAM-NVIDIA Benchmarks for NVIDIA Grace CPU

Examples of `HPL-NVIDIA` run

Several sample input files are available in the container at /workspace/hpl-linux-aarch64.

To run HPL-NVIDIA on two nodes of NVIDIA Grace CPU using your custom HPL.dat file:

CONT='/path/to/hpc-benchmarks:24.03.sif'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

srun -N 2 --ntasks-per-node=2 singularity run \
     -B "${MOUNT}" "${CONT}" \
     ./hpl-aarch64.sh --dat /my-dat-files/HPL.dat --cpu-affinity 0-71:72-143 --mem-affinity 0:1

where --cpu-affinity is mapping to cores on the local node and --mem-affinity is mapping to NUMA-nodes on the local node.

Examples of `HPL-MxP-NVIDIA` run

To run HPL-MxP-NVIDIA on a single node of NVIDIA Grace Hopper x4:

CONT='/path/to/hpc-benchmarks:24.03.sif'

srun -N 1 --ntasks-per-node=16 singularity run \
     "${CONT}" \
     ./hpl-mxp-aarch64.sh --n 380000 --nb 2048 --nprow 2 --npcol 4 --nporder row \
     --cpu-affinity 0-71:72-143:144-215:216-287 \
     --mem-affinity 0:1:2:3

where --cpu-affinity is mapping to cores on the local node and --mem-affinity is mapping to NUMA-nodes on the local node.

Examples of `HPCG-NVIDIA` run

Sample input file is available in the container at /workspace/hpcg-linux-aarch64

To run HPCG-NVIDIA on two nodes of NVIDIA Grace CPU using your custom hpcg.dat file:

CONT='/path/to/hpc-benchmarks:24.03.sif'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

srun -N 2 --ntasks-per-node=4 singularity run \
     -B "${MOUNT}" "${CONT}" \
     ./hpcg-aarch64.sh --exm 1 --nx 512 --ny 512 --nz 288 --rt 10 --cpu-affinity 0-35:36-71:72-107:108-143 --mem-affinity 0:0:1:1

To run HPCG-NVIDIA on NVIDIA Grace Hopper x2 using script parameters on aarch64:

CONT='/path/to/hpc-benchmarks:24.03.sif'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

#GPU+Grace (Heterogeneous execution)
srun -N 2 --ntasks-per-node=8 singularity run \
     -B "${MOUNT}" "${CONT}" \
     ./hpcg-arch64.sh --nx 256 --ny 1024 --nz 288 --rt 2
     --exm 2 --pby 2 --lpm 1 --g2c 64 \
     --npx 4 --npy 4 --npz 1 \ 
     --cpu-affinity 0-7:8-71:72-79:80-143:144-151:152-215:216-223:224-287 \
     --mem-affinity 0:0:1:1:2:2:3:3

Running with Docker

The below examples are for single node runs with Docker. It is not recommended to use Docker for multi-node runs.

Pull the image

Download the HPL-Benchmark container as a local Docker image file:

$ docker pull nvcr.io/nvidia/hpc-benchmarks:24.03

NOTE: you may want to add --privileged flag for your docker command to avoid a “set_mempolicy” error.

HPL-NVIDIA, HPL-MxP-NVIDIA, HPCG-NVIDIA and STREAM-NVIDIA Benchmarks with support of GPU

Examples of `HPL-NVIDIA` run

To run HPL-NVIDIA on a single node with 4 GPUs using your custom HPL.dat file:

CONT='nvcr.io/nvidia/hpc-benchmarks:24.03'
MOUNT="/full-path/to/your/custom/dat-files:/my-dat-files"

docker run --gpus all --shm-size=1g -v ${MOUNT} \
     ${CONT} \
     mpirun --bind-to none -np 4 \
     ./hpl.sh --dat /my-dat-files/HPL.dat

Examples of `HPL-MxP-NVIDIA` run

To run HPL-MxP-NVIDIA on a single node with 8 GPUs:

CONT='nvcr.io/nvidia/hpc-benchmarks:24.03'

docker run --gpus all --shm-size=1g \
     ${CONT} \
     mpirun --bind-to none -np 8 \
     ./hpl-mxp.sh --n 380000 --nb 2048 --nprow 2 --npcol 4 --nporder row --gpu-affinity 0:1:2:3:4:5:6:7

Examples of `HPCG-NVIDIA` run

To run HPCG-NVIDIA on a single node with one GPU using your custom hpcg.dat file on x86:

CONT='nvcr.io/nvidia/hpc-benchmarks:24.03'
MOUNT="/full-path/to/your/custom/dat-files:/my-dat-files"

docker run --gpus all -v --shm-size=1g ${MOUNT} \
     ${CONT} \
     mpirun --bind-to none -np 8 \
     ./hpcg.sh --dat /my-dat-files/hpcg.dat

HPL-NVIDIA, HPL-MxP-NVIDIA, HPCG-NVIDIA and STREAM-NVIDIA Benchmarks for NVIDIA Grace CPU

Examples of `HPL-NVIDIA` run

Several sample docker run scripts are available in the container at /workspace/hpl-linux-aarch64.

To run HPL-NVIDIA on a single NVIDIA Grace CPU mode using your custom HPL.dat file:

CONT='nvcr.io#nvidia/hpc-benchmarks:24.03'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

docker run -v ${MOUNT} \
     "${CONT}" \
     mpirun --bind-to none -np 2 \
     ./hpl-aarch64.sh --dat /my-dat-files/HPL.dat --cpu-affinity 0-71:72-143 --mem-affinity 0:1

where --cpu-affinity is mapping to cores on the local node and --mem-affinity is mapping to NUMA-nodes on the local node.

Examples of `HPL-MxP-NVIDIA` run

Several sample docker run scripts are available in the container at /workspace/hpl-mxp-linux-aarch64.

To run HPL-MxP-NVIDIA on a single node of NVIDIA Grace Hopper x4:

CONT='nvcr.io/nvidia/hpc-benchmarks:24.03'

docker run \
     "${CONT}" \
     mpirun --bind-to none -np 4 \
     ./hpl-mxp-aarch64.sh --n 380000 --nb 2048 --nprow 2 --npcol 4 --nporder row \
     --cpu-affinity 0-71:72-143:144-215:216-287 --mem-affinity 0:1:2:3

where --cpu-affinity is mapping to cores on the local node and --mem-affinity is mapping to NUMA-nodes on the local node.

Examples of `HPCG-NVIDIA` run

Several sample docker run scripts are available in the container at /workspace/hpcg-linux-aarch64.

To run HPCG-NVIDIA on a single node of NVIDIA Grace CPU using your custom parameters file:

CONT='nvcr.io/nvidia/hpc-benchmarks:24.03'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

docker run -v ${MOUNT} \
     "${CONT}" \
     mpirun --bind-to none -np 4 \
     ./hpcg-aarch64.sh --exm 1 --nx 512 --ny 512 --nz 288 --rt 10 --cpu-affinity 0-35:36-71:72-107:108-143 --mem-affinity 0:0:1:1

To run HPCG-NVIDIA on NVIDIA Grace Hopper x2 using script parameters on aarch64:

CONT='nvcr.io/nvidia/hpc-benchmarks:24.03'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"

#GPU+Grace (Heterogeneous execution)
docker run -v ${MOUNT} \
     "${CONT}" \
     mpirun --bind-to none -np 16 \
     ./hpcg-arch64.sh --nx 256 --ny 1024 --nz 288 --rt 2
     --exm 2 --pby 2 --lpm 1 --g2c 64 \
     --npx 4 --npy 4 --npz 1 \ 
     --cpu-affinity 0-7:8-71:72-79:80-143:144-151:152-215:216-223:224-287 \
     --mem-affinity 0:0:1:1:2:2:3:3

Resources

Support

For questions or to provide feedback, please contact HPCBenchmarks@nvidia.com

NVIDIA HPC-Benchmarks

NVIDIA HPC-Benchmarks 24.03

Container packages

Prerequisites

Containers folder structure

x86 container image:

aarch64 container image:

Running the HPL-NVIDIA, HPL-MxP-NVIDIA, HPCG-NVIDIA and STREAM-NVIDIA Benchmarks

HPL-NVIDIA Out-of-core mode

x86 container image

aarch64 container image

Running with Pyxis/Enroot

HPL-NVIDIA, HPL-MxP-NVIDIA, HPCG-NVIDIA and STREAM-NVIDIA Benchmarks with support of GPU

Examples of HPL-NVIDIA run

Examples of HPL-MxP-NVIDIA run

Examples of HPCG-NVIDIA run

HPL-NVIDIA, HPL-MxP-NVIDIA, HPCG-NVIDIA and STREAM-NVIDIA Benchmarks for NVIDIA Grace CPU

Examples of HPL-NVIDIA run

Examples of HPL-MxP-NVIDIA run

Examples of HPCG-NVIDIA run

Running with Singularity

Pull the image

HPL-NVIDIA, HPL-MxP-NVIDIA, HPCG-NVIDIA and STREAM-NVIDIA Benchmarks with support of GPU

Examples of HPL-NVIDIA run

Examples of HPL-MxP-NVIDIA run

Examples of HPCG-NVIDIA run

HPL-NVIDIA, HPL-MxP-NVIDIA, HPCG-NVIDIA and STREAM-NVIDIA Benchmarks for NVIDIA Grace CPU

Examples of HPL-NVIDIA run

Examples of HPL-MxP-NVIDIA run

Examples of HPCG-NVIDIA run

Running with Docker

Pull the image

HPL-NVIDIA, HPL-MxP-NVIDIA, HPCG-NVIDIA and STREAM-NVIDIA Benchmarks with support of GPU

Examples of HPL-NVIDIA run

Examples of HPL-MxP-NVIDIA run

Examples of HPCG-NVIDIA run

HPL-NVIDIA, HPL-MxP-NVIDIA, HPCG-NVIDIA and STREAM-NVIDIA Benchmarks for NVIDIA Grace CPU

Examples of HPL-NVIDIA run

Examples of HPL-MxP-NVIDIA run

Examples of HPCG-NVIDIA run

Resources

Support

Examples of `HPL-NVIDIA` run

Examples of `HPL-MxP-NVIDIA` run

Examples of `HPCG-NVIDIA` run

Examples of `HPL-NVIDIA` run

Examples of `HPL-MxP-NVIDIA` run

Examples of `HPCG-NVIDIA` run

Examples of `HPL-NVIDIA` run

Examples of `HPL-MxP-NVIDIA` run

Examples of `HPCG-NVIDIA` run

Examples of `HPL-NVIDIA` run

Examples of `HPL-MxP-NVIDIA` run

Examples of `HPCG-NVIDIA` run

Examples of `HPL-NVIDIA` run

Examples of `HPL-MxP-NVIDIA` run

Examples of `HPCG-NVIDIA` run

Examples of `HPL-NVIDIA` run

Examples of `HPL-MxP-NVIDIA` run

Examples of `HPCG-NVIDIA` run