Analysis and Solution for Model-Parallel Execution on GCP Batch with Torchrun
Situation and the problem
Running a model-parallel step on Google Cloud Platform’s (GCP) Batch service with torchrun presents a challenge. The script,
successful on a multi-GPU Linux developer machine, encounters failure on GCP Batch under the same configuration.
The error log suggests an issue with locating libcuda.so
:
File "/opt/conda/lib/python3.10/site-packages/triton/compiler/compiler.py", line 425, in compile
so_path = make_stub(name, signature, constants)
File "/opt/conda/lib/python3.10/site-packages/triton/compiler/make_launcher.py", line 39, in make_stub
so = _build(name, src_path, tmpdir)
File "/opt/conda/lib/python3.10/site-packages/triton/common/build.py", line 61, in _build
cuda_lib_dirs = libcuda_dirs()
File "/opt/conda/lib/python3.10/site-packages/triton/common/build.py", line 30, in libcuda_dirs
assert any(os.path.exists(os.path.join(path, 'libcuda.so')) for path in dirs), msg
AssertionError: libcuda.so cannot found!
This problem did not occur with single-GPU PyTorch scripts, likely due to the correct driver mounting in the container as per Google’s official Large Language Model (LLM) example.
volumes:
- /var/lib/nvidia/lib64:/usr/local/nvidia/lib64
- /var/lib/nvidia/bin:/usr/local/nvidia/bin
In addition, I don’t exactly know why torch depends on triton
. It seems that it is related to how their
new TorchInductor
module
manages distributed cudagraph.
Triton’s Handling of libcuda.so
Triton 2.1.0 version uses /sbin/ldconfig -p
to find linkable libcuda.so
(Source).
Unfortunately it cannot find the libcuda.so
if your docker runtime is not set to nvidia-container-runtime
.
- Default docker runtime output
$ docker run -it pytorch/pytorch:2.1.0-cuda11.8-cudnn8-runtime bash root@2d251c5d518b:/workspace# /sbin/ldconfig -p | grep cuda # nothing
nvidia
docker runtime output$ docker run -it --runtime=nvidia pytorch/pytorch:2.1.0-cuda11.8-cudnn8-runtime bash root@5b77c2521774:/workspace# /sbin/ldconfig -p | grep cuda libcudadebugger.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcudadebugger.so.1 libcuda.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so.1
Role of Nvidia Container Toolkit
The ldconfig command depends on the /etc/ld.so.conf.d/*.conf
file for search directories. Even if driver/CUDA files,
including libcuda.so
, are mounted, ldconfi
g cannot locate them if the configuration file does not include the mounted
path (/usr/local/nvidia/lib64
). The Nvidia Container Toolkit, however, manages to load all necessary libraries into the
container correctly, enabling proper GPU access for scripts.
Here is an example of the ldconfig’s conf file:
# libc default configuration
/usr/local/lib
# Multiarch support
/usr/local/lib/x86_64-linux-gnu
/lib/x86_64-linux-gnu
/usr/lib/x86_64-linux-gnu
It does not include any mounted path /usr/local/nvidia/lib64
as the google official example demonstrated, thus ldconfig
isn’t capable of finding the libcuda.so
.
Solution Implementation
To resolve this, we first configure the nvidia runtime on the host machine (assuming the GCP G2 machine has nvidia-container-toolkit
installed but not configured for Docker). Then, we execute the torchrun
script with nvidia-runtime options. The setup
involves running scripts on the host to configure the runtime and restart Docker, followed by
executing the container with specific options, including --privileged
, --shm-size=4gb
, --runtime=nvidia
, and --gpus all
.
- script:
text: "sudo nvidia-ctk runtime configure --runtime=docker"
- script:
text: "sudo systemctl restart docker"
- container:
# ... other fields specifying command arguments and imageUri
entrypoint: torchrun
options: --privileged --shm-size=4gb --runtime=nvidia --gpus all
# `torchrun` usually requires a hugh shm size (default to be 64MB on docker).
This approach ensures that torchrun
has the necessary resources and environment configuration to function correctly on
GCP Batch service.