Situation and the problem

Running a model-parallel step on Google Cloud Platform’s (GCP) Batch service with torchrun presents a challenge. The script, successful on a multi-GPU Linux developer machine, encounters failure on GCP Batch under the same configuration. The error log suggests an issue with locating libcuda.so:

  File "/opt/conda/lib/python3.10/site-packages/triton/compiler/compiler.py", line 425, in compile
    so_path = make_stub(name, signature, constants)
  File "/opt/conda/lib/python3.10/site-packages/triton/compiler/make_launcher.py", line 39, in make_stub
    so = _build(name, src_path, tmpdir)
  File "/opt/conda/lib/python3.10/site-packages/triton/common/build.py", line 61, in _build
    cuda_lib_dirs = libcuda_dirs()
  File "/opt/conda/lib/python3.10/site-packages/triton/common/build.py", line 30, in libcuda_dirs
    assert any(os.path.exists(os.path.join(path, 'libcuda.so')) for path in dirs), msg
AssertionError: libcuda.so cannot found!

This problem did not occur with single-GPU PyTorch scripts, likely due to the correct driver mounting in the container as per Google’s official Large Language Model (LLM) example.

volumes:
- /var/lib/nvidia/lib64:/usr/local/nvidia/lib64
- /var/lib/nvidia/bin:/usr/local/nvidia/bin

In addition, I don’t exactly know why torch depends on triton. It seems that it is related to how their new TorchInductor module manages distributed cudagraph.

Triton’s Handling of libcuda.so

Triton 2.1.0 version uses /sbin/ldconfig -p to find linkable libcuda.so (Source). Unfortunately it cannot find the libcuda.so if your docker runtime is not set to nvidia-container-runtime.

  • Default docker runtime output
      $ docker run -it pytorch/pytorch:2.1.0-cuda11.8-cudnn8-runtime bash
      root@2d251c5d518b:/workspace# /sbin/ldconfig -p | grep cuda
      # nothing
    
  • nvidia docker runtime output
      $ docker run -it --runtime=nvidia pytorch/pytorch:2.1.0-cuda11.8-cudnn8-runtime bash
      root@5b77c2521774:/workspace# /sbin/ldconfig -p | grep cuda
      libcudadebugger.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcudadebugger.so.1
      libcuda.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so.1
    

Role of Nvidia Container Toolkit

The ldconfig command depends on the /etc/ld.so.conf.d/*.conf file for search directories. Even if driver/CUDA files, including libcuda.so, are mounted, ldconfig cannot locate them if the configuration file does not include the mounted path (/usr/local/nvidia/lib64). The Nvidia Container Toolkit, however, manages to load all necessary libraries into the container correctly, enabling proper GPU access for scripts.

Here is an example of the ldconfig’s conf file:

# libc default configuration
/usr/local/lib
# Multiarch support
/usr/local/lib/x86_64-linux-gnu
/lib/x86_64-linux-gnu
/usr/lib/x86_64-linux-gnu

It does not include any mounted path /usr/local/nvidia/lib64 as the google official example demonstrated, thus ldconfig isn’t capable of finding the libcuda.so.

Solution Implementation

To resolve this, we first configure the nvidia runtime on the host machine (assuming the GCP G2 machine has nvidia-container-toolkit installed but not configured for Docker). Then, we execute the torchrun script with nvidia-runtime options. The setup involves running scripts on the host to configure the runtime and restart Docker, followed by executing the container with specific options, including --privileged, --shm-size=4gb, --runtime=nvidia, and --gpus all.

- script:
    text: "sudo nvidia-ctk runtime configure --runtime=docker"
- script:
    text: "sudo systemctl restart docker"
- container:
    # ... other fields specifying command arguments and imageUri 
    entrypoint: torchrun
    options: --privileged --shm-size=4gb --runtime=nvidia --gpus all
    # `torchrun` usually requires a hugh shm size (default to be 64MB on docker). 

This approach ensures that torchrun has the necessary resources and environment configuration to function correctly on GCP Batch service.