AWS DL AMI for GPU Pytorch

Sungho Kim·2023년 9월 6일
0

AWS & ML

목록 보기
1/1

The AWS Deep Learning AMI (DLAMI)

is your one-stop shop for deep learning in the cloud. This customized machine instance is available in most Amazon EC2 regions for a variety of instance types, from a small CPU-only instance to the latest high-powered multi-GPU instances. It comes preconfigured with NVIDIA CUDA and NVIDIA cuDNN, as well as the latest releases of the most popular deep learning frameworks.

AMI Name format:

  • Deep Learning AMI GPU PyTorch 2.0.? ${PATCH_VERSION} (Amazon Linux 2) ${YYYY-MM-DD}

The AMI includes the following:

  • Supported AWS Service: EC2
  • Operating System: Amazon Linux 2
  • Compute Architecture: x86
  • Supported EC2 Instances: P5, P4de, P4d, P3, G5, G3, G4dn (P2 is not supported)
  • Python: /opt/conda/envs/pytorch/bin/python
  • NVIDIA Driver: 535.54.03
  • NVIDIA CUDA12.1 stack:
  • CUDA, NCCL and cuDDN installation path: /usr/local/cuda-12.1/
  • Default CUDA: 12.1
  • PATH /usr/local/cuda points to /usr/local/cuda-12.1/

Updated below env vars:

  • LD_LIBRARY_PATH to have /usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/local/cuda:/usr/local/cud/targets/x86_64-linux/lib
  • PATH to have /usr/local/cuda/bin/:/usr/local/cuda/include/
    Compiled NCCL Version for 12.1: 2.18.3
  • Note: PyTorch package comes with statically linked custom NCCL 2.18.3 and it won’t use system NCCL.
  • NCCL Tests Location:
    all_reduce, all_gather and reduce_scatter: /usr/local/cuda-xx.x/efa/test-cuda-xx.x/

To run NCCL tests, LD_LIBRARY_PATH is already with updated with needed paths.

  • Common PATHs are already added to LD_LIBRARY_PATH:
    /opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/lib:/usr/lib
  • LD_LIBRARY_PATH is updated with CUDA version paths
    /usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/local/cuda:/usr/local/cud/targets/x86_64-linux/lib
  • EFA Installer: 1.24.1
  • AWS OFI NCCL: 1.7.1-aws
  • Installation path: /opt/aws-ofi-nccl/ . Path /opt/aws-ofi-nccl/lib is added to LD_LIBRARY_PATH.
  • Tests path for ring, message_transfer: /opt/aws-ofi-nccl/tests

Note: PyTorch package comes with dynamically linked AWS OFI NCCL plugin as a conda package aws-ofi-nccl-dlc package as well and PyTorch will use that package instead of system AWS OFI NCCL.

  • GDRCopy: 2.3
  • EBS volume type: gp3
  • Python version: 3.10
  • Query AMI-ID with AWSCLI (example region is us-east-1):

aws ec2 describe-images --region us-east-1 --owners amazon --filters 'Name=name,Values=Deep Learning AMI GPU PyTorch 2.0.? (Amazon Linux 2) ????????' 'Name=state,Values=available' --query 'reverse(sort_by(Images, &CreationDate))[:1].ImageId' --output text

Note

  • P5 Instance: DeviceIndex is unique to each NetworkCard, and must be a non-negative integer less than the limit of ENIs per NetworkCard. On P5, the number of ENIs per NetworkCard is 2, meaning that the only valid values for DeviceIndex is 0 or 1. Below is the example of EC2 P5 instance launch command using awscli showing NetworkCardIndex from number 0-31 and DeviceIndex as 0 for first interface and DeviceIndex as 1 for rest 31 interrfaces.

aws ec2 run-instances --region REGION \ --instance-type $INSTANCETYPE \ --image-id $AMI --key-name $KEYNAME \ --iam-instance-profile "Name=dlami-builder" \ --tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=TAG}]" \
--network-interfaces "NetworkCardIndex=0,DeviceIndex=0,Groups=SG,SubnetId=SG,SubnetId=SUBNET,InterfaceType=efa" \
"NetworkCardIndex=1,DeviceIndex=1,Groups=SG,SubnetId=SG,SubnetId=SUBNET,InterfaceType=efa" \
"NetworkCardIndex=2,DeviceIndex=1,Groups=SG,SubnetId=SG,SubnetId=SUBNET,InterfaceType=efa" \
"NetworkCardIndex=3,DeviceIndex=1,Groups=SG,SubnetId=SG,SubnetId=SUBNET,InterfaceType=efa" \
"NetworkCardIndex=4,DeviceIndex=1,Groups=SG,SubnetId=SG,SubnetId=SUBNET,InterfaceType=efa" \
....
....
....
"NetworkCardIndex=31,DeviceIndex=1,Groups=SG,SubnetId=SG,SubnetId=SUBNET,InterfaceType=efa"

Horovod:

  • Horovod is supported in the current pytorch conda environment on the DLAMI. However, Horovod will be removed from the conda environment for upcoming version of PyTorch v2.1. Customers will be able install the horovod libraries by following the horovod guidelines and install them on their DLAMIs for their distributed training jobs.
profile
오복, 무심

0개의 댓글