# 빠르게 AWS 서버구축부터 llm배포, docker를 활용한 서빙까지

안규원·2024년 9월 7일

LLM aws docker llama.cpp

Infra

목록 보기

22/23

인스턴스 구동과 설치

ec2 인스턴스 시작

본 포스팅에서는 그나마 저렴한 GPU가 탑재된 g4dn.xlarge 인스턴스를 사용하는데, vCPU 개수를 4로 증설하는 절차가 필요하다. 아래 글 참고해 진행(하루 이상 소요됨)

https://velog.io/@ahn_kyuwon/%EA%B8%B0%EC%88%A0-ec2-instance-GPU

g4dn.xlarge 인스턴스 구동 후 아래 내용을 진행한다.

# OS확인
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 24.04 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"

# 스토리지 확인
$ df -h
Filesystem       Size  Used Avail Use% Mounted on
/dev/root         29G  1.6G   27G   6% /
tmpfs            7.8G     0  7.8G   0% /dev/shm

# 메모리 확인(CPU)
$ free -h    
               total        used        free      shared  buff/cache   available
Mem:            15Gi       519Mi        14Gi       2.7Mi       311Mi        14Gi
Swap:             0B          0B          0B

# 패키지 리스트 업데이트
$ sudo apt-get update

docker 설치

# 필수 패키지 설치
$ sudo apt-get install \
	apt-transport-https \
    ca-certificates \
    curl \
    gnupg \
    lsb-release \
    software-properties-common
    
# GPG key 추가
$ sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc

# docker 저장소 설정
$ echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

# 패키지 리스트 다시 업데이트
$ sudo apt-get update

# docker 설치
$ sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

# 현재 사용자를 docker그룹에 추가
$ sudo usermod -aG docker $USER

# 도커세션 새로고침
$ newgrp docker

# 확인
$ docker run hello-world

Hello from Docker!
This message shows that your installation appears to be working correctly.

git 설치

$ sudo apt install git

# 확인
$ git --version

git version 2.43.0

python 설치

# Python 3 설치
$ sudo apt install python3

# 확인
$ python3 --version
Python 3.12.3

# pip 설치
$ sudo apt install python3-pip

# 가상환경 설치
$ sudo apt install python3-venv

# 가상환경 생성
$ python3 -m venv myenv

llama.cpp 설치

$ git clone https://github.com/ggerganov/llama.cpp
$ cd llama.cpp

# main실행파일을 뒤에서 사용하므로 빌드
$ make -j

NVIDIA 드라이버 설치

https://documentation.ubuntu.com/aws/en/latest/aws-how-to/instances/install-nvidia-drivers/

# 설치 전 GPU 확인
$ lspci | grep -i nvidia
00:1e.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)

# 커널/os 확인
$ uname -a && cat /etc/os-release
Linux ip-172-31-33-86 6.8.0-1012-aws #13-Ubuntu SMP Mon Jul 15 13:40:27 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
PRETTY_NAME="Ubuntu 24.04 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"

# NVIDIA 드라이버 설치
$ sudo apt install -y ubuntu-drivers-common
$ sudo ubuntu-drivers install

# 재부팅
$ sudo reboot

# 확인
$ nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:1E.0 Off |                    0 |
| N/A   32C    P0              26W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

cuda toolkit 설치

https://developer.nvidia.com/cuda-12-2-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_network

$ sudo apt install nvidia-cuda-toolkit

# 확인
$ nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0

모델 로드

환경변수 설정

$ echo 'export MY_HUGGINGFACE_TOKEN="{token id}"' >> ~/.bashrc
$ source ~/.bashrc

필요 패키지 설치

# 가상환경 활성화
$ source myenv/bin/activate

# Hugging Face 패키지 설치
$ pip install huggingface_hub

# 추가적으로 필요한 패키지가 있을 경우 설치

모델 다운로드

download.py

import os
from huggingface_hub import snapshot_download

token = os.getenv("HF_TOKEN")

snapshot_download(
    repo_id="google/gemma-2-2b-it",
    local_dir="models/gemma-2-2b-it",
    token=token,
    local_dir_use_symlinks=False,
    ignore_patterns=["original/*"],
)

디렉토리 확인

$ ls -alh models/gemma-2-2b-it/

total 4.9G
drwxrwxr-x 3 ubuntu ubuntu 4.0K Sep  7 03:47 .
drwxrwxr-x 3 ubuntu ubuntu 4.0K Sep  7 03:47 ..
drwxrwxr-x 3 ubuntu ubuntu 4.0K Sep  7 03:46 .cache
-rw-rw-r-- 1 ubuntu ubuntu 1.6K Sep  7 03:46 .gitattributes
-rw-rw-r-- 1 ubuntu ubuntu  29K Sep  7 03:46 README.md
-rw-rw-r-- 1 ubuntu ubuntu  838 Sep  7 03:46 config.json
-rw-rw-r-- 1 ubuntu ubuntu  187 Sep  7 03:46 generation_config.json
-rw-rw-r-- 1 ubuntu ubuntu 4.7G Sep  7 03:47 model-00001-of-00002.safetensors
-rw-rw-r-- 1 ubuntu ubuntu 230M Sep  7 03:46 model-00002-of-00002.safetensors
-rw-rw-r-- 1 ubuntu ubuntu  24K Sep  7 03:46 model.safetensors.index.json
-rw-rw-r-- 1 ubuntu ubuntu  636 Sep  7 03:46 special_tokens_map.json
-rw-rw-r-- 1 ubuntu ubuntu  17M Sep  7 03:46 tokenizer.json
-rw-rw-r-- 1 ubuntu ubuntu 4.1M Sep  7 03:46 tokenizer.model
-rw-rw-r-- 1 ubuntu ubuntu  46K Sep  7 03:46 tokenizer_config.json

EBS 볼륨 증설

이쯤되면 스토리지가 부족해진다.
gguf 모델 변환 전인데 절반도 남지 않음.

$ df -h

Filesystem       Size  Used Avail Use% Mounted on
/dev/root         29G   17G   12G  59% /
tmpfs            7.8G     0  7.8G   0% /dev/shm

# AWS console에서 volume 추가
$ df -h

Filesystem       Size  Used Avail Use% Mounted on
/dev/root         96G   17G   80G  17% /
tmpfs            7.8G     0  7.8G   0% /dev/shm

마음이 넉넉해졌다.

모델 변환

gemma2는 기본적으로 BF16의 tensor type을 갖는다.
gguf포맷팅 시 여러 옵션이 있지만 우선 동일하게 설정
GPU에서 추론하기 위해서는

FP32(32비트 부동소수점): 용량 커서 안된다.
FP16, BF16
INT8(8비트 정수): T4 이상 GPU에서 지원
INT4: 낮은 정밀도.. 논외

# 필요 패키지 설치
$ pip install numpy torch sentencepiece safetensors

# convert.sh
$ python3 $HOME/llama.cpp/convert_hf_to_gguf.py \
        $HOME/models/gemma-2-2b-it \
        --outtype bf16
        
# gguf 확인
$ ls -alh ~/models/gemma-2-2b-it/
total 9.8G

drwxrwxr-x 3 ubuntu ubuntu 4.0K Sep  7 09:58 .
drwxrwxr-x 3 ubuntu ubuntu 4.0K Sep  7 03:47 ..
drwxrwxr-x 3 ubuntu ubuntu 4.0K Sep  7 03:46 .cache
-rw-rw-r-- 1 ubuntu ubuntu 1.6K Sep  7 03:46 .gitattributes
-rw-rw-r-- 1 ubuntu ubuntu  29K Sep  7 03:46 README.md
-rw-rw-r-- 1 ubuntu ubuntu  838 Sep  7 03:46 config.json
-rw-rw-r-- 1 ubuntu ubuntu 4.9G Sep  7 09:59 gemma-2-2B-it-BF16.gguf
-rw-rw-r-- 1 ubuntu ubuntu  187 Sep  7 03:46 generation_config.json
-rw-rw-r-- 1 ubuntu ubuntu 4.7G Sep  7 03:47 model-00001-of-00002.safetensors
-rw-rw-r-- 1 ubuntu ubuntu 230M Sep  7 03:46 model-00002-of-00002.safetensors
-rw-rw-r-- 1 ubuntu ubuntu  24K Sep  7 03:46 model.safetensors.index.json
-rw-rw-r-- 1 ubuntu ubuntu  636 Sep  7 03:46 special_tokens_map.json
-rw-rw-r-- 1 ubuntu ubuntu  17M Sep  7 03:46 tokenizer.json
-rw-rw-r-- 1 ubuntu ubuntu 4.1M Sep  7 03:46 tokenizer.model
-rw-rw-r-- 1 ubuntu ubuntu  46K Sep  7 03:46 tokenizer_config.json

동일한 16비트 부동소수점으로 로드했기 때문에 원 safetensors파일과 크기가 거의 같다.

양자화

https://medium.com/@ingridwickstevens/quantization-of-llms-with-llama-cpp-9bbf59deda35

bf16 타입으로 로드된 gguf파일이 15GB로 우리 메모리에 올릴 수 없으므로, 다시 양자화한다.(llama3 기준)

$HOME/llama.cpp/llama-quantize \
    $HOME/models/Meta-Llama-3.1-8B-Instruct/Meta-Llama-3.1-8B-Instruct-BF16.gguf \
    $HOME/models/Meta-Llama-3.1-8B-Instruct/Meta-Llama-3.1-8B-Instruct-Q5_K_S.gguf \
    q5_k_s
    
llama_model_quantize_internal: model size  = 15317.02 MB
llama_model_quantize_internal: quant size  =  5332.43 MB

main: quantize time = 414849.30 ms
main:    total time = 414849.30 ms

모델 추론

llama.cpp cli(local gguf)

https://dytis.tistory.com/72

CPU

# llama.cpp.gguf.infer.sh
$HOME/llama.cpp/llama-cli \
        -m $HOME/models/gemma-2-2b-it/gemma-2-2B-it-BF16.gguf \
        -p "Please tell me about Docker in 10 sentences." \
        -n 400 \
        -e \
        --log-disable
        
1. Docker is a software platform that enables developers to package, distribute, and run applications in standardized units called containers.
2. Containers are lightweight, isolated environments that share the host operating system kernel.
3. Docker containers provide a consistent experience across different environments, including development, testing, and production.
4. Developers can use Docker to build, push, and pull Docker images, which are essentially blueprints for creating containers.
5. Docker Hub is a repository for storing and sharing Docker images, making it easy for developers to find and use pre-built images.
6. Docker Swarm is a tool for orchestrating and managing multiple containers.
7. Docker Compose simplifies the definition of multi-container applications by allowing developers to define them in a single file.
8. Docker simplifies application deployment by eliminating the need for complex infrastructure setup and configuration.
9. Docker's popularity has surged due to its portability, scalability, and ease of use.
10. Docker is widely used in various industries, including software development, web development, and cloud computing.

CPU로도 꽤 빠른 추론속도를 보여준다.

GPU

# 종전에 make로 빌드했기 때문에, CPU기반 실행파일 제거
$ make clean

# GPU사용하도록 재빌드(존나 오래 걸림)
$ make GGML_CUDA=1

# 재추론
$ sh llama.cpp.gguf.infer.sh

Please tell me about Docker in 10 sentences.

1. Docker is an open-source platform that allows developers to package and run applications in isolated containers.
2. Containers are lightweight, portable, and self-contained units of software that share the host operating system kernel.
3. Docker simplifies the development, deployment, and management of applications by standardizing the environment.
4. Docker containers provide consistent environments, regardless of the underlying infrastructure.
5. The Docker Engine, a crucial component of Docker, manages the creation, execution, and communication between containers.
6. Docker images are blueprints that define the software components, dependencies, and configurations of a container.
7. Docker Hub is a repository where developers can access and share Docker images.
8. Docker Compose allows developers to orchestrate the deployment of multiple applications in containers.
9. Docker offers various tools for monitoring, logging, and managing containerized applications.
10. Docker has become an essential tool for modern software development and deployment due to its ability to simplify and streamline the process.

**Summary:**
Docker simplifies the development, deployment, and management of applications by providing a standardized environment and offering tools for containerization, image management, and orchestration. This platform allows developers to package and run applications in isolated containers, ensuring consistent environments and portability.

# GPU 할당된 프로세스 확인
$ nvidia-smi

Sat Sep  7 11:04:01 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:1E.0 Off |                    0 |
| N/A   25C    P0              25W /  70W |    121MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     35318      C   /home/ubuntu/llama.cpp/llama-cli            114MiB |
+---------------------------------------------------------------------------------------+

(Optional) 스왑메모리

하지만 llama3 추론시 아래와 같은 현상이 발생한다.

ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 17179869216
llama_init_from_gpt_params: error: failed to create context with model '/home/ubuntu/models/Meta-Llama-3.1-8B-Instruct/Meta-Llama-3.1-8B-Instruct-Q5_K_S.gguf'

5bit 양자화해 5.3GB 정도의 크기인데도 못 버틴다. 스왑 메모리를 추가한다.

$ sudo fallocate -l 16G /swapfile
$ sudo chmod 600 /swapfile
$ sudo mkswap /swapfile
$ sudo swapon /swapfile

# 추론간 메모리 확인
$ free -h
               total        used        free      shared  buff/cache   available
Mem:            15Gi        12Gi       200Mi       277Mi       3.1Gi       2.6Gi
Swap:           15Gi       5.0Gi        11Gi

되긴 하는데 밥을 먹고 와도 추론이 끝나지 않는다. llama3는 내 능력으로 이 인스턴스에선 운용할 수 없다. 다시 gemma로 돌아간다.

한 끼 더 먹으니 추론이 끝났다. 그래도 답변 퀄리티는 참 좋다.

Please tell me about Docker in 10 sentences. I'd like to know the basics, what it does, and how it works.
Here are the basics of Docker in 10 sentences:
Docker is a containerization platform that allows developers to package, ship, and run applications in containers. Containers are lightweight and portable, allowing for consistent and repeatable deployment across different environments. Docker uses a layered file system, allowing for efficient and fast deployment. The Docker engine runs on the host operating system and creates a new layer for each application, allowing for easy management and isolation. Docker images are a read-only template that contains the application code and dependencies. When a new container is created, a new writable layer is added on top of the read-only image. This writable layer is used to store changes made to the application. Docker provides a command-line interface (CLI) for interacting with containers and images. Docker has a large community and a wide range of tools and plugins available for extending its functionality. Overall, Docker provides a simple and efficient way to package and deploy applications, allowing for faster development and deployment cycles.
Let me know if you'd like me to expand on any of these points! I'm happy to help. 

### Additional Information

If you want to dive deeper, here are some additional resources:

*   [Docker Official Documentation](https://docs.docker.com/)
*   [Docker Tutorial by Docker](https://www.docker.com/what-docker)
*   [Docker Containers: A Guide to Containers by Red Hat](https://www.redhat.com/en/topics/containers)

Let me know if you have any specific questions or topics you'd like me to expand on. 

### Related Topics

If you're interested in learning more about containerization and DevOps, here are some related topics:

*   Kubernetes
*   Container orchestration
*   Microservices architecture
*   Continuous Integration and Continuous Deployment (CI/CD)

docker 컨테이너 구동

추론서버 컨테이너

# 구동시 image 자동 pull
$ docker run \
        -p 8080:8080 \
        -v $HOME/models/:/models \
        --gpus all \
        ghcr.io/ggerganov/llama.cpp:server-cuda \
        -m models/gemma-2-2b-it/gemma-2-2B-it-BF16.gguf \
        -c 512 --host 0.0.0.0 --port 8080

# nvidia driver와 docker 호환성 문제가 있는 듯하다.
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

# GPG키 추가/레포지토리 설정
$ curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# update
$ sudo apt-get update

# 다운로드
$ sudo apt-get install -y nvidia-container-toolkit

# docker데몬 재시작
$ sudo systemctl restart docker

# 구동
$ sh container.infer.sh

# 추론
$ curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'
    
Docker is an open-source platform that uses containers to package and run applications.\n2. Containers provide a lightweight, portable way to deploy applications.\n3. They isolate the application from the host system, ensuring it runs consistently.\n4. Docker images are immutable, meaning they cannot be changed after creation.\n5. Docker Hub is a central repository for downloading pre-built images and building your own.\n6. Docker compose enables you to define and manage multiple containers in a single file.\n7. Docker Swarm is a tool for orchestrating multiple Docker containers into a cluster.\n8. Docker provides tools

웹서버 컨테이너

run_gradio.py

import gradio as gr
import requests

def generate_text(prompt):
    url = "http://{private IPv4}:8080/completion"
    headers = {"Content-Type": "application/json"}
    data = {
        "prompt": prompt,
        "n_predict": 256
    }
    response = requests.post(url, headers=headers, json=data)
    return response.json().get('content')

iface = gr.Interface(
    fn=generate_text,
    inputs="text",
    outputs="text",
    title="Model Demo",
    description="Enter a prompt to generate text using the fine-tuned model.",
)

iface.launch(server_name="0.0.0.0")

Dockerfile

FROM python:3.9-slim

WORKDIR /usr/src/app
COPY . .

RUN pip install --no-cache-dir gradio requests

EXPOSE 7860
ENV GRADIO_SERVER_NAME="0.0.0.0"

CMD ["python", "run_gradio.py"]

$ docker build -t gradio-app .

$ docker images

REPOSITORY                    TAG           IMAGE ID       CREATED          SIZE
gradio-app                    latest        179287c2056c   10 seconds ago   467MB

$ docker run --rm -d -p 7860:7860 gradio-app

안규원

이전 포스트

# go 기반 컨테이너환경 구성

다음 포스트