본 포스팅에서는 그나마 저렴한 GPU가 탑재된 g4dn.xlarge 인스턴스를 사용하는데, vCPU 개수를 4로 증설하는 절차가 필요하다. 아래 글 참고해 진행(하루 이상 소요됨)
https://velog.io/@ahn_kyuwon/%EA%B8%B0%EC%88%A0-ec2-instance-GPU
g4dn.xlarge 인스턴스 구동 후 아래 내용을 진행한다.
# OS확인
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 24.04 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
# 스토리지 확인
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/root 29G 1.6G 27G 6% /
tmpfs 7.8G 0 7.8G 0% /dev/shm
# 메모리 확인(CPU)
$ free -h
total used free shared buff/cache available
Mem: 15Gi 519Mi 14Gi 2.7Mi 311Mi 14Gi
Swap: 0B 0B 0B
# 패키지 리스트 업데이트
$ sudo apt-get update
# 필수 패키지 설치
$ sudo apt-get install \
apt-transport-https \
ca-certificates \
curl \
gnupg \
lsb-release \
software-properties-common
# GPG key 추가
$ sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
# docker 저장소 설정
$ echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
# 패키지 리스트 다시 업데이트
$ sudo apt-get update
# docker 설치
$ sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
# 현재 사용자를 docker그룹에 추가
$ sudo usermod -aG docker $USER
# 도커세션 새로고침
$ newgrp docker
# 확인
$ docker run hello-world
Hello from Docker!
This message shows that your installation appears to be working correctly.
$ sudo apt install git
# 확인
$ git --version
git version 2.43.0
# Python 3 설치
$ sudo apt install python3
# 확인
$ python3 --version
Python 3.12.3
# pip 설치
$ sudo apt install python3-pip
# 가상환경 설치
$ sudo apt install python3-venv
# 가상환경 생성
$ python3 -m venv myenv
$ git clone https://github.com/ggerganov/llama.cpp
$ cd llama.cpp
# main실행파일을 뒤에서 사용하므로 빌드
$ make -j
https://documentation.ubuntu.com/aws/en/latest/aws-how-to/instances/install-nvidia-drivers/
# 설치 전 GPU 확인
$ lspci | grep -i nvidia
00:1e.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
# 커널/os 확인
$ uname -a && cat /etc/os-release
Linux ip-172-31-33-86 6.8.0-1012-aws #13-Ubuntu SMP Mon Jul 15 13:40:27 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
PRETTY_NAME="Ubuntu 24.04 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
# NVIDIA 드라이버 설치
$ sudo apt install -y ubuntu-drivers-common
$ sudo ubuntu-drivers install
# 재부팅
$ sudo reboot
# 확인
$ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 |
| N/A 32C P0 26W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
$ sudo apt install nvidia-cuda-toolkit
# 확인
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0
$ echo 'export MY_HUGGINGFACE_TOKEN="{token id}"' >> ~/.bashrc
$ source ~/.bashrc
# 가상환경 활성화
$ source myenv/bin/activate
# Hugging Face 패키지 설치
$ pip install huggingface_hub
# 추가적으로 필요한 패키지가 있을 경우 설치
download.py
import os
from huggingface_hub import snapshot_download
token = os.getenv("HF_TOKEN")
snapshot_download(
repo_id="google/gemma-2-2b-it",
local_dir="models/gemma-2-2b-it",
token=token,
local_dir_use_symlinks=False,
ignore_patterns=["original/*"],
)
디렉토리 확인
$ ls -alh models/gemma-2-2b-it/
total 4.9G
drwxrwxr-x 3 ubuntu ubuntu 4.0K Sep 7 03:47 .
drwxrwxr-x 3 ubuntu ubuntu 4.0K Sep 7 03:47 ..
drwxrwxr-x 3 ubuntu ubuntu 4.0K Sep 7 03:46 .cache
-rw-rw-r-- 1 ubuntu ubuntu 1.6K Sep 7 03:46 .gitattributes
-rw-rw-r-- 1 ubuntu ubuntu 29K Sep 7 03:46 README.md
-rw-rw-r-- 1 ubuntu ubuntu 838 Sep 7 03:46 config.json
-rw-rw-r-- 1 ubuntu ubuntu 187 Sep 7 03:46 generation_config.json
-rw-rw-r-- 1 ubuntu ubuntu 4.7G Sep 7 03:47 model-00001-of-00002.safetensors
-rw-rw-r-- 1 ubuntu ubuntu 230M Sep 7 03:46 model-00002-of-00002.safetensors
-rw-rw-r-- 1 ubuntu ubuntu 24K Sep 7 03:46 model.safetensors.index.json
-rw-rw-r-- 1 ubuntu ubuntu 636 Sep 7 03:46 special_tokens_map.json
-rw-rw-r-- 1 ubuntu ubuntu 17M Sep 7 03:46 tokenizer.json
-rw-rw-r-- 1 ubuntu ubuntu 4.1M Sep 7 03:46 tokenizer.model
-rw-rw-r-- 1 ubuntu ubuntu 46K Sep 7 03:46 tokenizer_config.json
이쯤되면 스토리지가 부족해진다.
gguf 모델 변환 전인데 절반도 남지 않음.
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/root 29G 17G 12G 59% /
tmpfs 7.8G 0 7.8G 0% /dev/shm
# AWS console에서 volume 추가
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/root 96G 17G 80G 17% /
tmpfs 7.8G 0 7.8G 0% /dev/shm
마음이 넉넉해졌다.
gemma2는 기본적으로 BF16의 tensor type을 갖는다.
gguf포맷팅 시 여러 옵션이 있지만 우선 동일하게 설정
GPU에서 추론하기 위해서는
# 필요 패키지 설치
$ pip install numpy torch sentencepiece safetensors
# convert.sh
$ python3 $HOME/llama.cpp/convert_hf_to_gguf.py \
$HOME/models/gemma-2-2b-it \
--outtype bf16
# gguf 확인
$ ls -alh ~/models/gemma-2-2b-it/
total 9.8G
drwxrwxr-x 3 ubuntu ubuntu 4.0K Sep 7 09:58 .
drwxrwxr-x 3 ubuntu ubuntu 4.0K Sep 7 03:47 ..
drwxrwxr-x 3 ubuntu ubuntu 4.0K Sep 7 03:46 .cache
-rw-rw-r-- 1 ubuntu ubuntu 1.6K Sep 7 03:46 .gitattributes
-rw-rw-r-- 1 ubuntu ubuntu 29K Sep 7 03:46 README.md
-rw-rw-r-- 1 ubuntu ubuntu 838 Sep 7 03:46 config.json
-rw-rw-r-- 1 ubuntu ubuntu 4.9G Sep 7 09:59 gemma-2-2B-it-BF16.gguf
-rw-rw-r-- 1 ubuntu ubuntu 187 Sep 7 03:46 generation_config.json
-rw-rw-r-- 1 ubuntu ubuntu 4.7G Sep 7 03:47 model-00001-of-00002.safetensors
-rw-rw-r-- 1 ubuntu ubuntu 230M Sep 7 03:46 model-00002-of-00002.safetensors
-rw-rw-r-- 1 ubuntu ubuntu 24K Sep 7 03:46 model.safetensors.index.json
-rw-rw-r-- 1 ubuntu ubuntu 636 Sep 7 03:46 special_tokens_map.json
-rw-rw-r-- 1 ubuntu ubuntu 17M Sep 7 03:46 tokenizer.json
-rw-rw-r-- 1 ubuntu ubuntu 4.1M Sep 7 03:46 tokenizer.model
-rw-rw-r-- 1 ubuntu ubuntu 46K Sep 7 03:46 tokenizer_config.json
동일한 16비트 부동소수점으로 로드했기 때문에 원 safetensors파일과 크기가 거의 같다.
https://medium.com/@ingridwickstevens/quantization-of-llms-with-llama-cpp-9bbf59deda35
bf16 타입으로 로드된 gguf파일이 15GB로 우리 메모리에 올릴 수 없으므로, 다시 양자화한다.(llama3 기준)
$HOME/llama.cpp/llama-quantize \
$HOME/models/Meta-Llama-3.1-8B-Instruct/Meta-Llama-3.1-8B-Instruct-BF16.gguf \
$HOME/models/Meta-Llama-3.1-8B-Instruct/Meta-Llama-3.1-8B-Instruct-Q5_K_S.gguf \
q5_k_s
llama_model_quantize_internal: model size = 15317.02 MB
llama_model_quantize_internal: quant size = 5332.43 MB
main: quantize time = 414849.30 ms
main: total time = 414849.30 ms
CPU
# llama.cpp.gguf.infer.sh
$HOME/llama.cpp/llama-cli \
-m $HOME/models/gemma-2-2b-it/gemma-2-2B-it-BF16.gguf \
-p "Please tell me about Docker in 10 sentences." \
-n 400 \
-e \
--log-disable
1. Docker is a software platform that enables developers to package, distribute, and run applications in standardized units called containers.
2. Containers are lightweight, isolated environments that share the host operating system kernel.
3. Docker containers provide a consistent experience across different environments, including development, testing, and production.
4. Developers can use Docker to build, push, and pull Docker images, which are essentially blueprints for creating containers.
5. Docker Hub is a repository for storing and sharing Docker images, making it easy for developers to find and use pre-built images.
6. Docker Swarm is a tool for orchestrating and managing multiple containers.
7. Docker Compose simplifies the definition of multi-container applications by allowing developers to define them in a single file.
8. Docker simplifies application deployment by eliminating the need for complex infrastructure setup and configuration.
9. Docker's popularity has surged due to its portability, scalability, and ease of use.
10. Docker is widely used in various industries, including software development, web development, and cloud computing.
CPU로도 꽤 빠른 추론속도를 보여준다.
GPU
# 종전에 make로 빌드했기 때문에, CPU기반 실행파일 제거
$ make clean
# GPU사용하도록 재빌드(존나 오래 걸림)
$ make GGML_CUDA=1
# 재추론
$ sh llama.cpp.gguf.infer.sh
Please tell me about Docker in 10 sentences.
1. Docker is an open-source platform that allows developers to package and run applications in isolated containers.
2. Containers are lightweight, portable, and self-contained units of software that share the host operating system kernel.
3. Docker simplifies the development, deployment, and management of applications by standardizing the environment.
4. Docker containers provide consistent environments, regardless of the underlying infrastructure.
5. The Docker Engine, a crucial component of Docker, manages the creation, execution, and communication between containers.
6. Docker images are blueprints that define the software components, dependencies, and configurations of a container.
7. Docker Hub is a repository where developers can access and share Docker images.
8. Docker Compose allows developers to orchestrate the deployment of multiple applications in containers.
9. Docker offers various tools for monitoring, logging, and managing containerized applications.
10. Docker has become an essential tool for modern software development and deployment due to its ability to simplify and streamline the process.
**Summary:**
Docker simplifies the development, deployment, and management of applications by providing a standardized environment and offering tools for containerization, image management, and orchestration. This platform allows developers to package and run applications in isolated containers, ensuring consistent environments and portability.
# GPU 할당된 프로세스 확인
$ nvidia-smi
Sat Sep 7 11:04:01 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 |
| N/A 25C P0 25W / 70W | 121MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 35318 C /home/ubuntu/llama.cpp/llama-cli 114MiB |
+---------------------------------------------------------------------------------------+
하지만 llama3 추론시 아래와 같은 현상이 발생한다.
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 17179869216
llama_init_from_gpt_params: error: failed to create context with model '/home/ubuntu/models/Meta-Llama-3.1-8B-Instruct/Meta-Llama-3.1-8B-Instruct-Q5_K_S.gguf'
5bit 양자화해 5.3GB 정도의 크기인데도 못 버틴다. 스왑 메모리를 추가한다.
$ sudo fallocate -l 16G /swapfile
$ sudo chmod 600 /swapfile
$ sudo mkswap /swapfile
$ sudo swapon /swapfile
# 추론간 메모리 확인
$ free -h
total used free shared buff/cache available
Mem: 15Gi 12Gi 200Mi 277Mi 3.1Gi 2.6Gi
Swap: 15Gi 5.0Gi 11Gi
되긴 하는데 밥을 먹고 와도 추론이 끝나지 않는다. llama3는 내 능력으로 이 인스턴스에선 운용할 수 없다. 다시 gemma로 돌아간다.
한 끼 더 먹으니 추론이 끝났다. 그래도 답변 퀄리티는 참 좋다.
Please tell me about Docker in 10 sentences. I'd like to know the basics, what it does, and how it works.
Here are the basics of Docker in 10 sentences:
Docker is a containerization platform that allows developers to package, ship, and run applications in containers. Containers are lightweight and portable, allowing for consistent and repeatable deployment across different environments. Docker uses a layered file system, allowing for efficient and fast deployment. The Docker engine runs on the host operating system and creates a new layer for each application, allowing for easy management and isolation. Docker images are a read-only template that contains the application code and dependencies. When a new container is created, a new writable layer is added on top of the read-only image. This writable layer is used to store changes made to the application. Docker provides a command-line interface (CLI) for interacting with containers and images. Docker has a large community and a wide range of tools and plugins available for extending its functionality. Overall, Docker provides a simple and efficient way to package and deploy applications, allowing for faster development and deployment cycles.
Let me know if you'd like me to expand on any of these points! I'm happy to help.
### Additional Information
If you want to dive deeper, here are some additional resources:
* [Docker Official Documentation](https://docs.docker.com/)
* [Docker Tutorial by Docker](https://www.docker.com/what-docker)
* [Docker Containers: A Guide to Containers by Red Hat](https://www.redhat.com/en/topics/containers)
Let me know if you have any specific questions or topics you'd like me to expand on.
### Related Topics
If you're interested in learning more about containerization and DevOps, here are some related topics:
* Kubernetes
* Container orchestration
* Microservices architecture
* Continuous Integration and Continuous Deployment (CI/CD)
# 구동시 image 자동 pull
$ docker run \
-p 8080:8080 \
-v $HOME/models/:/models \
--gpus all \
ghcr.io/ggerganov/llama.cpp:server-cuda \
-m models/gemma-2-2b-it/gemma-2-2B-it-BF16.gguf \
-c 512 --host 0.0.0.0 --port 8080
# nvidia driver와 docker 호환성 문제가 있는 듯하다.
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
# GPG키 추가/레포지토리 설정
$ curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# update
$ sudo apt-get update
# 다운로드
$ sudo apt-get install -y nvidia-container-toolkit
# docker데몬 재시작
$ sudo systemctl restart docker
# 구동
$ sh container.infer.sh
# 추론
$ curl --request POST \
--url http://localhost:8080/completion \
--header "Content-Type: application/json" \
--data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'
Docker is an open-source platform that uses containers to package and run applications.\n2. Containers provide a lightweight, portable way to deploy applications.\n3. They isolate the application from the host system, ensuring it runs consistently.\n4. Docker images are immutable, meaning they cannot be changed after creation.\n5. Docker Hub is a central repository for downloading pre-built images and building your own.\n6. Docker compose enables you to define and manage multiple containers in a single file.\n7. Docker Swarm is a tool for orchestrating multiple Docker containers into a cluster.\n8. Docker provides tools
run_gradio.py
import gradio as gr
import requests
def generate_text(prompt):
url = "http://{private IPv4}:8080/completion"
headers = {"Content-Type": "application/json"}
data = {
"prompt": prompt,
"n_predict": 256
}
response = requests.post(url, headers=headers, json=data)
return response.json().get('content')
iface = gr.Interface(
fn=generate_text,
inputs="text",
outputs="text",
title="Model Demo",
description="Enter a prompt to generate text using the fine-tuned model.",
)
iface.launch(server_name="0.0.0.0")
Dockerfile
FROM python:3.9-slim
WORKDIR /usr/src/app
COPY . .
RUN pip install --no-cache-dir gradio requests
EXPOSE 7860
ENV GRADIO_SERVER_NAME="0.0.0.0"
CMD ["python", "run_gradio.py"]
$ docker build -t gradio-app .
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
gradio-app latest 179287c2056c 10 seconds ago 467MB
$ docker run --rm -d -p 7860:7860 gradio-app