[MLOps] MinIO - Part2

RCC.AI·2024년 12월 13일

MLOps

목록 보기

8/15

첫번째 minio 포스팅에 이어 ChatGPT를 이용한 MinIO에 대해서 계속해서 알아보자. 아래에는 ChatGPT와의 채팅을 주고받는 형식으로 서술할 것이다.

Me : minio 데이터 통계를 낼 수 있는 예시 보여줘

ChatGPT : MinIO 데이터를 관리하거나 분석하기 위해, Python과 boto3 라이브러리를 활용하여 데이터 통계를 생성할 수 있습니다. 아래는 몇 가지 예시를 통해 버킷 내의 데이터 통계를 계산하는 방법을 설명합니다.

사전 준비

1. Python 라이브러리 설치

pip install boto3

2. MinIO 클라이언트 초기화

import boto3

# MinIO 서버 연결 정보
MINIO_URL = "http://127.0.0.1:9000"  # MinIO 서버 주소
ACCESS_KEY = "your-access-key"
SECRET_KEY = "your-secret-key"

# MinIO 클라이언트 초기화
s3_client = boto3.client(
    "s3",
    endpoint_url=MINIO_URL,
    aws_access_key_id=ACCESS_KEY,
    aws_secret_access_key=SECRET_KEY
)

예시 1: 버킷 내 파일 개수와 총 크기 계산

def calculate_bucket_stats(bucket_name):
    total_size = 0
    total_files = 0

    try:
        # 버킷 내 모든 객체 조회
        response = s3_client.list_objects_v2(Bucket=bucket_name)
        if "Contents" in response:
            for obj in response["Contents"]:
                total_files += 1
                total_size += obj["Size"]
        
        print(f"Bucket '{bucket_name}' Statistics:")
        print(f"- Total Files: {total_files}")
        print(f"- Total Size: {total_size / (1024**2):.2f} MB")
    except Exception as e:
        print(f"Error calculating stats: {e}")

# 사용 예시
calculate_bucket_stats("mybucket")

예시 2: 특정 확장자 파일의 개수와 크기 계산

def calculate_file_type_stats(bucket_name, file_extension):
    total_size = 0
    total_files = 0

    try:
        # 버킷 내 모든 객체 조회
        response = s3_client.list_objects_v2(Bucket=bucket_name)
        if "Contents" in response:
            for obj in response["Contents"]:
                if obj["Key"].endswith(file_extension):
                    total_files += 1
                    total_size += obj["Size"]
        
        print(f"Bucket '{bucket_name}' {file_extension} File Statistics:")
        print(f"- Total {file_extension} Files: {total_files}")
        print(f"- Total Size: {total_size / (1024**2):.2f} MB")
    except Exception as e:
        print(f"Error calculating stats: {e}")

# 사용 예시
calculate_file_type_stats("mybucket", ".jpg")

예시 3: 업로드된 파일의 최신 수정일 확인

def find_latest_file(bucket_name):
    latest_file = None
    latest_date = None

    try:
        # 버킷 내 모든 객체 조회
        response = s3_client.list_objects_v2(Bucket=bucket_name)
        if "Contents" in response:
            for obj in response["Contents"]:
                if latest_date is None or obj["LastModified"] > latest_date:
                    latest_date = obj["LastModified"]
                    latest_file = obj["Key"]
        
        print(f"Latest File in Bucket '{bucket_name}':")
        print(f"- File Name: {latest_file}")
        print(f"- Last Modified: {latest_date}")
    except Exception as e:
        print(f"Error finding latest file: {e}")

# 사용 예시
find_latest_file("mybucket")

예시 4: 파일 크기 분포

def calculate_file_size_distribution(bucket_name):
    size_distribution = {"small": 0, "medium": 0, "large": 0}

    try:
        # 버킷 내 모든 객체 조회
        response = s3_client.list_objects_v2(Bucket=bucket_name)
        if "Contents" in response:
            for obj in response["Contents"]:
                size = obj["Size"]
                if size < 1024 * 1024:  # 1MB 미만
                    size_distribution["small"] += 1
                elif size < 10 * 1024 * 1024:  # 1MB ~ 10MB
                    size_distribution["medium"] += 1
                else:  # 10MB 이상
                    size_distribution["large"] += 1
        
        print(f"File Size Distribution in Bucket '{bucket_name}':")
        print(f"- Small (<1MB): {size_distribution['small']}")
        print(f"- Medium (1MB-10MB): {size_distribution['medium']}")
        print(f"- Large (>10MB): {size_distribution['large']}")
    except Exception as e:
        print(f"Error calculating file size distribution: {e}")

# 사용 예시
calculate_file_size_distribution("mybucket")

예시 5: 데이터 통계 시각화 (옵션)

Python의 matplotlib를 사용하여 통계를 시각화할 수도 있습니다.

설치

pip install matplotlib

예제 코드

import matplotlib.pyplot as plt

def visualize_file_size_distribution(bucket_name):
    size_distribution = {"small": 0, "medium": 0, "large": 0}

    try:
        response = s3_client.list_objects_v2(Bucket=bucket_name)
        if "Contents" in response:
            for obj in response["Contents"]:
                size = obj["Size"]
                if size < 1024 * 1024:
                    size_distribution["small"] += 1
                elif size < 10 * 1024 * 1024:
                    size_distribution["medium"] += 1
                else:
                    size_distribution["large"] += 1
        
        # 데이터 시각화
        labels = size_distribution.keys()
        sizes = size_distribution.values()
        plt.bar(labels, sizes)
        plt.title(f"File Size Distribution in Bucket '{bucket_name}'")
        plt.xlabel("Size Category")
        plt.ylabel("Number of Files")
        plt.show()
    except Exception as e:
        print(f"Error visualizing file size distribution: {e}")

# 사용 예시
visualize_file_size_distribution("mybucket")

이 예시들은 MinIO의 데이터를 분석하고 통계를 생성하는 데 유용하며, 필요에 따라 확장할 수 있습니다. 😊

Me : 로컬 경로의 데이터를 추가하거나 삭제하고 싶으면 어떻게 해야해?

ChatGPT : MinIO는 데이터를 관리하기 위해 버킷과 오브젝트라는 개념을 사용하며, 직접 로컬 경로(/data와 같은)에서 데이터를 추가하거나 삭제하기보다는, MinIO API나 MinIO 클라이언트(mc)를 사용하는 것이 안전하고 권장됩니다.
직접 로컬 디렉토리를 수정하면 데이터 무결성이 손상될 수 있으므로, 반드시 MinIO를 통해 작업해야 합니다.

MinIO에서 데이터를 추가/삭제하는 권장 방법

1. MinIO 클라이언트(`mc`) 사용

1.1. MinIO 클라이언트 설정

설치:

wget https://dl.min.io/client/mc/release/linux-amd64/mc
chmod +x mc
sudo mv mc /usr/local/bin/

MinIO 서버와 연결:

mc alias set myminio http://127.0.0.1:9000 your-access-key your-secret-key

1.2. 데이터 추가

MinIO에 파일 업로드:

mc cp /path/to/local-file.jpg myminio/mybucket

1.3. 데이터 삭제

MinIO에서 파일 삭제:

mc rm myminio/mybucket/file.jpg

1.4. 버킷 삭제

mc rb myminio/mybucket --force

2. Python과 boto3 사용

2.1. 파일 추가

Python을 사용해 파일 업로드:

import boto3

# MinIO 연결 설정
s3_client = boto3.client(
    's3',
    endpoint_url='http://127.0.0.1:9000',
    aws_access_key_id='your-access-key',
    aws_secret_access_key='your-secret-key'
)

# 파일 업로드
bucket_name = "mybucket"
file_path = "/path/to/local-file.jpg"
object_name = "uploaded-file.jpg"

s3_client.upload_file(file_path, bucket_name, object_name)
print(f"{object_name} uploaded successfully.")

2.2. 파일 삭제

# 파일 삭제
s3_client.delete_object(Bucket=bucket_name, Key=object_name)
print(f"{object_name} deleted successfully.")

로컬 경로에서 직접 데이터 추가/삭제 (비권장)

로컬 경로(예: /data)에서 데이터를 직접 추가하거나 삭제하는 것은 비권장입니다. 하지만 만약 MinIO가 실행되지 않았거나 테스트 목적으로 사용해야 한다면, 아래처럼 작업할 수 있습니다.

1. 파일 추가

cp /path/to/local-file.jpg /data/mybucket/

2. 파일 삭제

rm /data/mybucket/file.jpg

주의: MinIO는 메타데이터를 관리합니다. 로컬 디렉토리에서 데이터를 직접 수정하면 MinIO의 관리 체계와 불일치가 발생할 수 있습니다. 이로 인해 버킷 상태가 손상되거나 비정상적인 동작을 초래할 수 있습니다.

정리

데이터를 추가/삭제하려면 MinIO API 또는 mc 클라이언트를 사용하세요.
직접 로컬 디렉토리(/data)에서 작업하는 것은 테스트 환경에서만 제한적으로 사용하고, 운영 환경에서는 권장되지 않습니다.

MinIO의 관리 도구를 활용하면 데이터의 무결성을 유지하면서 안전하게 작업할 수 있습니다. 😊

Me : yolo dataset 같은 경우에는 이미지와 annotation 파일로 나뉘어져 있어 잘못된 데이터를 확인하거나 수정 또는 삭제하고 싶은 경우에는 어떻게 하면 좋을까?

ChatGPT : YOLO 데이터셋처럼 이미지와 annotation 파일(예: .txt 파일로 된 bounding box 정보)이 분리된 경우, 잘못된 데이터를 확인하거나 수정/삭제하기 위해 Python을 사용해 MinIO에서 데이터를 관리하는 방법을 소개합니다. 이는 YOLO 형식의 데이터 구조를 기반으로 합니다.

1. YOLO 데이터셋 구조

YOLO 데이터셋은 일반적으로 다음과 같은 구조를 가집니다:

dataset/
├── images/
│   ├── image1.jpg
│   ├── image2.jpg
│   └── ...
├── labels/
│   ├── image1.txt
│   ├── image2.txt
│   └── ...

images/: 이미지 파일들이 저장된 디렉토리
labels/: 이미지에 대한 annotation 파일들 (bounding box, 클래스 정보 등)

2. 잘못된 데이터를 확인, 수정, 삭제하기

2.1. 데이터 확인

목적: 이미지와 annotation 파일 간의 매칭 여부 확인
예시 코드:

import boto3

# MinIO 설정
s3_client = boto3.client(
    "s3",
    endpoint_url="http://127.0.0.1:9000",
    aws_access_key_id="your-access-key",
    aws_secret_access_key="your-secret-key"
)

def check_yolo_dataset(bucket_name, image_folder="images/", label_folder="labels/"):
    # 버킷 내 객체 목록 가져오기
    response = s3_client.list_objects_v2(Bucket=bucket_name)
    if "Contents" not in response:
        print("No objects found in bucket.")
        return

    image_files = set()
    label_files = set()

    # 이미지와 라벨 파일 분류
    for obj in response["Contents"]:
        key = obj["Key"]
        if key.startswith(image_folder):
            image_files.add(key[len(image_folder):].rsplit(".", 1)[0])  # 파일명만 추출
        elif key.startswith(label_folder):
            label_files.add(key[len(label_folder):].rsplit(".", 1)[0])  # 파일명만 추출

    # 매칭되지 않은 파일 확인
    unmatched_images = image_files - label_files
    unmatched_labels = label_files - image_files

    if unmatched_images:
        print("Images without labels:", unmatched_images)
    if unmatched_labels:
        print("Labels without images:", unmatched_labels)

# 사용 예시
check_yolo_dataset("mybucket")

2.2. 데이터 수정

수정 작업: annotation 파일의 내용을 Python으로 읽고 수정

YOLO 형식의 .txt 파일은 다음과 같은 형식을 가집니다:

<class_id> <x_center> <y_center> <width> <height>

수정 예시:

import io

def update_annotation(bucket_name, label_file, new_content):
    try:
        # 기존 파일 읽기
        response = s3_client.get_object(Bucket=bucket_name, Key=label_file)
        old_content = response["Body"].read().decode("utf-8")
        print("Old Content:", old_content)

        # 새 내용 업로드
        s3_client.put_object(
            Bucket=bucket_name,
            Key=label_file,
            Body=new_content.encode("utf-8")
        )
        print(f"Updated annotation file: {label_file}")
    except Exception as e:
        print(f"Error updating annotation: {e}")

# 사용 예시
update_annotation("mybucket", "labels/image1.txt", "0 0.5 0.5 0.2 0.3")

2.3. 데이터 삭제

잘못된 데이터(이미지 또는 annotation)를 삭제하려면 다음 코드를 사용합니다.

def delete_file(bucket_name, file_key):
    try:
        s3_client.delete_object(Bucket=bucket_name, Key=file_key)
        print(f"Deleted: {file_key}")
    except Exception as e:
        print(f"Error deleting file: {e}")

# 사용 예시
delete_file("mybucket", "images/image1.jpg")
delete_file("mybucket", "labels/image1.txt")

2.4. 데이터 시각화 및 확인

잘못된 annotation 파일이 있거나 bounding box 정보가 잘못된 경우 이미지를 시각화하여 확인할 수 있습니다.

시각화 코드:

import matplotlib.pyplot as plt
import matplotlib.patches as patches
from PIL import Image
import io

def visualize_annotation(bucket_name, image_file, label_file):
    try:
        # 이미지 다운로드
        response = s3_client.get_object(Bucket=bucket_name, Key=image_file)
        image_data = response["Body"].read()
        image = Image.open(io.BytesIO(image_data))

        # Annotation 다운로드
        response = s3_client.get_object(Bucket=bucket_name, Key=label_file)
        annotations = response["Body"].read().decode("utf-8").splitlines()

        # 이미지 시각화
        fig, ax = plt.subplots(1)
        ax.imshow(image)

        # Annotation 시각화
        for ann in annotations:
            class_id, x_center, y_center, width, height = map(float, ann.split())
            x_min = x_center - width / 2
            y_min = y_center - height / 2

            # Bounding box 그리기
            rect = patches.Rectangle(
                (x_min * image.width, y_min * image.height),
                width * image.width,
                height * image.height,
                linewidth=2,
                edgecolor="r",
                facecolor="none"
            )
            ax.add_patch(rect)

        plt.show()
    except Exception as e:
        print(f"Error visualizing annotation: {e}")

# 사용 예시
visualize_annotation("mybucket", "images/image1.jpg", "labels/image1.txt")

3. YOLO 데이터셋 관리의 권장 워크플로

데이터 검증:
- 이미지와 annotation 파일의 매칭 여부 확인
- 잘못된 bounding box 확인 및 수정
수정 및 삭제:
- Python 스크립트를 사용해 잘못된 데이터를 수정
- 필요 없는 이미지 또는 annotation 파일 삭제
시각화 확인:
- 시각화 도구를 사용해 수정된 데이터가 정확히 반영되었는지 확인

이 과정을 통해 YOLO 데이터셋에서 잘못된 데이터를 효과적으로 관리할 수 있습니다. 😊

Me : MinIO 클라이언트 (mc)는 어떤 역할이야?

MinIO 클라이언트(mc)는 MinIO 서버 및 Amazon S3 호환 오브젝트 스토리지를 관리할 수 있는 강력한 CLI 도구입니다. MinIO와 연동되는 데이터를 관리하거나 운영 작업을 자동화하는 데 유용하며, S3 API를 사용하여 다른 스토리지 서비스와도 호환됩니다.

MinIO 클라이언트(mc)의 주요 역할

1. 스토리지 관리

버킷 생성 및 삭제
MinIO 또는 S3에서 데이터를 저장할 컨테이너 역할을 하는 버킷을 생성하거나 삭제할 수 있습니다.
```
mc alias set myminio http://127.0.0.1:9000 access-key secret-key
mc mb myminio/mybucket
mc rb myminio/mybucket --force
```

2. 파일 업로드 및 다운로드

파일을 MinIO 서버에 업로드하거나 다운로드할 수 있습니다.

mc cp /local/path/to/file.txt myminio/mybucket
mc cp myminio/mybucket/file.txt /local/path/to/

3. 데이터 복사 및 동기화

스토리지 간 또는 로컬 시스템과 스토리지 간 데이터를 복사하거나 동기화할 수 있습니다.

# 버킷 간 파일 복사
mc cp myminio/source-bucket/file.txt myminio/destination-bucket/

# 동기화
mc mirror /local/folder myminio/mybucket

4. 객체 삭제

특정 파일(오브젝트)을 삭제하거나 버킷 전체 데이터를 정리할 수 있습니다.
```
mc rm myminio/mybucket/file.txt
mc rm --recursive --force myminio/mybucket/
```

5. 접근 제어 설정

버킷 및 객체의 접근 권한을 설정합니다.

# 버킷을 공개로 설정
mc anonymous set public myminio/mybucket

# 버킷을 비공개로 설정
mc anonymous set private myminio/mybucket

6. URL 생성

특정 파일(오브젝트)의 URL을 생성하여 공유할 수 있습니다.
프리사인드 URL 생성도 가능합니다.
```
mc share upload myminio/mybucket/file.txt
mc presign myminio/mybucket/file.txt
```

7. 상태 확인 및 통계

스토리지의 상태 및 사용량을 확인합니다.
```
mc admin info myminio
mc admin top myminio
```

8. 서버 관리

MinIO 서버의 설정을 확인하고 조정할 수 있습니다.

mc admin config get myminio
mc admin config set myminio logger:webhook endpoint=https://example.com/log

mc의 장점

멀티 플랫폼 지원
Linux, macOS, Windows에서 사용 가능.
S3 호환
Amazon S3 및 S3 API를 지원하는 모든 스토리지와 호환됩니다.
간단한 사용법
CLI 기반으로 직관적인 명령 구조를 가짐.
자동화 가능
스크립트를 통해 데이터 작업과 서버 관리를 자동화할 수 있음.
보안
HTTPS 및 AWS 스타일의 액세스 키와 시크릿 키를 사용해 안전한 데이터 관리를 지원.

설치 및 기본 설정

1. 설치

wget https://dl.min.io/client/mc/release/linux-amd64/mc
chmod +x mc
sudo mv mc /usr/local/bin/

2. MinIO 서버 연결

MinIO 서버와의 연결(alias 설정):

mc alias set myminio http://127.0.0.1:9000 access-key secret-key

mc를 활용한 예제 워크플로

버킷 생성
```
mc mb myminio/mybucket
```

파일 업로드

mc cp /path/to/local/file.jpg myminio/mybucket/

데이터 확인
```
mc ls myminio/mybucket
```
공유 URL 생성
```
mc presign myminio/mybucket/file.jpg
```
파일 삭제
```
mc rm myminio/mybucket/file.jpg
```

MinIO 클라이언트(mc)는 데이터를 효율적으로 관리하고 운영 환경에서 필요한 모든 작업을 명령줄에서 수행할 수 있는 강력한 도구입니다. 😊

RCC.AI

따라가기도 벅찬 AI Engineer 겸 부앙단

이전 포스트

[MLOps] MinIO

다음 포스트