Milvus DB 컬렉션 생성하기

하나둘셋·2024년 7월 28일

RAG & Vector Database study (using Milvus)

목록 보기

2/4

Rag 시스템을 만들기 위한 테스트 과정에서의 기록입니다.

Milvus Lite를 이용해 간단하게 데이터베이스를 구축해보고 테이블 생성을 진행해본다.

Milvus Lite란?

벡터 임베딩과 유사도 검색 기능을 갖춘 오픈 소스 데이터베이스 Milvus의 경량 버전이다.
Milvus Lite는 Milvus의 핵심 벡터 검색 기능을 제공하고, 대부분의 기능을 포함한다. 따라서, 백만개 미만의 벡터에 대한 빠른 데모를 실행하거나 프로토타입을 빌드하는데 좋다.

Milvus Lite 구동의 자세한 내용은 아래 공식 문서를 참고하시길 바란다.

Run Milvus Lite

입력 데이터 속성

장소 ID (기본 키)
text : 장소의 정보(장소명, 위치, 카테고리, 장소키워드)가 담긴 문서

	text 예시)
	장소명: 오일차
	카테고리: 카페/키워드
	장소 키워드: 추억여행, 디저트
	위치: 서울 성동구 성수동1가 680-219 1층

embedding vector : 위의 문서를 벡터화 시킨 데이터

1. Milvus 설치 및 모듈 임포트

# pymilvus 설치
!pip install -U pymilvus

# 모듈 임포트
import pandas as pd
import numpy as np
import time
import os

from openai import OpenAI
from pymilvus import MilvusClient
from pymilvus import FieldSchema, DataType
from pymilvus import FieldSchema, CollectionSchema

구글 코랩에서 Milvus의 Python SDK 라이브러리인 pymilvus를 통해 milvus를 사용하였다.
Milvus Lite는 pymilvus와 함께 패키징 되어있기 때문에 설치를 진행해줘야 한다.

2. 벡터 데이터베이스 설정

client = MilvusClient("milvus_demo.db")

출력) 
DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: 117e983a7c4f4e0eb

MilvusClient 클래스를 통해 로컬 Milvus Lite 데이터베이스 생성을 해준다.
"Milvus_demo.db" 라는 이름으로 모든 데이터를 저장할 데이터베이스 파일이 생성된다.

3. 컬렉션 생성

# 인덱스, 벡터 차원, 유사도 메트릭 설정
INDEX_TYPE = "FLAT"
DIMENSION = 768
METRIC_TYPE = "IP"

# 컬렉션 생성하는 클래스
class MakeCollections:
  def __init__(self, client, index_type, metric_type, dimension):
    self.client = client
    self.index_type = index_type
    self.metric_type = metric_type
    self.dimension = dimension


  # 스키마 생성
  def create_schema(self):
    fields = [
      FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
      FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=500, description="elements of travel sites"),
      FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=self.dimension, description="vector")
    ]
    schema = CollectionSchema(fields=fields, auto_id=True, description="travel sites")
    return schema

  # 인덱스 생성
  def create_index(self):
    index_params = self.client.prepare_index_params()

    index_params.add_index(
      field_name="embedding",
      index_type=self.index_type,
      metric_type=self.metric_type
    )
    return index_params

  # 컬렉션 생성
  def create_collection(self, collection_name):
    self.client.create_collection(
      collection_name=collection_name,
      schema=self.create_schema(),
      index_params=self.create_index()
    )

    time.sleep(2)

    res = self.client.get_load_state(
      collection_name=collection_name
    )
    print(res)

    return self.client
    


collection = MakeCollections(client, INDEX_TYPE, METRIC_TYPE, DIMENSION)
kstartup_collection = collection.create_collection("myStartup_travel_sites")
nowlocal_collection = collection.create_collection("nowlocal_travel_sites")
nature_collection = collection.create_collection("nature_travel_sites")

아래는 컬렉션을 생성하는 전체 코드이며 스키마와 인덱스 정보가 같은 3개의 컬렉션을 생성하였다.
컬렉션은 기존 데이터베이스에서 테이블과 같은 의미이고 컬렉션을 만들 때 저장되는 벡터의 차원, 인덱스 유형, 유사도 메트릭 (similarity metric) 유형을 지정할 수 있다.
컬렉션을 생성하기 전 스키마와 인덱스 매개변수를 설정하고 지정한 설정 정보를 이용하여 컬렉션을 생성한다.

클래스 속 함수를 자세히 살펴보겠다.

3-1. 스키마 생성 메서드

  # 스키마 생성
  def create_schema(self):
    fields = [
      FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
      FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=500, description="elements of travel sites"),
      FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=self.dimension, description="vector")
    ]
    schema = CollectionSchema(fields=fields, auto_id=True, description="travel sites")
    return schema

생성 필드: id, text, embedding
automatic id를 사용하여 id값이 자동으로 지정되고 증가한다.
동적 필드는 사용하지 않으며 데이터 입력 시 지정되지 않은 새로운 필드의 추가되는 것을 막는다.

스키마는 컬렉션의 구조를 정의하며 데이터 속성과 유형, 기본 키, 동적 필드 사용여부 등 지정할 수 있다. 스키마를 생성하는 방식은 다양하기 때문에 자세한 내용은 아래의 문서를 참고하길 바란다.

Manage Schema

3-2. 인덱스 파라미터 설정 메서드

  # 인덱스 생성
  def create_index(self):
    index_params = self.client.prepare_index_params()

    index_params.add_index(
      field_name="embedding",
      index_type=self.index_type,
      metric_type=self.metric_type
    )
    return index_params

인덱스 선정 필드: embedding
- index type : FLAT
  컬렉션의 데이터셋의 규모가 크지않으며 검색 결과의 높은 정확도를 위해 지정
- metric type : IP
  정규화 되지않은 벡터 데이터이고 IP는 주로 자연어 처리 분야에서 텍스트 유사성 검색에 자주 사용됨

인덱스와 메트릭 유형을 지정하여 효율적으로 데이터를 정렬하고 필요한 데이터를 빠르게 검색할 수 있도록 한다.

Milvus에서는 FLAT, IVF_FLAT, IVF_SQ8, IVF_PQ 등 다양한 인덱스 유형을 제공하며 구체적인 인덱스 알고리즘과 메트릭에 대해선 다른 블로그나 공식 문서를 참고하길 바란다.

In-memory Index
Similarity Metrics

3-3. 컬렉션 생성 메서드

  # 컬렉션 생성
  def create_collection(self, collection_name):
    self.client.create_collection(
      collection_name=collection_name,
      schema=self.create_schema(),
      index_params=self.create_index()
    )

    time.sleep(2)

    res = self.client.get_load_state(
      collection_name=collection_name
    )
    print(res)

    return self.client

MilvusClient.create_collection() 함수를 통해 생성할 컬렉션 이름을 지정하고 위에 단계에서 만든 스키마와 인덱스 파라미터를 전달해준다.
MilvusClient.get_load_state() 함수를 통해 컬렉션이 생성 후 로드되었는지 확인할 수 있다.
위의 함수 호출을 통한 컬렉션 생성 후 출력값 예시

출력)

DEBUG:pymilvus.milvus_client.milvus_client:Successfully created collection: nowlocal_travel_sites
DEBUG:pymilvus.milvus_client.milvus_client:Successfully created an index on collection: nowlocal_travel_sites
{'state': <LoadState: Loaded>}

4. 생성한 컬렉션 정보 확인

4-1. 구체적인 정보 확인하기

MilvusClient.describte_collection()에 컬렉션 이름을 지정하면 해당 컬렉션의 필드 정보 및 설정 등 구체적인 정보를 확인할 수 있다.

# 만든 컬렉션 확인
res = client.describe_collection(
    collection_name="nowlocal_travel_sites"
)
res


출력)
{'collection_name': 'nowlocal_travel_sites',
 'auto_id': True,
 'num_shards': 0,
 'description': 'travel sites',
 'fields': [{'field_id': 100,
   'name': 'id',
   'description': '',
   'type': <DataType.INT64: 5>,
   'params': {},
   'auto_id': True,
   'is_primary': True},
  {'field_id': 101,
   'name': 'text',
   'description': 'elements of travel sites',
   'type': <DataType.VARCHAR: 21>,
   'params': {'max_length': 500}},
  {'field_id': 102,
   'name': 'embedding',
   'description': 'vector',
   'type': <DataType.FLOAT_VECTOR: 101>,
   'params': {'dim': 768}}],
 'aliases': [],
 'collection_id': 0,
 'consistency_level': 0,
 'properties': {},
 'num_partitions': 0,
 'enable_dynamic_field': False}

4-2. 생성된 컬렉션 리스트 확인하기

MilvusClient.list_collections()를 통해 만들어진 컬렉션들을 확인할 수 있다.

#만들어진 컬렉션 리스트 확인
client.list_collections()

출력)
['myStartup_travel_sites', 'nature_travel_sites', 'nowlocal_travel_sites']

이제 다음 게시물에서 생성된 컬렉션에 맞는 데이터를 입력해 보는 과정을 진행해 보겠다.

하나둘셋

하나씩 뚝딱뚝딱

이전 포스트

AWS EC2 인스턴스에서 Milvus DB 구동하기

다음 포스트