[Claude AI Agent] Multimodal requests

DH.J·2025년 1월 28일

Claude로 나만의 AI agents 만들기

목록 보기

2/4

지난 블로그에서는 api를 이용해 프롬프트를 실행하는 챗봇을 만들어 보았습니다.

messages = [
    {
        "role": "user", 
        "content": "tell me a joke"
     }
]
client = Anthropic(
    api_key="<api-key>",
)

response = client.messages.create(
    messages=messages,
    model=MODEL_NAME,
    max_tokens=100,
)
print(response.content[0].text)

Here's a classic:
Why don't scientists trust atoms?
Because they make up everything! 😄

답변이 만족스럽진 않지만, 어쨋든 응답을 생성해냅니다.

Multimodal

Multimodal은 여러 종류의 데이터 형식을 통합적으로 처리하는 방식입니다.
modality가 감각 채널을 뜻하는데,
이미지, 비디오, 음성, 텍스트 등의 모달리티를 함께 사용한다는 의미입니다.
예를 들면, 이미지 캡셔닝, 비디오 질답, 멀티모달 챗봇 등이 있습니다.

이번 블로그에서는 텍스트 뿐 아니라, 이미지 데이터를 처리하는 방법을 배워볼 것입니다.

Content block

messages 리스트를 다른 형태로 바꿔볼 수 있습니다.
이렇게 content block을 이용해서 여러 개로 쪼갤 수 있습니다.

messages = [
    {
        # Content block
        "role": "user", 
        "content": [
          {"type": "text", "text": "Who"},
          {"type": "text", "text": "made"},   
          {"type": "text", "text": "you?"}
        ]
     }
]

그렇다면 이미지 데이터는 어떻게 처리할까요?

API나 채팅 시스템에서 이미지를 포함한 메시지를 주고받을 때 일반적으로 사용되는 포맷입니다.

{
  "type": "image",  # 컨텐츠 타입(ex. text, image)
  "source": {       # 이미지 소스 정보를 담는 객체
    "type": "base64",     # 인코딩 타입 
    "media_type": "image/jpeg",  # 이미지 미디어 타입(ex. JPEG, PNG, GIF)
    "data": "iVBORw0KGgoAAAANSUhEUgAA..." # 실제 이미지 데이터
  }
}

실제로 적용해봅시다.

messages = [
    {
        "role": "user",
        "content": [{
            "type": "image",
            "source": {
                "type": "base64",
                "media_type": "image/png",
                "data": base64_string
            },
        }]
    }
]

Image messages

image message content 함수를 만들어 줍니다.

import base64
import mimetypes

def create_image_message(image_path):
    # Open the image file in "read binary" mode
    with open(image_path, "rb") as image_file:
        # Read the contents of the image as a bytes object
        binary_data = image_file.read()
    # Encode the binary data using Base64 encoding
    base64_encoded_data = base64.b64encode(binary_data)
    # Decode base64_encoded_data from bytes to a string
    base64_string = base64_encoded_data.decode('utf-8')
    # Get the MIME type of the image based on its file extension
    mime_type, _ = mimetypes.guess_type(image_path)
    # Create the image block
    image_block = {
        "type": "image",
        "source": {
            "type": "base64",
            "media_type": mime_type,
            "data": base64_string
        }
    }
    
    return image_block

실제 image의 path를 넣어준 뒤, 프롬프트("What species is this?")를 입력해 봅시다.

messages = [
    {
        "role": "user",
        "content": [
            create_image_message("./images/plant.png"),
            {"type": "text", "text": "What species is this?"}
        ]
    }
]

response = client.messages.create(
    model=MODEL_NAME,
    max_tokens=2048,
    messages=messages
)
print(response.content[0].text)

This appears to be a Monstera deliciosa, commonly known as the Swiss cheese plant or split-leaf philodendron. This identification is based on its distinctive features:

Large, glossy green leaves with natural leaf splits (fenestrations)
Heart-shaped leaf structure
Aerial roots visible at the base
Characteristic perforations in mature leaves

이렇게 답변이 성공적으로 나왔습니다.
만약 이미지 파일이 pdf일때, "media_type": "image/invoice.pdf"를 입력합니다.

"text": "Generate a JSON object reperesenting the contents of this invoice."
라는 message를 이용해서 pdf파일 내용을 JSON 객체 포맷으로 바꿔볼 수도 있습니다.

Stream

stream은 text generation을 실시간으로 전달하는 방식입니다.
한 번에 긴 응답을 생성해서 전달하는 것이 아닌,
chunk로 끊어서 순차적으로 생성하여 전달한다는 것이 차이점입니다.
응답이 바로바로 생성 -> 전달되어 기다리지 않아도 된다는 장점이 있습니다

with client.messages.stream(
    max_tokens=1024,
    messages=[{"role": "user", "content": "write an essay for kids."}],
    model=MODEL_NAME,
) as stream:
  for text in stream.text_stream:
    print(text, end="", flush=True)

with 구문을 사용하여 content에 프롬프트 내용을 입력합니다.
for text in stream.text_stream: 에서 stream 텍스트를 순회하여,
print(text, end="", flush=True)에서 text를 부분적으로 출력합니다.

더 작은 chunk로 끊어서 출력하고 싶다면?
buffer를 이용해서 한 번에 출력되는 단어 개수를 제한하는 방식도 있다!

with client.messages.stream(
    max_tokens=1024,
    messages=[{"role": "user", "content": "write an essay for kids."}],
    model=MODEL_NAME,
) as stream:
  buffer = ""
  word_count = 0
  words_per_chunk = 3 # 한 번에 출력 단어 수

  for text in stream.text_stream:
    buffer += text

    if " " in text:
      word_count += 1

    # 단어가 5개 이상이라면? -> 출력 & buffer 비우기 
    if word_count >= words_per_chunk:
      print(buffer, end="", flush=True)
      buffer = ""
      word_count = 0

  # 남은 buffer 출력  
  if buffer:
    print(buffer, flush=True)

DH.J

평생 질문하며 살고 싶습니다.

이전 포스트

[Claude AI agent] Working with the API

다음 포스트