AI Docent System for Art Exhibitions (6-2)

윤서·2025년 5월 21일

CapstoneProject

목록 보기

7/11

Object Dectection -> Embedding -> Faiss -> LlaMa Explanation pipeline

In our project, we aim to develop an AI docent system that automatically provides descriptions for specific objects detected within artwork images.

Pipeline Overview

[ full artwork image ]

Yolov8 segmentation
->

[object Dectection & crop]

CLIP embedding
->

[ vectorized image ]
FAISS Indexing
->

[ query -> object retrieval ]
Llama-based generation
->

[ visitor-friendly docent explanation ]

Ojbect Detection and cropping with yolov8

We use the yolov8-seg.pt model to detect objects within a given artwork image.
Unlike simple bounding box detection, this model utilizes segmentation masks, allowing us to precisely crop each object based on its actual shape.

Embedding Cropped Objects Using CLIP

Each cropped object iamge is passed through CLIP to convert it into a semantic vector.
This enables us to later retrieve sementically similar objects based on the visitor's query.

What is a Semantic Vector?

A semantic vector is a numerical representation that captures the meaning of text, images, or othre human-understandable content in a form that machines can interpret.
In similar terms: it's an array of numbers that represents meaning.

For example:

CAT : [0.8, 0.2, 0.5]
DOG : [0.79, 0.21, 0.52]
CAR : [0.1, 0.9, 0.3]

Cat and Dog have similar vector because they are semantically related.
Car is conceptually different , so its vector is far apart.

The similarity between these vectors tells us how closely related the meanings are.

CLIP is trained to embed both images and text into the same semantic space.

Example:
"an apple on the table" -> [text vector]
image of an actual apple -> [image vector]

These are trained to be close in vector space.
So when a user asks, "where is the apple?" , we convert the question into a vector and use FAISS to find the closest image embedding - an then explain it.

Indexing Semantic Vectors with FAISS

The image embeddings obtained from CLIP are indexed using FAISS. This enables fast, approximate nearest neighbor search to retrieve the most semantically similar objects later.

Query-based Retrieval and Description generation via LLaMA

When a user submits a natural-language query, the following steps occur:
1. The query is converted into a semantic vector using CLIP's text encoder.
2. FAISS searches for the most similar object vectors.
3. Metadata for the top result (labe, description, etc) is retrieved.
4. This information is passed into a prompt, which is then fed into LLaMA to generate a natural, human-friendly explanation.

This creates a dynamic docent experience where visitors can ask questions or click on an object and receive personalized explanations generated on the spot.

윤서

이전 포스트

이미지 객체 인식부터 LLAMA 기반 설명 생성

다음 포스트

AI Docent System for Art Exhibitions (6-2)

CapstoneProject

이미지 객체 인식부터 LLAMA 기반 설명 생성

FastAPI와 Spring Boot를 연동한 AI 기반 도슨트 설명 전달 시스템 구축하기

0개의 댓글