AI Docent System for Art Exhibitions (6-2)

윤서·2025년 5월 21일

CapstoneProject

목록 보기
7/11

Object Dectection -> Embedding -> Faiss -> LlaMa Explanation pipeline

In our project, we aim to develop an AI docent system that automatically provides descriptions for specific objects detected within artwork images.

Pipeline Overview

[ full artwork image ]

Yolov8 segmentation
->

[object Dectection & crop]

CLIP embedding
->

[ vectorized image ]
FAISS Indexing
->

[ query -> object retrieval ]
Llama-based generation
->

[ visitor-friendly docent explanation ]

  1. Ojbect Detection and cropping with yolov8

We use the yolov8-seg.pt model to detect objects within a given artwork image.
Unlike simple bounding box detection, this model utilizes segmentation masks, allowing us to precisely crop each object based on its actual shape.

  1. Embedding Cropped Objects Using CLIP

Each cropped object iamge is passed through CLIP to convert it into a semantic vector.
This enables us to later retrieve sementically similar objects based on the visitor's query.

What is a Semantic Vector?

A semantic vector is a numerical representation that captures the meaning of text, images, or othre human-understandable content in a form that machines can interpret.
In similar terms: it's an array of numbers that represents meaning.

For example:

CAT : [0.8, 0.2, 0.5]
DOG : [0.79, 0.21, 0.52]
CAR : [0.1, 0.9, 0.3]

Cat and Dog have similar vector because they are semantically related.
Car is conceptually different , so its vector is far apart.

The similarity between these vectors tells us how closely related the meanings are.

CLIP is trained to embed both images and text into the same semantic space.

Example:
"an apple on the table" -> [text vector]
image of an actual apple -> [image vector]

These are trained to be close in vector space.
So when a user asks, "where is the apple?" , we convert the question into a vector and use FAISS to find the closest image embedding - an then explain it.

  1. Indexing Semantic Vectors with FAISS

The image embeddings obtained from CLIP are indexed using FAISS. This enables fast, approximate nearest neighbor search to retrieve the most semantically similar objects later.

  1. Query-based Retrieval and Description generation via LLaMA

When a user submits a natural-language query, the following steps occur:
1. The query is converted into a semantic vector using CLIP's text encoder.
2. FAISS searches for the most similar object vectors.
3. Metadata for the top result (labe, description, etc) is retrieved.
4. This information is passed into a prompt, which is then fed into LLaMA to generate a natural, human-friendly explanation.

This creates a dynamic docent experience where visitors can ask questions or click on an object and receive personalized explanations generated on the spot.

0개의 댓글