[Langchain] Retriever

Hunie_07·2025년 3월 30일

langChain rag retriever

Langchain

목록 보기

13/35

[Langchain] Document Loader
[Langchain] Text Splitter

📌 벡터 저장소 기반 RAG 검색기 (Retriever)

Text Splitter 에서 사용했던 transformer.pdf 을 사용합니다.

문서 로드

from langchain_community.document_loaders import PyPDFLoader

# PDF 로더 초기화
pdf_loader = PyPDFLoader('./data/transformer.pdf')

# 동기 로딩
pdf_docs = pdf_loader.load()
print(f'PDF 문서 개수: {len(pdf_docs)}')

텍스트 분할

from langchain_huggingface.embeddings import HuggingFaceEmbeddings

# Hugging Face의 임베딩 모델 생성
embeddings_huggingface = HuggingFaceEmbeddings(model_name="BAAI/bge-m3")

# 토크나이저 직접 접근
tokenizer = embeddings_huggingface._client.tokenizer

# 토크나이저를 사용한 예시
text = "테스트 텍스트입니다."
tokens = tokenizer(text)
print(tokens)

# 토크나이저 설정 확인
print(tokenizer.model_max_length)  # 최대 토큰 길이
print(tokenizer.vocab_size)        # 어휘 크기

- 출력

{'input_ids': [0, 153924, 239355, 5826, 5, 2], 'attention_mask': [1, 1, 1, 1, 1, 1]}
8192
250002

토큰 수를 계산하는 함수 (분할 기준 사용 용도)

# 토큰 수를 계산하는 함수
def count_tokens(text):
    return len(tokenizer(text)['input_ids'])

# 토큰 수 계산
text = "테스트 텍스트입니다."
print(count_tokens(text))

- 출력

텍스트 분할기 생성 및 분할

from langchain_text_splitters import RecursiveCharacterTextSplitter

# 텍스트 분할기 생성
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,                      
    chunk_overlap=100,           
    length_function=count_tokens,         # 토큰 수를 기준으로 분할
    separators=["\n\n", "\n", " ", ""],   # 구분자 - 재귀적으로 순차적으로 적용 
)

# 텍스트 분할
chunks = text_splitter.split_documents(pdf_docs)
print(f"생성된 텍스트 청크 수: {len(chunks)}")
print(f"각 청크의 길이: {list(len(chunk.page_content) for chunk in chunks)}")
print(f"각 청크의 토큰 수: {list(count_tokens(chunk.page_content) for chunk in chunks)}")

- 출력

생성된 텍스트 청크 수: 38
각 청크의 길이: [1378, 1796, 1831, 1857, 1292, 1609, 503, 1554, 1278, 1362, 1608, 833, 1418, 1680, 999, 1764, 1604, 539, 1219, 1645, 926, 1213, 1688, 716, 1409, 1626, 624, 1411, 1437, 913, 1493, 1337, 845, 812, 470, 438, 470, 441]
각 청크의 토큰 수: [336, 415, 405, 419, 327, 424, 127, 388, 294, 384, 411, 204, 419, 417, 226, 419, 395, 149, 390, 400, 221, 356, 411, 181, 394, 405, 188, 424, 399, 277, 420, 409, 250, 190, 131, 117, 131, 113]

# 청크의 텍스트 확인
print(chunks[2].page_content)

- 출력

1 Introduction
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks
in particular, ...
The Transformer allows for significantly more parallelization and can reach a new state of the art in

1. 벡터 저장소 초기화

chroma 사용
cosine distance 기준으로 인덱싱

from langchain_chroma import Chroma

# Chroma 벡터 저장소 생성하기
chroma_db = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings_huggingface,    # huggingface 임베딩 사용
    collection_name="db_transformer",    # 컬렉션 이름
    persist_directory="./chroma_db",
    collection_metadata = {'hnsw:space': 'cosine'}, # l2, ip, cosine 중에서 선택 
)

# 현재 저장된 컬렉션 데이터 확인
chroma_db.get()

- 출력

{'ids': ['...' ,
	...],
 'embeddings': None,
 'documents': ['Provided ...',
 	...],
 'uris': None,
 'data': None,
 'metadatas': [
 {'page': 0, 'page_label': 1, 'source': './data/transformer.pdf'},
 ...],
 ...
}

2. Top K

k : 반환할 문서의 개수

chroma_k_retriever = chroma_db.as_retriever(
    search_kwargs={"k": 2},
)

query = "대표적인 시퀀스 모델은 어떤 것들이 있나요?"
retrieved_docs = chroma_k_retriever.invoke(query)

print(f"쿼리: {query}")
print("검색 결과:")
for i, doc in enumerate(retrieved_docs, 1):
    print(f"-{i}-\n{doc.page_content[:100]}...{doc.page_content[-100:]} [출처: {doc.metadata['source']}]")
    print("-" * 100)

- 출력

쿼리: 대표적인 시퀀스 모델은 어떤 것들이 있나요?
검색 결과:
-1-
1 Introduction
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural...he Transformer allows for significantly more parallelization and can reach a new state of the art in [출처: ./data/transformer.pdf]
----------------------------------------------------------------------------------------------------
-2-
In contrast to RNN sequence-to-sequence models [37], the Transformer outperforms the Berkeley-
Parse... 2016.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly [출처: ./data/transformer.pdf]
----------------------------------------------------------------------------------------------------

3. 임계값 지정

search_type : 검색 기준 (similarity, mmr, similarity_score_threshold)
search_kwargs={'score_threshold': 0.5} : 기준 점수 설정 (0.5 이상만 검색)

from langchain_community.utils.math import cosine_similarity

chroma_threshold_retriever = chroma_db.as_retriever(
    search_type='similarity_score_threshold',       # cosine 유사도
    search_kwargs={'score_threshold': 0.5, 'k':2},  # 0.5 이상인 문서를 추출
)

query = "대표적인 시퀀스 모델은 어떤 것들이 있나요?"
retrieved_docs = chroma_threshold_retriever.invoke(query)

print(f"쿼리: {query}")
print("검색 결과:")
for i, doc in enumerate(retrieved_docs, 1):
    score = cosine_similarity(
        [embeddings_huggingface.embed_query(query)], 
        [embeddings_huggingface.embed_query(doc.page_content)]
        )[0][0]
    print(f"-{i}-\n[유사도: {score}]\n{doc.page_content[:100]}...{doc.page_content[-100:]}")
    print("-" * 100)

- 출력

쿼리: 대표적인 시퀀스 모델은 어떤 것들이 있나요?
검색 결과:
-1-
[유사도: 0.5069071561705342]
1 Introduction
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural...he Transformer allows for significantly more parallelization and can reach a new state of the art in
----------------------------------------------------------------------------------------------------
-2-
[유사도: 0.5020665604450864]
In contrast to RNN sequence-to-sequence models [37], the Transformer outperforms the Berkeley-
Parse... 2016.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly
----------------------------------------------------------------------------------------------------

4. MMR (Maximal Marginal Relevance) 검색

fetch_k : MMR 알고리즘에 전달할 문서의 개수
lambda_mult : 결과의 다양성 ( 1 : 최소 다양성, 0 : 최대 다양성)

# MMR - 다양성 고려 (lambda_mult 작을수록 더 다양하게 추출)
chroma_mmr = chroma_db.as_retriever(
    search_type='mmr',
    search_kwargs={
        'k': 3,                 # 검색할 문서의 수
        'fetch_k': 8,           # mmr 알고리즘에 전달할 문서의 수 (fetch_k > k)
        'lambda_mult': 0.5,     # 다양성을 고려하는 정도 (1은 최소 다양성, 0은 최대 다양성을 의미. 기본값은 0.5)
        },
)

query = "대표적인 시퀀스 모델은 어떤 것들이 있나요?"
retrieved_docs = chroma_mmr.invoke(query)

print(f"쿼리: {query}")
print("검색 결과:")
for i, doc in enumerate(retrieved_docs, 1):
    score = cosine_similarity(
        [embeddings_huggingface.embed_query(query)], 
        [embeddings_huggingface.embed_query(doc.page_content)]
        )[0][0]
    print(f"-{i}-\n[유사도: {score}]\n{doc.page_content[:100]}...{doc.page_content[-100:]}")
    print("-" * 100)

- 출력

쿼리: 대표적인 시퀀스 모델은 어떤 것들이 있나요?
검색 결과:
-1-
[유사도: 0.5069071561705342]
1 Introduction
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural...he Transformer allows for significantly more parallelization and can reach a new state of the art in
----------------------------------------------------------------------------------------------------
-2-
[유사도: 0.47915489021788504]
Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations
for ...ng
corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000 · 2π. We
----------------------------------------------------------------------------------------------------
-3-
[유사도: 0.4709169133567156]
from our models and present and discuss examples in the appendix. Not only do individual attention
h..., according to the formula:
lrate = d−0.5
model · min(step_num−0.5, step_num · warmup_steps−1.5) (3)
----------------------------------------------------------------------------------------------------

5. metadata 필터링 검색

# 메타데이터 확인
chunks[0].metadata

- 출력

{'source': './data/transformer.pdf', 'page': 0, 'page_label': '1'}

# 문서 객체의 metadata를 이용한 필터링
chrom_metadata = chroma_db.as_retriever(
    search_kwargs={
        'filter': {'source': './data/transformer.pdf'},
        'k': 5, 
        }
)

query = "대표적인 시퀀스 모델은 어떤 것들이 있나요?"
retrieved_docs = chrom_metadata.invoke(query)

print(f"쿼리: {query}")
print("검색 결과:")
for i, doc in enumerate(retrieved_docs, 1):
    print(f"-{i}-\n{doc.page_content[:100]}...{doc.page_content[-100:]}\n[출처: {doc.metadata['source']}]")
    print("-" * 100)

- 출력

쿼리: 대표적인 시퀀스 모델은 어떤 것들이 있나요?
검색 결과:
-1-
1 Introduction
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural...he Transformer allows for significantly more parallelization and can reach a new state of the art in
[출처: ./data/transformer.pdf]
----------------------------------------------------------------------------------------------------
-2-
In contrast to RNN sequence-to-sequence models [37], the Transformer outperforms the Berkeley-
Parse... 2016.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly
[출처: ./data/transformer.pdf]
----------------------------------------------------------------------------------------------------
...

6. page_content 본문 필터링 검색

-'where_document': {'$contains': 'recurrent'} : page_content 본문에 'recurrent' 를 포함한 문서 중에서 검색

# page_content를 이용한 필터링
chroma_content = chroma_db.as_retriever(
    search_kwargs={
        'k': 2,
        'where_document': {'$contains': 'recurrent'},
        }
)

query = "대표적인 시퀀스 모델은 어떤 것들이 있나요?"
retrieved_docs = chroma_content.invoke(query)

print(f"쿼리: {query}")
print("검색 결과:")
for i, doc in enumerate(retrieved_docs, 1):
    print(f"-{i}-\n{doc.page_content} [출처: {doc.metadata['source']}]")
    print("-" * 100)

- 출력

쿼리: 대표적인 시퀀스 모델은 어떤 것들이 있나요?
검색 결과:
-1-
1 Introduction
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural...he Transformer allows for significantly more parallelization and can reach a new state of the art in
[출처: ./data/transformer.pdf]
----------------------------------------------------------------------------------------------------
-2-
In contrast to RNN sequence-to-sequence models [37], the Transformer outperforms the Berkeley-
Parse... 2016.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly
[출처: ./data/transformer.pdf]
----------------------------------------------------------------------------------------------------