[Langchain] Retriever

Hunie_07ยท2025๋…„ 3์›” 30์ผ
0

Langchain

๋ชฉ๋ก ๋ณด๊ธฐ
13/35

[Langchain] Document Loader
[Langchain] Text Splitter

๐Ÿ“Œ ๋ฒกํ„ฐ ์ €์žฅ์†Œ ๊ธฐ๋ฐ˜ RAG ๊ฒ€์ƒ‰๊ธฐ (Retriever)

Text Splitter ์—์„œ ์‚ฌ์šฉํ–ˆ๋˜ transformer.pdf ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๋ฌธ์„œ ๋กœ๋“œ

from langchain_community.document_loaders import PyPDFLoader

# PDF ๋กœ๋” ์ดˆ๊ธฐํ™”
pdf_loader = PyPDFLoader('./data/transformer.pdf')

# ๋™๊ธฐ ๋กœ๋”ฉ
pdf_docs = pdf_loader.load()
print(f'PDF ๋ฌธ์„œ ๊ฐœ์ˆ˜: {len(pdf_docs)}')

ํ…์ŠคํŠธ ๋ถ„ํ• 

from langchain_huggingface.embeddings import HuggingFaceEmbeddings

# Hugging Face์˜ ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ ์ƒ์„ฑ
embeddings_huggingface = HuggingFaceEmbeddings(model_name="BAAI/bge-m3")

# ํ† ํฌ๋‚˜์ด์ € ์ง์ ‘ ์ ‘๊ทผ
tokenizer = embeddings_huggingface._client.tokenizer

# ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์‚ฌ์šฉํ•œ ์˜ˆ์‹œ
text = "ํ…Œ์ŠคํŠธ ํ…์ŠคํŠธ์ž…๋‹ˆ๋‹ค."
tokens = tokenizer(text)
print(tokens)

# ํ† ํฌ๋‚˜์ด์ € ์„ค์ • ํ™•์ธ
print(tokenizer.model_max_length)  # ์ตœ๋Œ€ ํ† ํฐ ๊ธธ์ด
print(tokenizer.vocab_size)        # ์–ดํœ˜ ํฌ๊ธฐ

- ์ถœ๋ ฅ

{'input_ids': [0, 153924, 239355, 5826, 5, 2], 'attention_mask': [1, 1, 1, 1, 1, 1]}
8192
250002

ํ† ํฐ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ํ•จ์ˆ˜ (๋ถ„ํ•  ๊ธฐ์ค€ ์‚ฌ์šฉ ์šฉ๋„)

# ํ† ํฐ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ํ•จ์ˆ˜
def count_tokens(text):
    return len(tokenizer(text)['input_ids'])

# ํ† ํฐ ์ˆ˜ ๊ณ„์‚ฐ
text = "ํ…Œ์ŠคํŠธ ํ…์ŠคํŠธ์ž…๋‹ˆ๋‹ค."
print(count_tokens(text))

- ์ถœ๋ ฅ

6

ํ…์ŠคํŠธ ๋ถ„ํ• ๊ธฐ ์ƒ์„ฑ ๋ฐ ๋ถ„ํ• 

from langchain_text_splitters import RecursiveCharacterTextSplitter

# ํ…์ŠคํŠธ ๋ถ„ํ• ๊ธฐ ์ƒ์„ฑ
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,                      
    chunk_overlap=100,           
    length_function=count_tokens,         # ํ† ํฐ ์ˆ˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ถ„ํ• 
    separators=["\n\n", "\n", " ", ""],   # ๊ตฌ๋ถ„์ž - ์žฌ๊ท€์ ์œผ๋กœ ์ˆœ์ฐจ์ ์œผ๋กœ ์ ์šฉ 
)

# ํ…์ŠคํŠธ ๋ถ„ํ• 
chunks = text_splitter.split_documents(pdf_docs)
print(f"์ƒ์„ฑ๋œ ํ…์ŠคํŠธ ์ฒญํฌ ์ˆ˜: {len(chunks)}")
print(f"๊ฐ ์ฒญํฌ์˜ ๊ธธ์ด: {list(len(chunk.page_content) for chunk in chunks)}")
print(f"๊ฐ ์ฒญํฌ์˜ ํ† ํฐ ์ˆ˜: {list(count_tokens(chunk.page_content) for chunk in chunks)}")

- ์ถœ๋ ฅ

์ƒ์„ฑ๋œ ํ…์ŠคํŠธ ์ฒญํฌ ์ˆ˜: 38
๊ฐ ์ฒญํฌ์˜ ๊ธธ์ด: [1378, 1796, 1831, 1857, 1292, 1609, 503, 1554, 1278, 1362, 1608, 833, 1418, 1680, 999, 1764, 1604, 539, 1219, 1645, 926, 1213, 1688, 716, 1409, 1626, 624, 1411, 1437, 913, 1493, 1337, 845, 812, 470, 438, 470, 441]
๊ฐ ์ฒญํฌ์˜ ํ† ํฐ ์ˆ˜: [336, 415, 405, 419, 327, 424, 127, 388, 294, 384, 411, 204, 419, 417, 226, 419, 395, 149, 390, 400, 221, 356, 411, 181, 394, 405, 188, 424, 399, 277, 420, 409, 250, 190, 131, 117, 131, 113]

# ์ฒญํฌ์˜ ํ…์ŠคํŠธ ํ™•์ธ
print(chunks[2].page_content)

- ์ถœ๋ ฅ

1 Introduction
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks
in particular, ...
The Transformer allows for significantly more parallelization and can reach a new state of the art in

1. ๋ฒกํ„ฐ ์ €์žฅ์†Œ ์ดˆ๊ธฐํ™”

  • chroma ์‚ฌ์šฉ
  • cosine distance ๊ธฐ์ค€์œผ๋กœ ์ธ๋ฑ์‹ฑ
from langchain_chroma import Chroma

# Chroma ๋ฒกํ„ฐ ์ €์žฅ์†Œ ์ƒ์„ฑํ•˜๊ธฐ
chroma_db = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings_huggingface,    # huggingface ์ž„๋ฒ ๋”ฉ ์‚ฌ์šฉ
    collection_name="db_transformer",    # ์ปฌ๋ ‰์…˜ ์ด๋ฆ„
    persist_directory="./chroma_db",
    collection_metadata = {'hnsw:space': 'cosine'}, # l2, ip, cosine ์ค‘์—์„œ ์„ ํƒ 
)

# ํ˜„์žฌ ์ €์žฅ๋œ ์ปฌ๋ ‰์…˜ ๋ฐ์ดํ„ฐ ํ™•์ธ
chroma_db.get()

- ์ถœ๋ ฅ

{'ids': ['...' ,
	...],
 'embeddings': None,
 'documents': ['Provided ...',
 	...],
 'uris': None,
 'data': None,
 'metadatas': [
 {'page': 0, 'page_label': 1, 'source': './data/transformer.pdf'},
 ...],
 ...
}

2. Top K

  • k : ๋ฐ˜ํ™˜ํ•  ๋ฌธ์„œ์˜ ๊ฐœ์ˆ˜
chroma_k_retriever = chroma_db.as_retriever(
    search_kwargs={"k": 2},
)

query = "๋Œ€ํ‘œ์ ์ธ ์‹œํ€€์Šค ๋ชจ๋ธ์€ ์–ด๋–ค ๊ฒƒ๋“ค์ด ์žˆ๋‚˜์š”?"
retrieved_docs = chroma_k_retriever.invoke(query)

print(f"์ฟผ๋ฆฌ: {query}")
print("๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ:")
for i, doc in enumerate(retrieved_docs, 1):
    print(f"-{i}-\n{doc.page_content[:100]}...{doc.page_content[-100:]} [์ถœ์ฒ˜: {doc.metadata['source']}]")
    print("-" * 100)

- ์ถœ๋ ฅ

์ฟผ๋ฆฌ: ๋Œ€ํ‘œ์ ์ธ ์‹œํ€€์Šค ๋ชจ๋ธ์€ ์–ด๋–ค ๊ฒƒ๋“ค์ด ์žˆ๋‚˜์š”?
๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ:
-1-
1 Introduction
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural...he Transformer allows for significantly more parallelization and can reach a new state of the art in [์ถœ์ฒ˜: ./data/transformer.pdf]
----------------------------------------------------------------------------------------------------
-2-
In contrast to RNN sequence-to-sequence models [37], the Transformer outperforms the Berkeley-
Parse... 2016.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly [์ถœ์ฒ˜: ./data/transformer.pdf]
----------------------------------------------------------------------------------------------------

3. ์ž„๊ณ„๊ฐ’ ์ง€์ •

  • search_type : ๊ฒ€์ƒ‰ ๊ธฐ์ค€ (similarity, mmr, similarity_score_threshold)
  • search_kwargs={'score_threshold': 0.5} : ๊ธฐ์ค€ ์ ์ˆ˜ ์„ค์ • (0.5 ์ด์ƒ๋งŒ ๊ฒ€์ƒ‰)
from langchain_community.utils.math import cosine_similarity

chroma_threshold_retriever = chroma_db.as_retriever(
    search_type='similarity_score_threshold',       # cosine ์œ ์‚ฌ๋„
    search_kwargs={'score_threshold': 0.5, 'k':2},  # 0.5 ์ด์ƒ์ธ ๋ฌธ์„œ๋ฅผ ์ถ”์ถœ
)

query = "๋Œ€ํ‘œ์ ์ธ ์‹œํ€€์Šค ๋ชจ๋ธ์€ ์–ด๋–ค ๊ฒƒ๋“ค์ด ์žˆ๋‚˜์š”?"
retrieved_docs = chroma_threshold_retriever.invoke(query)

print(f"์ฟผ๋ฆฌ: {query}")
print("๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ:")
for i, doc in enumerate(retrieved_docs, 1):
    score = cosine_similarity(
        [embeddings_huggingface.embed_query(query)], 
        [embeddings_huggingface.embed_query(doc.page_content)]
        )[0][0]
    print(f"-{i}-\n[์œ ์‚ฌ๋„: {score}]\n{doc.page_content[:100]}...{doc.page_content[-100:]}")
    print("-" * 100)

- ์ถœ๋ ฅ

์ฟผ๋ฆฌ: ๋Œ€ํ‘œ์ ์ธ ์‹œํ€€์Šค ๋ชจ๋ธ์€ ์–ด๋–ค ๊ฒƒ๋“ค์ด ์žˆ๋‚˜์š”?
๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ:
-1-
[์œ ์‚ฌ๋„: 0.5069071561705342]
1 Introduction
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural...he Transformer allows for significantly more parallelization and can reach a new state of the art in
----------------------------------------------------------------------------------------------------
-2-
[์œ ์‚ฌ๋„: 0.5020665604450864]
In contrast to RNN sequence-to-sequence models [37], the Transformer outperforms the Berkeley-
Parse... 2016.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly
----------------------------------------------------------------------------------------------------

4. MMR (Maximal Marginal Relevance) ๊ฒ€์ƒ‰

  • fetch_k : MMR ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ์ „๋‹ฌํ•  ๋ฌธ์„œ์˜ ๊ฐœ์ˆ˜
  • lambda_mult : ๊ฒฐ๊ณผ์˜ ๋‹ค์–‘์„ฑ ( 1 : ์ตœ์†Œ ๋‹ค์–‘์„ฑ, 0 : ์ตœ๋Œ€ ๋‹ค์–‘์„ฑ)
# MMR - ๋‹ค์–‘์„ฑ ๊ณ ๋ ค (lambda_mult ์ž‘์„์ˆ˜๋ก ๋” ๋‹ค์–‘ํ•˜๊ฒŒ ์ถ”์ถœ)
chroma_mmr = chroma_db.as_retriever(
    search_type='mmr',
    search_kwargs={
        'k': 3,                 # ๊ฒ€์ƒ‰ํ•  ๋ฌธ์„œ์˜ ์ˆ˜
        'fetch_k': 8,           # mmr ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ์ „๋‹ฌํ•  ๋ฌธ์„œ์˜ ์ˆ˜ (fetch_k > k)
        'lambda_mult': 0.5,     # ๋‹ค์–‘์„ฑ์„ ๊ณ ๋ คํ•˜๋Š” ์ •๋„ (1์€ ์ตœ์†Œ ๋‹ค์–‘์„ฑ, 0์€ ์ตœ๋Œ€ ๋‹ค์–‘์„ฑ์„ ์˜๋ฏธ. ๊ธฐ๋ณธ๊ฐ’์€ 0.5)
        },
)

query = "๋Œ€ํ‘œ์ ์ธ ์‹œํ€€์Šค ๋ชจ๋ธ์€ ์–ด๋–ค ๊ฒƒ๋“ค์ด ์žˆ๋‚˜์š”?"
retrieved_docs = chroma_mmr.invoke(query)

print(f"์ฟผ๋ฆฌ: {query}")
print("๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ:")
for i, doc in enumerate(retrieved_docs, 1):
    score = cosine_similarity(
        [embeddings_huggingface.embed_query(query)], 
        [embeddings_huggingface.embed_query(doc.page_content)]
        )[0][0]
    print(f"-{i}-\n[์œ ์‚ฌ๋„: {score}]\n{doc.page_content[:100]}...{doc.page_content[-100:]}")
    print("-" * 100)

- ์ถœ๋ ฅ

์ฟผ๋ฆฌ: ๋Œ€ํ‘œ์ ์ธ ์‹œํ€€์Šค ๋ชจ๋ธ์€ ์–ด๋–ค ๊ฒƒ๋“ค์ด ์žˆ๋‚˜์š”?
๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ:
-1-
[์œ ์‚ฌ๋„: 0.5069071561705342]
1 Introduction
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural...he Transformer allows for significantly more parallelization and can reach a new state of the art in
----------------------------------------------------------------------------------------------------
-2-
[์œ ์‚ฌ๋„: 0.47915489021788504]
Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations
for ...ng
corresponds to a sinusoid. The wavelengths form a geometric progression from 2ฯ€ to 10000 ยท 2ฯ€. We
----------------------------------------------------------------------------------------------------
-3-
[์œ ์‚ฌ๋„: 0.4709169133567156]
from our models and present and discuss examples in the appendix. Not only do individual attention
h..., according to the formula:
lrate = dโˆ’0.5
model ยท min(step_numโˆ’0.5, step_num ยท warmup_stepsโˆ’1.5) (3)
----------------------------------------------------------------------------------------------------

5. metadata ํ•„ํ„ฐ๋ง ๊ฒ€์ƒ‰

# ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ํ™•์ธ
chunks[0].metadata

- ์ถœ๋ ฅ

{'source': './data/transformer.pdf', 'page': 0, 'page_label': '1'}

# ๋ฌธ์„œ ๊ฐ์ฒด์˜ metadata๋ฅผ ์ด์šฉํ•œ ํ•„ํ„ฐ๋ง
chrom_metadata = chroma_db.as_retriever(
    search_kwargs={
        'filter': {'source': './data/transformer.pdf'},
        'k': 5, 
        }
)

query = "๋Œ€ํ‘œ์ ์ธ ์‹œํ€€์Šค ๋ชจ๋ธ์€ ์–ด๋–ค ๊ฒƒ๋“ค์ด ์žˆ๋‚˜์š”?"
retrieved_docs = chrom_metadata.invoke(query)

print(f"์ฟผ๋ฆฌ: {query}")
print("๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ:")
for i, doc in enumerate(retrieved_docs, 1):
    print(f"-{i}-\n{doc.page_content[:100]}...{doc.page_content[-100:]}\n[์ถœ์ฒ˜: {doc.metadata['source']}]")
    print("-" * 100)

- ์ถœ๋ ฅ

์ฟผ๋ฆฌ: ๋Œ€ํ‘œ์ ์ธ ์‹œํ€€์Šค ๋ชจ๋ธ์€ ์–ด๋–ค ๊ฒƒ๋“ค์ด ์žˆ๋‚˜์š”?
๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ:
-1-
1 Introduction
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural...he Transformer allows for significantly more parallelization and can reach a new state of the art in
[์ถœ์ฒ˜: ./data/transformer.pdf]
----------------------------------------------------------------------------------------------------
-2-
In contrast to RNN sequence-to-sequence models [37], the Transformer outperforms the Berkeley-
Parse... 2016.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly
[์ถœ์ฒ˜: ./data/transformer.pdf]
----------------------------------------------------------------------------------------------------
...

6. page_content ๋ณธ๋ฌธ ํ•„ํ„ฐ๋ง ๊ฒ€์ƒ‰

-'where_document': {'$contains': 'recurrent'} : page_content ๋ณธ๋ฌธ์— 'recurrent' ๋ฅผ ํฌํ•จํ•œ ๋ฌธ์„œ ์ค‘์—์„œ ๊ฒ€์ƒ‰

# page_content๋ฅผ ์ด์šฉํ•œ ํ•„ํ„ฐ๋ง
chroma_content = chroma_db.as_retriever(
    search_kwargs={
        'k': 2,
        'where_document': {'$contains': 'recurrent'},
        }
)

query = "๋Œ€ํ‘œ์ ์ธ ์‹œํ€€์Šค ๋ชจ๋ธ์€ ์–ด๋–ค ๊ฒƒ๋“ค์ด ์žˆ๋‚˜์š”?"
retrieved_docs = chroma_content.invoke(query)

print(f"์ฟผ๋ฆฌ: {query}")
print("๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ:")
for i, doc in enumerate(retrieved_docs, 1):
    print(f"-{i}-\n{doc.page_content} [์ถœ์ฒ˜: {doc.metadata['source']}]")
    print("-" * 100)

- ์ถœ๋ ฅ

์ฟผ๋ฆฌ: ๋Œ€ํ‘œ์ ์ธ ์‹œํ€€์Šค ๋ชจ๋ธ์€ ์–ด๋–ค ๊ฒƒ๋“ค์ด ์žˆ๋‚˜์š”?
๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ:
-1-
1 Introduction
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural...he Transformer allows for significantly more parallelization and can reach a new state of the art in
[์ถœ์ฒ˜: ./data/transformer.pdf]
----------------------------------------------------------------------------------------------------
-2-
In contrast to RNN sequence-to-sequence models [37], the Transformer outperforms the Berkeley-
Parse... 2016.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly
[์ถœ์ฒ˜: ./data/transformer.pdf]
----------------------------------------------------------------------------------------------------

0๊ฐœ์˜ ๋Œ“๊ธ€