6_RAG

Jacob Kim·2024년 2월 2일

Naver project

Naver Project Week5

목록 보기

7/12

Retrieval-augmented generation (RAG)

Overview

What is RAG?

RAG는 비공개 또는 실시간 데이터를 추가하여 학습자의 지식을 보강하는 기법입니다.

인공신경망은 광범위한 주제에 대해 추론할 수 있지만, 그 지식은 학습된 특정 시점까지의 공개 데이터로 제한됩니다. 비공개 데이터나 모델의 마감일 이후에 도입된 데이터에 대해 추론할 수 있는 AI 애플리케이션을 구축하려면 모델에 필요한 특정 정보로 모델에 대한 지식을 보강해야 합니다. 적절한 정보를 가져와서 모델 프롬프트에 삽입하는 프로세스를 검색 증강 생성(RAG)이라고 합니다.

What's in this guide?

LangChain에는 RAG 애플리케이션을 구축하는 데 도움이 되도록 특별히 설계된 여러 구성 요소가 있습니다. 이러한 구성 요소에 익숙해지기 위해 텍스트 데이터 소스에 대한 간단한 질문-답변 애플리케이션을 구축해 보겠습니다. 특히, 릴리안 웡의 블로그 게시물 LLM 기반 자율 에이전트를 통해 QA 봇을 구축해 보겠습니다. 그 과정에서 일반적인 QA 아키텍처를 살펴보고, 관련 LangChain 구성 요소에 대해 논의하며, 고급 QA 기술을 위한 추가 리소스를 강조할 것입니다. 또한, 애플리케이션을 추적하고 이해하는 데 LangSmith가 어떻게 도움이 되는지 살펴볼 것입니다. 애플리케이션의 복잡성이 증가함에 따라 LangSmith는 점점 더 유용해질 것입니다.

Note 여기서는 비정형 데이터를 위한 RAG에 초점을 맞춥니다.

Architecture

일반적인 RAG 애플리케이션에는 두 가지 주요 구성 요소가 있습니다:

Indexing: 소스에서 데이터를 수집하고 인덱싱하기 위한 파이프라인. *이 작업은 보통 오프라인에서 이루어집니다.

Retrieval and generation: 런타임에 사용자 쿼리를 받아 인덱스에서 관련 데이터를 검색한 다음 이를 모델로 전달하는 실제 RAG 체인.

Raw data에서 답변에 이르는 가장 일반적인 전체 시퀀스는 다음과 같습니다:

Indexing

Load: 먼저 데이터를 로드해야 합니다. 이를 위해 문서 로더를 사용하겠습니다.
Split: 텍스트 분할기는 큰 '문서'를 작은 덩어리로 나눕니다. 큰 덩어리는 검색하기 어렵고 모델의 한정된 컨텍스트 창에서는 검색되지 않으므로 이 기능은 데이터를 색인하고 모델에 전달할 때 유용합니다.
Store: 나중에 검색할 수 있도록 분할을 저장하고 색인할 곳이 필요합니다. 이 작업은 대개 벡터스토어와 임베딩 모델을 사용해 수행합니다.

Retrieval and generation

Retrieve: 사용자 입력이 주어지면 Retriever를 사용하여 스토리지에서 관련 스플릿을 검색합니다.
Generate: ChatModel / LLM은 질문과 검색된 데이터가 포함된 프롬프트를 사용하여 답변을 생성합니다.

Setup

Dependencies

이 단계별 안내에서는 OpenAI 채팅 모델과 임베딩, Chroma 벡터 스토어를 사용하지만, 여기에 표시된 모든 내용은 모든 ChatModel 또는 LLM에서 작동합니다, 임베딩, 벡터스토어 또는 retrievers를 사용합니다.

필요 패키지 설치:

!pip install -U langchain openai chromadb langchainhub bs4 tiktoken

Collecting langchain
  Downloading langchain-0.0.348-py3-none-any.whl (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 25.5 MB/s eta 0:00:00
Collecting openai
  Downloading openai-1.3.7-py3-none-any.whl (221 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 221.4/221.4 kB 18.3 MB/s eta 0:00:00
Collecting chromadb
  Downloading chromadb-0.4.18-py3-none-any.whl (502 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 502.4/502.4 kB 44.2 MB/s eta 0:00:00

OPENAI_API_KEY 입력하기

import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass()

Quickstart

릴리안 웡의 LLM 기반 자율 에이전트 블로그 게시물을 통해 QA 앱을 구축하고자 한다고 가정해 보겠습니다.

이를 위한 간단한 파이프라인을 약 20줄의 코드로 만들 수 있습니다:

import bs4
from langchain import hub # prompt examples
from langchain.chat_models import ChatOpenAI # LLM
from langchain.document_loaders import WebBaseLoader
from langchain.embeddings import OpenAIEmbeddings # load -> embedding
from langchain.schema import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
docs = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever()

prompt = hub.pull("rlm/rag-prompt")
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

prompt

ChatPromptTemplate(input_variables=['context', 'question'], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"))])

rag_chain.invoke("What is Task Decomposition?")

Task decomposition is a technique used to break down complex tasks into smaller and simpler steps. It can be done through various methods such as using prompting techniques, task-specific instructions, or human inputs. The goal is to make the task more manageable and facilitate the interpretation of the model's thinking process.

# cleanup
vectorstore.delete_collection()

Detailed walkthrough

위의 코드를 단계별로 살펴보고 무슨 일이 일어나고 있는지 실제로 이해해 보겠습니다.

Step 1. Load

먼저 블로그 게시물 콘텐츠를 로드해야 합니다. 이를 위해 소스에서 데이터를 Documents로 로드하는 객체인 DocumentLoader를 사용할 수 있습니다. Documents는 page_content(문자열) 및 metadata(딕셔너리) 속성을 가진 객체입니다.

이 경우 urllib와 BeautifulSoup을 사용하여 전달된 웹 URL을 로드하고 구문 분석하여 URL당 하나의 Document를 반환하는 WebBaseLoader를 사용하겠습니다. 우리는 bs_kwargs를 통해 BeautifulSoup 구문 분석기에 매개변수를 전달하여 html -> 텍스트 구문 분석을 사용자 정의할 수 있습니다(BeautifulSoup 문서 참조). 이 경우 클래스가 "post-content", "post-title" 또는 "post-header"인 HTML 태그만 관련이 있으므로 다른 태그는 모두 제거합니다.

from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs={
        "parse_only": bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    },
)
docs = loader.load()

len(docs[0].page_content)

# 42824

print(docs[0].page_content[:500])



      LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
Agent System Overview#
In

Step 2. Split

로드된 문서의 길이가 42,000자가 넘습니다. 이는 많은 모델의 컨텍스트 창에 맞추기에는 너무 깁니다. 또한 컨텍스트 창에 전체 게시물을 넣을 수 있는 모델의 경우에도 경험적으로 모델은 매우 긴 프롬프트에서 관련 컨텍스트를 찾는 데 어려움을 겪습니다.

그래서 우리는 'Document'를 임베딩과 벡터 저장을 위해 청크로 분할할 것입니다. 이렇게 하면 런타임에 블로그 게시물에서 가장 관련성이 높은 부분만 검색하는 데 도움이 됩니다.

이 경우 문서를 1000자의 청크로 분할하고 청크 간에 200자의 겹침이 있습니다. 겹침은 문장과 관련된 중요한 문맥에서 문장이 분리될 가능성을 줄이는 데 도움이 됩니다. 각 청크가 적절한 크기가 될 때까지 공통 구분 기호(예: 줄 바꿈)를 사용하여 문서를 (재귀적으로) 분할하는 RecursiveCharacterTextSplitter를 사용합니다.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

len(all_splits)

# 66

len(all_splits[0].page_content)

# 969

all_splits[10].metadata

{'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/',
 'start_index': 7056}

Step 3. Store

이제 66개의 텍스트 청크가 메모리에 저장되었으므로 나중에 RAG 앱에서 검색할 수 있도록 이를 저장하고 색인을 생성해야 합니다. 이를 수행하는 가장 일반적인 방법은 각 문서 분할의 내용을 임베드하고 해당 임베딩을 벡터 스토어에 업로드하는 것입니다.

그런 다음, 분할을 검색하고 싶을 때 검색 쿼리를 가져와서 임베딩하고 일종의 '유사성' 검색을 수행하여 쿼리 임베딩과 가장 유사한 임베딩을 가진 저장된 분할을 식별합니다. 가장 간단한 유사성 측정은 코사인 유사성으로, 각 임베딩 쌍 사이의 각도의 코사인(매우 높은 차원의 벡터)을 측정합니다.

Chroma 벡터 저장소(vector store)와 OpenAIEmbeddings 모델을 사용하여 단일 명령으로 모든 문서 분할을 임베드하고 저장할 수 있습니다.

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())

Step 4. Retrieve

이제 실제 애플리케이션 로직을 작성해 보겠습니다. 사용자가 질문을 하고, 그 질문과 관련된 문서를 검색하고, 검색된 문서와 초기 질문을 모델에 전달하고, 마지막으로 답변을 반환하는 간단한 애플리케이션을 만들고자 합니다.

LangChain은 문자열 쿼리가 주어지면 관련 문서를 반환할 수 있는 인덱스를 래핑하는 Retriever 인터페이스를 정의합니다. 모든 리트리버는 공통 메서드인 get_relevant_documents()(및 그 비동기 변형인 aget_relevant_documents())를 구현합니다.

Retriever의 가장 일반적인 유형은 벡터 저장소의 유사성 검색 기능을 사용해 검색을 용이하게 하는 VectorStoreRetriever입니다. 모든 VectorStore는 Retriever로 쉽게 변환할 수 있습니다:

retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})

retrieved_docs = retriever.get_relevant_documents(
    "What are the approaches to Task Decomposition?"
)

len(retrieved_docs)

# 6

print(retrieved_docs[0].page_content)

Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote.
Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs.

Step 5. Generate

질문을 받고, 관련 문서를 검색하고, 프롬프트를 구성하고, 이를 모델에 전달하고, 출력을 파싱하는 체인으로 이 모든 것을 합쳐 보겠습니다.

여기서는 gpt-3.5-turbo OpenAI 채팅 모델을 사용하겠지만, 어떤 LangChain 'LLM' 또는 'ChatModel'로 대체할 수 있습니다.

from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

We'll use a prompt for RAG that is checked into the LangChain prompt hub (here).

from langchain import hub

prompt = hub.pull("rlm/rag-prompt")

prompt

ChatPromptTemplate(input_variables=['context', 'question'], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"))])

print(
    prompt.invoke(
        {"context": "filler context", "question": "filler question"}
    ).to_string()
)

Human: You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: filler question 
Context: filler context 
Answer:

우리는 체인을 정의하기 위해 LCEL Runnable 프로토콜을 사용하여 다음과 같이 할 수 있습니다.

컴포넌트와 함수를 투명한 방식으로 파이프하고
스트리밍, 비동기, 일괄 호출을 바로 사용할 수 있습니다.

from langchain.schema import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

for chunk in rag_chain.stream("What is Task Decomposition?"):
    print(chunk, end="", flush=True)

Task decomposition is a technique used to break down complex tasks into smaller and simpler steps. It involves transforming big tasks into multiple manageable tasks, allowing for easier interpretation and execution by autonomous agents or models. Task decomposition can be done through various methods, such as using prompting techniques, task-specific instructions, or human inputs.

Customizing the prompt

위와 같이 프롬프트 허브에서 프롬프트(예: 이 RAG 프롬프트)를 로드할 수 있습니다. 프롬프트는 쉽게 사용자 지정할 수도 있습니다:

from langchain.prompts import PromptTemplate

template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
Always say "thanks for asking!" at the end of the answer.
{context}
Question: {question}
Helpful Answer:"""
rag_prompt_custom = PromptTemplate.from_template(template)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt_custom
    | llm
    | StrOutputParser()
)

rag_chain.invoke("What is Task Decomposition?")

Task decomposition is a technique used to break down complex tasks into smaller and simpler steps. It involves transforming big tasks into multiple manageable tasks, allowing for a more systematic and organized approach to problem-solving. Thanks for asking!

Adding sources

LCEL을 사용하면 검색된 문서 또는 문서에서 특정 소스 메타데이터를 쉽게 반환할 수 있습니다:

from operator import itemgetter

from langchain.schema.runnable import RunnableParallel

rag_chain_from_docs = (
    {
        "context": lambda input: format_docs(input["documents"]),
        "question": itemgetter("question"),
    }
    | rag_prompt_custom
    | llm
    | StrOutputParser()
)
rag_chain_with_source = RunnableParallel(
    {"documents": retriever, "question": RunnablePassthrough()}
) | {
    "documents": lambda input: [doc.metadata for doc in input["documents"]],
    "answer": rag_chain_from_docs,
}

rag_chain_with_source.invoke("What is Task Decomposition")

{'documents': [{'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/',
   'start_index': 1585},
  {'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/',
   'start_index': 2192},
  {'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/',
   'start_index': 17804},
  {'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/',
   'start_index': 17414},
  {'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/',
   'start_index': 29630},
  {'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/',
   'start_index': 19373}],
 'answer': 'Task decomposition is a technique used to break down complex tasks into smaller and simpler steps. It involves transforming big tasks into multiple manageable tasks, allowing for a more systematic and organized approach to problem-solving. Thanks for asking!'}

Adding memory

과거 사용자 입력을 기억하는 상태 저장 애플리케이션을 만들고 싶다고 가정해 보겠습니다. 이를 지원하기 위해 필요한 작업은 크게 두 가지입니다.

과거 메시지를 전달할 수 있는 메시지 플레이스홀더를 체인에 추가합니다.
최신 사용자 쿼리를 가져와 채팅 기록의 맥락에서 리트리버에 전달할 수 있는 독립형 질문으로 재형성하는 체인을 추가합니다.

2부터 시작하겠습니다. 다음과 같이 condense question 체인을 구축할 수 있습니다:

from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder

condense_q_system_prompt = """Given a chat history and the latest user question \
which might reference the chat history, formulate a standalone question \
which can be understood without the chat history. Do NOT answer the question, \
just reformulate it if needed and otherwise return it as is."""
condense_q_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", condense_q_system_prompt),
        MessagesPlaceholder(variable_name="chat_history"),
        ("human", "{question}"),
    ]
)
condense_q_chain = condense_q_prompt | llm | StrOutputParser()

from langchain.schema.messages import AIMessage, HumanMessage

condense_q_chain.invoke(
    {
        "chat_history": [
            HumanMessage(content="What does LLM stand for?"),
            AIMessage(content="Large language model"),
        ],
        "question": "What is meant by large",
    }
)

What is the definition of "large" in the context of a language model?

condense_q_chain.invoke(
    {
        "chat_history": [
            HumanMessage(content="What does LLM stand for?"),
            AIMessage(content="Large language model"),
        ],
        "question": "How do transformers work",
    }
)

How do transformer models function?

이제 전체 QA 체인을 구축할 수 있습니다. 채팅 기록이 비어 있지 않을 때만 condense question chain을 실행하도록 라우팅 기능을 추가한 것을 주목하세요.

qa_system_prompt = """You are an assistant for question-answering tasks. \
Use the following pieces of retrieved context to answer the question. \
If you don't know the answer, just say that you don't know. \
Use three sentences maximum and keep the answer concise.\

{context}"""
qa_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", qa_system_prompt),
        MessagesPlaceholder(variable_name="chat_history"),
        ("human", "{question}"),
    ]
)


def condense_question(input: dict):
    if input.get("chat_history"):
        return condense_q_chain
    else:
        return input["question"]


rag_chain = (
    RunnablePassthrough.assign(context=condense_question | retriever | format_docs)
    | qa_prompt
    | llm
)

chat_history = []

question = "What is Task Decomposition?"
ai_msg = rag_chain.invoke({"question": question, "chat_history": chat_history})
chat_history.extend([HumanMessage(content=question), ai_msg])

second_question = "What are common ways of doing it?"
rag_chain.invoke({"question": second_question, "chat_history": chat_history})

AIMessage(content='Common ways of task decomposition include:\n\n1. Using Chain of Thought (CoT): CoT is a prompting technique that instructs the model to "think step by step" and decompose complex tasks into smaller and simpler steps. This approach utilizes more computation at test-time and sheds light on the model\'s thinking process.\n\n2. Prompting with LLM: Language Model (LLM) can be used to prompt the model with specific instructions, such as asking for the steps or subgoals for achieving a particular task. This simple prompting technique helps in breaking down the task into manageable components.\n\n3. Task-specific instructions: For certain tasks, task-specific instructions can be provided to guide the model in decomposing the task. For example, when writing a novel, the instruction "Write a story outline" can be given to help the model break down the task into smaller writing tasks.\n\n4. Human inputs: In some cases, human inputs can be used for task decomposition. Humans can provide their expertise and knowledge to identify the steps or subtasks involved in a complex task, which can then be used to guide the model\'s decomposition process.')

연습문제

WebBaseLoader 기반 QA
주어진 사이트의 정보를 이용해 QA Agent를 구성해보자
https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/

from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/",),
    bs_kwargs={
        "parse_only": bs4.SoupStrainer(
            class_=(# TODO)
        )
    },
)
docs = loader.load()

Jacob Kim

AI, Information and Communication, Electronics, Computer Science, Bio, Algorithms