[LLM] LangChain Chat with Your Data - (5) Question Answering

gunny·2024년 1월 25일

langchain qa langchain question answering llm qa llm question answering

LLM

목록 보기

10/14

해당 게시물은 DeepLearningAI의 LangChain Chat with Your Data 강의를 듣고 개인적으로 정리한 내용입니다.

LangChain Chat with Your Data

(5) Question Answering

주어진 질문과 관련된 문서를 검색하는 방법에 대해 살펴봤으니,
다음으로는 해당 문서를 가져와 원래 질문을 가져와서 언어 모델에 전달하고 질문에 답하도록 요청하는 것이다.

검색한 문서를 바탕으로 질문 답변을 수행하는 방법은 전체 데이터를 저장 및 수집을 완료하고 분할을 한뒤 질문에 대한 문서를 검색한 후에 수행한다.
이제 답을 얻기 위해 검색한 문서를 언어 모델에 전달한다.

이에 대한 일반적인 흐름은 다음과 같다.
질문이 들어오면 관련 문서를 찾은 다음 시스템 프롬프트와 질문과 함께 해당 분할을 언어 모델에 전달하고 답변을 얻는다.

기본적으로 모든 청크를 동일한 컨텍스트 창, 동일한 언어 모델로 전달한다. 이러한 방법에는 장단점이 있는데, 단점은 문서가 너무 많아 동일한 컨텍스트 창에 문서를 모두 전달할 수 없는 경우가 있다는 점이다.

MapReduce, Refine 및 MapReran에서 가져오는 세가지 방법이 있는데, 해당 방법에 대해 언급해보겠다.

◼︎ Question Answering


import os
import openai


from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

openai.api_key = os.environ['OPENAI_API_KEY']

환경 변수를 로드하고,

from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = 'data/chroma/'

embdding = OpenAIEmbeddings()
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

print(vectordb._collection.count())

##output
152

벡터 데이터베이스, 임베딩 모델을 로드한다.

이전과 동일한 152개의 문서가 있음을 알 수 있고,

question = 'What are major topics for this class?'
docs = vectordb.similarity_search(question, k=3)
print(len(docs))
print(docs)

##output
3
[Document(page_content="statistics for a while or maybe algebra, we'll go over those in the discussion sections as a \nrefresher for those of you that want one.  \nLater in this quarter, we'll also use the disc ussion sections to go over extensions for the \nmaterial that I'm teaching in the main lectur es. So machine learning is a huge field, and \nthere are a few extensions that we really want  to teach but didn't have time in the main \nlectures for.", metadata={'page': 8, 'source': 'data/MachineLearning-Lecture01.pdf'}),
 Document(page_content="middle of class, but because there won't be video you can safely sit there and make faces \nat me, and that won't show, okay?  \nLet's see. I also handed out this — ther e were two handouts I hope most of you have, \ncourse information handout. So let me just sa y a few words about parts of these. On the \nthird page, there's a section that says Online Resources.  \nOh, okay. Louder? Actually, could you turn up the volume? Testing. Is this better? \nTesting, testing. Okay, cool. Thanks.", metadata={'page': 4, 'source': 'data/MachineLearning-Lecture01.pdf'}),
 Document(page_content="So all right, online resources. The class has a home page, so it's in on the handouts. I \nwon't write on the chalkboard — http:// cs229.stanford.edu. And so when there are \nhomework assignments or things like that, we  usually won't sort of — in the mission of \nsaving trees, we will usually not give out many handouts in class. So homework \nassignments, homework solutions will be posted online at the course home page.  \nAs far as this class, I've also written, a nd I guess I've also revised every year a set of \nfairly detailed lecture notes that cover the te chnical content of this  class. And so if you \nvisit the course homepage, you'll also find the detailed lecture notes that go over in detail \nall the math and equations and so on  that I'll be doing in class.  \nThere's also a newsgroup, su.class.cs229, also written on the handout. This is a \nnewsgroup that's sort of a forum for people in  the class to get to  know each other and \nhave whatever discussions you want to ha ve amongst yourselves. So the class newsgroup \nwill not be monitored by the TAs and me. But this is a place for you to form study groups \nor find project partners or discuss homework problems and so on, and it's not monitored \nby the TAs and me. So feel free to ta lk trash about this class there.  \nIf you want to contact the teaching staff, pl ease use the email address written down here, \ncs229-qa@cs.stanford.edu. This goes to an acc ount that's read by all the TAs and me. So", metadata={'page': 5, 'source': 'data/MachineLearning-Lecture01.pdf'})]

첫 번째 질문으로 '이 수업의 주요 주제는 무엇입니까?'에 대해 작동하는지 확인하기 위해 유사성 검색을 수행한다.

import datetime

current_date = datetime.datetime.now().date()
if current_date < datetime.date(2023, 9,2):
    llm_name = 'gpt-3.5-turbo-0301'
else:
    llm_name = 'gpt-3.5-turbo'
    
    
print(llm_name)

#output 
gpt-3.5-turbo

질문에 답하는 데 사용할 언어 모델을 초기화하는데, 여기서는 AI chatmodel인 GPT 3.5를 사용하고 temperatur를 0으로 설정했다.

from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model_name=llm_name, temperature=0)

temperature의 범위는 0~1 인데, 0으로 갈수록 변동성이 낮고 일반적으로 가장 충실하고 신뢰할 수 있는 답변을 제공하기 때문에 사실에 기반한 답변이 나오길 원할 때 0으로 설정한다.

from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm, 
    retriever = vectordb.as_retriever()
)

그런 다음 검색 QA 체인을 가져온다. 검색 단계를 통해 지원되는 질문 답변을 수행한다.

result = qa_chain({'query' : question})
result

#output

{'query': 'What are major topics for this class?',
 'result': 'The major topics for this class include machine learning, statistics, algebra, linear algebra, and extensions of machine learning.'}

언어 모델과 벡터 데이터베이스를 검색기로 전달하고, 우리가 묻고 싶은 질문과 동일한 쿼리를 사용하여 이를 호출한다.

result['result']

##output
'The major topics for this class include machine learning, statistics, algebra, linear algebra, and extensions of machine learning.'

결과를 보면 이번 수업의 주요 주제는 머신러닝이라는 것을 알 수 있다.
또한 수업의 토론 섹션에서 통계와 대수학과 머신러닝 확장도 다룰 것으로 보인다.

무슨 일이 일어나고 있는지 좀 더 잘 이해하고 돌릴 수 있는 몇 가지 다양한 케이스들을 문서와 질문을 가져와 언어 모델에 전달하는 프롬프트를 통해 확인해 볼 수 있다.

◼︎ prompt - RetrievalQA chain types

from langchain.prompts import PromptTemplate

template = """Use the following pieces of context to answer the question at the end.\
    If you don't know the answer, just say that you don't know, don't try to make up an answer. \
    Use three sentences maximum. Keep the answer as concise as possible. \
    Always say "thanks for asking!" at the end of the answer. 
{context}
Question: {question}
Helpful Answer:"""

먼저, 프롬프트 템플릿을 정의한다.

여기에는 다음 컨텍스트 조각을 사용하는 방법에 대한 몇 가지 지침이 있으며 컨텍스트 변수와, 질문이 들어갈 표시가 {context}, {question} 으로 되어있다.

QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

해당 템플릿을 가지고 검색 QA 체인을 생성할 수 있다.
이전과 동일한 언어 모델과 이전과 동일한 벡터 데이터베이스를 사용하지만 몇 가지 새로운 인수를 전달한다.

# Run_chain

qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

반환할 소스 문서가 있으므로 return_source_documents를 true로 설정한다.
이를 통해 우리가 검색한 문서를 쉽게 검사할 수 있다. 그런 다음 위에서 정의한 QA 체인 프롬프트와 동일한 프롬프트도 전달한다.

question = 'Is probability a class topic?'

이제 새로운 질문인 확률이 수업 주제인가요? 를 전달해본다.

result = qa_chain({'query' : question})
result

##output
{'query': 'Is probability a class topic?',
 'result': 'Yes, probability is a class topic. Thanks for asking!',
 'source_documents': [Document(page_content="of this class will not be very program ming intensive, although we will do some \nprogramming, mostly in either MATLAB or Octa ve. I'll say a bit more about that later.  \nI also assume familiarity with basic proba bility and statistics. So most undergraduate \nstatistics class, like Stat 116 taught here at Stanford, will be more than enough. I'm gonna \nassume all of you know what ra ndom variables are, that all of you know what expectation \nis, what a variance or a random variable is. And in case of some of you, it's been a while \nsince you've seen some of this material. At some of the discussion sections, we'll actually \ngo over some of the prerequisites, sort of as  a refresher course under prerequisite class. \nI'll say a bit more about that later as well.  \nLastly, I also assume familiarity with basi c linear algebra. And again, most undergraduate \nlinear algebra courses are more than enough. So if you've taken courses like Math 51, \n103, Math 113 or CS205 at Stanford, that would be more than enough. Basically, I'm \ngonna assume that all of you know what matrix es and vectors are, that you know how to \nmultiply matrices and vectors and multiply matrix and matrices, that you know what a matrix inverse is. If you know what an eigenvect or of a matrix is, that'd be even better. \nBut if you don't quite know or if you're not qu ite sure, that's fine, too. We'll go over it in \nthe review sections.", metadata={'page': 4, 'source': 'data/MachineLearning-Lecture01.pdf'}),
  Document(page_content='Instructor (Andrew Ng) :Yeah, yeah. I mean, you’re asking about overfitting, whether \nthis is a good model. I thi nk let’s – the thing’s you’re mentioning are maybe deeper \nquestions about learning algorithms  that we’ll just come back to later, so don’t really \nwant to get into that right now. Any more questions? Okay.  \nSo this endows linear regression with a proba bilistic interpretati on. I’m actually going to \nuse this probabil – use this, sort of, probabilist ic interpretation in order to derive our next \nlearning algorithm, which will be our first classification algorithm. Okay? So you’ll recall \nthat I said that regression problems are where the variable Y that you’re trying to predict \nis continuous values. Now I’m actually gonna ta lk about our first cl assification problem, \nwhere the value Y you’re trying to predict will be discreet value. You can take on only a \nsmall number of discrete values and in th is case I’ll talk about binding classification \nwhere Y takes on only two values, right? So you  come up with classi fication problems if \nyou’re trying to do, say, a medical diagnosis and try to decide based on some features that \nthe patient has a disease or does not have a di sease. Or if in the housing example, maybe \nyou’re trying to decide will this house sell in the next six months or not and the answer is \neither yes or no. It’ll either be  sold in the next six months or it won’t be. Other standing', metadata={'page': 10, 'source': 'data/MachineLearning-Lecture03.pdf'}),
  Document(page_content="statistics for a while or maybe algebra, we'll go over those in the discussion sections as a \nrefresher for those of you that want one.  \nLater in this quarter, we'll also use the disc ussion sections to go over extensions for the \nmaterial that I'm teaching in the main lectur es. So machine learning is a huge field, and \nthere are a few extensions that we really want  to teach but didn't have time in the main \nlectures for.", metadata={'page': 8, 'source': 'data/MachineLearning-Lecture01.pdf'}),
  Document(page_content='come back to this again. Any questions a bout this? Actually, let me clean up another \ncouple of boards and then I’ll see what questions you have.  \nOkay. Any questions? Yeah?  \nStudent: You are, I think here you try to measure the likelihood of your nice of theta by a \nfraction of error, but I think it’s that you measure because it depends on the family of \ntheta too, for example. If you have a lot of  parameters [inaudible] or fitting in?', metadata={'page': 9, 'source': 'data/MachineLearning-Lecture03.pdf'})]}

결과를 받고, 그 안에 무엇이 있는지 조사해 보면,

'Yes, probability is a class topic. Thanks for asking!'
예, 확률이 클래스의 전제 조건으로 가정됩니다. 라는 답을 얻을 수 있다.

이 데이터를 어디서 가져오는지에 대한 좀 더 나은 직관을 위해 반환된 소스 문서 중 일부를 살펴보자.

result['source_documents'][0]

##output

Document(page_content="of this class will not be very program ming intensive, although we will do some \nprogramming, mostly in either MATLAB or Octa ve. I'll say a bit more about that later.  \nI also assume familiarity with basic proba bility and statistics. So most undergraduate \nstatistics class, like Stat 116 taught here at Stanford, will be more than enough. I'm gonna \nassume all of you know what ra ndom variables are, that all of you know what expectation \nis, what a variance or a random variable is. And in case of some of you, it's been a while \nsince you've seen some of this material. At some of the discussion sections, we'll actually \ngo over some of the prerequisites, sort of as  a refresher course under prerequisite class. \nI'll say a bit more about that later as well.  \nLastly, I also assume familiarity with basi c linear algebra. And again, most undergraduate \nlinear algebra courses are more than enough. So if you've taken courses like Math 51, \n103, Math 113 or CS205 at Stanford, that would be more than enough. Basically, I'm \ngonna assume that all of you know what matrix es and vectors are, that you know how to \nmultiply matrices and vectors and multiply matrix and matrices, that you know what a matrix inverse is. If you know what an eigenvect or of a matrix is, that'd be even better. \nBut if you don't quite know or if you're not qu ite sure, that's fine, too. We'll go over it in \nthe review sections.", metadata={'page': 4, 'source': 'data/MachineLearning-Lecture01.pdf'})

◼︎ RetrievalQA chain type - MapReduce

지금까지 기본적으로 사용하는 기술인 stuff 기술을 사용했는데,
기본적으로 모든 문서를 최종 프롬프트에 채워넣는 방법이다.
이는 언어 모델을 한 번만 호출하면 되기 때문에 정말 좋은 방법이다.

그러나 문서가 너무 많으면 컨텍스트 창에 모두 들어가지 못할 수 있다는 제한이 있다. 문서에 대한 질문 답변을 수행하는 데 사용할 수 있는 다른 유형의 기술은 맵리듀스 기술이다.

이 기술에서는 각 개별 문서가 먼저 자체적으로 언어 모델로 전송되어 원래의 답변을 얻고, 그런 다음 해당 답변은 언어 모델에 대한 최종 호출을 통해 최종 답변으로 구성된다.

여기에는 언어 모델에 대한 더 많은 호출이 포함되지만 임의의 많은 문서에 대해 작동할 수 있다는 이점이 있다.


qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever = vectordb.as_retriever(),
    chain_type= 'map_reduce'
)

하지만 이 체인을 통해 이전 질문을 실행하면 속도가 훨씬 느리고, 결과는 실제로 좋지 않다는 것을 볼 수 있다.


result = qa_chain_mr({'query' : question})
result

문서의 주어진 부분에 따르면 이 질문에 대한 명확한 대답을 하지 못한다.

그 이유는 각 문서를 기준으로 개별적으로 답변을 하기 때문에 발생할 수 있다. 따라서 정보가 두 문서에 분산되어 있는 경우 해당 정보가 모두 동일한 컨텍스트에 있는 것이 아니기 때문이다.

MapReduce chain은 실제로 언어 모델에 대한 4개의 별도 호출을 포함한다. 이러한 호출 중 하나를 클릭하면 각 문서에 대한 입력과 출력이 있음을 확인할 수 있습니다.

각 문서를 실행한 후 최종 체인인 Stuffed Documents 체인에 결합되어 이러한 모든 응답을 최종 호출에 넣는 것을 볼 수 있다. 시스템 메시지에서는 이전 문서에서 4개의 요약이 있고 사용자 질문이 있으며 바로 거기에서 답변을 얻는 것을 볼 수 있다.


qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type='refine'
)

체인 유형을 Refine으로 설정하여 비슷한 작업을 수행할 수 있는데, 여기서는 LLM 체인에 대한 4개의 순차적 호출을 포함하는 Refine Documents 체인을 호출한다.

result = qa_chain_mr({'query':question})
result

##output

{'query': 'Is probability a class topic?',
 'result': 'Based on the new context provided, it is not clear whether probability is a class topic. The conversation between the instructor and the student does not directly mention probability. Therefore, the original answer remains unchanged.'}

여기에서는 언어 모델로 전송되기 직전에 프롬프트가 있고, 몇 개의 비트로 구성된 시스템 메시지가 있다.

다음 부분, 여기 있는 모든 텍스트는 우리가 검색한 문서 중 하나입니다. 그런 다음 여기에 사용자 질문이 있고 바로 여기에 답변이 있습니다.

그런 다음 다시 돌아가면 언어 모델에 대한 다음 호출을 볼 수 있습니다. 여기서 우리가 언어 모델에 보내는 마지막 프롬프트는 이전 응답과 새 데이터를 결합한 다음 향상된 응답을 요청하는 시퀀스이다.

원래 사용자 질문이 있고 그 다음에는 이전과 동일한 답변을 얻었으며 필요한 경우에만 기존 답변을 다듬을 수 있는 기회가 있다.

이 프롬프트 템플릿의 일부이자 지침의 일부이고, 나머지 부분은 우리가 검색한 문서이며 목록의 두 번째 문서입니다. 마지막에는 새로운 컨텍스트를 고려하여 질문에 더 잘 답할 수 있도록 원래 답변을 개선한 몇 가지 추가 지침이 있음을 확인할 수 있다.

하지만 이것은 두 번째 최종 답변일 뿐이므로 4번 실행되고, 최종 답변에 도달하기 전에 모든 문서를 검토한다.

수업에서는 기본 확률과 통계에 익숙하다고 가정하지만 전제 조건을 새로 고치기 위한 검토 섹션도 있어 MapReduce 체인보다 더 나은 결과라는 것을 알 수 있다.

refine 체인을 사용하면 순차적이긴 하지만 정보를 결합할 수 있고 실제로 MapReduce 체인보다 더 많은 정보 전달을 장려하기 때문이다.

◼︎ RetrievalQA limitations

현재의 QA는 대화를 기억할 수 없다.
확률이 전제조건이어야 한다고 언급하는데, 그러한 전제조건이 왜 필요한지 후속 질문을 해보자.


qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever = vectordb.as_retriever()
)

question = 'Is probability a class topic?'

result =  qa_chain({'query': question})
result 

##output

{'query': 'Is probability a class topic?',
 'result': 'Yes, probability is mentioned as a prerequisite for the class. The instructor assumes familiarity with basic probability and statistics.'}

여기서 후속질문을 해보면,


question = 'whey are those prerequesites needed?'
result = qa_chain({'query' : question})
result

##output

{'query': 'whey are those prerequesites needed?',
 'result': 'The prerequisites are needed because the course assumes familiarity with basic probability and statistics, as well as basic linear algebra. These concepts are fundamental to understanding and applying machine learning algorithms. Without a solid understanding of probability, statistics, and linear algebra, it would be difficult to grasp the concepts and techniques taught in the course.'}

'The prerequisites are needed because the course assumes familiarity with basic probability and statistics, as well as basic linear algebra. These concepts are fundamental to understanding and applying machine learning algorithms. Without a solid understanding of probability, statistics, and linear algebra, it would be difficult to grasp the concepts and techniques taught in the course.'

라는 대답을 얻는다.

우리가 받은 답변은 컴퓨터 과학에 대한 기본 지식과 기본 컴퓨터 기술 및 원리 대답이다. 우리가 확률에 대해 질문하기 전의 대답과 전혀 관련이 없는 것을 볼 수 있다. 기본적으로 우리가 사용하는 체인에는 상태에 대한 개념이 없기 때문이다.

이전 질문이나 이전 답변이 무엇인지 기억하지 않는다.
이전 질문이나 답변을 기억하게 하기 위해서는 메모리를 도입해야 한다.