랭체인(LangChain)을 활용한 SQL 자동 생성기 구축

ddurru·2024년 6월 24일

랭체인 (LangChain)

목록 보기

2/2

⭐️ 목표 : 자연어를 통해 원하는 요청을 수행한 후, LLM이 자연어 요청을 SQL로 자동 변환하여 DB와 상호작용을 할 수 있는 SQL Agent 구현하기 ⭐️

01. Q&A 체인과 SQL 에이전트란 무엇일까?

	Q&A 체인 (Chain)	SQL 에이전트 (Agent)
기능	사용자가 SQL DB에 질문을 하면 DB 조회를 통해 답변을 제공하며, 특정 데이터에 대한 단순한 질문에 대한 답변을 얻고자 할 때 유용함	사용자의 질문에 답변하기 위해 필요할 때마다 DB를 여러 번 조회할 수 있으며, 복잡한 질문에 답변을 얻고자 할 때 유용함 (특히 여러 단계의 데이터 조회와 처리 과정이 필요할 경우)
과정	예시 질문 : "가장 많이 팔린 제품은 무엇인가요?"	예시 질문 : "지난 해 월별로 가장 많이 팔린 제품은 무엇인가요?"
	(1) 질문을 SQL 쿼리로 변환	(1) 질문을 이해하고 필요한 정보를 얻기 위해 여러 개의 SQL 쿼리를 작성
	(2) DB를 조회하여 답변 받음	(2) 여러 번 DB 조회
	(3) 사용자에게 답변 제공	(3) 각 조회 결과를 종합하여 최종 답변 만들기
		(4) 사용자에게 답변 제공
결과	"가장 많이 팔린 제품은 ABC입니다."	"지난 해 월별로 가장 많이 팔린 제품은 1월에 ABC, 2월에 DEF, ... 입니다."

결국 Q&A 체인은 단일 조회로 간단한 질문에 답변하고, 에이전트는 여러 번 조회를 통해 복잡한 질문에 답변을 하는 것으로 이를 활용하게 된다면 SQL DB에 있는 데이터에 대해 질문하고 자연어로 된 답변을 받을 수 있을 것이다.

02. SQL 체인(Chain) 및 에이전트(Agent)의 동작 과정

과정은 다음과 같다.

Convert question to SQL query : LLM 모델이 사용자의 입력을 바탕으로 한 질문을 SQL 쿼리로 변환
Execute SQL query : SQL 쿼리를 실행
Answer the question : LLM 모델은 쿼리 결과를 사용하여 사용자의 입력에 응답

03. [실습] 랭체인을 활용한 SQL 쿼리 자동 생성기 만들기 (Google Colab 환경)

Google Colab 환경에서 진행되었으며 과정을 다음과 같이 작성해 두었다!

1. Sample SQLite DB 다운로드

실습을 위해 샘플 DB 파일을 다운로드 하여 구축해 보도록 하자. 
References : https://www.sqlitetutorial.net/sqlite-sample-database/

* 위 데이터베이스 다이어그램은 DB 테이블과 관계에 대한 이미지

# 파일 다운로드
!wget https://www.sqlitetutorial.net/wp-content/uploads/2018/03/chinook.zip -O chinook.zip
# 압축 풀기  
!unzip chinook.zip

2. Q&A 체인 생성

a. 랭체인 라이브러리 설치 
우선 랭체인을 실행하기 위한 라이브러리 환경을 구축해 두자.

pip install langchain langchain-community langchain-openai

b. DB 불러오기
정상적으로 데이터가 로드 되었는지 환경 내 파이썬 버전을 확인한 후, 쿼리를 통해 데이터를 조회해 보자. 

DB 확인을 위한 코드 작성 부분이다.
chinook DB와 SQLite를 연결한 후, 쿼리 실행 결과를 pandas 데이터프레임으로 변환하여 출력을 해 보도록 하자.

import pandas as pd
import sqlite3
from langchain_community.utilities import SQLDatabase
from langchain.chains import create_sql_query_chain
from langchain_openai import ChatOpenAI
 
OPENAI_KEY = "_YOUR_OPENAI_API_KEY_"
 
# 데이터베이스 연결
db = SQLDatabase.from_uri("sqlite:///chinook.db")
 
# 데이터베이스 정보 출력
print(db.dialect)
print(db.get_usable_table_names())
 
# sqlite3 연결 설정
conn = sqlite3.connect('chinook.db')
 
# 쿼리 실행 및 결과를 pandas 데이터프레임으로 변환
query = "SELECT * FROM artists LIMIT 10;"
df = pd.read_sql_query(query, conn)
 
# 결과 출력
print(df)
 
# 연결 종료
conn.close()

c. LLM과의 연동 및 SQL 쿼리 변환  
이제 환경을 세팅을 하였으니 LLM에 연결하여 질문을 SQL로 변환해보도록 하자. 

SQL 체인 또는 에이전트의 첫 번째 단계는 사용자의 입력을 가져와 SQL의 쿼리로 변환하는 과정이 필요하다. 
이를 위해 랭체인은내장 체인인 create_sql_query_chain이 제공된다. 

더 세부적으로 설명을 하자면 사용자가 제공한 프롬프트와 SQL DB를 사용하여 SQL 쿼리를 생성하는 체인을 만드는 데 사용이 된다. 
create_sql_query_chain 함수를 사용할 경우, LLM과 SQL DB를 인수로 체인을 만들며, 해당 체인은 사용자의 질문을 분석하여 SQL 질의를 생성하고, 생성된 질의를 기반으로 SQL DB에서 실행하여 결과를 반환한다.

import sqlite3
from langchain_community.utilities import SQLDatabase
from langchain.chains import create_sql_query_chain
from langchain_openai import ChatOpenAI
 
OPENAI_KEY = "_YOUR_OPENAI_API_KEY_"
 
# 데이터베이스 연결
db = SQLDatabase.from_uri("sqlite:///chinook.db")
 
# 언어 모델 설정
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0, openai_api_key=OPENAI_KEY)
# 체인 생성
chain = create_sql_query_chain(llm, db)
 
# 질문 결과 및 실행
response = chain.invoke({"question": "employees는 몇명이 있어?"})
print(response)
 
# SQL 쿼리 결과 실행 및 결과 출력
result = db.run(response)
print(result)

LangSmith 추적을 통해 체인이 수행하는 과정을 이해하고, response 쿼리를 만들어 낸 실제 프롬프트를 확인해 보도록 하자. 

위 코드에 다음과 같은 코드를 추가해 보자.

chain.get_prompts()[0].pretty_print()

"""
You are a SQLite expert. Given an input question, first create a syntactically correct SQLite query to run, then look at the results of the query and return the answer to the input question.
Unless the user specifies in the question a specific number of examples to obtain, query for at most 5 results using the LIMIT clause as per SQLite. You can order the results to return the most informative data in the database.
Never query for all columns from a table. You must query only the columns that are needed to answer the question. Wrap each column name in double quotes (") to denote them as delimited identifiers.
Pay attention to use only the column names you can see in the tables below. Be careful to not query for columns that do not exist. Also, pay attention to which column is in which table.
Pay attention to use date('now') function to get the current date, if the question involves "today".

Use the following format:

Question: Question here
SQLQuery: SQL Query to run
SQLResult: Result of the SQLQuery
Answer: Final answer here

Only use the following tables:
{table_info}

Question: {input}
""" 

이를 통해 질문에 답변하기 위해 어떤 절차를 따르는지 설명하는 가이드라인을 확인할 수 있었다.
해당 가이드라인은 입력된 질문에 따라 적절한 SQL 쿼리를 생성하고, 쿼리 결과를 바탕으로 질문에 대한 답변을 제공하는 방법을 제시하는 것이다. 


d. SQL 쿼리 실행 
해당 부분은 SQL 쿼리 체인 생성에 있어서 가장 중요한 부분이다. 만약 실제로 DB와 연동을 했을 경우, 체인이 실제로 쿼리를 DB로부터 실행하고자 하기 때문에 주의해서 사용해야 과정이다!  

QuerySQLDatabaseTool을 활용하여 체인이 쿼리 실행을 쉽게 추가할 수 있도록 한다.

import sqlite3
from langchain_community.utilities import SQLDatabase
from langchain.chains import create_sql_query_chain
from langchain_openai import ChatOpenAI
from langchain_community.tools.sql_database.tool import QuerySQLDataBaseTool
 
OPENAI_KEY = "_YOUR_OPENAI_API_KEY_"
 
db = SQLDatabase.from_uri("sqlite:///chinook.db")
 
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0, openai_api_key=OPENAI_KEY)
 
execute_query = QuerySQLDataBaseTool(db=db)
write_query = create_sql_query_chain(llm, db)
chain = write_query | execute_query
result = chain.invoke({"question": "employees는 몇명이 있어?"})
 
print(result)

e. 질문에 답하기
d번까지의 과정을 통해 쿼리를 자동으로 생성하고 실행할 수 있는 방법을 공부해 왔다.

이제 질문과 SQL 쿼리 결과를 결합하여 LLM에 한 번 더 전달하는 과정을 통해 최종 답변을 생성하는 과정을 만들어 보자.

from operator import itemgetter
 
import sqlite3
from langchain_community.utilities import SQLDatabase
from langchain.chains import create_sql_query_chain
from langchain_openai import ChatOpenAI
from langchain_community.tools.sql_database.tool import QuerySQLDataBaseTool
 
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
 
OPENAI_KEY = "_YOUR_OPENAI_API_KEY_"
 
db = SQLDatabase.from_uri("sqlite:///chinook.db")
 
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0, openai_api_key=OPENAI_KEY)
 
execute_query = QuerySQLDataBaseTool(db=db)
write_query = create_sql_query_chain(llm, db)
 
answer_prompt = PromptTemplate.from_template(
    """Given the following user question, corresponding SQL query, and SQL result, answer the user question.
 
Question: {question}
SQL Query: {query}
SQL Result: {result}
Answer: """
)
 
answer = answer_prompt | llm | StrOutputParser()
chain = (
    RunnablePassthrough.assign(query=write_query).assign(
        result=itemgetter("query") | execute_query
    )
    | answer
)
 
result = chain.invoke({"question": "employees는 몇명이 있어?"})
print(result)

3. SQL Agent 생성

랭체인은 SQL DB와의 상호 작용을 위해 유연한 방법을 제공하는 SQL Agent를 추가할 수 있다.


SQL Agent의 장점은 다음과 같다!

- 데이터베이스의 스키마뿐만 아니라 데이터베이스의 내용을 바탕으로 한 질문에도 답할 수 있음
- (특정 테이블을 설명하는 것과 같은) 생성된 쿼리를 실행하여 오류를 복구하고, 추적하여 정확하게 다시 생성할 수 있음
- 여러 의존 쿼리가 필요한 질문에 대한 답변이 가능함
- 관련 테이블의 스키마만 고려하여 토큰을 절약할 수 있음


따라서 복잡한 SQL DB 작업을 보다 효율적이고 유연하게 처리할 수 있도록 SQL Agent를 추가해 보도록 하자. 
에이전트를 초기화하기 위해 create_sql_agent 함수를 사용하도록 한다.

import re
import io
import contextlib
from langchain_community.utilities import SQLDatabase
from langchain_openai import ChatOpenAI
from langchain_community.agent_toolkits import create_sql_agent
 
OPENAI_KEY = "_YOUR_OPENAI_API_KEY_"
 
db = SQLDatabase.from_uri("sqlite:///chinook.db")
 
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0, openai_api_key=OPENAI_KEY)
 
agent_executor = create_sql_agent(llm, db=db, agent_type="openai-tools", verbose=True)
 
question = "_YOUR_QUESTION_"
 
# 표준 출력을 StringIO 객체로 리디렉션하여 로그 캡처
log_stream = io.StringIO()
with contextlib.redirect_stdout(log_stream):
    result = agent_executor.invoke({"input": question})
 
# 로그 내용 가져오기
log_contents = log_stream.getvalue()
 
# 로그 메시지에서 SQL 쿼리를 추출하기 위해 정규 표현식 사용
query_match = re.search(r"Invoking: `sql_db_query` with `\{'query': '(.*?)'\}`", log_contents)
 
if query_match:
    query = query_match.group(1)
else:
    query = "SQL 쿼리를 추출할 수 없습니다."
 
# 결과 출력 형식 설정
output = result['output']
formatted_output = (
    f"전체 로그:\n{log_contents}\n"
    f"질문: {question}\n"
    f"SQL 쿼리: {query}\n"
    f"답변: {output}"
)
 
print(formatted_output)

💬 Q1 = "가장 많은 앨범을 발매한 아티스트는 누구인가요?"

💬 Q2 = "playlisttrack 테이블을 한국어로 설명하시오."

위 과정을 통해 자연어 요청을 받아 이를 SQL로 자동 변환하여 DB와의 상호작용을 할 수 있는 SQL Agent를 구현할 수 있었다!

이를 활용하여 서비스에 적용할 수 있는 방안에 대해서 생각해 볼 수 있도록 하자 🙃

🔗

langchain quickstart
[실습] SQL Quickstart 예제 (Text to SQL)
Langchain을 이용한 LLM 애플리케이션 구현
create_sql_query_chain_api
QuerySQLDatabaseTool
create_sql_agent_api

ddurru

2024.04.15 ~

이전 포스트