LangChain | PandasDataFrameOutputParser | 출력 파서

박성문·2025년 1월 20일

LangChain | 출력 파서

목록 보기

5/9

PandasDataFrameOutputParser란

Python 프로그래밍 언어에서 널리 사용되는 데이터 구조로,
데이터 조작 및 분석을 위해 흔히 사용된다. 구조화된 데이터를 다루기 위한 포괄적인 도구세트를 제공하여,
데이터 정제, 변환 및 분석과 같은 작업에 다양하게 활용될 수 있다.
이 출력 파서는 사용자가 임의의 Pandas DataFrame을 지정하고 해당 DataFrame에서 
데이터를 추출하여 형식화된 사전 형태로 데이터를 조회할 수 있는 LLM을 요청할 수 있게 해준다.

이 도구의 핵심 목적은 "자연어로 된 질문을 판다스 명령어로 자동 변환"하는 것이다.

기본 구조

import

import pprint
Python의 내장 모듈인 'pprint' (Pretty Print)를 임포트
복잡한 데이터 구조를 보기 좋게 출력해주는 기능 제공
from typing import Any, Dict
Python의 타입 힌팅(Type Hinting)을 위한 도구들을 임포트
Dict: 딕셔너리 타입을 명시할 때 사용
Any: 어떤 타입이든 허용할 때 사용
import pandas as pd
Pandas 라이브러리를 'pd'라는 별칭으로 임포트

함수 선언

parser_output: Pandas DataFrame들을 담고 있는 딕셔너리
Dict[str, Any]: 키는 문자열, 값은 어떤 타입이든 가능함을 의미
-> None: 이 함수는 아무것도 반환하지 않음(출력만 함)

딕셔너리의 모든 키를 순회
각 DataFrame을 딕셔너리 형태로 변환

width = 4 : 출력시 들여쓰기 너비를 4로 설정
compact = True : 가능한 한 줄에 많은 내용을 출력

데이터 로드

pandas를 사용해서 CSV 파일을 데이터프레임으로 불러온다.
아까 pandas를 pd라는 별칭으로 임포트했기 때문에 pd.read_csv이라고 작성함
해당 데이터를 저장할 변수 df

df = pd.read_csv()로 데이터를 읽어오면 DataFrame 형태
이 DataFrame으로 작업을 수행하면 결과가 Series나 DataFrame 형태
format_parser_output는 이러한 pandas 객체들을 보기 좋은 딕셔너리 형태로 변환해서 출력

파서 및 지시사항 설정

pandas는 dataframe=df 파라미터가 필수적이다.
이 파서가 어떤 데이터프레임을 다룰지 지정하는 것이다.

The output should be formatted as a string as the operation, followed by a colon, followed by the column or row to be queried on, followed by optional array parameters.
1. The column names are limited to the possible columns below.
2. Arrays must either be a comma-separated list of numbers formatted as [1,3,5], or it must be in range of numbers formatted as [0..4].
3. Remember that arrays are optional and not necessarily required.
4. If the column is not in the possible columns or the operation is not a valid Pandas DataFrame operation, return why it is invalid as a sentence starting with either "Invalid column" or "Invalid operation".

As an example, for the formats:
1. String "column:num_legs" is a well-formatted instance which gets the column num_legs, where num_legs is a possible column.
2. String "row:1" is a well-formatted instance which gets row 1.
3. String "column:num_legs[1,2]" is a well-formatted instance which gets the column num_legs for rows 1 and 2, where num_legs is a possible column.
4. String "row:1[num_legs]" is a well-formatted instance which gets row 1, but for just column num_legs, where num_legs is a possible column.
5. String "mean:num_legs[1..3]" is a well-formatted instance which takes the mean of num_legs from rows 1 to 3, where num_legs is a possible column and mean is a valid Pandas DataFrame operation.
6. String "do_something:num_legs" is a badly-formatted instance, where do_something is not a valid Pandas DataFrame operation.
7. String "mean:invalid_col" is a badly-formatted instance, where invalid_col is not a possible column.

Here are the possible columns:

PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked

그럼 이런식으로 출력이 되는데 이는 PandasDataFrameOutputParser가 이해할 수 있는 명령어 형식에 대한 설명이다.