5_Data_generation

Jacob Kim·2024년 2월 1일

Naver project

Naver Project Week5

목록 보기

6/12

Use case

합성 데이터는 실제 이벤트에서 수집된 데이터가 아닌 인위적으로 생성된 데이터입니다. 개인 정보를 침해하거나 현실적인 제약에 부딪히지 않고 실제 데이터를 시뮬레이션하는 데 사용됩니다.

합성 데이터의 이점:

Privacy and Security:실제 개인 데이터가 유출될 위험이 없습니다.
Data Augmentation: 머신러닝을 위한 데이터 세트 확장.
Flexibility: 특정 또는 희귀한 시나리오를 생성합니다.
Cost-effective: 실제 데이터 수집보다 저렴한 경우가 많습니다.
Regulatory Compliance: 엄격한 데이터 보호법을 준수하는 데 도움이 됩니다.
Model Robustness: AI 모델을 더 잘 일반화할 수 있습니다.
Rapid Prototyping: 실제 데이터 없이도 빠르게 테스트할 수 있습니다.
Controlled Experimentation: 특정 조건을 시뮬레이션합니다.
Access to Data: 실제 데이터를 사용할 수 없는 경우의 대안.

참고: 이러한 장점에도 불구하고 합성 데이터는 실제 세계의 복잡성을 항상 포착하지 못할 수 있으므로 신중하게 사용해야 합니다.

Quickstart

이 노트북에서는 랭체인 라이브러리를 사용해 합성 의료 청구 기록을 생성하는 방법을 자세히 살펴보겠습니다. 이 도구는 알고리즘을 개발하거나 테스트하고 싶지만 개인정보 보호 문제나 데이터 가용성 문제로 인해 실제 환자 데이터를 사용하고 싶지 않을 때 특히 유용합니다.

Setup

먼저, 종속 요소와 함께 랭체인 라이브러리가 설치되어 있어야 합니다. OpenAI 제너레이터 체인을 사용하므로 이 라이브러리도 함께 설치합니다. 이 라이브러리는 실험용 라이브러리이므로 설치 시 langchain_experimental을 포함시켜야 합니다. 그런 다음 필요한 모듈을 가져옵니다.

!pip install -U langchain langchain_experimental openai

Collecting langchain
  Downloading langchain-0.0.348-py3-none-any.whl (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 9.6 MB/s eta 0:00:00
Collecting langchain_experimental
  Downloading langchain_experimental-0.0.45-py3-none-any.whl (162 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 162.8/162.8 kB 16.3 MB/s eta 0:00:00

import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass()

from langchain.chat_models import ChatOpenAI
from langchain.prompts import FewShotPromptTemplate, PromptTemplate
from langchain.pydantic_v1 import BaseModel
from langchain_experimental.tabular_synthetic_data.openai import (
    OPENAI_TEMPLATE,
    create_openai_data_generator,
)
from langchain_experimental.tabular_synthetic_data.prompts import (
    SYNTHETIC_FEW_SHOT_PREFIX,
    SYNTHETIC_FEW_SHOT_SUFFIX,
)

OPENAI_TEMPLATE

PromptTemplate(input_variables=['example'], template='{example}')

SYNTHETIC_FEW_SHOT_PREFIX

This is a test about generating synthetic data about {subject}. Examples below:

SYNTHETIC_FEW_SHOT_SUFFIX

Now you generate synthetic data about {subject}. Make sure to {extra}:

1. Define Your Data Model

모든 데이터 세트에는 구조 또는 "스키마"가 있습니다. 아래의 MedicalBilling 클래스는 합성 데이터에 대한 스키마 역할을 합니다. 이를 정의함으로써 합성 데이터 생성기에 예상되는 데이터의 형태와 특성을 알려줍니다.

class MedicalBilling(BaseModel):
    patient_id: int
    patient_name: str
    diagnosis_code: str
    procedure_code: str
    total_charge: float
    insurance_claim_amount: float

예를 들어, 모든 레코드에는 정수인 'patient_id'와 문자열인 'patient_name' 등이 있습니다.

2. Sample Data

합성 데이터 생성기를 안내하기 위해 실제와 유사한 몇 가지 예시를 제공하는 것이 유용합니다. 이러한 예는 원하는 데이터의 종류를 대표하는 '시드' 역할을 하며, 생성기는 이를 사용하여 유사한 데이터를 더 많이 생성할 수 있습니다.

다음은 몇 가지 가상의 의료비 청구 기록입니다:

examples = [
    {
        "example": """Patient ID: 123456, Patient Name: John Doe, Diagnosis Code:
        J20.9, Procedure Code: 99203, Total Charge: $500, Insurance Claim Amount: $350"""
    },
    {
        "example": """Patient ID: 789012, Patient Name: Johnson Smith, Diagnosis
        Code: M54.5, Procedure Code: 99213, Total Charge: $150, Insurance Claim Amount: $120"""
    },
    {
        "example": """Patient ID: 345678, Patient Name: Emily Stone, Diagnosis Code:
        E11.9, Procedure Code: 99214, Total Charge: $300, Insurance Claim Amount: $250"""
    },
]

3. Craft a Prompt Template

생성기는 데이터를 생성하는 방법을 알지 못하므로 우리가 안내해야 합니다. 이를 위해 프롬프트 템플릿을 생성합니다. 이 템플릿은 기본 언어 모델에 원하는 형식의 합성 데이터를 생성하는 방법을 안내하는 데 도움이 됩니다.


OPENAI_TEMPLATE = PromptTemplate(input_variables=["example"], template="{example}")

prompt_template = FewShotPromptTemplate(
    prefix=SYNTHETIC_FEW_SHOT_PREFIX,
    examples=examples, # 가상데이터를 첨부해줍니다.
    suffix=SYNTHETIC_FEW_SHOT_SUFFIX,
    input_variables=["subject", "extra"],
    example_prompt=OPENAI_TEMPLATE,
)

FewShotPromptTemplate에는 다음이 포함됩니다::

prefix and suffix: 여기에는 안내 문맥이나 지침이 포함되어 있을 가능성이 높습니다.
examples: 앞서 정의한 샘플 데이터입니다.
input_variables: 이러한 변수("subject", "extra")는 나중에 동적으로 채울 수 있는 자리 표시자입니다. 예를 들어, "subject"는 모델을 더 자세히 안내하기 위해 "medical_billing"으로 채워질 수 있습니다.
example_prompt: 이 프롬프트 템플릿은 프롬프트에서 각 예제 행이 취할 형식입니다.

4. Creating the Data Generator

스키마와 프롬프트가 준비되었으면 다음 단계는 데이터 생성기를 만드는 것입니다. 이 객체는 합성 데이터를 얻기 위해 기본 언어 모델과 통신하는 방법을 알고 있습니다.

synthetic_data_generator = create_openai_data_generator(
    output_schema=MedicalBilling,
    llm=ChatOpenAI(
        temperature=1
    ),  # You'll need to replace with your actual Language Model instance
    prompt=prompt_template,
)

5. Generate Synthetic Data

마지막으로 합성 데이터를 가져와 보겠습니다!

synthetic_results = synthetic_data_generator.generate(
    subject="medical_billing",
    extra="the name must be chosen at random. Make it something you wouldn't normally choose.",
    runs=10,
)

synthetic_results

[MedicalBilling(patient_id=987654, patient_name='Jennifer Parker', diagnosis_code='C50.9', procedure_code='99204', total_charge=400.0, insurance_claim_amount=320.0),
 MedicalBilling(patient_id=123456, patient_name='Sophia Johnson', diagnosis_code='A09.9', procedure_code='99205', total_charge=500.0, insurance_claim_amount=400.0),
 MedicalBilling(patient_id=543210, patient_name='Oliver Wilson', diagnosis_code='G20.9', procedure_code='99213', total_charge=250.0, insurance_claim_amount=200.0),
 MedicalBilling(patient_id=789012, patient_name='Xavier Rodriguez', diagnosis_code='F32.9', procedure_code='99214', total_charge=350.0, insurance_claim_amount=280.0),
 MedicalBilling(patient_id=987654, patient_name='Amelia Thompson', diagnosis_code='R07.9', procedure_code='99204', total_charge=400.0, insurance_claim_amount=320.0),
 MedicalBilling(patient_id=123456, patient_name='Elijah Parker', diagnosis_code='A09.9', procedure_code='99215', total_charge=300.0, insurance_claim_amount=240.0),
 MedicalBilling(patient_id=456789, patient_name='Olivia Johnson', diagnosis_code='G44.1', procedure_code='99213', total_charge=250.0, insurance_claim_amount=200.0),
 MedicalBilling(patient_id=987654, patient_name='Isabella Rodriguez', diagnosis_code='S62.309A', procedure_code='99203', total_charge=350.0, insurance_claim_amount=280.0),
 MedicalBilling(patient_id=246813, patient_name='Liam Thompson', diagnosis_code='F41.1', procedure_code='99214', total_charge=275.0, insurance_claim_amount=220.0),
 MedicalBilling(patient_id=123456, patient_name='Sophia Smith', diagnosis_code='M25.511', procedure_code='99212', total_charge=200.0, insurance_claim_amount=150.0)]

이 명령은 생성기에 10개의 합성 의료 청구 기록을 생성하도록 요청합니다. 결과는 synthetic_results에 저장됩니다. 출력은 MedicalBilling 파이던트 모델 목록입니다.

Other implementations

from langchain.chat_models import ChatOpenAI
from langchain_experimental.synthetic_data import (
    DatasetGenerator,
    create_data_generation_chain,
)

# LLM
model = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7)
chain = create_data_generation_chain(model)

chain({"fields": ["blue", "yellow"], "preferences": {}})

{'fields': ['blue', 'yellow'],
 'preferences': {},
 'text': 'The vibrant blue sky contrasted beautifully with the golden yellow sunflowers, creating a mesmerizing scene that seemed to capture the essence of a perfect summer day.'}

chain(
   {
       "fields": {"colors": ["blue", "yellow"]},
       "preferences": {"style": "Make it in a style of a weather forecast."},
   }
)

{'fields': {'colors': ['blue', 'yellow']},
 'preferences': {'style': 'Make it in a style of a weather forecast.'},
 'text': "In today's weather forecast, we can expect a vibrant display of colors with a stunning blend of blue and yellow, reminiscent of a picturesque sunset on a summer evening."}

chain(
   {
       "fields": {"actor": "Tom Hanks", "movies": ["Forrest Gump", "Green Mile"]},
       "preferences": None,
   }
)

{'fields': {'actor': 'Tom Hanks', 'movies': ['Forrest Gump', 'Green Mile']},
 'preferences': None,
 'text': 'Tom Hanks, a legendary actor known for his exceptional talent, has graced the silver screen with his remarkable performances in movies such as "Forrest Gump" and "Green Mile", captivating audiences worldwide with his versatility and magnetic presence.'}

chain(
   {
       "fields": [
           {"actor": "Tom Hanks", "movies": ["Forrest Gump", "Green Mile"]},
           {"actor": "Mads Mikkelsen", "movies": ["Hannibal", "Another round"]},
       ],
       "preferences": {"minimum_length": 200, "style": "gossip"},
   }
)

{'fields': [{'actor': 'Tom Hanks', 'movies': ['Forrest Gump', 'Green Mile']},
  {'actor': 'Mads Mikkelsen', 'movies': ['Hannibal', 'Another round']}],
 'preferences': {'minimum_length': 200, 'style': 'gossip'},
 'text': 'In a surprising turn of events, the illustrious Hollywood actor Tom Hanks, renowned for his exceptional performances in iconic movies such as "Forrest Gump" and "Green Mile," shares the limelight with the enigmatic Mads Mikkelsen, known for his captivating portrayal of the infamous Hannibal Lecter in the thrilling television series and his recent triumph in the critically acclaimed film "Another round." These two incredible actors, each with their own unique style and mesmerizing screen presence, have captured the hearts of audiences worldwide, cementing their status as true legends in the realm of cinema.'}

보시다시피 제작된 예시들은 다양하고 우리가 원하는 정보를 담고 있습니다. 또한 스타일도 주어진 선호도를 잘 반영하고 있습니다.

Generating exemplary dataset for extraction benchmarking purposes

inp = [
    {
        "Actor": "Tom Hanks",
        "Film": [
            "Forrest Gump",
            "Saving Private Ryan",
            "The Green Mile",
            "Toy Story",
            "Catch Me If You Can",
        ],
    },
    {
        "Actor": "Tom Hardy",
        "Film": [
            "Inception",
            "The Dark Knight Rises",
            "Mad Max: Fury Road",
            "The Revenant",
            "Dunkirk",
        ],
    },
]

generator = DatasetGenerator(model, {"style": "informal", "minimal length": 500})
dataset = generator(inp)

dataset

[{'fields': {'Actor': 'Tom Hanks',
   'Film': ['Forrest Gump',
    'Saving Private Ryan',
    'The Green Mile',
    'Toy Story',
    'Catch Me If You Can']},
  'preferences': {'style': 'informal', 'minimal length': 500},
  'text': 'Tom Hanks, the beloved actor known for his roles in iconic films such as "Forrest Gump," "Saving Private Ryan," "The Green Mile," "Toy Story," and "Catch Me If You Can," effortlessly captivates audiences with his unmatched talent and versatility. Whether he is running across the country as the endearing Forrest Gump or tugging at our heartstrings as the compassionate prison guard in "The Green Mile," Hanks consistently delivers performances that leave a lasting impact. With his charming demeanor and incredible acting skills, it is no wonder that Tom Hanks has become a household name and a true legend in the film industry.'},
 {'fields': {'Actor': 'Tom Hardy',
   'Film': ['Inception',
    'The Dark Knight Rises',
    'Mad Max: Fury Road',
    'The Revenant',
    'Dunkirk']},
  'preferences': {'style': 'informal', 'minimal length': 500},
  'text': 'Tom Hardy, the versatile actor known for his roles in films such as "Inception," "The Dark Knight Rises," "Mad Max: Fury Road," "The Revenant," and "Dunkirk," captivates audiences with his raw talent, effortlessly transitioning between complex characters and showcasing his ability to immerse himself in diverse roles across various genres. From his charismatic portrayal of the enigmatic Eames in the mind-bending thriller "Inception" to his intense and physically demanding performance as the formidable Bane in Christopher Nolan\'s "The Dark Knight Rises," Hardy\'s on-screen presence is nothing short of mesmerizing. In "Mad Max: Fury Road," he embodies the iconic character of Max Rockatansky, delivering a gritty and riveting performance that perfectly captures the essence of the post-apocalyptic world. His portrayal of the treacherous John Fitzgerald in "The Revenant" showcases his ability to portray morally complex characters with depth and nuance. Lastly, in "Dunkirk," Hardy\'s portrayal of Farrier, a courageous RAF pilot, exemplifies his dedication to his craft as he flawlessly conveys the bravery, resilience, and unwavering determination of a hero amidst the chaos of war. With each film, Tom Hardy continues to push the boundaries of his craft, leaving a lasting impact on the cinematic world and solidifying his status as one of the most talented actors of his generation.'}]

Extraction from generated examples

이제 이렇게 생성된 데이터에서 출력을 추출할 수 있는지, 그리고 우리의 사례와 어떻게 비교되는지 살펴봅시다!

from typing import List

from langchain.chains import create_extraction_chain_pydantic
from langchain.llms import OpenAI
from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import PromptTemplate
from pydantic import BaseModel, Field

class Actor(BaseModel):
    Actor: str = Field(description="name of an actor")
    Film: List[str] = Field(description="list of names of films they starred in")

Parsers

llm = OpenAI()
parser = PydanticOutputParser(pydantic_object=Actor)

prompt = PromptTemplate(
    template="Extract fields from a given text.\n{format_instructions}\n{text}\n",
    input_variables=["text"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

_input = prompt.format_prompt(text=dataset[0]["text"])
output = llm(_input.to_string())

parsed = parser.parse(output)
parsed

Actor(Actor='Tom Hanks', Film=['Forrest Gump', 'Saving Private Ryan', 'The Green Mile', 'Toy Story', 'Catch Me If You Can'])

(parsed.Actor == inp[0]["Actor"]) & (parsed.Film == inp[0]["Film"])

True

Extractors

extractor = create_extraction_chain_pydantic(pydantic_schema=Actor, llm=model)
extracted = extractor.run(dataset[1]["text"])
extracted

[Actor(Actor='Tom Hardy', Film=['Inception', 'The Dark Knight Rises', 'Mad Max: Fury Road', 'The Revenant', 'Dunkirk'])]

(extracted[0].Actor == inp[1]["Actor"]) & (extracted[0].Film == inp[1]["Film"])

# True

Jacob Kim

AI, Information and Communication, Electronics, Computer Science, Bio, Algorithms