예시에서는 '호밀밭의 파수꾼(The Catcher in the Rye)' 책을 요약하고, Spacy LLM과 GPT-4를 사용하여 주요 개체를 식별한 다음, Memgraph에서 Cypher 쿼리를 생성하고 실행하여 책의 주제와 등장인물을 중심으로 지식 그래프를 만들 것이다.
먼저, 백그라운드에서 Memgraph 인스턴스가 실행해야한다.
Memgraph Platform(Memgraph 데이터베이스 + MAGE 라이브러리 + Memgraph Lab)을 처음 사용해보고 싶다면, Docker가 백그라운드에서 실행 중인 상태에서 다음 명령어를 실행:
curl https://install.memgraph.com | shiwr https://windows.memgraph.com | iex텍스트에서 정보를 추출하여 KG 생성
도구 :
docker-compose로 memgraph 실행
필요한 패키지 설치
%pip install openai neo4j
%pip install spacy
%pip install spacy_llm
%python -m spacy download en_core_web_md
import os
from wasabi import msg
os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>"
# Check for OpenAI API key
if not os.getenv("OPENAI_API_KEY"):
msg.fail("OPENAI_API_KEY environment variable not set. Please set it to proceed.", exits=1)
첫 번째 단계는 SpaCy의 대규모 언어 모델을 사용하여 요약본에서 개체를 추출하는 것이다.
SpaCy는 개체 인식, 품사 태깅, 의존성 구문 분석과 같은 작업을 위해 설계된 파이썬의 고급 NLP(자연어 처리) 라이브러리이다. 텍스트 처리의 속도와 정확성으로 널리 사용된다.
# Sample text summary for processing
summary="'The Catcher in the Rye' by J.D. Salinger follows Holden Caulfield, a troubled teenager who narrates his experiences over a few days after being expelled from his elite boarding school, Pencey Prep. Set in post-World War II New York City, the story revolves around Holden’s encounters with various characters, reflecting his disillusionment with the adult world and his search for identity and meaning. The novel begins with Holden being expelled due to poor academic performance, which sets the stage for his wandering through New York City. His isolation becomes a central theme, symbolizing his struggle with mental health and alienation. Throughout the book, Holden interacts with multiple characters, including teachers, former classmates, strangers, and his younger sister, Phoebe. Each interaction reveals his distrust of adults and his disdain for what he calls phoniness. He idolizes Phoebe as a symbol of innocence and sincerity, which stands in contrast to his views on the rest of society. Holden’s fixation on preserving innocence is symbolized by his dream of being the catcher in the rye, a protector who saves children from losing their innocence. Key symbols also include his red hunting hat, which represents Holden's uniqueness and desire for protection, and the Museum of Natural History, a place he values for its permanence in contrast to life’s constant change and unpredictability. Holden’s narrative reveals symptoms of depression and lingering trauma from the death of his younger brother, Allie, which complicates his ability to cope with the challenges of adulthood. His internal struggles suggest unresolved grief and a fear of growing up. The climax of the story occurs when Holden, overwhelmed, plans to run away but has a meaningful encounter with Phoebe that changes his mind. Her innocence and love provide him with a sense of purpose, grounding him and encouraging him to continue facing his reality. By the novel’s end, Holden reluctantly begins to accept life’s imperfections and complexities. The main characters include Holden Caulfield, who is marked by cynicism, vulnerability, and compassion; Phoebe Caulfield, his younger sister who represents innocence and serves as an emotional anchor for Holden; Mr. Antolini, a former teacher who offers him guidance and represents an adult Holden partially trusts; and Allie Caulfield, Holden’s deceased younger brother, whose memory profoundly impacts him. The novel is set primarily in New York City, with scenes at Pencey Prep and various urban locations, emphasizing Holden's sense of disorientation and social critique. Themes of alienation, innocence, identity, and the challenges of adolescence permeate the novel, creating a poignant exploration of a young person grappling with mental health and the transition to adulthood."
en_core_web_lg 모델을 로드해서 NER 수행
문장 단위로 텍스트 분리 → 각 문장에서 PERSON, ORG, DATE, GPE 등 추출
import json
from collections import Counter
from pathlib import Path
import spacy
from spacy_llm.util import assemble
# load the spaCy model
nlp = spacy.load("en_core_web_md")
# split document into sentences
def split_document_sent(text):
doc = nlp(text)
return [sent.text.strip() for sent in doc.sents]
# define custom relationship extraction and text processing
def process_text(text, verbose=False):
doc = nlp(text)
if verbose:
msg.text(f"Text: {doc.text}")
msg.text(f"Entities: {[(ent.text, ent.label_) for ent in doc.ents]}")
# Relations extraction logic can be added here
return doc
# Pipeline to run entity extraction
def extract_entities(text, verbose=False):
processed_data = []
entity_counts = Counter()
sentences = split_document_sent(text)
for sent in sentences:
doc = process_text(sent, verbose)
entities = [(ent.text, ent.label_) for ent in doc.ents]
# Store processed data for each sentence
processed_data.append({'text': doc.text, 'entities': entities})
# Update counters
entity_counts.update([ent[1] for ent in entities])
# Export to JSON
with open('processed_data.json', 'w') as f:
json.dump(processed_data, f)
# Display summary
msg.text(f"Entity counts: {entity_counts}")
# Run the pipeline on the summary text
verbose = True
extract_entities(summary, verbose)
출력예시 :
Text: 'The Catcher in the Rye' by J.D. Salinger follows Holden Caulfield, a
troubled teenager who narrates his experiences over a few days after being
expelled from his elite boarding school, Pencey Prep.
Entities: [('J.D. Salinger', 'PERSON'), ('Holden Caulfield', 'PERSON'), ('a few
days', 'DATE'), ('Pencey', 'GPE')]
Text: Set in post-World War II New York City, the story revolves around Holden’s
encounters with various characters, reflecting his disillusionment with the
adult world and his search for identity and meaning.
Entities: [('post-World War II', 'EVENT'), ('New York City', 'GPE'), ('Holden',
'PERSON')]
Text: The novel begins with Holden being expelled due to poor academic
performance, which sets the stage for his wandering through New York City.
Entities: [('Holden', 'PERSON'), ('New York City', 'GPE')]
Text: His isolation becomes a central theme, symbolizing his struggle with
mental health and alienation.
Entities: []
...
위 JSON 데이터를 LLM에 전달하여:
LLM 프롬프트는 구조화된 관계를 반환하도록 설계
import json
import openai
from pathlib import Path
# Load processed data from JSON
json_path = Path("processed_data.json")
with open(json_path, "r") as f:
processed_data = json.load(f)
# Prepare nodes and relationships
nodes = []
relationships = []
# Formulate a prompt for GPT-4
prompt = (
"Extract entities and relationships from the following JSON data. For each entry in data['entities'], "
"create a 'node' dictionary with fields 'id' (unique identifier), 'name' (entity text), and 'type' (entity label). "
"For entities that have meaningful connections, define 'relationships' as dictionaries with 'source' (source node id), "
"'target' (target node id), and 'relationship' (type of connection). Create max 30 nodes, format relationships in the format of capital letters and _ inbetween words and format the entire response in the JSON output containing only variables nodes and relationships without any text inbetween"
"JSON data:\n"
f"{json.dumps(processed_data)}"
)
response = openai.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a helpful assistant that structures data into nodes and relationships."},
{"role": "user", "content": prompt}
],
max_tokens=1000
)
output = response.choices[0].message.content
print(output)
structured_data = json.loads(output) # Assuming GPT-4 outputs structured JSON
# Populate nodes and relationships lists
nodes.extend(structured_data.get("nodes", []))
relationships.extend(structured_data.get("relationships", []))
{
"nodes": [
{"id": 1, "name": "J.D. Salinger", "type": "PERSON"},
{"id": 2, "name": "Holden Caulfield", "type": "PERSON"},
{"id": 3, "name": "a few days", "type": "DATE"},
{"id": 4, "name": "Pencey", "type": "GPE"},
{"id": 5, "name": "post-World War II", "type": "EVENT"},
{"id": 6, "name": "New York City", "type": "GPE"},
{"id": 7, "name": "Holden", "type": "PERSON"},
{"id": 8, "name": "Phoebe", "type": "PERSON"},
{"id": 9, "name": "the Museum of Natural History", "type": "ORG"},
{"id": 10, "name": "Allie", "type": "PERSON"},
{"id": 11, "name": "Phoebe Caulfield", "type": "PERSON"},
{"id": 12, "name": "Antolini", "type": "PERSON"},
{"id": 13, "name": "Allie Caulfield", "type": "PERSON"},
{"id": 14, "name": "Pencey Prep", "type": "ORG"}
],
"relationships": [
{"source": 1, "target": 2, "relationship": "AUTHORED_BY"},
{"source": 2, "target": 3, "relationship": "NARRATION_DURATION"},
{"source": 2, "target": 4, "relationship": "STUDENT_OF"},
{"source": 2, "target": 6, "relationship": "LOCATED_IN"},
{"source": 2, "target": 5, "relationship": "EVENT_OCCURED_IN"},
{"source": 2, "target": 8, "relationship": "SIBLING"},
{"source": 8, "target": 2, "relationship": "SIBLING"},
{"source": 2, "target": 9, "relationship": "VISITED"},
{"source": 2, "target": 10, "relationship": "SIBLING"},
{"source": 10, "target": 2, "relationship": "SIBLING"},
{"source": 2, "target": 14, "relationship": "STUDIED_AT"},
{"source": 11, "target": 2, "relationship": "SIBLING"},
{"source": 2, "target": 12, "relationship": "STUDENT_OF"},
{"source": 2, "target": 13, "relationship": "SIBLING"},
{"source": 13, "target": 2, "relationship": "SIBLING"}
]
}
위 구조화된 데이터를 기반으로 Cypher 쿼리 생성:
def generate_cypher_queries(nodes, relationships):
queries = []
# Create nodes
for node in nodes:
query = f"CREATE (n:{node['type']} {{id: '{node['id']}', name: '{node['name']}'}})"
queries.append(query)
# Create relationships
for rel in relationships:
query = f"MATCH (a {{id: '{rel['source']}'}}), (b {{id: '{rel['target']}'}}) " \
f"CREATE (a)-[:{rel['relationship']}]->(b)"
queries.append(query)
return queries
cypher_queries = generate_cypher_queries(nodes, relationships)
print(cypher_queries)
["CREATE (n:PERSON {id: '1', name: 'J.D. Salinger'})", "CREATE (n:PERSON {id: '2', name: 'Holden Caulfield'})", "CREATE (n:DATE {id: '3', name: 'a few days'})", "CREATE (n:GPE {id: '4', name: 'Pencey'})", "CREATE (n:EVENT {id: '5', name: 'post-World War II'})", "CREATE (n:GPE {id: '6', name: 'New York City'})", "CREATE (n:PERSON {id: '7', name: 'Holden'})", "CREATE (n:PERSON {id: '8', name: 'Phoebe'})", "CREATE (n:ORG {id: '9', name: 'the Museum of Natural History'})", "CREATE (n:PERSON {id: '10', name: 'Allie'})", "CREATE (n:PERSON {id: '11', name: 'Phoebe Caulfield'})", "CREATE (n:PERSON {id: '12', name: 'Antolini'})", "CREATE (n:PERSON {id: '13', name: 'Allie Caulfield'})", "CREATE (n:ORG {id: '14', name: 'Pencey Prep'})", "MATCH (a {id: '1'}), (b {id: '2'}) CREATE (a)-[:AUTHORED_BY]->(b)", "MATCH (a {id: '2'}), (b {id: '3'}) CREATE (a)-[:NARRATION_DURATION]->(b)", "MATCH (a {id: '2'}), (b {id: '4'}) CREATE (a)-[:STUDENT_OF]->(b)", "MATCH (a {id: '2'}), (b {id: '6'}) CREATE (a)-[:LOCATED_IN]->(b)", "MATCH (a {id: '2'}), (b {id: '5'}) CREATE (a)-[:EVENT_OCCURED_IN]->(b)", "MATCH (a {id: '2'}), (b {id: '8'}) CREATE (a)-[:SIBLING]->(b)", "MATCH (a {id: '8'}), (b {id: '2'}) CREATE (a)-[:SIBLING]->(b)", "MATCH (a {id: '2'}), (b {id: '9'}) CREATE (a)-[:VISITED]->(b)", "MATCH (a {id: '2'}), (b {id: '10'}) CREATE (a)-[:SIBLING]->(b)", "MATCH (a {id: '10'}), (b {id: '2'}) CREATE (a)-[:SIBLING]->(b)", "MATCH (a {id: '2'}), (b {id: '14'}) CREATE (a)-[:STUDIED_AT]->(b)", "MATCH (a {id: '11'}), (b {id: '2'}) CREATE (a)-[:SIBLING]->(b)", "MATCH (a {id: '2'}), (b {id: '12'}) CREATE (a)-[:STUDENT_OF]->(b)", "MATCH (a {id: '2'}), (b {id: '13'}) CREATE (a)-[:SIBLING]->(b)", "MATCH (a {id: '13'}), (b {id: '2'}) CREATE (a)-[:SIBLING]->(b)"]
from neo4j import GraphDatabase
# Initialize the Neo4j driver for Memgraph (modify the URI if necessary)
uri = "bolt://localhost:7687"
user = ""
password = ""
driver = GraphDatabase.driver(uri, auth=(user, password))
# Function to execute Cypher queries in Memgraph
def execute_cypher_queries(queries):
with driver.session() as session:
session.run("MATCH (n) DETACH DELETE n;")
for query in queries:
try:
session.run(query)
msg.good(f"Executed query: {query}")
except Exception as e:
msg.fail(f"Error executing query: {query}. Error: {e}")
# Execute the generated Cypher queries
execute_cypher_queries(cypher_queries)
✔ Executed query: CREATE (n:PERSON {id: '1', name: 'J.D.
Salinger'})
✔ Executed query: CREATE (n:PERSON {id: '2', name: 'Holden
Caulfield'})
✔ Executed query: CREATE (n:DATE {id: '3', name: 'a few days'})
✔ Executed query: CREATE (n:GPE {id: '4', name: 'Pencey'})
✔ Executed query: CREATE (n:EVENT {id: '5', name: 'post-World War
II'})
✔ Executed query: CREATE (n:GPE {id: '6', name: 'New York City'})
✔ Executed query: CREATE (n:PERSON {id: '7', name: 'Holden'})
✔ Executed query: CREATE (n:PERSON {id: '8', name: 'Phoebe'})
✔ Executed query: CREATE (n:ORG {id: '9', name: 'the Museum of Natural
History'})
MATCH (n) DETACH DELETE n
MATCH p=()-[r]->() RETURN p
