유튜브 크롤링 + 라벨링

Taixi·2024년 10월 21일

생성형 AI 교육

목록 보기

19/35

import json
import pandas as pd
from konlpy.tag import Mecab

# KNU 감성어 사전 로드
with open('SentiWord_info.json', encoding='utf-8-sig', mode='r') as f:
    SentiWord_info = json.load(f)

sentiword_dic = pd.DataFrame(SentiWord_info)

# 감성 점수 계산 함수
def calculate_sentiment_score(sentence, sentiword_dic):
    score = 0
    for word in sentence.split():
        if word in sentiword_dic['word'].values:
            word_score = int(sentiword_dic[sentiword_dic['word'] == word]['polarity'].values[0])
            score += word_score
    return score

# 데이터 로드
community_raw = pd.read_csv("개인주소")

# Null 값 제거
community_df = community_raw.dropna(subset=['processed_comment']).copy()

# 형태소 분석기
mecab = Mecab(dicpath='/usr/local/lib/mecab/dic/mecab-ko-dic')

# 형태소 분석
def preprocess_text(text):
    return " ".join(mecab.morphs(text))

community_df["tagged_str"] = community_df["processed_comment"].apply(preprocess_text)

# 감성 점수 계산
community_df["sentiment_score"] = community_df["tagged_str"].apply(lambda x: calculate_sentiment_score(x, sentiword_dic))

# 감성 레이블 추가
def label_sentiment(score):
    if score > 0:
        return "긍정"
    elif score < 0:
        return "부정"
    else:
        return "중립"

community_df["sentiment_label"] = community_df["sentiment_score"].apply(label_sentiment)

# 결과 저장
community_df.to_csv('개인주소', index=False)

평가: 긍정과 부정보다 중립이 대부분인듯하다. 나의 의견으로는 감성사전으로는 부족한거같다. 다른 방법을 생각해봐야겠다.

노션페이지에서 다운가능

Taixi

개발자를 위한 첫시작

이전 포스트

교육7주차

다음 포스트

유튜브 크롤링 + 라벨링

생성형 AI 교육

교육7주차

하이퍼파라미터 튜닝

0개의 댓글