🧭 파이썬 기초 · 텍스트 데이터 분석 (보충편)

okorion·2025년 10월 29일

Data Preprocessing NLP String data cleaning pandas python python기초 text processing textdata wordcloud

🛢 데이터 엔지니어링 (Data Engineering)

목록 보기

6/30

1. 텍스트 데이터와 Pandas의 만남

1.1 개요

텍스트 데이터는 구조적 데이터(숫자, 날짜)와 달리 가공이 필요하다.
Pandas의 문자열 처리 기능(.str)과 기본 파이썬 문법을 활용하면 정제·분석이 가능하다.

import pandas as pd

df = pd.read_csv('text_data.csv')
df.head()

1.2 텍스트 데이터 탐색

df.info()
df['text'].head()
df['text'].describe()

핵심 탐색 지표

len(df['text']): 문장 수
df['text'].str.len().mean(): 평균 글자 수
df['text'].isnull().sum(): 결측값 확인

2. 텍스트 정규화 (Text Normalization)

2.1 대소문자 변환

df['text_lower'] = df['text'].str.lower()
df['text_upper'] = df['text'].str.upper()

→ 같은 단어라도 ‘Apple’ vs ‘apple’처럼 다른 값으로 인식되는 문제를 방지.

2.2 문자열 기본 연산

df['word_count'] = df['text'].str.split().str.len()
df['contains_ai'] = df['text'].str.contains('AI', case=False)
df['replaced'] = df['text'].str.replace('data', 'information', regex=False)

함수	기능
`.str.len()`	문자열 길이
`.str.split()`	단어 분할
`.str.contains()`	특정 패턴 포함 여부
`.str.replace()`	특정 단어 대체

3. 구두점(Punctuation) 제거

텍스트 전처리의 기본은 불필요한 기호 제거다.

import string

def remove_punct(text):
    return text.translate(str.maketrans('', '', string.punctuation))

df['clean_text'] = df['text'].apply(remove_punct)

예시:

"Hello, world!" → "Hello world"
"AI-driven, data-based." → "AIdriven databased"

4. 불용어(Stopwords) 제거

의미 없는 단어(예: the, and, is)는 제거해야 통계적 왜곡이 줄어든다.

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
df['no_stopwords'] = df['clean_text'].apply(
    lambda x: ' '.join([word for word in x.split() if word not in stop_words])
)

5. 텍스트 토큰화 (Tokenization)

문장을 단어 또는 형태소 단위로 분할한다.

from nltk.tokenize import word_tokenize
df['tokens'] = df['clean_text'].apply(word_tokenize)

출력 예시

["Artificial", "Intelligence", "drives", "future", "innovation"]

한국어의 경우 konlpy 패키지의 Okt, Mecab 형태소 분석기를 활용한다.

6. 텍스트 데이터 시각화

6.1 단어 빈도 분석

from collections import Counter
word_counts = Counter(" ".join(df['no_stopwords']).split())
pd.DataFrame(word_counts.most_common(10), columns=['word', 'count'])

6.2 워드클라우드

from wordcloud import WordCloud
import matplotlib.pyplot as plt

text = " ".join(df['no_stopwords'])
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)

plt.figure(figsize=(10,5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

WordCloud Tips

불용어 제거 후 생성해야 시각적 노이즈 감소

colormap='coolwarm', max_words=100 등으로 커스터마이징 가능

7. 파이썬 기초 문법 복습

7.1 변수와 자료형

x = 10
name = "Python"
is_active = True

자료형	예시
정수(int)	5
실수(float)	3.14
문자열(str)	`"text"`
불리언(bool)	`True`, `False`

7.2 산술·비교·논리 연산

a, b = 5, 3
a + b, a - b, a * b, a / b
a > b, a == b
a > 2 and b < 5

7.3 조건문

score = 85

if score >= 90:
    print("A")
elif score >= 80:
    print("B")
else:
    print("C")

7.4 반복문

for i in range(5):
    print(i)

n = 0
while n < 3:
    print("Loop", n)
    n += 1

리스트 컴프리헨션

squares = [x**2 for x in range(5)]

7.5 함수(Function)

def greet(name):
    return f"Hello, {name}!"

add = lambda x, y: x + y

7.6 내장 함수 활용

함수	기능
`len()`	길이 계산
`sum()`	합계
`sorted()`	정렬
`map()`, `filter()`	함수형 데이터 처리

7.7 컬렉션 자료형

유형	예시	특징
리스트	`[1, 2, 3]`	순서 O, 수정 가능
튜플	`(1, 2, 3)`	순서 O, 수정 불가
사전	`{'a': 1, 'b': 2}`	키-값 쌍
집합	`{1, 2, 3}`	중복 불가

7.8 파일 입출력

# 텍스트 파일
with open('sample.txt', 'w') as f:
    f.write('Hello World')

# CSV 파일
import pandas as pd
df.to_csv('output.csv', index=False)

8. NumPy 기초 (보충)

import numpy as np

arr = np.array([1, 2, 3, 4])
print(arr.shape, arr.dtype)
print(arr + 10)

기능	예시
배열 생성	`np.array([1,2,3])`
슬라이싱	`arr[1:3]`
브로드캐스팅	`arr * 2`
수학 연산	`np.mean(arr)`

9. 전체 요약

구분	주요 학습 내용	핵심 코드
텍스트 정규화	대소문자, 구두점, 불용어 처리	`.str.lower()`, `translate()`, `stopwords`
토큰화	단어 단위 분할	`word_tokenize()`
시각화	단어빈도, 워드클라우드	`Counter`, `WordCloud`
파이썬 기초	변수, 조건, 반복, 함수	`if`, `for`, `def`, `lambda`
데이터 타입 복습	리스트·딕셔너리·튜플·집합	`[ ]`, `{ }`, `( )`, `set()`
NumPy 활용	수치형 배열 연산	`np.array`, `np.mean`, `np.shape`

10. 실무 적용 포인트

텍스트 정제 → 토큰화 → 시각화 순으로 처리 파이프라인 구성
.str 메서드는 Pandas 내에서 벡터화 연산 지원 → 대규모 데이터에도 효율적
워드클라우드는 EDA(탐색적 분석) 단계에서 키워드 도출용으로 유용
파이썬 기초 숙달은 텍스트 분석, 데이터 전처리, AI/NLP 전반의 기반

okorion

okorion's Tech Study Blog.

이전 포스트

🧭 시계열 데이터 분석 with Python & Pandas

다음 포스트