๐Ÿงญ ํŒŒ์ด์ฌ ๊ธฐ์ดˆ ยท ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ ๋ถ„์„ (๋ณด์ถฉํŽธ)

okorionยท2025๋…„ 10์›” 29์ผ
0

1. ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์™€ Pandas์˜ ๋งŒ๋‚จ

1.1 ๊ฐœ์š”

ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋Š” ๊ตฌ์กฐ์  ๋ฐ์ดํ„ฐ(์ˆซ์ž, ๋‚ ์งœ)์™€ ๋‹ฌ๋ฆฌ ๊ฐ€๊ณต์ด ํ•„์š”ํ•˜๋‹ค.
Pandas์˜ ๋ฌธ์ž์—ด ์ฒ˜๋ฆฌ ๊ธฐ๋Šฅ(.str)๊ณผ ๊ธฐ๋ณธ ํŒŒ์ด์ฌ ๋ฌธ๋ฒ•์„ ํ™œ์šฉํ•˜๋ฉด ์ •์ œยท๋ถ„์„์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

import pandas as pd

df = pd.read_csv('text_data.csv')
df.head()

1.2 ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ ํƒ์ƒ‰

df.info()
df['text'].head()
df['text'].describe()

ํ•ต์‹ฌ ํƒ์ƒ‰ ์ง€ํ‘œ

  • len(df['text']): ๋ฌธ์žฅ ์ˆ˜
  • df['text'].str.len().mean(): ํ‰๊ท  ๊ธ€์ž ์ˆ˜
  • df['text'].isnull().sum(): ๊ฒฐ์ธก๊ฐ’ ํ™•์ธ

2. ํ…์ŠคํŠธ ์ •๊ทœํ™” (Text Normalization)

2.1 ๋Œ€์†Œ๋ฌธ์ž ๋ณ€ํ™˜

df['text_lower'] = df['text'].str.lower()
df['text_upper'] = df['text'].str.upper()

โ†’ ๊ฐ™์€ ๋‹จ์–ด๋ผ๋„ โ€˜Appleโ€™ vs โ€˜appleโ€™์ฒ˜๋Ÿผ ๋‹ค๋ฅธ ๊ฐ’์œผ๋กœ ์ธ์‹๋˜๋Š” ๋ฌธ์ œ๋ฅผ ๋ฐฉ์ง€.


2.2 ๋ฌธ์ž์—ด ๊ธฐ๋ณธ ์—ฐ์‚ฐ

df['word_count'] = df['text'].str.split().str.len()
df['contains_ai'] = df['text'].str.contains('AI', case=False)
df['replaced'] = df['text'].str.replace('data', 'information', regex=False)
ํ•จ์ˆ˜๊ธฐ๋Šฅ
.str.len()๋ฌธ์ž์—ด ๊ธธ์ด
.str.split()๋‹จ์–ด ๋ถ„ํ• 
.str.contains()ํŠน์ • ํŒจํ„ด ํฌํ•จ ์—ฌ๋ถ€
.str.replace()ํŠน์ • ๋‹จ์–ด ๋Œ€์ฒด

3. ๊ตฌ๋‘์ (Punctuation) ์ œ๊ฑฐ

ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ์˜ ๊ธฐ๋ณธ์€ ๋ถˆํ•„์š”ํ•œ ๊ธฐํ˜ธ ์ œ๊ฑฐ๋‹ค.

import string

def remove_punct(text):
    return text.translate(str.maketrans('', '', string.punctuation))

df['clean_text'] = df['text'].apply(remove_punct)

์˜ˆ์‹œ:

  • "Hello, world!" โ†’ "Hello world"
  • "AI-driven, data-based." โ†’ "AIdriven databased"

4. ๋ถˆ์šฉ์–ด(Stopwords) ์ œ๊ฑฐ

์˜๋ฏธ ์—†๋Š” ๋‹จ์–ด(์˜ˆ: the, and, is)๋Š” ์ œ๊ฑฐํ•ด์•ผ ํ†ต๊ณ„์  ์™œ๊ณก์ด ์ค„์–ด๋“ ๋‹ค.

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
df['no_stopwords'] = df['clean_text'].apply(
    lambda x: ' '.join([word for word in x.split() if word not in stop_words])
)

5. ํ…์ŠคํŠธ ํ† ํฐํ™” (Tokenization)

๋ฌธ์žฅ์„ ๋‹จ์–ด ๋˜๋Š” ํ˜•ํƒœ์†Œ ๋‹จ์œ„๋กœ ๋ถ„ํ• ํ•œ๋‹ค.

from nltk.tokenize import word_tokenize
df['tokens'] = df['clean_text'].apply(word_tokenize)

์ถœ๋ ฅ ์˜ˆ์‹œ

["Artificial", "Intelligence", "drives", "future", "innovation"]

ํ•œ๊ตญ์–ด์˜ ๊ฒฝ์šฐ konlpy ํŒจํ‚ค์ง€์˜ Okt, Mecab ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ๋ฅผ ํ™œ์šฉํ•œ๋‹ค.


6. ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™”

6.1 ๋‹จ์–ด ๋นˆ๋„ ๋ถ„์„

from collections import Counter
word_counts = Counter(" ".join(df['no_stopwords']).split())
pd.DataFrame(word_counts.most_common(10), columns=['word', 'count'])

6.2 ์›Œ๋“œํด๋ผ์šฐ๋“œ

from wordcloud import WordCloud
import matplotlib.pyplot as plt

text = " ".join(df['no_stopwords'])
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)

plt.figure(figsize=(10,5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

WordCloud Tips

  • ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ ํ›„ ์ƒ์„ฑํ•ด์•ผ ์‹œ๊ฐ์  ๋…ธ์ด์ฆˆ ๊ฐ์†Œ
  • colormap='coolwarm', max_words=100 ๋“ฑ์œผ๋กœ ์ปค์Šคํ„ฐ๋งˆ์ด์ง• ๊ฐ€๋Šฅ

7. ํŒŒ์ด์ฌ ๊ธฐ์ดˆ ๋ฌธ๋ฒ• ๋ณต์Šต

7.1 ๋ณ€์ˆ˜์™€ ์ž๋ฃŒํ˜•

x = 10
name = "Python"
is_active = True
์ž๋ฃŒํ˜•์˜ˆ์‹œ
์ •์ˆ˜(int)5
์‹ค์ˆ˜(float)3.14
๋ฌธ์ž์—ด(str)"text"
๋ถˆ๋ฆฌ์–ธ(bool)True, False

7.2 ์‚ฐ์ˆ ยท๋น„๊ตยท๋…ผ๋ฆฌ ์—ฐ์‚ฐ

a, b = 5, 3
a + b, a - b, a * b, a / b
a > b, a == b
a > 2 and b < 5

7.3 ์กฐ๊ฑด๋ฌธ

score = 85

if score >= 90:
    print("A")
elif score >= 80:
    print("B")
else:
    print("C")

7.4 ๋ฐ˜๋ณต๋ฌธ

for i in range(5):
    print(i)

n = 0
while n < 3:
    print("Loop", n)
    n += 1

๋ฆฌ์ŠคํŠธ ์ปดํ”„๋ฆฌํ—จ์…˜

squares = [x**2 for x in range(5)]

7.5 ํ•จ์ˆ˜(Function)

def greet(name):
    return f"Hello, {name}!"

add = lambda x, y: x + y

7.6 ๋‚ด์žฅ ํ•จ์ˆ˜ ํ™œ์šฉ

ํ•จ์ˆ˜๊ธฐ๋Šฅ
len()๊ธธ์ด ๊ณ„์‚ฐ
sum()ํ•ฉ๊ณ„
sorted()์ •๋ ฌ
map(), filter()ํ•จ์ˆ˜ํ˜• ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ

7.7 ์ปฌ๋ ‰์…˜ ์ž๋ฃŒํ˜•

์œ ํ˜•์˜ˆ์‹œํŠน์ง•
๋ฆฌ์ŠคํŠธ[1, 2, 3]์ˆœ์„œ O, ์ˆ˜์ • ๊ฐ€๋Šฅ
ํŠœํ”Œ(1, 2, 3)์ˆœ์„œ O, ์ˆ˜์ • ๋ถˆ๊ฐ€
์‚ฌ์ „{'a': 1, 'b': 2}ํ‚ค-๊ฐ’ ์Œ
์ง‘ํ•ฉ{1, 2, 3}์ค‘๋ณต ๋ถˆ๊ฐ€

7.8 ํŒŒ์ผ ์ž…์ถœ๋ ฅ

# ํ…์ŠคํŠธ ํŒŒ์ผ
with open('sample.txt', 'w') as f:
    f.write('Hello World')

# CSV ํŒŒ์ผ
import pandas as pd
df.to_csv('output.csv', index=False)

8. NumPy ๊ธฐ์ดˆ (๋ณด์ถฉ)

import numpy as np

arr = np.array([1, 2, 3, 4])
print(arr.shape, arr.dtype)
print(arr + 10)
๊ธฐ๋Šฅ์˜ˆ์‹œ
๋ฐฐ์—ด ์ƒ์„ฑnp.array([1,2,3])
์Šฌ๋ผ์ด์‹ฑarr[1:3]
๋ธŒ๋กœ๋“œ์บ์ŠคํŒ…arr * 2
์ˆ˜ํ•™ ์—ฐ์‚ฐnp.mean(arr)

9. ์ „์ฒด ์š”์•ฝ

๊ตฌ๋ถ„์ฃผ์š” ํ•™์Šต ๋‚ด์šฉํ•ต์‹ฌ ์ฝ”๋“œ
ํ…์ŠคํŠธ ์ •๊ทœํ™”๋Œ€์†Œ๋ฌธ์ž, ๊ตฌ๋‘์ , ๋ถˆ์šฉ์–ด ์ฒ˜๋ฆฌ.str.lower(), translate(), stopwords
ํ† ํฐํ™”๋‹จ์–ด ๋‹จ์œ„ ๋ถ„ํ• word_tokenize()
์‹œ๊ฐํ™”๋‹จ์–ด๋นˆ๋„, ์›Œ๋“œํด๋ผ์šฐ๋“œCounter, WordCloud
ํŒŒ์ด์ฌ ๊ธฐ์ดˆ๋ณ€์ˆ˜, ์กฐ๊ฑด, ๋ฐ˜๋ณต, ํ•จ์ˆ˜if, for, def, lambda
๋ฐ์ดํ„ฐ ํƒ€์ž… ๋ณต์Šต๋ฆฌ์ŠคํŠธยท๋”•์…”๋„ˆ๋ฆฌยทํŠœํ”Œยท์ง‘ํ•ฉ[ ], { }, ( ), set()
NumPy ํ™œ์šฉ์ˆ˜์น˜ํ˜• ๋ฐฐ์—ด ์—ฐ์‚ฐnp.array, np.mean, np.shape

10. ์‹ค๋ฌด ์ ์šฉ ํฌ์ธํŠธ

  • ํ…์ŠคํŠธ ์ •์ œ โ†’ ํ† ํฐํ™” โ†’ ์‹œ๊ฐํ™” ์ˆœ์œผ๋กœ ์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์„ฑ
  • .str ๋ฉ”์„œ๋“œ๋Š” Pandas ๋‚ด์—์„œ ๋ฒกํ„ฐํ™” ์—ฐ์‚ฐ ์ง€์› โ†’ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์—๋„ ํšจ์œจ์ 
  • ์›Œ๋“œํด๋ผ์šฐ๋“œ๋Š” EDA(ํƒ์ƒ‰์  ๋ถ„์„) ๋‹จ๊ณ„์—์„œ ํ‚ค์›Œ๋“œ ๋„์ถœ์šฉ์œผ๋กœ ์œ ์šฉ
  • ํŒŒ์ด์ฌ ๊ธฐ์ดˆ ์ˆ™๋‹ฌ์€ ํ…์ŠคํŠธ ๋ถ„์„, ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ, AI/NLP ์ „๋ฐ˜์˜ ๊ธฐ๋ฐ˜
profile
okorion's Tech Study Blog.

0๊ฐœ์˜ ๋Œ“๊ธ€