๐Ÿงญ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ ๋ถ„์„ with Python & Pandas

okorionยท2025๋…„ 10์›” 29์ผ
0

1. Python ๊ธฐ๋ณธ Datetime ๋ชจ๋“ˆ ์ดํ•ด

1.1 Datetime์ด๋ž€?

๋‚ ์งœ(date)์™€ ์‹œ๊ฐ„(time)์„ ์กฐํ•ฉํ•ด ์ฒ˜๋ฆฌํ•˜๋Š” Python ํ‘œ์ค€ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ.
datetime ๋ชจ๋“ˆ์„ ์‚ฌ์šฉํ•˜๋ฉด ์—ฐ๋„, ์›”, ์ผ, ์‹œ, ๋ถ„, ์ดˆ ๋‹จ์œ„์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ •๋ฐ€ํ•˜๊ฒŒ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋‹ค.

import datetime as dt

# ๋‚ ์งœ ์ •์˜
my_date = dt.date(2020, 3, 22)
print(my_date)          # 2020-03-22
print(type(my_date))    # <class 'datetime.date'>

# ๋‚ ์งœ + ์‹œ๊ฐ„
my_datetime = dt.datetime(2020, 3, 22, 8, 20, 50)
print(my_datetime)      # 2020-03-22 08:20:50
print(my_datetime.hour) # 8
print(my_datetime.minute) # 20

ํ•ต์‹ฌ ๊ฐœ๋…

  • date โ†’ ๋‚ ์งœ๋งŒ ํฌํ•จ
  • datetime โ†’ ๋‚ ์งœ + ์‹œ๊ฐ„ ๋ชจ๋‘ ํฌํ•จ
  • .year, .month, .day, .hour, .minute ๋“ฑ์˜ ์†์„ฑ์œผ๋กœ ๊ตฌ์„ฑ์š”์†Œ ์ ‘๊ทผ ๊ฐ€๋Šฅ

1.2 ๋ฌธ์ž์—ด โ†” Datetime ๋ณ€ํ™˜

# datetime โ†’ ๋ฌธ์ž์—ด
str(my_datetime)  # '2020-03-22 08:20:50'

# ๋ฌธ์ž์—ด โ†’ datetime
converted = dt.datetime.strptime('2020-03-22', '%Y-%m-%d')
print(converted)

1.3 ๋‹ฌ๋ ฅ ์ •๋ณด ํ™•์ธ

import calendar
print(calendar.month(2021, 3))

2. Pandas์—์„œ์˜ Datetime ์ฒ˜๋ฆฌ

2.1 ๋ฌธ์ž์—ด์„ Datetime์œผ๋กœ ๋ณ€ํ™˜

import pandas as pd

dates = pd.Series(['2020/03/22', '2020-08-25', 'March 22nd, 2020'])
pd.to_datetime(dates)

๋‹ค์–‘ํ•œ ํ˜•์‹์˜ ๋ฌธ์ž์—ด๋„ ์ผ๊ด€๋œ datetime64 ํ˜•์‹์œผ๋กœ ์ž๋™ ๋ณ€ํ™˜๋œ๋‹ค.


2.2 Timestamp์™€ DatetimeIndex

# Timestamp ์ •์˜
ts = pd.Timestamp(2020, 3, 22, 10)
print(ts)

# ๋‚ ์งœ ์ฐจ์ด ๊ณ„์‚ฐ
day_1 = pd.Timestamp(1998, 3, 22)
day_2 = pd.Timestamp(2021, 3, 22)
print(day_2 - day_1)  # 8401 days

DatetimeIndex ๋งŒ๋“ค๊ธฐ

dates_list = [
    dt.date(2020, 3, 22),
    dt.date(2020, 4, 22),
    dt.date(2020, 5, 22)
]
date_index = pd.DatetimeIndex(dates_list)

2.3 DatetimeIndex๋ฅผ ์ธ๋ฑ์Šค๋กœ ํ•˜๋Š” Series

sales = [50000, 65000, 72000]
sales_series = pd.Series(data=sales, index=date_index)
print(sales_series)

์ด์ œ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ ๋ถ„์„์ด ๊ฐ€๋Šฅํ•œ ํ˜•ํƒœ๊ฐ€ ๋œ๋‹ค.


3. ๋‚ ์งœ ๋ฒ”์œ„ ์ƒ์„ฑ ๋ฐ Resample ๊ธฐ๋ณธ

3.1 ๋‚ ์งœ ๋ฒ”์œ„ ์ƒ์„ฑ

my_days = pd.date_range(start='2020-01-01', end='2020-04-01', freq='D')
print(len(my_days))  # 92์ผ

์ฃผ์š” ์˜ต์…˜

  • 'D': ์ผ ๋‹จ์œ„
  • 'M': ์›” ๋ง์ผ
  • 'B': ํ‰์ผ(์˜์—…์ผ)๋งŒ
  • 'W': ์ฃผ ๋‹จ์œ„
  • 'Q': ๋ถ„๊ธฐ ๋‹จ์œ„

3.2 ์˜์—…์ผ๋งŒ ํฌํ•จํ•œ ๋‚ ์งœ

business_days = pd.date_range('2020-01-01', '2020-04-01', freq='B')
print(business_days)

4. ์‹ค์ „ ๋ฐ์ดํ„ฐ: ์•„๋ณด์นด๋„ ๊ฐ€๊ฒฉ ๋ถ„์„

4.1 ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

avocado_df = pd.read_csv('avocado.csv')
avocado_df['Date'] = pd.to_datetime(avocado_df['Date'])
avocado_df.set_index('Date', inplace=True)
avocado_df.info()

์ถœ๋ ฅ ์˜ˆ์‹œ

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 18249 entries
Columns: [AveragePrice, Total Volume, type, region]

4.2 ํŠน์ • ๋‚ ์งœ ์ ‘๊ทผ

# ๋‹จ์ผ ๋‚ ์งœ
avocado_df.loc['2018-01-21']

# ๊ธฐ๊ฐ„ ํ•„ํ„ฐ๋ง
avocado_df.loc['2015-01-04':'2015-01-25']

4.3 ๋‚ ์งœ ๋ฒ”์œ„ ์กฐ์ •๊ณผ ์ •๋ ฌ

avocado_df.sort_index(inplace=True)

# ํŠน์ • ๊ธฐ๊ฐ„๋งŒ ์ถ”์ถœ
trimmed = avocado_df.truncate(before='2017-01-01', after='2018-02-01')

4.4 DateOffset์œผ๋กœ ์ธ๋ฑ์Šค ์ด๋™

avocado_df.index = avocado_df.index + pd.DateOffset(months=12, days=30)

โ†’ ์ „์ฒด ์‹œ๊ณ„์—ด์„ ์ผ์ • ๊ธฐ๊ฐ„๋งŒํผ ์ด๋™์‹œํ‚ด.
โ†’ ์ดํ›„ -pd.DateOffset(...)์œผ๋กœ ์›๋ณต ๊ฐ€๋Šฅ.


5. ๋ฆฌ์ƒ˜ํ”Œ(Resample)๊ณผ ์ง‘๊ณ„

Rule์˜๋ฏธ์˜ˆ์‹œ
'A'์—ฐ๋„๋ณ„.resample('A').mean()
'Q'๋ถ„๊ธฐ๋ณ„.resample('Q').mean()
'M'์›”๋ณ„.resample('M').mean()
'W'์ฃผ๋ณ„.resample('W').mean()
# ์—ฐ๋„๋ณ„ ํ‰๊ท ๊ฐ€
avocado_df.resample('A').mean()['AveragePrice']

# ๋ถ„๊ธฐ๋ณ„ ์ตœ๋Œ€๊ฐ€
avocado_df.resample('Q').max()['AveragePrice']

5.1 ์กฐ๊ฑด ํ•„ํ„ฐ๋ง

low_price = avocado_df['AveragePrice'].where(avocado_df['AveragePrice'] < 1.2)
# ์›”๋ณ„ 1.5 ๋ฏธ๋งŒ ํ‰๊ท ๊ฐ€ ๊ฐœ์ˆ˜
(avocado_df['AveragePrice'] < 1.5).resample('M').sum()

6. ๋‚ ์งœ ๊ตฌ์„ฑ์š”์†Œ ์ถ”์ถœ

avocado_df.reset_index(inplace=True)
avocado_df['Day'] = avocado_df['Date'].dt.day
avocado_df['Month'] = avocado_df['Date'].dt.month
avocado_df['Year'] = avocado_df['Date'].dt.year
avocado_df.set_index('Date', inplace=True)

7. ์‹œ๊ณ„์—ด ์‹œ๊ฐํ™”

7.1 ์›”๋ณ„ ํ‰๊ท  ๊ฐ€๊ฒฉ

avocado_df.resample('M').mean()['AveragePrice'].plot(
    figsize=(10,5),
    marker='o',
    color='r',
    title='์›”๋ณ„ ์•„๋ณด์นด๋„ ํ‰๊ท ๊ฐ€๊ฒฉ ์ถ”์ด'
)

7.2 ๋ถ„๊ธฐ๋ณ„ ํ‰๊ท  ๊ฐ€๊ฒฉ

avocado_df.resample('Q').mean()['AveragePrice'].plot(
    figsize=(10,5),
    marker='o',
    color='r',
    title='๋ถ„๊ธฐ๋ณ„ ์•„๋ณด์นด๋„ ํ‰๊ท ๊ฐ€๊ฒฉ'
)

7.3 ์—ฐ๋„๋ณ„ ํ‰๊ท  ๊ฐ€๊ฒฉ

avocado_df.resample('A').mean()['AveragePrice'].plot(
    figsize=(10,5),
    marker='o',
    color='b',
    title='์—ฐ๋„๋ณ„ ์•„๋ณด์นด๋„ ํ‰๊ท ๊ฐ€๊ฒฉ'
)

8. Seaborn์œผ๋กœ ํŠธ๋ Œ๋“œ ์‹œ๊ฐํ™”

8.1 Violin Plot โ€” ์œ ํ˜•๋ณ„ ๊ฐ€๊ฒฉ ๋ถ„ํฌ

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10,7))
sns.violinplot(
    x='type', 
    y='AveragePrice', 
    data=avocado_df,
    palette='Set2'
)

๊ฒฐ๊ณผ:
์œ ๊ธฐ๋†(organic) ์•„๋ณด์นด๋„์˜ ํ‰๊ท ๊ฐ€๊ฐ€ ์ผ๋ฐ˜(conventional)๋ณด๋‹ค ๋†’์Œ์„ ์ง๊ด€์ ์œผ๋กœ ํ™•์ธ ๊ฐ€๋Šฅ.


8.2 Distplot โ€” ์ „์ฒด ๋ถ„ํฌ ์‹œ๊ฐํ™”

plt.figure(figsize=(13,6))
sns.histplot(avocado_df['AveragePrice'], kde=True, color='steelblue')

8.3 Catplot โ€” ์ง€์—ญ๋ณ„, ์—ฐ๋„๋ณ„ ๊ฐ€๊ฒฉ ๋น„๊ต

sns.catplot(
    x='AveragePrice',
    y='region',
    hue='Year',
    data=avocado_df[avocado_df['type']=='conventional'],
    height=10
)

โ†’ ์ง€์—ญยท์—ฐ๋„ยท์œ ํ˜•๋ณ„ ๊ฐ€๊ฒฉ ์ฐจ์ด๋ฅผ ํ•œ๋ˆˆ์— ํ™•์ธ ๊ฐ€๋Šฅ.
์ƒŒํ”„๋ž€์‹œ์Šค์ฝ”, ์‹œ์นด๊ณ  ๋“ฑ ์ฃผ์š” ๋„์‹œ์˜ ๊ฐ€๊ฒฉ ๊ตฌ์กฐ๋ฅผ ์‹œ๊ฐ์ ์œผ๋กœ ๋น„๊ต.


8.4 ์ฃผ๊ฐ„ ๊ธฐ์ค€ ํŠธ๋ Œ๋“œ

avocado_df['AveragePrice'].resample('W').mean().plot(
    figsize=(10,5),
    marker='o',
    color='r',
    title='์ฃผ๊ฐ„ ํ‰๊ท ๊ฐ€๊ฒฉ ํŠธ๋ Œ๋“œ'
)

8.5 Organic ์•„๋ณด์นด๋„ CatPlot

sns.catplot(
    x='AveragePrice',
    y='region',
    hue='Year',
    data=avocado_df[avocado_df['type']=='organic'],
    height=10
)

9. ์š”์•ฝ

์„น์…˜์ฃผ์š” ํ•™์Šต ๋‚ด์šฉํ•ต์‹ฌ ์ฝ”๋“œ
1Python Datetime ๊ธฐ์ดˆdatetime.date(), .datetime()
2Pandas Timestamp / DatetimeIndexpd.to_datetime, pd.DatetimeIndex
3๋‚ ์งœ ๋ฒ”์œ„ ๋ฐ ์ฃผ๊ธฐpd.date_range(freq='M')
4์‹ค์ œ ๋ฐ์ดํ„ฐ ๋กœ๋”ฉpd.read_csv, .set_index('Date')
5Resample ๊ธฐ๋ฐ˜ ์ง‘๊ณ„.resample('Q').mean()
6๋‚ ์งœ ์š”์†Œ ๋ถ„๋ฆฌ.dt.day, .dt.month, .dt.year
7Matplotlib ์‹œ๊ฐํ™”.plot(marker='o')
8Seaborn ๊ณ ๊ธ‰ ์‹œ๊ฐํ™”sns.violinplot, sns.catplot

์ •๋ฆฌ

  • ๋‚ ์งœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฌธ์ž์—ด์ด ์•„๋‹Œ DatetimeIndex๋กœ ๋‹ค๋ฃจ๋ฉด ๋ฆฌ์ƒ˜ํ”Œยทํ•„ํ„ฐ๋งยท์ง‘๊ณ„๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.
  • Resample์„ ํ†ตํ•ด ์‹œ๊ฐ„ ๋‹จ์œ„๋ณ„(์›”/๋ถ„๊ธฐ/์—ฐ๋„) ํŠธ๋ Œ๋“œ๋ฅผ ๋ถ„์„ํ•  ์ˆ˜ ์žˆ๋‹ค.
  • ์‹œ๊ฐํ™”๋กœ ํŒจํ„ด๊ณผ ์ด์ƒ์น˜(outlier)๋ฅผ ์ง๊ด€์ ์œผ๋กœ ํŒŒ์•…ํ•œ๋‹ค.
  • ์‹ค๋ฌด์—์„œ ํŒ๋งค, ํŠธ๋ž˜ํ”ฝ, ๋กœ๊ทธ ๋ฐ์ดํ„ฐ์˜ ์‹œ๊ฐ„ ๊ธฐ๋ฐ˜ ๋ถ„์„์— ๋™์ผํ•œ ์›๋ฆฌ๋ฅผ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.
profile
okorion's Tech Study Blog.

0๊ฐœ์˜ ๋Œ“๊ธ€