โœ๏ธ 24.12.24 TIL

Dadaยท2024๋…„ 12์›” 24์ผ

๐Ÿ“… TO DO


  • Pandas ๊ฐœ์ธ๊ณผ์ œ 6๋ฒˆ ํ’€์ด
  • Pandas ๊ฐœ์ธ ๊ณผ์ œ ํ•ด์„ค ๋ผ์ด๋ธŒ์„ธ์…˜
  • Pandas ๋ผ์ด๋ธŒ์„ธ์…˜ 3ํšŒ์ฐจ ๋ณต์Šต

โ–ถ Pandas ๊ฐœ์ธ ๊ณผ์ œ

6๋ฒˆ ๋ฌธ์ œ. SQL ๋ฌธ๋ฒ• โ†’ ํŒŒ์ด์ฌ ๋ฌธ๋ฒ•

  • ์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์›”๋ณ„ ์ฃผ๋ฌธ ๊ฑด์ˆ˜ ๊ณ„์‚ฐ
  • ์ฃผ๋ฌธ์ด ์—†๋Š” ๋‹ฌ๋„ 0๊ฑด์œผ๋กœ ํฌํ•จ
  • ์ง€๋‚œ๋‹ฌ ๋Œ€๋น„ ์ฃผ๋ฌธ ๊ฑด์ˆ˜ ์ฆ๊ฐ์œจ ๊ณ„์‚ฐ
  • ๊ฒฐ๊ณผ๋Š” ์—ฐ-์›” ์ˆœ ์ •๋ ฌ + ์ฆ๊ฐ์œจ ์†Œ์ˆ˜์  ๋‘˜์งธ ์ž๋ฆฌ๊นŒ์ง€ ๋ฐ˜์˜ฌ๋ฆผ
# Skeleton ์ฝ”๋“œ
import pandas as pd
orders = pd.read_csv(""orders.csv")

# ์ปฌ๋Ÿผ๋ช… ์งง๊ฒŒ ์ˆ˜์ • (order_purchase_timestamp โ†’ order_time)
# df.rename() ํ™œ์šฉ
orders.rename(columns={'order_purchase_timestamp':'order_time'}, inplace=True)

# order_time ์ปฌ๋Ÿผ dtype ๋ณ€๊ฒฝ (object โ†’ datetime)
# pandas.to_datetime() ํ™œ์šฉ
orders['order_time'] = pd.to_datetime(orders['order_time'])

# Date_Format ์ˆ˜์ • (์˜ˆ: '2026-10') + month ์ปฌ๋Ÿผ์— ์ถ”๊ฐ€
# Series.dt.strftime() ํ™œ์šฉ
orders['months'] = orders['order_time'].dt.strftime('%Y-%m')

# ๋ชจ๋“  ์—ฐ์›”์ด ๋‹ด๊ธด ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์ƒ์„ฑ 
# pd.date_range() ํ™œ์šฉ
all_months = pd.date_range(start = orders['order_time'].min(), end = orders['order_time'].max(),freq="MS")
all_months = pd.DataFrame(all_months, columns=['months'])

# Date_Format ์ˆ˜์ • 
all_months['months'] = all_months['months'].dt.strftime('%Y-%m')

# orders ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ๊ธฐ๋ฐ˜ ์›”๋ณ„ ์ฃผ๋ฌธ๋Ÿ‰ count
# df.groupby().count() ํ™œ์šฉ
orders_month = orders.groupby(['months'])['order_id'].count().reset_index()

# column๋ช… ์ˆ˜์ • (order_id โ†’ order_count)
orders_month = orders_month.rename(columns={'order_id':'order_count'})

# all_months + orders ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ํ•ฉ์น˜๊ธฐ
# pd.merge() ํ™œ์šฉ
all_months_table = pd.merge(all_months, orders_month, how='left', on = 'months')

# ๊ฒฐ์ธก๊ฐ’ 0์ฒ˜๋ฆฌ
# df.fillna() ํ™œ์šฉ
all_months_table['order_count'] = all_months_table['order_count'].fillna(value=0)

# ์ „์›” ์ฃผ๋ฌธ๊ฑด์ˆ˜ ์ปฌ๋Ÿผ ์ถ”๊ฐ€
# df.shift() ํ™œ์šฉ
all_months_table['before_order_count'] = all_months_table['order_count'].shift(1)

# ์ฆ๊ฐ์œจ ๊ณ„์‚ฐ - ์กฐ๊ฑด์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ๊ณ„์‚ฐ์‹ ์ ์šฉ
# df.loc[์กฐ๊ฑด์‹, column] = value ํ™œ์šฉ
all_months_table['growth_rate'] = 0
all_months_table.loc[all_months_table['before_order_count'] == 0, "growth_rate"] = all_months_table['order_count'] * 100
all_months_table.loc[all_months_table['before_order_count'] != 0, "growth_rate"] = (all_months_table['order_count']-all_months_table['before_order_count']) / all_months_table['before_order_count'] * 100

# ์†Œ์ˆ˜์  ๋‘˜์งธ ์ž๋ฆฌ๊นŒ์ง€ ๋ฐ˜์˜ฌ๋ฆผ
# round() ํ™œ์šฉ
all_months_table['growth_rate'] = round(all_months_table['growth_rate'],2)

# ์ปฌ๋Ÿผ๋ช… ์ˆ˜์ •(order_count โ†’ cnt_orders)
all_months_table.rename(columns={'order_count':'cnt_orders'}, inplace=True)

# ํ•„์š”ํ•œ ์ปฌ๋Ÿผ๋งŒ ์ถ”์ถœ
all_months_table[['months', 'cnt_orders','growth_rate']]

0๊ฐœ์˜ ๋Œ“๊ธ€