[Week 1] Pandas

ํ˜œ ์ฝฉยท2022๋…„ 9์›” 26์ผ
0
post-thumbnail

๐Ÿšฉ Pandas


๐Ÿ Groupby

df.groupby("Team")["Points"].mean()
# Team ๋ณ„๋กœ Points ์นผ๋Ÿผ ๋ฐ์ดํ„ฐ๋“ค์˜ ํ‰๊ท ๊ฐ’๋“ค์„ ๊ณ„์‚ฐ
h_index = df.groupby(["Team", "Year")["Points"].sum()
# Team๋ณ„๋กœ Year๋ณ„๋กœ Points ์นผ๋Ÿผ ๋ฐ์ดํ„ฐ๋“ค์˜ ํ•ฉ์„ ๊ณ„์‚ฐ
# groupby์— ์˜ํ•ด split๋œ ์ƒํƒœ ์ถ”์ถœ
grouped = df.groupby("Team")
for name, group in grouped:
	print(name)
    print(group)

๐ŸŸจ Aggregation

  • ์š”์•ฝ๋œ ํ†ต๊ณ„์ •๋ณด๋ฅผ ์ถ”์ถœ
  • grouped.agg(max)

๐ŸŸจ Transformation

  • ํ•ด๋‹น ์ •๋ณด๋ฅผ ๋ณ€ํ™˜ (์•„๋ž˜ ์ฝ”๋“œ: ๊ทธ๋ฃน๋ณ„๋กœ ํ•จ์ˆ˜(lambda) ์ ์šฉ)
  • ๊ทธ๋ฃน๋ณ„๋กœ ์—ฐ์‚ฐ์„ ํ•ด์ฃผ๊ณ  ๊ฐ๊ฐ์˜ ๊ฐ’์— ์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค€๋‹ค.

๐ŸŸจ Filteration

  • ํŠน์ • ์ •๋ณด๋ฅผ ์ œ๊ฑฐํ•˜์—ฌ ํ•„ํ„ฐ๋ง ๊ธฐ๋Šฅ

๐Ÿ Pivot Table

  • ๋ฐ์ดํ„ฐ ๊ฐ’๋“ค: duration ์นผ๋Ÿผ๊ฐ’๋“ค
  • index ์„ค์ •: month, item ์นผ๋Ÿผ๋ณ„๋กœ
  • ํ…Œ์ด๋ธ”์˜ ์นผ๋Ÿผ(์†์„ฑ)๋“ค: network ์นผ๋Ÿผ๋“ค๋กœ
  • aggregation function: sum() ํ•ด์„œ ๊ฐ’๋“ค ํ‘œํ˜„
  • NaN์ด ์—†๋„๋ก 0์œผ๋กœ ์ฑ„์›Œ์ฃผ๊ธฐ
df_phone.groupby(['month','item','network'])['duration'].sum().unstack()
ํ•˜๋ฉด ๊ฐ™๊ฒŒ ๋‚˜์˜จ๋‹ค.

๐Ÿ Crosstab

  • ๋‘ ์นผ๋Ÿผ์˜ ๊ต์ฐจ ๋นˆ๋„, ๋น„์œจ, ๋ง์…ˆ ๋“ฑ์„ ๊ตฌํ•  ๋•Œ ์‚ฌ์šฉ
  • Pivot table์˜ ํŠน์ˆ˜ํ•œ ํ˜•ํƒœ
  • user-item Rating matrix ๋“ฑ์„ ๋งŒ๋“ค ๋•Œ ์‚ฌ์šฉ

๐Ÿ merge & concat

๐ŸŸจ merge

  • 2๊ฐœ์˜ ๋ฐ์ดํ„ฐ(ex. ํ…Œ์ด๋ธ”)๋ฅผ ํ•˜๋‚˜๋กœ ํ•ฉ์น  ๋•Œ
  • ์œ„์˜ ์ด๋ฏธ์ง€์—์„œ subject_id๊ฐ€ ๊ฒน์น˜๋ฏ€๋กœ ๊ฒน์น˜๋Š” 4,5,7,8๋งŒ ํ•ฉ์ณ์„œ subject_id๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ…Œ์ด๋ธ” ๋ณ‘ํ•ฉ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.
    pd.merge(df_a, df_b, on='subject_id')
  • on์„ ์“ฐ๋ฉด ์–‘์ชฝ ๋ฐ์ดํ„ฐ์— ๊ฐ™์€ ๊ธฐ์ค€ ์นผ๋Ÿผ์ด ์กด์žฌํ•ด์•ผํ•œ๋‹ค.
    pd.merge(df_a, df_b, left_on='subject_id', right_on='subject_id')
  • ๋‘ dataframe์ด column๋ช…์ด ๋‹ค๋ฅผ ๋•Œ

๐Ÿ join

  • inner join: A์™€ B์˜ ๊ณตํ†ต๋œ ๋ฐ์ดํ„ฐ๋“ค๋งŒ
  • full join: ๊ณตํ†ต๋œ ๋ฐ์ดํ„ฐ๋“ค์€ ๋ถ™์ด๊ณ  ์•„๋‹Œ ๊ฒƒ๋“ค๋„ ์ „๋ถ€ ๋ณด์—ฌ์ฃผ๊ธฐ
  • left join: ์™ผ์ชฝ์— ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋“ค์„ ์ „๋ถ€ ๋ณด์—ฌ์ฃผ๊ณ  ๊ฐ™์€ ์นผ๋Ÿผ๋ช…์ด์ง€๋งŒ ์˜ค๋ฅธ์ชฝ์—์„œ ์—†๋Š” ๋ฐ์ดํ„ฐ๋“ค์€ NaN์œผ๋กœ ํ‘œ์‹œ (๊ณตํ†ต๋œ ๋ถ€๋ถ„์€ ํ•ฉ์นจ)
  • right join: ์˜ค๋ฅธ์ชฝ์— ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋“ค์„ ์ „๋ถ€ ๋ณด์—ฌ์ฃผ๊ณ  ๊ฐ™์€ ์นผ๋Ÿผ๋ช…์ด์ง€๋งŒ ์™ผ์ชฝ์—์„œ ์—†๋Š” ๋ฐ์ดํ„ฐ๋“ค์€ NaN์œผ๋กœ ํ‘œ์‹œ (๊ณตํ†ต๋œ ๋ถ€๋ถ„์€ ํ•ฉ์นจ)
    pd.merge(df_a, df_b, on='subject_id', how='left / right / inner / outer')

๐ŸŸจ index๋ฅผ ๊ธฐ์ค€์œผ๋กœ joinํ•˜๊ธฐ

pd.merge(df_a, df_b, right_index=True, left_index=True)

๐Ÿ concat

  • ๋ฐ์ดํ„ฐ ์ด์–ด๋ถ™์ด๊ธฐ
    df_new = pd.concat([df_a, df_b])
    df_a.append(df_b)
  • axis๋ฅผ 1๋กœ ์ง€์ •ํ•˜๋ฉด ๊ฐ€๋กœ๋กœ ๋ถ™์ด๊ณ  0์œผ๋กœ ์ง€์ •ํ•˜๋ฉด ์„ธ๋กœ๋กœ ๋ถ™์ธ๋‹ค.
profile
๋ฐฐ์šฐ๊ณ  ์‹ถ์€๊ฒŒ ๋งŽ์€ ๊ฐœ๋ฐœ์ž๐Ÿ“š

0๊ฐœ์˜ ๋Œ“๊ธ€