[EDA/Python] Playing with Pandas ๐Ÿ“Š๐Ÿผ 1ํŽธ

SengMin Youn ์œค์„ฑ๋ฏผยท2023๋…„ 10์›” 21์ผ
1
post-thumbnail

๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ๐Ÿผ

Tidy Data

Tidy data๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ๋ชฉ์ ์— ๋ง๋Š” ํ˜•์‹์„ ๊ฐ–๊ณ  ์žˆ์Œ์„ ์˜๋ฏธํ•œ๋‹ค. Rํ”„๋กœ๊ทธ๋ž˜๋ฐ ์žฅ์ธ์ด์ž ํ†ต๊ณ„ํ•™์ž์ธ Hadley Wickham์— ๋”ฐ๋ฅด๋ฉด Tidy data๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์กฐ๊ฑด์„ ๋งŒ์กฑํ•˜๋Š” 2-D ํ…Œ์ด๋ธ”์ด๋‹ค:

1. each column represents a variable;
2. each row represents an observation;
3. each entry of the table represents a single value, which may come from either categorical(discrete) or continuous spaces.


'tidy'ํ•œ ํ…Œ์ด๋ธ”์„ ์šฐ๋ฆฌ๋Š” 'tibble'์ด๋ผ๊ณ  ๋ถ€๋ฅด๊ธฐ๋„ ํ•œ๋‹ค

import pandas as pd 
from io import StringIO
from IPython.display import display            #๊ทธ๋ž˜ํ”„๋‚˜  df์ƒ์„ฑ์‹œ ํ™œ์šฉํ•˜๋ฉด ํŽธํ•˜๋‹ค 

A_csv = """country,year,cases
Afghanistan,1999,745
Brazil,1999,37737
China,1999,212258
Afghanistan,2000,2666
Brazil,2000,80488
China,2000,213766"""

with StringIO(A_csv) as fp:
    A = pd.read_csv(fp)
print("=== A ===")
display(A)

A_csv = """country,year,cases
Afghanistan,1999,745
Brazil,1999,37737
China,1999,212258
Afghanistan,2000,2666
Brazil,2000,80488
China,2000,213766"""

with StringIO(A_csv) as fp:
    A = pd.read_csv(fp)
print("=== A ===")
display(A)

merge()ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ์ด ๋‘ df๋ฅผ ์‰ฝ๊ฒŒ ํ•ฉ์น  ์ˆ˜ ์žˆ๋‹ค.

C = A.merge(B, on=['country', 'year'])
print("\n=== C = merge(A, B) ===")
display(C)

Joins

์‰ฝ๊ฒŒ ๋งํ•˜์ž๋ฉด... ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

  • Inner-join(A,B) (default): ๋‘˜ ์‚ฌ์ด์˜ ๊ต์ง‘ํ•ฉ๋งŒ ์‚ด๋ฆฌ๊ณ  ๋‚˜๋จธ์ง€๋Š” ๋ฒ„๋ฆผ
  • Outer-join(A,B): ๋‘˜ ์‚ฌ์ด์˜ ํ•ฉ์ง‘ํ•ฉ์„ ์‚ด๋ฆฌ๋Š”๋ฐ non-match์— ๋Œ€ํ•ด์„œ๋Š” NaN์œผ๋กœ ์ฑ„์›Œ๋ฒ„๋ฆผ
  • Left-join(A,B): A์˜ ๋ชจ๋“  ํ–‰์„ ์‚ด๋ฆฌ๊ณ  A์™€ ๋งž๋Š” B๋งŒ ์‚ด๋ฆผ
  • Right-join(A,B): left-join ๋ฐ˜๋Œ€
with StringIO("""x,y,z
bug,1,d
rug,2,d
lug,3,d
mug,4,d""") as fp:
    D = pd.read_csv(fp)
print("=== D ===")
display(D)

with StringIO("""x,y,w
hug,-1,e
smug,-2,e
rug,-3,e
tug,-4,e
bug,1,e""") as fp:
    E = pd.read_csv(fp)
print("\n=== E ===")
display(E)

print("\n=== Outer-join (D, E) ===")
display(D.merge(E, on=['x', 'y'], how='outer'))

print("\n=== Left-join (D, E) ===")
display(D.merge(E, on=['x', 'y'], how='left'))

print("\n=== Right-join (D, E) ===")
display(D.merge(E, on=['x', 'y'], how='right'))


print("\n=== Inner-join (D, E) ===")
display(D.merge(E, on=['x', 'y']))



์ฐธ ์‰ฝ์ฃ ~?

profile
An Aspiring Back-end Developer

0๊ฐœ์˜ ๋Œ“๊ธ€