Pandas(ํŒ๋‹ค์Šค) - Series์™€ DataFrame

seulzzangยท2022๋…„ 10์›” 6์ผ
0


๊ต์œก๊ณผ์ •์—์„œ ์ด์ œ ๋จธ์‹ ๋Ÿฌ๋‹์„ ๋ฐฐ์šฐ๊ธฐ ์‹œ์ž‘ํ•˜๋Š”๋ฐ Pandas์— ๋Œ€ํ•ด ์ •๋ฆฌํ•˜๊ณ  ์‹ถ์–ด์„œ ๋‚จ๊ธฐ๋Š” ๊ธ€


Pandas

  • ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„๊ณผ ์‹œ๋ฆฌ์ฆˆ๋ผ๋Š” ์ž๋ฃŒํ˜•์„ ๋‹ค๋ฃจ๊ณ  ๋ฐ์ดํ„ฐ ๋ถ„์„์„ ์œ„ํ•œ ๋‹ค์–‘ํ•œ ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•˜๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
  • ๋‚ด๋ถ€์ ์œผ๋กœ numpy๋ฅผ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ ํ•จ๊ป˜ import
import pandas as pd

Series

  • Numpy์˜ 1์ฐจ์› array์™€ ๋น„์Šทํ•˜๋‹ค.
  • 1์ฐจ์› ๋ฐฐ์—ด์ด๊ณ , ์—‘์…€์‹œํŠธ์˜ ์—ด ํ•˜๋‚˜๋ฅผ ๋– ์˜ฌ๋ฆฌ๋ฉด ์ดํ•ด๊ฐ€ ์‰ฝ๋‹ค.
x = [1, 2, 3, 4, 5]
pd.Series(x)
0    1
1    2
2    3
3    4
4    5
dtype: int64

์ถœ๋ ฅ ๊ฒฐ๊ณผ๋ฅผ ๋ด๋„ ์—‘์…€์‹œํŠธ์˜ ์—ด ํ•œ๊ฐœ์™€ ๋น„์Šทํ•˜๋‹ค.
๋„˜ํŒŒ์ด ๋ฆฌ์ŠคํŠธ๋กœ๋„ ์‹œ๋ฆฌ์ฆˆ๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค.

Series Index

  • Series๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ index์™€ value๋กœ ๊ตฌ๋ถ„๋˜์–ด ์žˆ๋‹ค.
x = [1, 2, 3, 4, 5]
x = pd.Series(x)
print(x.index)
print(x.values)
RangeIndex(start=0, stop=5, step=1)
[1 2 3 4 5]

list(x.index)๋กœ printํ•˜๋ฉด ์ธ๋ฑ์Šค๊ฐ€ [0, 1, 2, 3, 4]๋กœ ์ถœ๋ ฅ๋œ๋‹ค.

  • ์ธ๋ฑ์Šค๋ช…์„ ๋ณ€๊ฒฝํ•  ์ˆ˜ ์žˆ๊ณ  ์ธ๋ฑ์Šค์— ์ ‘๊ทผํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ๋‹ค์–‘ํ•˜๋‹ค.
x = [1, 2, 3, 4, 5]
x = pd.Series(x, index=['a','b','c','d','e']) #์ธ๋ฑ์Šค์„ค์ •

print(x['a']) #๋ช…์‹œ์  ์ธ๋ฑ์Šค์ ‘๊ทผ
print(x[0]) #๋ฌต์‹œ์  ์ธ๋ฑ์Šค์ ‘๊ทผ
print(x[['a','e']]) #ํŒฌ์‹œ์ƒ‰์ธ, ํ•œ๋ฒˆ์— ์—ฌ๋Ÿฌ๊ฐ’ ์ ‘๊ทผ
print(x.a) # ์†์„ฑ๊ฐ’ ์ ‘๊ทผํ•˜๋“ฏ์ด ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ์Œ
1
1
a    1
e    5
dtype: int64
1
  • ์ธ๋ฑ์Šค๊ฐ€ ์ˆซ์ž๋ผ๋ฉด, ๋ฌต์‹œ์  ์ธ๋ฑ์Šค ์ ‘๊ทผ์ด ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค. ์ด๋Ÿด๋• .iloc๊ณผ .loc์„ ์‚ฌ์šฉํ•ด์ฃผ๋ฉด ๋œ๋‹ค!
x = [1, 2, 3, 4, 5]
x = pd.Series(x, index=[1, 2, 3, 4, 5])
print(x)
# x[0] ์˜ค๋ฅ˜(์ธ๋ฑ์Šค๊ฐ€ ์ˆซ์ž๋ผ์„œ ๋ฌต์‹œ์  ์ธ๋ฑ์Šค ์ ‘๊ทผ์ด ๋ถˆ๊ฐ€๋Šฅํ•จ)
print(x.iloc[0]) #๋ฌต์‹œ์  ์ธ๋ฑ์Šค๋กœ๋งŒ
print(x.loc[1]) #๋ช…์‹œ์  ์ธ๋ฑ์Šค๋กœ๋งŒ
1    1
2    2
3    3
4    4
5    5
dtype: int64
1
1

Dictionary to Series

  • ๋”•์…”๋„ˆ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•ด Series๋ฅผ ๋งŒ๋“ค๋ฉด key๊ฐ’์„ index๋กœ ์‚ฌ์šฉํ•œ๋‹ค.
x = {"์ˆ˜ํ•™":90, "์˜์–ด":80, "๊ณผํ•™":95, "๋ฏธ์ˆ ":80}
x = pd.Series(x)
x
์ˆ˜ํ•™    90
์˜์–ด    80
๊ณผํ•™    95
๋ฏธ์ˆ     80
dtype: int64
  • ๋”•์…”๋„ˆ๋ฆฌ์™€ ๋น„์Šทํ•˜๊ฒŒ ์ธ๋ฑ์‹ฑ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.
print(x['์ˆ˜ํ•™']) # ์ถœ๋ ฅ: 90
  • ์Šฌ๋ผ์ด์‹ฑ๋„ ๊ฐ€๋Šฅ
print(x['์˜์–ด':])
์˜์–ด    80
๊ณผํ•™    95
๋ฏธ์ˆ     80
dtype: int64
  • ์ธ๋ฑ์Šค ๊ฐ’์„ ์ง€์ •ํ•˜์—ฌ ์ผ๋ถ€ ๊ฐ’๋งŒ Series๋กœ ์ƒ์„ฑํ•  ์ˆ˜๋„ ์žˆ๋‹ค.
x = {"์ˆ˜ํ•™":90, "์˜์–ด":80, "๊ณผํ•™":95, "๋ฏธ์ˆ ":80}
x = pd.Series(x, index=["์ˆ˜ํ•™", "์˜์–ด", "๊ณผํ•™"])
x
์ˆ˜ํ•™    90
์˜์–ด    80
๊ณผํ•™    95
dtype: int64

Multi Index

  • ์ธ๋ฑ์Šค๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋‹ค.
student_1 = {"์ˆ˜ํ•™":90, "์˜์–ด":80, "๊ณผํ•™":95, "๋ฏธ์ˆ ":80}
student_2 = {"์ˆ˜ํ•™":70, "์˜์–ด":90, "๊ณผํ•™":100, "๋ฏธ์ˆ ":70}

#index_1 = ['ํ™๊ธธ๋™','ํ™๊ธธ๋™','ํ™๊ธธ๋™','ํ™๊ธธ๋™','์ด๋ชฝ๋ฃก','์ด๋ชฝ๋ฃก','์ด๋ชฝ๋ฃก','์ด๋ชฝ๋ฃก']
index_1 = ['ํ™๊ธธ๋™' for i in range(len(student_1))] + ['์ด๋ชฝ๋ฃก' for i in range(len(student_2))]

#index_2 = ['์ˆ˜ํ•™','์˜์–ด','๊ณผํ•™','๋ฏธ์ˆ ','์ˆ˜ํ•™','์˜์–ด','๊ณผํ•™','๋ฏธ์ˆ ']
index_2 = [key for key in student_1] + [key for key in student_2] # key๋ฅผ ๋ถˆ๋Ÿฌ์˜ด

value_all = list(student_1.values()) + list(student_2.values())

students = pd.Series(value_all, index=[index_1, index_2]) # ์ƒ์œ„ ์ธ๋ฑ์Šค๋ž‘ ํ•˜์œ„ ์ธ๋ฑ์Šค
students
ํ™๊ธธ๋™  ์ˆ˜ํ•™     90
     ์˜์–ด     80
     ๊ณผํ•™     95
     ๋ฏธ์ˆ      80
์ด๋ชฝ๋ฃก  ์ˆ˜ํ•™     70
     ์˜์–ด     90
     ๊ณผํ•™    100
     ๋ฏธ์ˆ      70
dtype: int64
  • students['ํ™๊ธธ๋™']์˜ ๊ฒฝ์šฐ ํ™๊ธธ๋™์— ํ•ด๋‹นํ•˜๋Š” ์ ์ˆ˜๋“ค๋งŒ ์ถœ๋ ฅํ•ด์ค€๋‹ค.
์ˆ˜ํ•™    90
์˜์–ด    80
๊ณผํ•™    95
๋ฏธ์ˆ     80
dtype: int64

DataFrame

  • 2์ฐจ์› ํ…Œ์ด๋ธ” ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ๋กœ, ์—‘์…€์˜ ์Šคํ”„๋ ˆ๋“œ์‹œํŠธ ์ „์ฒด๋ฅผ ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค.
sales_data = {    
    '์—ฐ๋„':[2015, 2016, 2017, 2018, 2019, 2020],
    'ํŒ๋งค๋Ÿ‰':[103, 70, 130, 160, 190, 230],
    '๋งค์ถœ':[500000, 300000, 400000, 550000, 700000, 680000],
    '์ˆœ์ด์ต':[370000, 190000, 300000, 480000, 600000, 590000]
}

sales_data = pd.DataFrame(sales_data)
sales_data


(๋ฒจ๋กœ๊ทธ์—์„œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ์ถœ๋ ฅํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค๋ฉด ๋Œ“๊ธ€๋กœ ์•Œ๋ ค์ฃผ์„ธ์š”..)

sales_data = {    
    '์—ฐ๋„':[2015, 2016, 2017, 2018, 2019, 2020],
    'ํŒ๋งค๋Ÿ‰':[103, 70, 130, 160, 190, 230],
    '๋งค์ถœ':[500000, 300000, 400000, 550000, 700000, 680000],
    '์ˆœ์ด์ต':[370000, 190000, 300000, 480000, 600000, 590000]
}

temp_df = pd.DataFrame(sales_data, columns=['ํŒ๋งค๋Ÿ‰','๋งค์ถœ','์ˆœ์ด์ต'], index=sales_data['์—ฐ๋„'])
temp_df


์ธ๋ฑ์Šค๋ฅผ ๋”ฐ๋กœ ์ง€์ •ํ•ด์ค„ ์ˆ˜๋„ ์žˆ๋‹ค.

  • Series์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ DataFrame๋„ .loc, .iloc์ด ์‚ฌ์šฉ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.
  • ํŒŒ์ผ์„ ์ฝ์–ด์™€ DataFrame์œผ๋กœ ์ถœ๋ ฅํ•  ์ˆ˜ ์žˆ๋‹ค. (pd.read_csv๊ฐ€ ๊ฐ€์žฅ ๋งŽ์ด ์“ฐ์ž„)
sales_data = pd.read_csv('sales_data.csv', index_col='์—ฐ๋„', header=0, sep=',')
sales_data


์ด๋Ÿฐ์‹์œผ๋กœ..


ํŒ๋‹ค์Šค์˜ ์ž๋ฃŒ๊ตฌ์กฐ์ธ Series์™€ DataFrame์„ ์•Œ์•„๋ณด์•˜๋‹ค. ๋‹ค์Œ์—๋Š” Kaggle์˜ ํƒ€์ดํƒ€๋‹‰ ์ƒ์กด์ž ๋ฐ์ดํ„ฐ๋กœ ํŒ๋‹ค์Šค๋ฅผ ๋‹ค๋ฃจ๋Š” ๋ฐฉ๋ฒ•๋“ค์— ๋Œ€ํ•ด ์ •๋ฆฌํ•  ๊ฒƒ์ด๋‹ค!!!!!!

profile
์ค‘์š”ํ•œ ๊ฒƒ์€ ๊บพ์ด์ง€ ์•Š๋Š” ๋งˆ์Œ

0๊ฐœ์˜ ๋Œ“๊ธ€