[python] Pandas

장동균·2024년 4월 21일

https://www.youtube.com/watch?v=BO6JTVBjVQ4&list=PLNPt2ycoheHrQHSg7MqTELiWUmieIxH-5

해당 강의를 정리하는 내용이다.

numpy

data = [1, 2, 3, 4]
result = []

for i in data:
    result.append(i * 10)

print(res)

import numpy as np

arr = np.array([1, 2, 3, 4])
arr10 = arr * 10

print(arr10)

numpy를 사용하면 배열을 순회하지 않고도 일괄적인 수정이 가능하다.

import numpy as np

array = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])

# 행은 모든 것, 열은 0번
print(array[:, 0])

파이썬의 기본 배열 메서드에는 열을 선택하는 기능은 없으나, numpy를 활용하여 해당 기능을 취할 수 있다.

Pandas

데이터 분석을 위한 파이썬 라이브러리. numpy를 기반으로 만들어져있다.

Pandas는 Series와 DataFrame 2가지로 이루어져있다.

Series

Numpy를 기반으로 만들어진 1차원 데이터를 위한 자료구조

DataFrame

NumPy를 기반으로 만들어진 2차원 데이터를 위한 자료구조

import 방식

from pandas import Series, DataFrame

Series()

DataFrame()

import pandas as pd

pd.Series()

pd.DataFrame()

import를 하는 2가지 방식이 존재하는데, 보통 두번째 방식을 많이 사용한다.

데이터 조회시 인덱스를 통해서 얻어올 수도 있지만, 행번호를 통해서도 가져올 수 있다.

from pandas import Series

data = [100, 200, 300]
Series(data)

행번호	인덱스	데이터
0	0	100
1	1	200
2	2	300

Series 생성 시 index를 지정할 수 있다.

from pandas import Series

data = [100, 200, 300]
index = ["월", "화", "수"]
Series(data, index)

행번호	인덱스	데이터
0	"월"	100
1	"화"	200
2	"수"	300

data = [100, 200, 300]
index = ["월", "화", "수"]

s = Series(data, index)

s.index # ["월", "화", "수"]
s.values # [100, 200, 300]
s.array # [100, 200, 300]

인덱싱

iloc(integer location)를 통한 조회 => 행번호를 통한 데이터 조회

data = [100, 200, 300]
index = ["월", "화", "수"]

s = Series(data, index)

s.iloc[0] # 100
s.iloc[2] # 300
s.iloc[-1] # 300

loc(location)를 통한 조회 => 인덱스를 통한 데이터 조회

data = [100, 200, 300]
index = ["월", "화", "수"]

s = Series(data, index)

s.loc["월"] # 100
s.loc["수"] # 300

대괄호를 통한 조회 => 인덱스를 통한 데이터 조회

data = [100, 200, 300]
index = ["월", "화", "수"]

s = Series(data, index)

s["월"] # 100
s["수"] # 300

연속적이지 않은 여러 개의 값을 한번에 인덱싱 할 수도 있다.

data = [100, 200, 300]
index = ["월", "화", "수"]

s = Series(data, index)

target = [0, 2]
s.iloc[target]
# 월    100
# 수    300
# dtype: int64

target = ["월", "수"]
s.loc[target]
# 월    100
# 수    300
# dtype: int64

슬라이싱

iloc(integer location)를 통한 슬라이싱 => 행번호를 통한 데이터 슬라이싱

import pandas as pd

data = [100, 200, 300]
index = ["월", "화", "수"]

s = pd.Series(data, index)

s.iloc[0:2]
# 월    100
# 화    200
# dtype: int64

loc를 통한 슬라이싱 => 인덱스를 통한 데이터 슬라이싱

import pandas as pd

data = [100, 200, 300]
index = ["월", "화", "수"]

s = pd.Series(data, index)

s.loc["월":"화"]
# 월    100
# 화    200
# dtype: int64

iloc의 경우 last index를 포함하지 않지만, loc의 경우 last index를 포함한다.

Series 추가

data = [100, 200, 300]
index = ["월", "화", "수"]

s = Series(data, index)

s.loc["목"] = 400
s["금"] = 500

값 추가 시에는 iloc 사용이 불가능하다.

Series 삭제

drop 메서드

원본은 유지하고 값이 삭제된 시리즈 객체를 반환

s.drop('인덱스')

s.drop(['인덱스1', '인덱스2'])

두번째 인자로 inplace=True를 추가하면 원본에 대한 수정도 가능하다.

data = [100, 200, 300]
index = ["월", "화", "수"]

s = Series(data, index)

s1 = s.drop("월")

s1
# 화    200
# 수    300
# dtype: int64

s
# 월    100
# 화    200
# 수    300
# dtype: int64

Series 수정

data = [100, 200, 300]
index = ["월", "화", "수"]

s = Series(data, index)

s.iloc[0] = 1000
s.loc["화"] = 2000

s
# 월    1000
# 화    2000
# 수     300
# dtype: int64

s[0] = 1000 형태의 수정도 가능하지만 console에 iloc 사용을 권장하는 warning 메시지가 노출된다.

BroadCasting

반복문을 사용하지 않고 시리즈 객체 전체에 연산을 적용함

이때 연산의 대상은 인덱스를 기준으로 결정된다.

즉 순서와 상관없이 인덱스를 기준으로 시리즈 객체에 대한 연산이 수행된다.

연산의 대상이 되는 인덱스가 존재하지 않는 경우 NaN 값이 반환된다.

s = Series([100, 200, 300])
s10 = s + 10

s10
# 0 110
# 1 210
# 2 310
# dtype: int64

high = Series([51500, 51200, 52500, 51500, 51500])
low = Series([50700, 50500, 50500, 50800, 50700])
diff = high - low

diff
# 0 800
# 1 700
# 2 2000
# 3 700
# 4 800
# dtype: int64

high = pd.Series([51500, 51200, 52500], ["5/1", "5/2", "5/3"])
low = pd.Series([50700, 50500, 50500], ["5/1", "5/2", "5/4"])
diff = high - low

diff
# 5/1 800.0
# 5/2 700.0
# 5/3 NaN
# 5/4 NaN
# dtype: float64

s = pd.Series([100, 200, 300, 400])
cond = s > 300

cond
# 0 False
# 1 False
# 2 False
# 3 True
# dtype: bool

Filtering

True/False 값을 통해 True 값만 필터링 할 수 있다.

s = pd.Series([100, 200, 300, 400, 500])
cond = [False, False, False, True, True]

s[cond]
# 3 400
# 4 500
# dtype: int64

s = pd.Series([100, 200, 300, 400])
cond = s > 300

s[cond]
# 3 400
# dtype: bool

DataFrame 생성

2차원 표에서 컬럼 단위로 데이터를 표현

data = {
    "종가": [157000, 51300, 6880, 1400],
    "PER": [39, 28, 10, 229],
    "PBR": [4, 1, 0, 2],
}

index = ["naver", "samsung", "LG", "kakao"]

df = pd.DataFrame(data, index)

df
#              종가  PER  PBR
# naver    157000   39    4
# samsung   51300   28    1
# LG         6880   10    0
# kakao      1400  229    2

2차원 표에서 로우 단위로 데이터를 리스트로 표현

data = [
  [157000, 39, 4],
  [51300, 28, 1],
  [6880, 10, 0],
  [1400, 229, 2]
]

index = ["naver", "samsung", "LG", "kakao"]
columns = ["종가", "PER", "PBR"]

df = DataFrame(data, index, columns)

df
#              종가  PER  PBR
# naver    157000   39    4
# samsung   51300   28    1
# LG         6880   10    0
# kakao      1400  229    2

2차원 표에서 로우 단위로 데이터를 딕셔너리로 표현

data = [
  {"종가": 157000, "PER": 39, "PBR": 4},
  {"종가": 51300, "PER": 28, "PBR": 1},
  {"종가": 6880, "PER": 10, "PBR": 0},
  {"종가": 1400, "PER": 229, "PBR": 2}
]

index = ["naver", "samsung", "LG", "kakao"]

df = DataFrame(data, index)

df
#              종가  PER  PBR
# naver    157000   39    4
# samsung   51300   28    1
# LG         6880   10    0
# kakao      1400  229    2

DataFrame 인덱싱

대괄호['컬럼명']을 통해서 단일 컬럼을 선택할 수 있다.

df["종가"]
# naver      157000
# samsung     51300
#LG           6880
#kakao        1400
# Name: 종가, dtype: int64

df["PER"]
# naver       39
# samsung     28
# LG          10
# kakao      229
# Name: PER, dtype: int64

df["PBR"]
# naver      4
# samsung    1
# LG         0
# kakao      2
# Name: PBR, dtype: int64

df[["PER", "PBR"]]
#          PER  PBR
# naver     39    4
# samsung   28    1
# LG        10    0
# kakao    229    2

로우를 선택할 때는 iloc 혹은 loc 속성을 사용한다.

df.iloc[1]
# 종가     51300
# PER       28
# PBR        1
# Name: samsung, dtype: int64

df.loc["samsung"]
# 종가     51300
# PER       28
# PBR        1
# Name: samsung, dtype: int64

# multi row 선택
df.iloc[[0, 1]]
df.loc[["naver", "samsung"]]

슬라이싱

Series와 동일하게 iloc, loc 속성을 통해 슬라이싱이 가능하다.

df.iloc[0:2]
#              종가  PER  PBR
# naver    157000   39    4
# samsung   51300   28    1

df.loc["naver":"LG"]
#              종가  PER  PBR
# naver    157000   39    4
# samsung   51300   28    1
# LG         6880   10    0

데이터 가져오기

data = [
  [157000, 39, 4],
  [51300, 28, 1],
  [6880, 10, 0],
  [1400, 229, 2]
]

index = ["naver", "samsung", "LG", "kakao"]
columns = ["종가", "PER", "PBR"]

df = DataFrame(data, index, columns)

df.iloc[1, 1] # 28
df.loc["naver", "종가"] # 157000

장동균

프론트 개발자가 되고 싶어요

이전 포스트

[기타] 효율적으로 업무 진행하기

다음 포스트

[React] useEffect 실행 순서

1개의 댓글

2024년 4월 23일

Panda
푸바오 가지마~

답글 달기