[빅데이터/판다스/Series]

안지은·2022년 12월 4일

from pandas import Series, DataFrame
import pandas as pd
from __future__ import division
from numpy.random import randn
import numpy as np
import os
import matplotlib.pyplot as plt
from pandas import Series, DataFrame
import pandas as pd

💡 Series : 1차원 배열

- 값 주면서 1차 배열(series) 생성

obj = Series([4, 7, -5, 3])

0 4
1 7
2 -5
3 3

- values & index 값 얻기

obj.values

array([ 4, 7, -5, 3], dtype=int64)

obj.index

RangeIndex(start=0, stop=4, step=1)

- index 이름 변경

obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']  #index 이름 변경

Bob 4
Steve 7
Jeff -5
Ryan 3

- index 이름 지정해서 series 생성

obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

d 4
b 7
a -5
c 3

obj = Series(range(3), index=['a', 'b', 'c'])
print(obj)

a 0
b 1
c 2

- index 이름 통해 series 값(요소) 얻기 or 변경

obj2['a']
obj2[['c', 'a', 'd']]

-5

c 3
a -5
d 4

obj2['d'] = 6

- series 값으로 index와 값 얻기

obj[[1, 3]]

b 1.0
d 3.0

- 조건에 맞는 data만 추출하기 (index와 같이)

값이 0보다 큰 것만 출력

obj2[obj2 > 0]

a 2.0
b 1.0

값에 * 2 해서 출력 (실제 series 내용이 바뀌진 않음.)

obj2 * 2

e의 값제곱

np.exp(obj2)

해당 index가 있는지

'b' in obj2   #True or False 반환

- 값/index 지정하면서 series 생성

sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
obj3

Ohio 35000
Texas 71000
Oregon 16000
Utah 5000

states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)  #index에 명시돼있는 것만 만들어짐.    
obj4

California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0

- index 객체

표 형식의 데이터에서 각 row와 column에 대한 이름(name)과 다른 meta data를 저장하는 객체index 객체: 표 형식의 데이터에서 각 row와 column에 대한 이름(name)과 다른 meta data를 저장하는 객체
index 객체는 변경할 수 없음.
immutable data가 index 객체를 더 안전하게 함

labels = pd.Index(np.arange(3))
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2

0 1.5
1 -2.5
2 0.0

- NaN 데이터 처리 방법 : NULL인지 (True or False)

pd.isnull(obj4)  #NaN인 거만 True
#obj4.isnull()   #동일

California True
Ohio False
Oregon False
Texas False

pd.notnull(obj4)   #NaN이 아니면 True

California False
Ohio True
Oregon True
Texas True

- 두 series 산술 연산

서로 null이 아니어야만 연산 진행.
한 쪽만 특정 index가 있으면 NaN
한 쪽이 NaN이어도 null

obj3 + obj4

- series name 설정 / index의 name 설정

obj4.name = 'population'
obj4.index.name = 'state'
obj4

state
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
Name: population, dtype: float64

obj4.index  #각 index 이름이랑 index 전체 name 확인

Index(['California', 'Ohio', 'Oregon', 'Texas'], dtype='object', name='state')

- 재색인(reindexing)

데이터를 새로운 index에 맞게 row, colummn 재배열
없는 index는 NaN 혹은 fill_value option

obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c']) #를
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e']) #로 변경. (e는 NaN)

a -5.3
b 7.2
c 3.6
d 4.5
e NaN

없는 index는 값을 0으로 채우기

obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)

a -5.3
b 7.2
c 3.6
d 4.5
e 0.0

기존 series의 값을 중복하여 채우기

obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3.reindex(range(6), method='ffill')

0 blue
2 purple
4 yellow
↓
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow

- 특정 column, row 삭제

obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
new_obj = obj.drop('c')

obj.drop(['d', 'c'])  # 하나 이상의 열 삭제

inplace 조작(복사본을 만들지 않고 객체 내부 변경)

obj.drop('c', inplace=True)  #'c' index 삭제

- Indexing

obj[1]   #index가 1번째인 값 출력 -> ex) 1.0
obj[2:4]  #index가 2번째부터 3번째인 index와 값 출력

c 2.0
d 3.0

- Slicing

주의! 라벨 이름으로 슬라이싱하면 범위의 끝점도 포함

obj['b':'c']

b 1.0
c 2.0

obj['b':'c'] = 5

a 0.0
b 5.0
c 5.0
d 3.0

- Sorting

sort_index : row(axis=0), column(axis=1)의 index를 알파벳순으로 정렬하고 새로운 객체를 반환
내림차순 : ascending = False 옵션

obj = Series(range(4), index=['d', 'a', 'b', 'c'])
print(obj)
obj.sort_index()  #index를 a. b. c. d로 정렬

obj = Series([4, 7, -3, 2, np.nan])
obj.sort_values()  #-3, 2, 4, 7, NaN로 값이 정렬

obj = Series([7, -5, 7, 4, 2, 0, 4])
obj.rank()

# 동일 데이터는 관찰된 순서대로 순위 부여
obj.rank(method='first')

# 내림차순으로 순위를 매김: ascending=False 옵션
# method='max': 동일 데이터는 순위가 max로 
obj.rank(ascending=False, method='max')

- index 중복

여러 entry를 가진 동일 label의 경우 series를 반환
label에 해단되는 entry가 하나만 있는 경우 scalar 반환

obj.index.is_unique  #중복된 게 있으면 False return

- 수치가 아닌 데이터에 대한 describe

obj = Series(['a', 'a', 'b', 'c'] * 4)
obj.describe()

안지은

공부 기록용

이전 포스트

[빅데이터/CSV]

다음 포스트