Pandas

Jaho·2021년 12월 2일

Python

목록 보기

6/18

Pandas란?

pandas 는 데이터 조작 및 분석을 위한 Python 프로그래밍 언어 용으로 작성된 소프트웨어 라이브러리
위키피디아

데이터 프레임을 JSON 문자열 파일로 변환하고 저장하자

pandas.DataFrame.to_json() / pandas.DataFrame.read_json()

1. 데이터 프레임 생성

import pandas as pd
import pprint
import json

df= pd.DataFrame({'col1':[1,2,3], 'col2':['a','x','가']},
                 index=['row1','row2','row3'])
df

col1 col2
row1 1 a
row2 2 x
row3 3 가

key,value형식이며 index(row)이름을 설정할 수 있다.

2. 데이터 프레임을 json으로 출력 해보자.

print(df.to_json())
print(type(df.to_json())) # str 타입 
#row3 : \uac00 = 문자열로 변환됨

{"col1":{"row1":1,"row2":2,"row3":3},"col2":{"row1":"a","row2":"x","row3":"\uac00"}}
<class 'str'>

3. 데이터 프레임을 to_json()을 첫번째 매개인자로 경로를 지정해서 파일로 기록하자.

from google.colab import drive
drive.mount('/content/gdrive')

코랩을 이용하고 있어서 mount시켜준다.
출력값 : Mounted at /content/gdrive

%cd /content/gdrive/My Drive/test

디렉토리 변경 MyDrive의 test폴더로 경로를 잡아준다.

path ='경로/mydf.json'
df.to_json(path)

path를 json(mydf.json)으로 만든다.

4. 저장된 파일을 읽어 보자.

with open(path) as f:
  s = f.read()
  print(s)
  print(type(s))

path의 내용을 읽는다. row3=\uac00(유니코드)로 출력된다.

5. open() 읽을 때 유니코드 이스케이프 시퀀스는 해당 문자 인수로 변환된다.

# encoding = 'unicode-escape'
with open(path,encoding = 'unicode-escape') as f:
  s = f.read()
  print(s)
  print(type(s))

"row3" : "가" 문자로 출력된다.
<class 'str'>

6. 표준라이브러리 json 모듈 함수를 이용해서 사전형을 읽어보자.

with open(path) as f:
  s = json.load(f) #json형식으로 읽는다.
  print(s)
  print(type(s))

'row3' : '가'
<class 'dict'>

여기서 중요한 점!
f.read 하였을 때는 "" 문자열로 읽었고
json.load(f) 하였을 때는 '' dict 형식으로 읽어서 출력하였다.

7. 판다스 json 문자열을 가져와 보자.

df_read = pd.read_json(path) #json형식으로 읽는다.
print(df_read)
print(type(df_read))

col1 col2
row1 1 a
row2 2 x
row3 3 가
<class 'pandas.core.frame.DataFrame'>

8. 판다스에서 파일을 압축해서 사용하자 gzip

df.to_json('경로/mydf_res.gz',compression='gzip')

압축 형식
compression : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}

compression = 으로 압축형식 지정

9. 데이터 프레임을 분철하자

orient   pandas.DataFrame, index -> [index], columns -> [columns] , data -> [values]
print(df.to_json(orient='split'))
pprint.pprint(json.loads(df.to_json(orient='split')))

{"columns":["col1","col2"],"index":["row1","row2","row3"],"data":[[1,"a"],[2,"x"],[3,"\uac00"]]}
{'columns': ['col1', 'col2'],
'data': [[1, 'a'], [2, 'x'], [3, '가']],
'index': ['row1', 'row2', 'row3']}

coulmns, index, data 형식으로 프레임 분철

9-1 oreint 속성을 이용해서 레코드를 출력해보자.

# allowed values are: {'split', 'records', 'index', 'columns', 'values', 'table'}.
print(df.to_json(orient='records'))
pprint.pprint(json.loads(df.to_json(orient='records')), width=40)

[{"col1":1,"col2":"a"},{"col1":2,"col2":"x"},{"col1":3,"col2":"\uac00"}]
{'row1': {'col1': 1, 'col2': 'a'},
'row2': {'col1': 2, 'col2': 'x'},
'row3': {'col1': 3, 'col2': '가'}}

여기서 width는 무시해도 된다.(출력 잘나오게하려고 한것)
레코드(행)로 출력

9-2 인덱스로 출력 해보자

print(df.to_json(orient='index'))
pprint.pprint(json.loads(df.to_json(orient='index')), width=40)

{"row1":{"col1":1,"col2":"a"},"row2":{"col1":2,"col2":"x"},"row3":{"col1":3,"col2":"\uac00"}}
{'row1': {'col1': 1, 'col2': 'a'},
'row2': {'col1': 2, 'col2': 'x'},
'row3': {'col1': 3, 'col2': '가'}}

9-3 values로 출력 해보자

print(df.to_json(orient='values'))
pprint.pprint(json.loads(df.to_json(orient='values')), width=40)

[[1,"a"],[2,"x"],[3,"\uac00"]]
[[1, 'a'], [2, 'x'], [3, '가']]

9-4 table로 출력 해보자

print(df.to_json(orient='table'))
pprint.pprint(json.loads(df.to_json(orient='table')), width=40)

{"schema":{"fields":[{"name":"index","type":"string"},{"name":"col1","type":"integer"},{"name":"col2","type":"string"}],"primaryKey":["index"],"pandas_version":"0.20.0"},"data":[{"index":"row1","col1":1,"col2":"a"},{"index":"row2","col1":2,"col2":"x"},{"index":"row3","col1":3,"col2":"\uac00"}]}
{'data': [{'col1': 1,
           'col2': 'a',
           'index': 'row1'},
          {'col1': 2,
           'col2': 'x',
           'index': 'row2'},
          {'col1': 3,
           'col2': '가',
           'index': 'row3'}],
 'schema': {'fields': [{'name': 'index',
                        'type': 'string'},
                       {'name': 'col1',
                        'type': 'integer'},
                       {'name': 'col2',
                        'type': 'string'}],
            'pandas_version': '0.20.0',
            'primaryKey': ['index']}}

판다스 캐글 데이터

데이터 로드를 해서 데이터 집합을 출력 해보자.

df = pd.read_csv('경로/파일.csv') 
df

csv파일 dataFrame 형식으로 출력

type확인 해보자.

type(df)

기본 줄 출력

df.head()

OO 컬럼의 값을 3줄만 출력 해보자.

df['OO'].head(3)

# shape확인
df.shape

# 통계 함수들의 대략적인 값을 확인
df.describe()

# 상관계수를 출력 해보자
df.corr()

# 해당 컬럼의 값을 정렬 sort_values(컬럼)
# OO 컬럼을 기준으로 오름차순 해보자.
res = df.sort_values('OO')

#해당 컬럼의 값을 정렬 sort_values(컬럼)
#OO 컬럼을 기준으로 내림차순 해보자.
res = df.sort_values('OO', ascending=False)

데이터 추출 해보자. 정수 인덱스를 사용해서 행과 열을 추출 해보자 df.i loc[]

df.head()
df.iloc[0,0] # 0행 0열
df.iloc[1,1] # 1행 1열

모든행 중 마지막 열의 3줄만 출력

df.iloc[:,-1].head(3)

XX값이 70000 보다 큰 요소만 추출해보자

result = df ['XX'] > 70000
result

XX 컬럼이 70000보다 작거나 80000보다 큰 요소만 추출해보자.

result2 = (df['XX'] < 70000) | (df['XX'] > 80000) # | (논리합 연산자)
result2.head()

누락 데이터 관리

df.iloc[0,0] = None
df.head()
df_drop = df.dropna()
df_drop.head()

df의 0,0 자리를 None으로 변경
dropna() 로 None값 삭제

mean으로 누락된 값을 채워보자 fillna(채울값)

채울값 = df.mean() # 평균값을 채울값에 넣는다.
print(채울값)

df_fillna = df.fillna(채울값) # 누락된 데이터에 채울값을 추가
df_fillna.head()

pd.DataFrame() : 판다스 데이터프레임 생성
to_json() : json형식으로 변환
json.load() : json형식으로 불러온다.
read_json() : json형식으로 읽는다.
head() : 기본 5줄 출력 / ()안에 숫자만큼 출력
to_csv() : csv형식으로 변환
describe() : 통계함수들의 대력적인 값 확인
corr() : 상관계수 출력
sort_values() : ()기준으로 오름차순
sort_values('컬럼', ascending=False) : 컬럼기준으로 내림차순
loc : column의 label이나 boolean array로 접근
iloc[] : 인덱싱
dropna() : None값 삭제 (누락데이터관리)
fillna() : ()의 값을(변수) 넣고 누락된 값을 채운다.

compression 의 종류 : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}
allowed values 의 종류: {'split', 'records', 'index', 'columns', 'values', 'table'}

Jaho

개발 옹알이 부터

이전 포스트

Numpy(넘파이)

다음 포스트