파이썬 - 4일차 (CDS1 1일차)

Junyong-Ahn·2024년 3월 11일

Python+시각화

목록 보기

5/9

엑셀 형식 입출력

import pandas as pd
import numpy as np

# 엑셀 불러오기 ( sheet_name = None 으로 설정 時, 모든 sheet를 불러온다.
excel = pd.read_excel('data/seoul_transportation.xlsx', sheet_name = None)
# 특정 sheet 접근
sheet1 = excel['지하철']
# 엑셀 저장하기
excel.to_excel('sample.xlsx', index = True)

CSV 형식 입출력

# csv 불러오기
df = pd.read_csv('data/seoul_population.csv')
# csv 파일 형식으로 저장
df.to_csv('sample.csv', index = False)

json 형식 입출력

import requests
import json

ret = requests.get(url)

# json 형식의 파일로 로드
json_data = json.loads(ret.text)
df = pd.read_json(url)
# json 형식으로 저장
df.to_json('currency.json')

loc 활용 예제

cond1 = (df['age']>20) & (df['age']<40)
cond2 = (df['pclass']==1) | (df['pclass']==2)
df.loc[cond1 & cond2, ['survived', 'pclass', 'age', 'fare']].sort_values(by='pclass', ascending=False).head(10)


# sex컬럼이 male 인 데이터만 필터합니다.
cond1 = df['sex']=='male'
# children 이 2명 이상인 데이터만 필터합니다.
cond2 = df['children']>=2
# region이 northeast와 northwest 인 데이터만 필터합니다.
cond3 = (df['region'] == 'northeast') | (df['region'] == 'northwest')
# bmi와 age를 기준으로 내림차순 정렬하고 index 3~5 행만 선택하여 출력합니다.
df.loc[cond1&cond2&cond3].sort_values(by=['bmi','age'], ascending = False)[3:6]

copy() 활용

원본 데이터에 영향을 주지 않기 위해 df_copy = df.copy()를 사용해서 복제본을 만든다.

결측치 처리

df1 = sns.load_dataset('titanic')
# age 결측치 필터링 -> age가 NA인 행 필터링
cond_age = df1['age'].isna() 
# age 컬럼 평균값 -> age가 NA가 아닌 행들로 계산
age_mean = df1.loc[~cond_age,'age'].mean()
df1.loc[cond_age,'age'] = df1.loc[cond_age,'age'].fillna(age_mean)

result_age =df1['age']

##########################################################################################
# - who가 man인 데이터에서 age가 결측치인 데이터의 값을 남자 나이의 median값으로 결측치를 채우기
# - who가 woman인 데이터에서 age가 결측치인 데이터의 값을 여자 나이의 25% Quantile값으로 결측치를 채우기
# - who가 child인 데이터에서 age가 결측치인 데이터의 값을 아이 나이의 평균값으로 결측치를 채우기
##########################################################################################
# man 필터링
cond_man = titanic['who'] == 'man'
# median값 구하기
median_man = titanic.loc[cond_man, 'age'].median()
# cond_man 의 조건을 갖는 행의 age 열을 medain_man으로 채워주기
titanic.loc[cond_man,'age'] =titanic.loc[cond_man, 'age'].fillna(median_man)


# woman 필터링
cond_woman = titanic['who'] == 'woman'  # titanic.loc[cond_man] -> 'who'가 man인 데이터
# median값 구하기 (남자이면서 age값이 존재하는 데이터)
quan_woman = titanic.loc[cond_woman, 'age'].quantile(0.25)
titanic.loc[cond_woman,'age'] =titanic.loc[cond_woman, 'age'].fillna(quan_woman)

# child 필터링
cond_child = titanic['who'] == 'child'
# child 들의 age 평균값
mean_child = titanic.loc[cond_child,'age'].mean()
# child 이면서, age가 na인 것들을 mean_child로 채우기
titanic.loc[cond_child, 'age'] = titanic.loc[cond_child,'age'].fillna(mean_child)
##########################################################################################

Junyong-Ahn

이전 포스트

파이썬 - 3일차 (데이터 수정/삭제)

다음 포스트

파이썬 - 4일차 (CDS1 1일차)

Python+시각화

엑셀 형식 입출력

CSV 형식 입출력

json 형식 입출력

loc 활용 예제

copy() 활용

결측치 처리

파이썬 - 3일차 (데이터 수정/삭제)

파이썬 - 5일차(CDS1 2일차)

0개의 댓글