[Section1 Sprint1] EDA

Kyungtaek Oh·2022년 6월 14일

AI Bootcamp

목록 보기

2/15

[Section1 Sprint1] EDA

https://github.com/KYOH95/ds-section1-sprint1-new

EDA
- Data Preprocessing
- Feature Engineering
- Business Insight
Data Wrangling
- 데이터 수집
- 데이터 탐색
- 데이터 정제

Data Preprocessing & Exploratory Data Analysis

1. Load and Explore the Data

데이터 불러오기

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from google.colab import files
uploaded = files.upload()

df1 = pd.read_csv("data1.csv")
df2 = pd.read_csv("data2.csv")
df3 = pd.read_csv("data3.csv")

모든 데이터셋에 대하여 결측치의 합

df.isnull().sum()

특정 열의 값들에 해당되고, 원하는 열들만 나타내는 데이터 프레임

condition = (df.나라 == 'usa') | (df.나라 == 'chn')
df_clean = df.loc[condition, ['나라','돈','시간','날짜']]

123번째 인덱스 값 구하기

df.iloc[123]

scatter plot

sns.scatterplot(data=df, x="time", y="cell_phones_total", hue="geo")

2. Join data

inner join

df_join = pd.merge(df1,df2, how = 'inner', on=['geo','time'])

조건 데이터 프레임에 넣기

condition = (df_join.country == "United States") & (df_join.population < df_join.cell_phones_total)
df_join[condition]

3. Feature Engineering

새로운 column 추가 하면서 값 넣기

df_join["새로운 컬럼"] = (df_join.cell_phones_total / df_join.population)

sort_values

df_new = df.sort_values(by=['PPP'] ,ascending=False)

Kyungtaek Oh

Studying for Data Analysis, Data Engineering & Data Science

이전 포스트

[Section0] 나만의 사전

다음 포스트

[Section1 Sprint1] EDA

AI Bootcamp

[Section1 Sprint1] EDA

Contents

Data Preprocessing & Exploratory Data Analysis

1. Load and Explore the Data

데이터 불러오기

모든 데이터셋에 대하여 결측치의 합

특정 열의 값들에 해당되고, 원하는 열들만 나타내는 데이터 프레임

123번째 인덱스 값 구하기

scatter plot

2. Join data

inner join

조건 데이터 프레임에 넣기

3. Feature Engineering

새로운 column 추가 하면서 값 넣기

sort_values

[Section0] 나만의 사전

[Section1 Sprint2] Statistics

0개의 댓글