๐Ÿ“•Week4 day3(EDA)

๋ฐ•์ค€ํฌยท2023๋…„ 9์›” 13์ผ

ํ”„๋กœ๊ทธ๋ž˜๋จธ์Šค

๋ชฉ๋ก ๋ณด๊ธฐ
19/28
post-thumbnail

EDA(ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ๋ถ„์„)


EDA์˜ Process

  1. ๋ถ„์„์˜ ๋ชฉ์ ๊ณผ ๋ณ€์ˆ˜ ํ™•์ธ
  2. ๋ฐ์ดํ„ฐ ์ „์ฒด์ ์œผ๋กœ ์‚ดํŽด๋ณด๊ธฐ
  3. ๋ฐ์ดํ„ฐ์˜ ๊ฐœ๋ณ„ ์†์„ฑ ํŒŒ์•…ํ•˜๊ธฐ

1. ๋ถ„์„์˜ ๋ชฉ์ ๊ณผ ๋ณ€์ˆ˜ ํ™•์ธ

๋จผ์ € ๋ถ„์„ํ•  ๋ฐ์ดํ„ฐ๋ฅผ ๋จผ์ € ๋ถˆ๋Ÿฌ์˜ต๋‹ˆ๋‹ค.

# ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
titanic_df = pd.read_csv("./train.csv")
  • ๋ถ„์„ ๋ชฉ์  : ์‚ด์•„๋‚จ์€ ์‚ฌ๋žŒ๋“ค์€ ์–ด๋–ค ํŠน์ง•์„ ๊ฐ€์ง€๊ณ  ์žˆ์—ˆ๋Š”๊ฐ€
#์ƒ์œ„ 5๊ฐœ ๋ฐ์ดํ„ฐ ํ™•์ธ
titanic_df.head(5)


๊ฐ column์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

column์„ค๋ช…

  • PassengerId: ์Šน๊ฐ ID (unique ํ•œ ๋ฒˆํ˜ธ)
  • Survived: ์‚ด์•„๋‚จ์•˜๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋ƒ„ - 0 = No, 1 = Yes
  • pclass: #๋“ฑ์„ - 1 = 1st, 2 = 2nd, 3 = 3rd
  • sibsp: #๋ช…์˜ ํ˜•์ œ์ž๋งค/์™€์ดํ”„์™€ ๋™์Šนํ–ˆ๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋ƒ„ (ํ˜•์ œ์ž๋งค/์™€์ดํ”„๊ฐ€ ๋™์Šนํ•˜์ง€ ์•Š์•˜์œผ๋ฉด 0)
  • parch: #๋ช…์˜ ๋ถ€๋ชจ/์•„์ด์™€ ๋™์Šนํ–ˆ๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋ƒ„ (๋ถ€๋ชจ/์•„์ด๊ฐ€ ๋™์Šนํ•˜์ง€ ์•Š์•˜์œผ๋ฉด 0)
  • ticket: ํ‹ฐ์ผ“๋ฒˆํ˜ธ
  • fare : ํƒ‘์Šน๊ฐ์ด ์–ผ๋งˆ๋ฅผ ์ง€๋ถˆํ–ˆ๋Š”์ง€
  • cabin: ์„ ์‹ค๋ฒˆํ˜ธ (์ง์„ ์–ด๋””๋‹ค ๋†“์•˜๋Š”์ง€)
  • embarked: ์–ด๋А ํ•ญ๊ตฌ์—์„œ ํƒ”๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋ƒ„ - C = Cherbourg, Q = Queenstown, S = Southampton
# ๊ฐ Column์˜ ๋ฐ์ดํ„ฐ ํƒ€์ž… ํ™•์ธํ•˜๊ธฐ
titanic_df.dtypes
>>>PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

2. ๋ฐ์ดํ„ฐ ์ „์ฒด์ ์œผ๋กœ ์‚ดํŽด๋ณด๊ธฐ

์ด์ œ ๋ถˆ๋Ÿฌ์˜จ ๋ฐ์ดํ„ฐ๋ฅผ ์ „์ฒด์ ์œผ๋กœ ํ•œ ๋ฒˆ ์‚ดํŽด๋ณธ๋‹ค.

# ๋ฐ์ดํ„ฐ ์ „์ฒด ์ •๋ณด๋ฅผ ์–ป๋Š” ํ•จ์ˆ˜ : .describe()

titanic_df.describe()#์ˆ˜์น˜ํ˜• ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์š”์•ฝ๋งŒ์„ ์ œ๊ณต


PassengerId๋ถ€๋ถ„์€ ๋‹จ์ˆœํžˆ ์ง€์ •๋œ ์ˆซ์ž์ด๊ธฐ ๋•Œ๋ฌธ์— ํ•„์š” ์—†๋‹ค.
Age๋ถ€๋ถ„์„ ์‚ดํŽด๋ณด๋ฉด mean๊ฐ’๊ณผ ์ค‘์•™๊ฐ’์ด 20๋Œ€์ด๊ณ  3๋ถ„์œ„์˜ ๊ฐ’์ด 38์„ธ๋กœ ๋Œ€๋ถ€๋ถ„์˜ ๋‚˜์ด๊ฐ€ ๊ทธ๋ ‡๊ฒŒ ๋งŽ์ง€ ์•Š๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

#์ƒ๊ด€๊ณ„์ˆ˜ ํ™•์ธ!
titanic_df.corr(numeric_only=True)
# correlation is not causation
#์ƒ๊ด€์„ฑ : A up ,B up
#์ธ๊ณผ์„ฑ : A->B


corr()๊ฐ ์ˆซ์žํ˜• ๋ณ€์ˆ˜๊ฐ„์˜ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ํ˜•ํƒœ๋กœ ๋‚˜ํƒ€๋‚ด๋Š” ํ•จ์ˆ˜์ธ๋ฐ, ์—ฌ๊ธฐ์„œ ์ฃผ์˜ํ•  ์ ์€ ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ ์ธ๊ณผ๊ด€๊ณ„๋ฅผ ์˜๋ฏธํ•˜๋Š” ๊ฒƒ์€ ์•„๋‹ˆ๋ผ๋Š” ๊ฒƒ์ด๋‹ค.

corr๊ฐ’์„ ์‚ดํŽด๋ณด๋ฉด Pclass์™€ Survived์‚ฌ์ด์—์„œ ์Œ์˜ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ๊ฐ€์ง€๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋Š”๋ฐ, Pclass๋Š” ์ข‹์€ ์ž๋ฆฌ์ผ์ˆ˜๋ก ๊ทธ ๊ฐ’์ด ๋‚ฎ๊ธฐ ๋•Œ๋ฌธ์— ์ข‹์€ ์ž๋ฆฌ์ผ์ˆ˜๋ก survived์˜ ๊ฐ’์€ ๋‚ฎ์•„์ง€๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์ด๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

๋‹ค์Œ์œผ๋ก  ๋ฐ์ดํ„ฐ ์…‹์˜ ๊ฒฐ์ธก์น˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ™•์ธํ•œ๋‹ค.

# ๊ฒฐ์ธก์น˜ ํ™•์ธ
titanic_df.isnull().sum()
#age, cabin,embarked์—์„œ ๊ฒฐ์ธก์น˜ ๋ฐœ๊ฒฌ
>>>PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

3. ๋ฐ์ดํ„ฐ์˜ ๊ฐœ๋ณ„ ์†์„ฑ ํŒŒ์•…ํ•˜๊ธฐ

๋ฐ์ดํ„ฐ๋ฅผ ์ „์ฒด์ ์œผ๋กœ ํ•œ ๋ฒˆ ์‚ดํŽด๋ณด์•˜์œผ๋ฏ€๋กœ ์ด๋ฒˆ์—” ๊ฐœ๋ณ„ ๋ฐ์ดํ„ฐ๋ณ„๋กœ ์†์„ฑ์„ ํŒŒ์•…ํ•œ๋‹ค.

  1. Survived Column
# ์ƒ์กด์ž, ์‚ฌ๋ง์ž ๋ช…์ˆ˜๋Š”?

titanic_df["Survived"].value_counts()
>>>Survived
0    549
1    342
Name: count, dtype: int64

์‚ฌ๋ง์ž 549๋ช… ์ƒ์กด์ž 342๋ช…์œผ๋กœ ์‚ฌ๋ง์ž๊ฐ€ ๋” ๋งŽ๊ฒŒ ๋‚˜์™”๋‹ค.

# ์ƒ์กด์ž์ˆ˜์™€ ์‚ฌ๋ง์ž์ˆ˜๋ฅผ Barplot์œผ๋กœ ๊ทธ๋ ค๋ณด๊ธฐ sns.countplot()

sns.countplot(x ='Survived', data = titanic_df)
plt.show()


์‹œ๊ฐํ™”๋ฅผ ํ†ตํ•ด ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ๋” ๊ฐ„๊ฒฐํ•˜๊ฒŒ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

  1. PclassPclass
#Pcalss์— ๋”ฐ๋ฅธ ์ธ์› ํŒŒ์•…

titanic_df[['Pclass','Survived']].groupby(['Pclass']).count()
#์ƒ์กด์ž ์ธ์›
titanic_df[['Pclass','Survived']].groupby(['Pclass']).sum()

์ƒ์กด์ž์˜ ๊ฐ’์ด 1์ด๋ฏ€๋กœ sum()์„ ํ•ด์ฃผ๋ฉด ์ƒ์กด์ž์˜ ์ธ์› ์ˆ˜๋ฅผ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

Pclass๋ณ„๋กœ ์ด ์ธ์›์ˆ˜๊ฐ€ ๋‹ค๋ฅด๋ฏ€๋กœ ์ƒ์กด ๋น„์œจ๋กœ ํ‘œ์‹œํ•ด๋ดค๋‹ค.

#์ƒ์กด ๋น„์œจ
titanic_df[['Pclass','Survived']].groupby(['Pclass']).mean()
์ข‹์€ class์ผ์ˆ˜๋ก ์ƒ์กด๋น„์œจ์ด ๋” ๋†’์€๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.
  1. Sex
    groupby๋ฅผ ํ†ตํ•ด ์„ฑ๋ณ„ ์ƒ์กด์ž ์‚ฌ๋ง์ž ์ˆ˜๋ฅผ ํ™•์ธํ•ด๋ณด์•˜๋‹ค.
titanic_df.groupby(['Survived', 'Sex'])['Survived'].count()
>>>Survived  Sex   
0         female     81
          male      468
1         female    233
          male      109
Name: Survived, dtype: int64

์ด๋ฅผ ์‹œ๊ฐํ™”ํ•ด๋ณด์•˜๋‹ค.

sns.catplot(x = 'Sex', col = 'Survived', kind = 'count',data = titanic_df)
plt.show()


๋‚จ์ž์˜ ์ƒ์กด์ž ๋น„์œจ๋ณด๋‹ค ์—ฌ์ž์˜ ์ƒ์กด์ž ๋น„์œจ์ด ์›”๋“ฑํžˆ ๋†’์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

  1. Age
#Survived 1,0 ๊ณผ Age์˜ ๊ฒฝํ–ฅ์„ฑ

#ํ•˜๋‚˜์˜ ์ถ• ์œ„์— ๋‘๊ฐœ์˜ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ค์ฃผ๊ธฐ ์œ„ํ•ด ๋ฐ‘์— ๋ช…๋ น์–ด ์‹คํ–‰
fig, ax = plt.subplots(1,1,figsize = (10,5))
sns.kdeplot(x = titanic_df[titanic_df['Survived']==1]['Age'],ax = ax)
sns.kdeplot(x = titanic_df[titanic_df['Survived']==0]['Age'],ax = ax)
plt.legend(['Survivde','Dead'])#๋ฐ์ดํ„ฐ ๊ตฌ๋ถ„์„ ์œ„ํ•œ ๋ฒ”๋ก€

plt.show()

Age๋Š” ์—ฐ์†ํ˜• ๋ณ€์ˆ˜์ด๊ธฐ ๋•Œ๋ฌธ์— KDEplot์„ ํ™œ์šฉํ•ด ์‹œ๊ฐํ™”๋ฅผ ํ•ด์ฃผ์—ˆ๋‹ค.
๊ทธ๋ž˜ํ”„๋ฅผ ์‚ดํŽด๋ณด๋ฉด 20์„ธ๋ณด๋‹ค ๋” ์–ด๋ฆฐ ๋‚˜์ด๋Œ€์˜ ์Šน๊ฐ์€ ์ƒ์กด์œจ์ด ๋†’์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๊ณ  2,30๋Œ€์˜ ์Šน๊ฐ์€ ์‚ฌ๋ง์œจ์ด ๋” ๋†’์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

Apendix

์œ„์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด Sex์™€ Pclass, Age๋Š” Survived์™€ ๊ด€๋ จ์ด ์žˆ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์—ˆ๋‹ค.
๊ทธ๋ž˜์„œ ์ด๋ฒˆ์—” ๋‘๊ฐ€์ง€ ๋ณ€์ˆ˜๋ฅผ ๋ชจ์•„ ์–ด๋–ค ์˜ํ–ฅ์ด ์žˆ๋Š”์ง€ ์‹œ๊ฐํ™”๋ฅผ ํ•ด๋ณด๋ ค ํ•œ๋‹ค.

  1. Sex+Pclass vs survived
sns.catplot(x = "Pclass", y = 'Survived', hue = 'Sex',kind = 'point', data = titanic_df)
plt.show()

plot์˜ x์ถ•์„ Pclass, y์ถ•์„Survived๋กœ ํ•˜๊ณ  hue์ธ์ž๋ฅผ Sex๋กœ ๋‘์–ด ๊ฐ ํด๋ž˜์Šค๋ฅผ ์„ฑ๋ณ„๋กœ ๊ตฌ๋ถ„ํ•˜์—ฌ survived์˜ ๊ฐ’์„ ์‹œ๊ฐํ™”ํ•˜์˜€๋‹ค.

  1. Age+Pclass
# Age graph with Pclass

titanic_df['Age'][titanic_df.Pclass == 1].plot(kind = 'kde')
titanic_df['Age'][titanic_df.Pclass == 2].plot(kind = 'kde')
titanic_df['Age'][titanic_df.Pclass == 3].plot(kind = 'kde')
plt.legend(['1st class','2nd class','3rd class'])
plt.show()

์ด๋ฒˆ์—๋Š” Pclass๋ณ„๋กœ Age๊ฐ€ ์–ด๋–ป๊ฒŒ ๋ถ„ํฌ๋˜์–ด ์žˆ๋Š”์ง€ ์‹œ๊ฐํ™”๋ฅผ ์ง„ํ–‰ํ–ˆ๋‹ค.
๋จผ์ € titanic_df์—์„œ age์—ด์„ ๊ฐ€์ ธ์˜ค๊ณ  ๊ฐ Pclass๋ณ„๋กœ KDEplot๋ฅผ ์ƒ์„ฑํ–ˆ๋‹ค.

๋Œ€์ฒด๋กœ ์ข‹์€ class์ผ์ˆ˜๋ก ํ‰๊ท  ์—ฐ๋ น๋Œ€๊ฐ€ ๋†’์€ ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ๋‹ค.

๋~!


profile
๊ฒŒ์„๋ €๋˜ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ณต๋ถ€

0๊ฐœ์˜ ๋Œ“๊ธ€