[Kaggle][Competitions] Titanic 튜토리얼 따라해보기

켈로그·2023년 11월 19일
0

Kaggle

목록 보기
1/11

왜 타이타닉?

  • 데이터가 이미 한 번 정리됨
  • 실제 데이터로 연습할 수 있음
  • 캐글에 참여하는 방법을 배우기 좋음
  • 캐글 노트북을 써볼 수 있음

타이타닉 목표

  • 타이타닉을 탔던 사람 중 어떤 속성이 생존률에 영향을 줬을까?

➿따라해볼 코드➿

Titanic Tutorial

Titanic - Machine Learning from Disaster | Kaggle

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory*

import os

for dirname, _, filenames in os.walk('/kaggle/input'):

    for filename in filenames:

        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"* 

# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session*

/kaggle/input/titanic/train.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/gender_submission.csv

train = pd.read_csv('/kaggle/input/titanic/train.csv')
train.head()


|  | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |

```python
train.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')

test = pd.read_csv('/kaggle/input/titanic/test.csv')
test.head()
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS
test.columns

Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')

women **=** train.loc[train["Sex"] **==** "female"]["Survived"]
women

1 1
2 1
3 1
8 1
9 1
..
880 1
882 0
885 0
887 1
888 0
Name: Survived, Length: 314, dtype: int64

*# 값이 1 또는 0 뿐이기 때문에 count 대신 sum 사용 가능*
*# 몇 %가 살았는지 볼 수 있음*
rate **=** (sum(women)**/**len(women)) ***** 100
rate

74.20382165605095

men **=** train.loc[train["Sex"]**==**"male"]["Survived"]
men

0 0
4 0
5 0
6 0
7 0
..
883 0
884 0
886 0
889 1
890 0
Name: Survived, Length: 577, dtype: int64

rate_men **=** (sum(men)**/**len(men)) ***** 100
rate_men

18.890814558058924

Random Forest Model

속성별로 나무를 만든다

*# RandomForestClassifier : 어떤 나무가 제일 잘 컸는지*

**from** sklearn.ensemble **import** RandomForestClassifier
y **=** train["Survived"]
y

0 0
1 1
2 1
3 1
4 0
..
886 0
887 1
888 0
889 1
890 0
Name: Survived, Length: 891, dtype: int64

train.colums

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')

*# 학습시킬 대상 뽑아내기*

*# features = ["Pclass","Sex","SibSp","Parch"]*
features **=** "Pclass, Sex, SibSp, Parch".split(", ")

features

['Pclass', 'Sex', 'SibSp', 'Parch']

*# train 파일 중 features에 대한 것들을 랜덤으로 가져오기*
X **=** pd.get_dummies(train[features])
X
PclassSibSpParchSex_femaleSex_male
0310FalseTrue
1110TrueFalse
2300TrueFalse
3110TrueFalse
4300FalseTrue
..................
886200FalseTrue
887100TrueFalse
888312TrueFalse
889100FalseTrue
890300FalseTrue

891 rows × 5 columns

X_test **=** pd.get_dummies(test[features])
X_test
PclassSibSpParchSex_femaleSex_male
0300FalseTrue
1310TrueFalse
2200FalseTrue
3300FalseTrue
4311TrueFalse
..................
413300FalseTrue
414100TrueFalse
415300FalseTrue
416300FalseTrue
417311FalseTrue

418 rows × 5 columns

model **=** RandomForestClassifier(n_estimators**=**100, max_depth**=**5,random_state**=**1)
*# depth와 feature 개수 동일 -> 아마 중복 없을 것*

model

RandomForestClassifier

RandomForestClassifier(max_depth=5, random_state=1)

model.__dict__

{'estimator': DecisionTreeClassifier(),
'n_estimators': 100,
'estimator_params': ('criterion',
'max_depth',
'min_samples_split',
'min_samples_leaf',
'min_weight_fraction_leaf',
'max_features',
'max_leaf_nodes',
'min_impurity_decrease',
'random_state',
'ccp_alpha'),
'base_estimator': 'deprecated',
'bootstrap': True,
'oob_score': False,
'n_jobs': None,
'random_state': 1,
'verbose': 0,
'warm_start': False,
'class_weight': None,
'max_samples': None,
'criterion': 'gini',
'max_depth': 5,
'min_samples_split': 2,
'min_samples_leaf': 1,
'min_weight_fraction_leaf': 0.0,
'max_features': 'sqrt',
'max_leaf_nodes': None,
'min_impurity_decrease': 0.0,
'ccp_alpha': 0.0}

*# 제일 잘 맞는 나무 고르기*

model.fit(X,y)

RandomForestClassifier

RandomForestClassifier(max_depth=5, random_state=1)

# 잘 맞는지 테스트 해보기
predicts = model.predict(X_test)
predicts  # 살았는지/죽었는지 에 대한 값들

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0,
1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0])

output = pd.DataFrame({"PassengerId": test["PassengerId"],"Survived": predicts})
output
PassengerIdSurvived
08920
18931
28940
38950
48961
.........
41313050
41413061
41513070
41613080
41713090

418 rows × 2 columns

output.to_csv("submit.csv", index=False)
profile
호랑이기운

0개의 댓글