. ㄴ[실습-퀴즈] Python 활용한 AI 모델링 - 전처리 파트
• 이번시간에는 Python을 활용한 AI 모델링에서 전처리에 대해 실습해 보겠습니다.
• 머신러닝과 AI 모델링 전체에서 60~70% 차지하는 부분이 바로 전처리 파트입니다.
• 굉장히 시간과 노력이 많이 투입되며, 어려운 부분일수 있습니다.
• 데이터가 깨끗이 정리되지 않는다면 머신러닝/AI 성능을 장담할수 없으므로 데이터 전처리에 심혈을 기울려 주시기 바랍니다.
• 한가지 당부 드리고 싶은 말은 "백문이불여일타" 입니다.
• 이론보다 실습이 더 많은 시간과 노력이 투자 되어야 합니다.
학습목차
1. 실습 내용 확인
2. 필요 라이브러리 임포트 및 파일 읽어오기
3. EDA (Exploratory Data Analysis) 탐색적 데이터 분석
4. 데이터 전처리 수행
• 불필요 컬럼 삭제
• 컬럼 내용 변경하기
• Null 처리
• 컬럼 type 변경하기
4. 시각화
5. 결과 저장하기
머신러닝, 딥러닝을 사용한 통신 서비스 이탈 예측
모든 관련 고객 데이터를 분석하고 강력하고 정확한 이탈 예측 모델을 개발하여 고객을 유지하고 고객 이탈률을 줄이기 위한 전략을 수립합니다.
Churn은 서비스를 중단하거나 업계의 경쟁업체로 이전한 고객 또는 사용자를 의미합니다. 모든 조직이 기존 고객을 유지하고 새로운 고객을 유치하는 것이 매우 중요합니다. 그 중 하나가 실패하면 비즈니스에 좋지 않습니다. 목표는 업계에서 경쟁 우위를 유지하기 위해 이탈 예측을 위한 머신러닝, 딥러닝의 가능성을 탐색하는 것입니다.
Numpy
[문제] numpy 라이브러리를 np alias로 임포트하세요.
[1]:
import numpy as np
Pandas
[문제] pandas 라이브러리를 pd alias로 임포트하세요.
[2]:
import pandas as pd
읽어올 데이터 파일 : data_v1.csv
Telco Customer Churn Dataset 컬럼
1. CustomerID: Customer ID unique for each customer
2. gender: Whether the customer is a male or a female
3. SeniorCitizen: Whether the customer is a senior citizen or not (1, 0) : 고령자 여부
4. Partner: Whether the customer has a partner or not (Yes, No)
5. Dependents: Whether the customer has dependents or not (Yes, No) : 부양가족 여부
6. Tenure: Number of months the customer has stayed with the company : 서비스 사용 개월수
7. PhoneService: Whether the customer has a phone service or not (Yes, No)
8. MultipleLines: Whether the customer has multiple lines or not (Yes, No, No phone service)
9. InternetService: Customer’s internet service provider (DSL, Fiber optic, No)
10. OnlineSecurity: Whether the customer has online security or not (Yes, No, No internet service)
11. OnlineBackup: Whether the customer has an online backup or not (Yes, No, No internet service)
12. DeviceProtection: Whether the customer has device protection or not (Yes, No, No internet service)
13. TechSupport: Whether the customer has tech support or not (Yes, No, No internet service)
14. StreamingTV: Whether the customer has streaming TV or not (Yes, No, No internet service)
15. StreamingMovies: Whether the customer has streaming movies or not (Yes, No, No internet service)
16. Contract: The contract term of the customer (Month-to-month, One year, Two years)
17. PaperlessBilling: Whether the customer has paperless billing or not (Yes, No)
18. PaymentMethod: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
19. MonthlyCharges: The amount charged to the customer monthly
20. TotalCharges: The total amount charged to the customer
21. Churn: Whether the customer churned or not (Yes or No) : 이탈 여부
CSV 파일 데이터 읽어오기
[문제] data.csv 파일을 Pandas read_csv 함수를 이용하여 읽어 df 변수에 저장하세요.
[3]:
df = pd.read_csv('(라이브교육)data_v1.csv')
4:
df
4:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG NaN 0.0 Yes No 1 No No phone service DSL No ... No No No No NaN Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0.0 No No 34 Yes No DSL Yes ... Yes No No No One year No Mailed check 56.95 1889.5 No
2 3668-QPYBK Male 0.0 No No 2 Yes No DSL Yes ... NaN No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0.0 No No 45 No No phone service DSL Yes ... NaN Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0.0 No No 2 Yes No Fiber optic No ... NaN No No No Month-to-month Yes Electronic check 70.70 151.65 Yes
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7038 6840-RESVB Male 0.0 Yes Yes 24 Yes Yes DSL Yes ... Yes Yes Yes Yes One year Yes Mailed check 84.80 1990.5 No
7039 2234-XADUH Female 0.0 Yes Yes 72 Yes Yes Fiber optic No ... Yes No Yes Yes One year Yes Credit card (automatic) 103.20 7362.9 No
7040 4801-JZAZL Female 0.0 Yes Yes 11 No No phone service DSL Yes ... No No No No Month-to-month Yes Electronic check 29.60 346.45 No
7041 8361-LTMKD Male 1.0 Yes No 4 Yes Yes Fiber optic No ... No No No No Month-to-month Yes Mailed check 74.40 306.6 Yes
7042 3186-AJIEK NaN 0.0 No No 66 Yes No Fiber optic Yes ... Yes Yes Yes Yes Two year Yes Bank transfer (automatic) 105.65 6844.5 No
7043 rows × 21 columns
데이터 탐색하기
5:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG NaN 0.0 Yes No 1 No No phone service DSL No ... No No No No NaN Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0.0 No No 34 Yes No DSL Yes ... Yes No No No One year No Mailed check 56.95 1889.5 No
2 3668-QPYBK Male 0.0 No No 2 Yes No DSL Yes ... NaN No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0.0 No No 45 No No phone service DSL Yes ... NaN Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0.0 No No 2 Yes No Fiber optic No ... NaN No No No Month-to-month Yes Electronic check 70.70 151.65 Yes
5 rows × 21 columns
6:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
7038 6840-RESVB Male 0.0 Yes Yes 24 Yes Yes DSL Yes ... Yes Yes Yes Yes One year Yes Mailed check 84.80 1990.5 No
7039 2234-XADUH Female 0.0 Yes Yes 72 Yes Yes Fiber optic No ... Yes No Yes Yes One year Yes Credit card (automatic) 103.20 7362.9 No
7040 4801-JZAZL Female 0.0 Yes Yes 11 No No phone service DSL Yes ... No No No No Month-to-month Yes Electronic check 29.60 346.45 No
7041 8361-LTMKD Male 1.0 Yes No 4 Yes Yes Fiber optic No ... No No No No Month-to-month Yes Mailed check 74.40 306.6 Yes
7042 3186-AJIEK NaN 0.0 No No 66 Yes No Fiber optic Yes ... Yes Yes Yes Yes Two year Yes Bank transfer (automatic) 105.65 6844.5 No
5 rows × 21 columns
자료구조 파악
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
0 customerID 7043 non-null object
1 gender 7034 non-null object
2 SeniorCitizen 7042 non-null float64
3 Partner 7043 non-null object
4 Dependents 7041 non-null object
5 tenure 7043 non-null int64
6 PhoneService 7040 non-null object
7 MultipleLines 7043 non-null object
8 InternetService 7043 non-null object
9 OnlineSecurity 7043 non-null object
10 OnlineBackup 7043 non-null object
11 DeviceProtection 3580 non-null object
12 TechSupport 7043 non-null object
13 StreamingTV 7043 non-null object
14 StreamingMovies 7043 non-null object
15 Contract 7042 non-null object
16 PaperlessBilling 7043 non-null object
17 PaymentMethod 7042 non-null object
18 MonthlyCharges 7042 non-null float64
19 TotalCharges 7043 non-null object
20 Churn 7043 non-null object
dtypes: float64(2), int64(1), object(18)
memory usage: 1.1+ MB
데이터 타입, 인덱스, 컬럼명, Values 확인
8:
df.index
8:
RangeIndex(start=0, stop=7043, step=1)
[9]:
df.columns
[9]:
Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
dtype='object')
11:
df.values
11:
array([['7590-VHVEG', nan, 0.0, ..., 29.85, '29.85', 'No'],
['5575-GNVDE', 'Male', 0.0, ..., 56.95, '1889.5', 'No'],
['3668-QPYBK', 'Male', 0.0, ..., 53.85, '108.15', 'Yes'],
...,
['4801-JZAZL', 'Female', 0.0, ..., 29.6, '346.45', 'No'],
['8361-LTMKD', 'Male', 1.0, ..., 74.4, '306.6', 'Yes'],
['3186-AJIEK', nan, 0.0, ..., 105.65, '6844.5', 'No']],
dtype=object)
Null 데이터 확인
10:
customerID 0
gender 9
SeniorCitizen 1
Partner 0
Dependents 2
tenure 0
PhoneService 3
MultipleLines 0
InternetService 0
OnlineSecurity 0
OnlineBackup 0
DeviceProtection 3463
TechSupport 0
StreamingTV 0
StreamingMovies 0
Contract 1
PaperlessBilling 0
PaymentMethod 1
MonthlyCharges 1
TotalCharges 0
Churn 0
dtype: int64
통계 정보
[12]:
df.describe()
[12]:
SeniorCitizen tenure MonthlyCharges
count 7042.000000 7043.000000 7042.000000
mean 0.162170 32.371149 64.763256
std 0.368633 24.559481 30.091898
min 0.000000 0.000000 18.250000
25% 0.000000 9.000000 35.500000
50% 0.000000 29.000000 70.350000
75% 0.000000 55.000000 89.850000
max 1.000000 72.000000 118.750000
Duplicate key in file PosixPath('/usr/local/lib/python3.6/dist-packages/matplotlib/mpl-data/matplotlibrc'), line 758 ('font.family\t: NanumGothicCoding')
자료구조 파악
[문제] df1 DataFrame의 함수를 활용해서 자료구조(Row, columnm , Not-null, type)을 파악 하세요.
14:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
0 customerID 7043 non-null object
1 gender 7034 non-null object
2 SeniorCitizen 7042 non-null float64
3 Partner 7043 non-null object
4 Dependents 7041 non-null object
5 tenure 7043 non-null int64
6 PhoneService 7040 non-null object
7 MultipleLines 7043 non-null object
8 InternetService 7043 non-null object
9 OnlineSecurity 7043 non-null object
10 OnlineBackup 7043 non-null object
11 DeviceProtection 3580 non-null object
12 TechSupport 7043 non-null object
13 StreamingTV 7043 non-null object
14 StreamingMovies 7043 non-null object
15 Contract 7042 non-null object
16 PaperlessBilling 7043 non-null object
17 PaymentMethod 7042 non-null object
18 MonthlyCharges 7042 non-null float64
19 TotalCharges 7043 non-null object
20 Churn 7043 non-null object
dtypes: float64(2), int64(1), object(18)
memory usage: 1.1+ MB
컬럼 삭제
[문제] df 데이터프레임에서 'customerID' 컬럼 삭제 하세요.
15:
df.drop('customerID', axis=1, inplace=True)
[16]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
0 gender 7034 non-null object
1 SeniorCitizen 7042 non-null float64
2 Partner 7043 non-null object
3 Dependents 7041 non-null object
4 tenure 7043 non-null int64
5 PhoneService 7040 non-null object
6 MultipleLines 7043 non-null object
7 InternetService 7043 non-null object
8 OnlineSecurity 7043 non-null object
9 OnlineBackup 7043 non-null object
10 DeviceProtection 3580 non-null object
11 TechSupport 7043 non-null object
12 StreamingTV 7043 non-null object
13 StreamingMovies 7043 non-null object
14 Contract 7042 non-null object
15 PaperlessBilling 7043 non-null object
16 PaymentMethod 7042 non-null object
17 MonthlyCharges 7042 non-null float64
18 TotalCharges 7043 non-null object
19 Churn 7043 non-null object
dtypes: float64(2), int64(1), object(17)
memory usage: 1.1+ MB
컬럼 내용 변경하기
범주형 문자 데이터를 숫자 변환하는것은 성능에 많은 영향을 미치므로 꼭 변환하로록 하자.
null, _ 문제있는 문자 데이터를 모델링하기 전에 미리 다른 데이터로 변경하거나 필요없을 경우 삭제하도록 하자.
TotalCharges 컬럼 타입 변경하기
[18]:
df['TotalCharges']
[18]:
0 29.85
1 1889.5
2 108.15
3 1840.75
4 151.65
...
7038 1990.5
7039 7362.9
7040 346.45
7041 306.6
7042 6844.5
Name: TotalCharges, Length: 7043, dtype: object
[19]:
df['TotalCharges'].astype(float)
ValueError Traceback (most recent call last)
in
2 # 문자열을 숫자형으로 변경할수 없으므로 에러 발생
3
----> 4 df['TotalCharges'].astype(float)
/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in astype(self, dtype, copy, errors)
5546 else:
5547 # else, only a single dtype is given
-> 5548 new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors,)
5549 return self._constructor(new_data).finalize(self, method="astype")
5550
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)
602 self, dtype, copy: bool = False, errors: str = "raise"
603 ) -> "BlockManager":
--> 604 return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
605
606 def convert(
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, **kwargs)
407 applied = b.apply(f, **kwargs)
408 else:
--> 409 applied = getattr(b, f)(**kwargs)
410 result_blocks = _extend_blocks(applied, result_blocks)
411
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)
593 vals1d = values.ravel()
594 try:
--> 595 values = astype_nansafe(vals1d, dtype, copy=True)
596 except (ValueError, TypeError):
597 # e.g. astype_nansafe can fail on object-dtype of strings
/usr/local/lib/python3.6/dist-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
995 if copy or is_object_dtype(arr) or is_object_dtype(dtype):
996 # Explicit copy, or required since NumPy can't view from / to object.
--> 997 return arr.astype(dtype, copy=True)
998
999 return arr.view(dtype)
ValueError: could not convert string to float:
[20]:
(df['TotalCharges'] == '') | (df['TotalCharges'] == ' ')
[20]:
0 False
1 False
2 False
3 False
4 False
...
7038 False
7039 False
7040 False
7041 False
7042 False
Name: TotalCharges, Length: 7043, dtype: bool
[23]:
cond = (df['TotalCharges'] == '') | (df['TotalCharges'] == ' ')
df[cond]
[23]:
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
[문제] df 데이터프레임의 'TotalCharges' 컬럼의 값 ' ' --> '0' 변경하세요.
21:
df['TotalCharges'].replace([' '], ['0'], inplace = True )
[문제] df 데이터프레임의 'TotalCharges' 컬럼 타입을 object에서 float 변경하세요.
[29]:
df['TotalCharges']=df['TotalCharges'].astype(float)
25:
cond = (df['TotalCharges'] == '') | (df['TotalCharges'] == ' ')
df[cond]
25:
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
30:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
0 gender 7034 non-null object
1 SeniorCitizen 7042 non-null float64
2 Partner 7043 non-null object
3 Dependents 7041 non-null object
4 tenure 7043 non-null int64
5 PhoneService 7040 non-null object
6 MultipleLines 7043 non-null object
7 InternetService 7043 non-null object
8 OnlineSecurity 7043 non-null object
9 OnlineBackup 7043 non-null object
10 DeviceProtection 3580 non-null object
11 TechSupport 7043 non-null object
12 StreamingTV 7043 non-null object
13 StreamingMovies 7043 non-null object
14 Contract 7042 non-null object
15 PaperlessBilling 7043 non-null object
16 PaymentMethod 7042 non-null object
17 MonthlyCharges 7042 non-null float64
18 TotalCharges 7043 non-null float64
19 Churn 7043 non-null object
dtypes: float64(3), int64(1), object(16)
memory usage: 1.1+ MB
Churn 컬럼의 문자열값을 숫자로 변경
31:
df['Churn'].value_counts()
31:
No 5174
Yes 1869
Name: Churn, dtype: int64
[32]:
df['Churn'].replace(['Yes', 'No'], [1, 0], inplace=True)
33:
df['Churn'].value_counts()
33:
0 5174
1 1869
Name: Churn, dtype: int64
Null 데이터 확인
[문제] df 데이터프레임에 대해 컬럼별로 null 얼마나 있는지 null 갯수를 나열 하세요.
34:
df.isnull().sum()
34:
gender 9
SeniorCitizen 1
Partner 0
Dependents 2
tenure 0
PhoneService 3
MultipleLines 0
InternetService 0
OnlineSecurity 0
OnlineBackup 0
DeviceProtection 3463
TechSupport 0
StreamingTV 0
StreamingMovies 0
Contract 1
PaperlessBilling 0
PaymentMethod 1
MonthlyCharges 1
TotalCharges 0
Churn 0
dtype: int64
결측치 처리
데이터에 결측치 있으면 모델링시 알지 못하는 에러 발생할수 있으므로 반드시 결측치를 제거나 변경해야 한다.
결측치 제거시 dropna() 함수 활용하면 된다.
결측치를 변경시 변경하는 방법이 꼭 정답이 아니며, 여러가지 판단하고 고민이 필요하다.
주로, 문자형 컬럼에 대해 최빈값으로 , 숫자형 컬럼에 대해 중간값으로 결측치 대신해서 채울수 있다.
[문제] df 데이터프레임의 결측치 많은 컬럼은 컬럼 제거하고 나머지 결측치는 Row 제거 하세요.
35:
df.drop('DeviceProtection', axis=1, inplace = True)
df.dropna(inplace=True)
#여러개도 가능
#df.drop(['DeviceProtection', '~', '~~'], axis=1, inplace = True)
36:
df.isnull().sum()
36:
gender 0
SeniorCitizen 0
Partner 0
Dependents 0
tenure 0
PhoneService 0
MultipleLines 0
InternetService 0
OnlineSecurity 0
OnlineBackup 0
TechSupport 0
StreamingTV 0
StreamingMovies 0
Contract 0
PaperlessBilling 0
PaymentMethod 0
MonthlyCharges 0
TotalCharges 0
Churn 0
dtype: int64
39:
df2 = df.copy()
41:
df.info()
df2.reset_index(drop = True)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7027 entries, 1 to 7041
Data columns (total 19 columns):
0 gender 7027 non-null object
1 SeniorCitizen 7027 non-null float64
2 Partner 7027 non-null object
3 Dependents 7027 non-null object
4 tenure 7027 non-null int64
5 PhoneService 7027 non-null object
6 MultipleLines 7027 non-null object
7 InternetService 7027 non-null object
8 OnlineSecurity 7027 non-null object
9 OnlineBackup 7027 non-null object
10 TechSupport 7027 non-null object
11 StreamingTV 7027 non-null object
12 StreamingMovies 7027 non-null object
13 Contract 7027 non-null object
14 PaperlessBilling 7027 non-null object
15 PaymentMethod 7027 non-null object
16 MonthlyCharges 7027 non-null float64
17 TotalCharges 7027 non-null float64
18 Churn 7027 non-null int64
dtypes: float64(3), int64(2), object(14)
memory usage: 1.1+ MB
41:
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 Male 0.0 No No 34 Yes No DSL Yes No No No No One year No Mailed check 56.95 1889.50 0
1 Male 0.0 No No 2 Yes No DSL Yes Yes No No No Month-to-month Yes Mailed check 53.85 108.15 1
2 Male 0.0 No No 45 No No phone service DSL Yes No Yes No No One year No Bank transfer (automatic) 42.30 1840.75 0
3 Female 0.0 No No 2 Yes No Fiber optic No No No No No Month-to-month Yes Electronic check 70.70 151.65 1
4 Female 0.0 No No 8 Yes Yes Fiber optic No No No Yes Yes Month-to-month Yes Electronic check 99.65 820.50 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7022 Female 0.0 No No 72 Yes No No No internet service No internet service No internet service No internet service No internet service Two year Yes Bank transfer (automatic) 21.15 1419.40 0
7023 Male 0.0 Yes Yes 24 Yes Yes DSL Yes No Yes Yes Yes One year Yes Mailed check 84.80 1990.50 0
7024 Female 0.0 Yes Yes 72 Yes Yes Fiber optic No Yes No Yes Yes One year Yes Credit card (automatic) 103.20 7362.90 0
7025 Female 0.0 Yes Yes 11 No No phone service DSL Yes No No No No Month-to-month Yes Electronic check 29.60 346.45 0
7026 Male 1.0 Yes No 4 Yes Yes Fiber optic No No No No No Month-to-month Yes Mailed check 74.40 306.60 1
7027 rows × 19 columns
라이브러리 임포트
42:
import matplotlib.pyplot as plt
%matplotlib inline
Bar 차트
43:
df['gender'].value_counts()
43:
Male 3550
Female 3477
Name: gender, dtype: int64
44:
df['gender'].value_counts().plot(kind='bar')
[문제] df 데이터프레임의 'Partner' 컬럼의 값 분포를 구하고 Bar 차트를 그리세요.
45:
df['Partner'].value_counts().plot(kind = 'bar')
한꺼번에 Object 컬럼에 대해서 분포 Bar 차트 확인해 봅시다.
47:
df.select_dtypes('O').head(3)
47:
gender Partner Dependents PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod
1 Male No No Yes No DSL Yes No No No No One year No Mailed check
2 Male No No Yes No DSL Yes Yes No No No Month-to-month Yes Mailed check
3 Male No No No No phone service DSL Yes No Yes No No One year No Bank transfer (automatic)
[48]:
df.select_dtypes('O').columns.values
[48]:
array(['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines',
'InternetService', 'OnlineSecurity', 'OnlineBackup', 'TechSupport',
'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
'PaymentMethod'], dtype=object)
[49]:
object_list = df.select_dtypes('object').columns.values
for col in object_list:
df[col].value_counts().plot(kind='bar')
plt.title(col)
plt.show()
불균형 심한 PhoneService 컬럼 삭제
[55]:
df.drop('PhoneService', axis=1, inplace=True)
숫자형 컬럼에 대한 시각화
[50]:
df.select_dtypes( 'number').head(3)
[50]:
SeniorCitizen tenure MonthlyCharges TotalCharges Churn
1 0.0 34 56.95 1889.50 0
2 0.0 2 53.85 108.15 1
3 0.0 45 42.30 1840.75 0
Churn 컬럼
[51]:
df['Churn'].value_counts()
[51]:
0 5161
1 1866
Name: Churn, dtype: int64
52:
df['Churn'].value_counts().plot(kind='bar')
SeniorCitizen 컬럼
53:
df['SeniorCitizen'].value_counts()
53:
0.0 5885
1.0 1142
Name: SeniorCitizen, dtype: int64
54:
df['SeniorCitizen'].value_counts().plot(kind='bar')
[문제] 불균형 심한 'SeniorCitizen' 컬럼을 삭제하세요.
[56]:
df.drop('SeniorCitizen', axis=1, inplace=True)
[57]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7027 entries, 1 to 7041
Data columns (total 17 columns):
0 gender 7027 non-null object
1 Partner 7027 non-null object
2 Dependents 7027 non-null object
3 tenure 7027 non-null int64
4 MultipleLines 7027 non-null object
5 InternetService 7027 non-null object
6 OnlineSecurity 7027 non-null object
7 OnlineBackup 7027 non-null object
8 TechSupport 7027 non-null object
9 StreamingTV 7027 non-null object
10 StreamingMovies 7027 non-null object
11 Contract 7027 non-null object
12 PaperlessBilling 7027 non-null object
13 PaymentMethod 7027 non-null object
14 MonthlyCharges 7027 non-null float64
15 TotalCharges 7027 non-null float64
16 Churn 7027 non-null int64
dtypes: float64(2), int64(2), object(13)
memory usage: 988.2+ KB
Histgram
60:
#!pip install seaborn
#!pip install seaborn
import seaborn as sns
tenure 컬럼
61:
sns.histplot(data=df, x='tenure')
63:
sns.histplot(data=df, x='tenure', hue='Churn')
64:
sns.kdeplot(data=df, x='tenure', hue='Churn')
TotalCharges 컬럼
65:
sns.histplot(data=df, x='TotalCharges')
66:
sns.kdeplot(data=df, x='TotalCharges', hue='Churn')
Countplot
67:
sns.countplot(data=df, x='MultipleLines', hue='Churn')
heatmap
68:
df[['tenure','MonthlyCharges','TotalCharges']].corr()
68:
tenure MonthlyCharges TotalCharges
tenure 1.000000 0.247630 0.826172
MonthlyCharges 0.247630 1.000000 0.651049
TotalCharges 0.826172 0.651049 1.000000
69:
sns.heatmap(df[['tenure','MonthlyCharges','TotalCharges']].corr(), annot=True)
boxplot
70:
sns.boxplot(data=df, x='Churn', y='TotalCharges')
결과를 csv 파일로 저장하기
[71]:
df.to_csv('data_v1_save.csv', index=False)
72:
gender Partner Dependents tenure MultipleLines InternetService OnlineSecurity OnlineBackup TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 Male No No 34 No DSL Yes No No No No One year No Mailed check 56.95 1889.50 0
1 Male No No 2 No DSL Yes Yes No No No Month-to-month Yes Mailed check 53.85 108.15 1
2 Male No No 45 No phone service DSL Yes No Yes No No One year No Bank transfer (automatic) 42.30 1840.75 0
3 Female No No 2 No Fiber optic No No No No No Month-to-month Yes Electronic check 70.70 151.65 1
4 Female No No 8 Yes Fiber optic No No No Yes Yes Month-to-month Yes Electronic check 99.65 820.50 1
배운 내용 정리
1. 필요 라이브러리 임포트 및 파일 읽어오기 : pd.read_csv()
2. EDA (Exploratory Data Analysis) 탐색적 데이터 분석 : df.info(), df.head(), df.tail()
3. 데이터 전처리 수행
• 불필요 컬럼 삭제 : df.drop()
• 컬럼 내용 변경하기 : df.replace()
• Null 처리 : df.replace(), df.fillna()
• 컬럼 type 변경하기 : df['col'].astype(int)
4. 시각화
• matplotlib, seaborn
• bar, scatter, countplot, boxplot
5. 결과 저장하기
• to_csv()
D:\바탕화면2021\ETC A2Z\2022 코딩챌린지및_계발\10. 2023 AICE Associate 교육
[실습-퀴즈] Python을 활용한 AI 모델링 - 머신러닝 파트
· 이번시간에는 Python을 활용한 AI 모델링에서 머신러닝에 대해 실습해 보겠습니다.
· 머신러닝 모델에는 아래와 같이 모델들이 있습니다.
· 단일 분류예측 모델 : LogisticRegression, KNN, DecisionTree
· 앙상블(Ensemble) 모델 : RandomForest, XGBoost, LGBM, Stacking, Weighted Blending
· 솔직히, 머신러닝이 딥러닝보다 코딩하기 쉽습니다. 4줄 템플릿에 맞쳐 코딩하면 되기 때문입니다.
· 한가지 당부 드리고 싶은 말은 "백문이불여일타" 입니다.
· 이론보다 실습이 더 많은 시간과 노력이 투자 되어야 합니다.
학습목차
머신러닝 모델 프로세스· 데이터 가져오기
· 데이터 전처리
· Train, Test 데이터셋 분할
· 데이터 정규화
· 단일 분류예측 모델 : LogisticRegression, KNN, DecisionTree
· 앙상블(Ensemble) 모델 : RandomForest, XGBoost, LGBM
재현율 성능이 너무 안나온다. 어떻게 해결할수 있을까?
머신러닝 모델 프로세스
① 라이브러리 임포트(import)
② 데이터 가져오기(Loading the data)
③ 탐색적 데이터 분석(Exploratory Data Analysis)
④ 데이터 전처리(Data PreProcessing) : 데이터타입 변환, Null 데이터 처리, 누락데이터 처리, 더미특성 생성, 특성 추출 (feature engineering) 등
⑤ Train, Test 데이터셋 분할
⑥ 데이터 정규화(Normalizing the Data)
⑦ 모델 개발(Creating the Model)
⑧ 모델 성능 평가
① 라이브러리 임포트
필요 라이브러리 임포트
[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Duplicate key in file PosixPath('/usr/local/lib/python3.6/dist-packages/matplotlib/mpl-data/matplotlibrc'), line 758 ('font.family\t: NanumGothicCoding')
② 데이터 로드
data_v1_save.csv 파일 읽어오기
[2]:
df = pd.read_csv('(라이브교육)data_v1_save.csv',sep = ",")
③ 데이터 분석
[3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7027 entries, 0 to 7026
Data columns (total 17 columns):
0 gender 7027 non-null object
1 Partner 7027 non-null object
2 Dependents 7027 non-null object
3 tenure 7027 non-null int64
4 MultipleLines 7027 non-null object
5 InternetService 7027 non-null object
6 OnlineSecurity 7027 non-null object
7 OnlineBackup 7027 non-null object
8 TechSupport 7027 non-null object
9 StreamingTV 7027 non-null object
10 StreamingMovies 7027 non-null object
11 Contract 7027 non-null object
12 PaperlessBilling 7027 non-null object
13 PaymentMethod 7027 non-null object
14 MonthlyCharges 7027 non-null float64
15 TotalCharges 7027 non-null float64
16 Churn 7027 non-null int64
dtypes: float64(2), int64(2), object(13)
memory usage: 933.4+ KB
Partner
Dependents
tenure
MultipleLines
InternetService
OnlineSecurity
OnlineBackup
TechSupport
StreamingTV
StreamingMovies
Contract
PaperlessBilling
PaymentMethod
MonthlyCharges
TotalCharges
Churn
7022
Female
No
No
72
No
No
No internet service
No internet service
No internet service
No internet service
No internet service
Two year
Yes
Bank transfer (automatic)
21.15
1419.40
0
7023
Male
Yes
Yes
24
Yes
DSL
Yes
No
Yes
Yes
Yes
One year
Yes
Mailed check
84.80
1990.50
0
7024
Female
Yes
Yes
72
Yes
Fiber optic
No
Yes
No
Yes
Yes
One year
Yes
Credit card (automatic)
103.20
7362.90
0
7025
Female
Yes
Yes
11
No phone service
DSL
Yes
No
No
No
No
Month-to-month
Yes
Electronic check
29.60
346.45
0
7026
Male
Yes
No
4
Yes
Fiber optic
No
No
No
No
No
Month-to-month
Yes
Mailed check
74.40
306.60
1
5:
df['Churn'].value_counts()
5:
0 5161
1 1866
Name: Churn, dtype: int64
7:
df['Churn'].value_counts()[:].plot(kind='bar')
④ 데이터 전처리
모든 데이터값들은 숫자형으로 되어야 한다. 즉, Ojbect 타입을 모든 숫자형 변경 필요
전처리 시간에 했던 replace 대신 Label ending 과 OneHot 함수를 활용 하여 인코딩
Object 컬럼에 대해 Pandas get_dummies 함수 활용하여 One-Hot-Encoding
8:
df[['MultipleLines']].head()
0
No
1
No
2
No phone service
3
No
4
Yes
[9]:
df['MultipleLines'].value_counts()
[9]:
No 3380
Yes 2966
No phone service 681
Name: MultipleLines, dtype: int64
10:
pd.get_dummies(data=df, columns=['MultipleLines'])
Partner
Dependents
tenure
InternetService
OnlineSecurity
OnlineBackup
TechSupport
StreamingTV
StreamingMovies
Contract
PaperlessBilling
PaymentMethod
MonthlyCharges
TotalCharges
Churn
MultipleLines_No
MultipleLines_No phone service
MultipleLines_Yes
0
Male
No
No
34
DSL
Yes
No
No
No
No
One year
No
Mailed check
56.95
1889.50
0
1
0
0
1
Male
No
No
2
DSL
Yes
Yes
No
No
No
Month-to-month
Yes
Mailed check
53.85
108.15
1
1
0
0
2
Male
No
No
45
DSL
Yes
No
Yes
No
No
One year
No
Bank transfer (automatic)
42.30
1840.75
0
0
1
0
3
Female
No
No
2
Fiber optic
No
No
No
No
No
Month-to-month
Yes
Electronic check
70.70
151.65
1
1
0
0
4
Female
No
No
8
Fiber optic
No
No
No
Yes
Yes
Month-to-month
Yes
Electronic check
99.65
820.50
1
0
0
1
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
7022
Female
No
No
72
No
No internet service
No internet service
No internet service
No internet service
No internet service
Two year
Yes
Bank transfer (automatic)
21.15
1419.40
0
1
0
0
7023
Male
Yes
Yes
24
DSL
Yes
No
Yes
Yes
Yes
One year
Yes
Mailed check
84.80
1990.50
0
0
0
1
7024
Female
Yes
Yes
72
Fiber optic
No
Yes
No
Yes
Yes
One year
Yes
Credit card (automatic)
103.20
7362.90
0
0
0
1
7025
Female
Yes
Yes
11
DSL
Yes
No
No
No
No
Month-to-month
Yes
Electronic check
29.60
346.45
0
0
1
0
7026
Male
Yes
No
4
Fiber optic
No
No
No
No
No
Month-to-month
Yes
Mailed check
74.40
306.60
1
0
0
1
7027 rows × 19 columns
11:
df.select_dtypes('object').head(3)
Partner
Dependents
MultipleLines
InternetService
OnlineSecurity
OnlineBackup
TechSupport
StreamingTV
StreamingMovies
Contract
PaperlessBilling
PaymentMethod
0
Male
No
No
No
DSL
Yes
No
No
No
No
One year
No
Mailed check
1
Male
No
No
No
DSL
Yes
Yes
No
No
No
Month-to-month
Yes
Mailed check
2
Male
No
No
No phone service
DSL
Yes
No
Yes
No
No
One year
No
Bank transfer (automatic)
14:
cal_cols = df.select_dtypes('object').columns.values
cal_cols
14:
array(['gender', 'Partner', 'Dependents', 'MultipleLines',
'InternetService', 'OnlineSecurity', 'OnlineBackup', 'TechSupport',
'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
'PaymentMethod'], dtype=object)
[문제] Object 컬럼에 대해 One-Hot-Encoding 수행하고 그 결과를 df1 변수에 저장하세요.
15:
df1 = pd.get_dummies(data = df, columns=cal_cols)
[16]:
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7027 entries, 0 to 7026
Data columns (total 40 columns):
0 tenure 7027 non-null int64
1 MonthlyCharges 7027 non-null float64
2 TotalCharges 7027 non-null float64
3 Churn 7027 non-null int64
4 gender_Female 7027 non-null uint8
5 gender_Male 7027 non-null uint8
6 Partner_No 7027 non-null uint8
7 Partner_Yes 7027 non-null uint8
8 Dependents_No 7027 non-null uint8
9 Dependents_Yes 7027 non-null uint8
10 MultipleLines_No 7027 non-null uint8
11 MultipleLines_No phone service 7027 non-null uint8
12 MultipleLines_Yes 7027 non-null uint8
13 InternetService_DSL 7027 non-null uint8
14 InternetService_Fiber optic 7027 non-null uint8
15 InternetService_No 7027 non-null uint8
16 OnlineSecurity_No 7027 non-null uint8
17 OnlineSecurity_No internet service 7027 non-null uint8
18 OnlineSecurity_Yes 7027 non-null uint8
19 OnlineBackup_No 7027 non-null uint8
20 OnlineBackup_No internet service 7027 non-null uint8
21 OnlineBackup_Yes 7027 non-null uint8
22 TechSupport_No 7027 non-null uint8
23 TechSupport_No internet service 7027 non-null uint8
24 TechSupport_Yes 7027 non-null uint8
25 StreamingTV_No 7027 non-null uint8
26 StreamingTV_No internet service 7027 non-null uint8
27 StreamingTV_Yes 7027 non-null uint8
28 StreamingMovies_No 7027 non-null uint8
29 StreamingMovies_No internet service 7027 non-null uint8
30 StreamingMovies_Yes 7027 non-null uint8
31 Contract_Month-to-month 7027 non-null uint8
32 Contract_One year 7027 non-null uint8
33 Contract_Two year 7027 non-null uint8
34 PaperlessBilling_No 7027 non-null uint8
35 PaperlessBilling_Yes 7027 non-null uint8
36 PaymentMethod_Bank transfer (automatic) 7027 non-null uint8
37 PaymentMethod_Credit card (automatic) 7027 non-null uint8
38 PaymentMethod_Electronic check 7027 non-null uint8
39 PaymentMethod_Mailed check 7027 non-null uint8
dtypes: float64(2), int64(2), uint8(36)
memory usage: 466.8 KB
MonthlyCharges
TotalCharges
Churn
gender_Female
gender_Male
Partner_No
Partner_Yes
Dependents_No
Dependents_Yes
...
StreamingMovies_Yes
Contract_Month-to-month
Contract_One year
Contract_Two year
PaperlessBilling_No
PaperlessBilling_Yes
PaymentMethod_Bank transfer (automatic)
PaymentMethod_Credit card (automatic)
PaymentMethod_Electronic check
PaymentMethod_Mailed check
0
34
56.95
1889.50
0
0
1
1
0
1
0
...
0
0
1
0
1
0
0
0
0
1
1
2
53.85
108.15
1
0
1
1
0
1
0
...
0
1
0
0
0
1
0
0
0
1
2
45
42.30
1840.75
0
0
1
1
0
1
0
...
0
0
1
0
1
0
1
0
0
0
3 rows × 40 columns
⑤ Train, Test 데이터셋 분할
입력(X)과 레이블(y) 나누기
[문제] df1 DataFrame에서 'Churn' 컬럼을 제외한 나머지 정보를 X에 저장하세요.
[18]:
X = df1.drop('Churn', axis=1).values
[문제] df DataFrame에서 'Churn' 컬럼을 y로 저장하세요.
[20]:
y = df1['Churn'].values
21:
X.shape, y.shape
21:
((7027, 39), (7027,))
Train , Test dataset 나누기
22:
from sklearn.model_selection import train_test_split
[문제] Train dataset, Test dataset 나누세요.
[23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, stratify = y, random_state = 42)
24:
(4918, 39)
⑥ 데이터 정규화/스케일링(Normalizing/Scaling)
26:
df1.tail()
MonthlyCharges
TotalCharges
Churn
gender_Female
gender_Male
Partner_No
Partner_Yes
Dependents_No
Dependents_Yes
...
StreamingMovies_Yes
Contract_Month-to-month
Contract_One year
Contract_Two year
PaperlessBilling_No
PaperlessBilling_Yes
PaymentMethod_Bank transfer (automatic)
PaymentMethod_Credit card (automatic)
PaymentMethod_Electronic check
PaymentMethod_Mailed check
7022
72
21.15
1419.40
0
1
0
1
0
1
0
...
0
0
0
1
0
1
1
0
0
0
7023
24
84.80
1990.50
0
0
1
0
1
0
1
...
1
0
1
0
0
1
0
0
0
1
7024
72
103.20
7362.90
0
1
0
0
1
0
1
...
1
0
1
0
0
1
0
1
0
0
7025
11
29.60
346.45
0
1
0
0
1
0
1
...
0
1
0
0
0
1
0
0
1
0
7026
4
74.40
306.60
1
0
1
0
1
1
0
...
0
1
0
0
0
1
0
0
0
1
5 rows × 40 columns
27:
from sklearn.preprocessing import MinMaxScaler
[문제] MinMaxScaler 함수를 'scaler'로 정의 하세요.
28:
scaler = MinMaxScaler()
[29]:
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
30:
X_train[:2], y_train[:2]
30:
(array([[0.65277778, 0.56851021, 0.40877722, 1. , 0. ,
1. , 0. , 1. , 0. , 1. ,
0. , 0. , 0. , 1. , 0. ,
0. , 0. , 1. , 1. , 0. ,
0. , 1. , 0. , 0. , 1. ,
0. , 0. , 1. , 0. , 0. ,
1. , 0. , 0. , 1. , 0. ,
0. , 1. , 0. , 0. ],
[0.27777778, 0.00498256, 0.04008671, 1. , 0. ,
1. , 0. , 1. , 0. , 1. ,
0. , 0. , 0. , 0. , 1. ,
0. , 1. , 0. , 0. , 1. ,
0. , 0. , 1. , 0. , 0. ,
1. , 0. , 0. , 1. , 0. ,
1. , 0. , 0. , 0. , 1. ,
0. , 1. , 0. , 0. ]]),
array([0, 0]))
AttributeError Traceback (most recent call last)
in
----> 1 X_train.tail()
AttributeError: 'numpy.ndarray' object has no attribute 'tail'
⑦ 모델 개발
(참고) 모델별 바차트 그려주고 성능 확인을 위한 함수
44:
from sklearn.metrics import accuracy_score
my_predictions = {}
colors = ['r', 'c', 'm', 'y', 'k', 'khaki', 'teal', 'orchid', 'sandybrown',
'greenyellow', 'dodgerblue', 'deepskyblue', 'rosybrown', 'firebrick',
'deeppink', 'crimson', 'salmon', 'darkred', 'olivedrab', 'olive',
'forestgreen', 'royalblue', 'indigo', 'navy', 'mediumpurple', 'chocolate',
'gold', 'darkorange', 'seagreen', 'turquoise', 'steelblue', 'slategray',
'peru', 'midnightblue', 'slateblue', 'dimgray', 'cadetblue', 'tomato'
]
def recalleval(name, pred, actual):
global predictions
global colors
plt.figure(figsize=(12, 9))
#acc = accuracy_score(actual, pred)
acc = recall_score(actual, pred)
my_predictions[name_] = acc * 100
y_value = sorted(my_predictions.items(), key=lambda x: x[1], reverse=True)
df = pd.DataFrame(y_value, columns=['model', 'recall'])
print(df)
length = len(df)
plt.figure(figsize=(10, length))
ax = plt.subplot()
ax.set_yticks(np.arange(len(df)))
ax.set_yticklabels(df['model'], fontsize=15)
bars = ax.barh(np.arange(len(df)), df['recall'])
for i, v in enumerate(df['recall']):
idx = np.random.choice(len(colors))
bars[i].set_color(colors[idx])
ax.text(v + 2, i, str(round(v, 3)), color='k', fontsize=15, fontweight='bold')
plt.title('recall', fontsize=18)
plt.xlim(0, 100)
plt.show()
1) 로지스틱 회귀 (LogisticRegression, 분류)
[32]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report
[문제] LogisticRegression 모델 정의하고 학습시키세요.
33:
lg = LogisticRegression()
lg.fit(X_train, y_train)
34:
lg.score(X_test, y_test)
· 분류기 성능 평가 지표
35:
lg_pred = lg.predict(X_test)
36:
array([0, 0, 0, ..., 1, 1, 0])
[37]:
confusion_matrix(y_test, lg_pred)
[37]:
array([[1386, 163],
[ 246, 314]])
38:
accuracy_score(y_test, lg_pred)
39:
precision_score(y_test, lg_pred)
40:
recall_score(y_test, lg_pred)
41:
f1_score(y_test, lg_pred)
42:
print(classification_report(y_test, lg_pred))
precision recall f1-score support
0 0.85 0.89 0.87 1549
1 0.66 0.56 0.61 560
accuracy 0.81 2109
macro avg 0.75 0.73 0.74 2109
weighted avg 0.80 0.81 0.80 2109
45:
recall_eval('LogisticRegression', lg_pred, y_test)
model recall
0 LogisticRegression 56.071429
2) KNN (K-Nearest Neighbor)
[46]:
from sklearn.neighbors import KNeighborsClassifier
47:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
[48]:
knn_pred = knn.predict(X_test)
[49]:
recall_eval('K-Nearest Neighbor', knn_pred, y_test)
model recall
0 LogisticRegression 56.071429
1 K-Nearest Neighbor 52.142857
3) 결정트리(DecisionTree)
[50]:
from sklearn.tree import DecisionTreeClassifier
[51]:
dt = DecisionTreeClassifier(max_depth=10, random_state=42)
dt.fit(X_train, y_train)
[51]:
DecisionTreeClassifier(max_depth=10, random_state=42)
[문제] 학습된 DecisionTreeClassifier 모델로 예측해 보기
52:
dt_pred = dt.predict(X_test)
53:
recall_eval('DecisionTree', dt_pred, y_test)
model recall
0 LogisticRegression 56.071429
1 DecisionTree 55.714286
2 K-Nearest Neighbor 52.142857
앙상블 기법의 종류
· 배깅 (Bagging): 여러개의 DecisionTree 활용하고 샘플 중복 생성을 통해 결과 도출. RandomForest
· 부스팅 (Boosting): 약한 학습기를 순차적으로 학습을 하되, 이전 학습에 대하여 잘못 예측된 데이터에 가중치를 부여해 오차를 보완해 나가는 방식. XGBoost, LGBM
앙상블
4) 랜덤포레스트(RandomForest)
· Bagging 대표적인 모델로써, 훈련셋트를 무작위로 각기 다른 서브셋으로 데이터셋을 만들고
· 여러개의 DecisonTree로 학습하고 다수결로 결정하는 모델
주요 Hyperparameter
· random_state: 랜덤 시드 고정 값. 고정해두고 튜닝할 것!
· n_jobs: CPU 사용 갯수
· max_depth: 깊어질 수 있는 최대 깊이. 과대적합 방지용
· n_estimators: 앙상블하는 트리의 갯수
· max_features: 최대로 사용할 feature의 갯수. 과대적합 방지용
· min_samples_split: 트리가 분할할 때 최소 샘플의 갯수. default=2. 과대적합 방지용
54:
from sklearn.ensemble import RandomForestClassifier
[55]:
rfc = RandomForestClassifier(n_estimators=3, random_state=42)
rfc.fit(X_train, y_train)
[55]:
RandomForestClassifier(n_estimators=3, random_state=42)
[56]:
rfc_pred = rfc.predict(X_test)
[57]:
recall_eval('RandomForest Ensemble', rfc_pred, y_test)
model recall
0 LogisticRegression 56.071429
1 DecisionTree 55.714286
2 K-Nearest Neighbor 52.142857
3 RandomForest Ensemble 52.142857
5) XGBoost
· 여러개의 DecisionTree를 결합하여 Strong Learner 만드는 Boosting 앙상블 기법
· Kaggle 대회에서 자주 사용하는 모델이다.
5-1) Boosting 기본 개념
5-2) Boosting 상세
주요 특징
· scikit-learn 패키지가 아닙니다.
· 성능이 우수함
· GBM보다는 빠르고 성능도 향상되었습니다.
· 여전히 학습시간이 매우 느리다
주요 Hyperparameter
· random_state: 랜덤 시드 고정 값. 고정해두고 튜닝할 것!
· n_jobs: CPU 사용 갯수
· learning_rate: 학습율. 너무 큰 학습율은 성능을 떨어뜨리고, 너무 작은 학습율은 학습이 느리다. 적절한 값을 찾아야함. n_estimators와 같이 튜닝. default=0.1
· n_estimators: 부스팅 스테이지 수. (랜덤포레스트 트리의 갯수 설정과 비슷한 개념). default=100
· max_depth: 트리의 깊이. 과대적합 방지용. default=3.
· subsample: 샘플 사용 비율. 과대적합 방지용. default=1.0
· max_features: 최대로 사용할 feature의 비율. 과대적합 방지용. default=1.0
[58]:
!pip install xgboost
Looking in indexes: http://10.220.235.19/pypi/simple
Requirement already satisfied: xgboost in /usr/local/lib/python3.6/dist-packages (0.90)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from xgboost) (1.19.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from xgboost) (1.5.4)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
59:
from xgboost import XGBClassifier
60:
xgb = XGBClassifier(n_estimators=3, random_state=42)
xgb.fit(X_train, y_train)
60:
XGBClassifier(n_estimators=3, random_state=42)
61:
xgb_pred = xgb.predict(X_test)
[62]:
recall_eval('XGBoost', xgb_pred, y_test)
model recall
0 LogisticRegression 56.071429
1 DecisionTree 55.714286
2 K-Nearest Neighbor 52.142857
3 RandomForest Ensemble 52.142857
4 XGBoost 48.214286
6) Light GBM
· XGBoost와 함께 주목받는 DecisionTree 알고리즘 기반의 Boosting 앙상블 기법
· XGBoost에 비해 학습시간이 짧은 편이다.
주요 특징
· scikit-learn 패키지가 아닙니다.
· 성능이 우수함
· 속도도 매우 빠릅니다.
주요 Hyperparameter
· random_state: 랜덤 시드 고정 값. 고정해두고 튜닝할 것!
· n_jobs: CPU 사용 갯수
· learning_rate: 학습율. 너무 큰 학습율은 성능을 떨어뜨리고, 너무 작은 학습율은 학습이 느리다. 적절한 값을 찾아야함. n_estimators와 같이 튜닝. default=0.1
· n_estimators: 부스팅 스테이지 수. (랜덤포레스트 트리의 갯수 설정과 비슷한 개념). default=100
· max_depth: 트리의 깊이. 과대적합 방지용. default=3.
· colsample_bytree: 샘플 사용 비율 (max_features와 비슷한 개념). 과대적합 방지용. default=1.0
63:
!pip install lightgbm
Looking in indexes: http://10.220.235.19/pypi/simple
Requirement already satisfied: lightgbm in /usr/local/lib/python3.6/dist-packages (2.3.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from lightgbm) (1.19.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from lightgbm) (1.5.4)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.6/dist-packages (from lightgbm) (0.24.2)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn->lightgbm) (1.1.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.6/dist-packages (from scikit-learn->lightgbm) (3.1.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
64:
from lightgbm import LGBMClassifier
65:
lgbm = LGBMClassifier(n_estimators=3, random_state=42)
lgbm.fit(X_train, y_train)
65:
LGBMClassifier(n_estimators=3, random_state=42)
66:
lgbm_pred = lgbm.predict(X_test)
67:
recall_eval('LGBM', lgbm_pred, y_test)
model recall
0 LogisticRegression 56.071429
1 DecisionTree 55.714286
2 K-Nearest Neighbor 52.142857
3 RandomForest Ensemble 52.142857
4 XGBoost 48.214286
5 LGBM 0.000000
68:
lgbm.score(X_test, y_test)
69:
recall_score(y_test, lgbm_pred)
배운 내용 정리
머신러닝 모델 프로세스
① 라이브러리 임포트(import)
② 데이터 가져오기(Loading the data)
③ 탐색적 데이터 분석(Exploratory Data Analysis)
④ 데이터 전처리(Data PreProcessing) : 데이터타입 변환, Null 데이터 처리, 누락데이터 처리, 더미특성 생성, 특성 추출 (feature engineering) 등
⑤ Train, Test 데이터셋 분할
⑥ 데이터 정규화(Normalizing the Data)
⑦ 모델 개발(Creating the Model)
⑧ 모델 성능 평가
평가 지표 활용 : 모델별 성능 확인을 위한 함수 (가져다 쓰면 된다)
단일 회귀예측 모델 : LogisticRegression, KNN, DecisionTree
앙상블 (Ensemble) : RandomForest, XGBoost, LGBM
재현율 성능이 너무 안나온다. 어떻게 해결할수 있을까?
D:\바탕화면2021\ETC A2Z\2022 코딩챌린지및_계발\10. 2023 AICE Associate 교육
[실습-퀴즈] Python을 활용한 AI 모델링 - 딥러닝 파트
· 이번시간에는 Python을 활용한 AI 모델링에서 딥러닝에 대해 실습해 보겠습니다.
· 여기서는 딥러닝 모델 DNN에 대해 코딩하여 모델 구축해 보겠습니다.
· 한가지 당부 드리고 싶은 말은 "백문이불여일타" 입니다.
· 이론보다 실습이 더 많은 시간과 노력이 투자 되어야 합니다.
학습목차
딥러닝 심층신경망(DNN) 모델 프로세스· 데이터 가져오기
· 데이터 전처리
· Train, Test 데이터셋 분할
· 데이터 정규화
· DNN 딥러닝 모델
재현율 성능이 좋지 않다. 어떻게 성능향상 할수 있나?
딥러닝 심층신경망(DNN) 모델 프로세스
① 라이브러리 임포트(import)
② 데이터 가져오기(Loading the data)
③ 탐색적 데이터 분석(Exploratory Data Analysis)
④ 데이터 전처리(Data PreProcessing) : 데이터타입 변환, Null 데이터 처리, 누락데이터 처리, 더미특성 생성, 특성 추출 (feature engineering) 등
⑤ Train, Test 데이터셋 분할
⑥ 데이터 정규화(Normalizing the Data)
⑦ 모델 개발(Creating the Model)
⑧ 모델 성능 평가
① 라이브러리 임포트
필요 라이브러리 임포트
[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Duplicate key in file PosixPath('/usr/local/lib/python3.6/dist-packages/matplotlib/mpl-data/matplotlibrc'), line 758 ('font.family\t: NanumGothicCoding')
② 데이터 로드
[문제] 같은 폴더내에 있는 data_v1_save.csv 파일을 Pandas read_csv 함수를 이용하여 읽어 df 변수에 저장하세요.
[2]:
df = pd.read_csv('(라이브교육)data_v1_save.csv')
③ 데이터 분석
[3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7027 entries, 0 to 7026
Data columns (total 17 columns):
0 gender 7027 non-null object
1 Partner 7027 non-null object
2 Dependents 7027 non-null object
3 tenure 7027 non-null int64
4 MultipleLines 7027 non-null object
5 InternetService 7027 non-null object
6 OnlineSecurity 7027 non-null object
7 OnlineBackup 7027 non-null object
8 TechSupport 7027 non-null object
9 StreamingTV 7027 non-null object
10 StreamingMovies 7027 non-null object
11 Contract 7027 non-null object
12 PaperlessBilling 7027 non-null object
13 PaymentMethod 7027 non-null object
14 MonthlyCharges 7027 non-null float64
15 TotalCharges 7027 non-null float64
16 Churn 7027 non-null int64
dtypes: float64(2), int64(2), object(13)
memory usage: 933.4+ KB
Partner
Dependents
tenure
MultipleLines
InternetService
OnlineSecurity
OnlineBackup
TechSupport
StreamingTV
StreamingMovies
Contract
PaperlessBilling
PaymentMethod
MonthlyCharges
TotalCharges
Churn
7022
Female
No
No
72
No
No
No internet service
No internet service
No internet service
No internet service
No internet service
Two year
Yes
Bank transfer (automatic)
21.15
1419.40
0
7023
Male
Yes
Yes
24
Yes
DSL
Yes
No
Yes
Yes
Yes
One year
Yes
Mailed check
84.80
1990.50
0
7024
Female
Yes
Yes
72
Yes
Fiber optic
No
Yes
No
Yes
Yes
One year
Yes
Credit card (automatic)
103.20
7362.90
0
7025
Female
Yes
Yes
11
No phone service
DSL
Yes
No
No
No
No
Month-to-month
Yes
Electronic check
29.60
346.45
0
7026
Male
Yes
No
4
Yes
Fiber optic
No
No
No
No
No
Month-to-month
Yes
Mailed check
74.40
306.60
1
5:
df['Churn'].value_counts().plot(kind='bar')
④ 데이터 전처리
· 모든 데이터값들은 숫자형으로 되어야 한다. 즉, Ojbect 타입을 모든 숫자형 변경 필요
· Object 컬럼에 대해 Pandas get_dummies 함수 활용하여 One-Hot-Encoding
6:
cal_cols = df.select_dtypes('object').columns.values
cal_cols
6:
array(['gender', 'Partner', 'Dependents', 'MultipleLines',
'InternetService', 'OnlineSecurity', 'OnlineBackup', 'TechSupport',
'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
'PaymentMethod'], dtype=object)
[문제] Object 컬럼에 대해 One-Hot-Encoding 수행하고 그 결과를 df1 변수에 저장하세요.
7:
df1 = pd.get_dummies(data = df, columns = cal_cols)
8:
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7027 entries, 0 to 7026
Data columns (total 40 columns):
0 tenure 7027 non-null int64
1 MonthlyCharges 7027 non-null float64
2 TotalCharges 7027 non-null float64
3 Churn 7027 non-null int64
4 gender_Female 7027 non-null uint8
5 gender_Male 7027 non-null uint8
6 Partner_No 7027 non-null uint8
7 Partner_Yes 7027 non-null uint8
8 Dependents_No 7027 non-null uint8
9 Dependents_Yes 7027 non-null uint8
10 MultipleLines_No 7027 non-null uint8
11 MultipleLines_No phone service 7027 non-null uint8
12 MultipleLines_Yes 7027 non-null uint8
13 InternetService_DSL 7027 non-null uint8
14 InternetService_Fiber optic 7027 non-null uint8
15 InternetService_No 7027 non-null uint8
16 OnlineSecurity_No 7027 non-null uint8
17 OnlineSecurity_No internet service 7027 non-null uint8
18 OnlineSecurity_Yes 7027 non-null uint8
19 OnlineBackup_No 7027 non-null uint8
20 OnlineBackup_No internet service 7027 non-null uint8
21 OnlineBackup_Yes 7027 non-null uint8
22 TechSupport_No 7027 non-null uint8
23 TechSupport_No internet service 7027 non-null uint8
24 TechSupport_Yes 7027 non-null uint8
25 StreamingTV_No 7027 non-null uint8
26 StreamingTV_No internet service 7027 non-null uint8
27 StreamingTV_Yes 7027 non-null uint8
28 StreamingMovies_No 7027 non-null uint8
29 StreamingMovies_No internet service 7027 non-null uint8
30 StreamingMovies_Yes 7027 non-null uint8
31 Contract_Month-to-month 7027 non-null uint8
32 Contract_One year 7027 non-null uint8
33 Contract_Two year 7027 non-null uint8
34 PaperlessBilling_No 7027 non-null uint8
35 PaperlessBilling_Yes 7027 non-null uint8
36 PaymentMethod_Bank transfer (automatic) 7027 non-null uint8
37 PaymentMethod_Credit card (automatic) 7027 non-null uint8
38 PaymentMethod_Electronic check 7027 non-null uint8
39 PaymentMethod_Mailed check 7027 non-null uint8
dtypes: float64(2), int64(2), uint8(36)
memory usage: 466.8 KB
⑤ Train, Test 데이터셋 분할
10:
from sklearn.model_selection import train_test_split
11:
X = df1.drop('Churn', axis=1).values
y = df1['Churn'].values
[12]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3,
stratify=y,
random_state=42)
13:
(4918, 39)
⑥ 데이터 정규화/스케일링(Normalizing/Scaling)
15:
df1.tail()
MonthlyCharges
TotalCharges
Churn
gender_Female
gender_Male
Partner_No
Partner_Yes
Dependents_No
Dependents_Yes
...
StreamingMovies_Yes
Contract_Month-to-month
Contract_One year
Contract_Two year
PaperlessBilling_No
PaperlessBilling_Yes
PaymentMethod_Bank transfer (automatic)
PaymentMethod_Credit card (automatic)
PaymentMethod_Electronic check
PaymentMethod_Mailed check
7022
72
21.15
1419.40
0
1
0
1
0
1
0
...
0
0
0
1
0
1
1
0
0
0
7023
24
84.80
1990.50
0
0
1
0
1
0
1
...
1
0
1
0
0
1
0
0
0
1
7024
72
103.20
7362.90
0
1
0
0
1
0
1
...
1
0
1
0
0
1
0
1
0
0
7025
11
29.60
346.45
0
1
0
0
1
0
1
...
0
1
0
0
0
1
0
0
1
0
7026
4
74.40
306.60
1
0
1
0
1
1
0
...
0
1
0
0
0
1
0
0
0
1
5 rows × 40 columns
[16]:
from sklearn.preprocessing import MinMaxScaler
17:
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
[18]:
X_train[:2]
[18]:
array([[0.65277778, 0.56851021, 0.40877722, 1. , 0. ,
1. , 0. , 1. , 0. , 1. ,
0. , 0. , 0. , 1. , 0. ,
0. , 0. , 1. , 1. , 0. ,
0. , 1. , 0. , 0. , 1. ,
0. , 0. , 1. , 0. , 0. ,
1. , 0. , 0. , 1. , 0. ,
0. , 1. , 0. , 0. ],
[0.27777778, 0.00498256, 0.04008671, 1. , 0. ,
1. , 0. , 1. , 0. , 1. ,
0. , 0. , 0. , 0. , 1. ,
0. , 1. , 0. , 0. , 1. ,
0. , 0. , 1. , 0. , 0. ,
1. , 0. , 0. , 1. , 0. ,
1. , 0. , 0. , 0. , 1. ,
0. , 1. , 0. , 0. ]])
⑦ 딥러닝 심층신경망(DNN) 모델 구현
라이브러리 임포트
[19]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
tf.random.set_seed(100)
하이퍼파라미터 설정 : batch_size, epochs
[20]:
batch_size = 16
epochs = 20
모델 입력(features) 갯수 확인
21:
(4918, 39)
모델 출력(label) 갯수 확인
A. 이진분류 DNN모델 구성
hidden Layer
· [출처] https://subscription.packtpub.com/book/data/9781788995207/1/ch01lvl1sec03/deep-learning-intuition
[문제] 요구사항대로 Sequential 모델을 만들어 보세요.
[23]:
model = Sequential()
model.add(Dense(4, activation = 'relu', input_shape = (39,)))
model.add(Dense(3, activation = 'relu'))
model.add(Dense(1, activation = 'sigmoid'))
모델 확인
Model: "sequential"
dense (Dense) (None, 4) 160
dense_1 (Dense) (None, 3) 15
Total params: 179
Trainable params: 179
Non-trainable params: 0
모델 구성 - 과적합 방지
dropout
25:
model = Sequential()
model.add(Dense(4, activation='relu', input_shape=(39,)))
model.add(Dropout(0.3))
model.add(Dense(3, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(1, activation='sigmoid'))
과적합 방지 모델 확인
Model: "sequential_1"
dense_3 (Dense) (None, 4) 160
dropout (Dropout) (None, 4) 0
dense_4 (Dense) (None, 3) 15
dropout_1 (Dropout) (None, 3) 0
Total params: 179
Trainable params: 179
Non-trainable params: 0
모델 컴파일 – 이진 분류 모델
loss='binary_crossentropy',
metrics=['accuracy'])
· 모델 컴파일 – 다중 분류 모델 (Y값을 One-Hot-Encoding 한경우)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
· 모델 컴파일 – 다중 분류 모델 (Y값을 One-Hot-Encoding 하지 않은 경우)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
· 모델 컴파일 – 예측 모델 model.compile(optimizer='adam', loss='mse')
모델 학습
[문제] 요구사항대로 DNN 모델을 학습시키세요.
· 모델 이름 : model
· epoch : 10번
· batch_size : 10번
28:
model.fit(X_train, y_train, validation_data = (X_test, y_test), epochs = 10, batch_size = 10)
Epoch 1/10
492/492 [==============================] - 2s 3ms/step - loss: 0.6015 - accuracy: 0.7318 - val_loss: 0.5247 - val_accuracy: 0.7345
Epoch 2/10
492/492 [==============================] - 2s 3ms/step - loss: 0.5439 - accuracy: 0.7416 - val_loss: 0.4802 - val_accuracy: 0.7345
Epoch 3/10
492/492 [==============================] - 1s 3ms/step - loss: 0.5225 - accuracy: 0.7548 - val_loss: 0.4734 - val_accuracy: 0.7345
Epoch 4/10
492/492 [==============================] - 2s 4ms/step - loss: 0.5139 - accuracy: 0.7623 - val_loss: 0.4646 - val_accuracy: 0.7364
Epoch 5/10
492/492 [==============================] - 2s 3ms/step - loss: 0.5153 - accuracy: 0.7554 - val_loss: 0.4651 - val_accuracy: 0.7368
Epoch 6/10
492/492 [==============================] - 1s 3ms/step - loss: 0.5016 - accuracy: 0.7674 - val_loss: 0.4550 - val_accuracy: 0.7653
Epoch 7/10
492/492 [==============================] - 1s 3ms/step - loss: 0.5042 - accuracy: 0.7593 - val_loss: 0.4532 - val_accuracy: 0.7496
Epoch 8/10
492/492 [==============================] - 1s 3ms/step - loss: 0.4989 - accuracy: 0.7633 - val_loss: 0.4536 - val_accuracy: 0.7620
Epoch 9/10
492/492 [==============================] - 1s 3ms/step - loss: 0.5035 - accuracy: 0.7609 - val_loss: 0.4553 - val_accuracy: 0.7539
Epoch 10/10
492/492 [==============================] - 1s 3ms/step - loss: 0.5023 - accuracy: 0.7588 - val_loss: 0.4567 - val_accuracy: 0.7544
B. 다중 분류 DNN 구성
· 13개 input layer
· unit 5개 hidden layer
· dropout
· unit 4개 hidden layer
· dropout
· 2개 output layser : 이진분류
다중분류
· [출처] https://www.educba.com/dnn-neural-network/
[29]:
model = Sequential()
model.add(Dense(5, activation='relu', input_shape=(39,)))
model.add(Dropout(0.3))
model.add(Dense(4, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(2, activation='softmax'))
모델 확인
Model: "sequential_2"
dense_6 (Dense) (None, 5) 200
dropout_2 (Dropout) (None, 5) 0
dense_7 (Dense) (None, 4) 24
dropout_3 (Dropout) (None, 4) 0
Total params: 234
Trainable params: 234
Non-trainable params: 0
모델 컴파일 – 다중 분류 모델
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
모델 학습
[32]:
history = model.fit(X_train, y_train,
validation_data=(X_test, y_test),
epochs=20,
batch_size=16)
Epoch 1/20
308/308 [==============================] - 2s 4ms/step - loss: 0.5507 - accuracy: 0.7322 - val_loss: 0.4708 - val_accuracy: 0.7345
Epoch 2/20
308/308 [==============================] - 1s 4ms/step - loss: 0.5011 - accuracy: 0.7351 - val_loss: 0.4540 - val_accuracy: 0.7345
Epoch 3/20
308/308 [==============================] - 1s 4ms/step - loss: 0.4894 - accuracy: 0.7338 - val_loss: 0.4482 - val_accuracy: 0.7345
Epoch 4/20
308/308 [==============================] - 1s 4ms/step - loss: 0.4916 - accuracy: 0.7344 - val_loss: 0.4455 - val_accuracy: 0.7345
Epoch 5/20
308/308 [==============================] - 1s 4ms/step - loss: 0.4850 - accuracy: 0.7340 - val_loss: 0.4420 - val_accuracy: 0.7345
Epoch 6/20
308/308 [==============================] - 1s 4ms/step - loss: 0.4899 - accuracy: 0.7342 - val_loss: 0.4447 - val_accuracy: 0.7345
Epoch 7/20
308/308 [==============================] - 1s 4ms/step - loss: 0.4749 - accuracy: 0.7344 - val_loss: 0.4360 - val_accuracy: 0.7345
Epoch 8/20
308/308 [==============================] - 1s 4ms/step - loss: 0.4779 - accuracy: 0.7342 - val_loss: 0.4374 - val_accuracy: 0.7345
Epoch 9/20
308/308 [==============================] - 1s 4ms/step - loss: 0.4744 - accuracy: 0.7340 - val_loss: 0.4358 - val_accuracy: 0.7345
Epoch 10/20
308/308 [==============================] - 1s 4ms/step - loss: 0.4808 - accuracy: 0.7344 - val_loss: 0.4379 - val_accuracy: 0.7345
Epoch 11/20
308/308 [==============================] - 1s 4ms/step - loss: 0.4761 - accuracy: 0.7344 - val_loss: 0.4379 - val_accuracy: 0.7345
Epoch 12/20
308/308 [==============================] - 1s 4ms/step - loss: 0.4679 - accuracy: 0.7344 - val_loss: 0.4340 - val_accuracy: 0.7345
Epoch 13/20
308/308 [==============================] - 1s 3ms/step - loss: 0.4712 - accuracy: 0.7617 - val_loss: 0.4358 - val_accuracy: 0.7824
Epoch 14/20
308/308 [==============================] - 1s 3ms/step - loss: 0.4790 - accuracy: 0.7489 - val_loss: 0.4388 - val_accuracy: 0.7691
Epoch 15/20
308/308 [==============================] - 1s 3ms/step - loss: 0.4760 - accuracy: 0.7637 - val_loss: 0.4360 - val_accuracy: 0.7345
Epoch 16/20
308/308 [==============================] - 1s 3ms/step - loss: 0.4759 - accuracy: 0.7527 - val_loss: 0.4391 - val_accuracy: 0.7710
Epoch 17/20
308/308 [==============================] - 1s 3ms/step - loss: 0.4739 - accuracy: 0.7562 - val_loss: 0.4375 - val_accuracy: 0.7904
Epoch 18/20
308/308 [==============================] - 1s 3ms/step - loss: 0.4668 - accuracy: 0.7605 - val_loss: 0.4348 - val_accuracy: 0.7985
Epoch 19/20
308/308 [==============================] - 1s 3ms/step - loss: 0.4695 - accuracy: 0.7674 - val_loss: 0.4330 - val_accuracy: 0.7899
Epoch 20/20
308/308 [==============================] - 1s 3ms/step - loss: 0.4758 - accuracy: 0.7684 - val_loss: 0.4345 - val_accuracy: 0.7876
Callback : 조기종료, 모델 저장
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
early_stop = EarlyStopping(monitor='val_loss', mode='min',
verbose=1, patience=5)
check_point = ModelCheckpoint('best_model.h5', verbose=1,
monitor='val_loss', mode='min', save_best_only=True)
모델 학습
history = model.fit(x=X_train, y=y_train,
epochs=50 , batch_size=20,
validation_data=(X_test, y_test), verbose=1,
callbacks=[early_stop, check_point])
Epoch 1/50
231/246 [===========================>..] - ETA: 0s - loss: 0.4702 - accuracy: 0.7712
Epoch 1: val_loss improved from inf to 0.43752, saving model to best_model.h5
246/246 [==============================] - 1s 2ms/step - loss: 0.4726 - accuracy: 0.7686 - val_loss: 0.4375 - val_accuracy: 0.7956
Epoch 2/50
241/246 [============================>.] - ETA: 0s - loss: 0.4734 - accuracy: 0.7587
Epoch 2: val_loss improved from 0.43752 to 0.43419, saving model to best_model.h5
246/246 [==============================] - 1s 2ms/step - loss: 0.4727 - accuracy: 0.7603 - val_loss: 0.4342 - val_accuracy: 0.7966
Epoch 3/50
239/246 [============================>.] - ETA: 0s - loss: 0.4754 - accuracy: 0.7692
Epoch 3: val_loss did not improve from 0.43419
246/246 [==============================] - 1s 2ms/step - loss: 0.4733 - accuracy: 0.7698 - val_loss: 0.4343 - val_accuracy: 0.7923
Epoch 4/50
241/246 [============================>.] - ETA: 0s - loss: 0.4699 - accuracy: 0.7645
Epoch 4: val_loss improved from 0.43419 to 0.43329, saving model to best_model.h5
246/246 [==============================] - 1s 2ms/step - loss: 0.4688 - accuracy: 0.7647 - val_loss: 0.4333 - val_accuracy: 0.7975
Epoch 5/50
218/246 [=========================>....] - ETA: 0s - loss: 0.4628 - accuracy: 0.7743
Epoch 5: val_loss improved from 0.43329 to 0.43201, saving model to best_model.h5
246/246 [==============================] - 1s 2ms/step - loss: 0.4633 - accuracy: 0.7694 - val_loss: 0.4320 - val_accuracy: 0.7980
Epoch 6/50
234/246 [===========================>..] - ETA: 0s - loss: 0.4799 - accuracy: 0.7650
Epoch 6: val_loss did not improve from 0.43201
246/246 [==============================] - 1s 2ms/step - loss: 0.4783 - accuracy: 0.7629 - val_loss: 0.4376 - val_accuracy: 0.7980
Epoch 7/50
240/246 [============================>.] - ETA: 0s - loss: 0.4704 - accuracy: 0.7675
Epoch 7: val_loss did not improve from 0.43201
246/246 [==============================] - 1s 2ms/step - loss: 0.4691 - accuracy: 0.7676 - val_loss: 0.4331 - val_accuracy: 0.7961
Epoch 8/50
226/246 [==========================>...] - ETA: 0s - loss: 0.4679 - accuracy: 0.7710
Epoch 8: val_loss did not improve from 0.43201
246/246 [==============================] - 1s 3ms/step - loss: 0.4706 - accuracy: 0.7700 - val_loss: 0.4369 - val_accuracy: 0.7985
Epoch 9/50
222/246 [==========================>...] - ETA: 0s - loss: 0.4734 - accuracy: 0.7716
Epoch 9: val_loss did not improve from 0.43201
246/246 [==============================] - 1s 2ms/step - loss: 0.4734 - accuracy: 0.7674 - val_loss: 0.4357 - val_accuracy: 0.7980
Epoch 10/50
221/246 [=========================>....] - ETA: 0s - loss: 0.4707 - accuracy: 0.7667
Epoch 10: val_loss did not improve from 0.43201
246/246 [==============================] - 1s 2ms/step - loss: 0.4718 - accuracy: 0.7670 - val_loss: 0.4337 - val_accuracy: 0.7999
Epoch 10: early stopping
⑧ 모델 성능 평가
33:
losses = pd.DataFrame(model.history.history)
accuracy
val_loss
val_accuracy
0
0.550660
0.732208
0.470834
0.734471
1
0.501130
0.735055
0.453986
0.734471
2
0.489383
0.733835
0.448154
0.734471
3
0.491561
0.734445
0.445530
0.734471
4
0.484967
0.734038
0.441958
0.734471
성능 시각화
35:
losses[['loss','val_loss']].plot()
36:
losses[['loss','val_loss', 'accuracy','val_accuracy']].plot()
[37]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend(['acc', 'val_acc'])
plt.show()
성능 평가
38:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report
39:
pred = model.predict(X_test)
40:
(2109, 2)
41:
y_pred = np.argmax(pred, axis=1)
42:
accuracy_score(y_test, y_pred)
43:
recall_score(y_test, y_pred)
44:
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.80 0.95 0.87 1549
1 0.70 0.35 0.46 560
accuracy 0.79 2109
macro avg 0.75 0.65 0.67 2109
weighted avg 0.77 0.79 0.76 2109
성능향상 할수 있는 방법은 여러가지 있습니다.
DNN 하이퍼 파라미터 수정하면서 성능향상이 되는지 확인
데이터 줄이거나 늘리거나, Feature(컬럼)을 늘리거나 줄이거나 하는 식의 Feature Engineering 방법
Feature Engineering 통한 성능향상
· 불균현 Churn 데이터 균형 맞추기 : OverSampling, UnderSampling
· OverSampling 기법 : SMOTE(Synthetic Minority Over-sampling Technique)
SMOTE
imbalanced-learn 패키지 설치
· imbalanced data 문제를 해결하기 위한 다양한 샘플링 방법을 구현한 파이썬 패키지
45:
!pip install -U imbalanced-learn
Looking in indexes: http://10.220.235.19/pypi/simple
Collecting imbalanced-learn
Downloading http://10.220.235.19/pypi/packages/19/79/e86c8fd859dca4fb1fbfc61376afc63210177a235a7bfbe7219b02edf8f3/imbalanced_learn-0.9.1-py3-none-any.whl (199 kB)
|████████████████████████████████| 199 kB 47.8 MB/s
Downloading http://10.220.235.19/pypi/packages/83/92/a4d1f42b29e9f62f9c3fad68d28282a9610a02801e1d89945702f981dd8e/imbalanced_learn-0.9.0-py3-none-any.whl (199 kB)
|████████████████████████████████| 199 kB 105.7 MB/s
Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.6/dist-packages (from imbalanced-learn) (1.5.4)
Downloading http://10.220.235.19/pypi/packages/b1/bd/4bb46fb4d317fd0f19aa7463d8906598e5fee073c0842b57cb112f023a45/imbalanced_learn-0.8.1-py3-none-any.whl (189 kB)
|████████████████████████████████| 189 kB 90.7 MB/s
Requirement already satisfied: scikit-learn>=0.24 in /usr/local/lib/python3.6/dist-packages (from imbalanced-learn) (0.24.2)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from imbalanced-learn) (1.1.0)
Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.6/dist-packages (from imbalanced-learn) (1.19.5)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.24->imbalanced-learn) (3.1.0)
Installing collected packages: imbalanced-learn
Successfully installed imbalanced-learn-0.8.1
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
SMOTE 함수 이용하여 Oversampling
[46]:
from imblearn.over_sampling import SMOTE
47:
smote = SMOTE(random_state=0)
X_train_over, y_train_over = smote.fit_resample(X_train, y_train)
[48]:
print('SMOTE 적용 전 학습용 피처/레이블 데이터 세트: ', X_train.shape, y_train.shape)
print('SMOTE 적용 후 학습용 피처/레이블 데이터 세트: ', X_train_over.shape, y_train_over.shape)
SMOTE 적용 전 학습용 피처/레이블 데이터 세트: (4918, 39) (4918,)
SMOTE 적용 후 학습용 피처/레이블 데이터 세트: (7224, 39) (7224,)
[49]:
pd.Series(y_train_over).value_counts()
[49]:
1 3612
0 3612
dtype: int64
데이터 정규화
[50]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_over = scaler.transform(X_train_over)
X_test = scaler.transform(X_test)
[51]:
X_train_over.shape, y_train_over.shape, X_test.shape, y_test.shape
[51]:
((7224, 39), (7224,), (2109, 39), (2109,))
모델 개발(Creating the Model)
52:
model = Sequential()
model.add(Dense(64, activation='relu', input_shape=(39,)))
model.add(Dropout(0.3))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(16, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(2, activation='softmax'))
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
[55]:
from tensorflow.python.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_accuracy', mode='max',
verbose=1, patience=5)
[56]:
from tensorflow.python.keras.callbacks import ModelCheckpoint
check_point = ModelCheckpoint('best_model.h5', verbose=1,
monitor='val_loss', mode='min',
save_best_only=True)
[57]:
history = model.fit(x=X_train_over, y=y_train_over,
epochs=50 , batch_size=32,
validation_data=(X_test, y_test), verbose=1,
callbacks=[early_stop, check_point])
Epoch 1/50
226/226 [==============================] - 2s 6ms/step - loss: 0.5762 - accuracy: 0.7006 - val_loss: 0.4876 - val_accuracy: 0.7307
Epoch 00001: val_loss improved from inf to 0.48763, saving model to best_model.h5
Epoch 2/50
226/226 [==============================] - 1s 5ms/step - loss: 0.5148 - accuracy: 0.7546 - val_loss: 0.4987 - val_accuracy: 0.7250
Epoch 00002: val_loss did not improve from 0.48763
Epoch 3/50
226/226 [==============================] - 1s 5ms/step - loss: 0.5035 - accuracy: 0.7625 - val_loss: 0.4921 - val_accuracy: 0.7297
Epoch 00003: val_loss did not improve from 0.48763
Epoch 4/50
226/226 [==============================] - 1s 5ms/step - loss: 0.4912 - accuracy: 0.7667 - val_loss: 0.4960 - val_accuracy: 0.7283
Epoch 00004: val_loss did not improve from 0.48763
Epoch 5/50
226/226 [==============================] - 1s 6ms/step - loss: 0.4875 - accuracy: 0.7655 - val_loss: 0.4844 - val_accuracy: 0.7468
Epoch 00005: val_loss improved from 0.48763 to 0.48436, saving model to best_model.h5
Epoch 6/50
226/226 [==============================] - 1s 5ms/step - loss: 0.4808 - accuracy: 0.7744 - val_loss: 0.4664 - val_accuracy: 0.7525
Epoch 00006: val_loss improved from 0.48436 to 0.46640, saving model to best_model.h5
Epoch 7/50
226/226 [==============================] - 1s 5ms/step - loss: 0.4730 - accuracy: 0.7825 - val_loss: 0.5007 - val_accuracy: 0.7255
Epoch 00007: val_loss did not improve from 0.46640
Epoch 8/50
226/226 [==============================] - 1s 5ms/step - loss: 0.4731 - accuracy: 0.7777 - val_loss: 0.4724 - val_accuracy: 0.7530
Epoch 00008: val_loss did not improve from 0.46640
Epoch 9/50
226/226 [==============================] - 1s 5ms/step - loss: 0.4676 - accuracy: 0.7781 - val_loss: 0.4657 - val_accuracy: 0.7639
Epoch 00009: val_loss improved from 0.46640 to 0.46568, saving model to best_model.h5
Epoch 10/50
226/226 [==============================] - 1s 5ms/step - loss: 0.4566 - accuracy: 0.7879 - val_loss: 0.5155 - val_accuracy: 0.7141
Epoch 00010: val_loss did not improve from 0.46568
Epoch 11/50
226/226 [==============================] - 1s 5ms/step - loss: 0.4582 - accuracy: 0.7883 - val_loss: 0.5009 - val_accuracy: 0.7283
Epoch 00011: val_loss did not improve from 0.46568
Epoch 12/50
226/226 [==============================] - 1s 5ms/step - loss: 0.4528 - accuracy: 0.7926 - val_loss: 0.4852 - val_accuracy: 0.7425
Epoch 00012: val_loss did not improve from 0.46568
Epoch 13/50
226/226 [==============================] - 1s 5ms/step - loss: 0.4476 - accuracy: 0.7928 - val_loss: 0.4655 - val_accuracy: 0.7520
Epoch 00013: val_loss improved from 0.46568 to 0.46549, saving model to best_model.h5
Epoch 14/50
226/226 [==============================] - 1s 5ms/step - loss: 0.4446 - accuracy: 0.7935 - val_loss: 0.4678 - val_accuracy: 0.7492
Epoch 00014: val_loss did not improve from 0.46549
Epoch 00014: early stopping
모델 성능 평가
[58]:
losses = pd.DataFrame(model.history.history)
accuracy
val_loss
val_accuracy
0
0.576232
0.700581
0.487635
0.730678
1
0.514756
0.754568
0.498698
0.724988
2
0.503462
0.762458
0.492146
0.729730
3
0.491154
0.766750
0.496050
0.728307
4
0.487461
0.765504
0.484363
0.746799
성능 시각화
60:
losses[['loss','val_loss']].plot()
61:
losses[['loss','val_loss', 'accuracy','val_accuracy']].plot()
[62]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend(['acc', 'val_acc'])
plt.show()
성능 평가
63:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report
64:
pred = model.predict(X_test)
65:
(2109, 2)
66:
y_pred = np.argmax(pred, axis=1)
67:
accuracy_score(y_test, y_pred)
68:
recall_score(y_test, y_pred)
69:
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.90 0.74 0.81 1549
1 0.52 0.76 0.62 560
accuracy 0.75 2109
macro avg 0.71 0.75 0.72 2109
weighted avg 0.80 0.75 0.76 2109
배운 내용 정리
딥러닝 심층신경망(DNN) 모델 프로세스· 데이터 가져오기
· 데이터 전처리
· Train, Test 데이터셋 분할
· 데이터 정규화
· DNN 딥러닝 모델
재현율 성능이 좋지 않다. 어떻게 성능향상 방법은?· Feature Engineering : 성능 잘 나올수 있도록 데이터 가공
· 불균현 데이터 문제 해소 : under-sampling, over-sampling
· Over-Sampling 기법 : SMOTE