sklearn 프렙프 scentomni

just develop it!·2023년 3월 21일

. ㄴ[실습-퀴즈] Python 활용한 AI 모델링 - 전처리 파트

       • 이번시간에는 Python을 활용한 AI 모델링에서 전처리에 대해 실습해 보겠습니다.

       • 머신러닝과 AI 모델링 전체에서 60~70% 차지하는 부분이 바로 전처리 파트입니다.

       • 굉장히 시간과 노력이 많이 투입되며, 어려운 부분일수 있습니다.

       • 데이터가 깨끗이 정리되지 않는다면 머신러닝/AI 성능을 장담할수 없으므로 데이터 전처리에 심혈을 기울려 주시기 바랍니다.

       • 한가지 당부 드리고 싶은 말은 "백문이불여일타" 입니다.

       • 이론보다 실습이 더 많은 시간과 노력이 투자 되어야 합니다.

학습목차

       1. 실습 내용 확인

       2. 필요 라이브러리 임포트 및 파일 읽어오기

       3. EDA (Exploratory Data Analysis) 탐색적 데이터 분석

       4. 데이터 전처리 수행

       • 불필요 컬럼 삭제

       • 컬럼 내용 변경하기

       • Null 처리

       • 컬럼 type 변경하기

       4. 시각화

       5. 결과 저장하기

실습 내용 확인

머신러닝, 딥러닝을 사용한 통신 서비스 이탈 예측

모든 관련 고객 데이터를 분석하고 강력하고 정확한 이탈 예측 모델을 개발하여 고객을 유지하고 고객 이탈률을 줄이기 위한 전략을 수립합니다.

Churn은 서비스를 중단하거나 업계의 경쟁업체로 이전한 고객 또는 사용자를 의미합니다. 모든 조직이 기존 고객을 유지하고 새로운 고객을 유치하는 것이 매우 중요합니다. 그 중 하나가 실패하면 비즈니스에 좋지 않습니다. 목표는 업계에서 경쟁 우위를 유지하기 위해 이탈 예측을 위한 머신러닝, 딥러닝의 가능성을 탐색하는 것입니다.

필요 라이브러리 임포트 및 파일 읽어오기

Numpy

[문제] numpy 라이브러리를 np alias로 임포트하세요.

[1]:

import numpy as np

Pandas

[문제] pandas 라이브러리를 pd alias로 임포트하세요.

[2]:

import pandas as pd

읽어올 데이터 파일 : data_v1.csv

Telco Customer Churn Dataset 컬럼

       1. CustomerID: Customer ID unique for each customer

       2. gender: Whether the customer is a male or a female

       3. SeniorCitizen: Whether the customer is a senior citizen or not (1, 0) : 고령자 여부

       4. Partner: Whether the customer has a partner or not (Yes, No)

       5. Dependents: Whether the customer has dependents or not (Yes, No) : 부양가족 여부

       6. Tenure: Number of months the customer has stayed with the company : 서비스 사용 개월수

       7. PhoneService: Whether the customer has a phone service or not (Yes, No)

       8. MultipleLines: Whether the customer has multiple lines or not (Yes, No, No phone service)

       9. InternetService: Customer’s internet service provider (DSL, Fiber optic, No)

       10. OnlineSecurity: Whether the customer has online security or not (Yes, No, No internet service)

       11. OnlineBackup: Whether the customer has an online backup or not (Yes, No, No internet service)

       12. DeviceProtection: Whether the customer has device protection or not (Yes, No, No internet service)

       13. TechSupport: Whether the customer has tech support or not (Yes, No, No internet service)

       14. StreamingTV: Whether the customer has streaming TV or not (Yes, No, No internet service)

       15. StreamingMovies: Whether the customer has streaming movies or not (Yes, No, No internet service)

       16. Contract: The contract term of the customer (Month-to-month, One year, Two years)

       17. PaperlessBilling: Whether the customer has paperless billing or not (Yes, No)

       18. PaymentMethod: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))

       19. MonthlyCharges: The amount charged to the customer monthly

       20. TotalCharges: The total amount charged to the customer

       21. Churn: Whether the customer churned or not (Yes or No) : 이탈 여부

CSV 파일 데이터 읽어오기

[문제] data.csv 파일을 Pandas read_csv 함수를 이용하여 읽어 df 변수에 저장하세요.

[3]:

읽어 들일 파일명 : data,csv

Pandas read_csv 함수 활용

결과 : df 저장

df = pd.read_csv('(라이브교육)data_v1.csv')

읽어온 데이터프레임 확인하기

       customerID        gender  SeniorCitizen      Partner  Dependents       tenure   PhoneService           MultipleLines      InternetService   OnlineSecurity    ...         DeviceProtection TechSupport           StreamingTV      StreamingMovies Contract PaperlessBilling   PaymentMethod MonthlyCharges           TotalCharges      Churn

0 7590-VHVEG NaN 0.0 Yes No 1 No No phone service DSL No ... No No No No NaN Yes Electronic check 29.85 29.85 No

1 5575-GNVDE Male 0.0 No No 34 Yes No DSL Yes ... Yes No No No One year No Mailed check 56.95 1889.5 No

2 3668-QPYBK Male 0.0 No No 2 Yes No DSL Yes ... NaN No No No Month-to-month Yes Mailed check 53.85 108.15 Yes

3 7795-CFOCW Male 0.0 No No 45 No No phone service DSL Yes ... NaN Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No

4 9237-HQITU Female 0.0 No No 2 Yes No Fiber optic No ... NaN No No No Month-to-month Yes Electronic check 70.70 151.65 Yes

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

7038 6840-RESVB Male 0.0 Yes Yes 24 Yes Yes DSL Yes ... Yes Yes Yes Yes One year Yes Mailed check 84.80 1990.5 No

7039 2234-XADUH Female 0.0 Yes Yes 72 Yes Yes Fiber optic No ... Yes No Yes Yes One year Yes Credit card (automatic) 103.20 7362.9 No

7040 4801-JZAZL Female 0.0 Yes Yes 11 No No phone service DSL Yes ... No No No No Month-to-month Yes Electronic check 29.60 346.45 No

7041 8361-LTMKD Male 1.0 Yes No 4 Yes Yes Fiber optic No ... No No No No Month-to-month Yes Mailed check 74.40 306.6 Yes

7042 3186-AJIEK NaN 0.0 No No 66 Yes No Fiber optic Yes ... Yes Yes Yes Yes Two year Yes Bank transfer (automatic) 105.65 6844.5 No

7043 rows × 21 columns

EDA (Exploratory Data Analysis) 탐색적 데이터 분석

데이터 탐색하기

       customerID        gender  SeniorCitizen      Partner  Dependents       tenure   PhoneService           MultipleLines      InternetService   OnlineSecurity    ...         DeviceProtection TechSupport           StreamingTV      StreamingMovies Contract PaperlessBilling   PaymentMethod MonthlyCharges           TotalCharges      Churn

0 7590-VHVEG NaN 0.0 Yes No 1 No No phone service DSL No ... No No No No NaN Yes Electronic check 29.85 29.85 No

1 5575-GNVDE Male 0.0 No No 34 Yes No DSL Yes ... Yes No No No One year No Mailed check 56.95 1889.5 No

2 3668-QPYBK Male 0.0 No No 2 Yes No DSL Yes ... NaN No No No Month-to-month Yes Mailed check 53.85 108.15 Yes

3 7795-CFOCW Male 0.0 No No 45 No No phone service DSL Yes ... NaN Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No

4 9237-HQITU Female 0.0 No No 2 Yes No Fiber optic No ... NaN No No No Month-to-month Yes Electronic check 70.70 151.65 Yes

5 rows × 21 columns

       customerID        gender  SeniorCitizen      Partner  Dependents       tenure   PhoneService           MultipleLines      InternetService   OnlineSecurity    ...         DeviceProtection TechSupport           StreamingTV      StreamingMovies Contract PaperlessBilling   PaymentMethod MonthlyCharges           TotalCharges      Churn

7038 6840-RESVB Male 0.0 Yes Yes 24 Yes Yes DSL Yes ... Yes Yes Yes Yes One year Yes Mailed check 84.80 1990.5 No

7039 2234-XADUH Female 0.0 Yes Yes 72 Yes Yes Fiber optic No ... Yes No Yes Yes One year Yes Credit card (automatic) 103.20 7362.9 No

7040 4801-JZAZL Female 0.0 Yes Yes 11 No No phone service DSL Yes ... No No No No Month-to-month Yes Electronic check 29.60 346.45 No

7041 8361-LTMKD Male 1.0 Yes No 4 Yes Yes Fiber optic No ... No No No No Month-to-month Yes Mailed check 74.40 306.6 Yes

7042 3186-AJIEK NaN 0.0 No No 66 Yes No Fiber optic Yes ... Yes Yes Yes Yes Two year Yes Bank transfer (automatic) 105.65 6844.5 No

5 rows × 21 columns

자료구조 파악

RangeIndex: 7043 entries, 0 to 7042

Data columns (total 21 columns):

Column Non-Null Count Dtype

0 customerID 7043 non-null object

1 gender 7034 non-null object

2 SeniorCitizen 7042 non-null float64

3 Partner 7043 non-null object

4 Dependents 7041 non-null object

5 tenure 7043 non-null int64

6 PhoneService 7040 non-null object

7 MultipleLines 7043 non-null object

8 InternetService 7043 non-null object

9 OnlineSecurity 7043 non-null object

10 OnlineBackup 7043 non-null object

11 DeviceProtection 3580 non-null object

12 TechSupport 7043 non-null object

13 StreamingTV 7043 non-null object

14 StreamingMovies 7043 non-null object

15 Contract 7042 non-null object

16 PaperlessBilling 7043 non-null object

17 PaymentMethod 7042 non-null object

18 MonthlyCharges 7042 non-null float64

19 TotalCharges 7043 non-null object

20 Churn 7043 non-null object

dtypes: float64(2), int64(1), object(18)

memory usage: 1.1+ MB

데이터 타입, 인덱스, 컬럼명, Values 확인

인덱스

df.index

RangeIndex(start=0, stop=7043, step=1)

[9]:

컬럼명

df.columns

[9]:

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',

   'tenure', 'PhoneService', 'MultipleLines', 'InternetService',

   'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',

   'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',

   'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],

  dtype='object')

11:

Values

df.values

11:

array([['7590-VHVEG', nan, 0.0, ..., 29.85, '29.85', 'No'],

   ['5575-GNVDE', 'Male', 0.0, ..., 56.95, '1889.5', 'No'],

   ['3668-QPYBK', 'Male', 0.0, ..., 53.85, '108.15', 'Yes'],

   ...,

   ['4801-JZAZL', 'Female', 0.0, ..., 29.6, '346.45', 'No'],

   ['8361-LTMKD', 'Male', 1.0, ..., 74.4, '306.6', 'Yes'],

   ['3186-AJIEK', nan, 0.0, ..., 105.65, '6844.5', 'No']],

  dtype=object)

Null 데이터 확인

10:

customerID 0

gender 9

SeniorCitizen 1

Partner 0

Dependents 2

tenure 0

PhoneService 3

MultipleLines 0

InternetService 0

OnlineSecurity 0

OnlineBackup 0

DeviceProtection 3463

TechSupport 0

StreamingTV 0

StreamingMovies 0

Contract 1

PaperlessBilling 0

PaymentMethod 1

MonthlyCharges 1

TotalCharges 0

Churn 0

dtype: int64

통계 정보

[12]:

SeniorCitizen 컬럼 : 고령자 여부, 범주형 0,1

tenure : 서비스 사용한 월수로 평균 32개월 사용, 최대 72개월 충성고객 있음 확인

25%, 50%, 75% 비율과 Max를 보니, 아무래도 Outliner 있는것으로 사료됨.

MonthlyCharges : 월사용요금, 평균 64$ , 최소 18 $, 최대 118$

df.describe()

[12]:

       SeniorCitizen      tenure   MonthlyCharges

count 7042.000000 7043.000000 7042.000000

mean 0.162170 32.371149 64.763256

std 0.368633 24.559481 30.091898

min 0.000000 0.000000 18.250000

25% 0.000000 9.000000 35.500000

50% 0.000000 29.000000 70.350000

75% 0.000000 55.000000 89.850000

max 1.000000 72.000000 118.750000

Duplicate key in file PosixPath('/usr/local/lib/python3.6/dist-packages/matplotlib/mpl-data/matplotlibrc'), line 758 ('font.family\t: NanumGothicCoding')

데이터 전처리 수행

자료구조 파악

[문제] df1 DataFrame의 함수를 활용해서 자료구조(Row, columnm , Not-null, type)을 파악 하세요.

14:

DataFrame info 함수

df.info()

RangeIndex: 7043 entries, 0 to 7042

Data columns (total 21 columns):

Column Non-Null Count Dtype

0 customerID 7043 non-null object

1 gender 7034 non-null object

2 SeniorCitizen 7042 non-null float64

3 Partner 7043 non-null object

4 Dependents 7041 non-null object

5 tenure 7043 non-null int64

6 PhoneService 7040 non-null object

7 MultipleLines 7043 non-null object

8 InternetService 7043 non-null object

9 OnlineSecurity 7043 non-null object

10 OnlineBackup 7043 non-null object

11 DeviceProtection 3580 non-null object

12 TechSupport 7043 non-null object

13 StreamingTV 7043 non-null object

14 StreamingMovies 7043 non-null object

15 Contract 7042 non-null object

16 PaperlessBilling 7043 non-null object

17 PaymentMethod 7042 non-null object

18 MonthlyCharges 7042 non-null float64

19 TotalCharges 7043 non-null object

20 Churn 7043 non-null object

dtypes: float64(2), int64(1), object(18)

memory usage: 1.1+ MB

컬럼 삭제

[문제] df 데이터프레임에서 'customerID' 컬럼 삭제 하세요.

15:

DataFrame drop 함수

'customerID' 컬럼 삭제

axis=1 옵션 사용해서 컬럼단위 삭제 수행

inplace=True 옵션 사용하여 df DataFrame에 저장

df.drop('customerID', axis=1, inplace=True)

[16]:

21컬럼에서 20개 컬럼으로 1개 줄어듬 확인

TotalCharges 컬럼은 사용요금으로 숫자형이어야 하는데 Ojbect로 나옴. 확인 필요

df.info()

RangeIndex: 7043 entries, 0 to 7042

Data columns (total 20 columns):

Column Non-Null Count Dtype

0 gender 7034 non-null object

1 SeniorCitizen 7042 non-null float64

2 Partner 7043 non-null object

3 Dependents 7041 non-null object

4 tenure 7043 non-null int64

5 PhoneService 7040 non-null object

6 MultipleLines 7043 non-null object

7 InternetService 7043 non-null object

8 OnlineSecurity 7043 non-null object

9 OnlineBackup 7043 non-null object

10 DeviceProtection 3580 non-null object

11 TechSupport 7043 non-null object

12 StreamingTV 7043 non-null object

13 StreamingMovies 7043 non-null object

14 Contract 7042 non-null object

15 PaperlessBilling 7043 non-null object

16 PaymentMethod 7042 non-null object

17 MonthlyCharges 7042 non-null float64

18 TotalCharges 7043 non-null object

19 Churn 7043 non-null object

dtypes: float64(2), int64(1), object(17)

memory usage: 1.1+ MB

컬럼 내용 변경하기

범주형 문자 데이터를 숫자 변환하는것은 성능에 많은 영향을 미치므로 꼭 변환하로록 하자.

null, _ 문제있는 문자 데이터를 모델링하기 전에 미리 다른 데이터로 변경하거나 필요없을 경우 삭제하도록 하자.

TotalCharges 컬럼 타입 변경하기

[18]:

TotalCharges : 월사용요금, 실수형으로 보인다.

df['TotalCharges']

[18]:

0 29.85

1 1889.5

2 108.15

3 1840.75

4 151.65

...

7038 1990.5

7039 7362.9

7040 346.45

7041 306.6

7042 6844.5

Name: TotalCharges, Length: 7043, dtype: object

[19]:

TotalCharges 컬럼 타입을 float로 변경해 보자.

문자열을 숫자형으로 변경할수 없으므로 에러 발생

df['TotalCharges'].astype(float)

ValueError Traceback (most recent call last)

  2 # 문자열을 숫자형으로 변경할수 없으므로 에러 발생

  3

----> 4 df['TotalCharges'].astype(float)

/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in astype(self, dtype, copy, errors)

5546 else:

5547 # else, only a single dtype is given

-> 5548 new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors,)

5549 return self._constructor(new_data).finalize(self, method="astype")

5550

/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)

602         self, dtype, copy: bool = False, errors: str = "raise"

603     ) -> "BlockManager":

--> 604 return self.apply("astype", dtype=dtype, copy=copy, errors=errors)

605

606     def convert(

/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, **kwargs)

407                 applied = b.apply(f, **kwargs)

408             else:

--> 409 applied = getattr(b, f)(**kwargs)

410             result_blocks = _extend_blocks(applied, result_blocks)

411

/usr/local/lib/python3.6/dist-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)

593             vals1d = values.ravel()

594             try:

--> 595 values = astype_nansafe(vals1d, dtype, copy=True)

596             except (ValueError, TypeError):

597                 # e.g. astype_nansafe can fail on object-dtype of strings

/usr/local/lib/python3.6/dist-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)

995     if copy or is_object_dtype(arr) or is_object_dtype(dtype):

996         # Explicit copy, or required since NumPy can't view from / to object.

--> 997 return arr.astype(dtype, copy=True)

998

999     return arr.view(dtype)

ValueError: could not convert string to float:

[20]:

Boolean indexing으로 검색

(df['TotalCharges'] == '') | (df['TotalCharges'] == ' ')

[20]:

0 False

1 False

2 False

3 False

4 False

...

7038 False

7039 False

7040 False

7041 False

7042 False

Name: TotalCharges, Length: 7043, dtype: bool

[23]:

Boolean indexing으로 검색

cond = (df['TotalCharges'] == '') | (df['TotalCharges'] == ' ')

df[cond]

[23]:

gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn

[문제] df 데이터프레임의 'TotalCharges' 컬럼의 값 ' ' --> '0' 변경하세요.

21:

DataFrame replace 함수

대상 컬럼 : 'TotalCharges'

df['TotalCharges'].replace([' '], ['0'], inplace = True )

[문제] df 데이터프레임의 'TotalCharges' 컬럼 타입을 object에서 float 변경하세요.

[29]:

'TotalCharges' 컬럼 type을 float로 변경

결과를 TotalCharges 컬럼에 다시 넣어야 합니다.

df['TotalCharges']=df['TotalCharges'].astype(float)

25:

다시 Boolean indexing으로 검색 : '' 값을 가진 Row 없음

cond = (df['TotalCharges'] == '') | (df['TotalCharges'] == ' ')

df[cond]

25:

30:

TotalCharges 컬럼 float 변경 확인

df.info()

RangeIndex: 7043 entries, 0 to 7042

Data columns (total 20 columns):

Column Non-Null Count Dtype

0 gender 7034 non-null object

1 SeniorCitizen 7042 non-null float64

2 Partner 7043 non-null object

3 Dependents 7041 non-null object

4 tenure 7043 non-null int64

5 PhoneService 7040 non-null object

6 MultipleLines 7043 non-null object

7 InternetService 7043 non-null object

8 OnlineSecurity 7043 non-null object

9 OnlineBackup 7043 non-null object

10 DeviceProtection 3580 non-null object

11 TechSupport 7043 non-null object

12 StreamingTV 7043 non-null object

13 StreamingMovies 7043 non-null object

14 Contract 7042 non-null object

15 PaperlessBilling 7043 non-null object

16 PaymentMethod 7042 non-null object

17 MonthlyCharges 7042 non-null float64

18 TotalCharges 7043 non-null float64

19 Churn 7043 non-null object

dtypes: float64(3), int64(1), object(16)

memory usage: 1.1+ MB

Churn 컬럼의 문자열값을 숫자로 변경

31:

Churn 컬럼의 분포 확인

Churn Yes : 이탈 , No : 이탈안함. 언밸런스하게 No쪽이 많은 차지하고 있음

df['Churn'].value_counts()

31:

No 5174

Yes 1869

Name: Churn, dtype: int64

[32]:

'Churn' 컬럼의 ['Yes', 'No'] --> [1, 0] 변경하기

컴퓨터는 문자열 이해 잘 하지 못해 숫자로 변경해야 함.

df['Churn'].replace(['Yes', 'No'], [1, 0], inplace=True)

33:

Churn 컬럼의 분포 확인

df['Churn'].value_counts()

33:

0 5174

1 1869

Name: Churn, dtype: int64

Null 데이터 확인

[문제] df 데이터프레임에 대해 컬럼별로 null 얼마나 있는지 null 갯수를 나열 하세요.

34:

DataFrame isnull(), sum() 함수 활용

df.isnull().sum()

34:

gender 9

SeniorCitizen 1

Partner 0

Dependents 2

tenure 0

PhoneService 3

MultipleLines 0

InternetService 0

OnlineSecurity 0

OnlineBackup 0

DeviceProtection 3463

TechSupport 0

StreamingTV 0

StreamingMovies 0

Contract 1

PaperlessBilling 0

PaymentMethod 1

MonthlyCharges 1

TotalCharges 0

Churn 0

dtype: int64

결측치 처리

데이터에 결측치 있으면 모델링시 알지 못하는 에러 발생할수 있으므로 반드시 결측치를 제거나 변경해야 한다.

결측치 제거시 dropna() 함수 활용하면 된다.

결측치를 변경시 변경하는 방법이 꼭 정답이 아니며, 여러가지 판단하고 고민이 필요하다.

주로, 문자형 컬럼에 대해 최빈값으로 , 숫자형 컬럼에 대해 중간값으로 결측치 대신해서 채울수 있다.

[문제] df 데이터프레임의 결측치 많은 컬럼은 컬럼 제거하고 나머지 결측치는 Row 제거 하세요.

35:

1. 결측치 많은 컬럼 : DeviceProtection --> drop 함수 이용하여 해당 컬럼 제거

2. 결측치 작은 Row에 대해서 dropna로 제거

inplace=True 옵션으로 자체 저장

df.drop('DeviceProtection', axis=1, inplace = True)

df.dropna(inplace=True)

#여러개도 가능

#df.drop(['DeviceProtection', '~', '~~'], axis=1, inplace = True)

36:

Null 여부 다시 확인

df.isnull().sum()

36:

gender 0

SeniorCitizen 0

Partner 0

Dependents 0

tenure 0

PhoneService 0

MultipleLines 0

InternetService 0

OnlineSecurity 0

OnlineBackup 0

TechSupport 0

StreamingTV 0

StreamingMovies 0

Contract 0

PaperlessBilling 0

PaymentMethod 0

MonthlyCharges 0

TotalCharges 0

Churn 0

dtype: int64

39:

df2 = df.copy()

41:

DeviceProtection 컬럼 삭제 확인

df.info()

df2.reset_index(drop = True)

Int64Index: 7027 entries, 1 to 7041

Data columns (total 19 columns):

Column Non-Null Count Dtype

0 gender 7027 non-null object

1 SeniorCitizen 7027 non-null float64

2 Partner 7027 non-null object

3 Dependents 7027 non-null object

4 tenure 7027 non-null int64

5 PhoneService 7027 non-null object

6 MultipleLines 7027 non-null object

7 InternetService 7027 non-null object

8 OnlineSecurity 7027 non-null object

9 OnlineBackup 7027 non-null object

10 TechSupport 7027 non-null object

11 StreamingTV 7027 non-null object

12 StreamingMovies 7027 non-null object

13 Contract 7027 non-null object

14 PaperlessBilling 7027 non-null object

15 PaymentMethod 7027 non-null object

16 MonthlyCharges 7027 non-null float64

17 TotalCharges 7027 non-null float64

18 Churn 7027 non-null int64

dtypes: float64(3), int64(2), object(14)

memory usage: 1.1+ MB

41:

       gender  SeniorCitizen      Partner  Dependents       tenure   PhoneService     MultipleLines           InternetService   OnlineSecurity    OnlineBackup     TechSupport      StreamingTV      StreamingMovies           Contract PaperlessBilling   PaymentMethod  MonthlyCharges  TotalCharges      Churn

0 Male 0.0 No No 34 Yes No DSL Yes No No No No One year No Mailed check 56.95 1889.50 0

1 Male 0.0 No No 2 Yes No DSL Yes Yes No No No Month-to-month Yes Mailed check 53.85 108.15 1

2 Male 0.0 No No 45 No No phone service DSL Yes No Yes No No One year No Bank transfer (automatic) 42.30 1840.75 0

3 Female 0.0 No No 2 Yes No Fiber optic No No No No No Month-to-month Yes Electronic check 70.70 151.65 1

4 Female 0.0 No No 8 Yes Yes Fiber optic No No No Yes Yes Month-to-month Yes Electronic check 99.65 820.50 1

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

7022 Female 0.0 No No 72 Yes No No No internet service No internet service No internet service No internet service No internet service Two year Yes Bank transfer (automatic) 21.15 1419.40 0

7023 Male 0.0 Yes Yes 24 Yes Yes DSL Yes No Yes Yes Yes One year Yes Mailed check 84.80 1990.50 0

7024 Female 0.0 Yes Yes 72 Yes Yes Fiber optic No Yes No Yes Yes One year Yes Credit card (automatic) 103.20 7362.90 0

7025 Female 0.0 Yes Yes 11 No No phone service DSL Yes No No No No Month-to-month Yes Electronic check 29.60 346.45 0

7026 Male 1.0 Yes No 4 Yes Yes Fiber optic No No No No No Month-to-month Yes Mailed check 74.40 306.60 1

7027 rows × 19 columns

시각화

라이브러리 임포트

42:

import matplotlib.pyplot as plt

%matplotlib inline

Bar 차트

43:

df['gender'].value_counts()

43:

Male 3550

Female 3477

Name: gender, dtype: int64

44:

df['gender'].value_counts().plot(kind='bar')

[문제] df 데이터프레임의 'Partner' 컬럼의 값 분포를 구하고 Bar 차트를 그리세요.

45:

DataFrame value_counts()와 plot() 함수 활용

대상 컬럼 : 'Partner'

plot 함수의 인자 : kind='bar'

df['Partner'].value_counts().plot(kind = 'bar')

한꺼번에 Object 컬럼에 대해서 분포 Bar 차트 확인해 봅시다.

47:

Object 컬럼만 뽑으려면

1. 일일히 눈으로 보고 Object 컬럼 고른다

2. select_dtype() 함수 활용한다.

df.select_dtypes('O').head(3)

47:

       gender  Partner  Dependents       PhoneService     MultipleLines      InternetService   OnlineSecurity           OnlineBackup     TechSupport      StreamingTV      StreamingMovies Contract PaperlessBilling           PaymentMethod

1 Male No No Yes No DSL Yes No No No No One year No Mailed check

2 Male No No Yes No DSL Yes Yes No No No Month-to-month Yes Mailed check

3 Male No No No No phone service DSL Yes No Yes No No One year No Bank transfer (automatic)

[48]:

Object 컬럼명만 뽑아보자

df.select_dtypes('O').columns.values

[48]:

array(['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines',

   'InternetService', 'OnlineSecurity', 'OnlineBackup', 'TechSupport',

   'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',

   'PaymentMethod'], dtype=object)

[49]:

Object 컬럼 하나씩 가져와서 Bar 차트 그려보기

불균형 컬럼 : Dependents, PhoneService. 심한 불균형 가진 PhoneService 컬럼 삭제 필요

object_list = df.select_dtypes('object').columns.values

for col in object_list:

df[col].value_counts().plot(kind='bar')

plt.title(col)

plt.show()

불균형 심한 PhoneService 컬럼 삭제

[55]:

df.drop('PhoneService', axis=1, inplace=True)

숫자형 컬럼에 대한 시각화

[50]:

number(int, float) 컬럼에 대해 검색

df.select_dtypes( 'number').head(3)

[50]:

       SeniorCitizen      tenure   MonthlyCharges  TotalCharges      Churn

1 0.0 34 56.95 1889.50 0

2 0.0 2 53.85 108.15 1

3 0.0 45 42.30 1840.75 0

Churn 컬럼

[51]:

Churn 컬럼은 0, 1 되어 있으므로 분포 확인

df['Churn'].value_counts()

[51]:

0 5161

1 1866

Name: Churn, dtype: int64

52:

Churn 컬럼에 대한 Bar 차트 확인

이탈(1)가 이탈않음(0)에 비해 1/3 수준임 : 불균형

df['Churn'].value_counts().plot(kind='bar')

SeniorCitizen 컬럼

53:

SeniorCitizen 컬럼은 0, 1 되어 있으므로 분포 확인

df['SeniorCitizen'].value_counts()

53:

0.0 5885

1.0 1142

Name: SeniorCitizen, dtype: int64

54:

SeniorCitizen 컬럼에 대한 Bar 차트 확인

이탈(1)가 이탈않음(0)에 비해 1/5 수준임 : 불균형

df['SeniorCitizen'].value_counts().plot(kind='bar')

[문제] 불균형 심한 'SeniorCitizen' 컬럼을 삭제하세요.

[56]:

DataFrame drop() 함수 활용

대상 컬럼 : 'SeniorCitizen'

axis 와 inplace 옵션 사용

df.drop('SeniorCitizen', axis=1, inplace=True)

[57]:

SeniorCitizen 삭제 확인

df.info()

Int64Index: 7027 entries, 1 to 7041

Data columns (total 17 columns):

Column Non-Null Count Dtype

0 gender 7027 non-null object

1 Partner 7027 non-null object

2 Dependents 7027 non-null object

3 tenure 7027 non-null int64

4 MultipleLines 7027 non-null object

5 InternetService 7027 non-null object

6 OnlineSecurity 7027 non-null object

7 OnlineBackup 7027 non-null object

8 TechSupport 7027 non-null object

9 StreamingTV 7027 non-null object

10 StreamingMovies 7027 non-null object

11 Contract 7027 non-null object

12 PaperlessBilling 7027 non-null object

13 PaymentMethod 7027 non-null object

14 MonthlyCharges 7027 non-null float64

15 TotalCharges 7027 non-null float64

16 Churn 7027 non-null int64

dtypes: float64(2), int64(2), object(13)

memory usage: 988.2+ KB

Histgram

60:

seaborn 라이브러리 임포트

#!pip install seaborn

import seaborn as sns

tenure 컬럼

61:

tenure (서비스 사용기간)에 대한 히스토그램

처음에 많이 사용하고 , 70개월 사용하는 충성고객도 있다.

sns.histplot(data=df, x='tenure')

63:

tenure (서비스 사용기간) 대한 히스토그램을 Churn 으로 구분

히스토그램으로 Churn 구분하니 겹쳐서 보기 어렵다.

sns.histplot(data=df, x='tenure', hue='Churn')

64:

kdeplot : 히스토그램을 곡선으로 그려보자

처음엔 서비스 가입도 많이 하고 이탈도 많이 하는것으로 보이고

70개월 이상 충성고객수는 점점 줄어 들고, 특히 60개월이상 넘어가면 이탈이 많이 늘어난다.

sns.kdeplot(data=df, x='tenure', hue='Churn')

TotalCharges 컬럼

65:

TotalCharges (서비스 총요금)에 대한 히스토그램

처음에 많이 사용하고 금액이 커질수록 사용자수가 줄어든다

sns.histplot(data=df, x='TotalCharges')

66:

kdeplot : 히스토그램을 곡선으로 그려보자

최근 가입자가 이탈하기 쉽다.

MonthlyCharges(서비스 총요금)이 클수록 이탈하기 쉽다.

sns.kdeplot(data=df, x='TotalCharges', hue='Churn')

Countplot

67:

MultipleLines 서비스를 사용하는 고객이 약간 더 높은 이탈율을 보인다.

sns.countplot(data=df, x='MultipleLines', hue='Churn')

heatmap

68:

'tenure','MonthlyCharges','TotalCharges' 컬럼간의 상관관계를 확인해 보자

df[['tenure','MonthlyCharges','TotalCharges']].corr()

68:

       tenure   MonthlyCharges TotalCharges

tenure 1.000000 0.247630 0.826172

MonthlyCharges 0.247630 1.000000 0.651049

TotalCharges 0.826172 0.651049 1.000000

69:

tenure','MonthlyCharges','TotalCharges' 컬럼간의 상관관계를 heatmap으로 그려보자

tenure(서비스 사용기간)과 TotalCharges(서비스 총요금)간의 깊은 상관관계가 있어 보인다.

sns.heatmap(df[['tenure','MonthlyCharges','TotalCharges']].corr(), annot=True)

boxplot

70:

이탈하는 고객이 이탈하지 않는 고객에 비해 총사용금액이 낮으며, Outlier 보인다.

sns.boxplot(data=df, x='Churn', y='TotalCharges')

결과 저장하기

결과를 csv 파일로 저장하기

[71]:

index=False 주어야 기존 인덱스 값이 저장되지 않음

df.to_csv('data_v1_save.csv', index=False)

72:

       gender  Partner  Dependents       tenure   MultipleLines      InternetService   OnlineSecurity           OnlineBackup     TechSupport      StreamingTV      StreamingMovies Contract PaperlessBilling           PaymentMethod  MonthlyCharges  TotalCharges      Churn

0 Male No No 34 No DSL Yes No No No No One year No Mailed check 56.95 1889.50 0

1 Male No No 2 No DSL Yes Yes No No No Month-to-month Yes Mailed check 53.85 108.15 1

2 Male No No 45 No phone service DSL Yes No Yes No No One year No Bank transfer (automatic) 42.30 1840.75 0

3 Female No No 2 No Fiber optic No No No No No Month-to-month Yes Electronic check 70.70 151.65 1

4 Female No No 8 Yes Fiber optic No No No Yes Yes Month-to-month Yes Electronic check 99.65 820.50 1

배운 내용 정리

       1. 필요 라이브러리 임포트 및 파일 읽어오기 : pd.read_csv()

       2. EDA (Exploratory Data Analysis) 탐색적 데이터 분석 : df.info(), df.head(), df.tail()

       3. 데이터 전처리 수행

       • 불필요 컬럼 삭제 : df.drop()

       • 컬럼 내용 변경하기 : df.replace()

       • Null 처리 : df.replace(), df.fillna()

       • 컬럼 type 변경하기 : df['col'].astype(int)

       4. 시각화

       • matplotlib, seaborn

       • bar, scatter, countplot, boxplot

       5. 결과 저장하기

       • to_csv()

D:\바탕화면2021\ETC A2Z\2022 코딩챌린지및_계발\10. 2023 AICE Associate 교육

[실습-퀴즈] Python을 활용한 AI 모델링 - 머신러닝 파트

· 이번시간에는 Python을 활용한 AI 모델링에서 머신러닝에 대해 실습해 보겠습니다.

· 머신러닝 모델에는 아래와 같이 모델들이 있습니다.

· 단일 분류예측 모델 : LogisticRegression, KNN, DecisionTree

· 앙상블(Ensemble) 모델 : RandomForest, XGBoost, LGBM, Stacking, Weighted Blending

· 솔직히, 머신러닝이 딥러닝보다 코딩하기 쉽습니다. 4줄 템플릿에 맞쳐 코딩하면 되기 때문입니다.

· 한가지 당부 드리고 싶은 말은 "백문이불여일타" 입니다.

· 이론보다 실습이 더 많은 시간과 노력이 투자 되어야 합니다.

학습목차

```
머신러닝 모델 프로세스
```

· 데이터 가져오기

· 데이터 전처리

· Train, Test 데이터셋 분할

· 데이터 정규화

· 단일 분류예측 모델 : LogisticRegression, KNN, DecisionTree

· 앙상블(Ensemble) 모델 : RandomForest, XGBoost, LGBM

재현율 성능이 너무 안나온다. 어떻게 해결할수 있을까?

머신러닝 모델 프로세스

① 라이브러리 임포트(import)

② 데이터 가져오기(Loading the data)

③ 탐색적 데이터 분석(Exploratory Data Analysis)

④ 데이터 전처리(Data PreProcessing) : 데이터타입 변환, Null 데이터 처리, 누락데이터 처리, 더미특성 생성, 특성 추출 (feature engineering) 등

⑤ Train, Test 데이터셋 분할

⑥ 데이터 정규화(Normalizing the Data)

⑦ 모델 개발(Creating the Model)

⑧ 모델 성능 평가

① 라이브러리 임포트

필요 라이브러리 임포트

[1]:

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

Duplicate key in file PosixPath('/usr/local/lib/python3.6/dist-packages/matplotlib/mpl-data/matplotlibrc'), line 758 ('font.family\t: NanumGothicCoding')

② 데이터 로드

data_v1_save.csv 파일 읽어오기

[2]:

앞쪽 전처리에서 저장한 cust_data.csv 파일 읽기

df = pd.read_csv('data_v1_save.txt', )

df = pd.read_csv('(라이브교육)data_v1_save.csv',sep = ",")

③ 데이터 분석

[3]:

17컬럼, 7,027 라인

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7027 entries, 0 to 7026
Data columns (total 17 columns):

Column Non-Null Count Dtype

0 gender 7027 non-null object
1 Partner 7027 non-null object
2 Dependents 7027 non-null object
3 tenure 7027 non-null int64
4 MultipleLines 7027 non-null object
5 InternetService 7027 non-null object
6 OnlineSecurity 7027 non-null object
7 OnlineBackup 7027 non-null object
8 TechSupport 7027 non-null object
9 StreamingTV 7027 non-null object
10 StreamingMovies 7027 non-null object
11 Contract 7027 non-null object
12 PaperlessBilling 7027 non-null object
13 PaymentMethod 7027 non-null object
14 MonthlyCharges 7027 non-null float64
15 TotalCharges 7027 non-null float64
16 Churn 7027 non-null int64
dtypes: float64(2), int64(2), object(13)
memory usage: 933.4+ KB

Partner

Dependents

tenure

MultipleLines

InternetService

OnlineSecurity

OnlineBackup

TechSupport

StreamingTV

StreamingMovies

Contract

PaperlessBilling

PaymentMethod

MonthlyCharges

TotalCharges

Churn

7022

Female

No internet service

Two year

Yes

Bank transfer (automatic)

21.15

1419.40

7023

Male

Yes

DSL

Yes

One year

Yes

Mailed check

84.80

1990.50

7024

Female

Yes

Fiber optic

Yes

One year

Yes

Credit card (automatic)

103.20

7362.90

7025

Female

Yes

No phone service

DSL

Yes

Month-to-month

Yes

Electronic check

29.60

346.45

7026

Male

Yes

Fiber optic

Month-to-month

Yes

Mailed check

74.40

306.60

df['Churn'].value_counts()

0 5161
1 1866
Name: Churn, dtype: int64

Churn 레이블 불균형

df['Churn'].value_counts()[:].plot(kind='bar')

④ 데이터 전처리

모든 데이터값들은 숫자형으로 되어야 한다. 즉, Ojbect 타입을 모든 숫자형 변경 필요
전처리 시간에 했던 replace 대신 Label ending 과 OneHot 함수를 활용 하여 인코딩
Object 컬럼에 대해 Pandas get_dummies 함수 활용하여 One-Hot-Encoding

MultipleLines 컬럼 내용 보기

df[['MultipleLines']].head()

No phone service

Yes

[9]:

MultipleLines 컬럼에 대한 분포 확인 : 3가지 되어 있음 확인

df['MultipleLines'].value_counts()

[9]:

No 3380
Yes 2966
No phone service 681
Name: MultipleLines, dtype: int64

10:

MultipleLines 컬럼의 값들이 문자열로 되어 있어 숫자로 변환해야 함. 컴퓨터가 이해할수 있도록

Object 컬럼의 데이터를 원-핫-인코딩해서 숫자로 변경해 주는 함수 : Pandas get_dummies()

pd.get_dummies(data=df, columns=['MultipleLines'])

Partner

Dependents

tenure

InternetService

OnlineSecurity

OnlineBackup

TechSupport

StreamingTV

StreamingMovies

Contract

PaperlessBilling

PaymentMethod

MonthlyCharges

TotalCharges

Churn

MultipleLines_No

MultipleLines_No phone service

MultipleLines_Yes

Male

DSL

Yes

One year

Mailed check

56.95

1889.50

Male

DSL

Yes

Month-to-month

Yes

Mailed check

53.85

108.15

Male

DSL

Yes

One year

Bank transfer (automatic)

42.30

1840.75

Female

Fiber optic

Month-to-month

Yes

Electronic check

70.70

151.65

Female

Fiber optic

Yes

Month-to-month

Yes

Electronic check

99.65

820.50

...

7022

Female

No internet service

Two year

Yes

Bank transfer (automatic)

21.15

1419.40

7023

Male

Yes

DSL

Yes

One year

Yes

Mailed check

84.80

1990.50

7024

Female

Yes

Fiber optic

Yes

One year

Yes

Credit card (automatic)

103.20

7362.90

7025

Female

Yes

DSL

Yes

Month-to-month

Yes

Electronic check

29.60

346.45

7026

Male

Yes

Fiber optic

Month-to-month

Yes

Mailed check

74.40

306.60

7027 rows × 19 columns

11:

Object 컬럼 확인

df.select_dtypes('object').head(3)

Partner

Dependents

MultipleLines

InternetService

OnlineSecurity

OnlineBackup

TechSupport

StreamingTV

StreamingMovies

Contract

PaperlessBilling

PaymentMethod

Male

DSL

Yes

One year

Mailed check

Male

DSL

Yes

Month-to-month

Yes

Mailed check

Male

No phone service

DSL

Yes

One year

Bank transfer (automatic)

14:

Object 컬럼명 수집

cal_cols = df.select_dtypes('object').columns.values

cal_cols

14:

array(['gender', 'Partner', 'Dependents', 'MultipleLines',
'InternetService', 'OnlineSecurity', 'OnlineBackup', 'TechSupport',
'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
'PaymentMethod'], dtype=object)

[문제] Object 컬럼에 대해 One-Hot-Encoding 수행하고 그 결과를 df1 변수에 저장하세요.

15:

Pandas get_dummies() 함수 이용

원-핫-인코딩 결과를 df1 저장

df1 = pd.get_dummies(data = df, columns=cal_cols)

[16]:

모든 컬럼 데이터가 숫자되어 있음을 확인

40컬럼, 7,027 라인

df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7027 entries, 0 to 7026
Data columns (total 40 columns):

Column Non-Null Count Dtype

0 tenure 7027 non-null int64
1 MonthlyCharges 7027 non-null float64
2 TotalCharges 7027 non-null float64
3 Churn 7027 non-null int64
4 gender_Female 7027 non-null uint8
5 gender_Male 7027 non-null uint8
6 Partner_No 7027 non-null uint8
7 Partner_Yes 7027 non-null uint8
8 Dependents_No 7027 non-null uint8
9 Dependents_Yes 7027 non-null uint8
10 MultipleLines_No 7027 non-null uint8
11 MultipleLines_No phone service 7027 non-null uint8
12 MultipleLines_Yes 7027 non-null uint8
13 InternetService_DSL 7027 non-null uint8
14 InternetService_Fiber optic 7027 non-null uint8
15 InternetService_No 7027 non-null uint8
16 OnlineSecurity_No 7027 non-null uint8
17 OnlineSecurity_No internet service 7027 non-null uint8
18 OnlineSecurity_Yes 7027 non-null uint8
19 OnlineBackup_No 7027 non-null uint8
20 OnlineBackup_No internet service 7027 non-null uint8
21 OnlineBackup_Yes 7027 non-null uint8
22 TechSupport_No 7027 non-null uint8
23 TechSupport_No internet service 7027 non-null uint8
24 TechSupport_Yes 7027 non-null uint8
25 StreamingTV_No 7027 non-null uint8
26 StreamingTV_No internet service 7027 non-null uint8
27 StreamingTV_Yes 7027 non-null uint8
28 StreamingMovies_No 7027 non-null uint8
29 StreamingMovies_No internet service 7027 non-null uint8
30 StreamingMovies_Yes 7027 non-null uint8
31 Contract_Month-to-month 7027 non-null uint8
32 Contract_One year 7027 non-null uint8
33 Contract_Two year 7027 non-null uint8
34 PaperlessBilling_No 7027 non-null uint8
35 PaperlessBilling_Yes 7027 non-null uint8
36 PaymentMethod_Bank transfer (automatic) 7027 non-null uint8
37 PaymentMethod_Credit card (automatic) 7027 non-null uint8
38 PaymentMethod_Electronic check 7027 non-null uint8
39 PaymentMethod_Mailed check 7027 non-null uint8
dtypes: float64(2), int64(2), uint8(36)
memory usage: 466.8 KB

MonthlyCharges

TotalCharges

Churn

gender_Female

gender_Male

Partner_No

Partner_Yes

Dependents_No

Dependents_Yes

...

StreamingMovies_Yes

Contract_Month-to-month

Contract_One year

Contract_Two year

PaperlessBilling_No

PaperlessBilling_Yes

PaymentMethod_Bank transfer (automatic)

PaymentMethod_Credit card (automatic)

PaymentMethod_Electronic check

PaymentMethod_Mailed check

56.95

1889.50

...

53.85

108.15

...

42.30

1840.75

...

3 rows × 40 columns

⑤ Train, Test 데이터셋 분할

입력(X)과 레이블(y) 나누기

[문제] df1 DataFrame에서 'Churn' 컬럼을 제외한 나머지 정보를 X에 저장하세요.

[18]:

DataFrame drop 함수 활용

'Churn' 컬럼 삭제

DataFrame에서 values만 X에 저장

X = df1.drop('Churn', axis=1).values

[문제] df DataFrame에서 'Churn' 컬럼을 y로 저장하세요.

[20]:

DataFrame 'Churn' 컬럼 사용

DataFrame에서 values만 y에 저장

y = df1['Churn'].values

21:

X.shape, y.shape

21:

((7027, 39), (7027,))

Train , Test dataset 나누기

22:

from sklearn.model_selection import train_test_split

[문제] Train dataset, Test dataset 나누세요.

[23]:

Train dataset, Test dataset 나누기 : train_test_split 함수 사용

입력 : X, y

Train : Test 비율 = 7: 3 --> test_size=0.3

y Class 비율에 맞게 나누기 : stratify=y

여러번 수행해도 같은 결과 나오게 고정하기 : random_state=42

결과 : X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, stratify = y, random_state = 42)

24:

(4918, 39)

⑥ 데이터 정규화/스케일링(Normalizing/Scaling)

26:

숫자 분포 이루어진 컬럼 확인

df1.tail()

MonthlyCharges

TotalCharges

Churn

gender_Female

gender_Male

Partner_No

Partner_Yes

Dependents_No

Dependents_Yes

...

StreamingMovies_Yes

Contract_Month-to-month

Contract_One year

Contract_Two year

PaperlessBilling_No

PaperlessBilling_Yes

PaymentMethod_Bank transfer (automatic)

PaymentMethod_Credit card (automatic)

PaymentMethod_Electronic check

PaymentMethod_Mailed check

7022

21.15

1419.40

...

7023

84.80

1990.50

...

7024

103.20

7362.90

...

7025

29.60

346.45

...

7026

74.40

306.60

...

5 rows × 40 columns

27:

from sklearn.preprocessing import MinMaxScaler

[문제] MinMaxScaler 함수를 'scaler'로 정의 하세요.

28:

사이키런의 MinMaxScaler() 함수 활용

정의할 결과를 'scaler'로 매핑

scaler = MinMaxScaler()

[29]:

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

30:

X_train[:2], y_train[:2]

30:

(array([[0.65277778, 0.56851021, 0.40877722, 1. , 0. ,
1. , 0. , 1. , 0. , 1. ,
0. , 0. , 0. , 1. , 0. ,
0. , 0. , 1. , 1. , 0. ,
0. , 1. , 0. , 0. , 1. ,
0. , 0. , 1. , 0. , 0. ,
1. , 0. , 0. , 1. , 0. ,
0. , 1. , 0. , 0. ],
[0.27777778, 0.00498256, 0.04008671, 1. , 0. ,
1. , 0. , 1. , 0. , 1. ,
0. , 0. , 0. , 0. , 1. ,
0. , 1. , 0. , 0. , 1. ,
0. , 0. , 1. , 0. , 0. ,
1. , 0. , 0. , 1. , 0. ,
1. , 0. , 0. , 0. , 1. ,
0. , 1. , 0. , 0. ]]),
array([0, 0]))

AttributeError Traceback (most recent call last)
in
----> 1 X_train.tail()

AttributeError: 'numpy.ndarray' object has no attribute 'tail'

⑦ 모델 개발

(참고) 모델별 바차트 그려주고 성능 확인을 위한 함수

44:

모델별로 Recall 점수 저장

모델 Recall 점수 순서대로 바차트를 그려 모델별로 성능 확인 가능

from sklearn.metrics import accuracy_score

my_predictions = {}

colors = ['r', 'c', 'm', 'y', 'k', 'khaki', 'teal', 'orchid', 'sandybrown',

      'greenyellow', 'dodgerblue', 'deepskyblue', 'rosybrown', 'firebrick',

      'deeppink', 'crimson', 'salmon', 'darkred', 'olivedrab', 'olive',

      'forestgreen', 'royalblue', 'indigo', 'navy', 'mediumpurple', 'chocolate',

      'gold', 'darkorange', 'seagreen', 'turquoise', 'steelblue', 'slategray',

      'peru', 'midnightblue', 'slateblue', 'dimgray', 'cadetblue', 'tomato'

     ]

모델명, 예측값, 실제값을 주면 위의 plot_predictions 함수 호출하여 Scatter 그래프 그리며

모델별 MSE값을 Bar chart로 그려줌

def recalleval(name, pred, actual):

global predictions

global colors

plt.figure(figsize=(12, 9))

#acc = accuracy_score(actual, pred)

acc = recall_score(actual, pred)

my_predictions[name_] = acc * 100

y_value = sorted(my_predictions.items(), key=lambda x: x[1], reverse=True)



df = pd.DataFrame(y_value, columns=['model', 'recall'])

print(df)



length = len(df)



plt.figure(figsize=(10, length))

ax = plt.subplot()

ax.set_yticks(np.arange(len(df)))

ax.set_yticklabels(df['model'], fontsize=15)

bars = ax.barh(np.arange(len(df)), df['recall'])



for i, v in enumerate(df['recall']):

    idx = np.random.choice(len(colors))

    bars[i].set_color(colors[idx])

    ax.text(v + 2, i, str(round(v, 3)), color='k', fontsize=15, fontweight='bold')

   

plt.title('recall', fontsize=18)

plt.xlim(0, 100)



plt.show()

1) 로지스틱 회귀 (LogisticRegression, 분류)

[32]:

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import confusion_matrix

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

from sklearn.metrics import classification_report

[문제] LogisticRegression 모델 정의하고 학습시키세요.

33:

LogisticRegression 함수 사용 및 정의 : lg 저장

정의된 LogisticRegression 학습 fit() : 입력값으로 X_train, y_train 준다.

lg = LogisticRegression()

lg.fit(X_train, y_train)

34:

분류기 성능 평가(score)

lg.score(X_test, y_test)

· 분류기 성능 평가 지표

35:

lg_pred = lg.predict(X_test)

36:

array([0, 0, 0, ..., 1, 1, 0])

[37]:

오차행렬

TN FP

FN TP

confusion_matrix(y_test, lg_pred)

[37]:

array([[1386, 163],
[ 246, 314]])

38:

정확도

accuracy_score(y_test, lg_pred)

39:

정밀도

precision_score(y_test, lg_pred)

40:

재현율 : 낮다.

recall_score(y_test, lg_pred)

41:

정밀도 + 재현율

f1_score(y_test, lg_pred)

42:

print(classification_report(y_test, lg_pred))

          precision    recall  f1-score   support

0 0.85 0.89 0.87 1549
1 0.66 0.56 0.61 560

accuracy 0.81 2109
macro avg 0.75 0.73 0.74 2109
weighted avg 0.80 0.81 0.80 2109

45:

recall_eval('LogisticRegression', lg_pred, y_test)

            model     recall

0 LogisticRegression 56.071429

2) KNN (K-Nearest Neighbor)

[46]:

from sklearn.neighbors import KNeighborsClassifier

47:

knn = KNeighborsClassifier(n_neighbors=5)

knn.fit(X_train, y_train)

[48]:

knn_pred = knn.predict(X_test)

[49]:

recall_eval('K-Nearest Neighbor', knn_pred, y_test)

            model     recall

0 LogisticRegression 56.071429
1 K-Nearest Neighbor 52.142857

3) 결정트리(DecisionTree)

[50]:

from sklearn.tree import DecisionTreeClassifier

[51]:

dt = DecisionTreeClassifier(max_depth=10, random_state=42)

dt.fit(X_train, y_train)

[51]:

DecisionTreeClassifier(max_depth=10, random_state=42)

[문제] 학습된 DecisionTreeClassifier 모델로 예측해 보기

52:

DecisionTreeClassifier 학습 모델 : dt

DecisionTreeClassifier 모델의 predict() 활용 : 입력값으로 X_test

결과 : dt_pred 저장

dt_pred = dt.predict(X_test)

53:

recall_eval('DecisionTree', dt_pred, y_test)

            model     recall

0 LogisticRegression 56.071429
1 DecisionTree 55.714286
2 K-Nearest Neighbor 52.142857

앙상블 기법의 종류

· 배깅 (Bagging): 여러개의 DecisionTree 활용하고 샘플 중복 생성을 통해 결과 도출. RandomForest

· 부스팅 (Boosting): 약한 학습기를 순차적으로 학습을 하되, 이전 학습에 대하여 잘못 예측된 데이터에 가중치를 부여해 오차를 보완해 나가는 방식. XGBoost, LGBM

앙상블

4) 랜덤포레스트(RandomForest)

· Bagging 대표적인 모델로써, 훈련셋트를 무작위로 각기 다른 서브셋으로 데이터셋을 만들고

· 여러개의 DecisonTree로 학습하고 다수결로 결정하는 모델

주요 Hyperparameter

· random_state: 랜덤 시드 고정 값. 고정해두고 튜닝할 것!

· n_jobs: CPU 사용 갯수

· max_depth: 깊어질 수 있는 최대 깊이. 과대적합 방지용

· n_estimators: 앙상블하는 트리의 갯수

· max_features: 최대로 사용할 feature의 갯수. 과대적합 방지용

· min_samples_split: 트리가 분할할 때 최소 샘플의 갯수. default=2. 과대적합 방지용

54:

from sklearn.ensemble import RandomForestClassifier

[55]:

rfc = RandomForestClassifier(n_estimators=3, random_state=42)

rfc.fit(X_train, y_train)

[55]:

RandomForestClassifier(n_estimators=3, random_state=42)

[56]:

rfc_pred = rfc.predict(X_test)

[57]:

recall_eval('RandomForest Ensemble', rfc_pred, y_test)

               model     recall

0 LogisticRegression 56.071429
1 DecisionTree 55.714286
2 K-Nearest Neighbor 52.142857
3 RandomForest Ensemble 52.142857

5) XGBoost

· 여러개의 DecisionTree를 결합하여 Strong Learner 만드는 Boosting 앙상블 기법

· Kaggle 대회에서 자주 사용하는 모델이다.

5-1) Boosting 기본 개념

5-2) Boosting 상세

주요 특징

· scikit-learn 패키지가 아닙니다.

· 성능이 우수함

· GBM보다는 빠르고 성능도 향상되었습니다.

· 여전히 학습시간이 매우 느리다

주요 Hyperparameter

· random_state: 랜덤 시드 고정 값. 고정해두고 튜닝할 것!

· n_jobs: CPU 사용 갯수

· learning_rate: 학습율. 너무 큰 학습율은 성능을 떨어뜨리고, 너무 작은 학습율은 학습이 느리다. 적절한 값을 찾아야함. n_estimators와 같이 튜닝. default=0.1

· n_estimators: 부스팅 스테이지 수. (랜덤포레스트 트리의 갯수 설정과 비슷한 개념). default=100

· max_depth: 트리의 깊이. 과대적합 방지용. default=3.

· subsample: 샘플 사용 비율. 과대적합 방지용. default=1.0

· max_features: 최대로 사용할 feature의 비율. 과대적합 방지용. default=1.0

[58]:

!pip install xgboost

Looking in indexes: http://10.220.235.19/pypi/simple
Requirement already satisfied: xgboost in /usr/local/lib/python3.6/dist-packages (0.90)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from xgboost) (1.19.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from xgboost) (1.5.4)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

59:

from xgboost import XGBClassifier

60:

xgb = XGBClassifier(n_estimators=3, random_state=42)

xgb.fit(X_train, y_train)

60:

XGBClassifier(n_estimators=3, random_state=42)

61:

xgb_pred = xgb.predict(X_test)

[62]:

recall_eval('XGBoost', xgb_pred, y_test)

               model     recall

0 LogisticRegression 56.071429
1 DecisionTree 55.714286
2 K-Nearest Neighbor 52.142857
3 RandomForest Ensemble 52.142857
4 XGBoost 48.214286

6) Light GBM

· XGBoost와 함께 주목받는 DecisionTree 알고리즘 기반의 Boosting 앙상블 기법

· XGBoost에 비해 학습시간이 짧은 편이다.

주요 특징

· scikit-learn 패키지가 아닙니다.

· 성능이 우수함

· 속도도 매우 빠릅니다.

주요 Hyperparameter

· random_state: 랜덤 시드 고정 값. 고정해두고 튜닝할 것!

· n_jobs: CPU 사용 갯수

· n_estimators: 부스팅 스테이지 수. (랜덤포레스트 트리의 갯수 설정과 비슷한 개념). default=100

· max_depth: 트리의 깊이. 과대적합 방지용. default=3.

· colsample_bytree: 샘플 사용 비율 (max_features와 비슷한 개념). 과대적합 방지용. default=1.0

63:

!pip install lightgbm

Looking in indexes: http://10.220.235.19/pypi/simple
Requirement already satisfied: lightgbm in /usr/local/lib/python3.6/dist-packages (2.3.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from lightgbm) (1.19.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from lightgbm) (1.5.4)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.6/dist-packages (from lightgbm) (0.24.2)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn->lightgbm) (1.1.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.6/dist-packages (from scikit-learn->lightgbm) (3.1.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

64:

from lightgbm import LGBMClassifier

65:

lgbm = LGBMClassifier(n_estimators=3, random_state=42)

lgbm.fit(X_train, y_train)

65:

LGBMClassifier(n_estimators=3, random_state=42)

66:

lgbm_pred = lgbm.predict(X_test)

67:

recall_eval('LGBM', lgbm_pred, y_test)

               model     recall

0 LogisticRegression 56.071429
1 DecisionTree 55.714286
2 K-Nearest Neighbor 52.142857
3 RandomForest Ensemble 52.142857
4 XGBoost 48.214286
5 LGBM 0.000000

68:

정확도는 73% 정도 나온다.

lgbm.score(X_test, y_test)

69:

재현율 0으로 나온다.

recall_score(y_test, lgbm_pred)

재현율 성능이 너무 안나온다. 어떻게 해결할수 있을까?

배운 내용 정리

```
머신러닝 모델 프로세스
```
① 라이브러리 임포트(import)
② 데이터 가져오기(Loading the data)
③ 탐색적 데이터 분석(Exploratory Data Analysis)
④ 데이터 전처리(Data PreProcessing) : 데이터타입 변환, Null 데이터 처리, 누락데이터 처리, 더미특성 생성, 특성 추출 (feature engineering) 등
⑤ Train, Test 데이터셋 분할
⑥ 데이터 정규화(Normalizing the Data)
⑦ 모델 개발(Creating the Model)
⑧ 모델 성능 평가

평가 지표 활용 : 모델별 성능 확인을 위한 함수 (가져다 쓰면 된다)

단일 회귀예측 모델 : LogisticRegression, KNN, DecisionTree

앙상블 (Ensemble) : RandomForest, XGBoost, LGBM

재현율 성능이 너무 안나온다. 어떻게 해결할수 있을까?

D:\바탕화면2021\ETC A2Z\2022 코딩챌린지및_계발\10. 2023 AICE Associate 교육

[실습-퀴즈] Python을 활용한 AI 모델링 - 딥러닝 파트

· 이번시간에는 Python을 활용한 AI 모델링에서 딥러닝에 대해 실습해 보겠습니다.

· 여기서는 딥러닝 모델 DNN에 대해 코딩하여 모델 구축해 보겠습니다.

· 한가지 당부 드리고 싶은 말은 "백문이불여일타" 입니다.

· 이론보다 실습이 더 많은 시간과 노력이 투자 되어야 합니다.

학습목차

딥러닝 심층신경망(DNN) 모델 프로세스

· 데이터 가져오기

· 데이터 전처리

· Train, Test 데이터셋 분할

· 데이터 정규화

· DNN 딥러닝 모델

재현율 성능이 좋지 않다. 어떻게 성능향상 할수 있나?

딥러닝 심층신경망(DNN) 모델 프로세스

① 라이브러리 임포트(import)

② 데이터 가져오기(Loading the data)

③ 탐색적 데이터 분석(Exploratory Data Analysis)

④ 데이터 전처리(Data PreProcessing) : 데이터타입 변환, Null 데이터 처리, 누락데이터 처리, 더미특성 생성, 특성 추출 (feature engineering) 등

⑤ Train, Test 데이터셋 분할

⑥ 데이터 정규화(Normalizing the Data)

⑦ 모델 개발(Creating the Model)

⑧ 모델 성능 평가

① 라이브러리 임포트

필요 라이브러리 임포트

[1]:

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

Duplicate key in file PosixPath('/usr/local/lib/python3.6/dist-packages/matplotlib/mpl-data/matplotlibrc'), line 758 ('font.family\t: NanumGothicCoding')

② 데이터 로드

[문제] 같은 폴더내에 있는 data_v1_save.csv 파일을 Pandas read_csv 함수를 이용하여 읽어 df 변수에 저장하세요.

[2]:

읽어 들일 파일명 : data_v1_save.csv

Pandas read_csv 함수 활용

결과 : df 저장

df = pd.read_csv('(라이브교육)data_v1_save.csv')

③ 데이터 분석

[3]:

17컬럼, 7027 라인

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7027 entries, 0 to 7026
Data columns (total 17 columns):

Column Non-Null Count Dtype

Partner

Dependents

tenure

MultipleLines

InternetService

OnlineSecurity

OnlineBackup

TechSupport

StreamingTV

StreamingMovies

Contract

PaperlessBilling

PaymentMethod

MonthlyCharges

TotalCharges

Churn

7022

Female

No internet service

Two year

Yes

Bank transfer (automatic)

21.15

1419.40

7023

Male

Yes

DSL

Yes

One year

Yes

Mailed check

84.80

1990.50

7024

Female

Yes

Fiber optic

Yes

One year

Yes

Credit card (automatic)

103.20

7362.90

7025

Female

Yes

No phone service

DSL

Yes

Month-to-month

Yes

Electronic check

29.60

346.45

7026

Male

Yes

Fiber optic

Month-to-month

Yes

Mailed check

74.40

306.60

Churn 레이블 불균형

df['Churn'].value_counts().plot(kind='bar')

④ 데이터 전처리

· 모든 데이터값들은 숫자형으로 되어야 한다. 즉, Ojbect 타입을 모든 숫자형 변경 필요

· Object 컬럼에 대해 Pandas get_dummies 함수 활용하여 One-Hot-Encoding

Object 컬럼명 수집

cal_cols = df.select_dtypes('object').columns.values

cal_cols

[문제] Object 컬럼에 대해 One-Hot-Encoding 수행하고 그 결과를 df1 변수에 저장하세요.

Pandas get_dummies() 함수 이용

원-핫-인코딩 결과를 df1 저장

df1 = pd.get_dummies(data = df, columns = cal_cols)

40컬럼, 7026 라인

df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7027 entries, 0 to 7026
Data columns (total 40 columns):

Column Non-Null Count Dtype

⑤ Train, Test 데이터셋 분할

10:

from sklearn.model_selection import train_test_split

11:

X = df1.drop('Churn', axis=1).values

y = df1['Churn'].values

[12]:

X_train, X_test, y_train, y_test = train_test_split(X, y,

                                                test_size=0.3,

                                                stratify=y,

                                                random_state=42)

13:

(4918, 39)

⑥ 데이터 정규화/스케일링(Normalizing/Scaling)

15:

숫자 분포 이루어진 컬럼 확인

df1.tail()

MonthlyCharges

TotalCharges

Churn

gender_Female

gender_Male

Partner_No

Partner_Yes

Dependents_No

Dependents_Yes

...

StreamingMovies_Yes

Contract_Month-to-month

Contract_One year

Contract_Two year

PaperlessBilling_No

PaperlessBilling_Yes

PaymentMethod_Bank transfer (automatic)

PaymentMethod_Credit card (automatic)

PaymentMethod_Electronic check

PaymentMethod_Mailed check

7022

21.15

1419.40

...

7023

84.80

1990.50

...

7024

103.20

7362.90

...

7025

29.60

346.45

...

7026

74.40

306.60

...

5 rows × 40 columns

[16]:

from sklearn.preprocessing import MinMaxScaler

17:

scaler = MinMaxScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

[18]:

X_train[:2]

[18]:

array([[0.65277778, 0.56851021, 0.40877722, 1. , 0. ,
1. , 0. , 1. , 0. , 1. ,
0. , 0. , 0. , 1. , 0. ,
0. , 0. , 1. , 1. , 0. ,
0. , 1. , 0. , 0. , 1. ,
0. , 0. , 1. , 0. , 0. ,
1. , 0. , 0. , 1. , 0. ,
0. , 1. , 0. , 0. ],
[0.27777778, 0.00498256, 0.04008671, 1. , 0. ,
1. , 0. , 1. , 0. , 1. ,
0. , 0. , 0. , 0. , 1. ,
0. , 1. , 0. , 0. , 1. ,
0. , 0. , 1. , 0. , 0. ,
1. , 0. , 0. , 1. , 0. ,
1. , 0. , 0. , 0. , 1. ,
0. , 1. , 0. , 0. ]])

⑦ 딥러닝 심층신경망(DNN) 모델 구현

라이브러리 임포트

[19]:

import tensorflow as tf

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense, Dropout

tf.random.set_seed(100)

하이퍼파라미터 설정 : batch_size, epochs

[20]:

batch_size = 16

epochs = 20

모델 입력(features) 갯수 확인

21:

(4918, 39)

모델 출력(label) 갯수 확인

A. 이진분류 DNN모델 구성

hidden Layer

· [출처] https://subscription.packtpub.com/book/data/9781788995207/1/ch01lvl1sec03/deep-learning-intuition

[문제] 요구사항대로 Sequential 모델을 만들어 보세요.

[23]:

Sequential() 모델 정의 하고 model로 저장

input layer는 input_shape=() 옵션을 사용한다.

39개 input layer

unit 4개 hidden layer

unit 3개 hidden layer

1개 output layser : 이진분류

model = Sequential()

model.add(Dense(4, activation = 'relu', input_shape = (39,)))

model.add(Dense(3, activation = 'relu'))

model.add(Dense(1, activation = 'sigmoid'))

모델 확인

Model: "sequential"

Layer (type) Output Shape Param #

dense (Dense) (None, 4) 160

dense_1 (Dense) (None, 3) 15

dense_2 (Dense) (None, 1) 4

Total params: 179
Trainable params: 179
Non-trainable params: 0

모델 구성 - 과적합 방지

dropout

· [출처] https://medium.com/@amarbudhiraja/https-medium-com-amarbudhiraja-learning-less-to-learn-better-dropout-in-deep-machine-learning-74334da4bfc5

25:

model = Sequential()

model.add(Dense(4, activation='relu', input_shape=(39,)))

model.add(Dropout(0.3))

model.add(Dense(3, activation='relu'))

model.add(Dropout(0.3))

model.add(Dense(1, activation='sigmoid'))

과적합 방지 모델 확인

Model: "sequential_1"

Layer (type) Output Shape Param #

dense_3 (Dense) (None, 4) 160

dropout (Dropout) (None, 4) 0

dense_4 (Dense) (None, 3) 15

dropout_1 (Dropout) (None, 3) 0

dense_5 (Dense) (None, 1) 4

Total params: 179
Trainable params: 179
Non-trainable params: 0

모델 컴파일 – 이진 분류 모델

          loss='binary_crossentropy',

          metrics=['accuracy'])

· 모델 컴파일 – 다중 분류 모델 (Y값을 One-Hot-Encoding 한경우)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

· 모델 컴파일 – 다중 분류 모델 (Y값을 One-Hot-Encoding 하지 않은 경우)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

· 모델 컴파일 – 예측 모델 model.compile(optimizer='adam', loss='mse')

모델 학습

[문제] 요구사항대로 DNN 모델을 학습시키세요.

· 모델 이름 : model

· epoch : 10번

· batch_size : 10번

28:

앞쪽에서 정의된 모델 이름 : model

Sequential 모델의 fit() 함수 사용

@인자

X, y : X_train, y_train

validation_data=(X_test, y_test)

epochs 10번

batch_size 10번

model.fit(X_train, y_train, validation_data = (X_test, y_test), epochs = 10, batch_size = 10)

Epoch 1/10
492/492 [==============================] - 2s 3ms/step - loss: 0.6015 - accuracy: 0.7318 - val_loss: 0.5247 - val_accuracy: 0.7345
Epoch 2/10
492/492 [==============================] - 2s 3ms/step - loss: 0.5439 - accuracy: 0.7416 - val_loss: 0.4802 - val_accuracy: 0.7345
Epoch 3/10
492/492 [==============================] - 1s 3ms/step - loss: 0.5225 - accuracy: 0.7548 - val_loss: 0.4734 - val_accuracy: 0.7345
Epoch 4/10
492/492 [==============================] - 2s 4ms/step - loss: 0.5139 - accuracy: 0.7623 - val_loss: 0.4646 - val_accuracy: 0.7364
Epoch 5/10
492/492 [==============================] - 2s 3ms/step - loss: 0.5153 - accuracy: 0.7554 - val_loss: 0.4651 - val_accuracy: 0.7368
Epoch 6/10
492/492 [==============================] - 1s 3ms/step - loss: 0.5016 - accuracy: 0.7674 - val_loss: 0.4550 - val_accuracy: 0.7653
Epoch 7/10
492/492 [==============================] - 1s 3ms/step - loss: 0.5042 - accuracy: 0.7593 - val_loss: 0.4532 - val_accuracy: 0.7496
Epoch 8/10
492/492 [==============================] - 1s 3ms/step - loss: 0.4989 - accuracy: 0.7633 - val_loss: 0.4536 - val_accuracy: 0.7620
Epoch 9/10
492/492 [==============================] - 1s 3ms/step - loss: 0.5035 - accuracy: 0.7609 - val_loss: 0.4553 - val_accuracy: 0.7539
Epoch 10/10
492/492 [==============================] - 1s 3ms/step - loss: 0.5023 - accuracy: 0.7588 - val_loss: 0.4567 - val_accuracy: 0.7544

B. 다중 분류 DNN 구성

· 13개 input layer

· unit 5개 hidden layer

· dropout

· unit 4개 hidden layer

· dropout

· 2개 output layser : 이진분류

다중분류

· [출처] https://www.educba.com/dnn-neural-network/

[29]:

39개 input layer

unit 5개 hidden layer

dropout

unit 4개 hidden layer

dropout

2개 output layser : 다중분류

model = Sequential()

model.add(Dense(5, activation='relu', input_shape=(39,)))

model.add(Dropout(0.3))

model.add(Dense(4, activation='relu'))

model.add(Dropout(0.3))

model.add(Dense(2, activation='softmax'))

모델 확인

Model: "sequential_2"

Layer (type) Output Shape Param #

dense_6 (Dense) (None, 5) 200

dropout_2 (Dropout) (None, 5) 0

dense_7 (Dense) (None, 4) 24

dropout_3 (Dropout) (None, 4) 0

dense_8 (Dense) (None, 2) 10

Total params: 234
Trainable params: 234
Non-trainable params: 0

모델 컴파일 – 다중 분류 모델

          loss='sparse_categorical_crossentropy',

          metrics=['accuracy'])

모델 학습

[32]:

history = model.fit(X_train, y_train,

      validation_data=(X_test, y_test),

      epochs=20,

      batch_size=16)

Epoch 1/20
308/308 [==============================] - 2s 4ms/step - loss: 0.5507 - accuracy: 0.7322 - val_loss: 0.4708 - val_accuracy: 0.7345
Epoch 2/20
308/308 [==============================] - 1s 4ms/step - loss: 0.5011 - accuracy: 0.7351 - val_loss: 0.4540 - val_accuracy: 0.7345
Epoch 3/20
308/308 [==============================] - 1s 4ms/step - loss: 0.4894 - accuracy: 0.7338 - val_loss: 0.4482 - val_accuracy: 0.7345
Epoch 4/20
308/308 [==============================] - 1s 4ms/step - loss: 0.4916 - accuracy: 0.7344 - val_loss: 0.4455 - val_accuracy: 0.7345
Epoch 5/20
308/308 [==============================] - 1s 4ms/step - loss: 0.4850 - accuracy: 0.7340 - val_loss: 0.4420 - val_accuracy: 0.7345
Epoch 6/20
308/308 [==============================] - 1s 4ms/step - loss: 0.4899 - accuracy: 0.7342 - val_loss: 0.4447 - val_accuracy: 0.7345
Epoch 7/20
308/308 [==============================] - 1s 4ms/step - loss: 0.4749 - accuracy: 0.7344 - val_loss: 0.4360 - val_accuracy: 0.7345
Epoch 8/20
308/308 [==============================] - 1s 4ms/step - loss: 0.4779 - accuracy: 0.7342 - val_loss: 0.4374 - val_accuracy: 0.7345
Epoch 9/20
308/308 [==============================] - 1s 4ms/step - loss: 0.4744 - accuracy: 0.7340 - val_loss: 0.4358 - val_accuracy: 0.7345
Epoch 10/20
308/308 [==============================] - 1s 4ms/step - loss: 0.4808 - accuracy: 0.7344 - val_loss: 0.4379 - val_accuracy: 0.7345
Epoch 11/20
308/308 [==============================] - 1s 4ms/step - loss: 0.4761 - accuracy: 0.7344 - val_loss: 0.4379 - val_accuracy: 0.7345
Epoch 12/20
308/308 [==============================] - 1s 4ms/step - loss: 0.4679 - accuracy: 0.7344 - val_loss: 0.4340 - val_accuracy: 0.7345
Epoch 13/20
308/308 [==============================] - 1s 3ms/step - loss: 0.4712 - accuracy: 0.7617 - val_loss: 0.4358 - val_accuracy: 0.7824
Epoch 14/20
308/308 [==============================] - 1s 3ms/step - loss: 0.4790 - accuracy: 0.7489 - val_loss: 0.4388 - val_accuracy: 0.7691
Epoch 15/20
308/308 [==============================] - 1s 3ms/step - loss: 0.4760 - accuracy: 0.7637 - val_loss: 0.4360 - val_accuracy: 0.7345
Epoch 16/20
308/308 [==============================] - 1s 3ms/step - loss: 0.4759 - accuracy: 0.7527 - val_loss: 0.4391 - val_accuracy: 0.7710
Epoch 17/20
308/308 [==============================] - 1s 3ms/step - loss: 0.4739 - accuracy: 0.7562 - val_loss: 0.4375 - val_accuracy: 0.7904
Epoch 18/20
308/308 [==============================] - 1s 3ms/step - loss: 0.4668 - accuracy: 0.7605 - val_loss: 0.4348 - val_accuracy: 0.7985
Epoch 19/20
308/308 [==============================] - 1s 3ms/step - loss: 0.4695 - accuracy: 0.7674 - val_loss: 0.4330 - val_accuracy: 0.7899
Epoch 20/20
308/308 [==============================] - 1s 3ms/step - loss: 0.4758 - accuracy: 0.7684 - val_loss: 0.4345 - val_accuracy: 0.7876

Callback : 조기종료, 모델 저장

from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

val_loss 모니터링해서 성능이 5번 지나도록 좋아지지 않으면 조기 종료

early_stop = EarlyStopping(monitor='val_loss', mode='min',

                       verbose=1, patience=5)

val_loss 가장 낮은 값을 가질때마다 모델저장

check_point = ModelCheckpoint('best_model.h5', verbose=1,

                          monitor='val_loss', mode='min', save_best_only=True)

모델 학습

history = model.fit(x=X_train, y=y_train,

      epochs=50 , batch_size=20,

      validation_data=(X_test, y_test), verbose=1,

      callbacks=[early_stop, check_point])

Epoch 1/50
231/246 [===========================>..] - ETA: 0s - loss: 0.4702 - accuracy: 0.7712
Epoch 1: val_loss improved from inf to 0.43752, saving model to best_model.h5
246/246 [==============================] - 1s 2ms/step - loss: 0.4726 - accuracy: 0.7686 - val_loss: 0.4375 - val_accuracy: 0.7956
Epoch 2/50
241/246 [============================>.] - ETA: 0s - loss: 0.4734 - accuracy: 0.7587
Epoch 2: val_loss improved from 0.43752 to 0.43419, saving model to best_model.h5
246/246 [==============================] - 1s 2ms/step - loss: 0.4727 - accuracy: 0.7603 - val_loss: 0.4342 - val_accuracy: 0.7966
Epoch 3/50
239/246 [============================>.] - ETA: 0s - loss: 0.4754 - accuracy: 0.7692
Epoch 3: val_loss did not improve from 0.43419
246/246 [==============================] - 1s 2ms/step - loss: 0.4733 - accuracy: 0.7698 - val_loss: 0.4343 - val_accuracy: 0.7923
Epoch 4/50
241/246 [============================>.] - ETA: 0s - loss: 0.4699 - accuracy: 0.7645
Epoch 4: val_loss improved from 0.43419 to 0.43329, saving model to best_model.h5
246/246 [==============================] - 1s 2ms/step - loss: 0.4688 - accuracy: 0.7647 - val_loss: 0.4333 - val_accuracy: 0.7975
Epoch 5/50
218/246 [=========================>....] - ETA: 0s - loss: 0.4628 - accuracy: 0.7743
Epoch 5: val_loss improved from 0.43329 to 0.43201, saving model to best_model.h5
246/246 [==============================] - 1s 2ms/step - loss: 0.4633 - accuracy: 0.7694 - val_loss: 0.4320 - val_accuracy: 0.7980
Epoch 6/50
234/246 [===========================>..] - ETA: 0s - loss: 0.4799 - accuracy: 0.7650
Epoch 6: val_loss did not improve from 0.43201
246/246 [==============================] - 1s 2ms/step - loss: 0.4783 - accuracy: 0.7629 - val_loss: 0.4376 - val_accuracy: 0.7980
Epoch 7/50
240/246 [============================>.] - ETA: 0s - loss: 0.4704 - accuracy: 0.7675
Epoch 7: val_loss did not improve from 0.43201
246/246 [==============================] - 1s 2ms/step - loss: 0.4691 - accuracy: 0.7676 - val_loss: 0.4331 - val_accuracy: 0.7961
Epoch 8/50
226/246 [==========================>...] - ETA: 0s - loss: 0.4679 - accuracy: 0.7710
Epoch 8: val_loss did not improve from 0.43201
246/246 [==============================] - 1s 3ms/step - loss: 0.4706 - accuracy: 0.7700 - val_loss: 0.4369 - val_accuracy: 0.7985
Epoch 9/50
222/246 [==========================>...] - ETA: 0s - loss: 0.4734 - accuracy: 0.7716
Epoch 9: val_loss did not improve from 0.43201
246/246 [==============================] - 1s 2ms/step - loss: 0.4734 - accuracy: 0.7674 - val_loss: 0.4357 - val_accuracy: 0.7980
Epoch 10/50
221/246 [=========================>....] - ETA: 0s - loss: 0.4707 - accuracy: 0.7667
Epoch 10: val_loss did not improve from 0.43201
246/246 [==============================] - 1s 2ms/step - loss: 0.4718 - accuracy: 0.7670 - val_loss: 0.4337 - val_accuracy: 0.7999
Epoch 10: early stopping

⑧ 모델 성능 평가

33:

losses = pd.DataFrame(model.history.history)

accuracy

val_loss

val_accuracy

0.550660

0.732208

0.470834

0.734471

0.501130

0.735055

0.453986

0.734471

0.489383

0.733835

0.448154

0.734471

0.491561

0.734445

0.445530

0.734471

0.484967

0.734038

0.441958

0.734471

성능 시각화

35:

losses[['loss','val_loss']].plot()

36:

losses[['loss','val_loss', 'accuracy','val_accuracy']].plot()

[37]:

plt.plot(history.history['accuracy'])

plt.plot(history.history['val_accuracy'])

plt.title('Accuracy')

plt.xlabel('Epochs')

plt.ylabel('Acc')

plt.legend(['acc', 'val_acc'])

plt.show()

성능 평가

38:

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

from sklearn.metrics import classification_report

39:

pred = model.predict(X_test)

40:

(2109, 2)

41:

y_pred = np.argmax(pred, axis=1)

42:

정확도 80%

accuracy_score(y_test, y_pred)

43:

재현율 성능이 좋지 않다

recall_score(y_test, y_pred)

44:

accuracy, recall, precision 성능 한번에 보기

print(classification_report(y_test, y_pred))

          precision    recall  f1-score   support

0 0.80 0.95 0.87 1549
1 0.70 0.35 0.46 560

accuracy 0.79 2109
macro avg 0.75 0.65 0.67 2109
weighted avg 0.77 0.79 0.76 2109

# 2. 재현율 성능이 좋지 않다. 어떻게 성능향상 할수 있나?

성능향상 할수 있는 방법은 여러가지 있습니다.
DNN 하이퍼 파라미터 수정하면서 성능향상이 되는지 확인
데이터 줄이거나 늘리거나, Feature(컬럼)을 늘리거나 줄이거나 하는 식의 Feature Engineering 방법

Feature Engineering 통한 성능향상

· 불균현 Churn 데이터 균형 맞추기 : OverSampling, UnderSampling

· OverSampling 기법 : SMOTE(Synthetic Minority Over-sampling Technique)

· 참조사이트 : https://datascienceschool.net/03%20machine%20learning/14.02%20%EB%B9%84%EB%8C%80%EC%B9%AD%20%EB%8D%B0%EC%9D%B4%ED%84%B0%20%EB%AC%B8%EC%A0%9C.html

SMOTE

imbalanced-learn 패키지 설치

· imbalanced data 문제를 해결하기 위한 다양한 샘플링 방법을 구현한 파이썬 패키지

45:

!pip install -U imbalanced-learn

Looking in indexes: http://10.220.235.19/pypi/simple
Collecting imbalanced-learn
Downloading http://10.220.235.19/pypi/packages/19/79/e86c8fd859dca4fb1fbfc61376afc63210177a235a7bfbe7219b02edf8f3/imbalanced_learn-0.9.1-py3-none-any.whl (199 kB)
|████████████████████████████████| 199 kB 47.8 MB/s
Downloading http://10.220.235.19/pypi/packages/83/92/a4d1f42b29e9f62f9c3fad68d28282a9610a02801e1d89945702f981dd8e/imbalanced_learn-0.9.0-py3-none-any.whl (199 kB)
|████████████████████████████████| 199 kB 105.7 MB/s
Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.6/dist-packages (from imbalanced-learn) (1.5.4)
Downloading http://10.220.235.19/pypi/packages/b1/bd/4bb46fb4d317fd0f19aa7463d8906598e5fee073c0842b57cb112f023a45/imbalanced_learn-0.8.1-py3-none-any.whl (189 kB)
|████████████████████████████████| 189 kB 90.7 MB/s
Requirement already satisfied: scikit-learn>=0.24 in /usr/local/lib/python3.6/dist-packages (from imbalanced-learn) (0.24.2)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from imbalanced-learn) (1.1.0)
Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.6/dist-packages (from imbalanced-learn) (1.19.5)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.24->imbalanced-learn) (3.1.0)
Installing collected packages: imbalanced-learn
Successfully installed imbalanced-learn-0.8.1
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

SMOTE 함수 이용하여 Oversampling

[46]:

from imblearn.over_sampling import SMOTE

47:

SMOTE 함수 정의 및 Oversampling 수행

smote = SMOTE(random_state=0)

X_train_over, y_train_over = smote.fit_resample(X_train, y_train)

[48]:

print('SMOTE 적용 전 학습용 피처/레이블 데이터 세트: ', X_train.shape, y_train.shape)

print('SMOTE 적용 후 학습용 피처/레이블 데이터 세트: ', X_train_over.shape, y_train_over.shape)

SMOTE 적용 전 학습용 피처/레이블 데이터 세트: (4918, 39) (4918,)
SMOTE 적용 후 학습용 피처/레이블 데이터 세트: (7224, 39) (7224,)

[49]:

SMOTE 적용 후 레이블 값 분포 : 0과 1 갯수가 동일

pd.Series(y_train_over).value_counts()

[49]:

1 3612
0 3612
dtype: int64

데이터 정규화

[50]:

MinMaxScaler

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaler.fit(X_train)

X_train_over = scaler.transform(X_train_over)

X_test = scaler.transform(X_test)

[51]:

X_train_over.shape, y_train_over.shape, X_test.shape, y_test.shape

[51]:

((7224, 39), (7224,), (2109, 39), (2109,))

모델 개발(Creating the Model)

52:

model = Sequential()

model.add(Dense(64, activation='relu', input_shape=(39,)))

model.add(Dropout(0.3))

model.add(Dense(32, activation='relu'))

model.add(Dropout(0.3))

model.add(Dense(16, activation='relu'))

model.add(Dropout(0.3))

model.add(Dense(2, activation='softmax'))

          loss='sparse_categorical_crossentropy',

          metrics=['accuracy'])

[55]:

여기서는 val_accuracy 모니터링해서 성능이 좋아지지 않으면 조기 종료 하게 함.

from tensorflow.python.keras.callbacks import EarlyStopping

early_stop = EarlyStopping(monitor='val_accuracy', mode='max',

                       verbose=1, patience=5)

[56]:

from tensorflow.python.keras.callbacks import ModelCheckpoint

check_point = ModelCheckpoint('best_model.h5', verbose=1,

                          monitor='val_loss', mode='min',

                          save_best_only=True)

[57]:

history = model.fit(x=X_train_over, y=y_train_over,

      epochs=50 , batch_size=32,

      validation_data=(X_test, y_test), verbose=1,

      callbacks=[early_stop, check_point])

Epoch 1/50
226/226 [==============================] - 2s 6ms/step - loss: 0.5762 - accuracy: 0.7006 - val_loss: 0.4876 - val_accuracy: 0.7307

Epoch 00001: val_loss improved from inf to 0.48763, saving model to best_model.h5
Epoch 2/50
226/226 [==============================] - 1s 5ms/step - loss: 0.5148 - accuracy: 0.7546 - val_loss: 0.4987 - val_accuracy: 0.7250

Epoch 00002: val_loss did not improve from 0.48763
Epoch 3/50
226/226 [==============================] - 1s 5ms/step - loss: 0.5035 - accuracy: 0.7625 - val_loss: 0.4921 - val_accuracy: 0.7297

Epoch 00003: val_loss did not improve from 0.48763
Epoch 4/50
226/226 [==============================] - 1s 5ms/step - loss: 0.4912 - accuracy: 0.7667 - val_loss: 0.4960 - val_accuracy: 0.7283

Epoch 00004: val_loss did not improve from 0.48763
Epoch 5/50
226/226 [==============================] - 1s 6ms/step - loss: 0.4875 - accuracy: 0.7655 - val_loss: 0.4844 - val_accuracy: 0.7468

Epoch 00005: val_loss improved from 0.48763 to 0.48436, saving model to best_model.h5
Epoch 6/50
226/226 [==============================] - 1s 5ms/step - loss: 0.4808 - accuracy: 0.7744 - val_loss: 0.4664 - val_accuracy: 0.7525

Epoch 00006: val_loss improved from 0.48436 to 0.46640, saving model to best_model.h5
Epoch 7/50
226/226 [==============================] - 1s 5ms/step - loss: 0.4730 - accuracy: 0.7825 - val_loss: 0.5007 - val_accuracy: 0.7255

Epoch 00007: val_loss did not improve from 0.46640
Epoch 8/50
226/226 [==============================] - 1s 5ms/step - loss: 0.4731 - accuracy: 0.7777 - val_loss: 0.4724 - val_accuracy: 0.7530

Epoch 00008: val_loss did not improve from 0.46640
Epoch 9/50
226/226 [==============================] - 1s 5ms/step - loss: 0.4676 - accuracy: 0.7781 - val_loss: 0.4657 - val_accuracy: 0.7639

Epoch 00009: val_loss improved from 0.46640 to 0.46568, saving model to best_model.h5
Epoch 10/50
226/226 [==============================] - 1s 5ms/step - loss: 0.4566 - accuracy: 0.7879 - val_loss: 0.5155 - val_accuracy: 0.7141

Epoch 00010: val_loss did not improve from 0.46568
Epoch 11/50
226/226 [==============================] - 1s 5ms/step - loss: 0.4582 - accuracy: 0.7883 - val_loss: 0.5009 - val_accuracy: 0.7283

Epoch 00011: val_loss did not improve from 0.46568
Epoch 12/50
226/226 [==============================] - 1s 5ms/step - loss: 0.4528 - accuracy: 0.7926 - val_loss: 0.4852 - val_accuracy: 0.7425

Epoch 00012: val_loss did not improve from 0.46568
Epoch 13/50
226/226 [==============================] - 1s 5ms/step - loss: 0.4476 - accuracy: 0.7928 - val_loss: 0.4655 - val_accuracy: 0.7520

Epoch 00013: val_loss improved from 0.46568 to 0.46549, saving model to best_model.h5
Epoch 14/50
226/226 [==============================] - 1s 5ms/step - loss: 0.4446 - accuracy: 0.7935 - val_loss: 0.4678 - val_accuracy: 0.7492

Epoch 00014: val_loss did not improve from 0.46549
Epoch 00014: early stopping

모델 성능 평가

[58]:

losses = pd.DataFrame(model.history.history)

accuracy

val_loss

val_accuracy

0.576232

0.700581

0.487635

0.730678

0.514756

0.754568

0.498698

0.724988

0.503462

0.762458

0.492146

0.729730

0.491154

0.766750

0.496050

0.728307

0.487461

0.765504

0.484363

0.746799

성능 시각화

60:

losses[['loss','val_loss']].plot()

61:

losses[['loss','val_loss', 'accuracy','val_accuracy']].plot()

[62]:

plt.plot(history.history['accuracy'])

plt.plot(history.history['val_accuracy'])

plt.title('Accuracy')

plt.xlabel('Epochs')

plt.ylabel('Acc')

plt.legend(['acc', 'val_acc'])

plt.show()

성능 평가

63:

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

from sklearn.metrics import classification_report

64:

pred = model.predict(X_test)

65:

(2109, 2)

66:

y_pred = np.argmax(pred, axis=1)

67:

정확도 70~80%

accuracy_score(y_test, y_pred)

68:

재현율 70% 정도로 이전보다 좋아졌다.

recall_score(y_test, y_pred)

69:

recall 성능을 올렸지만, 반대급부로 precision 성능은 떨어진다.

accuracy, recall, precision 어떤것에 집중할지 선택하는것도 필요하다.

print(classification_report(y_test, y_pred))

          precision    recall  f1-score   support

0 0.90 0.74 0.81 1549
1 0.52 0.76 0.62 560

accuracy 0.75 2109
macro avg 0.71 0.75 0.72 2109
weighted avg 0.80 0.75 0.76 2109

배운 내용 정리

딥러닝 심층신경망(DNN) 모델 프로세스

· 데이터 가져오기

· 데이터 전처리

· Train, Test 데이터셋 분할

· 데이터 정규화

· DNN 딥러닝 모델

재현율 성능이 좋지 않다. 어떻게 성능향상 방법은?

· Feature Engineering : 성능 잘 나올수 있도록 데이터 가공

· 불균현 데이터 문제 해소 : under-sampling, over-sampling

· Over-Sampling 기법 : SMOTE

just develop it!

안티프래질!

이전 포스트

이태원 할로윈 사망자는 20대가 많다. 이 일은 20대의 책임일까.

다음 포스트

sklearn 프렙프 scentomni

읽어 들일 파일명 : data,csv

Pandas read_csv 함수 활용

결과 : df 저장

읽어온 데이터프레임 확인하기

Column Non-Null Count Dtype

인덱스

컬럼명

Values

SeniorCitizen 컬럼 : 고령자 여부, 범주형 0,1

tenure : 서비스 사용한 월수로 평균 32개월 사용, 최대 72개월 충성고객 있음 확인

25%, 50%, 75% 비율과 Max를 보니, 아무래도 Outliner 있는것으로 사료됨.

MonthlyCharges : 월사용요금, 평균 64$ , 최소 18,최대118, 최대 118,최대118

DataFrame info 함수

Column Non-Null Count Dtype

DataFrame drop 함수

'customerID' 컬럼 삭제

axis=1 옵션 사용해서 컬럼단위 삭제 수행

inplace=True 옵션 사용하여 df DataFrame에 저장

21컬럼에서 20개 컬럼으로 1개 줄어듬 확인

TotalCharges 컬럼은 사용요금으로 숫자형이어야 하는데 Ojbect로 나옴. 확인 필요

Column Non-Null Count Dtype

TotalCharges : 월사용요금, 실수형으로 보인다.

TotalCharges 컬럼 타입을 float로 변경해 보자.

문자열을 숫자형으로 변경할수 없으므로 에러 발생

Boolean indexing으로 검색

Boolean indexing으로 검색

DataFrame replace 함수

대상 컬럼 : 'TotalCharges'

'TotalCharges' 컬럼 type을 float로 변경

결과를 TotalCharges 컬럼에 다시 넣어야 합니다.

다시 Boolean indexing으로 검색 : '' 값을 가진 Row 없음

TotalCharges 컬럼 float 변경 확인

Column Non-Null Count Dtype

Churn 컬럼의 분포 확인

Churn Yes : 이탈 , No : 이탈안함. 언밸런스하게 No쪽이 많은 차지하고 있음

'Churn' 컬럼의 ['Yes', 'No'] --> [1, 0] 변경하기

컴퓨터는 문자열 이해 잘 하지 못해 숫자로 변경해야 함.

Churn 컬럼의 분포 확인

DataFrame isnull(), sum() 함수 활용

1. 결측치 많은 컬럼 : DeviceProtection --> drop 함수 이용하여 해당 컬럼 제거

2. 결측치 작은 Row에 대해서 dropna로 제거

inplace=True 옵션으로 자체 저장

Null 여부 다시 확인

DeviceProtection 컬럼 삭제 확인

Column Non-Null Count Dtype

DataFrame value_counts()와 plot() 함수 활용

대상 컬럼 : 'Partner'

plot 함수의 인자 : kind='bar'

Object 컬럼만 뽑으려면

1. 일일히 눈으로 보고 Object 컬럼 고른다

2. select_dtype() 함수 활용한다.

Object 컬럼명만 뽑아보자

Object 컬럼 하나씩 가져와서 Bar 차트 그려보기

불균형 컬럼 : Dependents, PhoneService. 심한 불균형 가진 PhoneService 컬럼 삭제 필요

number(int, float) 컬럼에 대해 검색

Churn 컬럼은 0, 1 되어 있으므로 분포 확인

Churn 컬럼에 대한 Bar 차트 확인

이탈(1)가 이탈않음(0)에 비해 1/3 수준임 : 불균형

SeniorCitizen 컬럼은 0, 1 되어 있으므로 분포 확인

SeniorCitizen 컬럼에 대한 Bar 차트 확인

이탈(1)가 이탈않음(0)에 비해 1/5 수준임 : 불균형

DataFrame drop() 함수 활용

대상 컬럼 : 'SeniorCitizen'

axis 와 inplace 옵션 사용

SeniorCitizen 삭제 확인

Column Non-Null Count Dtype

seaborn 라이브러리 임포트

tenure (서비스 사용기간)에 대한 히스토그램

처음에 많이 사용하고 , 70개월 사용하는 충성고객도 있다.

tenure (서비스 사용기간) 대한 히스토그램을 Churn 으로 구분

히스토그램으로 Churn 구분하니 겹쳐서 보기 어렵다.

kdeplot : 히스토그램을 곡선으로 그려보자

처음엔 서비스 가입도 많이 하고 이탈도 많이 하는것으로 보이고

70개월 이상 충성고객수는 점점 줄어 들고, 특히 60개월이상 넘어가면 이탈이 많이 늘어난다.

TotalCharges (서비스 총요금)에 대한 히스토그램

처음에 많이 사용하고 금액이 커질수록 사용자수가 줄어든다

kdeplot : 히스토그램을 곡선으로 그려보자

최근 가입자가 이탈하기 쉽다.

MonthlyCharges(서비스 총요금)이 클수록 이탈하기 쉽다.

MonthlyCharges : 월사용요금, 평균 64$ , 최소 18 $, 최대 118$