신용카드 사용자 연체 예측 EDA_Part3

지리산근육곰·2021년 8월 30일

EDA dacon python 신용카드 사용자

Dacon 신용카드 사용자 연체 예측

목록 보기

3/4

7. 변수 값 변경

현재 변수에서 신용등급 2를 제외 하면 여성고객의 신용등급이 남성고객보다 높다.

여성을 1 남성을 0으로 지정한다.
자동차와 부동산의 경우 가지고 있는 경우에 대해 1로 지정한다.
자녀의 경우 4명 이상의 경우에 대해선 4로 지정 하는것이 좋아 보인다.
연령 그룹에 대해선 20대 부터 60대 까지 ordinal로 정렬 해준다.

7.1 여성 값 지정

train['women'] = np.zeros(len(train))
train['women'][train['gender']=='F'] = 1

7.2 부동산과 자동차 값 지정

# 자동차
train['yesCar'] = np.zeros(len(train))
train['yesCar'][train['car']=='Y'] = 1
# 부동산
train['yesReality'] = np.zeros(len(train))
train['yesReality'][train['reality']=='Y'] = 1

7.2.1 변수 drop하기

train.drop(columns=['gender', 'car', 'reality'], inplace=True)

7.3 자녀 수 4명 이상에 대한 처리

train['child_num'][train['child_num'] > 3] = 4

7.4 연령 그룹 수치화 하기

train['age_group'][train['age']<30] = 0

train['age_group'][(train['age']>=30) & (train['age']<40)] = 1

train['age_group'][(train['age']>=40) & (train['age']<50)] = 2

train['age_group'][(train['age']>=50) & (train['age']<60)] = 3

train['age_group'][train['age']>=60] = 4

train.describe(include='all')

8. Testing set 데이터 변환

8.1 testing set 불러오기

myfile2 = files.upload()

test = pd.read_csv('test.csv')

# 불필요한 변수 Drop
test.drop(columns=['index','FLAG_MOBIL','phone','email','work_phone', 'edu_type'], inplace=True)

8.2 소득에 의한 변수 지정

# 4, 5 10 분위 지정시 test의 값이 아닌 training에 의한 4, 5, 10분위 지정을 해준다.
# variable 'income_quintile' 생성
test['income_quartile'] = np.zeros(10000)
# variable 'income_quintile' 생성
test['income_quintile'] = np.zeros(10000)
# variable 'income_decile' 생성
test['income_decile'] = np.zeros(10000)

# income_quartile에 값 할당하기
test['income_quartile'][test['income_total'] < train['income_total'].quantile(0.25)] = 1

test['income_quartile'][(test['income_total'] >= train['income_total'].quantile(0.25)) &
                       (test['income_total'] < train['income_total'].quantile(0.5))] = 2

test['income_quartile'][(test['income_total'] >= train['income_total'].quantile(0.5)) &
                       (test['income_total'] < train['income_total'].quantile(0.75))] = 3

test['income_quartile'][test['income_total'] >= train['income_total'].quantile(0.75)] = 4

# income_quintile에 값 할당하기
test['income_quintile'][test['income_total'] < train['income_total'].quantile(0.2)] = 1

test['income_quintile'][(test['income_total'] >= train['income_total'].quantile(0.2)) &
                       (test['income_total'] < train['income_total'].quantile(0.4))] = 2

test['income_quintile'][(test['income_total'] >= train['income_total'].quantile(0.4)) &
                       (test['income_total'] < train['income_total'].quantile(0.6))] = 3

test['income_quintile'][(test['income_total'] >= train['income_total'].quantile(0.6)) &
                       (test['income_total'] < train['income_total'].quantile(0.8))] = 4

test['income_quintile'][test['income_total'] >= train['income_total'].quantile(0.8)] = 5

# income_decile에 값 할당하기
test['income_decile'][test['income_total'] < train['income_total'].quantile(0.1)] = 1

test['income_decile'][(test['income_total'] >= train['income_total'].quantile(0.1)) &
                       (test['income_total'] < train['income_total'].quantile(0.2))] = 2

test['income_decile'][(test['income_total'] >= train['income_total'].quantile(0.2)) &
                       (test['income_total'] < train['income_total'].quantile(0.3))] = 3

test['income_decile'][(test['income_total'] >= train['income_total'].quantile(0.3)) &
                       (test['income_total'] < train['income_total'].quantile(0.4))] = 4

test['income_decile'][(test['income_total'] >= train['income_total'].quantile(0.4)) &
                       (test['income_total'] < train['income_total'].quantile(0.5))] = 5

test['income_decile'][(test['income_total'] >= train['income_total'].quantile(0.5)) &
                       (test['income_total'] < train['income_total'].quantile(0.6))] = 6

test['income_decile'][(test['income_total'] >= train['income_total'].quantile(0.6)) &
                       (test['income_total'] < train['income_total'].quantile(0.7))] = 7

test['income_decile'][(test['income_total'] >= train['income_total'].quantile(0.7)) &
                       (test['income_total'] < train['income_total'].quantile(0.8))] = 8

test['income_decile'][(test['income_total'] >= train['income_total'].quantile(0.8)) &
                       (test['income_total'] < train['income_total'].quantile(0.9))] = 9

test['income_decile'][test['income_total'] >= train['income_total'].quantile(0.9)] = 10

8.3 연령 구간 설정

# age 변수 생성
# DAYS_BIRTH에 -1곱하고 365로 나누고 몫에다 +1
test['age'] = (test['DAYS_BIRTH'] * (-1))//365+1
# age_group 변수 생성
test['age_group'] = np.zeros(10000)

# 29Under, 30~39, 40~49, 50~64, 65+ 구간으로 나누어 주자
test['age_group'][test['age']<30] = 0

test['age_group'][(test['age']>=30) & (test['age']<40)] = 1

test['age_group'][(test['age']>=40) & (test['age']<50)] = 2

test['age_group'][(test['age']>=50) & (test['age']<60)] = 3

test['age_group'][test['age']>=60] = 4

8.4 신용카드 사용 연수

# 신용카드 사용 연수 생성
# -1 곱해주고 12 나누고 내림
test['used_years'] = test['begin_month']*(-1)//12

8.5 근무 연수값 지정

# worked_year 변수지정
test['worked_year'] = test['DAYS_EMPLOYED']*(-1)

# 취업되지 않은 사람들에 대해 -365 지정
test['worked_year'][test['worked_year']<0] = -365

# 근무연수를 구하기 위해 worked_year를 365로 나누고 몫만 지정
# 근무연수가 없는 사람들은 -1에 지정
test['worked_year'] = test['worked_year']//365

# 무직자와 연금수령자 중 무직자에 대해 Unempolyed 지정
test['occyp_type'][(test.worked_year==-1)&(test.income_type=='Pensioner')] = 'Unempolyed'
test['occyp_type'][test.worked_year==-1] = 'Unempolyed'
# 결측치 제거
test.dropna(axis=0, inplace=True)

8.6 성별, 자동차, 부동산 값 변경 및 Drop

# 성별
test['women'] = np.zeros(len(test))
test['women'][test['gender']=='F'] = 1
# 자동차
test['yesCar'] = np.zeros(len(test))
test['yesCar'][test['car']=='Y'] = 1
# 부동산
test['yesReality'] = np.zeros(len(test))
test['yesReality'][test['reality']=='Y'] = 1

# Drop
test.drop(columns=['gender', 'car', 'reality'], inplace=True)

8.7 자녀 수 4명 이상에 대한 처리

test['child_num'][test['child_num'] > 3] = 4

9. Training & Testing sets 저장하기

train.to_csv("newTrain.csv", sep=',',na_rep='NaN')
test.to_csv("newTest.csv", sep=',',na_rep='NaN')

지리산근육곰

이전 포스트

신용카드 사용자 연체 예측 EDA_Part2

다음 포스트

신용카드 사용자 연체 예측 EDA_Part3

Dacon 신용카드 사용자 연체 예측

7. 변수 값 변경

7.1 여성 값 지정

7.2 부동산과 자동차 값 지정

7.2.1 변수 drop하기

7.3 자녀 수 4명 이상에 대한 처리

7.4 연령 그룹 수치화 하기

8. Testing set 데이터 변환

8.1 testing set 불러오기

8.2 소득에 의한 변수 지정

8.3 연령 구간 설정

8.4 신용카드 사용 연수

8.5 근무 연수값 지정

8.6 성별, 자동차, 부동산 값 변경 및 Drop

8.7 자녀 수 4명 이상에 대한 처리

9. Training & Testing sets 저장하기

신용카드 사용자 연체 예측 EDA_Part2

Proportional Odds Model in R

0개의 댓글