2025.04.22 본_캠프 44일차

민동·2025년 4월 22일

본캠프

목록 보기

43/74

고객 재무 (Financial Behavior)

DTI: 소득 대비 부채 수준 → 군집 간 재무 건전성 분리에 좋
신용카드 사용률 (CU): 신용 한도의 과도 사용 여부 → 신용 리스크 구분에 효과적

#1. 신용카드 사용률 (CU): 신용 한도의 과도 사용 여부 → 신용 리스크 구분에 효과적
# 사용자 별 총 거래 금액
amount_sum = df_transactions.groupby('client_id')['amount'].sum()
# 사용자별 총 신용카드 한도
limit_sum = cards_df.groupby('client_id')['credit_limit'].sum()
cu_df = pd.merge(amount_sum, limit_sum, left_index=True, right_index=True)
cu_df['credit_utilization'] = (cu_df['amount']/cu_df['credit_limit']).round(2)

평균 거래액: 소비 성향 파악 가능 (절약형, 플렉스형 등)

이건 그룹화 하면서 이미 들어간듯

거래 변동성: 소비 일관성 (안정 vs. 충동적 소비자 군 구분)

#2 거래 변동성: 소비 일관성 (안정 vs. 충동적 소비자 군 구분)
df2_transactions = df_transactions.groupby('client_id')['amount'].agg(['mean','std']).reset_index().round(2)
df2_transactions['trans_stats'] = (trans_stats['std']/trans_stats['mean']).round(2)
df2_transactions

거래 변동성이란?
거래 변동성( txn_volatility = txn_amt_std / txn_amt_mean ) 이란?

데이터의 “흩어짐”을 절대 금액이 아니라 ‘평균 1 원당 얼마나 요동치는가’ 로 정규화한 값

신용 평가에 의미가 있는 이유?
- 평균적 소비·결제 규모가 같은 두 고객이라도, 월별·건별 금액이 들쭉날쭉하면 급격한 지출 증가→현금 부족→연체로 이어질 위험이 큼
- 금융 분야에서 CV는 “수익 1 단위당 변동 위험”의 고전적 척도
  
  (https://www.investopedia.com/terms/c/coefficientofvariation.asp?utm_source)
- 현금흐름 변동성이 커질수록 부도 확률이 상승
  
  (https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2649084)

카드 특성 (Card Profile)

card_age: 장기 보유 vs. 신규 가입자 구분

# acct_open_date를 datetime으로 변환
cards['acct_open_date'] = pd.to_datetime(cards['acct_open_date'], format='%m/%Y', errors='coerce')

# 기준 날짜 설정 (2021년 1월 기준)
reference_date = pd.Timestamp('2021-01-01')

# 카드 보유 기간 계산 (연도 기준)
cards['card_age'] = ((reference_date - cards['acct_open_date']) / pd.Timedelta(days=365)).round(1)

# 카드 보유자 구분 : 
#2년 이하 = 신규, 
#2년 이상 5년 미만 = 일반 Mid-Term 고객,
#5년 이상 = 장기 보유

def classify_card_age(age):
    if pd.isna(age):
        return 'Unknown'
    elif age >= 5:
        return 'Long-Term'
    elif age <= 2:
        return 'New'
    else:
        return 'Mid-Term'

cards['card_holder_type'] = cards['card_age'].apply(classify_card_age)

print(cards[['client_id', 'acct_open_date', 'card_age', 'card_holder_type']].head(30))

~~is_dark_web_flag: 사기 노출 여부~~
days_to_expire: 단기 관리 필요 고객 식별 가능

# 카드 데이터 불러오기
cards = pd.read_csv("cards_data.csv")

# expires 컬럼 → datetime 형식으로 변환 (월 말일로 맞추기)
cards['expires_date'] = pd.to_datetime(cards['expires'], format='%m/%Y') + pd.offsets.MonthEnd(0)

# 오늘 날짜
today = pd.to_datetime(datetime.today().date())

# days_to_expire 계산
cards['days_to_expire'] = (cards['expires_date'] - today).dt.days

# 예시 출력
cards[['card_number', 'expires', 'expires_date', 'days_to_expire']]

min_expiry = cards.groupby('client_id')['days_to_expire'].min().reset_index()
min_expiry.rename(columns={'days_to_expire': 'min_days_to_expire'}, inplace=True)
min_expiry

pin_change_gap: 보안 민감도 또는 습관 파악

#5. pin_change_gap: 보안 민감도 또는 습관 파악
from datetime import datetime
#cards_df.info() # year_pin_last_changed : int형

cards['pin_age'] = datetime.today().year - cards['year_pin_last_changed']
avg_pin_age = cards.groupby('client_id')['pin_age'].mean().round(2).reset_index(name='pin_age')

핀 변경 갭이란?
- 보안 습과 척도, 변경 주기가 길수록 정부 유출 위험 ↑
  
  (https://www.emvco.com/wp-content/uploads/2024/01/EMVCo-Annual-Report23_FINAL.pdf)
- EMV/PCI 가이드라인 12~24개월 주기 권장

거래 패턴 (Usage Pattern)

chip_use_ratio: 안전한 결제수단 선호 여부

# 파생변수 chip_use_ratio: 안전한 결제수단 선호 여부
# 1. 확인 작업 : 칩 사용여부 확인 [df_transaction]에서 'use_chip'의 고유값은 
# 'Swipe Transaction'(마그네틱), 'Online Transaction'(온라인 결제), 'Chip Transaction'(ic칩 사용 방식)
# 'Chip Transaction'(ic칩 사용 방식)'은 복제 가능성이 낮다는 측면에서 가장 안전한 결제수단아다!
# 고객 별로 안전한 결제수단을 사용하는지 확인하기 위해 chip_use_ratio를 추가

<'chip_use_ratio'>
df_transactions['is_chip'] = df_transactions['use_chip'].apply(lambda x: 1 if x == 'Chip Transaction' else 0)
chip_ratio = df_transactions.groupby('client_id').agg(
    chip_use_ratio=('is_chip', 'mean')
).reset_index()
chip_ratio.rename(columns={'client_id': 'id'}, inplace=True)
#df3 = pd.merge(df3, chip_ratio, on='id', how='left')    병합은 한 번 처음 실행할 때만!

error_rate: 사용자 경험이 불편하거나 시스템적 문제 > 잔액부족의 연, 월단위

# 1. 우선 에러만 있는 컬럼을 필터링
# 2. 고객별 error 비율 계산
# 3. df3에 병합

df_transactions['is_error'] = df_transactions['errors'].notna().astype(int)

# 고객별 error 비율 계산
error_rate_df = df_transactions.groupby('client_id').agg(
    error_rate=('is_error', 'mean')  # 오류 발생 비율 전체 거래 건수 대비 오류 발생 건수
).reset_index()

# client_id → id로 컬럼명 맞추기
error_rate_df.rename(columns={'client_id': 'id'}, inplace=True)

# df3와 병합
# df3 = pd.merge(df3, error_rate_df, on='id', how='left')

에러 비율이 너무 낮게 뜬다.    
# total 결재 횟수 대비 에러 발생률이다 보니... 만일 잔액부족 사례까지 제거하면 더 적어질 예정
df3['error_rate'].max() : 0.1464
df3['error_rate'].min() : 0.0015
# 차라리 방향을 바꿔서 연단위, 월단위 고객의 정상거래 대비 잔액부족 발생 건수로 방향을 바꿔보자
##################################################################################

<새롭게 만드는 연 월단위 잔액부족 비율, 단, 날짜 데이터(transaction의 고객 별 date컬럼)가 새롭게 필요.>
# 연, 월단위 정상거래 대비 잔액부족 비율 
# 날짜 데이터 컬럼 추가 (한 번 만)
df_transactions['trans_date'] = transactions['date']
df_transactions['trans_date'] = pd.to_datetime(df_transactions['trans_date'])
df_transactions['trans_date']

# 고객별 최초 거래 ~ 마지막 거래 날짜
transaction_period = df_transactions.groupby('client_id').agg(
    min_date=('trans_date', 'min'),
    max_date=('trans_date', 'max')
).reset_index()

# 거래 개월 수 계산(pd.Timedelta()를 통해 두 날짜간 차이를 직접 지정)
transaction_period['months_active'] = ((transaction_period['max_date'] - transaction_period['min_date']) / pd.Timedelta(days=30)).round(1)
transaction_period['years_active'] = (transaction_period['months_active'] / 12).round(2)
"""각각 XX.xx개월, YY.yy년으로 전환""""
# 잔액 부족 포함 여부
df_transactions['insufficient_flag'] = df_transactions['errors'].fillna('').apply(lambda x: 'Insufficient Balance' in x)

# 고객별 잔액 부족 건수 & 전체 거래 수
insufficient_df = df_transactions.groupby('client_id').agg(
    insufficient_cnt=('insufficient_flag', 'sum'),
    total_txn=('client_id', 'count')
).reset_index()

# 정상 거래 수 = 전체 - 오류 발생
insufficient_df['normal_txn'] = insufficient_df['total_txn'] - insufficient_df['insufficient_cnt']

# 병합: 거래 기간 정보 추가
insufficient_df = pd.merge(insufficient_df, transaction_period[['client_id', 'months_active', 'years_active']], on='client_id', how='left')

# 단위 기간별 비율 계산
insufficient_df['monthly_insufficient_rate'] = (insufficient_df['insufficient_cnt'] / insufficient_df['months_active']).round(4)
insufficient_df['yearly_insufficient_rate'] = (insufficient_df['insufficient_cnt'] / insufficient_df['years_active']).round(4)

# df3에 병합 병합은 한 번만!
insufficient_df.rename(columns={'client_id': 'id'}, inplace=True)
#df3 = pd.merge(df3, insufficient_df[['id', 'monthly_insufficient_rate', 'yearly_insufficient_rate']], on='id', how='left')

#df3[['id', 'monthly_insufficient_rate', 'yearly_insufficient_rate']].head()

# 전체적으로 거래 시작일과 마지막 거래일일 사이에서 얼마나 많은 잔액부족 상황을 겪었는 지 알수 있다.
# 월단위는 "전체 오류 상황 / 전체 개월"느낌으로 생각하면 된다. 
# 최종적으론 1개월 단위로 잔액부족 상황이 일어난 평균을 확인 가능[분모가 크기에ex.517.02개월 비율이 작다]
# 연단위는 "전체 오류 상황 / 전체 년수"느낌으로 생각
# 최종적으론 1년 단위로 잔액부족 상황이 일어난 평균을 확인 가능[분모가 작기에ex.17.20년 비율이 크게 나온다.]

mcc_diversity: 다양한 소비 vs. 특정 분야 집중 소비

# 고객별 카테고리 거래 개수
client_trans_cate = trans_copy.groupby(['client_id', '상위카테고리'])['amount'].count().unstack(fill_value=0)

# 가장 높은 카테고리
client_trans_cate['max_cate'] = client_trans_cate.max(axis=1)

# 고객별 거래 총 개수 
client_trans_cate['trans_cnt'] = client_trans_cate.sum(axis=1)

# 가장 높은 카테고리 비율
client_trans_cate['max_cate_ratio'] = ((client_trans_cate['max_cate'] / client_trans_cate['trans_cnt']) * 100).round(3)

# 고객별 가장 많이 소비한 카테고리 이름
client_trans_cate['max_cate_name'] = client_trans_cate.drop(columns=['max_cate', 'trans_cnt', 'max_cate_ratio']).idxmax(axis=1)

client_trans_cate

top_merchant_state: 지역 기반 군집 분석 가능

# 결측치 제거거 (state 없는 거래 제거)
transactions = transactions.dropna(subset=['merchant_state'])

# 사용자별 가장 자주 거래한 state 계산
top_state = transactions.groupby(['client_id', 'merchant_state']).size().reset_index(name='count')
top_merchant_state = top_state.sort_values(['client_id', 'count'], ascending=[True, False]) \
.drop_duplicates(subset='client_id')

# 결과 컬럼 정리
top_merchant_state = top_merchant_state[['client_id', 'merchant_state']]
top_merchant_state.rename(columns={'merchant_state': 'top_merchant_state'}, inplace=True)

# 미리보기
print(top_merchant_state.head())

high_trans_ratio : 고액 결제 비율

# 총 거래 개수
total_trans = trans_copy.groupby('client_id')['amount'].count().reset_index()

# 고액 결제 개수 (500으로 할지 300으로 할지 아님 다른 걸로 할지...)
high_trans = trans_copy[trans_copy['amount'] >= 300].groupby('client_id')['amount'].count().reset_index()

# 총 거래랑 고액 결제 merge
high_trans_ratio = pd.merge(total_trans, high_trans, on='client_id', how='left')

# 고액 결제 비율
high_trans_ratio['high_amount_ratio'] = (high_trans_ratio['amount_y'] / high_trans_ratio['amount_x']) * 100

행동·소셜 (Behavioral / Socioeconomic)

card_per_income: 소득 대비 카드 개수 → 과소비 위험도

test = df3.copy()

# 소득이 0 이상인 경우만 계산 (0이면 나눗셈 오류 발생)
test = test[test['yearly_income'] > 0]

# card_per_income 변수 생성
test['card_per_income'] = test['num_credit_cards'] / test['yearly_income']

avg_cards_issued_per_year: 카드 신규발급 성향

# 현재 연도 가져오기
current_year = datetime.now().year

cards = pd.read_csv('cards_data.csv')

# 계좌 개설 연도
cards['acct_open_year'] = pd.to_datetime(cards['acct_open_date']).dt.year

# 신규 발급 비율 계산
cards['avg_cards_issued_per_year'] = cards['num_cards_issued'] / (current_year - cards['acct_open_year'] + 1)

night_txn_ratio: 야간 활동 → 사기 감지 or 특정 직업군 구분 가능

# night_txn_ratio: 야간 활동 → 사기 감지 or 특정 직업군 구분 가능
df_transactions['hour'] = pd.to_datetime(df_transactions['date']).dt.hour
df_transactions['is_night'] = df_transactions['hour'].apply(lambda x : 1 if (x>= 22 or x<6) else 0)

#계산
# 고객별 전제 거래 금액
total_amount = df_transactions.groupby('client_id')['amount'].sum().reset_index(name='total_amount')
# 고객별 야간 거래 금액
night_amount = df_transactions[df_transactions['is_night']==1].groupby('client_id')['amount'].sum().reset_index(name='night_amount')
# 위 두개 병합
df_time = pd.merge(total_amount,night_amount,on='client_id')
df_time['night_ratio'] = (df_time['night_amount']/df_time['total_amount']).round(2)

야간거래비란?
- 22 시 ~ 06 시 사이에 발생한 거래액(또는 건수)의 비율
- 왜 쓰냐?
  - 사용자,은행 모드 경계심이 낮은 시간대여서 스미싱,계정 탈취 거래가 집중
  - Pix 실시간 결제 : 22시-06시 한도를 $210로 낮춤 → 실제로 사기 감소 함 → 야간 사기 급증에 대한 규제적 대응

credit_score_range: 신용등급 (100단위)

# 신용점수 구간 함수 정의
def score_range(score):
    if pd.isna(score): 
        return 'Unknown'  # 결측치 처리리
    else:
        return f"{int(score) // 100 * 100}대"

# 파생변수 생성
users['credit_score_range'] = users['credit_score'].apply(score_range)

print(users[['credit_score', 'credit_score_range']].head())