n232_data-wrangling

ssu_hyun·2021년 8월 25일

AI Bootcamp

[codestates] AI Bootcamp

목록 보기

22/62

학습목표

지도학습(supervised machine learning)모델을 학습하기 위한 훈련 데이터 생성 합니다.
지도학습을 위한 데이터 엔지니어링 방법을 이해하고 올바른 특성을 만들어 낼 수 있습니다.

데이터 랭글링(wrangling)

분석을 하거나 모델을 만들기 전에 데이터를 사용하기 쉽게 변형하거나 맵핑하는 과정으로 보통 모델링 과정 중 가장 많은 시간이 소요되는 단계이다.

0. preview

#데이터 shape, head 동시 확인
def preview():
    for filename in glob('*.csv'):
        df = pd.read_csv(filename)
        print(filename, df.shape)
        display(df.head())
        print('\n')

1. 데이터 파일과 각 feature에 대한 분석 및 이해

From jeremystan

orders (3.4m rows, 206k users):
- order_id: order identifier
- user_id: customer identifier
- eval_set: which evaluation set this order belongs in (see SET described below)
- order_number: the order sequence number for this user (1 = first, n = nth)
- order_dow: the day of the week the order was placed on
- order_hour_of_day: the hour of the day the order was placed on
- days_since_prior: days since the last order, capped at 30 (with NAs for order_number = 1)
products (50k rows):
- product_id: product identifier
- product_name: name of the product
- aisle_id: foreign key
- department_id: foreign key
aisles (134 rows):
- aisle_id: aisle identifier
- aisle: the name of the aisle
deptartments (21 rows):
- department_id: department identifier
- department: the name of the department
order_products__SET (30m+ rows):
- order_id: foreign key
- product_id: foreign key
- add_to_cart_order: order in which each product was added to cart
- reordered: 1 if this product has been ordered by this user in the past, 0 otherwise
where SET is one of the four following evaluation sets (eval_set in orders):
- "prior": orders prior to that users most recent order (~3.2m orders)
- "train": training data supplied to participants (~131k orders)
- "test": test data reserved for machine learning competitions (~75k orders)

2. 데이터프레임간 관계 분석

ex) 모든 고객의 연속적인 구매 정보(order_id, user_id 등)가 orders에 모두 존재하고 prior, train에는 order_id와 연결된 product 정보(product_id, 카트에 넣은 순서, 재구매여부)가 존재합니다.

test(submission)의 경우 order_id만 있고 product_id 가 없습니다.

test, train 데이터 분리, 중복 분석

# set1.isdisjoint(set2)
# set.isdisjoint() - 두 집합이 공통 원소를 갖지 않는가?
set(orders[orders['eval_set']=='test']['user_id'])\
    .isdisjoint(set(orders[orders['eval_set']=='train']['user_id']))
    
>>> True

# 한 고객은 한 샘플만 있음
len(orders[orders['eval_set'].isin(['train','test'])]) \
,len(orders[orders['eval_set'].isin(['train','test'])]['user_id'].unique())

>>> (206209, 206209)

3. 이진분류로 문제 단순화

고객들마다 어떤 상품들이 재구매 될 것인지?
-> 구매자가 특정 상품을 구매 할 것인지 말 것인지(Binary classification)?

4. 문제의 답을 찾기 위한 질문 설계

고객들이 가장 빈번하게 주문하는 제품은?
고객들이 이 제품을 최근에 얼마나 구매를 하는지?
어떤 고객들이 이 제품을 이전에(prior) 구매했었는지?
이 제품을 구입한 이력이 있는 고객 데이터세트는?
어떤 특성을 엔지니어링 해야 고객이 이 제품을 재구매할 것이라 예측 할 수 있을까요?
여러가지 생각해 볼 수 있는 특성들이 있습니다!

고객의 주문당 평균 구입 제품의 수
주문한 시간
바나나 구매 횟수, 빈도
바나나 외에 다른 과일을 같이 구매 하는지
바나나 재구매 사이의 일수
최근 몇일 전에 바나나를 구매했는지? ...

5. 질문 해결을 위한 데이터 조합에 사용되는 함수 및 메서드

mode : 최빈값
prior['product_id'].mode() 

value_counts : 고유값+개수

top5_products = prior['product_id'].value_counts()[:5]
top5_products

merge : 공통 기준으로 df 합치기

#prior와 product를 product_id를 기준으로 합침
prior = prior.merge(products, on='product_id')

prior = prior.merge(orders, how='left', on='order_id')

prior.groupby(['user_id','order_id']).count()
prior.groupby(['user_id','order_id']).count().reset_index().groupby('user_id').mean()

groupby : 카테고리별 그룹화

#order_id별 제품 리스트
train.groupby('order_id')['product_id'].apply(list)

# any(): 주문(order_id) 중에서 한 번이라도 Banana 주문이 있는 경우 True
train.groupby('order_id')['banana'].any().value_counts(normalize=True)

#filtering beer_servings.mean() by continent
drinks.groupby('continent').beer_servings.mean()

#only 'Africa'
drinks[drinks.continent=='Africa'].beer_servings.mean()

#agg : allows us to specify the multiple aggregation function at one
drinks[drinks.continent=='Africa'].beer_servings.agg(['count', 'min', 'max', 'mean'])

#case of few columns
drinks.groupby('continent').mean()

#시각화
%matplotlib inline
drinks.groupby('continent').mean().plot(kind='bar')

참고자료

ssu_hyun

이전 포스트

n231_choose-your-ml-problems

다음 포스트