[Kaggle 필사] Instacart Market Basket Analysis

배서연·2022년 11월 7일

AI/SW Developers

목록 보기

1/1

Instract 구매자들은 어떤 물건을 재구매할까?

Instract는 식료품 주문 및 배달 앱으로, 앱을 통해 상품을 고른 후 자신의 장바구니를 확인하고 구매한다.

Data 설명

Data

df_order_products_train : 현재 구매자의 제품 주문 내역
df_order_products_prior : 과거 구매자의 제품 주문 내역
df_orders : 주문 정보
df_products : 제품 정보
df_aisles : 제품 상세 카테고리
df_departments : 제품 카테고리

Data Column

order_id : 주문 고유 아이디
user_id : 소비자 고유 아이디
product_id : 제품 고유 아이디
product_name : 제품 이름
order_number : 주문 번호
aisle : 제품 상세 카테고리
department : 제품 카테고리
aisle_id : 제품 상세 카테고리 고유 아이디
depertment_id : 제품 카테고리 고유 아이디
order_dow : 요일
order_hour_of_day : 일일 중 시간
add_to_cart_order : 장바구니 담은 제품 개수
reordered : 재주문 제품 개수
eval_set : 소비자 (prior/train/test)

1) library 및 Dataset 불러오기

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import zipfile
import math
import random
from pandasql import sqldf
import time
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/instacart-market-basket-analysis/departments.csv.zip
/kaggle/input/instacart-market-basket-analysis/sample_submission.csv.zip
/kaggle/input/instacart-market-basket-analysis/order_products__train.csv.zip
/kaggle/input/instacart-market-basket-analysis/order_products__prior.csv.zip
/kaggle/input/instacart-market-basket-analysis/orders.csv.zip
/kaggle/input/instacart-market-basket-analysis/products.csv.zip
/kaggle/input/instacart-market-basket-analysis/aisles.csv.zip

!ls

__notebook__.ipynb

Our Dataset containing billions of rows will get created in this location

!mkdir /kaggle/working/Dataset

zf = zipfile.ZipFile('/kaggle/input/instacart-market-basket-analysis/products.csv.zip') 
df_products = pd.read_csv(zf.open('products.csv'))
df_products.head()

제품 정보를 간략하게 살펴봄

	product_id	product_name	aisle_id	department_id
0	1	Chocolate Sandwich Cookies	61	19
1	2	All-Seasons Salt	104	13
2	3	Robust Golden Unsweetened Oolong Tea	94	7
3	4	Smart Ones Classic Favorites Mini Rigatoni Wit...	38	1
4	5	Green Chile Anytime Sauce	5	13

df_products.shape

(49688, 4)

Thera are 49.6K unique items/products in total
총 49,600개의 items/products

The Dataset has two sets of files : Prior and Train orders, this is because the actual task during the competition time was different.
We will concatanate the two sets to create single dataset with all the order details and products bought in those orders.

데이터 셋에는 현재와 과거 구매자군의 제품주문내역이 있음. 이 두 자료를 합쳐서 단일 데이터 셋을 만들어서 활용할 것임.

orders_path = '/kaggle/input/instacart-market-basket-analysis/order_products__train.csv.zip'
zf = zipfile.ZipFile(orders_path) 
df_orders_train = pd.read_csv(zf.open('order_products__train.csv'))

df_orders_train.head()

	order_id	product_id	add_to_cart_order	reordered
0	1	49302	1	1
1	1	11109	2	1
2	1	10246	3	0
3	1	49683	4	0
4	1	43633	5	1

orders_path = '/kaggle/input/instacart-market-basket-analysis/order_products__prior.csv.zip'
zf = zipfile.ZipFile(orders_path) 
df_orders_prior = pd.read_csv(zf.open('order_products__prior.csv'))

df_orders_prior.head()

	order_id	product_id	add_to_cart_order	reordered
0	2	33120	1	1
1	2	28985	2	1
2	2	9327	3	0
3	2	45918	4	1
4	2	30035	5	0

df_orders = pd.concat([df_orders_train, df_orders_prior])

df_orders.head()

	order_id	product_id	add_to_cart_order	reordered
0	1	49302	1	1
1	1	11109	2	1
2	1	10246	3	0
3	1	49683	4	0
4	1	43633	5	1

df_orders.shape

(33819106, 4)

tot_orders_cnt = len(df_orders.order_id.unique())
print(tot_orders_cnt)

There are 3.34M unique orders in total

#Getting a rough idea of average number of products per order 
sampling_orders = 5000
df_orders_set = df_orders[df_orders.order_id.isin(df_orders.order_id.unique()[:sampling_orders])].groupby("order_id").agg({
    "product_id": lambda x: set(x)
}).reset_index(drop=True)

display(df_orders_set.head())

df_orders_set['items_count'] = df_orders_set.product_id.apply(lambda x : len(x))

np.mean(df_orders_set.items_count)

	product_id
0	{11109, 10246, 47209, 43633, 49683, 22035, 493...
1	{34497, 46979, 48679, 19660, 43086, 49235, 396...
2	{42625, 23622, 11913, 28842, 4461, 39693, 1815...
3	{40706, 24489, 25610, 39275, 30391, 20574, 27966}
4	{46720, 24964, 4357, 43654, 18441, 36364, 3406...

10.4968

`On an average ther are 10.5 products per order

Approach

There are 3.34M unique orders in total. We will process 50K orders at a time, breaking the Embedding training-dataset into multiple files.
Like word2Vec we will also use the context information to find product embeddings.
In our case, the items ordered together (part of same order) will play the role of context.
Two products (P1 & P2) part of the same order will go as an input with target as 1.
Similarly we will also have a randomly drawn products let’s say Px and Py forming pair to go as an input with target as 0. (Negative sampling). There can be better Negative sampling approaches as well.

random.seed(42)
process_orders_per_batch = 50000
orders_list = list(df_orders.order_id.unique())
random.shuffle(orders_list)
start_time = time.time()
records_cnt = 0
for i in range(math.ceil(tot_orders_cnt/process_orders_per_batch)):
    #Taking first 50K orders
    orders = orders_list[i * process_orders_per_batch : min((i+1)* process_orders_per_batch, tot_orders_cnt) ]
    cols = ['order_id', 'product_id']
    df_orders_compressed = df_orders[df_orders.order_id.isin(orders)][cols].reset_index(drop=True)
    
    
    # For negative sampling, just keeping a random entry of product in order row with a separate column as well.
    lst_product_ids = list(df_orders_compressed.product_id)
#     print("Before Shuffling")
#     print(lst_product_ids[:10])
    random.shuffle(lst_product_ids)
#     print("After Shuffling")
#     print(lst_product_ids[:10])
    df_orders_compressed['product_id_complement'] = lst_product_ids
    

    #Self-join for positive sampling
    df_positive = pd.merge(df_orders_compressed.rename(columns = {'product_id':'product_id1'}), df_orders_compressed.rename(columns = {'product_id':'product_id2'}), on = 'order_id')
    df_positive = df_positive[df_positive.product_id1 != df_positive.product_id2]
    df_positive['label'] = 1
    
    cols = ['pid1','pid2','label']
    
    #Self-join for negative sampling
    df_negative = pd.merge(df_orders_compressed.rename(columns = {'product_id':'product_id1'}), df_orders_compressed.rename(columns = {'product_id_complement':'product_id2'}), on = 'order_id')
    df_negative = df_negative[df_negative.product_id1 != df_negative.product_id2]
    df_negative['label'] = 0
    

    df_tmp = pd.concat([df_positive,df_negative])
    df_tmp = df_tmp.sample(frac = 1).rename(columns = {'product_id1':'pid1', 'product_id2':'pid2'})
    records_cnt += df_tmp.shape[0]
    df_tmp[cols].to_csv('Dataset/file_{}.csv'.format(i), index = False, header = True)
    
print("--- {} minutes ---".format((time.time() - start_time)/60.0))

--- 24.300829911231993 minutes ---

df_orders_compressed.columns

Index(['order_id', 'product_id', 'product_id_complement'], dtype='object')

print(records_cnt)

1029938249

We have created a massive embeddings training dataset of 1.02B rows, summing to 14+GBs.

결과

prior 고객이 가장 많이 구매하는 시간은 토요일 오후 1~3시, 일요일 10시이다.

train 고객이 가장 많이 구매하는 시간은 수요일 아침 6~7시, 토요일 아침 6~7시이다.

train 고객은 하루 중 6시, 7시, 8시에 재 주문율이 가장 높았고, 9~12시에 주문율이 감소하는 경향을 보였다.

최종 결과

시간별 판매량 : 아침 9시에 판매량 가장 높음. / 새벽 3시에 판매량 가장 낮음.

아침 9~17시까지의 판매량이 전체 판매량의 약 70% 차지.

요일별 판매량 : 30일에 판매량 가장 높음. / 25일에 판매량 가장 낮음.

1~7일까지의 판매량은 증가 추세 / 8~29일까지의 판매량은 감소 추세

제품군에 따른 주문율 : produce(29.2%) > dairy eggs(16.7%) > snacks(8.9%) > beverages(8.3%) ...

상세 제품군에 따른 주문율 : 대부분의 organic 제품이 많이 판매된 축에 속한다.
상세 제품군에 따른 재주문율 : dairy eggs 제품군의 재주문율이 가장 높음. / personal care 제품군의 재주문율이 가장 낮음.

장바구니와 구매의 상관 : 가장 먼저 장바구니에 담긴 제품이 나중에 담긴 제품에 비해 다시 주문될 가능성이 높음. -> 소비자들은 자주 사용하는 제품들을 먼저 장바구니에 담은 후, 다른 제품들을 담는 경향이 있음.

배서연