Instract는 식료품 주문 및 배달 앱으로, 앱을 통해 상품을 고른 후 자신의 장바구니를 확인하고 구매한다.
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import zipfile
import math
import random
from pandasql import sqldf
import time
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
/kaggle/input/instacart-market-basket-analysis/departments.csv.zip
/kaggle/input/instacart-market-basket-analysis/sample_submission.csv.zip
/kaggle/input/instacart-market-basket-analysis/order_products__train.csv.zip
/kaggle/input/instacart-market-basket-analysis/order_products__prior.csv.zip
/kaggle/input/instacart-market-basket-analysis/orders.csv.zip
/kaggle/input/instacart-market-basket-analysis/products.csv.zip
/kaggle/input/instacart-market-basket-analysis/aisles.csv.zip
!ls
__notebook__.ipynb
Our Dataset containing billions of rows will get created in this location
!mkdir /kaggle/working/Dataset
zf = zipfile.ZipFile('/kaggle/input/instacart-market-basket-analysis/products.csv.zip')
df_products = pd.read_csv(zf.open('products.csv'))
df_products.head()
제품 정보를 간략하게 살펴봄
| product_id | product_name | aisle_id | department_id | |
|---|---|---|---|---|
| 0 | 1 | Chocolate Sandwich Cookies | 61 | 19 |
| 1 | 2 | All-Seasons Salt | 104 | 13 |
| 2 | 3 | Robust Golden Unsweetened Oolong Tea | 94 | 7 |
| 3 | 4 | Smart Ones Classic Favorites Mini Rigatoni Wit... | 38 | 1 |
| 4 | 5 | Green Chile Anytime Sauce | 5 | 13 |
df_products.shape
(49688, 4)
Thera are 49.6K unique items/products in total orders_path = '/kaggle/input/instacart-market-basket-analysis/order_products__train.csv.zip'
zf = zipfile.ZipFile(orders_path)
df_orders_train = pd.read_csv(zf.open('order_products__train.csv'))
df_orders_train.head()
| order_id | product_id | add_to_cart_order | reordered | |
|---|---|---|---|---|
| 0 | 1 | 49302 | 1 | 1 |
| 1 | 1 | 11109 | 2 | 1 |
| 2 | 1 | 10246 | 3 | 0 |
| 3 | 1 | 49683 | 4 | 0 |
| 4 | 1 | 43633 | 5 | 1 |
orders_path = '/kaggle/input/instacart-market-basket-analysis/order_products__prior.csv.zip'
zf = zipfile.ZipFile(orders_path)
df_orders_prior = pd.read_csv(zf.open('order_products__prior.csv'))
df_orders_prior.head()
| order_id | product_id | add_to_cart_order | reordered | |
|---|---|---|---|---|
| 0 | 2 | 33120 | 1 | 1 |
| 1 | 2 | 28985 | 2 | 1 |
| 2 | 2 | 9327 | 3 | 0 |
| 3 | 2 | 45918 | 4 | 1 |
| 4 | 2 | 30035 | 5 | 0 |
df_orders = pd.concat([df_orders_train, df_orders_prior])
df_orders.head()
| order_id | product_id | add_to_cart_order | reordered | |
|---|---|---|---|---|
| 0 | 1 | 49302 | 1 | 1 |
| 1 | 1 | 11109 | 2 | 1 |
| 2 | 1 | 10246 | 3 | 0 |
| 3 | 1 | 49683 | 4 | 0 |
| 4 | 1 | 43633 | 5 | 1 |
df_orders.shape
(33819106, 4)
tot_orders_cnt = len(df_orders.order_id.unique())
print(tot_orders_cnt)
3346083
There are 3.34M unique orders in total
#Getting a rough idea of average number of products per order
sampling_orders = 5000
df_orders_set = df_orders[df_orders.order_id.isin(df_orders.order_id.unique()[:sampling_orders])].groupby("order_id").agg({
"product_id": lambda x: set(x)
}).reset_index(drop=True)
display(df_orders_set.head())
df_orders_set['items_count'] = df_orders_set.product_id.apply(lambda x : len(x))
np.mean(df_orders_set.items_count)
| product_id | |
|---|---|
| 0 | {11109, 10246, 47209, 43633, 49683, 22035, 493... |
| 1 | {34497, 46979, 48679, 19660, 43086, 49235, 396... |
| 2 | {42625, 23622, 11913, 28842, 4461, 39693, 1815... |
| 3 | {40706, 24489, 25610, 39275, 30391, 20574, 27966} |
| 4 | {46720, 24964, 4357, 43654, 18441, 36364, 3406... |
10.4968
`On an average ther are 10.5 products per order
random.seed(42)
process_orders_per_batch = 50000
orders_list = list(df_orders.order_id.unique())
random.shuffle(orders_list)
start_time = time.time()
records_cnt = 0
for i in range(math.ceil(tot_orders_cnt/process_orders_per_batch)):
#Taking first 50K orders
orders = orders_list[i * process_orders_per_batch : min((i+1)* process_orders_per_batch, tot_orders_cnt) ]
cols = ['order_id', 'product_id']
df_orders_compressed = df_orders[df_orders.order_id.isin(orders)][cols].reset_index(drop=True)
# For negative sampling, just keeping a random entry of product in order row with a separate column as well.
lst_product_ids = list(df_orders_compressed.product_id)
# print("Before Shuffling")
# print(lst_product_ids[:10])
random.shuffle(lst_product_ids)
# print("After Shuffling")
# print(lst_product_ids[:10])
df_orders_compressed['product_id_complement'] = lst_product_ids
#Self-join for positive sampling
df_positive = pd.merge(df_orders_compressed.rename(columns = {'product_id':'product_id1'}), df_orders_compressed.rename(columns = {'product_id':'product_id2'}), on = 'order_id')
df_positive = df_positive[df_positive.product_id1 != df_positive.product_id2]
df_positive['label'] = 1
cols = ['pid1','pid2','label']
#Self-join for negative sampling
df_negative = pd.merge(df_orders_compressed.rename(columns = {'product_id':'product_id1'}), df_orders_compressed.rename(columns = {'product_id_complement':'product_id2'}), on = 'order_id')
df_negative = df_negative[df_negative.product_id1 != df_negative.product_id2]
df_negative['label'] = 0
df_tmp = pd.concat([df_positive,df_negative])
df_tmp = df_tmp.sample(frac = 1).rename(columns = {'product_id1':'pid1', 'product_id2':'pid2'})
records_cnt += df_tmp.shape[0]
df_tmp[cols].to_csv('Dataset/file_{}.csv'.format(i), index = False, header = True)
print("--- {} minutes ---".format((time.time() - start_time)/60.0))
--- 24.300829911231993 minutes ---
df_orders_compressed.columns
Index(['order_id', 'product_id', 'product_id_complement'], dtype='object')
print(records_cnt)
1029938249
We have created a massive embeddings training dataset of 1.02B rows, summing to 14+GBs.
prior 고객이 가장 많이 구매하는 시간은 토요일 오후 1~3시, 일요일 10시이다.
train 고객이 가장 많이 구매하는 시간은 수요일 아침 6~7시, 토요일 아침 6~7시이다.
train 고객은 하루 중 6시, 7시, 8시에 재 주문율이 가장 높았고, 9~12시에 주문율이 감소하는 경향을 보였다.
시간별 판매량 : 아침 9시에 판매량 가장 높음. / 새벽 3시에 판매량 가장 낮음.
아침 9~17시까지의 판매량이 전체 판매량의 약 70% 차지.
요일별 판매량 : 30일에 판매량 가장 높음. / 25일에 판매량 가장 낮음.
1~7일까지의 판매량은 증가 추세 / 8~29일까지의 판매량은 감소 추세
제품군에 따른 주문율 : produce(29.2%) > dairy eggs(16.7%) > snacks(8.9%) > beverages(8.3%) ...
상세 제품군에 따른 주문율 : 대부분의 organic 제품이 많이 판매된 축에 속한다.
상세 제품군에 따른 재주문율 : dairy eggs 제품군의 재주문율이 가장 높음. / personal care 제품군의 재주문율이 가장 낮음.
장바구니와 구매의 상관 : 가장 먼저 장바구니에 담긴 제품이 나중에 담긴 제품에 비해 다시 주문될 가능성이 높음. -> 소비자들은 자주 사용하는 제품들을 먼저 장바구니에 담은 후, 다른 제품들을 담는 경향이 있음.