[Paper Review] DCN

승민·6일 전

본 글은 2017년 발표된 Deep & Cross Network for Ad Click Predictions을 읽고 요약 및 정리한 글입니다.
논문
직접 구현 DCN 모델 코드 (Pytorch)
직접 구현 DCN Movielens100k 학습 코드 (Jupyter)

0. Summary

항목	Wide & Deep	DeepFM	DCN
구성	Wide (Linear model) + Deep (MLP)	FM (low-order interactions) + Deep (high-order interactions)	Cross Network(feature crossing) + Deep (MLP)
Feature Engineering	수동 Feature Cross 필요	자동 Feature Factorization Machines 학습 (2차)	자동 Feature Cross 학습 (교차층으로 다양한 차수)
Embedding 공유	❌ (Wide와 Deep 입력 분리)	✅ (FM과 Deep이 Embedding Layer 공유)	✅ (Cross와 Deep이 Embedding을 공유)
Interaction Modeling	- Wide: 1차(선형), 수동 조합	- FM: 2차 pairwise interaction 자동 학습	- Cross: 입력과 이전층의 외적 기반 명시적 교차
모델 수식	$\hat{y} = \sigma(W_{wide}^T x + W_{deep}^T a^{(L)} + b)$	$\hat{y} = \sigma(\hat{y}_{FM} + \hat{y}_{Deep})$	$x^{(l+1)} = x^{(0)} (x^{(l)})^T w^{(l)} + b^{(l)} + x^{(l)}$ 최종: $\hat{y} = \sigma(W_{out}^T [x_{cross}^{(L)}; a^{(L)}_{deep}] + b)$

구분	Cross Component	Deep Component
역할	저차원 명시적 feature crossing	고차원 비선형 관계 학습
목적	저차원 상호작용(feature cross) 자동 학습	고차원의 복잡한 비선형 표현 학습
특징	원본 입력과 이전층의 내적/외적을 통해 교차 특징 포착	비선형 특징 포착
모델 수식	$\mathbf{x_{l+1}} = \mathbf{x_0 x_l^T w_l + b_l + x_l}$	$a^{(l+1)} = \sigma(W^{(l)} a^{(l)} + b^{(l)})$ $\hat{y}_{deep} = \sigma(W_{out} a^{(L)} + b_{out})$

Web-scale에서는 주로 Sparse Categorical Feature 사용됨
- Embedding Procedure
- $\mathbf{x}_{embed, i}=W_{embed,i}\mathbf{x}_i$
$\mathbf{x_0} = \left [ \mathbf{x^\top_{embed,1}}, ... ,\mathbf{x^\top_{embed,k}}, \mathbf{x^\top_{dense}} \right ]$

cross layer
- 효과적인 방법으로 명시적인 피처 cross를 적용
- $\mathbf{x}_{l+1} = \mathbf{x}_0 \mathbf{x}^\top_l \mathbf{w}_l + \mathbf{b}_l + \mathbf{x}_l = f(\mathbf{x}_l,\mathbf{w}_l,\mathbf{b}_l)+\mathbf{x}_l$
  - Weight, Bias: ${\mathbf{w}_l, \mathbf{b}_l \in \mathbb{R}^d}$
  - Residual: $\mathbf{x_0}$ 를 각 layer마다 더함
High-degree interaction
- Layer depth에 따라 cross 차원이 증가
Complexity Analysis
- $L_c$ : cross layer의 수
- cross network의 parameter 수는 $d \times L_c \times 2$
- time, space는 input dimension에 선형 비례
- DNN에 비해 굉장히 작기 때문에 무시할 수 있는 크기

DNN
- $\mathbf{h}_{l+1} = f(W_l\mathbf{h}_l + \mathbf{b}_l)$
Complexity Analysis
- $d \times m + m + (m^2 + m) \times (L_d-1)$
- $L_d$ : deep layer의 수
- $m$ : deep layer의 size

Joint Train
- cross network의 output
- neep network의 output
- sigmoid를 이용하여 학습
- $p = sigmoid(W_{logit}x_{stack} + b_{logit})$
Logloss를 이용해서 학습

FM의 장점
- Parameter sharing
  - 효율성(Efficient): 추천 시스템의 sparse feature에 대한 특성
  - 일반화(Generalization): unseen or rarely seen feature interactions
DCN
- FM은 shallow structure(cross term degree 2)
- DCN은 degree $\alpha$ 의 모든 cross term을 생성
  - 3.1에 따라 layer의 크기를 조절
- FM과 달리 parameter의 수도 input dimension에 선형 비례

Dataset
- Criteo Display Ads Data
  - Continuous features: 13
  - Categorical features: 26
- Train: 6일
- Validation & Test: 1일
Implementation Details
- Continuous features $\rightarrow$ log transform
- Categorical features $\rightarrow$ embedding $\rightarrow$ concatenation(1026)