๐Ÿ“„ Factorization Machine ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

์„œ์€์„œยท2023๋…„ 8์›” 14์ผ
0

Paper Review

๋ชฉ๋ก ๋ณด๊ธฐ
1/6
post-thumbnail
post-custom-banner

0. Abstract

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” SVM(Support Vector Machines)๊ณผ Factorization model๋“ค์˜ ์žฅ์ ์„ ํ•ฉ์นœ Factorization Machine(FM)์— ๋Œ€ํ•ด์„œ ์†Œ๊ฐœํ•œ๋‹ค. SVM์ฒ˜๋Ÿผ FM์€ ์–ด๋Š ์‹ค์ˆ˜ ๊ฐ’ feature vector๊ฐ€ input๊ฐ’์œผ๋กœ ๋“ค์–ด์™€๋„ ์ž˜ ์ž‘๋™ํ•˜๋Š” general predictor working์„ ํ•œ๋‹ค. SVM๊ณผ๋Š” ๋ฐ˜๋Œ€๋กœ FM์€ factorized parameter๋“ค์„ ์ด์šฉํ•ด์„œ ๋ณ€์ˆ˜๋“ค ์‚ฌ์ด์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ๋ชจ๋ธํ™” ํ•œ๋‹ค. ๋•Œ๋ฌธ์— SVM์€ ์˜ˆ์ธกํ•  ์ˆ˜ ์—†๋Š” huge sparsity์˜ ๋ฌธ์ œ์—์„œ๋„ ์ƒํ˜ธ์ž‘์šฉ์„ ์ž˜ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋‹ค.

general predictor
๋ฐ์ดํ„ฐ์˜ ํ˜•ํƒœ์— ํฌ๊ฒŒ ๊ทœ์ œ ๋ฐ›์ง€ ์•Š๊ณ , ๋ถ„๋ฅ˜, ํšŒ๊ท€ ๋“ฑ ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ์ˆ˜ํ–‰ ํ•  ์ˆ˜ ์žˆ๋‹ค.

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” FM ๋ชจ๋ธ ๋ฐฉ์ •์‹์ด linear time์— ๊ณ„์‚ฐ๋  ์ˆ˜ ์žˆ์–ด ๋ฐ”๋กœ ์ตœ์ ํ™”ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค€๋‹ค. ๊ทธ๋ž˜์„œ nonlinear์ธ SVM๊ณผ ๋‹ฌ๋ฆฌ, dual form์—์„œ์˜ ๋ณ€ํ™˜์ด ํ•„์š”ํ•˜์ง€ ์•Š๊ณ , support vector์˜ ๋„์›€์—†์ด ๋ฐ”๋กœ model parameter๋“ค์„ ์ถ”์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.

๋˜ ๋‹ค๋ฅธ Factorization Model์—๋Š” matrix factorization, parallel factor analysis, SVD++, PITF, FPMC์™€ ๊ฐ™์€ specialized model๋“ค์ด ์žˆ๋‹ค. ์ด ๋ชจ๋ธ๋“ค์€ general prediction ์—…๋ฌด๋ฅผ ํ•˜์ง€ ๋ชปํ•˜๋ฉฐ ํŠน์ • input๊ฐ’์ด ๋“ค์–ด์™”์„ ๋•Œ๋งŒ ์ž‘๋™ํ•œ๋‹ค. ์ด ๋ชจ๋ธ๋“ค์˜ ๋ฐฉ์ •์‹๊ณผ ์ตœ์ ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ฐ ์ž‘์—…์— ๋Œ€ํ•ด ๋…๋ฆฝ์ ์œผ๋กœ ํŒŒ์ƒ๋œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” FM์ด input๊ฐ’์„ ํŠน์ •ํ•จ์œผ๋กœ์จ ์ด ๋ชจ๋ธ๋“ค์„ ํ‰๋‚ด๋‚ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค€๋‹ค. ๋”ฐ๋ผ์„œ Factorization Model์— ๋Œ€ํ•œ ์ „๋ฌธ ์ง€์‹์ด ์—†๋”๋ผ๋„ FM์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

1. Introduction

SVM์€ ๋จธ์‹ ๋Ÿฌ๋‹๊ณผ ๋ฐ์ดํ„ฐ ๋งˆ์ด๋‹์—์„œ ๊ฐ€์žฅ ์œ ๋ช…ํ•œ predictor ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ํ˜‘์—…ํ•„ํ„ฐ๋ง๊ณผ ๊ฐ™์€ ํ™˜๊ฒฝ์—์„œ๋Š” SVM์€ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•˜์ง€ ๋ชปํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํ‘œ์ค€ SVM predictor๊ฐ€ ์ด ์ž‘์—…์„ ์„ฑ๊ณตํ•˜์ง€ ๋ชปํ•˜๋Š” ์ด์œ ๊ฐ€ ๋งค์šฐ sparseํ•œ ๋ฐ์ดํ„ฐ์˜ ๋ณต์žกํ•œ(๋น„์„ ํ˜•์ ) kernel ๊ณต๊ฐ„์—์„œ reliabel parameter(hyperplane: ์ดˆํ‰๋ฉด)๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์ž„์„ ๋ณด์—ฌ์ค€๋‹ค. ๋ฐ˜๋ฉด์— tensor factorization model๋“ค๊ณผ ํŠน์ˆ˜ factorization model๋“ค์€ ์ผ๋ฐ˜์ ์ธ ๋ฌธ์ œ์— ์ ์šฉํ•  ์ˆ˜ ์—†๋‹ค.

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” SVM์ฒ˜๋Ÿผ general predictor์ด๋ฉด์„œ ๋งค์šฐ sparsityํ•œ ๋ฐ์ดํ„ฐ์—์„œ reliabel parameter๋“ค์„ ์ถ”์ฒญํ•  ์ˆ˜ ์žˆ๋Š” FM์„ ์†Œ๊ฐœํ•œ๋‹ค.

FM์€ ๋ชจ๋“  nested(์ค‘์ฒฉ๋œ) ๋ณ€์ˆ˜๋“ค์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ๋ชจ๋ธํ™”ํ•˜์ง€๋งŒ, SVM์ฒ˜๋Ÿผ dense parametrization์ด ์•„๋‹Œ factorized parametrization์„ ์‚ฌ์šฉํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” FM์˜ ๋ชจ๋ธ ๋ฐฉ์ •์‹์„ linear time์œผ๋กœ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ parameter์˜ ์ˆ˜์— ๋”ฐ๋ผ ๊ฒฐ์ ๋จ์„ ๋ณด์—ฌ์ค€๋‹ค. ์ด๋Š” ์ฆ‰์‹œ ์ตœ์ ํ™”์™€ ์˜ˆ์ธก์„ ์œ„ํ•œ ๋ณ„๋„์˜ ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ ์ €์žฅ ์—†์ด ๋ชจ๋ธ parameter์˜ ์ €์žฅ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค. ๋ฐ˜๋Œ€๋กœ nonlinear SVM์€ ๋ณดํ†ต dual form์œผ๋กœ ์ตœ์ ํ™”ํ•˜๋ฉฐ ํ•™์Šต๋ฐ์ดํ„ฐ(support vectors)์— ์˜์กดํ•˜์—ฌ ์˜ˆ์ธก์„ ๊ณ„์‚ฐํ•œ๋‹ค.

2. Prediction Under Sparsity

โ–ถ๏ธŽ ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์ธ ์˜ˆ์ธก์€ ์‹ค์ˆ˜ ๊ฐ’ feature vector x๋กœ๋ถ€ํ„ฐ target ๋ฒ”์œ„ T(e.g. T = R for regression or T = {+, โˆ’} for classification)๋กœ ๋งคํ•‘ํ•˜๋Š” ํ•จ์ˆ˜(y:R^n โ†’T)๋ฅผ ์ถ”์ •ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. supervied setting์—์„œ๋Š” target ํ•จ์ˆ˜ y๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ํ•™์Šต ๋ฐ์ดํ„ฐ D = {(x(1), y(1)), (x(2), y(2)), . . .}๊ฐ€ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค. ์ˆœ์œ„๋ฅผ ๋งค๊ธฐ๋Š” ์ž‘์—…์—์„œ ํ•จ์ˆ˜ y๋Š” feature vectors x์— ์ ์ˆ˜๋ฅผ ๋งค๊ธฐ๊ณ  ์ •๋ ฌํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ scoring ํ•จ์ˆ˜๋Š” pairwise ํ•™์Šต๋ฐ์ดํ„ฐ(feature tuple (x(A),x(B)) : x(A)๋Š” x(B)๋ณด๋‹ค ๋žญํฌ๊ฐ€ ๋†’์•„์•ผ ํ•œ๋‹ค.)๋ฅผ ์ด์šฉํ•˜์—ฌ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค. pairwise ranking ๊ด€๊ณ„๋Š” ๋น„๋Œ€์นญ์ด๊ธฐ ๋•Œ๋ฌธ์— positive training instance๋“ค๋งŒ ์‚ฌ์šฉํ•ด๋„ ์ถฉ๋ถ„ํ•˜๋‹ค.

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” x(feature vector)๊ฐ€ highly sparse(์˜ˆ๋ฅผ ๋“ค์–ด vector x์˜ ๊ฑฐ์˜ ๋ชจ๋“  ์›์†Œ x_i๊ฐ€ 0)์ผ ๋•Œ์˜ ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃฌ๋‹ค.

  • m(X) : feature vector x์—์„œ non-zero ์›์†Œ๋“ค์˜ ์ˆ˜

  • mD : ๋ชจ๋“  feature vector x์˜ m(x) ํ‰๊ท 

    Huge sparity๋Š” ๊ฑฐ๋ž˜ ๋ฐœ์ƒ(์˜ˆ๋ฅผ ๋“ค์–ด ์ถ”์ฒœ์‹œ์Šคํ…œ ๊ตฌ๋งค๋‚ด์—ญ)์˜ feature vector๋‚˜ ํ…์ŠคํŠธ ๋ถ„์„๊ณผ ๊ฐ™์ด ์‹ค์ œ ์„ธ๊ณ„์—์„œ ๋งŽ์ด ๋ณด์ด๋Š”๋ฐ ์›์ธ ์ค‘ ํ•œ๊ฐ€์ง€๋Š” ๊ฑฐ๋Œ€ํ•œ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ๋„๋ฉ”์ธ์„ ๋‹ค๋ฃฌ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

Example

  • user u โˆˆ U

  • movie(item) i โˆˆ I

    • U = {Alice (A), Bob (B), Charlie (C), . . .}
    • I = {Titanic (TI), Notting Hill (NH), Star Wars (SW),
      Star Trek (ST), . . .}
  • certain time t โˆˆ R

  • rating r โˆˆ {1,2,3,4,5}

๐Ÿ‘‰๐Ÿป Observed data S

  • S = {(A, TI, 2010-1, 5), (A, NH, 2010-2, 3), (A, SW, 2010-4, 1), (B, SW, 2009-5, 4), (B, ST, 2009-8, 5), (C, TI, 2009-9, 1), (C, SW, 2009-12, 5)}
    โ–ถ๏ธŽ (A, TI, 2010-1, 5) : ์‚ฌ์šฉ์ž A๊ฐ€ ์˜ํ™” TI(titanic)์˜ ํ‰์ ์„ 2010๋…„ 1์›”์— 5์ ์œผ๋กœ ๋‚จ๊น€

๐ŸŽฌ ์˜ํ™” ๋ฆฌ๋ทฐ ์‹œ์Šคํ…œ์˜ ๊ฑฐ๋ž˜ ๋ฐ์ดํ„ฐ

Figure 1์€ S๋ฅผ ์ด์šฉํ•˜์—ฌ feature vectors๋ฅผ ๋งŒ๋“  ์˜ˆ์‹œ์ด๋‹ค.

  • |U|๋Š” ํ•ด๋‹น ์œ ์ €๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” binary indicator๋ณ€์ˆ˜ (ํŒŒ๋ž€์ƒ‰ ๋ฐ•์Šค)
  • |I|๋Š” item์„ ๋ณด์—ฌ์ฃผ๋Š” binary indicator๋ณ€์ˆ˜ (์ฃผํ™ฉ์ƒ‰ ๋ฐ•์Šค)
  • ์‚ฌ์šฉ์ž๊ฐ€ ํ‰์ ์„ ๋งค๊ธด ๋ชจ๋“  ์˜ํ™”์— ๋Œ€ํ•ด ์ดํ•ฉ์ด 1์ด ๋˜๋„๋ก ์ •๊ทœํ™”์‹œ์ผœ ํ‘œํ˜„ (๋…ธ๋ž€์ƒ‰ ๋ฐ•์Šค)
  • 2009๋…„ 1์›”๋ถ€ํ„ฐ 1๋กœ ์ •์˜ํ•˜๋ฉฐ ์›”์˜ ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ๋‹ค. (๋…น์ƒ‰ ๋ฐ•์Šค)
  • ์‚ฌ์šฉ์ž๊ฐ€ ๋ณธ ์˜ํ™”๋ฅผ ๋ณด๊ธฐ์ „์— ์ ์ˆ˜๋ฅผ ๋งค๊ฒผ๋˜ ์˜ํ™”์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ๋‹ค. (๊ฐˆ์ƒ‰ ๋ฐ•์Šค)

3. Factorization Machines(FM)

A. Factorization Machine Model

1) ๋ชจ๋ธ ๋ฐฉ์ •์‹ (degree๊ฐ€ d = 2์ผ ๋•Œ)

  • W_0 : global bias
  • w_i : i๋ฒˆ์งธ ๋ณ€์ˆ˜์˜ ์˜ํ–ฅ๋ ฅ์„ ๋ชจ๋ธํ™”
  • w^_i,j : <v_i,v_j> i๋ฒˆ์จฐ ๋ณ€์ˆ˜์™€ j๋ฒˆ์จฐ ๋ณ€์ˆ˜์˜ ์ƒํ˜ธ์ž‘์šฉ
    ๐Ÿ‘‰๐Ÿป w_i,j์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  factorizing์„ ํ†ตํ•ด ํ‘œํ˜„ํ•œ๋‹ค. sparse ๋ฐ์ดํ„ฐ์ž„์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ๊ณ ์ฐจ์›์˜ ์ƒํ˜ธ์ž‘์šฉ์— ๋Œ€ํ•œ ํ›Œ๋ฅญํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ์ถ”์ •์น˜๋ฅผ ์‚ฐ์ถœํ•  ์ˆ˜ ์žˆ๋Š” ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•˜๊ฒŒ ๋œ๋‹ค.

2) ํ‘œํ˜„๋ฐฉ๋ฒ•

k๊ฐ€ ์ถฉ๋ถ„ํžˆ ํฌ๋ฉด positive definite ํ–‰๋ ฌ W์— ๋Œ€ํ•ด W = Vโˆ™Vt๋ฅผ ๋งŒ์กฑ์‹œํ‚ค๋Š” ํ–‰๋ ฌ V๊ฐ€ ๋ฐ˜๋“œ์‹œ ์กด์žฌํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ๋งŒ์•ฝ k๊ฐ€ ์ถฉ๋ถ„ํžˆ ํฌ๋‹ค๋ฉด FM๋Š” ํ–‰๋ ฌ W๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ sparse ํ™˜๊ฒฝ์—์„œ๋Š” ๋ณต์žกํ•œ ์ƒํ˜ธ์ž‘์šฉ์„ ์ถ”์ •ํ•  ์ถฉ๋ถ„ํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์— ์ „ํ˜•์ ์œผ๋กœ ์ž‘์€ k๋ฅผ ์„ ํƒํ•ด์•ผํ•œ๋‹ค. k๋ฅผ ์ œํ•œํ•˜๋Š” ๊ฒƒ์€ ๋” ์ข‹์€ ์ผ๋ฐ˜ํ™”๋กœ ์ด์–ด์ง€๋ฉฐ sparsity์—์„œ๋„ ์ƒํ˜ธ์ž‘์šฉ ํ–‰๋ ฌ์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค.

3) Sparsity ์ƒ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์ถ”์ฒญ

sparse ํ™˜๊ฒฝ์—์„œ๋Š” ๋ณ€์ˆ˜๊ฐ„์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ๋…๋ฆฝ์ ์œผ๋กœ directlyํ•˜๊ฒŒ ์ถ”์ •ํ•  ์ถฉ๋ถ„ํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ์—†๋‹ค. FM์€ factorizing์„ ํ•จ์œผ๋กœ์จ ์ƒํ˜ธ์ž‘์šฉ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๋…๋ฆฝ์„ฑ์„ ๊นจ๋ฒ„๋ฆฌ๊ธฐ ๋•Œ๋ฌธ์—(์—ฐ๊ด€์ด ์ƒ๊ธฐ๊ธฐ ๋•Œ๋ฌธ์—) sparse ํ™˜๊ฒฝ์—์„œ๋„ ์ƒํ˜ธ์ž‘์šฉ์„ ์ž˜ ์ถ”์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.

Example

Figure 1์„ ํ†ตํ•ด์„œ ์ด๋ฅผ ์ดํ•ดํ•ด ๋ณด์ž!

Q : target y(rating)๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด Alice(A)์™€ Star Trek(ST)์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ์ถ”์ •ํ•˜๊ณ  ์‹ถ๋‹ค.
๐Ÿ‘‰๐Ÿป x_A์™€ x_ST๊ฐ€ non-zero์ธ ํ•™์Šต ๋ฐ์ดํ„ฐ๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์— ์ง๊ด€์ ์œผ๋กœ ์ƒํ˜ธ์ž‘์šฉ์ด ์—†๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.(wA,ST = 0)
๐Ÿ‘‰๐Ÿป ๊ทธ๋Ÿฌ๋‚˜ โŸจv_A,v_STโŸฉ(factorized interaction parameters)๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด์„œ ์ด ์ƒํ™ฉ์—์„œ ์ƒํ˜ธ์ž‘์šฉ์„ ์ถ”์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.

  1. Bob(B)๊ณผ Charlie(C)๋Š” ๋น„์Šทํ•œ factor vector v_B, v_C๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.(Star Wars์— ๋Œ€ํ•ด ๋น„์Šทํ•œ ํ‰์ ์„ ๋งค๊ฒผ๊ธฐ ๋•Œ๋ฌธ์— โŸจvB,vSWโŸฉ ์™€ โŸจvC,vSWโŸฉ๋Š” ๋น„์Šทํ•ด์•ผ ํ•œ๋‹ค.)
  2. Alice(A)๋Š” Charlie(C)๋Š” ๋‹ค๋ฅธ factor vector v_A, v_C๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.(ํ‰์ ์„ ๋งค๊น€์— ์žˆ์–ด Titanic๊ณผ Star Wars์˜ factors๊ฐ€ ๋‹ค๋ฅธ ์ƒํ˜ธ์ž‘์šฉ์„ ๊ฐ€์ง€๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ)
  3. Star Trek์˜ factor vectors๋Š” Star Wars์™€ ๋น„์Šทํ•  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ๋‹ค.(Bob(B)์ด ๋‘ ์˜ํ™”์— ๋Œ€ํ•ด ๋น„์Šทํ•œ ์ƒํ˜ธ์ž‘์šฉ์„ ๋ณด์ด๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.)

๐Ÿ’ก Answer
Alice(A)์™€ Star Trek์˜ factor vector์˜ ๋‚ด์ (์ƒํ˜ธ์ž‘์šฉ)์€ Alice(A)์™€ Star Wars์˜ ์ƒํ˜ธ์ž‘์šฉ๊ณผ ๋น„์Šทํ•  ๊ฒƒ์ด๋‹ค.

4) ๊ณ„์‚ฐ

๋ชจ๋“  pairwise interaction๋“ค์ด ๊ณ„์‚ฐ๋˜์–ด์•ผํ•˜๊ธฐ ๋•Œ๋ฌธ์— (1)์˜ ๋ฐฉ์ •์‹์˜ ๋ณต์žก๋„๋Š” O(kn^2)์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์žฌ๊ตฌ์„ฑ์„ ํ†ตํ•ด linear time์œผ๋กœ ์ค„์–ด๋“ค ์ˆ˜ ์žˆ๋‹ค.

Lemma 3.1) FM์˜ ๋ชจ๋ธ ๋ฐฉ์ •์‹์€ linear time O(kn)์œผ๋กœ ๊ณ„์‚ฐ ํ•  ์ˆ˜ ์žˆ๋‹ค.
์ฆ๋ช… : Pairwise ์ƒํ˜ธ์ž‘์šฉํ•ญ์— ๋Œ€ํ•œ factorization ๋•Œ๋ฌธ์—, ์ง์ ‘์ ์œผ๋กœ ๋‘ ๋ณ€์ˆ˜๋“ค์— ์˜์กดํ•˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์—†๋‹ค. pairwise ์ƒํ˜ธ์ž‘์šฉ ๋ถ€๋ถ„์€ ์•„๋ž˜์™€ ๊ฐ™์ด ์žฌํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.

์ด๋•Œ ๋ฐฉ์ •์‹์˜ ๋ณต์žก๋„๋Š” O(kn)์œผ๋กœ linear time์ด๋‹ค.
sparsityํ•˜์—์„œ๋Š” ๋Œ€๋ถ€๋ถ„์˜ ์›์†Œ๊ฐ€ 0์ด๊ธฐ ๋•Œ๋ฌธ์— non-zero ์›์†Œ๋“ค๋งŒ ๊ณ„์‚ฐํ•œ๋‹ค.

B. Factorization Machines as Predictors

FM์€ Regression, Binary classification, Ranking ๋“ฑ ๋‹ค์–‘ํ•œ ์˜ˆ์ธก ์—…๋ฌด์—์„œ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.
์ด ๋ชจ๋“  ๊ฒฝ์šฐ์—์„œ L2์™€ ๊ฐ™์€ ์ •๊ทœํ™” term์€ ์˜ค๋ฒ„ํ”ผํŒ…์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ถ”๊ฐ€๋œ๋‹ค.

C. Learning Factorization Machines

FM์˜ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ(W_0,w, V)๋Š” gradient descent methods(์˜ˆ๋ฅผ ๋“ค์–ด SGD)๋ฅผ ํ†ตํ•ด ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค. FM์˜ gradient๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

sum ๋ถ€๋ถ„์€ i์™€ ๋…๋ฆฝ์ด๊ธฐ ๋•Œ๋ฌธ์— ๋ฏธ๋ฆฌ ๊ณ„์‚ฐ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

D. d-way Factorization Machine

2-way FM์€ ์‰ฝ๊ฒŒ d-way FM์œผ๋กœ ํ™•์žฅ ํ•  ์ˆ˜ ์žˆ๋‹ค.

l๋ฒˆ์งธ ์ƒ๊ด€๊ด€๊ณ„ ํ•ญ์— ๋Œ€ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์€ PARAFAC(parallel factor analysis : ๋ณ‘๋ ฌ๋ถ„์„)๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ factorizeํ•œ๋‹ค.
(5) ๋ฐฉ์ ์‹์˜ ๋ณต์žก๋„๋Š” O(k_d n^d)์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ lemma 3.1์„ ์ ์šฉํ•˜๋ฉด linear time์œผ๋กœ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค.

E. Summary

FM์€ full feature vectore๋“ค ๋Œ€์‹  factorized interaction๋“ค์„ ์‚ฌ์šฉํ•˜์—ฌ feature vector x ์˜ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ์ƒํ˜ธ์ž‘์šฉ์„ ๋ชจ๋ธํ™”ํ•œ๋‹ค.
์ด๋Š” ๋‘ ๊ฐ€์ง€์˜ ์žฅ์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š”๋ฐ

  • high sparsity์—์„œ ๊ฐ’ ์‚ฌ์ด์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ์ถ”์ •ํ•  ์ˆ˜ ์žˆ๋‹ค. ์‹ฌ์ง€์–ด ๊ด€์ฐฐ๋˜์ง€ ์•Š์€ ์ƒํ˜ธ์ž‘์šฉ์„ generalize(์ผ๋ฐ˜ํ™”)ํ•  ์ˆ˜ ์žˆ๋‹ค.
  • ์˜ˆ์ธก๊ณผ ํ•™์Šต์— ์žˆ์–ด ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์ˆ˜๋งŒํผ linear time์ด ์†Œ์š”๋œ๋‹ค. ์ด๋Š” SGD์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ”๋กœ ์ตœ์ ํ™”๋ฅผ ํ•˜๊ฒŒ ํ•˜๊ณ  ๋‹ค์–‘ํ•œ ์†์‹ค ํ•จ์ˆ˜์—์„œ ์ตœ์ ํ™”๋ฅผ ์“ธ ์ˆ˜ ์žˆ๊ฒŒ ํ–ˆ๋‹ค.
profile
๋‚ด์ผ์˜ ๋‚˜๋Š” ์˜ค๋Š˜๋ณด๋‹ค ๋” ๋‚˜์•„์ง€๊ธฐ๋ฅผ :D
post-custom-banner

0๊ฐœ์˜ ๋Œ“๊ธ€