[SSM Series]HiPPO: Recurrent Memory with Optimal Polynomial Projections

์ง„์„œ์—ฐ ยท2024๋…„ 3์›” 5์ผ
0

Paper Reading

๋ชฉ๋ก ๋ณด๊ธฐ
24/25

๐Ÿ‘€ย ์ด SSM ์‹œ๋ฆฌ์ฆˆ๋Š” State-space model์— ๋Œ€ํ•œ survey๋ฅผ ์ง„ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด์„œ Albert Gu์˜ ๋…ผ๋ฌธ๋“ค์„ HiPPO๋ถ€ํ„ฐ ์ตœ๊ทผ์˜ MAMBA๊นŒ์ง€ ๋ฆฌ๋ทฐํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ๋…ผ๋ฌธ์˜ SSM๊ณผ ๊ด€๋ จ๋œ ๋ถ€๋ถ„์— ๋Œ€ํ•ด์„œ ๊นŠ์ด ์žˆ๊ฒŒ ์ •๋ฆฌํ•  ์˜ˆ์ •์ด๋ฉฐ, SSM๊ณผ ๊ด€๋ จ์ด ์ ์€ ๋ถ€๋ถ„๋“ค์€ ๊ฐ„๋‹จํ•˜๊ฒŒ ์ •๋ฆฌ๋˜๋Š” ์  ์ฐธ๊ณ  ๋ถ€ํƒ๋“œ๋ฆฝ๋‹ˆ๋‹ค.

Motivation & Contributions

  • HiPPO : ๋Œ€๋Ÿ‰์˜ ๋ˆ„์  ๋ฐ์ดํ„ฐ๋ฅผ polynomial bases์— projectionํ•˜๋Š” ๊ฒƒ.
  • measures : ๊ณผ๊ฑฐ ์‹œ์ ์˜ ์ค‘์š”์„ฑ์„ ์ง€์ •ํ•˜๋Š” ์ฒ™๋„
  • HiPPO๋Š” measures๋ฅผ ์ด์šฉํ•ด์„œ ์ตœ์ ์˜ function approximation๋ฅผ ํ‘ผ๋‹ค.
  • ํŠน์ˆ˜ํ•œ ๊ฒฝ์šฐ๋กœ, HiPPO๋Š” GRU์˜ ์ผ๋ฐ˜ํ™”์ด๋‹ค. (์ด๋Š” SSM ๋ชจ๋ธ์˜ ๊ณตํ†ต์ ์ธ ํŠน์ง•์ด๋‹ค. )

๊ธฐ์กด์˜ sequential forecasting ์—์„œ์˜ ํ•œ๊ณ„์ 

  • ๊ธฐ์กด์˜ ์‹œ๊ณ„์—ด ์˜ˆ์ธก ๋ฐฉ๋ฒ•์€ ์ผ๋ฐ˜์ ์œผ๋กœ sequence length๋‚˜ time scale์— ๋Œ€ํ•œ prior๊ฐ€ ํ•„์š”ํ•˜๋ฉฐ ์ด ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚˜๋ฉด ํšจ๊ณผ์ ์ด์ง€ ์•Š๋‹ค. ์ด๋Š” distribution shift๊ฐ€ ์žˆ๋Š” ์„ค์ •์—์„œ๋Š” ๋ฌธ์ œ๊ฐ€ ๋  ์ˆ˜ ์žˆ๋‹ค.
    โ†’ ๋Œ€๋‹ค์ˆ˜์˜ ๋ชจ๋ธ๋“ค์€ long-term dependency๋ฅผ ์ž˜ ํฌ์ฐฉํ•˜๋Š”์ง€์— ๋Œ€ํ•œ ์ด๋ก ์  ๋ณด์žฅ์ด ๋ถ€์กฑํ•จ.
  • ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์ ์„ ํ•ด๊ฒฐํ•˜๊ธฐ์œ„ํ•ด์„œ HiPPO์—์„œ๋Š” ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค์— ๋Œ€ํ•œ ํ†ตํ•ฉ๋œ ๊ด€์ ์„ ๊ฐ€์ง€๊ณ , time scale์— ๋Œ€ํ•œ prior ์—†์ด ๋ชจ๋“  ๊ธธ์ด์˜ dependencies๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ํ•ด๋‹น ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ์ด๋ก ์  ๋ณด์žฅ์ด ๊ฐ€๋Šฅํ•จ.
    โ†’ ํ†ตํ•ฉ๋œ ๊ด€์  : SSM์€ recurrent model, Temporal convolution differential equation model์˜ ์ผ๋ฐ˜ํ™”์ด๋‹ค. (์ด๋Ÿฐ ํ†ตํ•ฉ๋œ ๊ด€์ ์œผ๋กœ์˜ ํ•ด์„์€ ํ›„์†์—ฐ๊ตฌ์ธ LSSL์—์„œ ์ž์„ธํ•˜๊ฒŒ ์„ค๋ช…๋˜์–ด์žˆ์Œ.)

Methods

  • ๋ˆ„์ ๋˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ polynomial basis์— projection ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š”
    1. a way to quantify the approximation
    2. suitable subspace( projection it onto a subspace of bounded dimension.)
  1. HiPPO builds upon a rich history of the well-studied orthogonal polynomial and related transforms in the signal processing literature.

์ •์˜ 1์— ๋”ฐ๋ฅธ HiPPO์˜ ์„ค๋ช…

HiPPO๋Š” ์—ฐ์† ํ•จ์ˆ˜ f:Rโ‰ฅ0โ†’Rf: Rโ‰ฅ0 โ†’ R์ด ์ฃผ์–ด์กŒ์„ ๋•Œ (ground-truth)๋ฅผ ๋ชจ๋“  ์‹œ๊ฐ„ t์— ๋Œ€ํ•ด ํˆฌ์˜ ์—ฐ์‚ฐ์ž proj(t)proj(t)์™€ ๊ณ„์ˆ˜ ์ถ”์ถœ ์—ฐ์‚ฐ์ž coef(t)coef(t)๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ proj(t)proj(t)๋Š” ์‹œ๊ฐ„ tt๊นŒ์ง€์˜ ํ•จ์ˆ˜ ff๋ฅผ ๋‹คํ•ญ์‹ g(t)โˆˆGg(t) โˆˆ G๋กœ ๋งคํ•‘ํ•˜์—ฌ ๊ทผ์‚ฌ ์˜ค๋ฅ˜ โ€–fโ‰คtโˆ’g(t)โ€–L2(ฮผ(t))โ€–fโ‰คt โˆ’ g(t)โ€–L2(ฮผ(t))๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ณ , coeff๋Š” ๋‹คํ•ญ์‹ g(t)g(t)๋ฅผ ฮผ(t)ฮผ(t)์— ๋Œ€ํ•ด ์ •์˜๋œ ์ง๊ต ๋‹คํ•ญ์‹์˜ ๊ธฐ์ € ๊ณ„์ˆ˜ c(t)โˆˆRNc(t) โˆˆ RN์œผ๋กœ ๋งคํ•‘ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ํ•จ์ˆ˜๋ฅผ ์‹œ๊ฐ„์— ๋”ฐ๋ผ ๋ณ€ํ™”ํ•˜๋Š” ์ธก์ •์— ๊ธฐ๋ฐ˜ํ•œ ์ง๊ต ๋‹คํ•ญ์‹์œผ๋กœ ๊ทผ์‚ฌํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

(hippo(f))(t)=coeft(projt(f))(hippo(f))(t) = coef_t(proj_t(f))

์ „๋ฐ˜์ ์œผ๋กœ HiPPO์—์„œ๋Š” OP๋กœ๋ถ€ํ„ฐ basis๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.

์–ด๋–ค OP๋ฅผ ์„ ํƒํ•˜๋Š”์ง€์— ๋”ฐ๋ผ์„œ ๊ทผ์‚ฌ๋˜๋Š” ํ•จ์ˆ˜๊ฐ€ ๋‹ค๋ฅผ ๊ฒƒ.

์ •๋ฆฌ

  • HiPPO: Recurrent Memory with Optimal Polynomial Projections
  • ์œ ํ•œํ•œ ์šฉ๋Ÿ‰์˜ ๋ฌธ์ œ๋กœ ๋ˆ„์ ๋œ ๋ฐ์ดํ„ฐ๋กœ representation์„ ์ƒ์„ฑํ•˜๊ธฐ ํž˜๋“ค๋‹ค.
  • HiPPO๋Š” continuous signal๊ณผ discrete time-series๋ฅผ polynomial basis ์œ„์— projectionํ•จ์œผ๋กœ์จ online ์••์ถ•์„ ํ•  ์ˆ˜ ์žˆ๋Š” framework.
  • ๊ณผ๊ฑฐ์˜ ๊ฐ ์‹œ์ ์˜ ์ค‘์š”์„ฑ์„ ์ง€์ •ํ•˜๋Š” observation์ด ์ฃผ์–ด์ง€๋ฉด, HiPPO๋Š” ์ž์—ฐ์Šค๋Ÿฌ์šด online function approximation ๋ฌธ์ œ์— ๋Œ€ํ•œ optimal solution์„ ๋งŒ๋“ ๋‹ค.
  • GRU์™€ ๊ฐ™์€ recurrent network์˜ gating mechanism์˜ ์ผ๋ฐ˜ํ™”์ด๋‹ค.
  • ์—„์ฒญ๋‚œ continuous series data n-dimensional polynomial๋กœ ์ •์˜.

HiPPO ๋…ผ๋ฌธ์€ ์ˆ˜๋งŽ์€ ๋ฐ์ดํ„ฐ, long-sequence data๋ฅผ g(t)g(t)๋กœ ์ถ•์•ฝํ•œ๋‹ค. g(t)g(t)๋กœ ์ถ•์•ฝํ•˜๊ธฐ ์œ„ํ•ด์„œ orthogonal polynomial์„ ์‚ฌ์šฉํ•œ๋‹ค. ์ด orthogonal polynomial์˜ ๊ฐ ํ•ญ ์•ž์—์„œ ๋“ค์–ด๊ฐ€๋Š” ๊ณ„์ˆ˜ c(t)c(t)๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ด ์ด ๋…ผ๋ฌธ์˜ ์ฃผ์š” contribution์ด๋‹ค.

Appendix๋ฅผ ๋ณด๋ฉด, ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ผ๋ จ์˜ ์ฆ๋ช…๊ณผ์ •์„ ํ†ตํ•ด Orthogonal polynomial์˜ ๊ณ„์ˆ˜์˜ ๋ฏธ๋ถ„๊ฐ’์ด ์•„๋ž˜์™€ ๊ฐ™์€ ์ˆ˜์‹์œผ๋กœ ์ •๋ฆฌ๋จ์„ ์ฆ๋ช…ํ•œ๋‹ค.
dc(t)dt=A(t)c(t)+B(t)f(t)\frac{dc(t)}{dt} = A(t)c(t)+B(t)f(t)

๋ณต์žกํ•œ orthogonal polynomial ๋กœ์˜ ๊ทผ์‚ฌ๋ฅผ ์œ„์˜ ์‹๊ณผ ๊ฐ™์ด ๊ฐ„๋‹จํ•œ ์ˆ˜์‹์œผ๋กœ ์ •๋ฆฌํ•˜๋Š” ๊ฒƒ์€ ์—„์ฒญ๋‚œ contribution
(์ฆ๋ช… ๊ณผ์ •์€ ์ƒ๋žต )์ด ๋˜๋ฉฐ, ํ›„์† ์—ฐ๊ตฌ๋“ค์ธ LSSL, S4, MAMBA๊นŒ์ง€ ์ด์–ด์ง„๋‹ค.
์—ฌ๊ธฐ์„œ, A,BA,B๋Š” ํ•™์Šต๊ฐ€๋Šฅํ•˜์ง€ ์•Š๋‹ค.
Legendre orthogonal polynomial๋“ฑ์„ ํ’€๋ฉด์„œ A,B๋“ฑ์„ ๊ทผ์‚ฌํ•ด์„œ ์‚ฌ์šฉํ•จ.

0๊ฐœ์˜ ๋Œ“๊ธ€