Anomaly Dectection์„ ์š”์ฆ˜์— ์ข€ ์•Œ์•„๋ณด๋ฉด์„œ, Time Series Forecasting ๋ถ„์•ผ๋ฅผ ๊ณ„์† ์ ‘ํ•  ์ˆ˜ ๋ฐ–์— ์—†์—ˆ๊ณ , Transformer์˜ sequence ์ ์ธ ํŠน์ง•์„ TSF์— ์‚ฌ์šฉํ•œ ์˜ˆ์‹œ๊ฐ€ ์—†์„๊นŒ ํ•˜์—ฌ์„œ ์ด๋ ‡๊ฒŒ ์ฐพ์•„๋ณด๋˜ ์ค‘ ์ข‹์€ ๋…ผ๋ฌธ์„ ์ฐพ๊ฒŒ ๋˜์—ˆ๋‹ค.
๋†€๋ž๊ฒŒ๋„,,, ์ด ๋…ผ๋ฌธ์€ ์œ ๋ช…ํ•œ ๋ชจ๋ธ์ธ Transformer๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ๋ชจ๋ธ๋“ค์ด ์‹œ๊ณ„์—ด ์˜ˆ์ธก์— ์žˆ์–ด์„œ ๊ณผ์—ฐ ํšจ๊ณผ์ ์ธ์ง€ ์˜๋ฌธ์„ ๊ฐ€์ง€๊ณ  ๊ฐ„๋‹จํ•œ ๊ตฌ์กฐ์˜ ๋ชจ๋ธ๊ณผ ๋น„๊ตํ•˜๋ฉฐ transformers๊ฐ€ ์‹œ๊ฐ„ ์ •๋ณด๋ฅผ ํ•™์Šตํ•˜์ง€ ๋ชปํ•จ์„ ์ฆ๋ช…ํ•˜๋Š” ๋…ผ๋ฌธ์ด๋‹ค...
์ฒ˜์Œ ์ด ๋…ผ๋ฌธ์„ ์ฝ๋Š” ๋ถ„๋“ค์ด๋ผ๋ฉด ์œ ํŠœ๋ธŒ ์ฑ„๋„์— ๋จผ์ € ๋“ค์–ด๊ฐ€์„œ ์ด ๋…ผ๋ฌธ์— ๋Œ€ํ•ด์„œ ํ•œ ๋ฒˆ ๋จผ์ € ๋“ค์–ด๋ณด๊ธธ ๋ฐ”๋ž€๋‹ค.

0. Abstract

[์ƒํ™ฉ]

  • Long-term Time Series Forecasting(LTSF) ๋ฌธ์ œ์˜ ํ•ด๊ฒฐ์ฑ…์œผ๋กœ Transformer ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ๋“ค์ด ๊ธ‰์ฆ
  • Transformers๋Š” ํ‹€๋ฆผ์—†์ด long sequence์˜ ์š”์†Œ๋“ค์˜ semantic correlations ์„ ์ถ”์ถœํ•˜๋Š”๋ฐ ๊ฐ€์žฅ ์„ฑ๊ณต์ ์ธ ํ•ด๊ฒฐ์ฑ…
    โ‡’ ๊ทธ๋Ÿฌ๋‚˜ ์‹œ๊ณ„์—ด ๋ชจ๋ธ๋ง์—์„œ๋Š” ์—ฐ์†๋œ ์ ๋“ค์˜ ์ˆœ์„œํ™”๋œ ์ง‘ํ•ฉ์—์„œ ์‹œ๊ฐ„์  ๊ด€๊ณ„๋ฅผ ์ถ”์ถœํ•ด์•ผ ํ•จ

[๊ฐ€์„ค๊ณผ ์‹คํ—˜]

  • Transformers๋Š” ordering information์„ ๋ณด์กดํ•˜๋Š”๋ฐ ์šฉ์ดํ•œ positional encoding ๊ณผ tokens ์„ ์‚ฌ์šฉํ•˜์—ฌ sub-series๋ฅผ embedding
    โ‡’ ์ด ๊ฒฝ์šฐ self-attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์˜ permutation-invariant ํŠน์„ฑ์œผ๋กœ ์ธํ•ด ํ•„์—ฐ์ ์œผ๋กœ temproal information ์˜ ์†์‹ค์ด ๋ฐœ์ƒ
    โ‡’ Transformers์€ LTSF task์— ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ๋ณด์ด์ง€ ์•Š์„ ๊ฒƒ์œผ๋กœ ๋ด„
    โ‡’ ์ด๋Ÿฌํ•œ ์ฃผ์žฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด LTSF-Linear ๋ผ๋Š” ์ด๋ฆ„์˜ ๋งค์šฐ ๋‹จ์ˆœํ•œ one-layer linear ๋ชจ๋ธ์„ ํ†ตํ•ด ๋น„๊ต

[์‹คํ—˜ ๊ฒฐ๊ณผ]

  • 9๊ฐœ์˜ real-life ๋ฐ์ดํ„ฐ์…‹์„ ํ†ตํ•œ ์‹คํ—˜ ๊ฒฐ๊ณผ์—์„œ ํ˜„์กดํ•˜๋Š” ์ •๊ตํ•œ Transformer ๊ธฐ๋ฐ˜ LTSF ๋ชจ๋ธ๋“ค๋ณด๋‹ค ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„
  • ์ถ”๊ฐ€์ ์œผ๋กœ LTSF ๋ชจ๋ธ์˜ ๊ตฌ์„ฑ ์š”์†Œ๋“ค์˜ temporal relation ์ถ”์ถœ ๋Šฅ๋ ฅ์— ๋Œ€ํ•œ ์˜ํ–ฅ๋ ฅ์„ ๋น„๊ต

๐Ÿค” Transformer ๊ธฐ๋ฐ˜์˜ TSF ๋ชจ๋ธ??

  1. Informer (AAAI 2021)
  2. Autoformer (Neurips 2021)
  3. Pyraformer (ICLR 2022)
  4. Fedformer (ICML 2022)
  5. EarthFormer (Neurips 2022)
  6. Non-Stationary Transformer (Neurips 2022)
  7. ...

TSF๋ฅผ ์œ„ํ•œ Transformer ๋ชจ๋ธ ์—ฐ๊ตฌ๋Š” ๋งŽ์ง€๋งŒ,,, ์˜๋ฌธ์ด ๋งŽ๊ณ , ์„ฑ๋Šฅ๊ณผ ์ง๊ฒฐ๋˜์ง€๋Š” ์•Š๋Š”๋‹ค๋Š” ํ‰์ด ๋งŽ์Œ...

1. Introduction

[Transformer?]

  • Transformer๋Š” NLP, speech recognition, computer vision ๋“ฑ์˜ ๋ถ„์•ผ์—์„œ ๊ฐ€์žฅ ์„ฑ๊ณต์ ์ธ sequence-modeling ์•„ํ‚คํ…์ฒ˜

  • ์ตœ๊ทผ์—๋Š” ์‹œ๊ณ„์—ด ๋ถ„์„์—๋„ Transformer ๊ธฐ๋ฐ˜ ์†”๋ฃจ์…˜๋“ค์ด ๋งŽ์ด ์—ฐ๊ตฌ๋˜์—ˆ์Œ
    (Ex.) LongTrans, Informer, Autoformer, Pyraformer, FED-former ๋“ฑ์ด LTSF ๋ฌธ์ œ์—์„œ ์ฃผ๋ชฉํ• ๋งŒํ•œ ๋ชจ๋ธ

  • Transformer ์˜ ๊ฐ€์žฅ ์ฃผ์š”ํ•œ ๋ถ€๋ถ„ : multi-head self-attention (long sequence์˜ ์š”์†Œ๋“ค ๊ฐ„์˜ semantic correlations ์„ ํšจ๊ณผ์ ์œผ๋กœ ์ถ”์ถœ)
    โœจ self-attention ์˜ ํŠน์ง•
    1) permutation-invariant (์ž…๋ ฅ ๋ฒกํ„ฐ ์š”์†Œ์˜ ์ˆœ์„œ์™€ ์ƒ๊ด€์—†์ด ๊ฐ™์€ ์ถœ๋ ฅ์„ ์ƒ์„ฑ)
    2) anti-order ํ•˜์—ฌ temporal information loss๋ฅผ ํ”ผํ•  ์ˆ˜ ์—†์Œ

  • ๋‹ค์–‘ํ•œ positional encoding์„ ์‚ฌ์šฉํ•˜๋ฉด ๋ช‡๋ช‡ ordering information ์„ ๋ณด์กดํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ๊ทธ ์ดํ›„ self-attention์„ ์ ์šฉํ•˜๋ฉด ์ด ๋˜ํ•œ ์†์‹ค์„ ํ”ผํ•  ์ˆ˜ ์—†์Œ
    ๐Ÿค” ๋‹จ์–ด์˜ ์ˆœ์„œ๋ฅผ ๋ฐ”๊พธ๋”๋ผ๋„ ๋ฌธ์žฅ์˜ ์˜๋ฏธ๋ก ์  ์˜๋ฏธ๋Š” ๋Œ€๋ถ€๋ถ„ ์œ ์ง€๋˜๋Š” NLP์™€ ๊ฐ™์€ ๊ฒฝ์šฐ ์œ„์˜ ๋ฌธ์ œ๋ฅผ ํŠน์ง•์ด ํฌ๊ฒŒ ์ƒ๊ด€์—†์œผ๋‚˜.. TSF์—์„  ๋ฌธ์ œ๊ฐ€ ๋จ...

    ๊ทธ๋ ‡๋‹ค๋ฉด ,,,
    Are Transformers really effective for long-term time series forecasting?

[์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ์˜ ํ•ต์‹ฌ, ์ˆœ์„œ]

  • ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜๋Š” ๊ฒฝ์šฐ, numerical data ์ž์ฒด์—๋Š” ์˜๋ฏธ๊ฐ€ ๋ถ€์กฑ
    โ‡’ ์ฃผ๋กœ continuous set of points(์—ฐ์†์ ์ธ ์  ์ง‘ํ•ฉ) ๊ฐ„์˜ teporal changes(์‹œ๊ฐ„์  ๋ณ€ํ™”)๋ฅผ ๋ชจ๋ธ๋งํ•˜๋Š” ๋ฐ ๊ด€์‹ฌ
    โ‡’ ์ˆœ์„œ ์ž์ฒด๊ฐ€ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•จ!!

[์‹คํ—˜ ์† ์˜ค๋ฅ˜ ์ œ์‹œ]

  • Transformer ๊ธฐ๋ฐ˜ LTSF ์†”๋ฃจ์…˜๋“ค์€ ๊ธฐ์กด ๋ฐฉ๋ฒ•๋ก ๋“ค์— ๋น„ํ•ด ๊ฐœ์„ ๋œ ์˜ˆ์ธก ์ •ํ™•๋„๋ฅผ ๋ณด์ž„
    โ‡’ ๊ทธ๋Ÿฌ๋‚˜ ํ•ด๋‹น ์‹คํ—˜์—์„œ non-Transformer ๊ธฐ๋ฐ˜์˜ ๋น„๊ต๊ตฐ๋“ค์€ LTSF ๋ฌธ์ œ์—์„œ error accumulation์ด ์žˆ๋‹ค๊ณ  ์•Œ๋ ค์ง„ autoregressive forecasting ํ˜น์€ Iterated
    Multi-Step(IMS) forecasting ๋ชจ๋ธ์ด์—ˆ์Œ...

[์‹คํ—˜ ๋‚ด์šฉ]
โ‡’ ๋ณธ ๋…ผ๋ฌธ์—์„  ์‹ค์ œ ์„ฑ๋Šฅ์„ ํ™•์ธํ•˜๊ธฐ์œ„ํ•ด Direct Multi-Step(DMS) forecasting ๊ณผ ๋น„๊ต

  • ๊ฐ€์„ค : ์žฅ๊ธฐ ์˜ˆ์ธก์€ ๋ฌผ๋ก , ๋ชจ๋“  ์‹œ๊ณ„์—ด์„ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์€ ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์— ๋น„๊ต์  ๋ช…ํ™•ํ•œ ์ถ”์„ธ(trend) ์™€ ์ฃผ๊ธฐ์„ฑ(periodicity) ์„ ๊ฐ€์ง„ ์‹œ๊ณ„์—ด์— ๋Œ€ํ•ด์„œ๋งŒ ์žฅ๊ธฐ ์˜ˆ์ธก์ด ๊ฐ€๋Šฅํ•˜๋‹ค
  • ์ƒˆ๋กœ์šด ๋ชจ๋ธ ์ œ์‹œ : ์„ ํ˜• ๋ชจ๋ธ์€ ์ด๋ฏธ ์ด๋Ÿฌํ•œ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ๋ณธ ๋…ผ๋ฌธ์—์„  ๋งค์šฐ ๊ฐ„๋‹จํ•œ LTSF-Linear ๋ชจ๋ธ์„ ์ƒˆ๋กœ์šด ๋น„๊ต์˜ ๊ธฐ์ค€์œผ๋กœ ์ œ์‹œ
  • LTSF-Linear ๋ชจ๋ธ : one-layer linear ๋ชจ๋ธ๋งŒ์„ ํ†ตํ•ด ๊ณผ๊ฑฐ ์‹œ๊ณ„์—ด์— ๋Œ€ํ•œ ํšŒ๊ท€๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ ๋ฏธ๋ž˜ ์‹œ๊ณ„์—ด์„ ์ง์ ‘ ์˜ˆ์ธก
  • ์‹คํ—˜ ๋ฐ์ดํ„ฐ์…‹ : ๊ตํ†ต, ์—๋„ˆ์ง€, ๊ฒฝ์ œ, ๋‚ ์”จ, ์žฌํ•ด ์˜ˆ์ธก ๋“ฑ์˜ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ๋ฒค์น˜๋งˆํฌ ๋ฐ์ดํ„ฐ์…‹
  • ์‹คํ—˜ ๊ฒฐ๊ณผ : LTSF-Linear๋Š” ๋ชจ๋“  ๊ฒฝ์šฐ์—์„œ ๋ณต์žกํ•œ Transformer ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋ณด๋‹ค ์„ฑ๋Šฅ์„ ์•ž์„ฌ, ์‹ฌ์ง€์–ด ๋ช‡๋ช‡ ๊ฒฝ์šฐ์—๋Š” ํฐ ์ฐจ์ด(20~50%)์˜ ์„ฑ๋Šฅ์„ ๋ณด์ž„
  • Transformer ๊ธฐ๋ฐ˜ ๋ชจ๋ธ ๋ฌธ์ œ ๋ฐœ๊ฒฌ : (Transformer ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋“ค์˜ ์ฃผ์žฅ๊ณผ๋Š” ๋‹ค๋ฅด๊ฒŒ) look-back window sizes ์˜ ์ฆ๊ฐ€์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์˜ˆ์ธก ์˜ค๋ฅ˜๊ฐ€ ๊ฐ์†Œํ•˜์ง€ ์•Š์•„ long sequences์—์„œ temporal relations์„ ์ถ”์ถœํ•˜๋Š”๋ฐ ์‹คํŒจํ•˜๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌ

[contributions]

โœ… LSTF task์—์„œ์˜ Transformers์˜ ํšจ๊ณผ์— ๋Œ€ํ•œ ์ฒซ ๋ฒˆ์งธ ์˜๋ฌธ์„ ์ œ๊ธฐํ•œ ์—ฐ๊ตฌ
โœ… ๊ฐ„๋‹จํ•œ one-layer linear models์ธ LTSF-Linear์™€ Transformer ๊ธฐ๋ฐ˜ LTSF ์†”๋ฃจ์…˜๋“ค์„ 9๊ฐœ์˜ ๋ฒค์น˜๋งˆํฌ ๋ฐ์ดํ„ฐ์…‹์„ ํ†ตํ•ด ๋น„๊ต
โœ… LTSF-Linear๊ฐ€ LTSF ๋ฌธ์ œ์˜ ์ƒˆ๋กœ์šด baseline์ด ๋  ์ˆ˜ ์žˆ์Œ
โœ… ๊ธฐ์กด Transformer ๊ธฐ๋ฐ˜ ์†”๋ฃจ์…˜์˜ ๋‹ค์–‘ํ•œ ์ธก๋ฉด์— ๋Œ€ํ•œ ์—ฐ๊ตฌ ์ˆ˜ํ–‰
1. long inputs์„ ๋ชจ๋ธ๋งํ•˜๋Š” ๋Šฅ๋ ฅ
2. ์‹œ๊ณ„์—ด order์— ๋Œ€ํ•œ sensitivity
3. positional encoding๊ณผ sub-series embedding์˜ ์˜ํ–ฅ๋ ฅ ํšจ์œจ์„ฑ ๋น„๊ต
โœ… ๊ฒฐ๋ก ์ ์œผ๋กœ, ์‹œ๊ณ„์—ด์— ๋Œ€ํ•œ Transformer์˜ temporal modeling ๊ธฐ๋Šฅ์€ ์ ์–ด๋„ ๊ธฐ์กด LTSF ๋ฒค์น˜๋งˆํฌ์—์„œ๋Š” ๊ณผ์žฅ๋จ

2. Preliminaries: TSF Problem Formulation

3. Transformer-Based LTSF Solutions

  • vanilla Transformer ๋ชจ๋ธ์„ LTSF ๋ฌธ์ œ์— ์ ์šฉ์‹œํ‚ฌ ๋•Œ์—๋Š” ๋‘ ๊ฐ€์ง€ ํ•œ๊ณ„์ ์ด ์กด์žฌ
    1) original self-attention์˜ quadractic time/memory complxity
    2) autoregressive decoder ์„ค๊ณ„๋กœ ์ธํ•ด ๋ฐœ์ƒํ•˜๋Š” error accumulation
  • Informer ๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด complexity๋ฅผ ์ค„์ด๊ณ , DMS ์˜ˆ์ธก ์ „๋žต์„ ์‚ฌ์šฉํ•˜๋Š” ์ƒˆ๋กœ์šด Transformer ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ œ์‹œ
  • ์ดํ›„ ์—ฌ๋Ÿฌ Transformer ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋“ค์ด ์„ฑ๋Šฅ๊ณผ ํšจ์œจ์„ฑ์„ ๊ฐœ์„ ํ•˜์˜€๊ณ , ์ด๋Ÿฌํ•œ ํ˜„์žฌ Trasnformer ๊ธฐ๋ฐ˜ LTSF ์†”๋ฃจ์…˜์˜ ์„ค๊ณ„ ์š”์†Œ๋ฅผ ์š”์•ฝํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Œ

[1] Time series decomposition

  • data preprocessing ๊ณผ์ •์—์„œ zero-mean normalization ์€ ํ”ํ•˜๊ฒŒ ์‚ฌ์šฉ
  • Autoformer ์—์„œ seasonal-trend decomposition ์„ ๊ฐ neural block ์ด์ „์— ์ฒ˜์Œ์œผ๋กœ ์ ์šฉ
    + ์‹œ๊ณ„์—ด ๋ถ„์„์—์„œ raw data๋ฅผ ๋”์šฑ predictableํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” standard method
    + input sequence์—์„œ moving average kernels ์„ ํ†ตํ•ด ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ์˜ trend-cyclical component ๋ฅผ ์ถ”์ถœ
    + trend component์™€ origina sequence์˜ ์ฐจ์ด๋Š” seasonal component ๋กœ ๊ฐ„์ฃผ๋œ๋‹ค๋Š” ๊ฒƒ
  • FEDformer ๋Š” ์ „๋ฌธ๊ฐ€์˜ ์ „๋žต๊ณผ ๋‹ค์–‘ํ•œ kernel sizes์˜ moving average kernels๋กœ ์ถ”์ถœํ•œ trend components๋ฅผ ํ˜ผํ•ฉํ•œ ํ˜•ํƒœ๋ฅผ ์ œ์‹œ

[2] Input embedding strategies

  • Transformer ์•„ํ‚คํ…์ฒ˜์˜ self-attention layer๋Š” ์‹œ๊ณ„์—ด์˜ position information ์„ ๋ณด์กดํ•˜์ง€ ๋ชปํ•จ
    โ‡’ ๊ทธ๋Ÿฌ๋‚˜ ์‹œ๊ณ„์—ด์˜ local positional information ์ฆ‰ ์‹œ๊ณ„์—ด์˜ ordering์€ ๋งค์šฐ ์ค‘์š” (+ hierarchial timestamps (week, month, year), agnostic timestamps (holidays and events)์™€ ๊ฐ™์€ global temporal information ๋˜ํ•œ ์œ ์ตํ•œ ์ •๋ณด)

  • ์‹œ๊ณ„์—ด inputs์˜ temporal context ๋ฅผ ๊ฐ•ํ™”ํ•˜๊ธฐ ์œ„ํ•ด SOTA Transformer ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋“ค์€ ์—ฌ๋Ÿฌ embedding์„ input sequence์— ํ™œ์šฉ
    + fixed positional encoding channel projection embedding learnable temporal embeddings
    + temporal convolution layer๋ฅผ ํ†ตํ•œ temporal embeddings learnable timestamps

[3] Self-attention schemes

  • Transformers๋Š” paired elements ๊ฐ„์˜ semantic dependencies ๋ฅผ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•ด self-attention ๋ฉ”์ปค๋‹ˆ์ฆ˜ ํ™œ์šฉ
  • ์ตœ๊ทผ ์—ฐ๊ตฌ์—์„œ๋Š” vanilla Transformer์˜ O(L2L^2) time/memory complexity๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ๋‘ ๊ฐ€์ง€ ์ „๋žต ์ œ์‹œ
    1. LogTrans์™€ Pyraformer๋Š” self-attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์— sparsity bias ๋ฅผ ๋„์ž…
    โ‡’ LogTrans๋Š” Logsparse mask ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ computational complexity๋ฅผ O(LlogL)๋กœ ๊ฐ์†Œ
    โ‡’ Pyraformer๋Š” hierarchically multi-scale temporal dependencies ๋ฅผ ํฌ์ฐฉํ•˜๋Š” pyramidal attention ์„ ํ†ตํ•ด time/memory complexity๋ฅผ O(L)๋กœ ๊ฐ์†Œ
    2. Informer์™€ FEDformer๋Š” self-attention matirx์— low-rank property๋ฅผ ์‚ฌ์šฉ
    โ‡’ Informer๋Š” ProbSparse self-attention ๋ฉ”์ปค๋‹ˆ์ฆ˜๊ณผ self-attention distilling operation ์„ ํ†ตํ•ด complexity๋ฅผ O(LlogL)๋กœ ๊ฐ์†Œ
    โ‡’ FEDformer๋Š” random selection์œผ๋กœ Fourier enhanced block ๊ณผ wavelet enhanced block ์„ ์„ค๊ณ„ํ•ด complexity๋ฅผ O(L)๋กœ ๊ฐ์†Œ
    โ‡’ Autoformer๋Š” original self-attention layer๋ฅผ ๋Œ€์ฒดํ•˜๋Š” series-wise auto-correlation ์„ค๊ณ„

[4] Decoders

  • vanilla Transformer decoder๋Š” autoregressiveํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ outputs์„ ์ƒ์„ฑํ•ด ํŠนํžˆ long-term predictions์—์„œ ๋Š๋ฆฐ ์ถ”๋ก  ์†๋„์™€ error accumulation ๋ฐœ์ƒ
    - Informer๋Š” DMS forecasting์„ ์œ„ํ•œ generative-style decoder ๋ฅผ ์„ค๊ณ„
    - Pyraformer๋Š” fully-connected layer๋ฅผ Spatio-temporal axes์™€ concatenatingํ•˜์—ฌ decoder๋กœ ์‚ฌ์šฉ
    - Autoformer๋Š” ์ตœ์ข… ์˜ˆ์ธก์„ ์œ„ํ•ด trend-cyclical components์™€ seasonal components์˜ stacked auto-correlation ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ํ†ตํ•ด ์žฌ์ •์˜๋œ
    decomposed features๋ฅผ ํ•ฉ์นจ
    - FEDformer๋Š” ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ decodeํ•˜๊ธฐ ์œ„ํ•ด frequency attention block์„ ํ†ตํ•œ decomposition scheme๋ฅผ ์‚ฌ์šฉ

  • Transformer ๋ชจ๋ธ์˜ ํ•ต์‹ฌ ์ „์ œ๋Š” paired elements ๊ฐ„์˜ semantic correlations
    โœ”๏ธ self-attention ์ž์ฒด๋Š” permutation-invariantํ•˜๋ฉฐ temproal relations์„ ๋ชจ๋ธ๋งํ•˜๋Š” ๋Šฅ๋ ฅ์€ input tokens๊ณผ ๊ด€๋ จ๋œ positional encoding์— ํฌ๊ฒŒ ์ขŒ์šฐ๋จ
    โœ”๏ธ ์‹œ๊ณ„์—ด์˜ numerical data๋ฅผ ๊ณ ๋ คํ•ด๋ณด๋ฉด, ๋ฐ์ดํ„ฐ ์‚ฌ์ด์—๋Š” point-wise semantic correlations ๊ฐ€ ๊ฑฐ์˜ ์—†์Œ

  • ์‹œ๊ณ„์—ด ๋ชจ๋ธ๋ง์—์„œ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋ถ€๋ถ„์€ ์—ฐ์†์ ์ธ ๋ฐ์ดํ„ฐ๋“ค์˜ ์ง‘ํ•ฉ์—์„œ์˜ temporal relations ์ด๋ฉฐ, ๋ฐ์ดํ„ฐ ๊ฐ„์˜ ์ˆœ์„œ๊ฐ€ Transformer์˜ ํ•ต์‹ฌ์ธ paired
    relationship๋ณด๋‹ค ์ค‘์š”ํ•œ ์—ญํ• ์„ ์ˆ˜ํ–‰

  • positional encoding์™€ tokens์„ ์‚ฌ์šฉํ•˜์—ฌ sub-series๋ฅผ embeddingํ•˜๋ฉด ์ผ๋ถ€ ordering information์„ ๋ณด์กดํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, permutation-invariantํ•œ self-
    attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์˜ ํŠน์„ฑ์ƒ ํ•„์—ฐ์ ์œผ๋กœ temporal information loss๊ฐ€ ๋ฐœ์ƒ

4. An Embarrassingly Simple Baseline

LTSF-Linear์˜ ๊ธฐ์ดˆ ์ˆ˜์‹์€ weighted sum ์—ฐ์‚ฐ์„ ํ†ตํ•ด ๋ฏธ๋ž˜ ์˜ˆ์ธก์„ ์œ„ํ•ด ๊ณผ๊ฑฐ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋ฅผ ์ง์ ‘ ํšŒ๊ท€ํ•˜๋Š” ๊ฒƒ

5. Experiments

5.1 Experimental Settings

| Dataset

  • 9๊ฐœ์˜ ๋‹ค๋ณ€๋Ÿ‰ real-world ๋ฐ์ดํ„ฐ์…‹ ํ™œ์šฉ
  • ETTh1, ETTh2, ETTm1 ETTm2, Traffic, Electricity, Weather, ILI, Exchange-Rate

| Evaluation Metric

  • Mean Squared Error(MSE) Mean Absolute Error(MAE)

| Compared Method

  • 5๊ฐœ์˜ Transformer ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•๋ก  : FEDformer, Autoformer, Informer, Pyraformer, LogTrans
  • naive DMS ๋ฐฉ๋ฒ•๋ก 
    Closest Repeat : look-back window์˜ ๋งˆ์ง€๋ง‰ ๊ฐ’์„ ๋ฐ˜๋ณต

5.2 Comparison with Transformers


โœ”๏ธ LSTF-Linear๋Š” ๋ณ€์ˆ˜ ๊ฐ„์˜ correlations์„ ๋ชจ๋ธ๋งํ•˜์ง€ ์•Š์•˜์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , SOTA ๋ชจ๋ธ์ธ FEDformer๋ฅผ ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ multivariate forecasting์—์„œ ์•ฝ
20%~50% ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„
โœ”๏ธ NLinear์™€ DLinear๋Š” distribution shift์™€ trend-seasonality features๋ฅผ ๋‹ค๋ฃจ๋Š” ๋Šฅ๋ ฅ์—์„œ ์šฐ์„ธ
โœ”๏ธ univariate forecasting์˜ ๊ฒฐ๊ณผ์—์„œ๋„ LTSF-Linear๊ฐ€ ์—ฌ์ „ํžˆ Transformer ๊ธฐ๋ฐ˜ LTSF ์†”๋ฃจ์…˜๋“ค๊ณผ ํฐ ์ฐจ์ด๋ฅผ ๋ณด์ž„
โœ”๏ธ Repeat ๋ชจ๋ธ์€ long-term seasonal data(e.g, Electricity and Traffic)์—์„œ ๊ฐ€์žฅ ์ข‹์ง€ ์•Š์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์ง€๋งŒ, Exchange-Rate ๋ฐ์ดํ„ฐ์…‹์—์„  ๋ชจ๋“  Transformer
๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋“ค๋ณด๋‹ค ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„
++++ ์ด๋Š” Transformer ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋“ค์ด ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ ๊ฐ‘์ž‘์Šค๋Ÿฌ์šด change noises์— overfitํ•˜์—ฌ ์ž˜๋ชป๋œ trend ์˜ˆ์ธก์œผ๋กœ ์ด์–ด์ ธ ์ •ํ™•๋„๊ฐ€ ํฌ๊ฒŒ ์ €ํ•˜๋  ์ˆ˜ ์žˆ์Œ
++++ Repeat์€ bias๊ฐ€ ์กด์žฌ X

โœ”๏ธ 3๊ฐœ์˜ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ Transformer ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋“ค๊ณผ LTSF-Linear ๋ชจ๋ธ์˜ ์˜ˆ์ธก ๊ฒฐ๊ณผ
โœ”๏ธ Electricity(Sequence 1951, Variate 36), Exchange-Rate(Sequence 676, Variate 3), ETTh2(Sequence 1241, Variate 2)
โœ”๏ธ ํ•ด๋‹น ๋ฐ์ดํ„ฐ์…‹์€ ๊ฐ๊ธฐ ๋‹ค๋ฅธ temporal patterns์„ ๋ณด์ž„
โœ”๏ธ input์˜ ๊ธธ์ด๊ฐ€ 96 steps์ด๊ณ , output horizon์ด 336 steps์ผ ๋•Œ Transformer๋Š” Electricity์™€ ETTh2 ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋ฏธ๋ž˜ ๋ฐ์ดํ„ฐ์˜ scale๊ณผ bias๋ฅผ ํฌ์ฐฉํ•˜๋Š”๋ฐ ์‹คํŒจ
โœ”๏ธ ๋˜ํ•œ Exchange-Rate ๋ฐ์ดํ„ฐ์…‹์—์„œ๋„ ์ ์ ˆํ•œ trend๋ฅผ ์˜ˆ์ธกํ•˜์ง€ ๋ชปํ•จ

๊ธฐ์กด Transformer ๊ธฐ๋ฐ˜ ์†”๋ฃจ์…˜์ด LTSF ์ž‘์—…์— ์ ํ•ฉํ•˜์ง€ ์•Š๋‹ค๋Š” ๊ฒƒ์„ ๋‚˜ํƒ€๋ƒ„

5.3 More Analyses on LTSF-Transformers

๐Ÿ’ก Can existing LTSF-Transformers extract temporal relations well from longer input sequences?

  • look-back window size ๋Š” ๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ์–ผ๋งˆ๋งŒํผ์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š”์ง€๋ฅผ ๊ฒฐ์ •ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์˜ˆ์ธก ์ •ํ™•๋„์— ๋งŽ์€ ์˜ํ–ฅ์„ ๋ผ์นจ
  • ๊ฐ•ํ•œ temporal relation ์ถ”์ถœ ๋Šฅ๋ ฅ์„ ๊ฐ€์ง„ ๊ฐ•๋ ฅํ•œ TSF ๋ชจ๋ธ์€ ๋” ํฐ look-back window sizes๋ฅผ ํ†ตํ•ด ๋” ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ์–ป์–ด๋‚ผ ์ˆ˜ ์žˆ์–ด์•ผ ํ•จ

โœ”๏ธ Transformer ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋“ค์˜ ์„ฑ๋Šฅ์€ ๊ธฐ์กด ์—ฐ๊ตฌ์˜ ๊ฒฐ๊ณผ์™€ ๋™์ผํ•˜๊ฒŒ look-back window size๊ฐ€ ์ปค์ง€๋ฉด์„œ ์„ฑ๋Šฅ์ด ์•…ํ™”๋˜๊ฑฐ๋‚˜ ์•ˆ์ •์ ์œผ๋กœ ์œ ์ง€
โœ”๏ธ ๋ฐ˜๋ฉด LTSF-Linear ๋ชจ๋ธ์€ look-back windows sizes๊ฐ€ ์ปค์ง์— ๋”ฐ๋ผ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ

๐Ÿ’ก What can be learned for long-term forecasting?

์‹คํ—˜ ๊ฒฐ๊ณผ,,
โœ”๏ธ SOTA Transformers์˜ ์„ฑ๋Šฅ์€ Far setting์—์„œ ์กฐ๊ธˆ์”ฉ ๋–จ์–ด์ง€๋Š”๋ฐ, ์ด๋Š” ๋ชจ๋ธ์ด ์ธ์ ‘ํ•œ ์‹œ๊ณ„์—ด ์‹œํ€€์Šค์—์„œ ์œ ์‚ฌํ•œ temproalinformation๋งŒ ํฌ์ฐฉํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธ
โœ”๏ธ ๋ฐ์ดํ„ฐ์…‹์˜ ๋‚ด์žฌ์  ํŠน์„ฑ์„ ํŒŒ์•…ํ•˜๋Š” ๋ฐ ์ผ๋ฐ˜์ ์œผ๋กœ ๋งŽ์€ ์ˆ˜์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ, ํ•˜๋‚˜์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ†ตํ•ด periodicity๋ฅผ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Œ
โœ”๏ธ ๋„ˆ๋ฌด ๋งŽ์€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ overfitting์„ ์œ ๋ฐœํ•  ๊ฒƒ์ด๊ณ , ์ด๋Š” LTSF-Linear์˜ ์„ฑ๋Šฅ์ด Transformer๋ณด๋‹ค ์ข‹์•˜๋˜ ๊ฒƒ์„ ์ผ๋ถ€๋ถ„ ์„ค๋ช…

๐Ÿ’ก Are the self-attention scheme effective for LTSF?

โœ”๏ธ Informer์˜ ์„ฑ๋Šฅ์€ ์ ์ง„์ ์œผ๋กœ ๋‹จ์ˆœํ™”ํ• ์ˆ˜๋ก ํ–ฅ์ƒ๋˜์–ด LTSF ๋ฒค์น˜๋งˆํฌ์—์„œ๋Š” self-attention ์ฒด๊ณ„ ๋ฐ ๊ธฐํƒ€ ๋ณต์žกํ•œ ๋ชจ๋“ˆ์ด ํ•„์š”ํ•˜์ง€ ์•Š์Œ์„ ๋‚˜ํƒ€๋ƒ„

๐Ÿ’ก Can existing LTSF-Transformers preserve temporal order well?

โœ”๏ธ ์ „์ฒด์ ์œผ๋กœ LTSF-Linear ๋ชจ๋ธ๋“ค์ด Transformer ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋“ค๋ณด๋‹ค ํ‰๊ท ์ ์ธ ์„ฑ๋Šฅ ํ•˜๋ฝ์ด ๋ชจ๋“  ๊ฒฝ์šฐ์— ์ปธ์œผ๋ฉฐ, ์ด๋Š” Transformers ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋“ค์ด temporal order
๋ฅผ ์ž˜ ๋ณด์กดํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์„ ๋‚˜ํƒ€๋ƒ„

๐Ÿ’ก How effective are different embedding strategies?

  • Transformer ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋“ค์—์„œ ์‚ฌ์šฉ๋œ position & timestamp embeddings์˜ ์ด์ ์— ๋Œ€ํ•ด ํ™•์ธ
    โœ”๏ธ Informer๋Š” positional embeddings๊ฐ€ ์—†์„ ๊ฒฝ์šฐ ์˜ˆ์ธก ์˜ค๋ฅ˜๊ฐ€ ํฌ๊ฒŒ ์ฆ๊ฐ€
    ++++timestamp embeddings๊ฐ€ ์—†๋Š” ๊ฒฝ์šฐ์—๋Š” ์˜ˆ์ธก ๊ธธ์ด๊ฐ€ ๊ธธ์–ด์ง์— ๋”ฐ๋ผ ์„ฑ๋Šฅ์ด ์ ์ฐจ ํ•˜๋ฝ
    ++++ Informer๊ฐ€ ๊ฐ ํ† ํฐ์— ๋Œ€ํ•ด ๋‹จ์ผ time step์„ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— temporal information์„ ํ† ํฐ์— ๋„์ž…ํ•ด์•ผ ํ•จ
    โœ”๏ธ FEDformer์™€ Autoformer๋Š” ๊ฐ ํ† ํฐ๋งˆ๋‹ค ๋‹จ์ผ time step์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  temporal information์„ ๋„์ž…ํ•˜๊ธฐ ์œ„ํ•ด timestamps์˜ ์‹œํ€€์Šค๋ฅผ ์ž…๋ ฅ
    ++++ ๊ณ ์ •๋œ positional embeddings ์—†์ด๋„ ๋น„์Šทํ•˜๊ฑฐ๋‚˜ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑ
    ++++ global temporal information loss ๋•Œ๋ฌธ์— timestamp embeddings์ด ์—†์œผ๋ฉด Autoformer์˜ ์„ฑ๋Šฅ์€ ๋น ๋ฅด๊ฒŒ ํ•˜๋ฝ
    ++++ FEDformer๋Š” temporal inductive bias๋ฅผ ๋„์ž…ํ•˜๊ธฐ ์œ„ํ•œ frequency-enhanced module ๋•๋ถ„์— position/timestamp embeddings์„ ์ œ๊ฑฐํ•ด๋„ ์„ฑ๋Šฅ์ด ๋œ
    ํ•˜๋ฝ

๐Ÿ’ก Is training data size a limiting factor for existing LTSF-Transformers?

โœ”๏ธ ๊ธฐ๋Œ€์™€๋Š” ๋‹ฌ๋ฆฌ ์‹คํ—˜ ๊ฒฐ๊ณผ ๋” ์ž‘์€ ํฌ๊ธฐ์˜ training data์—์„œ์˜ ์˜ˆ์ธก ์˜ค๋ฅ˜๊ฐ€ ๋” ์ž‘๊ฒŒ ๋‚˜์˜ด
โœ”๏ธ whole-year data๊ฐ€ ๋” ๊ธธ์ง€๋งŒ ๋ถˆ์™„์ „ํ•œ data size๋ณด๋‹ค ๋” ๋ถ„๋ช…ํ•œ temporal features๋ฅผ ์œ ์ง€ํ•˜๊ธฐ ๋•Œ๋ฌธ์œผ๋กœ ๋ณด์ž„
โœ”๏ธ training์„ ์œ„ํ•ด ๋” ์ ์€ ๋ฐ์ดํ„ฐ๋ฅผ ์จ์•ผ ํ•œ๋‹ค๊ณ  ๊ฒฐ๋ก ์ง€์„ ์ˆ˜๋Š” ์—†์ง€๋งŒ, ์ด๋Š” Autoformer์™€ FEDformer์˜ training data scale์ด ์„ฑ๋Šฅ์— ์ œํ•œ์„ ์ฃผ๋Š” ์š”์ธ์€ ์•„๋‹ˆ๋ž€ ๊ฒƒ์„ ์ฆ๋ช…

๐Ÿ’ก Is efficiency really a top-level priority?

โœ”๏ธ ํฅ๋ฏธ๋กญ๊ฒŒ๋„ vanilla Transformer(๋™์ผํ•œ DMS decoder)์™€ ๋น„๊ตํ–ˆ์„ ๋•Œ, ๋Œ€๋ถ€๋ถ„์˜ Transformer๋ฅผ ๋ณ€ํ˜•ํ•œ ๋ชจ๋ธ๋“ค์˜ ์‹ค์ œ ์ถ”๋ก  ์‹œ๊ฐ„๊ณผ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๊ฐœ์ˆ˜๋Š” ๋น„์Šทํ•˜๊ฑฐ๋‚˜ ๋” ๋‚˜์จ
โœ”๏ธ ๊ฒŒ๋‹ค๊ฐ€ vanilla Transformer์˜ memory cost๋Š” output length L = 720์—์„œ๋„ ์‹ค์งˆ์ ์œผ๋กœ ํ—ˆ์šฉ ๊ฐ€๋Šฅํ•œ ์ˆ˜์ค€์ด๊ธฐ ๋•Œ๋ฌธ์— ์ ์–ด๋„ ๊ธฐ์กด ๋ฒค์น˜๋งˆํฌ์—์„œ๋Š” ๋ฉ”๋ชจ๋ฆฌ ํšจ์šธ์ด ๋†’์€ Transformer์˜ ๊ฐœ๋ฐœ์˜ ์ค‘์š”์„ฑ์ด ์•ฝํ™”

6. Conclusion and Future Work

Conclusion

ยท ๋ณธ ๋…ผ๋ฌธ์€ long-term time series forecasting ๋ฌธ์ œ์—์„œ Transformer ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋“ค์˜ ํšจ๊ณผ์— ๋Œ€ํ•œ ์˜๋ฌธ์„ ์ œ์‹œ
ยท ๋†€๋ผ์šธ๋งŒํผ ๊ฐ„๋‹จํ•œ linear model์ธ LTSF-Linear ๋ฅผ DMS forecasting baseline์œผ๋กœ ์‚ผ์•„ ๋ณธ ๋…ผ๋ฌธ์˜ ์ฃผ์žฅ์„ ๊ฒ€์ฆ

Future work

ยท LSTF-Linear๋Š” ๋ชจ๋ธ ์šฉ๋Ÿ‰์ด ์ œํ•œ๋˜์–ด ์žˆ์–ด ์—ฌ๋Ÿฌ ๋ฌธ์ œ์ ์ด ๋ฐœ์ƒํ•˜๋ฉฐ, ํ–ฅํ›„ ์—ฐ๊ตฌ์˜ ๊ธฐ์ค€์„  ์—ญํ• ์„ ํ•  ๋ฟ์ž„
ยท one-layer linear network๋Š” change points์— ์˜ํ•ด ๋ฐœ์ƒํ•˜๋Š” temporal dynamics๋ฅผ ํฌ์ฐฉํ•˜๋Š” ๋ฐ ์–ด๋ ค์›€์ด ์žˆ์Œ
ยท ์ƒˆ๋กœ์šด ๋ชจ๋ธ ์„ค๊ณ„์™€ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ, ๋ฒค์น˜๋งˆํฌ ๋“ฑ์„ ํ†ตํ•ด ๊นŒ๋‹ค๋กœ์šด LTSF ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Œ

๐Ÿ”– Reference
๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
Transfomer ๊ธฐ๋ฐ˜ TSF ๋ชจ๋ธ ์ข…๋ฅ˜

profile
๋ฐฐ์šฐ๊ณ  ๊ณต๋ถ€ํ•˜๊ณ  ๊ธฐ๋กํ•˜๋Š” ๊ฒƒ์„ ๋ฉˆ์ถ”์ง€ ์•Š๋Š”๋‹ค.

0๊ฐœ์˜ ๋Œ“๊ธ€

๊ด€๋ จ ์ฑ„์šฉ ์ •๋ณด

Powered by GraphCDN, the GraphQL CDN