Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting

Southgiri·2025년 1월 21일

LLM Time Series

Paper Review

목록 보기

3/7

General purpose foundation model for univariate probabilistic time series forecasting

Decoder-only transformer architecture

Use lags as covariates

Lag-Llama is pretrained on a large corpus of diverse time series data from several domains

1. Introduction

General purpose foundation model for univariate probabilistic time series forecasting

Demonstrate performance of few-shot

Statistical models

Shortfall lies in their inherent assumption of linear relationships and stationarity
May require extensive manual tuning and domain knowledge to select appropriate models and parameters

Foundation models

Time-LLM, LLM4TS, GPT2 freeze LLM encoder backbones while simultaneously fine-tuning the input and distribution heads for forecasting
Main goal of our work is to apply the foundation model approach to time series data

3. Probabilistic Time Series Forecasting

Univariate time series dataset

$D_{train} = \{x^i_{1:T^i}\}^D_{i=1}$
$t \in \{1,\dots,T^i\}$
$T^i$ : length of the time series i
Predict the values at the future $P$
$D_{test}=\{x^i_{T^i+1:T^i+P}\}^D_{i=1}$

Univariate probabilistic time series forecasting problem

= Modelling an unknown joint distribution of the P future values

$\phi$ : paramters of a parametric distribution
Rather than considering the whole history of each time series i,
we can instead sub-sample fixed context windows of size $C$

Predictions are conditioned on these learned parameters $\theta$

4. Lag-Llama

When training on heterogenous univariate time series corpora, the frequency of the time series in our corpus varies
When adapting model to downstream datasets, may encounter new frequencies and combinations of seen frequencies

→ General method for tokenizing series from such a dataset, without directly relying on the frequency of any specific dataset

4.1. Tokenization : Lag Features

Construct lagged features from the prior values of the time series
Lag indices include quarterly, monthly, weekly, daily, hourly and second-level frequencies

To create lag features for some context-length window $x_{1:C}$ ,

Sample a larger window with L more historical points
To these lagged features, add date-time features of all the frequencies
second of minute, hour of day etc. (Real time values of data) till the quarter of year from the time index $t$
All except one date-time feature will remain constant from one time-step to the next and from the model can make sense of the frequency of the time series
$F$ : total of date time features
→ Each of tokens is of size $\pounds + F$

4.2. Lag-Llama Architecture

A univariate sequence of length along with its covariates is tokenized by concatenating the covariates vectors to a sequence of $C$ tokens
Tokens are passed through a shared linear projection layer
RMSNorm and Rotary Positional Encoding at each attention layer’s query and key representations

RMSNorm

Rotary Positional Encoding

As long as the distance between the two words stay same
Multiply rotation matrix after Multiply query and key weight matrix

Model predicts the parameters $\phi$ of the forecast distribution of the next timestep
Parameters are output by a parametric distribution head

4.3. Choice of Distribution Head

Last layer is the distribution head
which projects the model’s features to the parameters of a probability distribution
Combine different distribution heads
Adopt Student’s t-distribution and output the three parameters
Degrees of freedom, Mean, and Scale
To ensure the parameters stay positive, appropriate non-linearity is used

4.4. Value Scaling

Utilize the scaling heuristic
For each univariate window, Calulate its mean and variance
Scale $x^i_{1:C}$ → $\{(x^i_t-\mu^i)/\sigma^i\}^C_{t=1}$
Also incorporate $\mu^i$ and $\sigma^i$ as covariates for each token,
which we call summary statistics
During training, values are transformed using the mean and variance,
while sampling, every timestep data is sampled is de-standardized
In practice, instead of the standard scaler, Use Robust Standardization

Robust Standardization

4.5. Training Strategies

Corpus are weighed by the amount of total number of series
Augment with Freq-Mix and Freq-Mask

5. Experiment Setup

5.1. Datasets

Corpus of 27 time series datasets from six different domains
Leave out a few datasets from each domain for testing the few-shot
7965 univariate time series

5.3. Model Training Setup

Each epoch consists of 100 randomly sampled windows
Since model is decoder-only and since prediction length is not fixed
the model can work for any downstream prediction length

6. Results

CRPS

$\mu =20,\sigma=2$
Observed value = 22

6.1. Zero-shot

Exchange-rate is an entirely new domain
Inductive bias
- Vanilla decoder-only transformers outperform other transformer architectures
- As inductive bias gets larger, it performs well with smaller datasets
- Simple models with smaller inductive bias perform better when trained on large datasets.
Compared to the OneFitsAll which adapts a pretrained LLM for forecasting, Lag-Llama achieves better performance
- Demonstrate the potential of foundation models trained from scratch compared to the adaptation of pretrained LLM

6.2. Few-shot

Exchange-rate is entirely new domain and new unseen frequency
- most dissimilar as compared to the pretraining corpus

Figure

Visualization

Southgiri

이전 포스트

Time-LLM: Time Series Forecasting by Reprogramming Large Language Models

다음 포스트

Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting

Paper Review

General purpose foundation model for univariate probabilistic time series forecasting

Decoder-only transformer architecture

Use lags as covariates

Lag-Llama is pretrained on a large corpus of diverse time series data from several domains

1. Introduction

General purpose foundation model for univariate probabilistic time series forecasting

Demonstrate performance of few-shot

Statistical models

Foundation models

3. Probabilistic Time Series Forecasting

Univariate time series dataset

Univariate probabilistic time series forecasting problem

= Modelling an unknown joint distribution of the P future values

Predictions are conditioned on these learned parameters $\theta$

4. Lag-Llama

→ General method for tokenizing series from such a dataset, without directly relying on the frequency of any specific dataset

4.1. Tokenization : Lag Features

To create lag features for some context-length window $x_{1:C}$ ,

4.2. Lag-Llama Architecture

RMSNorm

Rotary Positional Encoding

4.3. Choice of Distribution Head

4.4. Value Scaling

Robust Standardization

4.5. Training Strategies

5. Experiment Setup

5.1. Datasets

5.3. Model Training Setup

6. Results

CRPS

6.1. Zero-shot

6.2. Few-shot

Figure

Visualization

Time-LLM: Time Series Forecasting by Reprogramming Large Language Models

OPRO Large Language Models as Optimizers

0개의 댓글

Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting

Paper Review

General purpose foundation model for univariate probabilistic time series forecasting

Decoder-only transformer architecture

Use lags as covariates

Lag-Llama is pretrained on a large corpus of diverse time series data from several domains

1. Introduction

General purpose foundation model for univariate probabilistic time series forecasting

Demonstrate performance of few-shot

2. Related Work

Statistical models

Foundation models

3. Probabilistic Time Series Forecasting

Univariate time series dataset

Univariate probabilistic time series forecasting problem

= Modelling an unknown joint distribution of the P future values

Predictions are conditioned on these learned parameters θ\thetaθ

4. Lag-Llama

→ General method for tokenizing series from such a dataset, without directly relying on the frequency of any specific dataset

4.1. Tokenization : Lag Features

To create lag features for some context-length window x1:Cx_{1:C}x1:C​,

4.2. Lag-Llama Architecture

RMSNorm

Rotary Positional Encoding

4.3. Choice of Distribution Head

4.4. Value Scaling

Robust Standardization

4.5. Training Strategies

5. Experiment Setup

5.1. Datasets

5.3. Model Training Setup

6. Results

CRPS

6.1. Zero-shot

6.2. Few-shot

Figure

Visualization

Time-LLM: Time Series Forecasting by Reprogramming Large Language Models

OPRO Large Language Models as Optimizers

0개의 댓글

Predictions are conditioned on these learned parameters $\theta$

To create lag features for some context-length window $x_{1:C}$ ,