Variational transformer-based anomaly detection approach for multivariate time series

Southgiri·2025년 1월 19일

Paper Review

목록 보기
1/7

1. Introduction

Traditional Limitation

  • Need to rely on expert experience to build complex feature

Previous Study & Necessity of this study

  • Multivariate time-series data has potential correlation between time series
  • It must be considered
  • GNN extracts the potential correlation by constructing a feature relationship graph
  • However, such algorithms depends too much on the size of the feature relationship graph
  • When the grpah is too sparse or the number of feature nodes is too small, it cause the performance bottleneck of the model
    → We propose Transformer-based model to capture the correlation through the self-attention mechanism
  • Why? → It reduces impact of the dimensionality of data

Main contributions

  1. Global temporal encoding (positional encoding)
  2. Multi-scale feature fusion algorithm
    • Obtain robust feature expression
  3. Residual Variational AutoEncoder
    • Can alleviate the KLD vanishing problem

2. Related work

2.1 Multivariate time-series anomaly detection

Challenging points

  • Occurs with a periodic or seasonal mode
  • Complex correlations between their sequences

We use self-attention mechanism to capture the corelation in feature dimensions

2.3 Variational AutoEncoder

Limitation of AutoEncoder for anomaly detection

  • AE is only trained with as little reconstruction loss as possible, regardless of how the hidden space encoded
    • In my opinion, Loss term of AE is only composed of Reconstruction loss
  • It causes possibility of overfitting = lack of regularity in the hidden space

VAE

  • forms robust local features

3. Multiscale transformer-based residual VAE

Limitation of traditional Transformer

  • Only extract the local sequential information
  • Cannot extract the global time-series information
  • In the process of upsampling,
  • Mapping relationship with raw input and high-dimensional vector (input data) is not accurate

Explanation of difference between the local and global information

For example, a traditional transformer model might effectively capture that a stock has been increasing for the last few days (local sequential information),

but it may not as easily recognize that the stock is in a long-term downward trend (global information), which is more complex and requires analyzing data over a longer period.

Model Parts

1. Positional encoding module

  • Provide the model local sequential and global time-series information

2. Multi-scale feature fusion module

  • Obtain more robust feature expression

3. Feature-learning module

  • Learn both temporal and feature dimensions
  • through Transformer and VAE

4. Data reconstruction module

  • Conv1d layer and FC layer

3.1 Positional encoding

Global temporal encoding

  1. Time-series encoding

    • Decompose the time stamp information

  2. Periodic encoding

    • Fourier transform to analyze the period of time series data from frequency domain

    • The main period = The most significant impact on the time-series data

      • T : the period of the data
      • t : the timestamp
    • If time series data has no pattern or close to a constant value,

    • After Fourier transform, it’s a group of sine waves with tiny amplitude

    • Then, DC component is the most influential to the time series data

    • DC is a sine wave with an infinite period

    • Frequency → 0

    • Main period of the time series data → \infty

    • Do not add extra information to non-periodic time series data

3.2 Multiscale feature fusion

  • FC layer or Conv1d layer map input into a high-dimensional vector
  • FC or Conv1d alone is not accurate
  • Feature pyramid structure
    • only handle image data
  • Only convolves and up-samples in the time dimension

  • SrcSrc : input data
  • mm : number of conv layers
    • Hyper parameter

  • FupsamplemF^m_{upsample} : Transposed conv1d

  • learnable parameters
    • interpolation need parameters and it’s not optimal
  • Conv stride must be diveded into the size of the transposed conv kernel

3.3 Residual Variational AutoEncoder

Loss term

  • δi\delta_i and μi\mu_i
    • learnable parameter
  • θ\theta control value range of the variance to [1,+][1,+\infty]
    • Prevent loss appearing negative

Reconstruction

  • gg : Decoder function

  • ζ\zeta : Noise sampled from the Standard normal distribution

  • When decoder is too powerful or the weight of the LossklLoss_{kl} is too high,

  • model encodes all the input data into a Standard normal distribution

  • in order to obtain the minimum of the loss function

  • As a result, decoder only relies on the noise data

    • KL divergence vanishing problem
      • p(ZiXi)p(Z_i|X_i) degenerate to the Standard normal distribution
  • Transformer’s decoder is too powerful

    → Transformer-based VAE is more prone to disappearance of divergence

Traditional Solution

  1. Set dynamic coefficients for the LossklLoss_{kl} term
    • Long process of finding the suitable coefficient
  2. Reduce decoder’s performance, increasing the contribution of the reconstruction error term
    • But, low performance decoder will reduce the model generation ability

Proposed Solution

  • Residual VAE
  • Do not connect the encoder and the decoder
  • Combines the residuals separately
  • Prevent encoder information from leaking to the decoder
  • Add a time window-based attention to the encoder

Residual structure

  • μ^\hat{\mu} : constant term, output sum of the encoders of each layer

  • k : number of encoder layers

  • When the disappearance of divergence occurs = mean, var equals to 0

    • Prevent p(ZiXi)p(Z_i|X_i) degenarting to Standard normal distribution
    • → Prevent the decoder from only reconstructing from the noise data

Time dimension attention mechanism

  • Transformer-based model only capture feature dimension with self-attention
  • Does not consider autocorrelation of data features in the time dimension
  • Data at different moments have different effects on the data at the current moment
  • Add weight information to the data in the same time window

Untitled

  • λm=(λ1m,,λlm)\lambda^m = (\lambda^m_1,\dots,\lambda^m_l) : a feature of input data λ\lambda in a time window
  • l : width of the time window
  • To prevent disappearance of divergence, only add to the encoder

3.4 Data reconstruction and anomaly detection

  • Conv1D and FC layer to adjust the output size
  • Upper threshold limit based on the reconstruction loss under normal data
  • Due to the changeable operating environment, difficult to apply the same upper threshold

Dynamic threshold method

  • Record the reconstruction error at each moment as vector R=(R1,,Rh)R = (R_1,\dots,R_h)
    • h : total time of detection data
  • In order to reduce the influence caused by the normal fluctuation of the data,
  • Use the exponentially weighted moving average algorithm EWMA to smooth

  • VtV_t : smoothed reconstruction error
  • RtR_t : actual reconstruction error at the current time
  • η\eta : weight coefficient

  • Z : hyper parameter
  • n : size of sliding window
  • μ\mu : mean value of the error vector V in the sliding window group i

Algorithm steps

  1. Use the trained model to reconstruct the data and calculate the reconstruction error
  2. Use the EWMA to smooth the reconstruction error vector
  3. Calculate the error threshold at each moment
  4. Judge whether the point is an abnormal or normal



4. Experiments

4.1 Dataset

SKAB

  • IIoT data

SAT

  • satellite data

4.2 Baseline methods and model setting

Local outlier factor

Isolation forest

LSTM AE

Fault prediction based on LSTM

SOTA models

GDNMultivariateGraph based
MTAD-GATMultivariateGraph attention
LSCPParallel integration
OmniAnomalyVAE and GRU
CPA-TCNTemporal CNN
TCN-AETemporal CNN and AE
  • Debug the parameters of the above methods to obtain the best results

Model parameter

4.3 Comparison

  • ML algorithms that cannot capture temporal dependency perform poorly in NAB-MT
  • MT-RVAE and GNN which focus on the multidimension is not as good as the TCN-AE on the one-dimensional data

  • Relationship between the different attributes of SKAB is sparse

  • Real dataset SAT has more relationship between the attributes

  • → Performance of GNN model on the SAT is higher

  • However, SAT has only 9 dimensions, the upper limit is low, which causes a bottleneck in GNN

  • OmniAnomaly (VAE + GRU) does not consider correlation between sequences

  • → Lower than MT-RVAE

  • MT-RVAE can capture the correlation between different sequences through self attention

  • → Don’t need to extract information through the feature relationship graph

  • → Avoid the information bottleneck caused by node sparseness

  • Global temporal encoding and residual VAE can extract the temporal dependence and local features

Conclusions

  1. Model rely only on the temporal dependence in the data, will not improve the accuracy for multivariate time series

  2. Performance on the GNN in SAT is higher than RNN and TCN, which proves the importance of capturing the sequence correlation

  3. Tightness of data feature relationship will affect the performance of GNN algorithm

  4. OmniAnomaly algorithm proves the number of features of the data will cause the bottleneck

  5. MT-RVAE can effectively capture the temporal dependence and sequence correlation

  6. Unidimensional or have no correlation between sequences are more suitable on TCN or RNN

  7. Multidimensional or have the correlation between sequences are more suitable on GNN or Transformer

4.4 Ablation Study

  1. Residual structure can alleviate the disappearance of the divergence
  2. Hidden space encoded by VAE can improve the performance
  3. Multi-scale feature can help the model better distinguish abnormal point
  4. Global temporal encoding capture the long time dependencies

CT-RVAE

  • Conv1D instead of Multi-scale feature fusion

LMT-RVAE

  • Bilinear interpolation instead of Transposed convolution for upsampling

MT-NVAE

  • Ordinary residual structure instead of residual

MT-RAE

  • Residual AE instead of RVAE

0개의 댓글