Timeseries Anomaly Detection-(1) Introduction

Yunkun·2022년 5월 27일

Team Member


Lee Jimin, Dept. Information System 18 Hanyang Univ. syjmlove@hanyang.ac.kr
Kim Joonhee, Dept. Information System 18 Hanyang Univ. pjoonheeq@hanyang.ac.kr
Oh Yunseok, Dept. Information System 17 Hanyang Univ. grade854@hanyang.ac.kr


  • Github

Introduction


Anomaly Detection

  • Anomaly detection refers to finding objects or data that show a different pattern than expected in data. In other words, it is a method of creating a model for finding data having different characteristics from existing data based on learning data.
  • Timeseries Anomaly Detection determines that the expected future value has a deviation from the actual obtained value. The learning steps are as follows.

    Using historical values of time series data, learning data is created to predict values just after one step.The error vector is calculated using a multivariate Gaussian distribution.If the error between the predicted value and the actual value is located at the end of the multivariate Gaussian distribution, it is generally considered to be above the predictable range of values.

Distinguish Outliers

  • Outlier = Extreme outlier + Novelties

    Extreme outliers values must be removed to be good for the model.
    Novelties are good for the model only if left.

  • Extreme Outlier
    It refers to the value of some observations having very small or large extreme values that are significantly out of the range of the entire data. This causes problems in estimating the population mean or sum of the data, and is recommended to eliminate it because it excessively increases the variance, reducing the accuracy of analysis or modeling.
    Calibration method

    Replace with true value
    Replace with interpolation
    elimination

  • Novelties
    Outlier due to normal collection process (patterns or data not previously seen)
    Novelties do not modify the original data, unlike extreme outlier values. This is because if there is another novelties later, the model will have to deal with it. For example, in the KOSPI data, the fall in stocks due to COVID-19 can be seen as a kind of novelties.

How to navigate outliers

Premise: Most of the data is true, and only a fraction of the data

  • Remeasurement
    Due to the nature of the time series data, it is impossible to go back to the past and measure it again
  • Supervised Learning

    Find and compare different sources of the same data
    Compare each data to find other data.
    Same data classified as normal
    Categorize 'weirder' data among different data as outliers
    Learn an outlier navigation model with labeled classification results

  • Unsupervised Learning

    Leverage your data's own characteristics to discover
    Finding out outliers on the premise that most data is true
    Analyze your data's own characteristics
    Categorize data 'weirder' than specific criteria as outliers

💡 Supervised learning : Normal data + Abnormal data + correct answers exist

💡 Unsupervised learning : Normal data + correct answer does not exist-> Learn features by yourself

  • Algorithms such as CNN exist in supervised learning, and algorithms such as Autoencoder and GAN exist in unsupervised learning. Although supervised learning has higher accuracy of the learned model than unsupervised learning, unsupervised learning field is being relatively actively studied due to the following disadvantages.

    1. It is difficult to obtain an abnormal sample.
    2. If a new abnormal pattern occurs, a new learning process should be conducted.
  • Autoencoder

    Autoencoder consists of encoder and decoder
    The encoder extracts important information (Compressed Feature Vector) from the input data.
    In this process, data in a compressed form is obtained more than input data.
    The decoder generates a form similar to the input data with important information.

The autoencoder is an algorithm that has a function of reconstructing (restoring) input data. In other words, when a feature of normal data, which is input data, is learned and data is embedded in the learned model, the difference between the reconstructed result and the learned normal feature is compared to determine whether it is abnormal. If the encoder is good at extracting important information, the decoder can generate almost the same input data. This will make abnormal data equally difficult for the decoder to generate.

We decided to practice an algorithm called LSTM Autoencoder among unsupervised learning techniques.

0개의 댓글