Robust Random Cut Forest Based Anomaly Detection On Streams

똑딱뚝딱·2023년 1월 18일

2016 Anomaly Detection Conference ICML Isolation Forest Time Series decision tree forest stream

Robust Random Cut Forest Based Anomaly Detection On Streams

2016 Proceedings of The 33rd International Conference on Machine Learning

Binary Search Tree 기반 Algorithm으로 Stream data에 존재하는 Anomaly 탐지 목적
Isolation Forst를 real-time streaming 환경에서 적용할 수 있도록 변형

Tree 구조를 Stream data에 적용했다는 의의

본 논문의 핵심 질문
1) How do we define anomalies?
2) What data structure do we use to efficiently detect anomalies over dynamic data streams?

Differences between IF and RCF

기존의 Isolation Forest(IF) 구분되는 부분

Feature selection
Anomaly score

Feature Selection

Isolation Forest
split에 사용할 feature를 randomly select
Extended Isolation Forest
IF와 동일
Random Cut Forest
feature의 범위에 따라 각 feature가 선택될 확률을 부여

Anomaly Score

Isolation Forest
모든 Tree의 Average path length를 anomaly score로 사용
0.5를 기준으로 normal과 anomaly를 구분

Extended Isolation Forest
IF와 동일
Random Cut Forest
dataset에서 data point를 제거하고 남은 data에서 발생하는 depth 변화의 관점에서 새로운
anomaly score를 정의
model complexity 관점

Robust Random Cut Tree

robust random cut tree on point set S

$T(S)$ : $S$ 로부터 생성된 tree

random choice feature $p$
$i$ 번째 feature가 선택될 확률 : $\frac{l_i}{\sum_jl_j}$
$l_i = max_{x\in S} \: x_i - min_{x\in S}\: x_i$
➜ 각 feature의 값의 범위에 따라 해당 feature가 선택될 확률이 결정

randomly select value $q$
choose $X_i$ ~ $Uniform[min_{x\in S} \: x_i, max_{x\in S} \: x_i]$
split point $q$ 보다 작으면 left branch로 크면 right branch로 분기

Anomaly Score

IF는 anomaly면 tree에서 먼저 isolation된다는 특징을 사용하여 anomaly score를 측정 ➜ average path lengh

RRCF는 model complexity 관점에서 anomaly score를 측정
➜ abnormal point increases model complexity

Displacement(DISP)

$DISP(x, Z)$ : dataset $Z$ 에 존재하는 data point $x$ 를 제거했을 때, 남은 data들의 depth 변화의 총합
➜ 각 tree에서 발생하는 depth 변화의 기댓값

(a) : before delete $x$
(b) : after delete $x$

$x$ 를 제거하면 sub-tree c에 존재하는 node들의 depth가 1씩 감소
$x$ 와 직접적으로 연결되어 있지 않은 sub-tree b의 depth는 변화 없음
➜ $x$ 로 인한 depth 변화의 총합 == $x$ 의 sibling node에 있는 data의 개수
➜ $x$ 가 anomaly일수록 $x$ 로 인한 전체 depth 변화가 클 것

Collusive Displacement(CODISP)

본 논문에서는 DISP는 masking 문제를 고려하기 위해 anomaly의 주변까지 고려하는 anomaly score를 제안

masking : 이상치들끼리 모여 마치 정상인 것 처럼 보이게 하는 문제

masking 현상 때문에 abnormal data $p$ 옆에 $q$ 가 있다면 $p$ 의 $DISP$ 는 매우 작을 것
abnormal data를 숨겨주는 colluder까지 고려하여 anomaly score를 계산
$x$ 주변의 collusive cluster $C$ 를 제거했을 때 발생하는 depth의 총합을 고려
➜ but $C$ 의 size가 클수록 depth 변화가 클 것
➜ $C$ 의 size의 영향을 줄이고자 최종적으로 $DISP$ 를 $C$ 의 size로 나눈 $CODISP$ 를 사용
➜ but $C$ 의 size를 정확하게 파악할 수 없다는 문제가 존재하기 때문에 고려할 수 있는 max value를 사용

$x$ : data point
$Z$ : dataset
$S$ : sub-set

$CODISP(x, Z, |S|) = \mathbb{E}[\underset{x\in C \subseteq S}{max} \frac{1}{|C|} \sum DISP(x, z)]$

Algorithm

Forget Point

Tree $T$ 에서 $p$ 에 해당하는 node $v$ 를 찾음
node $v$ 의 parents node를 제거하고 node $v$ 의 sibling node $u$ 를 parents node로 설정 (root to $u$ 의 path ↓)
new parents $u'$ 로부터 시작하는 모든 sub-tree update
return modified tree $T'$