Anomaly Detection - Isolation Forest

์•ˆ์„ฑ์ธยท2022๋…„ 6์›” 30์ผ
0

๐Ÿ“– ์ด๋ฒˆ ํฌ์ŠคํŒ…์€ ์ด์ „ ํฌ์ŠคํŠธ์ฒ˜๋Ÿผ ๊ณ ๋ ค๋Œ€ํ•™๊ต ๊น€์„ฑ๋ฒ” ๊ต์ˆ˜๋‹˜์˜ ๊ฐ•์˜์˜์ƒ์„ ์ฐธ๊ณ ํ•˜์—ฌ ๋ชจ๋ธ ๊ธฐ๋ฐ˜ ์ด์ƒ์น˜ ํƒ์ง€ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ค‘ Isolation Forest์— ๋Œ€ํ•ด ๋‹ค๋ฃจ๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค.


Isolation Forest

Notion

  • ์ •์ƒ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ํ•™์Šตํ•œ ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐ ๊ฐ์ฒด์˜ ์ •์ƒ/์ด์ƒ ์—ฌ๋ถ€๋ฅผ ํŒ๋‹จํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์ž…๋‹ˆ๋‹ค.
  • ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋ฅผ ์ข…ํ•ฉํ•œ ์•™์ƒ๋ธ” ๊ธฐ๋ฐ˜์˜ ์ด์ƒํƒ์ง€ ๊ธฐ๋ฒ•์œผ๋กœ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋ฅผ ์ง€์†์ ์œผ๋กœ ๋ถ„๊ธฐ์‹œํ‚ค๋ฉด์„œ ๋ชจ๋“  ๋ฐ์ดํ„ฐ ๊ด€์ธก์น˜์˜ ๊ณ ๋ฆฝ ์ •๋„ ์—ฌ๋ถ€์— ๋”ฐ๋ผ ์ด์ƒ์น˜๋ฅผ ํŒ๋ณ„ํ•ฉ๋‹ˆ๋‹ค.

  • ์ •์ƒ ๊ด€์ธก์น˜๋Š” ๊ณ ๋ฆฝ๋˜๊ธฐ ์–ด๋ ค์šธ ๊ฒƒ์ด๋‹ค๋ผ๋Š” ๊ฐ€์ •์„ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ •์ƒ ๊ด€์ธก์น˜๋ฅผ ๊ณ ๋ฆฝ์‹œํ‚ค๊ธฐ ์œ„ํ•ด์„œ๋Š” 2์ง„ ๋ถ„ํ• ์„ ์—ฌ๋Ÿฌ ๋ฒˆ ์ˆ˜ํ–‰ํ•ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

  • ์ด์ƒ์น˜๋Š” ์‰ฝ๊ฒŒ ๊ณ ๋ฆฝ๋  ๊ฒƒ์ด๋‹ค๋ผ๋Š” ๊ฐ€์ •์„ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š”, ์ด์ƒ์น˜๋ฅผ ๊ณ ๋ฆฝ์‹œํ‚ค๊ธฐ ์œ„ํ•ด์„œ๋Š” 2์ง„ ๋ถ„ํ• ์„ ์ ๊ฒŒ ์ˆ˜ํ–‰ํ•ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

  • ์ง๊ด€์ ์œผ๋กœ ๋น„์ •์ƒ ๋ฐ์ดํ„ฐ๋ผ๋ฉด ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด์˜ ๋ฃจํŠธ ๋…ธ๋“œ์—์„œ ๊ฐ€๊นŒ์šด ๊นŠ์ด์—์„œ ๊ณ ๋ฆฝ๋  ๊ฒƒ์ด๊ณ , ์ •์ƒ ๋ฐ์ดํ„ฐ๋ผ๋ฉด ๋ฃจํŠธ ๋…ธ๋“œ์—์„œ ๋จผ ๊นŠ์ด์—์„œ ๊ณ ๋ฆฝ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ๊ฐ ๊ด€์ธก์น˜์˜ Path length(๊ด€์ธก์น˜ x๊ฐ€ ๊ณ ๋ฆฝ๋  ๋•Œ๊นŒ์ง€ ํ•„์š”ํ•œ ๋ถ„ํ•  ํšŸ์ˆ˜)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ Anomaly score๋ฅผ ์ •์˜ํ•˜์—ฌ ๋ถ€์—ฌํ•ฉ๋‹ˆ๋‹ค.
  • 2008๋…„์— ๋ฐœํ‘œ๋œ ๋ชจ๋ธ์ด์ง€๋งŒ ์•„์ง๊นŒ์ง€๋„ ์ด์ƒ์น˜ ํƒ์ง€๋ฅผ ํ•˜๋Š”๋ฐ ์žˆ์–ด ์œ ์šฉํ•˜๊ฒŒ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

Algorithm

1. ์ „์ฒด ๋ฐ์ดํ„ฐ์—์„œ ์ผ๋ถ€ ๊ด€์ธก์น˜๋ฅผ ๋žœ๋คํ•˜๊ฒŒ ์„ ํƒ

2. ๋žœ๋คํ•˜๊ฒŒ ์„ ํƒ๋œ ๊ด€์ธก์น˜์— ๋Œ€ํ•ด ์ž„์˜์˜ ๋ณ€์ˆ˜(splitting variable)์™€ ๋ถ„ํ• ์ (splitting point)์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์Œ ์กฐ๊ฑด์„ ๋งŒ์กฑํ•  ๋•Œ๊นŒ์ง€ ์ด์ง„๋ถ„ํ•  ์ง„ํ–‰

  • ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋ชจ๋ธ์ด ์‚ฌ์ „ ์ •์˜๋œ ๊นŠ์ด์— ๋„๋‹ฌ
  • ๋ชจ๋“  ํ„ฐ๋ฏธ๋„ ๋…ธ๋“œ์— ์กด์žฌํ•˜๋Š” ๊ด€์ธก์น˜๊ฐ€ 1๊ฐœ์”ฉ ์กด์žฌ
  • ๋ชจ๋“  ํ„ฐ๋ฏธ๋„ ๋…ธ๋“œ์— ์กด์žฌํ•˜๋Š” ๊ด€์ธก์น˜๋“ค์ด ๊ฐ™์€ ์ž…๋ ฅ๋ณ€์ˆ˜

3. ์œ„์™€ ๊ฐ™์€ ๊ณผ์ •์œผ๋กœ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ITree๋ฅผ ๊ตฌ์ถ•

4. ITree๋งˆ๋‹ค ๊ฐ ๊ด€์ธก์น˜์˜ Path length(๋ถ„๋ฆฌ ํšŸ์ˆ˜)๋ฅผ ์ €์žฅ

5. ๊ฐ ๊ด€์ธก์น˜์˜ ํ‰๊ท  Path length๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ Anomaly score(์ด์ƒ์น˜ ์Šค์ฝ”์–ด)๋ฅผ ๊ณ„์‚ฐ ๋ฐ ์ด์ƒ์น˜ ํŒ๋ณ„


Anomaly score ์ •์˜

  • Path length๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ •์˜

  • Path length (=h(x))(= h(x))

  • Normalized h(x):c(n)=2H(nโˆ’1)โˆ’{2(nโˆ’1)n}h(x) : c(n) = 2H(n-1) - \lbrace \frac{2(n-1)}{n}\rbrace (๋ชจ๋“  ๊ด€์ธก์น˜๋“ค์˜ ํ‰๊ท  Path length)

    • H(i)=ln(i)+0.5772156649(์˜ค์ผ๋Ÿฌ์ƒ์ˆ˜)H(i) = ln(i) + 0.5772156649(์˜ค์ผ๋Ÿฌ ์ƒ์ˆ˜)
  • Anomaly score : s(x,n)=2โˆ’E(h(x))c(n),E(h(x))โ€…โ€Šisโ€…โ€Štheโ€…โ€Šaverageโ€…โ€Šofโ€…โ€Šh(x)s(x,n) = 2^{-\frac{E(h(x))}{c(n)}}, E(h(x))\;is\;the\;average\; of\;h(x)

    • = ๋ชจ๋“  ๊ด€์ธก์น˜๋“ค์˜ ํ‰๊ท  Path length ๋Œ€๋น„ ๋‚ด๊ฐ€ ๊ด€์‹ฌ์žˆ๋Š” ๊ด€์ธก์น˜ ํ‰๊ท  Path length ๋„์ถœ
    • E(h(x))โ†’0,โ€…โ€Šs(x,n)โ†’20โ†’1E(h(x)) โ†’ 0, \;s(x,n) โ†’ 2^{0} โ†’ 1
      (Path length์˜ ํ‰๊ท ๊ฐ’์ด 0์— ๊ฐ€๊นŒ์›€์œผ๋กœ ์ด์ƒ ๋ฐ์ดํ„ฐ)
    • E(h(x))โ†’nโˆ’1(=๊ด€์ธก์น˜์˜โ€…โ€Š๊ฐœ์ˆ˜โˆ’1=์ตœ๋Œ€โ€…โ€Š๋ถ„๋ฆฌโ€…โ€ŠํšŸ์ˆ˜),โ€…โ€Šs(x,n)โ†’2โˆ’ๅคงโ†’0์—๊ฐ€๊นŒ์šด์ˆ˜E(h(x)) โ†’ n-1 (=๊ด€์ธก์น˜์˜\;๊ฐœ์ˆ˜ - 1 = ์ตœ๋Œ€\;๋ถ„๋ฆฌ\;ํšŸ์ˆ˜), \;s(x,n) โ†’ 2^{-ๅคง} โ†’ 0์— ๊ฐ€๊นŒ์šด ์ˆ˜
      (Path length์˜ ํ‰๊ท ๊ฐ’์ด n์— ๊ฐ€๊นŒ์›€์œผ๋กœ ์ •์ƒ ๋ฐ์ดํ„ฐ)
    • E(h(x))โ†’c(n),โ€…โ€Šs(x,n)โ†’2โˆ’1โ†’0.5E(h(x)) โ†’ c(n),\;s(x,n) โ†’ 2^{-1} โ†’ 0.5
      (Path length์˜ ํ‰๊ท ๊ฐ’์ด normalized h(x)h(x)์™€ ๋น„์Šทํ•จ์œผ๋กœ ์ •์ƒ ๋ฐ์ดํ„ฐ)

  • Anomaly score์˜ ๋ฒ”์œ„๋Š” 0~1์ด๋ฉฐ, 1์— ๊ฐ€๊นŒ์šฐ๋ฉด ์ด์ƒ ๋ฐ์ดํ„ฐ, 0.5์ดํ•˜๋ฉด ์ •์ƒ ๋ฐ์ดํ„ฐ๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Code

  • ํŒŒ์ด์ฌ์—์„œ sklearn.ensemble ๋ชจ๋“ˆ์˜ IsolationForestํ•จ์ˆ˜๊ฐ€ ํ•ด๋‹น ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.
  • ์ผ๋ฐ˜์ ์ธ sklearn์˜ ๋‹ค๋ฅธ ํ•จ์ˆ˜์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋ฐ์ดํ„ฐ์— 'fit()', 'predict()'๋ฅผ ํ†ตํ•ด ๊ฒฐ๊ณผ๋ฅผ ์•Œ ์ˆ˜ ์žˆ์œผ๋ฉฐ, 'predict()'์˜ ๊ฒฐ๊ณผ๋Š” ์ •์ƒ(1), ์ด์ƒ(-1)๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style("darkgrid")
from sklearn.ensemble import IsolationForest

rng = np.random.RandomState(42)
# Generating training data 
X_train = 0.2 * rng.randn(1000, 2)
X_train = np.r_[X_train + 3, X_train]
X_train = pd.DataFrame(X_train, columns = ['x1', 'x2'])

# Generating new, 'normal' observation
X_test = 0.2 * rng.randn(200, 2)
X_test = np.r_[X_test + 3, X_test]
X_test = pd.DataFrame(X_test, columns = ['x1', 'x2'])

# Generating outliers
X_outliers = rng.uniform(low=-1, high=5, size=(50, 2))
X_outliers = pd.DataFrame(X_outliers, columns = ['x1', 'x2'])

# data scatter
plt.rcParams['figure.figsize'] = [10, 10]
p1 = plt.scatter(X_train.x1, X_train.x2, c='white', s=20*4, edgecolor='k', label='training observations')
# p2 = plt.scatter(X_test.x1, X_test.x2, c='green', s=20*4, edgecolor='k', label='new regular obs.')
p3 = plt.scatter(X_outliers.x1, X_outliers.x2, c='red', s=20*4, edgecolor='k', label='new abnormal obs.')

plt.legend()

# predict
clf = IsolationForest(max_samples=100, contamination = 0.1, random_state=42)
## ์ „์ฒด ๋ฐ์ดํ„ฐ ๋น„์œจ์˜ 10%๋ฅผ ์ด์ƒ์น˜๋กœ ์ง€์ •
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)

# outliers์˜ˆ์ธก๊ฐ’๊ณผ X_outliers๊ฐ’์„ X_outliers2๋ณ€์ˆ˜์— ํ• ๋‹น
X_outliers2 = X_outliers.assign(y = y_pred_outliers)
X_outliers2

# ์ด์ƒ ํƒ์ง€ scatter
plt.figure(figsize = (12,10))
p1 = plt.scatter(X_train.x1, X_train.x2, c='white',
                 s=20*4, edgecolor='k', label="training observations")
p2 = plt.scatter(X_outliers2.loc[X_outliers2.y == -1, ['x1']], 
                 X_outliers2.loc[X_outliers2.y == -1, ['x2']], 
                 c='red', s=20*4, edgecolor='k', label="detected outliers")
p3 = plt.scatter(X_outliers2.loc[X_outliers2.y == 1, ['x1']], 
                 X_outliers2.loc[X_outliers2.y == 1, ['x2']], 
                 c='green', s=20*4, edgecolor='k', label="detected regular obs")
plt.legend()

print("ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹(normal data)์—์„œ ์ •ํ™•๋„:", list(y_pred_test).count(1)/y_pred_test.shape[0])
print("์ด์ƒ์น˜ ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ •ํ™•๋„:", list(y_pred_outliers).count(-1)/y_pred_outliers.shape[0])


profile
ํ•จ๊ป˜ ๊ณต๋ถ€ํ•ด์š”!

0๊ฐœ์˜ ๋Œ“๊ธ€