Anomaly Detection - Local Outlier Factor (LOF)

์•ˆ์„ฑ์ธยท2022๋…„ 6์›” 24์ผ
0

๐Ÿ“– ์ด๋ฒˆ ํฌ์ŠคํŒ…์€ ๊ณ ๋ ค๋Œ€ํ•™๊ต ๊น€์„ฑ๋ฒ” ๊ต์ˆ˜๋‹˜์˜ ๊ฐ•์˜์˜์ƒ์„ ์ฐธ๊ณ ํ•˜์—ฌ ๋ฐ€๋„๊ธฐ๋ฐ˜ ์ด์ƒ์น˜ ํƒ์ง€ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ค‘ LOF์— ๋Œ€ํ•ด ๋‹ค๋ฃจ๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค.
์šฐ์„  Anomaly Detection์ด ๋ฌด์—‡์ธ์ง€์— ๋Œ€ํ•ด ์•Œ์•„๋ด…์‹œ๋‹ค.


Anomaly Detection (์ด์ƒ์น˜ ํƒ์ง€)

Notion

  • ์‰ฝ๊ฒŒ ๋งํ•ด, ๋ฐ์ดํ„ฐ ์•ˆ์—์„œ Normal(์ •์ƒ) sample๊ณผ Abnormal(๋น„์ •์ƒ, ์ด์ƒ์น˜, ํŠน์ด์น˜) sample์„ ๊ตฌ๋ณ„ํ•ด๋‚ด๋Š” ๋ฌธ์ œ๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ๋งค์šฐ ๋งŽ์€ ์ •์ƒ ๋ฐ์ดํ„ฐ์—์„œ ๊ทน์†Œ์ˆ˜์˜ ๋น„์ •์ƒ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ๋ณ„ํ•˜๋Š” ๊ฒƒ์ด ๋ชฉ์ ์ž…๋‹ˆ๋‹ค.
  • ์ฃผ๋กœ ์ด๋ฒคํŠธ ์‚ฌ๊ฑด์ด ์ •์ƒ ๋ฐ์ดํ„ฐ์— ๋น„ํ•ด ์ ์€ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃจ๋Š” ๋ถ„์•ผ์—์„œ ๋งŽ์ด ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • Anomaly Detection์€ ํ•™์Šต ๋ฐ์ดํ„ฐ ์…‹์— ๋น„์ •์ƒ์ ์ธ sample์ด ํฌํ•จ๋˜๋Š”์ง€, ๊ฐ sample์˜ label์ด ์กด์žฌํ•˜๋Š”์ง€, ๋น„์ •์ƒ์ ์ธ sample์˜ ์„ฑ๊ฒฉ์ด ์ •์ƒ sample๊ณผ ์–ด๋–ป๊ฒŒ ๋‹ค๋ฅธ์ง€, ์ •์ƒ sample์˜ class๊ฐ€ ๋‹จ์ผ class ์ธ์ง€ Multi-class ์ธ์ง€ ๋“ฑ์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ์šฉ์–ด๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
    ์˜ˆ๋ฅผ ๋“ค์–ด, Novelty Detection๋ผ๋Š” ์šฉ์–ด๋„ ์กด์žฌํ•˜๋‚˜ ์ด๋Š” ์ง€๊ธˆ๊นŒ์ง€ ๋“ฑ์žฅํ•˜์ง€ ์•Š์•˜์ง€๋งŒ ์ถฉ๋ถ„ํžˆ ๋“ฑ์žฅํ•  ์ˆ˜ ์žˆ๋Š” sample์„ ์ฐพ์•„๋‚ด๋Š” ์—ฐ๊ตฌ, ์ฆ‰ ๋ฐ์ดํ„ฐ๊ฐ€ ์˜ค์—ผ์ด ๋˜์ง€ ์•Š์€ ์ƒํ™ฉ์„ ๊ฐ€์ •ํ•˜๋Š” ์—ฐ๊ตฌ์™€ ๊ด€๋ จ๋œ ์šฉ์–ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๊ณ , Outlier Detection์€ ๋“ฑ์žฅํ•  ๊ฐ€๋Šฅ์„ฑ์ด ๊ฑฐ์˜ ์—†๋Š”, ๋ฐ์ดํ„ฐ์— ์˜ค์—ผ์ด ๋ฐœ์ƒํ–ˆ์„ ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ๋Š” sample์„ ์ฐพ์•„ ๋‚ด๋Š” ์—ฐ๊ตฌ์™€ ๊ด€๋ จ๋œ ์šฉ์–ด ์ •๋„๋กœ ๊ตฌ๋ถ„ํ•˜์—ฌ ์ •๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Local Outlier Factor (LOF)

Notion

  • ๊ฐ๊ฐ์˜ ๊ด€์ธก์น˜๊ฐ€ ๋ฐ์ดํ„ฐ ์•ˆ์—์„œ ์–ผ๋งˆ๋‚˜ ๋ฒ—์–ด๋‚˜ ์žˆ๋Š”๊ฐ€์— ๋Œ€ํ•œ ์ •๋„(์ด์ƒ์น˜ ์ •๋„)๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. LOF์˜ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ํŠน์ง•์€ ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ ์ „์ฒด์ ์œผ๋กœ ๊ณ ๋ คํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ํ•ด๋‹น ๊ด€์ธก์น˜์˜ ์ฃผ๋ณ€ ๋ฐ์ดํ„ฐ(neighbor)๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ตญ์†Œ์ (local) ๊ด€์ ์œผ๋กœ ์ด์ƒ์น˜ ์ •๋„๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ์ •์ƒ๊ฐ์ฒด๋Š” ์ฃผ๋กœ ์ฃผ๋ณ€์— ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์ด ์กด์žฌํ•˜๋ฉฐ, ๋ถˆ๋Ÿ‰ ๊ฐ์ฒด๋Š” ์ฃผ๋กœ ๋‹จ๋…์œผ๋กœ ์กด์žฌํ•จ์„ ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค.
  • ์ฃผ๋ณ€ ๋ฐ์ดํ„ฐ๋ฅผ ๋ช‡๊ฐœ๊นŒ์ง€ ๊ณ ๋ คํ•  ๊ฒƒ์ธ๊ฐ€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” k๋ผ๋Š” ํ•˜์ดํผ-ํŒŒ๋ผ๋ฏธํ„ฐ(hyper-parameter)๋งŒ ๊ฒฐ์ •ํ•˜๋ฉด ๋˜๋Š” ์žฅ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์•„์งˆ์ˆ˜๋ก ๊ณ„์‚ฐ๋ณต์žก๋„๊ฐ€ ์˜ฌ๋ผ๊ฐ€๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค.
  • ๋ฐ์ดํ„ฐ๊ฐ€ ๊ณ ์ฐจ์›์ธ ๊ฒฝ์šฐ, ์ €์ฐจ์›์œผ๋กœ ์ฐจ์› ์ถ•์†Œํ•˜์—ฌ ์‹œํ–‰ํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค.

Motivation

  • ์ง‘๋‹จ C1 ๊ณผ ์ง‘๋‹จ C2์˜ density๊ฐ€ ๋‹ค๋ฅด๋ฉฐ, ๋ฐ์ดํ„ฐ o1์€ ๋ˆˆ์— ๋„๊ฒŒ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์™€์˜ ๊ฑฐ๋ฆฌ๊ฐ€ ๋ฉ€๊ธฐ์— ๊ฑธ๋Ÿฌ๋‚ด๊ธฐ ์‰ฝ์ง€๋งŒ ๋ฐ์ดํ„ฐ o2๋Š” ๊ฑธ๋Ÿฌ๋‚ด๊ธฐ๊ฐ€ ์–ด๋ ต์Šต๋‹ˆ๋‹ค.
    ์ผ์ • ๊ฑฐ๋ฆฌ๋กœ ๊ธฐ์ค€์„ ์‚ผ์„ ๊ฒฝ์šฐ, C1ํ˜น์€ C2์—๋งŒ ํŠนํ™”๋œ outlier detction์„ ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
    ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์˜์‹์—์„œ, local์˜ ์ƒ๋Œ€์ ์ธ dense๋ฅผ ๋น„๊ตํ•˜์—ฌ outlier๋ฅผ ์ •ํ•˜์ž๋Š” lof๊ฐ€ ๋“ฑ์žฅํ–ˆ์Šต๋‹ˆ๋‹ค.
    ํฐ ํ‹€์€, neighbor๋“ค์˜ dense๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ๋น„๊ตํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Algorithm

1. K-distance of object p

  • ์ž๊ธฐ ์ž์‹ (pp)๋ฅผ ์ œ์™ธํ•˜๊ณ  k๋ฒˆ์งธ๋กœ ๊ฐ€๊นŒ์šด ์ด์›ƒ๊ณผ์˜ ๊ฑฐ๋ฆฌ

  • ์˜ˆ์‹œ : 4-distance of object p : ์ž๊ธฐ ์ž์‹ ์„ ์ œ์™ธํ•œ 4๋ฒˆ์งธ๋กœ ๊ฐ€๊นŒ์šด ์ด์›ƒ๊ณผ์˜ ๊ฑฐ๋ฆฌ = 2.0

  • ์ฃผ์˜์‚ฌํ•ญ : distance๊ฐ€ continuous๋ผ๋ฉด 3-distance๋‚ด์— ์ •ํ™•ํžˆ 3๊ฐœ์˜ neighbor๊ฐ€ ๋“ค์–ด์žˆ๊ฒ ์ง€๋งŒ, ๊ฑฐ๋ฆฌ๊ฐ€ 1,2,3,3,3,3๊ฐ™์ด discreteํ•ด์„œ ๊ฒน์น˜๋Š” ๊ฒฝ์šฐ๋ผ๋ฉด 3-distance๋‚ด์— 5๊ฐœ๋“  10๊ฐœ๋“  neighbor๋กœ ๋“ค์–ด์žˆ์„ ์ˆ˜๋Š” ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ๋”ฐ๋กœ ๋‚˜ํƒ€๋‚ด์ฃผ๊ธฐ ์œ„ํ•ด k-distance(p)์•ˆ์— ๋“ค์–ด์˜จ ๋ฐ์ดํ„ฐ๊ฐฏ์ˆ˜๋ฅผ Nk,(p)N_k,(p)๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.


2. K-distance neighborhood of object p ((Nk,(p))(N_k,(p)))

  • k๋ฒˆ์งธ๋กœ ๊ฐ€๊นŒ์šด ์ด์›ƒ๊ณผ์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์›์œผ๋กœ ํ‘œํ˜„ํ•  ๋•Œ, ์› ์•ˆ์— ํฌํ•จ๋˜๋Š” ๋ชจ๋“  ๊ฐ์ฒด๋“ค์˜ ๊ฐœ์ˆ˜
  • ์˜ˆ์‹œ : k = 4์ผ ๋•Œ, N4(p)N_4(p) = 4๊ฐœ ํฌํ•จ

3. Reachabillity Distance (Reachabillityโ€…โ€Šdistancek(p,o)Reachabillity\;distance_k(p,o))

  • ์ฃผ๋ณ€ ๋ฐ์ดํ„ฐ o๋ฅผ ๊ธฐ์ค€์œผ๋กœ k๋ฒˆ์งธ ๊ฐ€๊นŒ์šด ์ด์›ƒ๊ณผ์˜ ๊ฑฐ๋ฆฌ(k-distance of o)์™€ o์™€ p์‚ฌ์ด ๊ฑฐ๋ฆฌ ๊ฐ„์˜ ์ตœ๋Œ€๊ฐ’
    • Reachabillityโ€…โ€Šdistancek(p,o)=max(kโˆ’distanceโ€…โ€Šofโ€…โ€Šoโ€…โ€Š,d(p,o))Reachabillity\;distance_k(p,o) = max(k-distance\;of\;o\;, d(p,o))
  • ์˜ˆ์‹œ : k = 5์ผ ๋•Œ, k-distance of o๋Š” 1.2, d(p,o)d(p,o)๋Š” 2.0์ด๋‹ค.
    Reachabillityโ€…โ€Šdistance5(p,o)=max(1.2,2.0)=2.0Reachabillity\;distance_5(p,o) = max(1.2,2.0) = 2.0

4. Local reachabillity density of object p (lrdk(p)lrd_k(p))

  • p์ฃผ๋ณ€์˜ k_neighbor๋“ค๊ณผ์˜ reach_dist์˜ ํ‰๊ท ์„ inverse(์—ญ์ˆ˜)์ทจํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
    ์ฃผ๋ณ€์˜ dense๋ฅผ ๊ณ ๋ คํ•œ p์ ์—์„œ์˜ โ€˜neighbor๋“ค๊ณผ์˜ ์ ๋‹นํ•œ ๊ฑฐ๋ฆฌโ€™๋ฅผ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • lrd๋Š” inverse์ž…๋‹ˆ๋‹ค.

    • lrdk(p)=โˆฃNk(p)โˆฃโˆ‘oโˆˆNk(p)Reachabillityโ€…โ€Šdistancek(p,o)lrd_k(p) = \frac{\vert N_k(p) \vert}{\displaystyle\sum_{o \in N_k(p)}^{}{Reachabillity\;distance_k(p,o)}}
  • ์ž๊ธฐ ์ž์‹ (p) ์ฃผ๋ณ€์˜ ๋ฐ€๋„๊ฐ€ ๋†’์€ ๊ฒฝ์šฐ, ์œ„ ์‹์˜ ๋ถ„๋ชจ๊ฐ€ ์ž‘์•„์ง€๊ฒŒ ๋˜์–ด lrdk(p)lrd_k(p)๊ฐ’์ด ์ปค์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. = ์ •์ƒ
  • ๋ฐ˜๋Œ€๋กœ ์ž๊ธฐ ์ž์‹ (p) ์ฃผ๋ณ€์˜ ๋ฐ€๋„๊ฐ€ ๋‚ฎ์€ ๊ฒฝ์šฐ, ์œ„ ์‹์˜ ๋ถ„๋ชจ๊ฐ€ ์ปค์ง€๊ฒŒ ๋˜์–ด lrdk(p)lrd_k(p)๊ฐ’์ด ์ž‘์•„์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. = ๋น„์ •์ƒ

5. Local Outlier Factor of object p (LOFk(p)LOF_k(p))

  • ์ž๊ธฐ ์ž์‹ (p)์˜ ์ตœ์ข… Local Outlier Factor๊ฐ’์„ ๊ณ„์‚ฐ

    • LOFk(p)=โˆ‘oโˆˆNk(p)lrdk(o)/lrdk(p)โˆฃNk(p)โˆฃLOF_k(p) = \frac{\displaystyle\sum_{o \in N_k(p)}^{}{lrd_k(o)/lrd_k(p)}}{\vert N_k(p) \vert} = lrdk(o)lrd_k(o)์™€ lrdk(p)lrd_k(p)์˜ ๋น„์œจ์˜ ํ‰๊ท 

  • ์ž๊ธฐ ์ž์‹ (p) ์ฃผ๋ณ€์˜ ๋ฐ€๋„๊ฐ€ ๋†’์€ ๊ฒฝ์šฐ, ํฐ lrdk(o)lrd_k(o), ํฐ lrdk(p)lrd_k(p) = 1๊ทผ๋ฐฉ์˜ LOFk(p)LOF_k(p)๊ฐ’์„ ๊ฐ–์Šต๋‹ˆ๋‹ค.

  • ์ž๊ธฐ ์ž์‹ (p) ์ฃผ๋ณ€์˜ ๋ฐ€๋„๊ฐ€ ๋‚ฎ์€ ๊ฒฝ์šฐ, ํฐ lrdk(o)lrd_k(o), ์ž‘์€ lrdk(p)lrd_k(p) = 1๋ณด๋‹ค ํฐ LOFk(p)LOF_k(p)๊ฐ’์„ ๊ฐ–์Šต๋‹ˆ๋‹ค.
  • ํ•˜์ง€๋งŒ ์ž๊ธฐ ์ž์‹ (p) ์ฃผ๋ณ€์˜ ๋ฐ€๋„๊ฐ€ ๋‚ฎ์ง€๋งŒ ์ „์ฒด์ ์œผ๋กœ ๋ฐ€๋„๊ฐ€ ๋‚ฎ์€ ๊ฒฝ์šฐ, ์ž‘์€ lrdk(o)lrd_k(o), ์ž‘์€ lrdk(p)lrd_k(p) = 1๊ทผ๋ฐฉ์˜ LOFk(p)LOF_k(p)๊ฐ’์„ ๊ฐ–๊ฒŒ๋ฉ๋‹ˆ๋‹ค.

  • ๊ฒฐ๋ก ์ ์œผ๋กœ, 1๊ทผ๋ฐฉ์˜ ๊ฐ’์„ ๊ฐ€์ง„๋‹ค๋ฉด, ์˜ค๋ฐ€์กฐ๋ฐ€ํ•œ ๊ตฐ์ง‘์ด๋“  ํฌ๊ฒŒ ํŽผ์ณ์ง„ ๊ตฐ์ง‘์ด๋“  ๋ฐ์ดํ„ฐ๊ฐ€ ๊ทธ ์•ˆ์— ํฌํ•จ๋˜์–ด ์žˆ๋‹ค๊ณ  ๊ฐ„์ฃผํ•˜์—ฌ ์ •์ƒ์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

Code

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
from matplotlib.pyplot import figure
figure(num=None, figsize=(8, 6), dpi=80, facecolor='w', edgecolor='k')


np.random.seed(123)

# Generate train data
X_inliers = 0.3 * np.random.randn(100, 2) ## ์ •๊ทœ๋ถ„ํฌ์—์„œ 100*2๋งŒ๋“ค๊ณ 
X_inliers = np.r_[X_inliers + 2, 2*X_inliers - 2] ## ๊ฐ๊ฐ 2,2 ํ˜น์€ -2,-2๋งŒํผ ํ‰ํ–‰์ด๋™ํ•œ๊ฑฐ๋ฅผ vstack. ์ฆ‰ cluster 2๊ฐœ

# Generate some outliers
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))
X = np.r_[X_inliers, X_outliers] ##-4,4์—์„œ ๋ฝ‘์€ outlier์™€ inlier๋ฅผ vstack

n_outliers = len(X_outliers)
ground_truth = np.ones(len(X), dtype=int)
ground_truth[-n_outliers:] = -1

# fit the model for outlier detection (default)
clf = LocalOutlierFactor(n_neighbors=20, contamination=0.1)
## use fit_predict to compute the predicted labels of the training samples
## (when LOF is used for outlier detection, the estimator has no predict, decision_function and score_samples methods).
y_pred = clf.fit_predict(X) # 1,-1๋กœ ๋‚˜์˜จ๋‹ค.
n_errors = (y_pred != ground_truth).sum()
X_scores = clf.negative_outlier_factor_ ## ์Œ์ˆ˜ LOF scsore

plt.title("Local Outlier Factor (LOF)")
plt.scatter(X[:, 0], X[:, 1], color='black', s=3., label='Data points')
# plot circles with radius proportional to the outlier scores
radius = (X_scores.max() - X_scores) / (X_scores.max() - X_scores.min())
plt.scatter(X[:, 0], X[:, 1], s=500 * radius, edgecolors='g',
            facecolors='none', label='Outlier scores')
n = np.copy(X_scores)
n[n>-2] = np.nan ## LOF scsore(์Œ์ˆ˜)์˜ threshold๋Š” 2
n = np.round(n,2)
n = -1 * n ## ์–‘์ˆ˜๋ณ€ํ™˜

# LOF๋งŒ txtํ‘œ๊ธฐ
for i, txt in enumerate(n):
    if np.isnan(txt):continue
    plt.annotate(txt, (X[i,0], X[i,1]))
legend = plt.legend(loc='upper left')
plt.show()
profile
ํ•จ๊ป˜ ๊ณต๋ถ€ํ•ด์š”!

0๊ฐœ์˜ ๋Œ“๊ธ€