[논문 리뷰] DeepShield: Fortifying Deepfake Video Detection with Local and Global Forgery Analysis

a9umon·2026년 2월 2일

paper_survey

1. Introduction

Local forgery detail과 global subtle artifact를 효율적으로 찾아내자

기존 augmentation과정을 거쳐서 일반화 성능을 높이는 전략들은 computing cost가 크다는 단점이 있다.
이를 해결 하기 위해 LLM모델을 일부 사용해서 일반화 성능을 높인다
기존의 LLM 모델은 global feature에 집중하도록 훈련되어 있다. 이를 해결하는 모듈을 설계하자.
- 왜냐하면, 이 global feature에 집중하다보면 특정 method에 overfitting 되는 경향이 강해지기 때문임 - subtle artifact를 지나치는 경향이 있기 때문에
LPG → GFD 로 모듈이 구성 됨

contribution point

ViT를 가지고 와서 generalization 성능을 높이려고 함
LPG : local Region에서 artifact를 참조
- SAM : deepfake detection에서 overfitting을 방지하는 새로운 augmentatin - Blending 비슷한 개념인듯
GFD : 서로 다른 위조 방식을 섞어 새로운 위조 데이터를 합성 → 일반화 성능을 높이자
- DFG : domain gap을 완화하기 위해 서로 다른 위조 도메인 사이에 있도록 새로운 위조 데이터 생성 → 공통적인 위조의 본질적 특징 학습
- BFG : 기존 위조 domain 의 경계 밖으로 feature를 확장
LPG와 GFD를 연결하는 방법 → loss function

1. Deepfake Detection

Video level에서는 spatial, temporal artifact를 찾아야 한다.
보통, 시간적 정보와 공간적 정보를 결합하는 방식으로 진행되었지만 이는 새로운 위조 패턴에 적용하기 어렵다는 단점이 있다.

2. Deepfake Detection Using Synthetic Data

기존의 blending 방식들은 현실 세계에서 위조 양상을 모델링 하는데 문제가 있다.
이를 위해 forgery pattern을 다양하게 하도록 한다.
forgery pattern 은 distribution base로 설계하여 보다 자연스러운 위조를 만든다.

3. Method

1. Input & Output configuration

$\mathbf{V} \in \mathbb{R}^{T\times H\times W\times3}$ : Video clip
$\left\{ f^{\mathrm{cls}}_t,\; f^{\mathrm{patch}}_{t,p} \right\} = E(\mathbf{V}), \quad t = 1, \ldots, T,\; p = 1, \ldots, P$
$f^{\mathrm{cls}}_t \in \mathbb{R}^{C}, \quad f^{\mathrm{patch}}_{t,p} \in \mathbb{R}^{C}$

여기서 주의할 점 : Encoder가 비디오 전체에서 하나의 cls token을 뽑는게 아니라 frame 마다 class token을 뽑는 것임
또한 사용된 Encoder는 CLIP-ViT로 peft 기법을 도입하기 위해 ST-Adaper를 도입해야 함

2. Local Patch Guidance (LPG)

기존 dataset을 보면 real/fake에 대한 annotation은 되어 있으나 어느 패치에 위조가 가해졌는지는 나와있지 않다.
따라서, patch 단위로 정의된 gt가 없으므로 고전적인 supervised learning을 patch단위 학습에 사용할 수 없다.
patch 단위로 위조를 추가하는 SAM 모듈을 추가해 patch 단위의 gt가 있는 위조 데이터를 새로 만들자!

Generating Patch-level Annotations

SAM을 이용해서 fake vedio의 위조 영역(mask)를 추정하고 그 결과를 이용해서 P개의 patch 단위 label을 만든다.
- $H \times W$ 의 frame이 있다면 $H_p = \frac{H}{\sqrt{P}}, W_p = \frac{W}{\sqrt{p}}$

$\begin{aligned}\operatorname{PatchMaskScore}(\mathbf{M}_{t,p}) &= \sum_{h=1}^{H_p} \sum_{w=1}^{W_p} I[\mathbf{M}^{(h,w)}_{t,p} > 0]\\ y_{t,p}^{\text{patch}} &= \begin{cases} 1, & \text{if PatchMaskScore}(\mathbf{M}_{t,p}) \ge \theta, \\ 0, & \text{otherwise}. \end{cases} \end{aligned}$

위 수식은 각 patch에 real/fake 값을 할당한다.
- frame t의 patch p안에서 SAM mask가 차지하는 비율을 계산한다.
- 이후 그 비율이 임계값 이상이면 해당 patch 를 fake로 labeling한다.
- 즉, $y^{patch}_{t,p}$ 는 patch-level에서 얻어낸 pseudo GT 이다.

Learning Forgery Features at the Patch Level

path-level annatation이 끝나면 patch feature $f^{patch}_{t,p}$ 를 binary classifier $\phi$ 에 통과시켜서 $\text{Prob}^{patch}_{t,p}$ 를 얻는다.
아래는 이 $\text{Prob}^{patch}_{t,p}$ 로 구하는 BCE Loss 이다.

$\mathcal{L}_{\text{LGP}} = -\frac{1}{T \cdot P} \sum_{t,p} \left[ y_{t,p}^{\text{patch}} \cdot \log(\text{Prob}_{t,p}^{\text{patch}}) + (1 - y_{t,p}^{\text{patch}}) \cdot \log(1 - \text{Prob}_{t,p}^{\text{patch}}) \right]$

BCE Loss 쓰면 overfitting이 생긴다는 논문이 있었던 거 같음... 새로운 함수를 고안해 보는 것도 좋을 듯?
여기서 $\text{Prob}^{patch}_{t,p}$ 는 patch feature로 얻어낸 것으로, 해당 patch 가 fake일 확률을 의미한다.
위 수식은 각 patch에 대해 loss를 계산해서 해당 patch 의 feature가 위조 신호를 잘 담도록 학습을 유도한다.
self attention과정을 거치면 model이 patch token들을 참조하게 되면서 local 특징들을 잘 파악하게 된다.

SAM - spatial & temporal blending

1. Spatial artifact generating

SBI 모듈을 확장한 것 - temporal 위조를 어떤 식으로 구현했나? 가 중요
아래와 같이 각 frame의 image를 아래와 같이 분리하고 blened frame을 생성

$\begin{aligned} \{\text{I}_t | t = 1, \dots, T\} &: \text{vedio clip frame}\\ \text{I}_t^{\text{inner}}, \text{I}_t^{\text{outer}} &: \text{얼굴을 기준으로 안쪽 프레임, 바깥 프레임}\\ I_t^{\text{blend}} &= M_t^{\text{blend}} \odot I_t^{\text{inner}} + (1 - M_t^{\text{blend}}) \odot I_t^{\text{outer}} \end{aligned}$

위의 수식은 얼굴이나 배경 영역에 SAM mask를 기준으로 위조를 더함 → patch-level pseudo label을 만듦.

2. Temporal Artifact Generating

연속된 T frame 동안에 일관된 위조 패턴을 유지하도록 함
기본 마스크를 하나 두고, 그 마스트에 대해 shape 을 변화시키거나 blur를 조금씩 일관된 규칙으로 가하면서 temporal artifact를 관찰하게 함
이 부분을 코드상으로 어떻게 구현했는지 확인이 필요함.. 논문에는 너무 생략돼서 써 있음

3. Global Forgery Diversification (GFD)

여기서 DFG는 이미 분류된 cluster 사이의 세모들을 만드는 과정이고, BFG는 경계 밖으로 밀려난 X 표시들을 포함하도록 바운더리를 넓히는 과정이다.
ViT 기반의 model은 token간 연산 과정에서 local artifact에 overfitting 되는 경향이 있음.
이 부분은 그 bias를 줄이기 위해 고안 된 부분

Domain Feature Augmentation (DFA)

분포 특성을 이용하여 forgery pattern을 학습함
T개의 연속 된 frame중 N개를 샘플링함 - 랜덤하게

$\begin{aligned} \mu_c &= \frac{1}{N \cdot T} \sum_{n=1}^{N} \sum_{t=1}^{T} f_{n,t,c}^{\text{cls}}\\ \sigma_c &= \sqrt{\frac{1}{N \cdot T} \sum_{n=1}^{N} \sum_{t=1}^{T} \left(f^{\text{cls}}_{n,t,c} - \mu_c\right)^2} \end{aligned}$

위 수식에서 c는 각 frame에 대한 feature에서 몇 번째 element인지를 나타내는 것이다.
각 element 마다 평균과 분산이 계산되면 computing cost가 엄청 크지 않나..?

$\begin{aligned} \mu_c^{\text{mix}} &= \lambda\mu_c + (1 - \lambda)\tilde{\mu}_c, \\ \sigma_c^{\text{mix}} &= \lambda\sigma_c + (1 - \lambda)\tilde{\sigma}_c. \end{aligned}$

여기서 tilde는 fake와 쌍을 이루는 true 영상의 mean과 variance 임

$\hat{f}_{\text{n,t,c}}^{\text{cls}} = \sigma_{\text{c}}^{\text{mix}} \cdot \left( \frac{f_{\text{n,t,c}}^{\text{cls}} - \mu_{\text{c}}}{\sigma_{\text{c}}} \right) + \mu_{\text{c}}^{\text{mix}}$

이것이 최종 cls feature의 각 element 값
여기서 말하는 domain bridge는 위조된 이미지 사이의 어딘가에 있는 또 다른 위조 이미지를 만드는 역할을 한다.

Global Clip Feature Representation

$f_v = \frac{1}{T} \sum_{t=1}^{T} f_{v,t}^{\text{cls}}$

각 비디오 클립에 대해 frame 수준의 class embedding을 평균내서 global feature를 계산한다.

Dedicated Training Objective

$\mathcal{L}_{\text{GFD}} = \mathcal{L}^{\text{cls}} + \upsilon \mathcal{L}^{\text{supCon}}$

학습 중에 모델은 원본 학습 비디오, SAM에서 생성한 비디오, 그리고 DFA에서 합성된 feature를 관찰한다.
이런 loss function의 구성이 real/fake를 잘 구분하면서도 새로운 유형의 forgery pattern에도 잘 일반화 된다.

1. $\mathcal{L}^{\text{cls}}$ - Cross Entropy Loss

$\mathcal{L}^{\text{cls}} = \frac{1}{B} \sum_{v=1}^{B} H(\text{Prob}_v, y_v)$

$\text{Prob}_v$ : sample $v$ 로부터 예측된 확률 - 이 확률이 나오는 모듈을 코드상으로 확인할 필요
$y_v$ : ground truth label

2. $\mathcal{L}^{\text{supCon}}$ - Supervised Contrastive Loss

$\begin{aligned} \mathcal{L}^{\text{supCon}} &= \frac{1}{B} \sum_{v=1}^{B} - \frac{1}{|J(v)|} \sum_{i \in J(v)} L(v, i)\\ L(v, i) &= \log \frac{\exp(f_v \cdot f_i / \tau)}{\sum_{j \in J(v) \setminus \{v\}} \exp(f_v \cdot f_j / \tau)} \end{aligned}$

여기서 동일한 class 아래 집합을 말하는 $J(v)$ 가 같은 real class, fake class 영상 인지 혹은 같은 사람이 찍힌 다른 위조 영상인건지 확인이 필요함
또한 최적의 $\tau$ 값을 얼마로 설정했는지 확인해 봐야 함
위 식은 각 vedio 에서의 평균적인 softmax함수를 모두 합한 것을 말함.

Model Optimization

$\mathcal{L}^{\text{overall}} = \omega \mathcal{L}_{\text{LPG}} + \mathcal{L}^{\text{cls}} + \upsilon \mathcal{L}^{\text{supCon}}$

a9umon

이것저것 다 합니다.

이전 포스트

[보충] Blending synthesis

다음 포스트

[논문 리뷰] DeepShield: Fortifying Deepfake Video Detection with Local and Global Forgery Analysis

1. Introduction

contribution point

1. Deepfake Detection

2. Deepfake Detection Using Synthetic Data

3. Method

1. Input & Output configuration

2. Local Patch Guidance (LPG)

Generating Patch-level Annotations

Learning Forgery Features at the Patch Level

SAM - spatial & temporal blending

1. Spatial artifact generating

2. Temporal Artifact Generating

3. Global Forgery Diversification (GFD)

Domain Feature Augmentation (DFA)

Global Clip Feature Representation

Dedicated Training Objective

1. $\mathcal{L}^{\text{cls}}$ - Cross Entropy Loss

2. $\mathcal{L}^{\text{supCon}}$ - Supervised Contrastive Loss

Model Optimization

[보충] Blending synthesis

[논문 리뷰] Vulnerability-Aware Spatio-Temporal Learning for Generalizable Deepfake Video Detection

0개의 댓글

[논문 리뷰] DeepShield: Fortifying Deepfake Video Detection with Local and Global Forgery Analysis

1. Introduction

contribution point

2. Related Work

1. Deepfake Detection

2. Deepfake Detection Using Synthetic Data

3. Method

1. Input & Output configuration

2. Local Patch Guidance (LPG)

Generating Patch-level Annotations

Learning Forgery Features at the Patch Level

SAM - spatial & temporal blending

1. Spatial artifact generating

2. Temporal Artifact Generating

3. Global Forgery Diversification (GFD)

Domain Feature Augmentation (DFA)

Global Clip Feature Representation

Dedicated Training Objective

1. Lcls\mathcal{L}^{\text{cls}}Lcls - Cross Entropy Loss

2. LsupCon\mathcal{L}^{\text{supCon}}LsupCon - Supervised Contrastive Loss

Model Optimization

[보충] Blending synthesis

[논문 리뷰] Vulnerability-Aware Spatio-Temporal Learning for Generalizable Deepfake Video Detection

0개의 댓글

1. $\mathcal{L}^{\text{cls}}$ - Cross Entropy Loss

2. $\mathcal{L}^{\text{supCon}}$ - Supervised Contrastive Loss