Binary Level Toolchain Provenance Identification with Graph Neural Network

nananana·2022년 9월 19일

Aim:

Devise a Machine Learning (ML) based solution to the toolchain provenance identification problem over stripped binary codes.

Background:

Toolchain Provenance: Compiler Family, Compiler Version, Optimization Level

Importance Seem Questionable.
1. Determines Security Flaws
2. Helpful in identifying functions.

Usually security flaws are introduced by the programmer.
Difficult to realize how toolchain can help with identifying function.

Site Neural Network (SNN)

GNN based framework used to determine compiler toolchain
Use hierarchy of SNN's making binary decisions.

Site

A subgraph of CFG after fCFG is chopped in the chopping phase

Previous Work

Rosenblum et al. using Support Vector Machine
Recent works rely on Neural Networks (CNN, RNN)
Massarelli et al. extract binary Control Flow Graph (CFG) and process each block with Natural Language Processing (NLP) techniques.

Difference

Paper centers in Program Level Binary instead of Function level binary.

Utilize forgetting CFG (basically simplified CFG)

Utilize a Graph Neural Network (GNN) based solution: Site Neural Network (SNN)

Method

Setup

23 different compiler version (Clang, GCC, MinGW, Visual Studio)
4 class of optimization
92 Different Compiler configuration
36,272 C/C++ Source Code solving 91 problems in CodeForces
Clang (3.9.1, 4.0.1, 5.0.1, 6.0.0, 7.0.0, 8.0)
GCC (4.8.5, 5.5.0, 6.5.0, 7.5.0, 8.4.0, 9.3.0)
MinGW (3.4.5, 4.4.1, 4.7.1, 4.9.2, 5.11.0, 8.1.1)
Visual Studio (10.0, 12.0, 14.0, 2017, 2019)

Method

Binary Code Preprocessing: binary => CFG
Forgetful Phase: CFG => forgetful CFG
Chopping Phase: fCFG => Set of sites
Set of sites => Train and test model. SNN
Multiple SNN (local expert) in hierarchy

Result

RQ1. How does our framework evolve when the site size ɑ increase in terms of running time performance?

Increasing alpha, since volume of data is increased, time per element also increases.

RQ2. How does our framework evolve when the site size ɑ increase in terms of accuracy?

Accuracy does not necessarily increase as alpha is increased.

RQ3. Does our framework have the capacity to predict the compiler and optimization level of binary codes?

Accuracy in predicting family: Macro Avg F1 Score = 0.9950

Accuracy in predicting Optimization Level Prediction: Macro Avg F1 Score = 0.7549

RQ4. Does our framework have the capacity to predict the compiler version of binary codes?

Accuracy in predicting compiler version: Macro Avg F1 Score = 0.6475

Accuracy (excluding Clang) in predicting compiler version: Macro Avg F1 Score = 0.8167

Limitation

Dataset composed of small programs

Implementation

nananana

nana

이전 포스트

Moving 3D Object Unity C#

다음 포스트

Binary Level Toolchain Provenance Identification with Graph Neural Network

Aim:

Background:

Site Neural Network (SNN)

Site

Previous Work

Difference

Method

Setup

Method

Result

RQ1. How does our framework evolve when the site size ɑ increase in terms of running time performance?

RQ2. How does our framework evolve when the site size ɑ increase in terms of accuracy?

RQ3. Does our framework have the capacity to predict the compiler and optimization level of binary codes?

RQ4. Does our framework have the capacity to predict the compiler version of binary codes?

Limitation

Moving 3D Object Unity C#

CCFI: Cryptographically Enforced Control Flow Integrity

0개의 댓글

관련 채용 정보