Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection

nananana·2022년 9월 26일
0

To be fair, I am not really great with Neural Networks. This paper was difficult to understand because I do not know much about Neural Networks.

Basic Idea?

Binary executables are everywhere. The problem is that we can't really read binary easily. For obvious reasons, vendors do not post their source code with the binary. They even strip the symbols to make it difficult to dissassemble the executable.

How should we treat these binaries? Firmware made by vendors often use libraries with known vulnerability. In an age of IoT increasing number of devices with firmware with potential vulnerability is a problem that should be addressed. Problem is that all these firmware images are based off of different platforms hence it is difficult for an expert to examine all of it.

So what ware the solution?

Binary Code Similarity Detection point out candidate functions that are similar to a given vulnerable function. But the problem lies with the method.

Problem?

Implementations at that time utilized Graph Matching Algorithms. These methods are rather slow and potentially inaccurate. However this was not the only method that existed. Feng et al. developed a method using embeddings that actually seemed promising to the team.

A binary can be transformed into a control flow graph. Feng et al. extracted an Attributed Control Flow Graph (ACFG) using IDA PRO and using the ACFG generated an embedding, basically a vector. This embedding was then used to measure the similarity.

However, Genius, which was Feng et al.'s demonstration of this idea had its limitation. First, the computation of certain attributes were too expensive. Furthermore, Genius was still limited by the Graph Matching Algorithm. Hence the team proposed a method using Neural Networks to bypass this limitation.

Neural Networks?

The Neural Network is used to transform the ACFG into an embedding. This is done by the Graph Embedding Network by Dai et al. This function known as Struct2vec transform the graph into a vector. Since, the neural network can retain the connectivity information of the graph, ACFG does not need to compute expensive information regarding connection between nodes.

This information was something Genius required since connectivity is quite important in extracting features of a graph.

However Struct2Vec was used as a classification model not a similarity detection model. Hence, Dai et al.'s original work required that input data be labeled in training (since the aim is to classify).

To this the team proposed a novel method of using a Siamese Architecture. Using the Siamese architecture the graph embedding network is embedded within the architecture allowing 2 graph inputs and a similarity score output.

Siamese Architecture

The use of Siamese architecture is actually a very important aspect of this paper. Works by other people after this paper start to utilize this idea of using Siamese architecture with neural networks.

Indicating it was an interesting and promising approach.

Training

Training was done in two phases. Pre training and Re training. The reason is because of insufficient datasets to train the model. The paper suggested that using Siamese Architecture required them to have a lot of datasets.

So the training was done in pre-training where OpenSSL functions were compiled with different optimization levels. The functions compiled from same source code was considered similar and otherwise dissimilar.

This pre-training method allowed the model to identify the function charactersitics. Afterwards model can be retrained to be specialized in a certain task

Evaluation?

The model proposed, Gemini, demonstrated good performance. Results in the paper show that Gemini out performed Genius and other binary similarity detection methods.

Unfortunately the test dataset is unclear. No table demonstrating all the results are given in the paper. More specifically the Paper explicitly never mention recall or F1 score. They only talk about accuracy and precision.

This is odd since Recall, Precision, and F1 score are usually included in most studies regarding Neural Network.

The professor did comment that the model seem to have low recall when they tested this model in the lab. After this paper, Siamese architecture seemed to be employed often and other presentation during class on other papers made reference to Gemini.

I actually had a lot of trouble understanding the paper. First, I am not familiar with AI. Second, I had trouble because so less clear data was given. Lot of detail in implementation and algorithms were given but tabulated data of trials and datasets were unavailable.

0개의 댓글