SymLM: Predicting Function Names in Stripped Binaries via Context-Sensitive Execution-Aware Code Embeddings

Xin Jin; Kexin Pei; Jun Yeon Won; Zhiqiang Lin

doi:10.1145/3548606.3560612

Published August 31, 2023 | Version v1

Conference paper Open

SymLM: Predicting Function Names in Stripped Binaries via Context-Sensitive Execution-Aware Code Embeddings

1. Ohio State University
2. Columbia University

Predicting function names in stripped binaries is an extremely useful but challenging task, as it requires summarizing the execution behavior and semantics of the function in human languages. Recently, there has been significant progress in this direction with machine learning. However, existing approaches fail to model the exhaustive function behavior and thus suffer from the poor generalizability to unseen binaries. To advance the state of the art, we present a function Symbol name prediction and binary Language Modeling (SymLM) framework, with a novel neural architecture that learns the comprehensive function semantics by jointly modeling the execution behavior of the calling context and instructions via a novel fusing encoder. We have evaluated SymLM with 1,431,169 binary functions from 27 popular open source projects, compiled with 4 optimizations (O0-O3) for 4 different architectures (i.e., x64, x86, ARM, and MIPS) and 4 obfuscations. SymLM outperforms the stateof-the-art function name prediction tools by up to 15.4%, 59.6%, and 35.0% in precision, recall, and F1 score, with significantly better generalizability and obfuscation resistance. Ablation studies also show that our design choices (e.g., fusing components of the calling context and execution behavior) substantially boost the performance of function name prediction. Finally, our case studies further demonstrate the practical use cases of SymLM in analyzing firmware images.

This repository is the dataset used to train the SymLM model. It includes the binaries across different architectures (i.e., x64, x86, ARM, and MIPS) and different optimization levels (i.e., O0, O1, O2, O3).

Files

arm.zip

Files (9.4 GB)

Name	Size	Download all
arm.zip md5:aa35d14da31d7a4e769dc8d128e9ce9e	606.9 MB	Preview Download
mips.zip md5:71524714dd00362617499490cc6e64c9	548.8 MB	Preview Download
x64.zip md5:c629bfbf804ad047724693bba7ccfc4c	4.4 GB	Preview Download
x86.zip md5:0804d80c2275bf7ea246065bb1a75d8a	3.8 GB	Preview Download

	All versions	This version
Views	379	368
Downloads	491	488
Data volume	1.4 TB	1.4 TB

SymLM: Predicting Function Names in Stripped Binaries via Context-Sensitive Execution-Aware Code Embeddings

Authors/Creators

Description

Files

arm.zip

Files (9.4 GB)