TL;DR: Statically linked binaries can include vulnerabilities if not updated with the latest versions of libraries. Similarly, embedding libraries within the binary reduces dependency on the environment while running the binary. This makes identifying linked libraries in malware binaries essential for effective analysis. To help in this process, we present MANTILLA, a tool designed to identify runtime libraries in statically linked Linux binaries using static analysis and machine learning. MANTILLA extracts architecture-independent features from the binaries and uses a K-Nearest Neighbors (KNN) model to determine which libraries are linked. In our talk at NoConName 2024 (on Nov 19, 2024), we will share deeper insights into how MANTILLA works, its architecture-agnostic features, and how it can be used for both malware analysis and vulnerability detection. Our evaluation shows high accuracy across multiple architectures, demonstrating the value of MANTILLA in both malware analysis and vulnerability detection. If this post leaves you wanting more and you want to delve deeper into our research, we recommend reading our recently published scientific article (here).
For an attacker, a vulnerable and unpatched application is an irresistible target. Vulnerabilities often persist in applications due to outdated third-party dependencies, especially when binaries are statically linked. Static linking makes binaries self-contained and portable, but it also complicates updating libraries. Similarly, it makes reverse engineering more challenging. Interestingly, these features are precisely why malware authors prefer static linking, as it ensures compatibility across target platforms and adds complexity to analysis efforts.
To help address these challenges, we developed MANTILLA, a tool to identify runtime libraries in statically linked Linux binaries. This identification helps filter out library functions, allowing analysts to focus on the core behavior of the malware and detect vulnerabilities in outdated libraries.
What is MANTILLA?
MANTILLA, a
system for runtiMe librAries ideNtification in sTatIcally-Linked Linux binAries, is specially designed to automatically identify runtime libraries within a binary using static analysis and KNN classification. Figure 1 shows a high-level overview of MANTILLA.
It is based on the radare2 reverse engineering framework to extract a variety of features that are independent of the binary’s architecture, such as cyclomatic complexity, instruction count, and entropy.
The system then uses these features to classify the binary through a supervised machine learning model. Specifically, MANTILLA uses K-Nearest Neighbors (KNN) to predict the runtime library linked in the binary, with final decisions made using a majority voting system across all functions in the binary.
How does it work?
MANTILLA operates in two phases:
- Feature extraction: We extract features from each function in a given binary. These features include metrics such as cyclomatic complexity, number of basic blocks, function size, entropy, and more. Importantly, the features are chosen to be architecture-independent, allowing MANTILLA to work on different CPU architectures.
- Prediction: Using the extracted features, we apply a KNN model to predict the runtime library for each function. A majority voting mechanism is used to determine the final prediction for the entire binary, ensuring robust classification even when individual functions may have ambiguous results.
Evaluation and results
We evaluated MANTILLA on a dataset of binaries built for different architectures: MIPSeb
, ARMel
, Intel x86
, and Intel x86-64
. These binaries were linked with different runtime libraries: uClibc
, glibc
, and musl
. Additionally, we tested MANTILLA on real-world binaries, including IoT malware samples. In all tests, MANTILLA achieved very high accuracy, with results of over 95% in runtime library identification and almost 100% in architecture identification.
We also evaluate the performance of MANTILLA using K-fold cross-validation on this dataset of statically linked binaries, after removing the symbols (i.e., they are stripped). Specifically, we examined how well MANTILLA can identify runtime libraries in binaries compiled with different architectures and libraries. In particular, we focused on the KNN classification model, tuning key parameters to optimize performance.
We first computed distances to the K nearest neighbors (KNN) using the Euclidean distance metric. To fine-tune the model, we set a distance threshold (𝑑) and tried various settings for the number of neighbors K. We tried value of K = {1, …, 5} and d = {1, … , 7} to explore trade-offs between the number of neighbors and the threshold distance. The results, shown in Figure 2, reveal a clear trend: the system performs better when more neighbors are considered and a lower distance threshold is applied. As the threshold is increased, the model starts classifying unrelated features as part of the same runtime library, negatively impacting overall performance.
Based on these findings, we conclude that a configuration with K > 1 and a low distance threshold provides the best results. In particular, the optimal configuration achieved a 100% hit rate, thanks to the majority voting rule, and the best performance was observed for K = 5. This configuration maintained high accuracy even with a more relaxed threshold value for the distance metric, ensuring that MANTILLA could consistently predict the correct runtime library across all test cases.
Furthermore, we applied MANTILLA to a dataset containing thousands of Linux-based IoT malware samples. The results showed that the majority of these malware binaries were linked against uClibc
, a lightweight C library often used in embedded systems. As shown, MANTILLA analysis helped confirm runtime library usage trends across the IoT malware landscape. More experimental results and dicussion are given in our paper.
Real-World Impact
MANTILLA can help malware and forensic analysts understand the libraries used within a binary, filter out library functions, and focus efforts on analyzing malware-specific code. This is especially useful in malware reverse engineering, where distinguishing between benign and malicious functionality can be extremely difficult due to static linking.
Additionally, identifying the runtime library is critical for detecting vulnerabilities. If a binary contains an outdated version of a library, it may be susceptible to known attacks. By identifying the specific runtime library, MANTILLA helps assess whether a statically linked binary is at risk.
Future Work
Although MANTILLA provides good accuracy, further improvements are possible. We plan to extend the system to identify runtime libraries on other operating systems and support additional architectures, such as PowerPC
or SPARC
. Additionally, we will explore the possibility of providing MANTILLA as a software-as-a-service.
And that’s all, guys & gals! In this blog post, we have summarized our paper on MANTILLA, a system for identifying runtime libraries in statically linked Linux binaries. The paper provides a detailed overview of the system’s features, methodology, and evaluation. We hope that MANTILLA can serve as a useful tool for researchers and analysts working in the fields of binary analysis and malware forensics. Feel free to explore the tool, check out the source code on GitHub, and contribute to the ongoing efforts to improve static binary analysis. Thanks for reading!
Declaration of Generative AI Technologies in the Writing Process
During the preparation of this post, the author used ChatGPT (GPT4-o model) to improve readability and language. After using this tool, the author reviewed and edited the content as necessary and takes full responsibility for the content of this publication.