GTC Disassembling and analyzing malware to see how it works, what it’s designed for, and how to protect against it is usually a time-consuming, manual task that requires a solid understanding of assembly code and programming, techniques and exploits used by criminals, as well as other skills. which are hard to find.
With the rise of deep learning and other AI research, IT security specialists are investigating ways to use machine learning to increase the speed, efficiency and automation of this process. These automated systems have to contend with devilishly obfuscated malicious code designed to evade detection. One of the primary goals is to allow AI systems to perform more routine tasks, freeing up reverse engineers to focus on more important tasks.
Mandiant is one such company that sees where neural networks and related technologies can change the way malware is broken down and analyzed. This week at Nvidia’s GTC 2022 event, Sunil Vasisht, data scientist at infosec, present one such initiative: a neural machine translation (NMT) model capable of annotating functions.
This prediction model, we understand, can take decompiled code – machine language instructions converted back into corresponding high-level language code – and use it to suggest appropriate descriptive names for each of the building blocks. This is useful when function or symbol names have been stripped from a binary or obfuscated, and is an alternative to signature-based tools, such as IDA FLIRT.
If you’re reverse-engineering, you can skip functions that, for example, cause the operating system to handle a printf() call and jump directly to functions identified as performing encryption or increasing privileges. You can skip a block labeled by the template as tolower() and skip after that inject_into_process(). You can avoid wasting time on dead ends or inconsequential functions.
Specifically, the model works by predicting function name keywords (e.g., ‘get’, ‘registry’, ‘value’) from Abstract Syntax Tree (AST) tokens from executable files decompiled. The model was shown to be able to label a function as ‘des’, ‘encrypt’, ‘openssl’, ‘i386’, ‘libeay32’, while an analyst involved in the experiment could only suggest encode (). Mandiant also built a second NMT that made predictions from control flow graphs and API code calls.
Vasisht described the typical methods used to reverse engineer malware and the myriad challenges that come with it, including the techniques used by malware creators to craft their code to make it harder for threat hunters to find and debunk it. disassemble it. This creates what becomes an untenable situation.
“Reversing is extremely difficult work and devoting more analyst hours to the problem is not sustainable,” he said during his presentation.
By automating feature annotations, Mandiant aims to address the major challenges faced by most reverse engineers when analyzing modern malware. The vendor, acquired by Google for $5.4 billion, wants to step up reporting on malware features and capabilities, reduce the challenges facing its analysts and make reversal more efficient. In other words, it is easier to identify the core of delicate malicious code. We imagine this could also be useful for comparing malware strains.
“We hope to tackle the easy cases so analysts can spend their valuable time on bigger cases,” Vasisht said. “At Mandiant, these are the challenges we set out to address with a unified machine learning approach. Our problem statement is: how can we increase the coverage of function names in binary disassembly in order to speed up triaging malware?”
We hope to tackle the easy cases so analysts can spend their valuable time on bigger cases.
Malware analysts use a number of techniques falling under static and dynamic analysis; the first consists in studying the executable code, the second in executing it and observing its operation. There are tools like IDA Pro, Binary Ninja, Ghidra, debuggers, emulators and hypervisors to help you out. Even so, decompiled and disassembled functions can be difficult to follow, requiring inverters to spend hours before figuring out what a section of code does, and many samples are far too large for full analysis. The code can also be encrypted, which makes static analysis cumbersome.
Additionally, malware can be written to self-terminate or act harmlessly if it detects that it is running under dynamic analysis. “Malware can detect when it’s running inside a virtual machine and mask its true behavior. It might be able to check the operating system or even check the CPU temperature and determine whether to run or just hide,” he said.
Vasisht detailed two ways to turn binary code into inputs for a predictive NMT model. One is to use code2seq which breaks down source code and decompiled code into an AST of representative tokens. The other is Nerowhich describes the control flow graph (CFG) of the code.
Mandiant engineers looked at both initiatives to create their feature naming model, he said. As described above, one focused on ASTs and other CFGs.
“Drawing inspiration from code2seq and Nero-like architectures, we sought to see if we could apply these techniques to malware disassembly using AST and CFG representations to predict meaningful function, and in the process hope to reduce the effort surrounding a tedious reverse engineering workflow,” Vasisht said.
The engineers used a Linux server with 48 CPU cores, 500 GB of system RAM. and eight Nvidia Tesla M40 GPUs with 24GB of memory. The platform was used to simultaneously run multiple hyperparameter searches — from max AST contexts to max sub-tokens of output labels — and train the final model, he said. They used an input dataset of over 360,000 disassembled functions and annotations pulled from 4,000 malicious Windows PE files, some automatically generated from IDA’s FLIRT and others from a decade of Mandiant’s handwritten inverter annotations.
Mandiant’s automated and scalable analysis pipeline showed improvements over code2seq and Nero models, he said. The company must now think about how it will deploy the model.
“This includes using these model predictions with IDA Pro and [the NSA’s open-source] Ghidra plug-ins,” Vasisht said. “We also plan to deploy this model in the pipeline of malware analysts. Additionally, it will allow us to gather feedback on the predictions, as well as gather new annotations so that we can iterate and improve this model in the future.”
Future work includes improving labeling and data quality; use a combined AST and CFG model; and using different mixes of binaries to train the pattern, he said. ®