Authors:
(1) Maria Rigaki, Faculty of Electrical Engineering, Czech Technical University in Prague, Czech Republic and maria.rigaki@fel.cvut.cz;
(2) Sebastian Garcia, Faculty of Electrical Engineering, Czech Technical University in Prague, Czech Republic and sebastian.garcia@agents.fel.cvut.cz.
Table of Links
Conclusion, Acknowledgments, and References
5 Experiments Setup
To evaluate MEME, several experiments were conducted in different configurations. First, there was a selection of four different malware detection solutions as targets to evade. Second, there was a comparison of MEME and four other evasion techniques on these four targets.
5.1 Targets
The selection of targets was made to include three highly cited malware detection models together with a real implementation of a popular free Antivirus solution.
1. Ember. A LightGBM [20] model that was released as part of the Ember dataset [3] which was used for training that same model. The decision threshold was set to 0.8336, which corresponds to a 1% false positive rate (FPR) on the Ember 2018 test set.
2. Sorel-LGB. A LightGBM model that was distributed as part of the Sorel20M [15] dataset, which was used for training that same model. The decision threshold was set to 0.5, which corresponds to a 0.2% false positive rate (FPR) on the Sorel-20M test set.
3. Sorel-FFNN. A feed-forward neural network (FFNN) that was also released as part of the Sorel 20M dataset and was trained using the same data. The decision threshold was set to 0.5, which corresponds to 0.6% false positive rate (FPR) on the Sorel-20M test set.
4. Microsoft Defender. An antivirus product that comes pre-installed with the Windows operating system. According to [36], it is the most used free antivirus product for personal computers. All tests were performed using a virtual machine (VM) running an updated version of the product. The VM had no internet connectivity during the binary file scanning.
5.2 Datasets
Our experiments required the use of the following datasets:
1. Ember 2018. A dataset that consists of features extracted from one million Windows Portable Executable (PE) files [3]. The dataset is split into training, testing, and ”unlabeled sets”. The training set consists of 300,000 clean samples, 300,000 malicious samples, and 200,000 “unlabeled” samples. The so-called unlabeled part of the dataset was truly unlabeled in the first version of the dataset, however, in the 2018 release, the authors provided an avclass label for all malicious samples, including those in the unlabeled set. Each sample has 2,381 static features related to byte and entropy histograms, PE header information, strings, imports, data directories, etc.
2. Sorel-20M. The Sorel dataset [15] was released in 2020 and contains the extracted features of 20 million binary files (malicious and benign). The feature set used was the same as the one from the Ember dataset.
3. Malware Binary Files. In addition to the Ember features used for training the surrogate, we also obtained 1,000 malicious binary files whose hashes were part of Ember 2018 and we use them for generating the evasive malware with all the methods.
4. Benign Binary Files. All methods require a benign set of data from where they extract benign strings, sections, and other elements that are used for the binary modifications. The same set of 100 benign binaries were used in all experiments. The files were obtained from a Windows 10 virtual machine after installing known benign software.
MEME created two versions of the Daux dataset to train the surrogate models. For the Ember and AV surrogates, Daux contains the unlabeled part of the Ember dataset. For the Sorel surrogates, Daux contains 200,000 samples from the Sorel-20M validation set. These datasets were chosen to create the surrogate because they were not used to train the corresponding targets. During the evaluation, a subset of the Sorel-20M test set was used to evaluate the performance of the Sorel surrogate models and the Ember test set was used to evaluate the Ember and AV models. The 1,000 malware binaries were split into training and test sets with a 70-30% ratio using five different seeds. The test sets were used to test all the methods, while the 700 binaries in the training set were used to train each of the RL policies for PPO and MEME (these were the binaries to which modifier actions were applied).
5.3 Adversarial Malware Generation Comparison
With regard to the generation of adversarial malware, MEME is compared with four algorithms in total. Two baseline reinforcement learning algorithms that use the Malware-Gym environment: a random agent and an agent that learns a policy using the vanilla Proximal Policy Optimization (PPO) algorithm [35], and two state-of-the-art (SOTA) algorithms: MAB [38] and GAMMA [9]. The two SOTA algorithms were selected based on the fact that they are relatively recently released and the fact that they seem to perform well in the malware evasion task. In addition, their source code is available. The detailed setup used for each of the algorithms, as well as any modifications, are presented below:
1. Random Agent The random agent is the simplest baseline used in our experiments. It uses the Malware-Gym environment and randomly samples the next modification action from the available action space. The agent is evaluated in the test environments using the 300 test malware samples.
2. PPO This is an agent that uses the PPO [35] algorithm as implemented in the Stable-Baselines3 software package. The agent was trained for 2,048 steps on the malware training set and evaluated on the malware test set. To select the hyper-parameters related to PPO training, we used the Tree-structured Parzen Estimator (TPE) method [4] as implemented in the software package Optuna [1]. The TPE algorithm was executed with the Ember dataset, but the settings performed well in the other targets. The tuned hyper-parameters were γ, the learning rate, the maximum gradient norm, the activation function, and the neural network size for the actor and critic models. The search space of each parameter and the final values are presented in the Appendix.
3. MAB RL algorithm that treats the evasive malware generation as a multiarmed bandit problem2. It operates in two stages: evasion and minimization. It samples the action space, which includes generic actions (similar to Malware-Gym) and any successful evasive actions along with the specific modifiers, e.g., appending a specific benign section. MAB directly manipulates each binary without generating a learned policy. Therefore, all experiments were conducted directly on the malicious binary test set.
4. GAMMA An algorithm that injects benign binary sections into malicious PE files while preserving their functionality. It modifies features like section count, byte histograms, and strings, leaving features related to, e.g., certificates and debugging data unaffected. By employing genetic algorithms, GAMMA searches for optimal benign sections to reduce the target model’s confidence by minimizing the content and location of injected sections. The attack is implemented in the SecML library3. Though effective, it has a significantly longer runtime than other tested methods. The attack uses a restricted set of 30 available benign sections and a population size of 20. The λ parameter, impacting the injected data size, was set to 10−6. GAMMA directly operates on each binary and does not generate a learned policy. Hence, all experiments were conducted on the malicious binary test set.
5.4 MEME Experimental Setup
The initial training steps n in Algorithm 1 were set to 1,024, and the total number of loops k was two. For evaluation, the test set of 300 malware binaries was used. The surrogate training steps m were set to 2,048 (step 6 of Algorithm 1). MEME utilizes PPO for training and updating the policy (πθ). The PPO settings remained the same as in the baseline experiments, enabling a comparison of the impact of using a surrogate model for additional PPO training. In total, there were a total of 2,048 queries to the target and 4,096 training steps using the surrogate environment. The surrogate was always a LightGBM model. Surrogate training involved two datasets: Daux from an external dataset (e.g., Ember 2018 or Sorel-20M) and Dsur generated during lines 3 and 7 of Algorithm 1. These datasets were mixed with a ratio α, a hyperparameter for LGB surrogate tuning. Other hyperparameters, such as the number of boosting trees, learning rate, tree depth, minimum child samples, and feature fraction, were also tuned separately for each target using TPE and Optuna. Appendix A provides the detailed search space and selected values. Surrogate models were evaluated using the respective target test sets, and a decision threshold matching target FPR levels was calculated. For the AV target without a representative dataset, the surrogate’s decision threshold was set to 0.5.
5.5 General Experiment Settings
All the algorithms were tested under a common set of constraints. The maximum allowed modifications to a binary file were set to 15 for the RL-based algorithms. Similarly, for GAMMA, the number of iterations was set to 15; for MAB, the number of ”pulls” was also 15. The second constraint was to set the maximum running time for all experiments to 4 hours. For MAB and GAMMA, this setting means that the algorithms must manage to handle as many malicious binaries as possible in that time. At the same time, for PPO and MEME, this time included both the policy training time and the evaluation time.
For PPO and MEME, we set the query budget to 2,048. This budget does not include the final evaluation queries on the test set. MAB and GAMMA were not constrained in the total number of queries because it is not supported by the respective frameworks. Finally, all experiments were run with five different seeds. The random seeds controlled the split of the 1,000 malicious binaries into train and test, and therefore, all methods were tested in the same files.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.