RAMPAGE: Reproducible Evaluation of AGD Detection Models

Reading Time: 4 minutes

TL;DR: Detecting Algorithmically Generated Domains (AGDs) is essential for stopping malware that uses Domain Generation Algorithms (DGAs) for command and control (C2) resilience. However, the research field is complex: each proposed model uses different datasets, metrics, and configurations, making fair comparisons difficult. To address this problem, we present RAMPAGE, a reproducible framework for evaluating AGD detectors. Developed in Python and based on Keras, RAMPAGE facilitates the evaluation and comparison of AGD classifiers under consistent real-world conditions. Our framework includes benchmark datasets, standard metrics, and even a meta-classifier that outperforms many state-of-the-art models.

The Problem: Apples to Oranges in AGD Detection

In the world of malware detection and analysis, AGD detection is a very active area of research. However, there is a major problem: no two works use the same setup. Each author uses different datasets, preprocessing methods, metrics, and validation strategies.

This lack of standardization leads to two major problems:

Poor reproducibility: Researchers cannot easily replicate or verify each other’s work.
Misleading performance claims: A model may appear state-of-the-art only because it was tested on a more user-friendly dataset.

We created RAMPAGE to fix this problem.

What is RAMPAGE?

RAMPAGE (fRAMework to comPAre aGd dEtectors) is an open-source Python framework that provides:

Standardized training and testing processes
Pre-processed reference datasets
Evaluation under realistic conditions
Modular architecture for plugging and playing your own models

It allows researchers and practitioners to compare AGD classifiers fairly, under the same conditions. RAMPAGE includes reference implementations of seven popular deep learning models, as well as our own meta-classifier, which combines their predictions using logistic regression.

How does it work?

RAMPAGE supports two workflows:

Single Model Evaluation

You can train and test a model on multiple reference datasets using predefined or custom parameters. Results include metrics such as precision, recall, F1, and ROC-AUC. These metrics can be user-defined.

Metamodel Evaluation

RAMPAGE combines multiple base classifiers (e.g., CNN, LSTM, GRU) into a metaclassifier trained with logistic regression. The idea is that no single model provides a complete picture, but together they do. Furthermore, it is interpretable using SHAP, which also allows for obtaining feature importance.

Realistic, Real-World Datasets

To validate RAMPAGE, we collected and curated real DNS records from our university network (University of Zaragoza, Spain): over 7.5 million queries. Unlike synthetic datasets often used in DGA research, our data reflects real-world noise, benign domain structure, and class imbalance.

Our evaluation shows that models trained solely on artificial DGA datasets perform poorly on real-world data. Some state-of-the-art classifiers misclassify over 40% of benign domains!

Evaluation and Results

We evaluated 17 models with RAMPAGE. Key findings:

Our meta-classifier consistently outperformed all individual models in accuracy and robustness.
Simpler architectures sometimes outperform complex ones, especially in noisy real-world scenarios.
Interpretability is important: SHAP scores helped us understand the features the models rely on.

All tests were run under the same conditions: same data splits, same preprocessing steps, and same metrics. That’s the RAMPAGE difference.

Real-World Impact and What’s Next?

RAMPAGE enables:

Researchers: Fairly compare new models against existing benchmarks
Security analysts: Test AGD detectors on real DNS records
Tool developers: Integrate reproducible AGD detection into pipelines
Connect academic research with operational cybersecurity

And, what’s next? Well, at the moment we are actively working to:

Expand the benchmark with additional datasets, including multilingual and evolving DGA families.
Add support for transformer-based models and ensemble learning strategies.
Package RAMPAGE as a Dockerized service for easy deployment in labs and security operations centers (SOCs).

Are you ready? Get started

If you work in AGD detection or DNS-based threat intelligence, try RAMPAGE and let’s make reproducibility the norm, not the exception.

GitHub: https://github.com/reverseame/RAMPAGE

You can access the full paper here. This work has been a collaboration with Tomás Pelayo-Benedet (UNIZAR), Ricardo J. Rodríguez (UNIZAR), and Carlos H. Gañán (TU Delft).

Funding Acknowledgments

This research was supported in part by grant PID2023-151467OA-I00 (CRAPER), funded by MICIU/AEI/10.13039/501100011033 and by ERDF/EU, by grant TED2021-131115A-I00 (MIMFA), funded by MICIU/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR, by grant Ayudas para la recualificación del sistema universitario español 2021-2023, funded by the European Union NextGenerationEU/PRTR, the Spanish Ministry of Universities, and the University of Zaragoza, by grant Proyecto Estratégico Ciberseguridad EINA UNIZAR, funded by the Spanish National Cybersecurity Institute (INCIBE) and the European Union NextGenerationEU/PRTR, by grant Programa de Proyectos Estratégicos de Grupos de Investigación (DisCo research group, ref. T21-23R), funded by the University, Industry and Innovation Department of the Aragonese Government, and by the RAPID project (Grant No. CS.007) financed by the Dutch Research Council (NWO).

That’s, folks! Whether you’re a malware researcher, a data scientist, or just tired of unreplicable AGD articles, RAMPAGE is here to help. It brings much-needed clarity, fairness, and realism to a chaotic field. Try it, test your models, break them if necessary, but do so in a reproducible way. And if you find ways to improve it, fork it, add it to a favorites list, or send us a pull request. Let’s raise the bar on AGD detection, together!

Declaration of Generative AI Technologies in the Writing Process

During the preparation of this post, the author used ChatGPT (GPT4-o model) to improve readability and language. After using this tool, the author reviewed and edited the content as necessary and takes full responsibility for the content of this publication.

RAMPAGE: Reproducible Evaluation of AGD Detection Models

The Problem: Apples to Oranges in AGD Detection

What is RAMPAGE?

How does it work?

Single Model Evaluation

Metamodel Evaluation

Realistic, Real-World Datasets

Evaluation and Results

Real-World Impact and What’s Next?

Are you ready? Get started

Funding Acknowledgments

Declaration of Generative AI Technologies in the Writing Process

About the author

Ricardo J. Rodríguez

About the post

Our location

See our last tweets

RAMPAGE: Reproducible Evaluation of AGD Detection Models

The Problem: Apples to Oranges in AGD Detection

What is RAMPAGE?

How does it work?

Single Model Evaluation

Metamodel Evaluation

Realistic, Real-World Datasets

Evaluation and Results

Real-World Impact and What’s Next?

Are you ready? Get started

Funding Acknowledgments

Declaration of Generative AI Technologies in the Writing Process

About the author

Ricardo J. Rodríguez

About the post

Follow us

Our location

See our last tweets