TL;DR: Detecting Algorithmically Generated Domains (AGDs) is essential for stopping malware that uses Domain Generation Algorithms (DGAs) for command and control (C2) resilience. However, the research field is complex: each proposed model uses different datasets, metrics, and configurations, making fair comparisons difficult. To address this problem, we present RAMPAGE
, a reproducible framework for evaluating AGD detectors. Developed in Python and based on Keras, RAMPAGE
facilitates the evaluation and comparison of AGD classifiers under consistent real-world conditions. Our framework includes benchmark datasets, standard metrics, and even a meta-classifier that outperforms many state-of-the-art models.
The Problem: Apples to Oranges in AGD Detection
In the world of malware detection and analysis, AGD detection is a very active area of research. However, there is a major problem: no two works use the same setup. Each author uses different datasets, preprocessing methods, metrics, and validation strategies.
This lack of standardization leads to two major problems:
- Poor reproducibility: Researchers cannot easily replicate or verify each other’s work.
- Misleading performance claims: A model may appear state-of-the-art only because it was tested on a more user-friendly dataset.
We created RAMPAGE
to fix this problem.
What is RAMPAGE?
RAMPAGE
(fRAMework to comPAre aGd dEtectors) is an open-source Python framework that provides:
- Standardized training and testing processes
- Pre-processed reference datasets
- Evaluation under realistic conditions
- Modular architecture for plugging and playing your own models
It allows researchers and practitioners to compare AGD classifiers fairly, under the same conditions. RAMPAGE
includes reference implementations of seven popular deep learning models, as well as our own meta-classifier, which combines their predictions using logistic regression.
How does it work?
RAMPAGE
supports two workflows:
Single Model Evaluation
You can train and test a model on multiple reference datasets using predefined or custom parameters. Results include metrics such as precision, recall, F1, and ROC-AUC. These metrics can be user-defined.
Metamodel Evaluation
RAMPAGE
combines multiple base classifiers (e.g., CNN, LSTM, GRU) into a metaclassifier trained with logistic regression. The idea is that no single model provides a complete picture, but together they do. Furthermore, it is interpretable using SHAP, which also allows for obtaining feature importance.
Realistic, Real-World Datasets
To validate RAMPAGE
, we collected and curated real DNS records from our university network (University of Zaragoza, Spain): over 7.5 million queries. Unlike synthetic datasets often used in DGA research, our data reflects real-world noise, benign domain structure, and class imbalance.
Our evaluation shows that models trained solely on artificial DGA datasets perform poorly on real-world data. Some state-of-the-art classifiers misclassify over 40% of benign domains!
Evaluation and Results
We evaluated 17 models with RAMPAGE
. Key findings:
- Our meta-classifier consistently outperformed all individual models in accuracy and robustness.
- Simpler architectures sometimes outperform complex ones, especially in noisy real-world scenarios.
- Interpretability is important: SHAP scores helped us understand the features the models rely on.
All tests were run under the same conditions: same data splits, same preprocessing steps, and same metrics. That’s the RAMPAGE
difference.
Real-World Impact and What’s Next?
RAMPAGE
enables:
- Researchers: Fairly compare new models against existing benchmarks
- Security analysts: Test AGD detectors on real DNS records
- Tool developers: Integrate reproducible AGD detection into pipelines
- Connect academic research with operational cybersecurity
And, what’s next? Well, at the moment we are actively working to:
- Expand the benchmark with additional datasets, including multilingual and evolving DGA families.
- Add support for transformer-based models and ensemble learning strategies.
- Package
RAMPAGE
as a Dockerized service for easy deployment in labs and security operations centers (SOCs).
Are you ready? Get started
If you work in AGD detection or DNS-based threat intelligence, try RAMPAGE
and let’s make reproducibility the norm, not the exception.
You can access the full paper here. This work has been a collaboration with Tomás Pelayo-Benedet (UNIZAR), Ricardo J. Rodríguez (UNIZAR), and Carlos H. Gañán (TU Delft).
Funding Acknowledgments
This research was supported in part by grant PID2023-151467OA-I00 (CRAPER), funded by MICIU/AEI/10.13039/501100011033 and by ERDF/EU, by grant TED2021-131115A-I00 (MIMFA), funded by MICIU/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR, by grant Ayudas para la recualificación del sistema universitario español 2021-2023, funded by the European Union NextGenerationEU/PRTR, the Spanish Ministry of Universities, and the University of Zaragoza, by grant Proyecto Estratégico Ciberseguridad EINA UNIZAR, funded by the Spanish National Cybersecurity Institute (INCIBE) and the European Union NextGenerationEU/PRTR, by grant Programa de Proyectos Estratégicos de Grupos de Investigación (DisCo research group, ref. T21-23R), funded by the University, Industry and Innovation Department of the Aragonese Government, and by the RAPID project (Grant No. CS.007) financed by the Dutch Research Council (NWO).

That’s, folks! Whether you’re a malware researcher, a data scientist, or just tired of unreplicable AGD articles, RAMPAGE
is here to help. It brings much-needed clarity, fairness, and realism to a chaotic field. Try it, test your models, break them if necessary, but do so in a reproducible way. And if you find ways to improve it, fork it, add it to a favorites list, or send us a pull request. Let’s raise the bar on AGD detection, together!
Declaration of Generative AI Technologies in the Writing Process
During the preparation of this post, the author used ChatGPT (GPT4-o model) to improve readability and language. After using this tool, the author reviewed and edited the content as necessary and takes full responsibility for the content of this publication.