{"id":836,"date":"2025-06-30T01:49:11","date_gmt":"2025-06-30T01:49:11","guid":{"rendered":"https:\/\/reversea.me\/?p=836"},"modified":"2025-07-03T14:23:38","modified_gmt":"2025-07-03T14:23:38","slug":"rampage-reproducible-evaluation-of-agd-detection-models","status":"publish","type":"post","link":"https:\/\/reversea.me\/index.php\/rampage-reproducible-evaluation-of-agd-detection-models\/","title":{"rendered":"RAMPAGE: Reproducible Evaluation of AGD Detection Models"},"content":{"rendered":"<span class=\"span-reading-time rt-reading-time\" style=\"display: block;\"><span class=\"rt-label rt-prefix\">Reading Time: <\/span> <span class=\"rt-time\"> 4<\/span> <span class=\"rt-label rt-postfix\">minutes<\/span><\/span>\n<p class=\"wp-block-paragraph\"><strong>TL;DR<\/strong>: Detecting Algorithmically Generated Domains (AGDs) is essential for stopping malware that uses Domain Generation Algorithms (DGAs) for command and control (C2) resilience. However, the research field is complex: each proposed model uses different datasets, metrics, and configurations, making fair comparisons difficult. To address this problem, we present <a href=\"https:\/\/github.com\/reverseame\/RAMPAGE\"><code>RAMPAGE<\/code><\/a>, a reproducible framework for evaluating AGD detectors. Developed in Python and based on Keras, <code>RAMPAGE<\/code> facilitates the evaluation and comparison of AGD classifiers under consistent real-world conditions. Our framework includes benchmark datasets, standard metrics, and even a meta-classifier that outperforms many state-of-the-art models.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The Problem: Apples to Oranges in AGD Detection<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In the world of malware detection and analysis, AGD detection is a very active area of \u200b\u200bresearch. However, there is a major problem: no two works use the same setup. Each author uses different datasets, preprocessing methods, metrics, and validation strategies.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This lack of standardization leads to two major problems:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Poor reproducibility<\/strong>: Researchers cannot easily replicate or verify each other&#8217;s work.<\/li>\n\n\n\n<li><strong>Misleading performance claims<\/strong>: A model may appear state-of-the-art only because it was tested on a more user-friendly dataset.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">We created <a href=\"https:\/\/github.com\/reverseame\/RAMPAGE\"><code>RAMPAGE<\/code><\/a> to fix this problem.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What is RAMPAGE?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><code>RAMPAGE<\/code> (<strong>fRAMework to comPAre aGd dEtectors<\/strong>) is an open-source Python framework that provides:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Standardized training and testing processes<\/li>\n\n\n\n<li>Pre-processed reference datasets<\/li>\n\n\n\n<li>Evaluation under realistic conditions<\/li>\n\n\n\n<li>Modular architecture for plugging and playing your own models<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">It allows researchers and practitioners to compare AGD classifiers fairly, under the same conditions. <code>RAMPAGE<\/code> includes reference implementations of seven popular deep learning models, as well as our own meta-classifier, which combines their predictions using logistic regression.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">How does it work?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><code>RAMPAGE<\/code> supports two workflows:<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Single Model Evaluation<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">You can train and test a model on multiple reference datasets using predefined or custom parameters. Results include metrics such as precision, recall, F1, and ROC-AUC. These metrics can be user-defined.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Metamodel Evaluation<br><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\"><code>RAMPAGE<\/code> combines multiple base classifiers (e.g., CNN, LSTM, GRU) into a metaclassifier trained with logistic regression. The idea is that no single model provides a complete picture, but together they do. Furthermore, it is interpretable using SHAP, which also allows for obtaining feature importance.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Realistic, Real-World Datasets<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">To validate <code>RAMPAGE<\/code>, we collected and curated real DNS records from our university network (<a href=\"https:\/\/www.unizar.es\/\">University of Zaragoza, Spain<\/a>): over 7.5 million queries. Unlike synthetic datasets often used in DGA research, our data reflects real-world noise, benign domain structure, and class imbalance.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Our evaluation shows that models trained solely on artificial DGA datasets perform poorly on real-world data. Some state-of-the-art classifiers misclassify over 40% of benign domains!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Evaluation and Results<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">We evaluated 17 models with <code>RAMPAGE<\/code>. Key findings:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Our meta-classifier consistently outperformed all individual models in accuracy and robustness.<\/li>\n\n\n\n<li>Simpler architectures sometimes outperform complex ones, especially in noisy real-world scenarios.<\/li>\n\n\n\n<li>Interpretability is important: SHAP scores helped us understand the features the models rely on.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">All tests were run under the same conditions: same data splits, same preprocessing steps, and same metrics. That&#8217;s the <code>RAMPAGE<\/code> difference.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Real-World Impact and What&#8217;s Next?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><code>RAMPAGE<\/code> enables:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Researchers: Fairly compare new models against existing benchmarks<\/li>\n\n\n\n<li>Security analysts: Test AGD detectors on real DNS records<\/li>\n\n\n\n<li>Tool developers: Integrate reproducible AGD detection into pipelines<\/li>\n\n\n\n<li>Connect academic research with operational cybersecurity<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">And, what&#8217;s next? Well, at the moment we are actively working to:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Expand the benchmark with additional datasets, including multilingual and evolving DGA families.<\/li>\n\n\n\n<li>Add support for transformer-based models and ensemble learning strategies.<\/li>\n\n\n\n<li>Package <code>RAMPAGE<\/code> as a Dockerized service for easy deployment in labs and security operations centers (SOCs).<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Are you ready? Get started<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">If you work in AGD detection or DNS-based threat intelligence, try <code>RAMPAGE<\/code> and let&#8217;s make reproducibility the norm, not the exception.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>GitHub: <a href=\"https:\/\/github.com\/reverseame\/RAMPAGE\">https:\/\/github.com\/reverseame\/RAMPAGE<\/a><\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">You can access the full paper here. This work has been a collaboration with Tom\u00e1s Pelayo-Benedet (UNIZAR), Ricardo J. Rodr\u00edguez (UNIZAR), and <a href=\"https:\/\/www.tudelft.nl\/staff\/c.hernandezganan\/\" data-type=\"link\" data-id=\"https:\/\/www.tudelft.nl\/staff\/c.hernandezganan\/\">Carlos H. Ga\u00f1\u00e1n (TU Delft)<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Funding Acknowledgment<\/strong>s<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">This research was supported in part by grant PID2023-151467OA-I00 (CRAPER), funded by MICIU\/AEI\/10.13039\/501100011033 and by ERDF\/EU, by grant TED2021-131115A-I00 (MIMFA), funded by MICIU\/AEI\/10.13039\/501100011033 and by the European Union NextGenerationEU\/PRTR, by grant <em>Ayudas para la recualificaci\u00f3n del sistema universitario espa\u00f1ol 2021-2023<\/em>, funded by the European Union NextGenerationEU\/PRTR, the Spanish Ministry of Universities, and the University of Zaragoza, by grant <em>Proyecto Estrat\u00e9gico Ciberseguridad EINA UNIZAR<\/em>, funded by the Spanish National Cybersecurity Institute (INCIBE) and the European Union NextGenerationEU\/PRTR, by grant <em>Programa de Proyectos Estrat\u00e9gicos de Grupos de Investigaci\u00f3n<\/em> (DisCo research group, ref. T21-23R), funded by the University, Industry and Innovation Department of the Aragonese Government, and by the RAPID project (Grant No. CS.007) financed by the Dutch Research Council (NWO).<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"58\" src=\"https:\/\/reversea.me\/wp-content\/uploads\/2025\/06\/BandaINCIBEcolor-1024x58.png\" alt=\"\" class=\"wp-image-838\" srcset=\"https:\/\/reversea.me\/wp-content\/uploads\/2025\/06\/BandaINCIBEcolor-1024x58.png 1024w, https:\/\/reversea.me\/wp-content\/uploads\/2025\/06\/BandaINCIBEcolor-300x17.png 300w, https:\/\/reversea.me\/wp-content\/uploads\/2025\/06\/BandaINCIBEcolor-768x43.png 768w, https:\/\/reversea.me\/wp-content\/uploads\/2025\/06\/BandaINCIBEcolor-1536x87.png 1536w, https:\/\/reversea.me\/wp-content\/uploads\/2025\/06\/BandaINCIBEcolor-2048x116.png 2048w, https:\/\/reversea.me\/wp-content\/uploads\/2025\/06\/BandaINCIBEcolor-1440x82.png 1440w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p class=\"wp-block-paragraph\"><em>That&#8217;s, folks! Whether you&#8217;re a malware researcher, a data scientist, or just tired of unreplicable AGD articles, <a href=\"https:\/\/github.com\/reverseame\/RAMPAGE\"><code>RAMPAGE<\/code><\/a> is here to help. It brings much-needed clarity, fairness, and realism to a chaotic field. Try it, test your models, break them if necessary, but do so in a reproducible way. And if you find ways to improve it, fork it, add it to a favorites list, or send us a pull request. Let&#8217;s raise the bar on AGD detection, together!<\/em><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h4 class=\"wp-block-heading\">Declaration of Generative AI Technologies in the Writing Process<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">During the preparation of this post, the author used ChatGPT (GPT4-o model) to improve readability and language. After using this tool, the author reviewed and edited the content as necessary and takes full responsibility for the content of this publication.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p><span class=\"span-reading-time rt-reading-time\" style=\"display: block;\"><span class=\"rt-label rt-prefix\">Reading Time: <\/span> <span class=\"rt-time\"> 4<\/span> <span class=\"rt-label rt-postfix\">minutes<\/span><\/span>TL;DR: Detecting Algorithmically Generated Domains (AGDs) is essential for stopping malware that uses Domain Generation Algorithms (DGAs) for command and control (C2) resilience. However, the research field is complex: each proposed model uses different datasets, metrics, and configurations, making fair comparisons difficult. To address this problem, we present RAMPAGE, a reproducible framework for evaluating AGD [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[49,17,40,48,15],"tags":[50,51,53,52],"class_list":["post-836","post","type-post","status-publish","format-standard","hentry","category-ai-in-cybersecurity","category-malware","category-network","category-threat-detection","category-tools","tag-agd","tag-dga","tag-dns","tag-reproducibility","no-featured-image"],"_links":{"self":[{"href":"https:\/\/reversea.me\/index.php\/wp-json\/wp\/v2\/posts\/836","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/reversea.me\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/reversea.me\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/reversea.me\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/reversea.me\/index.php\/wp-json\/wp\/v2\/comments?post=836"}],"version-history":[{"count":4,"href":"https:\/\/reversea.me\/index.php\/wp-json\/wp\/v2\/posts\/836\/revisions"}],"predecessor-version":[{"id":848,"href":"https:\/\/reversea.me\/index.php\/wp-json\/wp\/v2\/posts\/836\/revisions\/848"}],"wp:attachment":[{"href":"https:\/\/reversea.me\/index.php\/wp-json\/wp\/v2\/media?parent=836"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/reversea.me\/index.php\/wp-json\/wp\/v2\/categories?post=836"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/reversea.me\/index.php\/wp-json\/wp\/v2\/tags?post=836"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}