{"id":932,"date":"2026-07-01T19:42:08","date_gmt":"2026-07-01T19:42:08","guid":{"rendered":"https:\/\/reversea.me\/?p=932"},"modified":"2026-07-06T15:32:47","modified_gmt":"2026-07-06T15:32:47","slug":"the-simpler-the-stealthier-benchmarking-adversarial-dga-models-in-a-unified-framework","status":"publish","type":"post","link":"https:\/\/reversea.me\/index.php\/the-simpler-the-stealthier-benchmarking-adversarial-dga-models-in-a-unified-framework\/","title":{"rendered":"The Simpler, the Stealthier: Benchmarking Adversarial DGA Models in a Unified Framework"},"content":{"rendered":"<span class=\"span-reading-time rt-reading-time\" style=\"display: block;\"><span class=\"rt-label rt-prefix\">Reading Time: <\/span> <span class=\"rt-time\"> 6<\/span> <span class=\"rt-label rt-postfix\">minutes<\/span><\/span>\n<p class=\"wp-block-paragraph\"><strong>TL;DR<\/strong> Adversarial Domain Generation Algorithms (DGAs) let botnets craft domain names that evade Machine Learning (ML)-based detectors. The field lacks a common evaluation setup: models are tested on different datasets and metrics, and their code is rarely published, which makes comparing them hard. In our <a href=\"https:\/\/webdiis.unizar.es\/~ricardo\/files\/papers\/PelayoBenedetR-DIMVA-26.pdf\" data-type=\"link\" data-id=\"https:\/\/webdiis.unizar.es\/~ricardo\/files\/papers\/PelayoBenedetR-DIMVA-26.pdf\">poster paper at DIMVA 2026, <em>&#8220;The Simpler, the Stealthier: A Framework for Evaluating Adversarial Domain Generation Algorithm Models<\/em>&#8220;<\/a>, presented tomorrow, we release an <a href=\"https:\/\/github.com\/reverseame\/adversarial-dga-framework\">open-source benchmarking framework<\/a> that evaluates four adversarial DGA models (<code>DeepDGA<\/code>, <code>CharBot<\/code>, <code>Deception<\/code>, and <code>MaskDGA<\/code>) in a single environment across three dimensions: lexical characteristics, evasion against LSTM and CNN classifiers, and computational cost. In our experiments, the two simplest models, <code>CharBot<\/code> and <code>Deception<\/code>, evade detection over 75% of the time while training in under two seconds, while the two more complex models stay below 21% evasion despite needing hours of training. Interested? Keep reading this briefing or <a href=\"https:\/\/github.com\/reverseame\/adversarial-dga-framework\">grab the code on GitHub<\/a>.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Why Comparing Adversarial DGAs Is So Hard<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Modern malware relies on DGAs to dynamically generate large volumes of pseudo-random domain names, giving botnets resilient command-and-control (C2) channels that resist static blocklists. Defenders responded with ML-based detectors that learn character-level patterns to flag Algorithmically Generated Domains (AGDs). Attackers then started crafting domains that mimic legitimate traffic to evade those detectors.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In our recent <a href=\"https:\/\/doi.org\/10.1016\/j.mlwa.2026.100888\">systematic literature review of this arms race<\/a>, we found two obstacles that hold back progress in the field:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>No common ground for comparison.<\/strong> Most adversarial models are evaluated in isolation, on heterogeneous datasets and with inconsistent metrics, so direct comparison between studies is hard.<\/li>\n\n\n\n<li><strong>Low reproducibility.<\/strong> Only 12.50% of the reviewed studies released public artifacts, which makes independent validation difficult.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Without a shared, open evaluation harness, it is hard to tell whether a new adversarial DGA is a real improvement or an artifact of a favorable dataset.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The Solution: A Unified Benchmarking Framework<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">To address both problems, we release an <a href=\"https:\/\/github.com\/reverseame\/adversarial-dga-framework\">open-source benchmarking framework<\/a> that evaluates adversarial DGA models in a single environment. The framework does not propose a new evasion technique. It provides a shared setup in which representative models receive the same data and are measured with the same metrics.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The framework is modular, with three decoupled layers coordinated by an orchestrator:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A <strong>generation layer<\/strong> that groups the adversarial DGA models.<\/li>\n\n\n\n<li>A <strong>detection layer<\/strong> with the trained ML classifiers.<\/li>\n\n\n\n<li>An <strong>analysis layer<\/strong> that computes lexical statistics, detection metrics, and timing.<\/li>\n<\/ul>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"994\" height=\"456\" src=\"https:\/\/reversea.me\/wp-content\/uploads\/2026\/07\/fig1_architecture.png\" alt=\"\" class=\"wp-image-933\" srcset=\"https:\/\/reversea.me\/wp-content\/uploads\/2026\/07\/fig1_architecture.png 994w, https:\/\/reversea.me\/wp-content\/uploads\/2026\/07\/fig1_architecture-300x138.png 300w, https:\/\/reversea.me\/wp-content\/uploads\/2026\/07\/fig1_architecture-768x352.png 768w\" sizes=\"auto, (max-width: 994px) 100vw, 994px\" \/><figcaption class=\"wp-element-caption\"><em>Figure 1: Architecture of the adversarial DGA benchmarking framework. The orchestrator drives the generation, detection, and analysis layers.<\/em><\/figcaption><\/figure>\n<\/div>\n\n\n<h2 class=\"wp-block-heading\">Inside the Framework<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Datasets.<\/strong> We build three disjoint datasets from the <a href=\"https:\/\/tranco-list.eu\/\">Tranco list<\/a> (benign domains) and <a href=\"https:\/\/dgarchive.caad.fkie.fraunhofer.de\/\"><code>DGArchive<\/code><\/a> (malicious AGDs), each with a distinct role: <code>D1<\/code> trains the adversarial models, <code>D2<\/code> trains the detectors, and <code>D3<\/code> acts as the control and evaluation set. <code>D1<\/code> and <code>D2<\/code> draw equally from the 58 DGA families with at least 50,000 samples each (100,000 malicious and 100,000 benign domains per dataset), so models and detectors see the full diversity of DGAs. <code>D3<\/code> is restricted to four families representing distinct generation schemes, which enables a per-family comparison: <code>Qakbot<\/code> and <code>Rovnix<\/code> (arithmetic-based), <code>Suppobox<\/code> (dictionary-based), and <code>Dyre<\/code> (hash-based).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Adversarial models.<\/strong> We implement four representative models, <code>DeepDGA<\/code>, <code>CharBot<\/code>, <code>Deception<\/code>, and <code>MaskDGA<\/code>, plus two control groups: a <em>malicious baseline<\/em> of real AGDs from the four <code>D3<\/code> families, and a <em>benign baseline<\/em> of real Tranco domains. The baselines are the reference points the adversarial models try to avoid (malicious) or mimic (benign).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Detectors.<\/strong> Two ML detectors trained on <code>D2<\/code>: an <code>LSTM<\/code>, a recurrent classifier at the character level, and a <code>CNN<\/code>, a convolutional classifier over character n-gram embeddings.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Metrics.<\/strong> For every model and control group, the analysis layer computes six lexical features (Shannon entropy, vowel ratio, digit ratio, unique-character ratio, maximum number of consecutive consonants, and length) along with training and generation times. Detection is measured with the evasion rate: the fraction of adversarial domains incorrectly classified as benign by each detector.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Key Findings: The Simpler, the Stealthier<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">We evaluated each model across three dimensions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Lexical features.<\/strong> <code>CharBot<\/code> and <code>Deception<\/code> closely match the benign baseline across all metrics, particularly Shannon entropy, vowel ratio, and domain length, which suggests they learn the lexical distribution of legitimate domains. <code>MaskDGA<\/code> shows a profile closer to arithmetic DGA families, with lower vowel ratios and longer domains. <code>DeepDGA<\/code> is a clear outlier, with a mean length of 62 characters, a near-zero vowel ratio, and the highest digit ratio of any group.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"679\" src=\"https:\/\/reversea.me\/wp-content\/uploads\/2026\/07\/fig2_lexical_heatmap-1024x679.png\" alt=\"\" class=\"wp-image-934\" srcset=\"https:\/\/reversea.me\/wp-content\/uploads\/2026\/07\/fig2_lexical_heatmap-1024x679.png 1024w, https:\/\/reversea.me\/wp-content\/uploads\/2026\/07\/fig2_lexical_heatmap-300x199.png 300w, https:\/\/reversea.me\/wp-content\/uploads\/2026\/07\/fig2_lexical_heatmap-768x509.png 768w, https:\/\/reversea.me\/wp-content\/uploads\/2026\/07\/fig2_lexical_heatmap-1536x1018.png 1536w, https:\/\/reversea.me\/wp-content\/uploads\/2026\/07\/fig2_lexical_heatmap-1440x955.png 1440w, https:\/\/reversea.me\/wp-content\/uploads\/2026\/07\/fig2_lexical_heatmap.png 1858w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\"><em>Figure 2: Mean lexical statistics by model and control group (min-max normalized). The horizontal line separates the baselines (above) from the adversarial models (below).<\/em><\/figcaption><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\"><strong>Detection evasion.<\/strong> Against benign domains, both detectors reach evasion rates above 96%. On real-world families, results depend on the generation scheme: hash-based <code>Dyre<\/code> is detected almost completely (evasion close to 0%), arithmetic <code>Rovnix<\/code> and <code>Qakbot<\/code> stay below 13%, and dictionary-based <code>Suppobox<\/code> is the hardest real family (64% against the CNN, 94% against the LSTM). Among the adversarial models, <code>CharBot<\/code> and <code>Deception<\/code> reach the highest evasion rates, above 75% and 84% respectively against both detectors. <code>MaskDGA<\/code> and <code>DeepDGA<\/code> stay below 21% in all cases, consistent with their divergent lexical profiles.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"370\" src=\"https:\/\/reversea.me\/wp-content\/uploads\/2026\/07\/fig3_evasion_rate-1024x370.png\" alt=\"\" class=\"wp-image-935\" srcset=\"https:\/\/reversea.me\/wp-content\/uploads\/2026\/07\/fig3_evasion_rate-1024x370.png 1024w, https:\/\/reversea.me\/wp-content\/uploads\/2026\/07\/fig3_evasion_rate-300x108.png 300w, https:\/\/reversea.me\/wp-content\/uploads\/2026\/07\/fig3_evasion_rate-768x278.png 768w, https:\/\/reversea.me\/wp-content\/uploads\/2026\/07\/fig3_evasion_rate-1536x555.png 1536w, https:\/\/reversea.me\/wp-content\/uploads\/2026\/07\/fig3_evasion_rate-2048x740.png 2048w, https:\/\/reversea.me\/wp-content\/uploads\/2026\/07\/fig3_evasion_rate-1440x521.png 1440w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\"><em>Figure 3: Evasion rate (%) of each model and control group against the CNN and LSTM detectors. The dashed line separates baselines from the adversarial models.<\/em><\/figcaption><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\"><strong>Computational cost.<\/strong> <code>CharBot<\/code> and <code>Deception<\/code> train in under two seconds and generate domains in microseconds. <code>MaskDGA<\/code> and <code>DeepDGA<\/code> need 2.61 and 21.11 hours of training respectively, and reach the lowest evasion rates in the study.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th><\/th><th><code>CharBot<\/code><\/th><th><code>Deception<\/code><\/th><th><code>MaskDGA<\/code><\/th><th><code>DeepDGA<\/code><\/th><\/tr><\/thead><tbody><tr><td><strong>Training time (s)<\/strong><\/td><td>0.91<\/td><td>1.52<\/td><td>9,409<\/td><td>75,994<\/td><\/tr><tr><td><strong>Generation time (s\/domain)<\/strong><\/td><td>10\u207b\u2075<\/td><td>10\u207b\u2074<\/td><td>0.056<\/td><td>0.11<\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\"><em>Table 1: Training and generation times for each adversarial model.<\/em><\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Across the three dimensions, lexical similarity to benign domains, rather than model complexity, is what tracks with evasion success in our experiments. In this set of models, higher computational cost did not translate into higher evasion. This is the observation behind the paper&#8217;s title.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Takeaways for Defenders<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Model complexity is not a good proxy for risk. In our results, the two models that were cheapest to train were also the most evasive.<\/li>\n\n\n\n<li>Dictionary-based and benign-mimicking generators are the ones that evade detectors relying on surface statistics such as entropy or n-gram distributions.<\/li>\n\n\n\n<li>Shared benchmarks and open artifacts make it possible to compare techniques fairly and to reproduce results. We release the framework as open source for that reason.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Ongoing Work<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">This poster is a proof of concept for a broader study. We are extending the framework to cover all adversarial models identified in our <a href=\"https:\/\/doi.org\/10.1016\/j.mlwa.2026.100888\">systematic literature review<\/a>, with additional lexical features (n-gram distributions), more detectors, and memory-usage profiling. We also plan to compare the numbers reported in the original papers with those obtained under our framework, and we are exploring a new adversarial model based on lightweight LLMs.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Funding Acknowledgments<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">This research was supported in part by grant PID2023-151467OA-I00 (CRAPER), funded by MICIU\/AEI\/10.13039\/501100011033 and by ERDF\/EU, by grant <em>Programa de Proyectos Estrat\u00e9gicos de Grupos de Investigaci\u00f3n<\/em> (DisCo research group, ref. T21-23R), funded by the University, Industry and Innovation Department of the Aragonese Government. The work of Tom\u00e1s Pelayo-Benedet was supported by the Government of Arag\u00f3n through the Diputaci\u00f3n General de Arag\u00f3n (DGA) Predoctoral Grant, during 2025-2029.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"89\" src=\"https:\/\/reversea.me\/wp-content\/uploads\/2025\/06\/BandaINCIBEcolor-1-1024x89.png\" alt=\"\" class=\"wp-image-844\" srcset=\"https:\/\/reversea.me\/wp-content\/uploads\/2025\/06\/BandaINCIBEcolor-1-1024x89.png 1024w, https:\/\/reversea.me\/wp-content\/uploads\/2025\/06\/BandaINCIBEcolor-1-300x26.png 300w, https:\/\/reversea.me\/wp-content\/uploads\/2025\/06\/BandaINCIBEcolor-1-768x67.png 768w, https:\/\/reversea.me\/wp-content\/uploads\/2025\/06\/BandaINCIBEcolor-1-1536x134.png 1536w, https:\/\/reversea.me\/wp-content\/uploads\/2025\/06\/BandaINCIBEcolor-1-2048x179.png 2048w, https:\/\/reversea.me\/wp-content\/uploads\/2025\/06\/BandaINCIBEcolor-1-1440x126.png 1440w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"201\" src=\"https:\/\/reversea.me\/wp-content\/uploads\/2026\/07\/logos-1024x201.jpeg\" alt=\"\" class=\"wp-image-944\" srcset=\"https:\/\/reversea.me\/wp-content\/uploads\/2026\/07\/logos-1024x201.jpeg 1024w, https:\/\/reversea.me\/wp-content\/uploads\/2026\/07\/logos-300x59.jpeg 300w, https:\/\/reversea.me\/wp-content\/uploads\/2026\/07\/logos-768x151.jpeg 768w, https:\/\/reversea.me\/wp-content\/uploads\/2026\/07\/logos.jpeg 1200w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<\/div>\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p class=\"wp-block-paragraph\"><em>And that&#8217;s all, folks! In the meantime, we encourage you to <a href=\"https:\/\/webdiis.unizar.es\/~ricardo\/files\/papers\/PelayoBenedetR-DIMVA-26.pdf\">read the full poster paper at DIMVA 2026<\/a>, <a href=\"https:\/\/github.com\/reverseame\/adversarial-dga-framework\">explore and contribute to the framework on GitHub<\/a>, or <a href=\"mailto:reverseame@unizar.es\">reach out to us if you&#8217;re interested in collaborating on adversarial DGA evaluation, botnet detection, or reproducible security benchmarking<\/a>. Thanks for reading, and stay secure!<\/em><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Declaration of Generative AI Technologies in the Writing Process<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">During the preparation of this post, the authors used Claude (Claude Opus 4.8  model) to improve readability and language. After using this tool, the authors reviewed and edited the content as necessary and take full responsibility for the content of this publication.<\/p>\n","protected":false},"excerpt":{"rendered":"<p><span class=\"span-reading-time rt-reading-time\" style=\"display: block;\"><span class=\"rt-label rt-prefix\">Reading Time: <\/span> <span class=\"rt-time\"> 6<\/span> <span class=\"rt-label rt-postfix\">minutes<\/span><\/span>TL;DR Adversarial Domain Generation Algorithms (DGAs) let botnets craft domain names that evade Machine Learning (ML)-based detectors. The field lacks a common evaluation setup: models are tested on different datasets and metrics, and their code is rarely published, which makes comparing them hard. In our poster paper at DIMVA 2026, &#8220;The Simpler, the Stealthier: A [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[49,17,40,48,15],"tags":[71,50,72,73,51,53,52],"class_list":["post-932","post","type-post","status-publish","format-standard","hentry","category-ai-in-cybersecurity","category-malware","category-network","category-threat-detection","category-tools","tag-adversarial-machine-learning","tag-agd","tag-benchmarking","tag-botnet","tag-dga","tag-dns","tag-reproducibility","no-featured-image"],"_links":{"self":[{"href":"https:\/\/reversea.me\/index.php\/wp-json\/wp\/v2\/posts\/932","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/reversea.me\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/reversea.me\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/reversea.me\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/reversea.me\/index.php\/wp-json\/wp\/v2\/comments?post=932"}],"version-history":[{"count":4,"href":"https:\/\/reversea.me\/index.php\/wp-json\/wp\/v2\/posts\/932\/revisions"}],"predecessor-version":[{"id":945,"href":"https:\/\/reversea.me\/index.php\/wp-json\/wp\/v2\/posts\/932\/revisions\/945"}],"wp:attachment":[{"href":"https:\/\/reversea.me\/index.php\/wp-json\/wp\/v2\/media?parent=932"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/reversea.me\/index.php\/wp-json\/wp\/v2\/categories?post=932"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/reversea.me\/index.php\/wp-json\/wp\/v2\/tags?post=932"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}