Reading Time: 4 minutes

TL;DR: Algorithmically Generated Domains (AGDs) help malware maintain stealthy and resilient communication by generating pseudo-random domains that evade traditional detection. But what if we could detect them using general-purpose Large Language Models (LLMs), without training or tuning? That’s precisely what we explored in our DIMVA 2025 poster, “Exploring the Zero-Shot Potential of LLMs for Algorithmically Generated Domain Detection.” (you can read the poster paper here or here) We evaluated nine popular LLMs (GPT-4o, Claude 3.5, Gemini 1.5, Mistral) and showed that they can indeed detect AGDs surprisingly well in a zero-shot setting. However, we also uncovered critical limitations, such as overconfidence and high false positive rates. Read on to learn why LLMs offer both promise and peril in malware detection.

Why AGD Detection Matters

The malware uses Domain Generation Algorithms (DGAs) to periodically create new pseudo-random domain names for command and control (C2) servers. These DGAs make it difficult for defenders to block communication, since even if one domain goes down, another will appear moments later. Traditional detection methods rely on custom features or trained classifiers, but these are fragile, especially when attackers change the generation patterns.

In contrast, Large Language Models (LLMs) come pre-trained with extensive linguistic knowledge. Could they detect malicious domains simply by “understanding” the domain’s structure and randomness? That is the hypothesis we tested.

What Did We Do?

We evaluated nine LLMs from OpenAI (GPT-4o, GPT-4o-mini), Anthropic (Claude 3.5 Sonnet & Haiku), Google (Gemini Pro, Flash, Flash-8B), and Mistral (Large & Small), all accessible through public APIs. Without tuning or label training, we provided them with domain names and asked: Is this domain malicious or legitimate?

We used a balanced dataset:

  • 25,000 malicious domains from 25 real-world malware families (via DGArchive).
  • 25,000 legitimate domains from the Tranco list.

We evaluated each LLM in two ways, with two different prompts:

  • P1: A minimal instruction with simple instructions.
  • P2: A more comprehensive instruction with explicit lexical features (such as randomness, character ratio, or similarity to known domains).

Key Findings

LLMs Can Detect DGAs

Overall, the LLMs showed strong baseline performance. Claude 3.5 Sonnet achieved an F1 score of 90.1%, and Mistral Large, 88.9%. Matthews Correlation Coefficients (MCCs) reached 79.7%, demonstrating strong agreement with the ground-truth labels.

More Features ≠ Better Performance

The lexically rich prompt P2 offered only marginal improvements, often less than 1% in accuracy. In some cases, it even led to more formatting errors or unclassified results. Keeping messages simple (P1) performed better.

High False Positive Rates

All models showed a consistent bias toward classifying domains as malicious, especially with P1. For example, GPT-4o-mini had a false positive rate of 31.9%, while even the best model, Mistral Large, had an FPR of 13.2%. This is a red flag in the real world, where false alarms can cripple operations.

Errors from Overconfidence

All models reported a very high level of confidence, even when they were wrong. The median confidence score exceeded 85%, regardless of the classification outcome (true or false). This lack of calibration means that the models are not just wrong, but definitely wrong.

Real-World Relevance

Using LLMs for AGD detection may seem like magic, and in some cases, it does. Without training, LLMs can differentiate AGDs from legitimate domains with impressive accuracy. However, their weaknesses should not be ignored.

Their bias toward detecting malicious behavior could result in the blocking of benign domains, causing outages. Furthermore, their lack of reliability undermines their usefulness in high-risk scenarios such as incident response.

What’s Next?

We are expanding our work in two interesting directions:

  • Multiclass detection: Can LLMs distinguish different DGA families (e.g., Conficker vs. Matsnu)?
  • Robustness testing: How do they perform when benign domains resemble DGAs (such as “x1x2.com” or “fb-login.net”)?

This will help us understand whether LLMs can generalize beyond test datasets and into the complex battlefield of real-world DNS traffic.

Funding Acknowledgments

This research was supported in part by grant PID2023-151467OA-I00 (CRAPER), funded by MICIU/AEI/10.13039/501100011033 and by ERDF/EU, by grant TED2021-131115A-I00 (MIMFA), funded by MICIU/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR, and the University of Zaragoza, by grant Proyecto Estratégico Ciberseguridad EINA UNIZAR, funded by the Spanish National Cybersecurity Institute (INCIBE) and the European Union NextGenerationEU/PRTR, by grant Programa de Proyectos Estratégicos de Grupos de Investigación (DisCo research group, ref. T21-23R), funded by the University, Industry and Innovation Department of the Aragonese Government, and by the RAPID project (Grant No. CS.007) financed by the Dutch Research Council (NWO).


And that’s all, folks! In this post, we’ve shared the highlights from our DIMVA 2025 poster on using LLMs to detect AGDs in a zero-shot environment. Our findings show that while LLMs can detect algorithmically generated domains with considerable accuracy (without the need for tuning), they also carry considerable risks: high levels of false positives, overconfidence errors, and sensitivity to indicator design. This means they are not yet ready for frontline deployment, but they are promising tools for experimentation, prototyping, and expanding the horizons of AI malware detection.

If you’re at DIMVA 2025, visit our poster for a chat! And if you’re curious to dig deeper, we invite you to read the full article. We hope this work will spur further research on integrating general-purpose LLMs into cybersecurity processes, in a responsible and rigorous manner. Thanks for reading!


Declaration of Generative AI Technologies in the Writing Process

During the preparation of this post, the author used ChatGPT (GPT4-o model) to improve readability and language. After using this tool, the author reviewed and edited the content as necessary and takes full responsibility for the content of this publication.