Scalable Similarity Detection in Digital Forensics? Meet APOTHEOSIS

Reading Time: 4 minutes

TL;DR: In digital forensics, finding “similar” files—whether plagiarized source code, near-identical documents, or any other artifact—is often more important than finding identical files. Traditional hashing? Too rigid. Brute-force matching? Too slow. That’s where APOTHEOSIS, our new, extensible, fast, and forensics-friendly approximate similarity search system, comes in. Think radix trees, HNSW graphs, and similarity digest algorithms (SDAs) in a single, ready-to-use open-source package. Interested? We’re unveiling more details this Thursday at DFRWS USA 2025.

Why is similarity detection important in forensic investigation?

Digital forensic analysts routinely handle millions of files on devices, logs, memory dumps, and documents. But exact duplicates aren’t the only threat: attackers modify binaries, rename libraries, or insert obfuscated code. To detect them, we need to go beyond simple hash matches.

This is where similarity digest algorithms (SDAs) come into play. They generate hashes that reflect the similarity of two inputs, not just whether they are the same. Combined with approximate nearest neighbor searching, we have a powerful tool for detecting manipulated, duplicated, or modified files on a large scale.

So, what did we build?

We built APOTHEOSIS, a system that performs two powerful functions:

Exact hash lookups using a memory-efficient radix tree.
Approximate similarity search using a custom implementation of HNSW (Hierarchical Navigable Small World) graphs tailored for discrete hash values.

It’s like Elasticsearch for similarity digests, optimized for forensic use cases. In particular, out system:

Supports multiple SDAs (such as ssdeep and TLSH).
Works via local deployment or remote REST API.
Handles millions of records with real-time queries.

What did we test?

We evaluated APOTHEOSIS in two use cases:

Use Case 1: Source Code Plagiarism Detection / Evaluation Metrics

We processed approximately 44,000 source code files from a university dataset and tested the effectiveness of APOTHEOSIS in detecting duplicate or modified submissions using ssdeep, TLSH, and MinHashLSH.

Our key findings are as follows:

Zero false positives with APOTHEOSIS using both ssdeep and TLSH.
Recall and F1 score improve as we increase the similarity threshold.
Compared to MinHashLSH, our approach is more reliable at high similarity thresholds and does not fail when finding exact matches.

Use Case Study 2: Searching for Memory Artifacts at Scale / Performance at Large

We collected over 4.2 million memory pages from Windows system libraries (from Windows 7 to Server 2019) and tested the scalability of APOTHEOSIS.

The key takeaways of this experiments are:

Even with large datasets, insertion and search times scale well.
We confirmed logarithmic times for approximate insertions and searches, and near-linear times for exact hash searches (hash lookup time is in the order of the hash length).
Our system enables rapid allow-listing of known files, a major advantage during incident response.

Real-World Impact

Imagine using APOTHEOSIS during a forensic case:

Extract your artifacts.
Calculate similarity digests and insert them into the system.
Instantly check if they match any other file (exactly or approximately).
If you have a previous allow-list of artifacts, you can also check against it.

This helps you quickly exclude benign data and focus only on suspicious artifacts. Whether filtering out noise, detecting tampering, or confirming duplicates, APOTHEOSIS makes the job easier.

Limitations? We’ve Got You (Almost) Covered

No system is perfect. Here’s where APOTHEOSIS still has room for improvement:

Like all graph-based approaches, HNSW needs tuning and can be memory-intensive.
It can’t work if your SDA doesn’t return a valid digest (e.g., a file that’s too small).
REST APIs need protection against abuse (but we already include safeguards such as rate limiting).

We’re actively exploring:

More SDA types (and better fallback strategies)
Other (approximate) nearest neighbor search techniques
Real-world adversarial testing for evasion resilience.

Try It Yourself

APOTHEOSIS is free and open source, licensed under the GNU/GPLv3 license. Integrate it into your workflow as a library or API, swap in your favorite SDA, and start searching!

Want to learn more? Visit us at DFRWS USA 2025 or read the full article here (or here).

One Last Thing

Whether you’re investigating cyberattacks, searching for malware in memory, or detecting code plagiarism, similarity matters. APOTHEOSIS puts scalable similarity search in your hands. It’s fast. It’s forensics-friendly. And it’s open to collaboration.

Let’s raise the bar for digital forensic tools, together.

Funding Acknowledgments

This research was supported in part by grants PID2020-113903RB-I00 (KIT-IA) PID2023-151467OA-I00 (CRAPER), funded by MICIU/AEI/10.13039/501100011033 and by ERDF/EU, by grant TED2021-131115A-I00 (MIMFA), funded by MICIU/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR, and the University of Zaragoza, by grant Proyecto Estratégico Ciberseguridad EINA UNIZAR, funded by the Spanish National Cybersecurity Institute (INCIBE) and the European Union NextGenerationEU/PRTR, and by grant Programa de Proyectos Estratégicos de Grupos de Investigación (refs. T21-23R and T42-23R, respectively), funded by the University, Industry and Innovation Department of the Aragonese Government.

And that’s all, folks! In this post, we introduce APOTHEOSIS, our scalable and extensible system for rapid similarity digest lookup and approximate search in digital forensics. Whether you’re reviewing memory dumps or hunting for source code clones, APOTHEOSIS is designed to help you find what’s similar enough to be relevant, quickly and reliably. If you’re attending DFRWS USA 2025 this week, don’t miss our presentation; we’d love to talk with you about hash trees, HNSW graphs, or anything else that sparks your interest. Until then, check out the article (in this link or this other link), dig into the code, or contact us if you’re interested in collaborating. Thanks for reading and happy hunting!

Declaration of Generative AI Technologies in the Writing Process

During the preparation of this post, the author used ChatGPT (GPT4-o model) to improve readability and language. After using this tool, the author reviewed and edited the content as necessary and takes full responsibility for the content of this publication.

Scalable Similarity Detection in Digital Forensics? Meet APOTHEOSIS

Why is similarity detection important in forensic investigation?

So, what did we build?

What did we test?

Use Case 1: Source Code Plagiarism Detection / Evaluation Metrics

Use Case Study 2: Searching for Memory Artifacts at Scale / Performance at Large

Real-World Impact

Limitations? We’ve Got You (Almost) Covered

Try It Yourself

One Last Thing

Funding Acknowledgments

Declaration of Generative AI Technologies in the Writing Process

About the author

Ricardo J. Rodríguez

About the post

Our location

See our last tweets

Scalable Similarity Detection in Digital Forensics? Meet APOTHEOSIS

Why is similarity detection important in forensic investigation?

So, what did we build?

What did we test?

Use Case 1: Source Code Plagiarism Detection / Evaluation Metrics

Use Case Study 2: Searching for Memory Artifacts at Scale / Performance at Large

Real-World Impact

Limitations? We’ve Got You (Almost) Covered

Try It Yourself

One Last Thing

Funding Acknowledgments

Declaration of Generative AI Technologies in the Writing Process

About the author

Ricardo J. Rodríguez

About the post

Follow us

Our location

See our last tweets