Smudge release notes

Daniel Andriesse
December 2025

Smudge is a taint engine, or rather, a taint engine generator. It can automatically learn taint propagation rules (how the taint should propagate from input to output operands when a particular type of instruction executes) for a platform. It attempts to do this by fuzzing instructions, meaning it runs each type of instruction lots of times with many different inputs, and then observes which output bits are affected. Note that Smudge is a binary-level taint engine generator, meant for doing taint analysis on binaries, not for the far easier problem of adding taint analysis instructions at the source code level.

This is inspired by TaintInduce¹, but unlike TaintInduce the aim is to generate a stand-alone, reusable taint engine with “low” overhead, rather than an on-the-fly engine that needs to be regenerated in real time on each run of a program. I have decided to finally release the code, even though Smudge is unfinished and not in active development, and it is unlikely that I will ever finish it.

This has been sitting half-forgotten on my hard drive for a long time, and I figure at least this way, someone might get some use out of it. I will include a few more details about the project below, but please note that although the code is free to use (see license in the repository), I do not intend to provide detailed support for it. So, be prepared to figure out the code on your own. That said, I am of course happy to answer simple questions. So if you want to take this and play around with it, then please do. If you do, I would love to hear from you.

If you are still not discouraged, obtain the code using the following command.

git clone https://github.com/dennisaa/smudge.git

Now, some more background. I started this project right towards the end of my time as a postdoc in the VUSec research group at Vrije Universiteit Amsterdam. I was annoyed at the lack of a good mature taint engine for x86-64. The idea (inspired, as I mentioned above, by TaintInduce) was that part of the problem is the amount of work it takes to write taint propagation rules for an instruction set, especially one as huge as x86, and keep up with changes over time. For instance, libdft was a pretty usable taint engine for a while (my favorite, in fact), but became outdated over time because it was never extended with official 64 bit support. (There have been a few ad-hoc attempts to do this, but these were implemented in the service of other research projects and not as a goal on their own. Hence, they are more like quick and dirty prototypes and not very usable in practice. I believe Triton also features some taint support, but last time I checked it was not mature, either.) It would be much better if we could automatically infer those rules.

Whereas TaintInduce does this in a sort of on-the-fly transient way, I wanted to create an explicit taint engine generator that can output a stand-alone taint engine for a given platform, without the overhead of rule creation at runtime. Moreover, I wanted it to be as agnostic as possible of both the front-end (the instruction set of your choice) and the back-end (the binary instrumentation platform you use to instrument your programs and propagate taint at runtime, such as Intel Pin).

Here’s a diagram of how it works.

Smudge design diagram

Essentially, Smudge takes as input a list of instructions for which to generate taint propagation rules. This list can be generated automatically by harvesting opcodes from a set of programs. The script scanner/mine_insns.sh included with the code does this.

The Smudge fuzzer then takes this list of instructions as input and fuzzes each instruction, executing it many times while flipping bits in the input operands and keeping track of how this affects the output operands. It logs these observations, and then uses a rule engine to infer taint propagation rules. As a trivial example, for the mov instruction, the taint from a bit in the input operand should propagate to the corresponding bit in the output operand. Smudge also keeps track of effects on the EFLAGS register and can infer rules at different granularities (bit, byte, word).

The end result produced by the rule engine is a database of taint propagation rules in an intermediate language (taint IR). This is then fed into a backend generator that translates these IR rules into concrete taint propagation functions for a DBI (dynamic binary instrumentation) tool of choice, such as Intel Pin or DynamoRIO. There is a general purpose shadow memory library included that can be used by any C++-based DBI to keep track of a shadow memory to store taint state, so that you don’t need to reinvent the wheel every time you add a new backend.

You can then use your preferred backend (DBI) along with the provided taint library to instrument and track taint in whatever binary programs you want.

In theory, to support a new ISA, all you need to do is mine a list of instructions and have Smudge fuzz it, and to support a new DBI, you simply implement a new backend generator to translate taint IR rules to a form usable by the new DBI. In practice, some platform specific work is needed to support a new ISA in the fuzzer, but I have tried (and perhaps partly succeeded) to keep this to a minimum.

I even got as far as creating an end-to-end proof of concept of the idea. To reproduce it, download the code, install the dependencies outlined in the README.md file, and build the various components with their respective Makefiles. I tested this on a somewhat older laptop with a Core i7-1185G7 CPU running Ubuntu 22.04 with kernel 6.8.0-90-generic. Some of the dependencies are included in the bin and lib directories for archival reasons.

First, navigate to the fuzzer directory, make, and run the command sudo ./insfuzz -r smudge.poc.rules -v. This invokes the fuzzer, which will sit idle until you feed it some instructions to fuzz. (Note that there will be a message about a stopped process. This is normal. It is a child process that is host to the instructions that are being fuzzed.)

To feed instructions to the fuzzer, open another terminal and navigate to the scanner directory, where you run cat x86.db > /tmp/smudge_insfuzz_<pid> where <pid> is the process ID of the fuzzer. This pipes an instruction database of opcodes to fuzz into the FIFO opened by the fuzzer process. Note that the database used in this example (x86.db) is far from exhaustive and is meant for testing purposes only. You can also create your own instruction database using the mine_insns.sh script in the scanner directory if you want.

The fuzzer may take a few minutes to complete and should yield a file called smudge.poc.rules, which is the taint rules file (in taint IR format). This is a text-based format that is somewhat human readable, so you can open the rules file and try to make sense of it. Note that the fuzzed instructions may cause various signals and memory access failures which will be reported by the fuzzer. This is normal and to be expected, and won’t invalidate the generated taint rules.

After running the fuzzer, you could conceptually run an optimizer that refines the taint rules generated by the fuzzer. There is an optimizer directory reserved for this, but no optimizer is currently implemented.

Once you have a taint rules file you can run one of the DBI tools to see the taint tracking at work. The current proof of concept is based on Pin 3.17. The taint propagation rules for Pin are in the file dbi/pin/lib/taint.cc. This file is automatically generated by the Makefile from the taint rules file emitted by the fuzzer. The program that translates the rules is called ir2dbi-pin. To add a new backend, you would create an analogous program that translates the rules for use by the new DBI.

You can run one of the POC tests as follows.

$ cd smudge/dbi/pin
$ make
$ ./run-pin.sh tools/obj-intel64/smudge-test-pin-syscall-hooks.so tests/udp-loopback
Running command 'pin -t tools/obj-intel64/smudge-test-pin-syscall-hooks.so -- tests/udp-loopback'
Smudge test (Pin)
Initializing Smudge
Inserting syscall hooks
Starting program
Test init
Running test
Message sent
entry: recvfrom (thread_id=0, syscall_id=45, buf_addr=0x7fff0d72e390)
exit: recvfrom (thread_id=0, syscall_id=45, ret=0x5, buf=0x7fff0d72e390)
Tainting 5 bytes @ addr 0x7fff0d72e390
Message received (buffer@0x7fff0d72e390): Hello
Test fini
Terminating

This runs a target program (in this case udp-loopback) instrumented with a taint tool (in this case smudge-test-pin-syscall-hooks.so) that performs taint tracking in the target program. The udp-loopback program sends itself a message via UDP. Any data received over the network is automatically tainted by the taint tool, which hooks the recvfrom system call, as you can tell from the output in the above example. The idea is that we can now track this taint and raise an alert if it does something dangerous, for example if it is used as a return address or as a parameter to execve. (To implement this you would extend the example taint tool to hook the execve system call as well as recvfrom.)

Smudge is able to install hooks for taint tracking at the system call level and at the instruction level. The example tools and tests are fairly straightforward, so you can have a look at the code to see how this is achieved and how it can be extended to create more complex cases.

Anyway. The idea was nice (I still think it is). But I left academia for Intel, and my priorities changed. By now this has been sitting on my hard drive for at least five years (I think). And now my focus has changed once more, from computer science to physics. So, I seriously doubt I will ever get around to finishing this.

But hey, an end-to-end proof of concept is better than nothing. And who knows, maybe someone will even finish this, or put it to some other use. In case you want to go for it, here are some ideas:

Part of the problem with the taint propagation rules is that it’s very hard to get reliable coverage of all input/output operand relations with the fuzzer. There will almost always be some bit mappings missing that don’t get covered in the taint propagation rule (especially with instructions where the mapping is non-trivial, such as add or mul). I think this is probably fixable with some kind of AI (I know, I know) that’s trained to recognize “fuzzy” rules like this and transform them into exact rules. After all, I think there’s only a relatively limited number of instruction categories as far as operand relationships go, so this is likely relatively straightforward for someone who’s good with AI. I would probably implement this as an optimizer pass that runs after the fuzzer.
I wanted to make everything as platform-agnostic as possible, but in the end I did take shortcuts here and there. This can probably be improved.
I’m sure the backend taint propagation/shadow memory library can use some optimization to make things go faster at runtime (though this should be of little concern as long as the emitted taint engine itself is not complete).
I’m sure some of the assumptions and libraries I used have become outdated in all the years I didn’t work on this.

-DAA

One Engine To Serve’em All: Inferring Taint Rules Without Architectural Semantics, Chua et al., NDSS 2019, https://taintinduce.github.io/↩︎