Keywords

1 Introduction

During the last decade, embedded systems have not only gained tremendously in number, capabilities, and connectivity, but also in their relevance for safety and security within the physical world [13, 21]. Cyber physical systems are an integral part of safety and security-critical applications such as health care, automotive, production, and infrastructure. As part of the Internet of things (IoT) they have become increasingly interconnected, often using wireless interfaces that are particularly exposed to potential attacks [20]. In addition to the high exposure and security impact, rolling out security updates to embedded systems in the field in many cases is complicated, if possible at all. Therefore, it is necessary to identify security vulnerabilities in embedded systems before deployment.

1.1 Fuzz Testing Embedded Systems

Fuzz testing is one of the major methods used to identify security vulnerabilities [8, 17]. A central factor for the effectiveness of a fuzzer is, how fast a good code coverage of the tested software can be achieved. In practice, not only the bare run-time of the fuzzer is relevant, but also the time a human security tester spends in order to setup and adapt the fuzzer to the system and interface under test. “Coverage-guided greybox fuzzing” promises to cut down on both the manual setup time and the run-time of the fuzzer [16, 22, 27]. By utilizing feedback collected at run-time, it adapts the fuzzer to the current target automatically and on-the-fly. Manual preparation and reverse-engineering effort is thereby minimized, and the fuzzer’s effectiveness improves. A coverage-guided fuzzer measures the code coverage achieved for each test case, in order to detect exercised application behavior. Test cases that triggered yet unseen application functionality are selected as seeds into the corpus of input values used to generate subsequent test cases. This results in a feedback loop, leading to a rapid exploration of the software under test. As a result, coverage-guided fuzzers appear to “learn” the input structure expected by their test subjects, even without any prior information about their target.Footnote 1

Currently, coverage-guided fuzzing is applied with great success to conventional software, running on a PC where it is relatively easy to obtain coverage information. A natural approach to use coverage-guided fuzzing to test an embedded system therefore would be to move its code from the embedded platform to a PC. This can be achieved through rehosting [1, 9], i.e. cross-compiling the embedded system’s source code for the host architecture. Emulation of the embedded platform [9, 18, 23, 26] is an option if the source code is unavailable (as it is often the case in practice [25]). A major hurdle for rehosting and emulation however, are the eponymous hardware dependencies of embedded software. In fact, all hardware interactions of an application either need to be mocked up or omitted sensibly to maintain testable functionality. While emulation platformsFootnote 2 support an impressive number of embedded CPU architectures, they frequently lack out-of-the-box support for the overwhelming diversity of peripheral interfaces incorporated by modern MCU architectures, and even more so for external peripheral chips and senors found in embedded systems. This is underlined by recent research activity seeking to reduce the manual efforts involved in addressing hardware dependencies. These propose to handle peripheral interaction by forwarding communication to the real device [12, 14, 18], automated modeling of peripheral behavior [11], or inferring expected access results from firmware inspection [6]. Nevertheless, these may not completely eliminate the involved development efforts and are frequently limited to non real-time or simple patterns of interaction. Hence, a substantial gap remains between sufficiently replicable and practically deployed hardware architectures.

1.2 Hardware Tracing as Feedback Channel

Monitoring the behavior of embedded software during run-time is desirable during regular firmware development and debugging, too [19]. To this end, most modern embedded platforms support a form of hardware instrumentation through so called “tracing interfaces”. The widely used ARMv7-M architecture we chose for our study supports multiple tracing interfaces:

  • The “parallel trace port” utilizes five pins as a clocked parallel interface for so called “ETM” trace transmission. This interface provides a high bandwidth and is capable of delivering a full sequence of executed instruction addresses in real time. Its use in coverage-guided fuzzing has been proposed and demonstrated previously [7, 15]. In general, the applicability of such hardware tracing as coverage feedback for fuzzers is not new, and has a well-proven track record in the x86 software domain [7, 10, 24].

    Unfortunately, the parallel tracing interface is rarely available in finished embedded products. The required pins are usually omitted in smaller-footprint chip packages. In many remaining cases, at least some of the required pins are in use as general purpose input/outputs (GPIOs) by the application.

  • The single wire output (SWO) interface is an alternative tracing interface that requires only a single pin, making it widely available and usable in many ARM Cortex-M3, -M33, -M4, and -M7 based devices [2, 3]. However, its bandwidth is magnitudes lower than the parallel port. This is accounted for by the dedicated “periodic programm counter (PC) sampling” trace mode. Here, the PC of the monitored central processing unit (CPU) is sampled in a configurable clock interval, resulting in a sparse and erratic view of the traced application behavior.

If security analysts may choose the MCU package and pin layout, ETM tracing via the parallel trace port would certainly be a favorable option [15]. However, in practice testers are incapable of influencing these decisions, as we experienced throughout our activity as embedded security testers. Consequently, SWO is typically the only interface available in existing applications, especially during third-party evaluations.

1.3 Contributions

We argue, that even severely restricted hardware tracing may constitute a valuable feedback channel that enables effective coverage-guided fuzz testing. To this end, we provide the first demonstration of coverage-guided fuzz testing of embedded applications using heavily constrained tracing information. In particular, we show how coverage-guided fuzzing can be enabled to detect novel application behaviors based on the erratic information available from the SWO tracing interface of the widespread ARMv7-M architecture. Therefore, we established a mutation-based coverage-guided fuzzing algorithm by making the following three contributions:

  1. 1.

    First, we devised a heuristic to approximate the instruction coverage exercised by a test case based on periodic PC sampling. Beyond repeated tracing, we conduct static analysis of the target application’s binary and combine this with hardware breakpoints to facilitate fast and reliable coverage estimation.

  2. 2.

    Based on the instruction coverage, a seed selection strategy is presented, that facilitates the heuristic to select fuzz inputs that exercise new areas of code. Seed inputs identified in this manner improve the subsequent test case generation and augment a corpus of initially provided example inputs.

  3. 3.

    Finally, the selection strategy was integrated with a seed scheduling and mutation strategy to obtain a full coverage-guided fuzzing algorithm. The result is then benchmarked against a similarly configured blackbox mutation-based fuzzer. Empirical evidence indicates that the coverage-guided fuzzer consistently outperforms the blackbox approach, in terms of total the vulnerabilities found and the coverage achieved over time.

Our overall goal is to demonstrate the feasibility of coverage-guided fuzz testing based on sparse and erratic feedback from constrained hardware tracing interfaces, and to encourage further research.

2 Setup for “Bare-Metal” Coverage-Guided Fuzz Testing

To conduct fuzz testing, a “test host” running the fuzzer is connected to the “target” embedded system via three interfaces as depicted in Fig. 1.

First, the test host must provide a counterpart for the peripheral interface that is to be fuzz tested. In a simple case, this may be a common feature of desktop systems, such as a Bluetooth or USB interface. Other cases may require the addition of a dedicated hardware adapter to interact with a specific target interface, such as CAN or ZigBee.

A debug probe connects the host to the serial wire debug (SWD) or Joint Test Action Group (JTAG) interface of the target MCU. This enables the fuzzer to configure the tracing interface, reset the target if required, read fault registers, or set breakpoints.

Finally, the trace data must be captured via the SWO interface of the target. We used a Segger J-Trace debug and trace probeFootnote 3 that conveniently combines the debug and trace interfaces in a single device and is capable of receiving data from the SWO interface with up to 50 MBaud. As SWO is nothing but a universal asynchronous receiver transmitter (UART) interface, a simple debug probe in combination with a separate fast UART adapter can also be used.

Fig. 1.
figure 1

Block diagram of the components involved in the execution and tracing of a test case.

Fig. 2.
figure 2

NRF board connected to the debug and trace probe. STM board on custom break-out board.

2.1 Evaluation of a Test Case

In the described setup, the application and tracing of a test case are as follows:

  • Preparation of the target. The application is set up for testing. If the MCU has not already been started, it is powered up and allowed to execute its initialization routine. Then, peripherals are set up and the control flow enters an event loop, usually performing periodic tasks or waiting for events requiring further action. At this point, the application is ready to accept input data.

  • Start tracing. The tracing interface is configured and enables the monitoring of the CPU’s execution. From this point on, trace data is generated.

  • Execution of the test case. The test input is supplied to the peripheral interface and processed by the tested application. Simultaneously, it is monitored for output data. The end of a test run is determined by either exceeding a time limit or receiving an expected output value.

  • Stop tracing. The tracing interface is disabled to stop the generation of further trace data. All pending trace packets are collected.

  • Checking for faults. Finally, the status register of the CPU is checked. This way a possible crash and the causing fault are detected.

As a result, for every test run the fuzzer either receives a crash notification or a trace of the most recent execution.

2.2 Specific Challenges of Traces Obtained via Periodic PC Sampling

With the described setup we are able to receive limited coverage feedback from almost any ARM Cortex-M MCU. In this setup however, repeatedly executing the same test case may lead to different coverage views, even while always triggering the same application behavior. Therefore traces generated via periodic PC sampling can not be used directly as feedback for coverage-guided fuzzing. In this section, we discuss the effects that degrade the quality of the available feedback.

First, tracing is not perfectly aligned with the processing of a test case. Following from the tracing procedure described above, traces offer insights into the execution taking place between the “start” and “stop” tracing steps. Tracing is disabled between individual test runs, starting a moment before the application of the input data, and disabled explicitly after the processing finishes. Consequently, every trace covers portions of the execution immediately preceding and following the actual test execution. This is necessary, because the production of trace packets may not be reliably synchronized with the execution of a test case. Consequently, the described approach ensures that the entire processing of a test input is covered by a recorded trace. As a side effect though, tracing the same test input several times may result in different traces, each one covering more or less “idle” execution time at the start and end of tracing.

Furthermore, tracing covers execution unrelated to a test case. An application may implement various exception handlers, which allow for asynchronous execution of code interrupting the regular application control flow [4, pp. 513–515]. These “interrupts” may trigger independently from the processing of an input value. Likewise, concurrent threads in an embedded OS may perform operations unrelated to the current test case. In either instance, the addresses of unrelated executions can be included in the trace. Therefore, varying occurrences of concurrency introduces more variability into the tracing results.

These caveats are further aggravated by the severe information loss incurred by periodic PC sampling. The trace consists of samples of the PC on every n-th clock cycle, triggered by a countdown timer with n steps. However, the countdown is not synchronized with the test case execution. Hence, processing of an input value can begin on any countdown value from the range \([0, (n-1)]\), multiplying the number of possible outcomes by n.

Finally, the execution of instructions may require a variable number of clock cycles to finish. A common example is memory operations, either working on cached memory addresses or waiting for slower flash memory.

3 Enabling Instruction Coverage-Guided Fuzzing

A single periodic PC trace does neither allow to identify a certain control flow nor a certain differentiation of behaviors. Nevertheless, during each run of a test case, we obtain a small glimpse on the application behavior in the shape of a changing subset of the executed PCs. We therefore may still attempt to extract valuable features for seed selection, like the achieved “instruction coverage”.

Instruction coverage, as the set of instructions executed by a test case, may be approximated by repeatedly tracing a test case and joining the results. Due to the nature of the periodic PC sampling though, this process is highly erratic: On the one hand, instruction addresses exercised by the seeds may be consecutively omitted during tracing. On the other hand, sporadic observation of interfering concurrent execution may lead to the reporting of additional, unrelated instruction addresses. This can lead to both false rejection of inputs actually achieving new coverage, and false-positive detection of merely-supposed new behavior.

Especially false-positives impose a serious issue for the sought improvement of test case generation. Per definition, false-positives exercise the same control flow as at least another entry from the corpus and hence feature a semantic structure deemed identical by the tested application. Assuming a fuzzer always using the same set of manipulations on seeds, fuzzing both values arguably results in a comparable probability distribution to trigger certain application behaviors.Footnote 4

A sustained incidence of false-positive discoveries quickly results in a large corpus, where most seeds produce similar test cases. This circumstance degrades the quality of generated test cases, which ideally should result in as much diversity as possible [5]. In an extreme case, assuming “naive seed scheduling“, fuzzing all seeds equally likely, this might even lead to the “starvation“ of valuable seeds. Intuitively, most time would be spend mutating semantically similar seeds, instead of few ones with a larger likelihood to make further discoveries.Footnote 5

3.1 Combating False-Positive Detection

We subsequently devise a set of heuristics which helped to reduce the residual risk of erroneous seed selection to an acceptable level.

Branch Instruction Mapping. “Branch instruction mapping” is a mechanism, reducing the number of traces needed to fully approximate the instruction coverage of a test case. To this end, the instruction addresses observed by tracing are mapped to the next succeeding branch instruction. Formally, a branch instruction is any operation that might change the PC in another way then simply incrementing it. The observation of any other type of instruction consequently implies the execution of all subsequent instructions, at least up to and including the next branch instruction.

Locating branch instructions in the raw binary of the tested application is accomplished by identifying eligible opcodes. Consequently, no disassembly or source code is required. The identified instruction addresses are then added to a sorted list referred to as the “branch instruction map”. Upon tracing a certain instruction it can be mapped to the closest succeeding branch instruction, by looking up the next higher value in the branch instruction map. As a result, all instructions lying between two branch instructions are translated into the instruction address at the end of such a “block”.

Intuitively, only one instruction needs to be observed to assume execution of the entire code block contained between the two next branch instructions. This allows to arrive at a consistent instruction coverage estimate with much less tracing runs required. Therefore, we safe time and are less likely to overlook genuinely achieved coverage, which might cause a false-positive detection later in the campaign.

Ignoring Samples from Interrupt Handlers. Further opportunity to eliminate unrelated observations is enabled by a configuration option of the tracing interface [4, p. 731]. The tracing logic allows enabling notifications upon entry into and exit from interrupt handlers. This allows for the identification of PC samples within a trace corresponding to interrupt handler execution. Tracing of handler code triggering independently of test cases is a major possible cause contributing to unrelated observations. Using handler entry and exit notifications to ignore the respective observations might consequently reduce the risk of false positive classifications.

However, a disadvantage is the lack of observations from handler executions actually related to a test case. In fact, a test case may deterministically trigger certain interrupt handlers. Observation from the respective handler code would be ignored, just as for every other interrupt. Accordingly, there is no coverage feedback available to guide further fuzzing towards testing the handler routine. Whether this is an acceptable blind spot needs to be decided on a per-target basis. Clearly, this is an unbearable condition when testing hardware-related code, but it might eliminate false positives when aiming for higher, application-level logic.

Triaging Observations with Breakpoints. Another option for addressing unrelated observations is an additional investigation that leverages hardware breakpoints. The ARMv7-M architecture features a limited set of PC comparators, halting the CPU when it is about to execute a specified instruction address. These hardware breakpoints can be set up programmatically via the respective configuration registers.

Accordingly, breakpoints can be used to test whether a test case covers a certain instruction address. If the test cases exercises the instruction address, processing of the test case will trigger the breakpoint and consequently halt the CPU. This condition is easily noticed by the fuzzer.

Therefore, we may use breakpoints to reliably sieve out observations of unrelated execution from traces. To this end, a hitherto unseen instruction address, reported for the most recently executed test case, is equipped with a breakpoint. Then, the test case is re-executed. The breakpoint must be hit, iff the newly observed instruction address is exercised during processing of the respective input value. Next, we re-execute all seeds once and demand that the breakpoint does not trigger on them. This allows to rule out that a certain condition unrelated to the processed input value persisted and led to the new observation. Applying this procedure to every new observation results in a reduced set of instructions which are very likely to be genuinely novel.

As a downside however, the tested application must be reset if a breakpoint was hit, to revert potential corruption caused by the inadvertent halt of the application. For applications incorporating a time-consuming initialization routine this may induce a severe time overhead. It is worth mentioning though, that observations tend to become more rare, the longer a fuzzing campaign is running [5]. Therefore, additional time spent on breakpoint triage might not only justify itself through a more pristine corpus but also becomes less of a penalty the longer a campaign is running.

3.2 A Time-Efficient Seed Selection Strategy

The presented heuristics provide a thorough approximation of the instruction coverage achieved by a test case. Another drawback that has not yet been addressed is the time required. Approximating the coverage of every test case exhaustively takes too long for coverage-guided fuzzing to remain competitive. Therefore, we switched to using the pooled “corpus coverage” for the detection of novel instruction addresses.

Formally, corpus coverage is the union of the instruction coverage of all the seeds combined. This allows us to detect the observation of completely new instruction addresses using a simple membership test on the corpus coverage. Most notably however, we may already do so with only a single trace of a test case. If any observed instruction address is not contained in the corpus coverage, we already know that there is new application behavior. In this way, we can decide how much time we want to spend on each individual fuzz input, before generating and applying a new one. Clearly, tracing a test case too few times imposes a substantial risk of overlooking present but yet unseen code and hence false rejection. As we found during our investigations though, the increase in the test-case throughput outweighs the benefit of missed observations. This holds especially true because faster test execution provides earlier chances to rectify missed discoveries sooner.

Putting all parts together, the result is the seed selection strategy as described by the pseudo-code in Fig. 3. To instantiate a full-fledged fuzzer, it must be embedded in a fuzzing loop, where in each iteration a new test case is generated by mutation and then applied to the tested application. The presented function subsequently takes the most recently executed fuzz input, the obtained trace and the currently known corpus coverage as arguments.

Seed selection then is done as follows: First, the observed instruction addresses are extracted from the trace, in line 3. If indicated by ignore_exceptions, the function get_observations() ignores samples from exception handler executions. Subsequently, if indicated by use_breakpoint_triage, observations not contained in the seed coverage yet, are verified. To this end, for every new observation verify_observation() is called in line 7 and performs breakpoint triage.

Fig. 3.
figure 3

Seed selection leveraging SWO-based periodic PC sampling.

Observations withstanding the described filtering routine are almost certainly indicating novel instruction coverage. Accordingly, the causing fuzz input is traced additional times, as indicated by trace_repetitions and additional observations are added to the known corpus coverage. As before, the instruction addresses observed in this manner are extracted from the trace by calling get_observations(). Finally, the addresses of contiguous instructions can be inferred using branch instruction mapping, as described in Sect. 3.1. When the corresponding \(use\_branch\_mapping\) flag is set, a call to lookup_instructions() is made in line 24.

4 Implementation and Evaluation

We implemented the previously devised algorithm as part of a Python-based fuzzing framework to evaluate it under realistic conditions. The target setup necessary to enable tracing was achieved using the pylink libraryFootnote 6 and a Segger J-Trace debug probe. We also implemented decoders for the data watchpoint and trace (DWT) trace protocol used to encode periodic PC sampling data which was received over the SWO interface [4, pp. 780–796].

4.1 Target System Hardware

As test targets we selected two ARM Cortex-M4 development boards from different manufacturers (Fig. 2):

  • A Nordic Semiconductor nRF52840 DK (“NRF”) featuring an nRF52 single core MCU at 64 MHz system clock.Footnote 7 The firmware resembled a minimal bare-metal system communicating over the integrated interrupt driven UART interface.

  • An ST Microelectronics NUCLEO-F429ZI (“STM”) with an STM32F429ZI single-core MCU also clocked at 64 MHz.Footnote 8 This system runs a multi-threading capable mbedOSFootnote 9 real-time operating system (RTOS) and is representative of more complex devices. It implements a TCP/IP stack and a simple HTTP server that communicates over an Ethernet interface.

Both setups were prepared to make both their SWO interfaces and their parallel trace ports available. Accordingly, tests and benchmarks could be conducted using periodic PC sampling via SWO, whereas the full instruction trace mode could serve as a reference for the evaluation of the actually achieved coverage.

4.2 Example 1: Deeply Nested Conditional Statements

Reaching deeply nested conditions is a primary challenge in which coverage information can be beneficial during seed selection. To test this, we created a minimal example containing eight levels of nested conditions, each checking four bits of a 32 bit input value against a constant value. If all the conditionals are passed, the execution enters the inner code section and triggers a crash. This is the case exactly if the 32-bit input value is “0xdeadbeef” (in hexadecimal notation). In all other cases, the processing of the input aborts and the application returns to its initial state.

Despite its simplicity, this example closely resembles the problem faced when testing any application that processes structured input. Our example sequentially checks the input data for certain features, based on which subsequent actions are determined. More complex applications operate in the same manner, just with potentially more alternative branches for the continuation of the processing at each stage.

As a mutation strategy, we selected either one of the following two randomized editing operations with equal likelihood:

  1. (1)

    the full seed is replaced with a uniform random 32-bit pattern, or

  2. (2)

    an equally likely chosen byte of the seed is replaced by a uniform random 8-bit pattern.

A round-robin scheme served as a scheduling strategy to arbitrate between the seeds in the corpus, while we started with a single all zero seed 0x00000000.

In this setup, a coverage-guided fuzzer should certainly be able to produce the crash inducing input within a reasonable time. In case the feedback mechanism works reliable, the fuzzer should aggregate more and more feature complete seeds in the corpus, which each provide a one in \(4 \cdot 2^8 = 2^{10}\) chance to pass the next condition with a suitable mutation from (2). Otherwise, the only option would be to hope for a very lucky outcome (one in \(2^{32}\)) from mutation (1) or randomly added seeds eventually arriving at an almost feature complete input. Randomly added seeds however bloat the corpus and thus cause “starvation“ of actually valuable seeds, as discussed in Sect. 3.

We conducted experiments using three different configurations of our proposed fuzzer, on both tested platforms:

  • using only branch instruction mapping,

  • additionally ignoring samples from exception handler execution

    (\({ignore\_exceptions} = 1\))

  • additionally using breakpoint triage (\({use\_breakpoint\_triage} = 1\)).

The tracing logic was configured to conduct periodic PC sampling in a 128 step clock cycle interval. Before addition to the corpus, seeds were traced 500 times to approximate their achieved instruction coverage, while using branch instruction mapping (\({trace\_repetitions} = 500, {use\_branch\_mapping} = 1\)). Every tested setup was running until either four hours exceeded (failure) or a fuzz input resulted in a reproducible crash (success). Twenty instances of this experiment were conducted for every configuration.

The fuzzing algorithm proved itself successful on both tested platforms. For the bare-metal NRF target, success rates of 90% (no optimization), 90% (ignoring exceptions), and 85% (breakpoint triage) were achieved. On the STM platform, just the breakpoint-triage configuration maintained a 95 % success rate, while there was only one successful run ignoring exceptions and none without optimizations. Investigating the individual runs, the root cause for this discrepancy appears to be the concurrent RTOS execution, which only breakpoint triage was able to eliminate reliably.

4.3 Example 2: Fuzz Testing a JSON Parser

For a more representative evaluation, we next chose “cJSON”Footnote 10 as target. In contrast to our fist example, JSON strings comprise more complex features in manifold compositions. Accordingly, there is a more diverse feature set to be explored by the fuzzer, measured in the overall instruction coverage achievable. Presumably, a well-configured coverage-guided fuzzer should reliably achieve more coverage in the same time, than a simple blackbox counterpart.

For a benchmark, we used the same NRF-board and its UART interface as before. To fuzz-test the library, a dedicated “fuzzing harness” based on the Nordic nRF5 SDK v16.0.0 was created. The resulting setup accepted strings via the UART peripheral and fed them to the cJSON parsing logic. The empty string was used as initial seed. The popular “Radamsa” mutation engineFootnote 11 was used as mutation strategy. As before, the individual seeds were scheduled in a round-robin fashion.

As a baseline, we used a simple blackbox fuzzer not leveraging any execution feedback. The coverage-guided fuzzer in turn was configured to process periodic PC sampling with a sampling interval of 128 clock cycles, using branch instruction mapping, and 500 repeated tracings of seeds (\({use\_branch\_mapping} = 1, {trace\_repetitions} = 500\)). We then ran experiments with each of the following optimizations:

  • using only branch instruction mapping,

  • additionally ignoring samples from exception handler execution

    (\({ignore\_exceptions} = 1\))

  • additionally using breakpoint triage (\({use\_breakpoint\_triage} = 1\)).

The blackbox and the three coverage-guided setups were each benchmarked five times, with twelve hours per run. The executed test cases were recorded including a timestamp. This allowed to obtain the accurate instruction coverage achieved over time, by replaying the series of tests subsequent to each run while tracing with ETM full trace mode. Figure 4 shows the results of the described series of benchmarks, averaged over the five runs per configuration. The x-axis refers to the time in seconds, whereas the y-axis represents the number of different instructions exercised.

Fig. 4.
figure 4

Instruction coverage achieved by different configurations of the coverage-guided fuzzer per time, plotted against the results obtained from using a similar blackbox fuzzer.

Fig. 5.
figure 5

Instruction coverage achieved by the coverage-guided fuzzer using breakpoint triage per number of applied test cases, plotted against the results obtained from using a similar blackbox fuzzer.

From the plots it can be seen, that two out of three coverage-guided fuzzer setups reliably outperformed the blackbox approach. Immediately after the start, all fuzzers made similar discoveries. However, while the blackbox instances slow down, the coverage-guided fuzzers continue to make substantial discoveries. The blackbox fuzzer eventually made a large fraction of the discoveries too, but ultimately staled not even half in the allotted run time. The coverage-guided instances using breakpoint triage and ignoring exceptions in turn maintained their advantage and even continued to make discoveries until the runs ended. Only the coverage-guided configuration without optimizations failed to beat the benchmark blackbox setup.

How substantial the support of proper coverage feedback is however, might best be seen in Fig. 5. The plot shows the number of test cases executed on the x-axis versus the total number of instructions observed on the y-axis. Accordingly, the coverage-guided setup achieved its superior results with roughly a quarter of the number of test cases issued by the blackbox setup. This underlines the warranted improvement in test case quality, achievable using the presented feedback mechanism.

During our experiments, we further identified two potential weaknesses within the cJSON parser:

  • First, a rather simple bug was triggered by an input comprising approximately 200 opening square brackets. The result was a stack overflow caused by an unbounded recursion taking a new step for every additional bracket found in the input.

  • Another, more complex bug was triggered by certain malformed strings. JSON allows the specification of 16-bit Unicode characters within strings in the form of escaped hexadecimal digits. A flaw in the corresponding parsing routine caused the parser to skip the evaluation of a character following such an escaped sequence if, for instance, the hexadecimal value was invalid. This includes parenthesis, otherwise terminating the string. Consequently, the parser continues to process the input as a string, which basically results in copying the input data into an allocated heap buffer. The string size does not match the correctly calculated buffer size, resulting in an out-of-bounds write on the heap.

The first bug was consistently found by all instances. The second bug conversely, was found by a single coverage-guided instance leveraging breakpoint triage. Although the frequency of observation of these two bugs does not enable much judgment regarding the qualities of the benchmarked algorithms, they demonstrate which results can be expected from fuzz testing. All of the found bugs were already known and fixed in the more recent upstream version of the cJSON library.

5 Conclusion

We presented a coverage-guided, mutation-based fuzzing algorithm, designed to automatically test applications running on ARMv7-M MCUs. To this end, we leverage the sparse information provided by the “periodic PC sampling” trace mode as coverage feedback. The resulting methodology can be applied to any ARM Cortex-M device, providing a SWO trace port and DWT support. At the time of writing, to our knowledge this is the first attempt to use this interface to enable coverage-guided fuzz testing of MCU based embedded systems.

Our algorithm outperforms blackbox fuzzing, which so far was the only available option if firmware needs to be fuzz tested in place on its target hardware. Accordingly, our results underline the feasibility of this new direction for the unobtrusive fuzz testing of embedded applications running on MCUs.

The described approach was implemented and evaluated under realistic conditions on two ARM Cortex-M4 platforms. Tests with a “nested conditions” example application showed that the approach is capable of reaching deeply nested conditions, consisting of small code blocks. Benchmarks conducted on the popular “cJSON” library further demonstrated its applicability to real-world, complex input structures. During both experiments, the devised fuzzer demonstrated its “learning” ability, observable in the improved quality of the generated test cases. Simultaneously, the run-time overhead introduced through tracing remains at an acceptable level, yielding a reasonable trade-off between the volume of applied inputs and their individual effectiveness. Most notably, this remains true under aggravated conditions, when the to-be-observed code sections are rather small or a highly concurrent execution environment obscures available insights.