Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Reconfigurable field-programmable gate arrays (FPGAs) offer high processing throughput at low power consumption and flexibility through reconfiguration which makes them widely-used devices in embedded systems with high processing demand today. However, SRAM-based FPGAs are particularly susceptible to the radiation effects of space applications because single event upsets (SEUs) in the configuration memory can cause a reconfiguration of the device and hence an undesired modification of the circuit. This problem is usually addressed by adding spatial redundancy, i.e. duplicating or triplicating the processing units in the FPGA fabric, in combination with scrubbing, i.e. reprogramming the configuration bitstream after a fault has been detected [1]. Repair via scrubbing potentially causes considerable down-time of the processing system [2] which can lead to the loss of payload data or affect onboard control tasks. This work addresses this issue and focuses on maximizing the availability of the onboard processing system. In particular, we consider the system availability as a third performance metric alongside the processing through-put and power/energy consumption.

This work describes a novel fault detection, isolation, and recovery (FDIR) strategy for onboard satellite payloads which utilizes commercial-off-the-shelf (COTS) reconfigurable FPGAs. Our scheme leverages heterogeneous systems-on-chips (SoCs), such as Xilinx’s Zynq chip [3] or Altera’s Cyclone V SoC modules [4], which comprise of a tightly coupled reconfigurable hardware and hard-wired processor cores. These SoCs embed one or multiple hard processor cores alongside programmable FPGA logic, enabling low latency and power-efficient communication between these two computing devices. Compared to the FPGA fabric, we consider the hard-wired cores to be more reliable with respect to radiation-induced SEUs for which standard fault mitigation techniques can be applied; for this reason these cores are assigned the role of a hypervisor being in charge of fault detection and scrubbing in the fabric. We define Quality-of-Service (QoS) as the rate at which payload data is processed, for example this could be the frame or pixel rate of an image processing application. A novelty of the proposed concept is that the adaptive framework aims to maintain a constant processing rate of the QoS application during repair of the FPGA configuration memory. To this end, the system is rolled back to the last known acceptable state and the hardware task is migrated to software running on a hard processor core when a fault is detected in the hardware. The task is then continued in a software thread while the FPGA device is reprogrammed, and once scrubbing is completed the task is migrated back to hardware.

Our rollback and repair can be very effectively applied to applications where its internal state is regularly reset to a known state, such as image processing applications. This reset interval also sets the amount of rollback time required since the computation will have to start over from the last point where the state was reset. The work presented in this chapter contains a case study with a HW/SW application that performs K-means clustering on frames of a video. In this example processing each video frame takes a variable amount of time and may require multiple iterations of the algorithm, which means that the reset interval can vary depending on how long each frame takes to process. Many common onboard processing tasks stem from signal and image processing [1, 57] or data compression [6, 8, 9] applications which are normally stream-based. This type of applications have therefore been the primary focus of our work to date.

Besides the optimization of system availability, our FDIR framework also ensures that constraints on the power consumption are maintained. Such constraints could possibly become violated when the payload task is migrated to software and the core frequency is ramped up in order to meet the QoS requirement. The adaptive frequency scaling will use online power monitoring to dynamically trade-off and manage the QoS and power constraints during system runtime.

This chapter describes our FDIR framework implemented on a Xilinx Zynq device and presents experiment results obtained under fault injection. This work represents a precursor system for a subsequent implementation in the payload computer on-board the OPS-SAT satellite [10]. OPS-SAT is a nano-satellite which is devoted to demonstrating novel mission concepts that arise when more powerful computers are available on satellites. The OPS-SAT mission is led by the European Space Agency (ESA) and is set to launch in 2016. OPS-SAT is the first spacecraft that flies COTS Altera Cyclone V SoCs built on 28 nm technology. The hardware is not space qualified and hence our experiment setup focuses on very high SEU-induced fault rates. In summary, our contributions are:

  • We present a novel FDIR scheme for FPGA-based space-borne processors. The scheme autonomously migrates processing tasks between the reprogrammable logic and hard processor cores, so as to maximize the availability of the processing system in the presence of SEU-induced faults.

  • We extend the FDIR framework by utilizing frequency scaling to create an adaptive, fine-grain optimization of power consumption and processing throughput.

  • We present measurements of the system availability, power consumption and processing throughput for different fault rates. We demonstrate that our technique maintains nearly full availability even under harsh conditions where faults occur as frequently as up to once per second.

  • We compare our results to ‘traditional’ fault handling approaches based on fault detection and scrubbing.

Section 6.1 discusses related work and highlights the differences to previous work. Section 6.2 describes our adaptive FDIR framework. Section 6.3 briefly outlines our benchmark application. We present experiments in Sects. 6.5 and 6.6 concludes the paper.

1 Related Work

Repairing the configuration memory of reconfigurable SRAM-based FPGAs in high-radiation environments is usually based on scrubbing, i.e. periodic rewriting of the FPGA configuration memory, while faults in the form of radiation-induced bit flips in the configuration memory are detected by including redundant modules and comparators or majority voters in the circuit. A common approach is to use triple modular redundancy (TMR) at the netlist level in combination with periodic scrubbing of the entire configuration memory [11]. The drawbacks of this strategy are large overheads in terms of chip area and power consumption, and long scrubbing times, especially for large COTS FPGAs. In addition to blind scrubbing at a fixed rate, the time spent for repair can be reduced by triggered scrubbing which is performed only if a fault has been detected.

Instead of adding spatial redundancy at netlist level, alternative approaches implement a module replication at coarse-grained unit level. Azambuja et al. [12] describe an approach where faulty modules are detected with unit-level TMR and repaired with selective partial scrubbing using dynamic partial reconfiguration (DPR). Their approach is notable in that it further reduces the scrubbing time and energy spent in the repair process compared to the netlist-level TMR approach [11] while keeping the resource overhead similar. Nazar et al. [2] propose an alternative approach to reduce the scrubbing time by leveraging DPR and applying the repair only to critical configuration bits that are used by the logic configuration and by determining the optimal starting point for the scrubbing process.

Several recent approaches address a reduction of the resource overhead in terms of chip area and power consumption caused by the redundancy scheme, in particular TMR. Jacobs et al. [13] propose a framework that, instead of using TMR by default, can adapt the amount of redundancy needed according to the degree of required protection and changing failure rates using DPR. The authors integrate three redundancy schemes in their framework: TMR with voting, self-checking pairs (module duplication with comparison, DWC [14]), and a high-performance mode without module-level replication. Siegle et al. [1] present a comprehensive framework that allows the designer to select and analyze different redundancy and repair schemes: netlist-level TMR with scrubbing, no redundancy, duplication with comparison, and module-level TMR with partial scrubbing to speed up repair. In line with this work they focus on maximizing the availability of the processing system. However, we completely abandon the expensive TMR approach and propose a reliable onboard processor based on the more economic DWC strategy which involves hard on-chip processor cores in addition to SRAM-configurable logic.

Ilias et al. [7] propose an FDIR strategy which is similar to this work in that it uses DWC and migrates the processing task to embedded hard PowerPCs during scrubbing. The hard processor core has a smaller cross section and is less susceptible to radiation-induced faults than the reconfigurable logic. The authors demonstrate their framework with a finite impulse response (FIR) filter application and report a 40 % area reduction compared to a standard TMR implementation at the cost of a reduction in the processing throughput. This work builds on the same basic idea, but we extend the approach to address the drawback of a drop in processing rate by adding an adaptive frequency scaling. The adaptive availability optimization is a distinguishing feature of the proposed technique compared to the FDIR approaches discussed above. Additionally, we include a fine-grain online optimization of the power consumption.

Re-synchronization of state-dependent logic after repair is a crucial task in the fault mitigation strategies discussed above. We choose a checkpoint and rollback approach [15] where the system is rolled back to the last known acceptable state before the task migration. This approach works particularly well for processing tasks that can be split into small independent chunks, such as stream-based processing tasks. The hardware/software task migration and slicing of the processing task is more difficult to implement for other types of applications which exhibit many dependencies between the processed data items. However, many typical onboard processing tasks, such as image of signal processing applications, are stream-based which makes this approach applicable to a wide range of onboard processing applications.

2 A Workload-Adaptive FDIR Framework

Our management system is divided into two main components: the fault recovery system (FRS), used to manage the repair of the system once a fault has been detected; and the adaptive management system (AMS), which dynamically monitors the processing progress of the system and scales the performance while still meeting power constraints. The AMS is built upon previous work known as the Heterogeneous Heartbeats which is a framework for adaptive reconfigurable SoCs and is discussed in subsection 6.2.2. This section discusses the implementation details of the FRS, then outlines the Heterogeneous Heartbeats framework, and finally discusses the details of the adaptation management system.

2.1 Fault Recovery Management System

The goal of the fault recovery system (FRS) is to detect and recover from errors that arise in the system’s hardware task. Figure 6.1 shows a flowchart of both the recovery process and task execution indicating whether each stage is executed in the hard processor system (PS) or within the programmable logic (PL). To demonstrate this an image processing case study is presented, where frames of an input video are iteratively processed. At the start of each iteration the PS is responsible for capturing the frame data and storing it into memory accessible by both the PS and the PL. It then checks to see if PL is fully configured and that the HW task and its duplicate are available. If available, they are sent a signal indicating that the input frame is present in memory and that they can start processing; however, if not then a software version of the task is started instead.

Fig. 6.1
figure 1

Flowchart of the FDIR scrubbing system

Both the hardware task and its duplicate process their data in lock step, and every time they complete a frame their outputs are compared. If there is no difference in the output then we assume that no error has occurred and the process continues as normal. However if a difference is detected between the outputs of the hardware tasks then we assume that an error has occurred and an exception is thrown. This exception triggers the FRS, running on the PS, to start reconfiguring the FPGA fabric. While the reconfiguration process is occurring the same input frame is then recomputed, however this time a software instance of the task is used instead of a hardware instance. Once the frame has been successfully processed the computation of the next frame is started in software until the hardware has been reprogrammed and validated.

2.2 Heterogeneous Heartbeats

Heterogeneous Heartbeats is the basis for the satellites adaptive management system. It aims to facilitate chip level adaptation, focusing on systems contained within a single package, such as the Altera SoC or Xilinx Zynq devices. The Heterogeneous Heartbeats framework is an enhanced version of preliminary work presented in [16]. In the development of such systems it is becoming increasingly common to use large collections of intellectual property (IP) packages, all with different characteristics from different locations, such as IP vendors or generated via high-level synthesis (HLS) tools. As the amount and variety of IP increases the interactions between sub components become increasingly complex potentially increasing the run time dynamics of the system. These dynamics make it difficult to statically optimize and tune parameters to meet constraints such as temperature, power, or frame rate offline and necessitates the need for online adaptive approaches. Heterogeneous Heartbeats extends the Heartbeats Application Programming Interface (API) [17], a standardized interface to monitor task progress, by allowing the seamless addition of both hardware (FPGA) resident heartbeat producers and heartbeat consumer.

The Heterogeneous Heartbeats framework considers three separate portions: sensors, adaptive engines, and actuators. Sensors collect data on the current state of the system, examples could be the applications progress via the Heartbeats API or a devices power consumption; these are heartbeat producers. Adaptive engines use the collected sensor data and predictions on how changes in the system will alter future sensor readings to make decisions on how the system should alter its behavior; these are heartbeat consumers. Actuators change the behavior of the system, examples could be the frequency multiplier value in the phase locked loop (PLL) for the systems clock or the cache replacement policies.

The Heartbeats API is used as the basis for the interaction between the heartbeat producers (sensors) and the heartbeat consumers (adaptive engines). Application developers use the Heartbeats API by first calling an initialization function at the start of their application. This function sets up a publicly available heartbeat record that can be generated and accessed by either software or hardware, where individual heartbeat entries are stored and the goals of the application are set. The goals of the application are expressed in terms of the sensors that the application is interested in. For example in a video processing application one goal might be to maintain a particular frame rate and power consumption, so this would require the availability of a timer and a power monitor on the sensor side.

A heartbeat function is then called at important milestones of the applications progress. This function is used to create a sensor stamped heartbeat which are then saved as an entry in the publicly available heartbeat record. In our image processing example this would mean that the sensor stamps would be a timestamp from an internal or external system clock, and a power stamp from a power monitoring unit. Further operations are then provided for external heartbeat consumer applications to query the heartbeat record. These functions perform operations such as, fetch the current heartrate, fetch the history of the last n heartbeats, fetch the average heartrate, and fetch applications goals.

On the other side of the adaptive engines is the actuator portion. These are methods that cause changes in the systems behavior. Examples are the frequency of the PS or hardware tasks in the PL, the cache replacement policy, or what version of a particular algorithm is running.

2.3 Adaptation Management System

The adaptive management system (AMS) dynamically tries to maintain an overall system goal while subject to certain constraints. In this particular case, the goal of the AMS is to maintain a particular QoS deadline (frame rate) while using as little power as possible and always ensuring that a system-wide power constraint is met. This adaptation needs to be performed in two cases: as the application workload varies, and while the FRS is repairing faults in the system. Figure 6.2 shows both the controller’s architecture in (a) and (b), and algorithmic flow in (c). In (a) and (b) we can see that the deviation of the application’s ideal heartrate and the current heartrate are turned into an error signal which is used to drive the controller. This is the signal that the controller will attempt to minimize in the presence of disturbances.

Fig. 6.2
figure 2

Diagram to show the QoS adaptation controller setup with; (a) architecture view where the controller is configured to scale the clock frequency of FPGA resident hardware tasks during normal, error free operation; (b) architecture view where the controller is configured to scale the hard Processor Systems (PS) clock frequency when errors are detected in hardware; and (c) algorithmic view showing the flow of the execution and reconfiguration of the controller

When the system is being repaired due to faults the structure of system dramatically changes along with its behavior. In control theory an approach known as gain scheduling, where a suitable linear controller is selected depending on the current operating region of the controller, is used to handle such non-linear effects. We adopt a similar approach here, adapting the controller and scheduling different models and parameters based on the current configuration of the system. The process of adapting the controller can be divided up into three coarse stages, labeled in Fig. 6.2c, below is a description of each stage.

  1. 1.

    Initially the control algorithm determines whether the application is currently running within hardware or software.

  2. 2.

    Based on this information the parameters and state of the controller are scheduled, the parameters and control models are selected depending on whether the application is resident in hardware or software.

  3. 3.

    Finally the control action is executed and a new frequency is calculated. This is then used to update the clock controllers for both the software and the hardware systems.

This preliminary work uses a simple heuristic to control the various components of the system; however work is underway to develop a more sophisticated controller where each control action consists of the following stages. Firstly, a learning algorithm takes the error signal and generates a performance scaling factor, which is the multiple of the current heartbeat that we require in the future to maintain our QoS. Secondly, this is fed into a model that determines the frequency required to achieve the required increase or decrease in performance. Thirdly, the new frequency value is fed into another model that is used to determine the predicted power consumption that the change in frequency will cause. Finally, this predicted power consumption is used along with the current power consumption to determine if the power constraint will be satisfied. If the power constraint is not satisfied then the controller will iteratively search to find the next highest frequency value that will give the best performance, while still meeting the constraint.

3 Benchmark Applications

We demonstrate the adaptive FDIR system using a benchmark application which processes image data from the high-resolution camera onboard the satellite. A software implementation and an FPGA implementation are uploaded onto the SoC, while the FDIR system automatically schedules the execution of either the software or the hardware task. The following briefly describes our benchmark application.

3.1 K-Means Clustering

A common remote sensing application is the creation of maps of vegetation type or land cover, for instance used in crops/forestation monitoring such as the objective of Planet Labs dove fleet [18]. A central component of these image processing systems is the unsupervised classification of image data based on the pixel values. K-means clustering is among the most popular machine learning techniques for assigning observation (in this case pixels) to classes (clusters). Clustering is also often used for analyzing multi- and hyperspectral imagery. K-means algorithms partition the D-dimensional point set \( X=\left\{{x}_1, \dots,\;{x}_N\right\} \) into clusters {S 1, … , S k } where k is provided as a parameter. The goal is to find the optimal partitioning which minimizes the objective function given in (6.1) where µ i is the geometric center (centroid) of S i .

$$ J\left(\left\{{S}_i\right\}\right)={\displaystyle \sum}_{i=1}^K{\displaystyle \sum}_{x_j\in {S}_i}\left|\right|{x}_j-{\mu_i}^2\left|\right| $$
(6.1)

Finding optimal solutions to this problem is NP-hard [19]. A popular heuristic version, known as Lloyd’s Algorithm uses an iterative refinement scheme which, for every data point, computes the nearest cluster center based on the smallest squared Euclidean distance to it and then updates each cluster center position according to the data points assigned to it.

Clustering produces an output image with each pixel assigned to a cluster. Apart from classification, clustering provides a locally optimal solution to color quantization which results in a reduction of the data volume as a pixel can be represented with log 2(K) bits in the new image. Remote sensing systems including a cluster analysis of satellite imagery usually perform the clustering offline after reception of the original image by the ground station. In this experiment we perform the clustering step onboard and benefit from the data volume reduction prior to downlinking telemetry. We use a software implementation of Lloyd’s algorithm for K-means clustering in C++ and an FPGA implementation in VHDL which builds on the work described in [20].

4 Experiments

Our measurements focus on three sub-experiments:

  • A ‘naive’ fault recovery mechanism where the task only runs in hardware and the entire system is stalled while the hard processing system performs scrubbing of the fabric’s configuration memory. This is the traditional approach to fault mitigation for FPGAs and is used as the base line comparison with the second and third experiment where the task is automatically migrated to software so as to maintain the availability of the payload processor.

  • The second experiment then includes the use of the FRS to migrate the task from software to hardware while the recovery process is taking place, demonstrating that the availability of the system can be improved through the use of the heterogeneous platform. We compare the FRS-based fault handling on the basis of system availability and processed blocks per time interval.

  • Finally, we combine the FRS with the AMS to automatically manage the QoS deadline and power constraints and to demonstrate that fault tolerance is achieved while maintaining a particular QoS via frequency scaling. The AMS controller monitors the instantaneous power consumption and adapts the frequency according to the allowable power budget.

For the in-orbit experiment, all three sub-experiments are packaged up in a single image which is uploaded onto the onboard SoC. A thread running on the PS is in charge of scheduling the three experiment phases. The payload processor requires access to a high-resolution camera in order to retrieve input data for the image processing benchmark applications. In addition to the processed image data, the experiment setup collects downlinks information about the system availability (i.e. down-time during scrubbing and violation of QoS deadlines), the number and time stamps of faults that occurred, the selected frequency scalar values, and the power consumption (drawn from online power monitoring sensors) as well as several status indicators such as the presence of permanent circuit failures (e.g. due to latch-ups in the reconfigurable logic) and fault statistics.

4.1 Prototype Test Setup

The prototype system used for the measurements presented in this paper was developed on a Xilinx ZC702 Zynq development board, a very similar device to the Altera Cyclone V SoC used in the payload onboard OPS-SAT. Like the Cyclone V SoC this device contains a dual core ARM processing system (PS) tightly coupled to an FPGA fabric (PL) in a single package. Two identical K-means clustering IP cores were implemented using the Xilinx Vivado HLS high-level synthesis tool. The clustering cores are placed in the PL and connected to PS via various AXI busses. The output of the identical clustering cores are compared and an error flagged if their outputs do not match. In order to configure and control the IP cores from the PS their AXI locations were memory mapped and Linux drivers were developed. A configuration bit stream was then generated for use in both the initial configuration of the device, and for reprogramming the device during repair.

Petalinux, developed by Xilinx, is the Linux kernel running within the PS and on top of this OpenCV is used to manage the image data sent to the device and check the for errors in the output. Getting data from the PS to the clustering cores required a portion of the DDR memory to be reserved, input frames were obtained in OpenCV and were then passed to this reserved memory. AXI masters within the clustering cores were then used to fetch the input frame without any intervention from the PS, the same AXI masters were then used to send the output to different reserved memory location that can be read by the OpenCV application. In order to reconfigure the hardware during repair drivers provided by Xilinx allowed us to write the bitstream to a device file xdevcfg, which connects to the PCAP and allows us to reconfigure the PL from within the embedded Linux environment.

For the current implementation of the FRS & AMS system a preliminary frequency controller is used where we distinguish between two states: the state where the task has been migrated to the PS and the state where the task is running in the PL . Each state is given a high frequency and a low frequency respectively. Future work plans to explore more sophisticated controllers that scale the frequency of both hardware and software separately in order to meet constraints. To change the frequency of the PS the System Level Control Registers (SLCR) was used, where the clock multiplier values for the PS clocks PLL can be edited.

4.2 Experiment Setting

For each experiment stock ESA/NASA footage of Earth from a low earth orbit was streamed into the Zynq device over a network connection. OpenCV then sent the frames of this video to the reserved memory where the hardware could access them. In order to emulate SEU-induced faults, a separate software thread running on the PS randomly injects faults by corrupting a bit from one of the hardware blocks’ output. This causes a discrepancy between the outputs of the two cores which causes the fault detection to trigger initiating the repair process. The application is instrumented with the heartbeats API, a power monitor, and timers and each of the three fault handling methods above is tested for a certain number of frames at a various error rates. For each experiment three metrics are evaluated:

  • The availability AV of the system is defined as

    $$ AV=\frac{t_{up}}{t_{up}+{t}_{down}\;} $$
    (6.2)

    where t up is the amount of time that the systems objective task is active, and t down is the time spent during repair after a fault occurred.

  • The heartrate of the system which is generated using the Heterogeneous Heart-beats API and represents the processing performance. In this case we use the heartrate to measure the QoS of the system in terms of image blocks processed per second.

  • The power consumption of the system which is measured using the inbuilt ZC702 development board power monitoring, allowing us to measure the power consumed from the PS and PL portions of the device separately.

5 Results

In Fig. 6.3 we examine the availability of the two recovery systems under different fault rates. Firstly, we observe that the average availability of the naive system, where the entire system stalls during repair, is always lower than the availability of the FRS system where the task is migrated to the PS when a fault occurs. At high error rates the gap between the average FRS availability and that of the conventional approach becomes increasingly large. This runaway effect is due to an increased probability of errors occurring during the repair process causing the system to make less and less progress. This problem potentially becomes increasingly significant as the recovery time increases due to larger configuration memories in larger FPGA devices, increasing the need for more intelligent scrubbing techniques that make use of partial reconfiguration as pointed out in [1].

Fig. 6.3
figure 3

Average and worst case availability of the conventional fault recovery (naive), task migration (FRS), and combined task migration and adaptive frequency scaling (FRS & AMS) systems

Secondly, we show the worst case availability of the different recovery systems. Critical onboard processing tasks can require hard task completion deadlines and hence we consider the worst case metric as ultimately more important for mission criticality. It can be seen from the graph that for the naive implementation the worst case availability deteriorates to values below 0.2, while with our FRS the worst case availability remains at nearly 100 % and is never lower than the average availability of the naive case.

Figure 6.4 shows how the worst case performance of the different systems change as the error rate is increased. In a similar fashion to the availability, we can see that the worst case heartrate for the naive recovery method performs poorly, the FRS improves this by ensuring that a certain level of QoS can always be maintained no matter what the current error rate is. However, maintaining this level of QoS is not free: In Fig. 6.5 we observe an increased power consumption of the FRS. This increased power consumption can be significantly improved through combining the FRS with the AMS which is used to scale the frequency of the PS only when it is required because a task has been assigned to it. Figure 6.4 also shows that, when we combine the FRS with the AMS, we are able to obtain a higher worst case heartrate than using the FRS alone, and Fig. 6.5 demonstrates that there is a comparable power consumption to the naive implementation and significant saving over using the FRS alone.

Fig. 6.4
figure 4

Average heartrate (QoS) achieved by the conventional fault recovery (naive), task migration (FRS), and combined task migration and adaptive frequency scaling (FRS & AMS) systems

Fig. 6.5
figure 5

Instantaneous power consumption of the conventional fault recovery (naive), FRS, and FRS & AMS systems

6 Conclusion and Outlook

We present a prototype implementation of a novel FDIR scheme for FPGA-based space-borne processors that will undergo an in-orbit test and validation campaign onboard the ESA OPS-SAT satellite, set to launch in 2016. A distinguishing feature of our technique is the autonomous migration of processing tasks between the reprogrammable logic and hard processor cores in heterogeneous SoCs, which maximizes the availability of the processing system in the presence of SEU-induced faults and required scrubbing of the programmable logic. Our measurement results show that, under all conditions, our task migration technique maintains nearly full availability of the processor at all times. We compare the availability results to the mean and worst-case availability of a conventional fault detection and repair approach which is degraded to 20 % in the worst case in scenarios with frequently occurring SEU-induced faults. In addition, our FDIR framework features frequency scaling for an adaptive, fine-grain optimization of power consumption and processing throughput.

There are three major directions we plan to explore in future work. Firstly, we will investigate more sophisticated controller implementations for the adaptive power and throughput management. Our current prototype implementation switches between high and low clock frequency according to the current state of the task migration. More complex controllers provide a better quality of the frequency adaptation at the expense of an increased inherent power and resource consumption, and we plan to explore this trade-off in future work.

Our task migration effectively mitigates the degradation of the system availability during FPGA repair. However, high-throughput applications may still experience a significant drop in heartrate when the task is migrated to software, which is especially true as the recovery time increases due to larger FPGA devices being repaired. Partial reconfiguration combined with a more fine-grain error detection and localization will lead to faster recovery times. Hence, we plan to integrate partial reconfiguration into our FDIR framework.

A third aspect of future work is to maintain the operability of the onboard processor in the presence permanent faults caused by radiation-induced latch-ups by leveraging the reprogrammability of SRAM-based FPGAs. Over time parts of the FPGA configuration memory may become permanently damaged, especially when COTS FPGAs are used in long-lasting missions. We plan to address this issue by storing multiple pre-mapped copies of the same circuit which will allow us to ‘re-place’ the circuit around the damaged area.