1 Introduction

The last two decades have seen the introduction of high-quality imaging sensors based on Complementary Metal Oxide Semiconductor (CMOS) technology [1,2,3]. This development has resulted in large sensing arrays with square pixels that are in the range of 1–10 μm which are ideal for various metrology and sensing applications for which point or linear array detectors are typically used. Through combination of individual captured pixel intensities, sensing areas of arbitrary shape are possible. With the older Charge Coupled Device (CCD) technology it was possible to combine adjacent pixels through binning of the analogue signals but with severe restrictions on the possible combinations. Unlike CCD sensors, their CMOS counterparts allow for integration on a single silicon die of the sensor together with peripherals such as analogue-to-digital converters and fast differential data channels allowing for high frame rates [4]. Frame rates can be further increased by limiting the amount of pixels that are readout.

There are, however, two main hurdles to overcome in any advanced sensing application using CMOS imagers. First, the interface of advanced or scientific CMOS sensors is complex and requires several phase related, high-speed clock signals to transfer pixel intensities. In some cases each pixel is read twice using different amplifiers to allow for enhanced dynamic range readings. Such interfaces are usually non-standard and differ between different CMOS imagers even if sourced from the same manufacturer. Providing flexible interfaces to allow quick adaptation at the readout is crucial especially because the sensors have short live cycles.

Second, the data bandwidth required to handle pixel intensities is significant. With millions of pixels to be processed per image frame and anywhere from 10 to 1000 frames per second just transmission and storing of the frames becomes a significant task for any processing system. Often simple readout, pixel summing and storage isn’t sufficient and more complex operations are required to extract salient information from the raw data.

In order to address the above two challenges, we have developed various generations of heterogenous systems using a field programmable gate array (FPGA) interfaced to a System-on-a-Chip (SOC). The development started around 2005 with the affordable Cyclone® range of FPGAs that featured sufficiently large on-chip memory blocks. An FPGA is an ideal and flexible way for creating fast sensor interfaces and introducing parallel image operations. The SoC of choice integrates a digital signal processor (DSP), a graphics processing unit (GPU) with an ARM® central processing unit and provides a freely programmable platform that handles high-level system tasks and communication with the external systems. We have opted to use an ARM® based SoC because it allows the creation of a compact and low-power use solution that doesn’t need active cooling. For a complete ‘smart’ camera system which is able to interface with other systems a central processing unit is needed that runs a generic operating system.

FPGAs are reprogrammable through a dedicated Joint Test Action Group (JTAG) port which is traditionally accomplished using a dedicated programmer module interfaced to the development system via USB or a special non-volatile memory chip. In order to allow for changing the functionality of the FPGA on demand, we have added the option of reprogramming the FPGA using a couple of general purpose input/output pins of the ARM processor.

For all the flexibility FPGAs offer, their programming is notoriously time consuming and requires additional development tools. Their use has, however, benefitted from ready-made functional units which are often called intellectual property (IP) cores. The supplier of the Cyclone family of FPGAs Intel, for example, provides the Video and Image Processing Suite of IP cores that use the Avalon streaming video standard of data transmission [5]. This standard is well documented and uses video packets, control packets and/or user packets making it possible to add dedicated image processing IP cores into the processing chain. In order to handle the large amount of raw data as collected from the sensor a high bandwidth and large capacity memory bank that is tightly coupled to the FPGA is paramount. In addition the data path between FPGA and the other processing units of the system should be high bandwidth as well.

In the following sections, we will describe in detail our embedded vision platform, which is based on commercially available components. This platform has been used to create systems with a range of sensors spanning the electromagnetic spectrum from long wave infrared to X-rays [6,7,8,9,10]. Over the last 15 years this platform has evolved as new processors and FPGAs have become available but all of them adhere to a similar heterogeneous design. With the ever increasing capability of modern FPGAs some of which have integrated hard processors, it is now attractive to create a fully integrated designs of cameras based on a just a single FPGA. For a truly versatile system that can be adapted quickly to changing image processing tasks it is, however, still advantageous to integrate multiple separate processor types and interface these to an FPGA not just from a cost and power budget point of view. We have used low-cost FPGAs from Intel and will describe development tools that are specific to this supplier. There are several other suppliers (e.g. Xilinx, Lattice) that offer similar devices and development tools.

As an application we describe an X-ray beam diagnostics instrument as developed particularly for synchrotron radiation sources in detail. More details about the measuring method can be found elsewhere [11].

2 ARMflash System Overview

The ARMflash is a platform developed in-house and uses a similar Texas Instruments SoC that features on the popular BeagleBone® community-supported development board series [12]. Through the last 15 years or so the platform hardware has changed. Initially we relied on Gumstix Overo modules coupled to either a Cyclone I or Cyclone III FPGA. More recently, we have done away with the Gumstix modules and integrated discrete components directly using a Cyclone V FPGA (see Fig. 1). The form factor of 66 × 100 mm2 and connector layout has remained the same throughout. At its core, the ARMflash has a Texas Instrument ARM® based SoC, and an Intel Cyclone V FPGA. Both devices have dedicated 512 Mbytes DDR3 SDRAM memory banks attached, providing storage for large data sets. The amount of data taken from the sensor is too big to be stored into memory blocks on the FPGA. To create a sufficient amount of storage to handle several images, a dedicated DDR3 memory bank of 512 Mbytes is used; this memory is connected only to the FPGA but its contents can be accessed by the processor using the FPGA as a bridge.

In terms of connectivity, the platform exposes most of the FPGA input/output (I/O) pins via two high-density connectors (AMP and ERNI). Several of the communication and debug interfaces from the SoC; such as the Gigabit Ethernet, USB, serial ports are made available through separate connectors.

Fig. 1
figure 1

Photograph of an ARMflash board with a Cyclone V FPGA. Top view: (1) FPGA, (2) AM3358 SoC, (3) & 4) 512Mbytes DDR3 Memory, 5) & 6) LCD screen/Debug connectors, 7) 1G Ethernet port, 8) USB OTG port, 9) UART port, 10) 50-pin ERNI FPGA IO right angled connector, 11) FPGA JTAG port, 12) Flex cable FPGA IO connector, 13) eMMC NAND flash memory. Bottom view: (1) micro SD card holder, (2) 80-pin AMP and, (3) 60-pin AMP FPGA IO connector

The platform provides both micro secure digital (SD) card and NAND flash memory as non-volatile storage options. The system is powered from a single 12 V DC supply with all voltages required to power individual components generated by efficient step down converters. Typical a module requires 6 W or less, this loading depends on the configuration and clock speed of the FPGA.

A UART to Universal Synchronous Bus (USB) converter chip is used to access the serial terminal of the processor from a USB connection on a computer. Accessing the processor from this interface is required in the early stages of development, once an OS has been booted and the network interface configured, it is possible to use Secure Shell (SSH) protocol to access the system.

The system usually runs an Open Embedded Linux distribution as its operating system (OS) [13]. For the development process, the OS is typically stored on a SD card in order to have a faster way to change the contents of the root file system and of the Linux kernel image. Having the possibility of booting from a system on an SD card also allows a simple and fast way to restore the system to a previously saved working state. On a final working system, the OS and the server program is stored into the NAND flash memory, this way an external SD card is not needed. Full schematic diagrams are published through the github link provided in the supported data section of this paper.

2.1 ARM SoC

We have opted to use a Texas Instruments SoC device from the OMAP or Sitara family. This SOC combines a Cortex A8 ARM® processor, a PowerVR GPU, and two Programmable Real-time Units (PRUs) (see Fig. 2). The Cortex A8 processor is a dual issue super scalar processor, which means it can execute two instructions simultaneously most of the time. It is also equipped with an Advanced Single instruction multiple data (SIMD) Extension (NEON), which can handle a combined maximum data width of 128 bits. The processor performs computation with a 13-stage integer pipeline and a 10-stage NEON pipeline. It is configured with 32 kB 4-way set associative level 1 (L1) cache and 256 kB 8-way set associative level 2 (L2) cache for efficient data fetching from memory. On the other hand, the PowerVR SGX530 is a 3D hardware accelerator supporting OpenGL ES 2.0 with a computation capability of 1.6 Giga FLOPS (floating-point operation per second) when running at 200 MHz. Lastly, there are two Programmable Realtime Unit (PRU) subsystems with substantial DSP capability due to their support of the multiply-accumulation instruction. The PRU cores are able to toggle general purpose input/output (GPIO) at a significantly faster rate compared to doing this by programming the ARM processor.

Fig. 2
figure 2

Block diagram of Texas Instrument Sitara SoC featuring a ARM Cortex A8 processor, a Digital Signal Processor and dedicated Graphics Processing Unit (highlighted in red). Two additional Programmable Unit Subsystems are also available on the L3 interconnect fabric

Supporting the processors, the SoC has an enhanced direct memory access (EDMA) engine to enable offloading the CPU from large data transfer transactions. It also integrates many peripheral modules including Gigabit Ethernet MAC, USB 2, and a general-purpose memory controller (GPMC). Further commonly used interfaces such as the Serial Peripheral Interface bus (SPI), the Universal Asynchronous Receiver-Transmitter (UART) and Inter-Integrated Circuit (I2C) communication interfaces are also supported.

2.2 Intel Cyclone® V FPGA

The Cyclone® V FPGA is the most recent addition to the Cyclone® series of low cost FPGAs which started with the Cyclone I variant which was introduced in 2002. The fifth generation was introduced in 2011 and is fabricated with a 28 nm process technology. An important feature of the new generation is that it provides a hard external memory controller for adding external DDR3 memory banks providing 800 Mbits per second (Mbps) of external memory bandwidth. The FPGA fabric comprises up to 300,000 logic elements arranged in vertical columns of adaptive logic modules, 12 Mbit of embedded memory and dedicated DSP blocks. All of these logic resources are interconnected through a clocking network with several phase locked loop (PLL) providing plenty of programmable clock generators. The GX version of the Cyclone V features hard IP blocks for fast transceivers. The I/O elements support 840Mhz Low Voltage Differential Signals (LVDS) support for all mainstream differential and single-ended I/O standards including 3.3 V.

The FPGA can be configured using one of three different methods. During development the standard JTAG interface is coupled through a USB interface to a development PC that runs the Quartus® FPGA design software. For functioning systems the FPGA programming can be taken over by a dedicated non-volatile memory chip or, alternatively, be done through the ARM processor. The latter method is implemented using general-purpose I/O lines of the SoC, and these lines are multiplexed with the external JTAG interface. In principle the JTAG protocol can be mimicked by the ARM processor but can be done faster using one of the two PRU cores which allow higher I/O switching rates. Over 120 FPGA I/O pins are exposed via two high-density connectors; a 50-pin ERNI and 80-pin AMP/TYPCO connector. Some versions of the system have an additional 60-pin AMP/TYPCO connector with further expansion to the number of I/O pins that can be used. These connectors also carry power and a couple of serial buses (SPI/I2C) and allow daughter boards to be attached for extended functionalities and connection to a range of image sensors. The serial bus allow for automatic discovery and/or configuration of the attached sensors.

2.3 SoC – FPGA Bridge

2.3.1 SoC side

Within the SoC, the GPMC is connected to the L3 interconnect, which operates at 100 MHz, using a 32-bit wide data port and has direct connections to the EDMA engine and the interrupt controller. Externally, the GPMC implements a simple memory interface consisting of address and data buses with various control signals. It is capable of interfacing to non-volatile memory devices such as NAND and NOR flash; and volatile memory like SRAM. It can address up to 228 bytes per individual chip-select (3 available) and perform burst access of up to 32 bytes with a parallel data bus up to 16 bits wide. Although both asynchronous and synchronous types of data transfer are supported the choice between these is simple for a point of view of significant data exchanges. Only using the synchronous burst read/write 16-bit can be moved every clock cycle during a burst read/write cycle. The price to pay is that setting up each transfer takes a few clock cycles extra but since most data transfers are consisting of more than a single read or write it is of no further concern.

Configuration of the GPMC interface takes place through a bank of dedicated registers and details of these can be found in the technical reference manual of the AM3358. Each burst read/write transaction can be conducted using the EDMA controller. Note that transactions on the GPMC are conducted with the SoC as the master and the FPGA will be a slave. In the ever growing range of TI ARM-based processors the GPMC is a constant - thus the use of this high bandwidth port provides our interface of choice to exchange image data with the FPGA.

2.3.2 FPGA side

A designer has the choice of standard and non-standard solutions to create an interface between the SoC and the FPGA using the GPMC bus. Intel provides a convenient system integration tool that is part of its Quartus® development software called Qsys (recently renamed to Platform Designer). Qsys allows users to choose the components such as processing modules and input/output interfaces that are needed to build a system by selecting individual components using a graphical user interface. In creating the complete system Qsys relies on the Avalon switch fabric to connect the desired modules and automatically generates the hardware system. Several other options exist that could be used. For example, within Qsys one could also use ARM AMBA AXI interfaces but these would be implemented with an Avalon interface during implementation. Of course one could opt not to use Qsys and develop bespoke solutions without the benefit of reusing many available IP cores offered through Qsys. For example, one could develop the system using the Opencores Wishbone interconnect [14]. Since we would like to use the Avalon video streaming library the use of the available Avalon® Memory Mapped interface is beneficial. In order to create the SoC-FPGA bridge we use an Avalon® Memory Mapped (AVMM) master. This synchronous interface offers configurations for both pipelined and burst transactions. Its data bus width is configurable up to 1024 bits wide with a variable burst size set by a 32-bit wide counter. The AVMM interface is able to handle one transfer per clock cycle and uses a slave-side arbitration scheme to allow multiple masters to perform transactions at the same time.

A dedicated EVS-MUX IP core was developed for interfacing the GPMC and the FPGA as implemented on the ARMflash platform. In the following section, we will describe this IP core and other EVS cores that are part of the library that allow users to create complete imaging systems. The library can be imported directly into Quartus for use with the Qsys tool by simply adding the path to the directory where the EVS library is stored using the options setting of Qsys.

The EVS-MUX interface consists of three modules that form a bridge linking the ARM SoC and the FPGA, providing means of communication for different scenarios. Effectively by using this interface the FPGA behaves as a memory mapped device. The first of these modules is the multiplexer (EVS-MUX) Module, which interfaces directly with the GPMC and acts as a signal router to different bridge modules. Out of the 26 address bus lines (the least significant address line becomes redundant with 16-bit word data transfers) available, the most significant two bits are used as the select signals for the multiplexing, allowing a maximum of 4 bridge modules to be attached using GPMC-link conduits. Only a single EVS-MUX module is needed to connect to a maximum of four bridge modules. The MUX module provides the option to issue a software reset to the FPGA by writing to a reserved address from the SoC side.

Two bridge modules are available: a Parallel input/output PIO Module and a DMA Module. The PIO Module allows fixed sized (2, 4 or 8 bytes) transaction between the SoC and the FPGA. This type of transfer is through programmed I/O using a non-burst access mode. The address for each AVMM bus transaction is derived from the GPMC bus address. The typical use of this module is for random access to control & status registers of the various modules implemented on the FPGA. Most FPGA modules have one or more register that control their operation or show their status. For some modules such as pulse width modulators or I2C interfaces tthis method can also be used for passing data.

The DMA Module is a simple engine performing direct memory access (DMA) transfers using burst transfers. The EVS DMA module operates according to the pre-configured transaction parameters in a way similar to a conventional DMA engine in the SoC. The module has a read DMA controller and fifo with read ahead capability and a write DMA controller and fifo with write posting capability. It can support concurrent streaming of data in both directions between the GPMC and AVMM bus. Sequential AVMM addresses are generated for each burst transaction. Intended for sequential access to large memory blocks. The available bandwidth of the bridge has a theoretical upper limit of 200 Mbyte/s but due to overhead in setting up burst transfers the practical limit is about 150 Mbyte/s [15].

Figure 3 shows a screenshot of the Qsys tool showing a part of a complete vision system featuring a EVS-MUX module wired to a PIO and DMA module using two GPMC link conduits. Data transfers to DDR3 memory that is directly attached to the FPGA are handled through a universal SDRAM controller that is attached to the Avalon fabric as a AVMM slave module. The SYSRST line from the EVS-MUX module that drives the FPGA reset input is also shown.

Fig. 3
figure 3

Screenshot of Qsys tool. The EVS library modules to create an interface between the SoC system and the FPGA and their connections to the Avalon fabric are shown

3 The Intel Avalon® Interface

The Avalon switch fabric is at the basis of the Qsys system builder tool and provides a means for interfacing both high-speed parallel and serial data flows inside a system created with Intel FPGAs. Avalon interfaces are an open standard [16]. Intel supports users with set of libraries containing IP cores that adhere to the Avalon standard. There are several interfaces defined as part of the standard:

  • Avalon Streaming Interface (AVST) - an interface that supports the unidirectional flow of data, including multiplexed streams, packets, and DSP data.

  • Avalon Memory Mapped Interface (AVMM) - an address-based read/write interface typical of master–slave connections.

  • Avalon Conduit Interface (AVCI) - an interface type that accommodates individual signals or groups of signals that do not fit into any of the other Avalon types.

  • Avalon Tri-State Conduit Interface (AVTC) —an interface to support connections to off-chip peripherals. Multiple peripherals can share pins through signal multiplexing, reducing the pin count of the FPGA and the number of traces on the PCB.

  • Avalon Interrupt Interface—an interface that allows components to signal events to other components.

AVST interfaces for components that drive high-bandwidth, low-latency, unidirectional data. Typical applications include multiplexed streams, packets, and DSP data. The Avalon-ST interface signals can describe traditional streaming interfaces supporting a single stream of data without knowledge of channels or packet boundaries. The interface can also support more complex protocols capable of burst and packet transfers with packets interleaved across multiple channels.

Figure 4 shows a very simple example of a system based on the Avalon interface in which both Avalon streaming and memory mapped interfaces are used. This diagram doesn’t show the required conduits (apart from the two GPMC links), reset/clock and interrupt interfaces but instead highlights the overall structure with the various master and slave AVMM components.

Fig. 4
figure 4

Block diagram of the FPGA part of a simple imaging system as implemented with EVS modules. The processor GOMC bus is interfaced to the Avalon Memory Mapped bus using three modules from the EVS IP library. This modules act as master controllers in order to configure and control pixel data flowing via the Avalon Stream interface to a local DDR3 memory that is attached to the FPGA

The sensor control module hands raw image data to a AVST interface that is received by a frame store module that uses a standard Intel library SDRAM interface through the AVMM interface to store frame data in the off chip DDR3 memory bank. Through the AVMM slave interface the sensor control module can be configured and triggered. Operations on the raw image data can either be accomplished by inserting AVST modules in between the sensor control module and the frame store module, replace the frame store module itself or be handled outside the immediate pixel data stream by reading stored frame data from memory with separate modules.

Adhering to the AVMM and AVST open standard bring the obvious benefits of being able to draw upon a library of modules developed by others. Intel provides a dedicated library that hold many IP cores that handle all the usual image processing tasks including conversion between various video standards.

3.1 Avalon Video and Image Processing

The Video and Image Processing (VIP) IP cores conform to the Avalon streaming video standard of data transmission. This standard is a configurable protocol layer that sits on top of the Intel Avalon streaming standard. The VIP protocol consist of control packets, which contain information about image size, and video packets which contain the pixel data. Packets are marked by assertion of the start_of_packet (SOP) and end_of_packet (EOP) signal. Downstream flow control is provided by the Valid signal from the source. Upstream flow control is provided by the ready signal from the sink. The individual video formats supported (i.e. NTSC, 1080p, UHD 4 K) depend primarily on the configuration of the Avalon streaming video standard and the clock frequency. The IPs may transmit pixel information either in sequence or in parallel, in RGB or YCbCr colour spaces, and under a variety of different chroma samplings and bit depths, depending on which is the most suitable for a particular application.

The VIP Suite of IP cores is available in the DSP library of the Quartus software. You can configure the IPs to the required number of bits per symbols, symbols per pixel, symbols in sequence or parallel, and pixels in parallel. The VIP Suite IP cores that permit run-time control of some aspects of their behaviour use a common type of Avalon-MM slave interface consisting of a set of control registers which must be set by external hardware. The set of available control registers and the width in binary bits of each register varies with each control interface.

The first two registers of every control interface perform the following two functions (the others vary with each control interface):

  • Register 0 is the Go register. Bit zero of this register is the Go bit. A few cycles after the function comes out of reset, it writes a zero in the Go bit (remember that all registers in Avalon-MM control slaves power up in an undefined state).

  • Although there are a few exceptions, most Video and Image Processing Suite IP cores stop at the beginning of an image data packet if the Go bit is set to 0. This allows you to stop the IP core and to program run-time control data before the processing of the image data begins. A few cycles after the Go bit is set by external logic connected to the control port, the IP core begins processing image data. If the Go bit is unset while data is being processed, then the IP core stops processing data again at the beginning of the next image data packet and waits until the Go bit is set by external logic.

  • Register 1 is the Status register. Bit zero of this register is the Status bit; the function does not use all other bits. The function sets the Status bit to 1 when it is running, and zero otherwise. External logic attached to the control port must not attempt to write to the Status register.

3.2 The EVS IP core Library

A set of flexible Embedded Vision Systems IP cores was developed in our research group for creating control and image capture/processing systems. Interfacing of the sensors to the ARMflash platform is accomplished using bespoke PCBs that are plugged into one of the connectors exposing the FPGA pins. Working with state-of-the-art sensors often requires a completely new sensor control and readout module since most of the sensors do not use standard interfaces (i.e. USB, MIPI CSI2 [17]) as used by sensors for consumer markets. Sensors are usually configured using a set of module registers that hold a multitude in settings needed to operate the sensors and thus allow full control to the system. In order to capture raw images from the sensor the module will generate the required clock and control signals needed to extract image data. These signals are generally high frequency and can be single or differential in nature. Often the actual sensor read-out is simply a case of providing a pixel clock signal and horizontal & vertical synchronisation signals after a short frame integration time has passed. Some sensors, however, require very complex operations where, for example, data needs to be descrambled and reformatted before images can be captured. An example is the Bayer® de-mosaicing that is needed to obtain images from colour sensors.

Image data correction/calibration is also an important task that is often required. Many sensors have a temperature dependent background that must be removed carefully. Generally this is accomplished by collecting many image frames with identical integration time and calculating a frame with average pixel values. This so-called dark image is then subtracted automatically from any image collected inside the FPGA. For convenience this function can simply be enabled or disabled by setting the appropriate bit in the control register of the module. For measurement systems that use image sensors there are various image processing tasks that can be implemented in FPGAs and thus releasing the other processing elements in the same to focus on other tasks. Examples of frame based processing tasks are frame histogramming of pixel intensities, calculation of column/row image profiles and region of interest pixel summing/binning.

4 Medipix 3RX Sensor Based System

We will now discuss an application of the ARMflash platform which highlights the benefits of the technology. The application of choice is that of in situ X-ray beam monitoring. Our research group has patented an innovative way of recording the beam position, shape and intensity without placing a sensor directly into the X-ray beam [18]. For establishing the beam position in vertical and horizontal position with sub-micrometer precision the Gaussian shaped beam is profiled and its centre of gravity is determined by fitting a Gaussian curve to the measured vertical and horizontal profile of the beam as collected from the sensor by all columns and rows of the sensor respectively. The process of adding all the columns and rows of the X-ray imaging sensor the Signal-to-Noise ratio is improved tremendously and after fitting the beam position is established with sub-pixel accuracy.

The Medipix is a family of photon-counting hybrid pixel detectors used for particle tracking and detection. The Medipix detectors have been developed by the Medipix Collaboration, which is an international collaboration based at the European Organisation for Nuclear Research (CERN) [19]. The Medipix1 chip was the first version of this detector developed for high-energy physics to address the needs of particle tracking at the CERN Large Hadron Collider (LHC). To expand the use of this technology to new scientific fields, new versions of the detector were developed during the Medipix2 collaboration in 1999 and the Medipix3 collaboration on 2005. The Medipix3RX is the latest revision of the resulting hybrid pixel detector chip. Through a collaboration with Diamond Light Source we developed an X-ray diagnostics module that would collect full information of the beam shape and centroid position on a frame-by-frame basis [6], [20].

Each pixel of the Medipix3RX contains the required electronics to detect the signal generated by the interaction of X-rays with the sensing layer placed above it, amplify and shape the signal, compare it with one or two thresholds and store the event in registers. To equalise the response from each pixel, therefore, there are several gain factors and offsets that must be tweaked by uploading different values to each pixel before measurements can be made.

The operation modes of the Medipix3RX represent the different ways in which the incoming photons are processed. The operation modes can be divided into two categories, the amount of energy thresholds available per pixel input and the use of shared charge reconstruction. Each category has two options generating four possible combinations or operation modes. The Medipix3RX uses an Operation Mode Register (OMR) to determine the operation to be realized by the chip. Every action of the Medipix3RX starts by the control electronics sending the 32-bit OMR, followed by a ‘Start Operation’ command to execute the action selected on the OMR.

Fig. 5
figure 5

Block diagram of the Medipix3RX FPGA modules and their interconnections. Not shown are the EVS-MUX, EVS-GPIO and EVS-DMA modules that allow collected image and profile data to be transmitted back to the SoC

The image data can be sent from the Medipix3RX using 1, 2, 4 or 8 data lines in an encrypted fashion. Reducing the amount of data lines to use reduces the I/O requirements on the control logic but, also increases the readout time. The required time for the readout of a frame is dependent on the amount of data lines used, the clock frequency sent to the Medipix3RX and the pixel depth that has been selected for the pixel’s counters. The Medipix3RX sensor module (see Fig. 5) handles pixel matrix configuration by uploading a multidimensional matrix of equalisation values for each pixel, unscrambling the data values read from the sensor and passing on the pixel data in the form of an Avalon Stream.

In order to determine the beam position as fast as possible we have developed a profiler module on the FPGA that is fed directly by an Avalon stream of pixel data. Per frame this module produces a vertical and horizontal profile of the X-ray beam which is stored in the DDR3 memory bank of the FPGA ready for transmission to the SoC which is charged with fitting each profile and determining the beam position from the fit. In order not to disturb collecting the full image acquisition the pixel stream is split by a standard VIP module. All of this takes place at full frame rate of the sensor something which is only possible by front-end processing at the FPGA level.

Several systems based on the Medipix3RX active pixel sensor have been developed for Diamond Light Source Ltd including one that operates in ultra-high vacuum. The method of measurement is licensed to FMB Oxford Ltd and they have developed the NanoBMP product that is based on a CMOS imaging sensor.

5 Discussion

The presented image processing platform has allowed us to develop various image sensors based on a variety of sensors and perform significant levels of image processing tasks on the platform. One of our criteria has been that the platform is energy efficient and can run without fans and/or heat sinks. This demand allows us to integrate the system closely with the image sensor without creating a large thermal background and temporal variation in the pixel intensities, it will also allow designers to fully shield the system from damaging x-rays and electronic noise.We have followed the integration of ARM processors and FPGA devices (e.g. the Cyclone® V SoC) with interest especially because these provide a faster bridge between the ARM processor and the FPGA. However, these solutions don’t offer the close integration with DSP and GPU co-processors. Nvidea, for example offers the Jetson™ series of embedded systems [21], which are very popular embedded systems but lack FPGA options and require large heat sinks. Their power consumption is listed as between 5 and 10 W depending on computation performance. Note that this platform does not include FPGA devices that tend to dominate power consumption when driving high-speed signals needed to control and read-out image sensors.

An interesting alternative to the GPMC bridge between processor and FPGA would be to use the PCIe interface. This multi-lane high-speed serial interface is integrated on the latest TI SoC processors and is also implemented on the GX variant of the Cyclone V FPGA series. Although serial encoding reduce the bandwidth somewhat a single lane Gen 2 implementation should achieve a bandwidth similar to 150 Mbyte/s. Again these higher performance processors will require either passive or even active cooling. If one considers that the platform is often linked to external systems via the Gigabit Ethernet port it is clear that this constitutes the limiting factor in terms of bandwidth. In many applications the FPGA performs a large reduction in the amount of image data that has to be transferred across the SoC-FPGA bridge.

The development effort of embedded vision systems away from devices that use common interfaces such as USB and MIPI CSI2 [17] using heterogeneous systems is significant and requires multiple development tools on either side of the processor to FPGA bridge. On the processor side our system have benefitted significantly from the large amount of open source code available. Modern and compact versions of the Linux Operating System are available that provide an easy way to implement fast ethernet communication with host systems over the Gigabit Ethernet link using either a TCP/IP or UDP based protocol. For the TI ARM based processor a lot of open source code is available to tailor the integration of various processing engines and peripheral devices such as the EDMA for fast data transfers. Without the benefit of available open source code and tools any undertaking that relies on multiple processing engines such as available on our platforms any system programming and development task becomes prohibitive.