Main

Image sensors1 play a crucial part in a broad range of applications2,3,4,5. However, with the increasing prevalence of open-world applications such as autonomous machines6, robotics7 and artificial intelligence8, current image sensors face substantial challenges. Despite the development of algorithms tailored for open-world applications9, image sensors have difficulty in handling dynamic, diverse and unpredictable corner cases10 beyond their sensing range, causing algorithm failures11. Some of these challenges are aliasing and quantization error, data redundancy and limited dynamic range at the sensing level11, as well as semantic misalignment, latency problem12 and domain shift13 at the perception level14 (Fig. 1a). To effectively address corner cases, image sensors must possess exceptional performance in spatial resolution, speed, precision and dynamic range simultaneously. Yet, achieving this goal is impeded by the power and bandwidth walls15. Traditional sensors exhibit escalated power and bandwidth requirements with higher spatial resolution, speed and precision, leading to constrained capture abilities and excessive data burdens. Moreover, given the rarity of corner cases16, these sensors are prone to generating redundant data, consequently wasting bandwidth and power resources.

Fig. 1: The challenges of open-world visual sensing and the solution with the complementary vision paradigm.
figure 1

a, Corner cases of the open world, such as dynamic, diverse and unpredictable situations, pose various challenges across both sensing and perception levels. b, The complementary visual paradigm requires the parsing of visual inputs into primitive-based representations, the assembly of these primitives into multiple pathways and their subsequent reorganization to accommodate different scenarios in the open world. Our vision sensor integrates these primitives into the COP and AOP, similar to the ventral and dorsal pathways, respectively, in the human visual system. TD, temporal difference with polarities (⊗ for negative, ⊙ for positive); SD, spatial difference with directions (↗, ↘, ↙, ↖).

In contrast to the existing image sensors, the human visual system (HVS) stands out for its versatility, adaptability and robustness in open-world environments. The HVS interprets visual stimuli into multiple visual primitives, such as colour, orientation and motion, and allocates them to the ventral and dorsal pathways in a complementary manner17. The cooperation between the two pathways efficiently provides a unified representation of visual scenes18 (Methods). Various endeavours have been made to replicate specific features of the HVS, including silicon retinas19,20, neuromorphic vision sensors21,22,23,24,25,26,27,28, pulse frequency modulation29,30,31,32, time-to-first-spike33,34,35 and near-sensor-computing chips36,37,38,39. However, challenges persist in achieving image sensors with high spatial resolution, high speed, high precision, and large dynamic range in the constraints of limited power and bandwidth.

Here we report a complementary sensing paradigm inspired by the multi-level characteristics of the HVS, along with a vision chip designed based on this paradigm, named Tianmouc. Our paradigm consists of a primitive-based representation and two complementary visual pathways (CVPs) that enable the parsing of a visual scene into primitives and their subsequent assembly into corresponding pathways. As shown in Fig. 1b, these primitives encompass colour, precision, sensitivity, spatial resolution, speed, absolute intensity, spatial difference (SD) and temporal difference (TD), serving as the foundational elements for a comprehensive representation of the scene (Supplementary Note 2). The CVP consists of two distinct pathways: the cognition-oriented pathway (COP) and the action-oriented pathway (AOP), analogous to the ventral and dorsal pathways of the HVS, respectively (Extended Data Table 1). The COP uses primitives of colour, intensity, high spatial resolution and high precision to achieve accurate cognition, minimizing spatial aliasing and quantization errors. By contrast, the AOP uses primitives of SD, TD and high speed to achieve fast responses with robustness and high sparsity, addressing data redundancy and latency problems. The two pathways complement each other in constructing representations of normal and corner cases, thereby achieving high dynamic range and mitigating semantic misalignment and domain shift problems.

Design of the complementary vision chip

Implementing the complementary sensing paradigm in a physical sensing system has several challenges that must be addressed. It is essential to design the pixel array such that it facilitates the simultaneous dissection of optical-to-electrical information conversion for corresponding primitives on the same focal plane. Furthermore, the readout architectures of the two pathways must incorporate heterogeneous building blocks that can encode electrical information with different data distributions and formats.

As shown in Fig. 2a, the Tianmouc chip, fabricated using a 90-nm CMOS back-side illuminated technology, consists of two centrepieces: a hybrid pixel array for converting optical information into electrical signals and a parallel-and-heterogeneous readout architecture for constructing two CVPs. Inspired by photoreceptor cells, the hybrid pixel array comprises cone-inspired and rod-inspired pixels with varying characteristics such as colours, response modes, resolutions and sensitivities. These pixels can parse visual information into specific colours (red, green and blue) and a white spectrum, serving as colour-opponent primitives. They can also be adjusted for four different sensitivities using high and low charge-to-electrical conversion gain, thereby achieving a high dynamic range by using the low noise of the high-gain mode and the high saturation capacity of the low-gain mode. The cone-inspired pixels are designed with a fine-grained 4-µm pitch for absolute intensity sensing, whereas the rod-inspired pixels feature two larger receptive fields, 8 µm and 16 µm, for sensing TD and SD, respectively. A spatiotemporal consecutive pixel architecture is used to facilitate TD and SD computation through the use of high-density in-pixel memory. Specifically, the rod-inspired pixels buffer historical voltage signals in a ping-pong behaviour to enable continuous computation of TD in the AOP readout. The same memory in rod-inspired pixels across a block can be reorganized to compute the SD, as shown by the operational phase in Fig. 2b. The full hybrid pixel array comprises 320 × 320 cone-inspired pixels and 160 × 160 rod-inspired pixels. Further details about these two types of pixel are provided in the Methods and Extended Data Fig. 2a,b.

Fig. 2: The architecture of the Tianmouc chip.
figure 2

a, Schematic of the chip architecture, including the hybrid pixel array and its interaction with multiple pathways. Various primitives such as colour, intensity, TD, SD, sensitivity, spatial resolution, speed and precision are implemented at different levels within the chip. b, In phases 1 and 2, photogenerated electrons are integrated into photodiodes and then converted to voltages with high- or low-gain amplification. In phase 3, the voltages buffered in the pixel are read out for spatiotemporal difference computation, whereas the electrons are integrated for the next sampling time. In phase 4, the in-pixel buffer behaves in a ping-pong manner. c, Electrical schematics of the AOP readout circuits. The TD and SD are first calculated by a subtractor circuit, filtered by a programmable threshold and then processed by a multi-precision convertor. The sparse TD and SD values are packetized at the final stage to achieve maximum bandwidth reduction. d, A microscopic image of the Tianmouc chip.

The electrical signals travelling along the two pathways exhibit distinct characteristics, including differences in data distributions and sparsity. These disparities require the use of different methods for encoding signals into digital data with appropriate speed and precision. To address this challenge, the chip adopts parallel-and-heterogeneous readout architectures. For the COP, accurate conversion of absolute intensity signals to dense matrixes is essential. This is achieved through a single-slope analog-to-digital architecture. By contrast, the AOP requires rapid encoding of signals with spatiotemporal differences characterized by symmetrical Laplacian-like distribution40 and high sparsity. Therefore, a specialized readout architecture is used (Fig. 2c), in which a programmable threshold filter is used to minimize redundancy and noise in the computed TD and SD signals while retaining key information. Subsequently, these signals are quantized using a fast polarity-adaptive digital-to-analog converter with configurable precision. Moreover, a data packetizer is used to achieve lossless compression of the sparse variable-precision TD and SD signals in a unified protocol (Extended Data Fig. 2d). This approach offers adaptive abilities to reduce bandwidth and further enhance the operation speed of the AOP. More details on the readout architectures can be found in the Methods and Extended Data Fig. 2c,d. An optical micrograph depicting the overall layout of the Tianmouc chip is presented in Fig. 2d.

Characterization of Tianmouc

The performance metrics of the Tianmouc chip, including quantum efficiency, dynamic range, response speed, power and bandwidth, have been evaluated comprehensively. The chip demonstrates high quantum efficiency in both COP and AOP, achieving a maximum of 72% of AOP and 69% of COP at 530 nm (Fig. 3a). It accomplishes a high dynamic range by leveraging the dynamic ranges of different gain modes in the complementary COP and AOP. As plotted in Fig. 3b, an overall dynamic range of 130 dB is achieved by detecting the lowest power density of 2.71 × 10−3 μW cm−2 and the highest power density of 8.04 × 103 μW cm−2, in accordance with a well-established standard (Methods and Extended Data Fig. 4).

Fig. 3: Summary of chip evaluation.
figure 3

a, Measured quantum efficiency curves show that both pathways exhibit a broad spectrum of response while maintaining high spectral sensitivities. b, Measured signal-to-noise ratios against optical power density show the high dynamic range ability achieved by combining high gain and low gain from the AOP and COP. c, Tianmouc eliminates spatial aliasing of the AOP through high spatial resolution and precision of the COP. d, Robust reconstruction of high-speed video for scenes with high-speed movements under light flash interference. e, Leveraging the high speed of the AOP, Tianmouc can promptly respond to unpredictable and rapidly occurring lightning. The bottom portion shows that the bandwidth consumption of the AOP remains lower than that of the COP, except during lightning events. f, A comparison between the Tianmouc chip and traditional image sensors41,42,43,44,45,46,47,48, as well as neuromorphic vision sensors21,25,49, on FOM with respect to power (top) and bandwidth consumption (bottom).

Source Data

The complementary pathways of the Tianmouc chip enable high spatial resolution and precision (Fig. 3c) and high robustness in unpredictable environments (Fig. 3d). To eliminate spatial aliasing and quantization errors caused by AOP, the Tianmouc chip complementally uses spatial resolution and precision. Although the standard Siemens star chart captured by the AOP-SD in Fig. 3c may appear distorted because of its low resolution, the COP accurately records it. As shown in Fig. 3d, in a scene with horizontally fast-moving and rotating objects, and changing illumination conditions, a sudden flash of light disturbs the AOP-TD, but the AOP-SD remains unaffected. By using the COP image with the AOP-TD and AOP-SD, a frame-by-frame reconstruction of high-speed video (Methods) enables the recovery of high-speed motion.

Using the AOP, Tianmouc demonstrates rapid response with reconfigurable speeds ranging from 757 frames per second (fps) to 10,000 fps and precisions varying from ±7 bits to ±1 bit. This complements the comparatively slower speed of the COP, which maintains a sustained response at 30 fps and a 10-bit resolution. The high-speed ability of Tianmouc is assessed by a transient lightning test. As shown in Fig. 3e, Tianmouc can operate at 10,000 fps with ±1 bit at a threshold level of 50 mV to capture fast lightning bolts. It is worth noting that, owing to high sparsity, the AOP consumes only about 50 megabytes per second (MB s−1) of peak bandwidth during transient phenomena, representing a 90% reduction compared with traditional cameras with equivalent spatiotemporal resolution and precision (640 × 320 × 10,000 × 2). More demonstrations of high-speed responses and temporal anti-aliasing can be found in Extended Data Fig. 5.

We use a comprehensive figure of merit (FOM), similar to that proposed in ref. 41, to evaluate the overall performance of the Tianmouc chip. This FOM incorporates key performance indicators for open-world sensing, integrating the maximum sampling rate (Rmax) and dynamic range into a unified metric (Rmax × dynamic range). In Fig. 3f, the FOM is plotted against power and bandwidth for various sensors, respectively. The power consumption of Tianmouc varies based on the operation mode (Extended Data Fig. 5b) and averages at 368 mW in a typical mode (±7 bits, 1,515 fps without threshold). As shown in Fig. 3f, Tianmouc achieves an advanced FOM, surpassing the existing neuromorphic sensors and traditional image sensors, while still retaining low power and low bandwidth consumption. Detailed calculations and comparisons can be found in the Methods.

Performance in the open world

The complementary sensing paradigm provides a large space of design possibilities and serves as an exceptional data source for perception algorithms. To assess these abilities in open-world scenarios, we develop an automotive driving perception system (Fig. 4a) integrated with a Tianmouc chip. The assessment is conducted on open roads, involving encounters with various corner cases, such as flash disturbances, high dynamic range scenes, domain shift problems (anomalous objects) and complex scenes featuring multiple corner cases. To make use of the advantages of Tianmouc architecture, we design a multi-pathway algorithm specifically tailored to harness the complementary features of the AOP and COP. At the sensing level, the completeness of primitives enables the reconstruction of the original scene and adaptation to extreme illumination. Meanwhile, at the perception level, the AOP provides the immediate perception of variations, textures and motion, whereas the COP offers fine semantic details. By synchronizing these outcomes, we achieve a comprehensive understanding of the scene.

Fig. 4: Open-world perception experiments.
figure 4

a, A long-distance drive is performed to test the performance of the integrated perception system with a Tianmouc, in which four different types of corner case are encountered. An asynchronous multi-tasking algorithm is designed based on the CVP. The results from the COP (blue detection boxes, green and red masks for segmentation) and those from the AOP (dark red) can be seamlessly aligned and synchronized to provide comprehensive outcomes. The mean average precision (mAP0.50) is calculated for the corner cases (Supplementary Note 8). A higher mAP0.50 value indicates better algorithm performance. In instances in which corner cases interfere with only a few frames, performance metrics are also calculated specifically for the influenced frames (only corner), apart from the overall performance. b, The AOP-SD demonstrates rapid response and strong anti-interference characteristics, effectively mitigating flash disturbances. c, Tianmouc supports tasks in high-speed, high-dynamic-range scenes, facilitated by the efficient extension of the dynamic range provided by the CVP. d, The AOP supports a high-speed optical flow estimation, effectively coping with anomalous objects. e, Tianmouc demonstrates proficiency in concurrently handling multiple corner cases through the cooperation of the CVP. Scale bar, 5 km (a).

Source Data

The first scenario shown in Fig. 4b evaluates the sensing abilities involving a sudden light flash that causes rapid changes in illumination, potentially affecting sensor robustness. Tianmouc exhibits remarkable resilience to such light flashes while maintaining high perception performance in normal situations. For real-time high dynamic range perception (Fig. 4c), the complementary sensitivity of the two pathways enables the Tianmouc to sense high-brightness contrast without sacrificing speed. At the perception level, the anomaly detection ability is complemented by a high-speed optical flow filter on the AOP, in which the collaboration between AOP-TD and AOP-SD enables precise calculation of motion direction and speed for identifying anomalies (Fig. 4d). Figure 4e shows a complex scene with dim natural illumination, a chaotic traffic environment and sudden disturbances from artificial light, demanding diverse sensing abilities in terms of sampling speed, resolution and dynamic range. The algorithms on CVP provide complementary and diverse results, offering ample room for further decisions in these scenarios. According to the mAP0.50 (mean average precision; Supplementary Note 8) bars, the CVP yields superior overall detection performance compared with using only a single pathway across all the cases in Fig. 4. Notably, it achieves this while consuming less than 80 MB s−1 bandwidth and an average power consumption of 328 mW. The experimental results demonstrate that Tianmouc can efficiently adapt to extreme light environments and provide domain-invariant multi-level perception abilities. Further details of the experiment setup and algorithms are provided in the Methods and Extended Data Figs. 68, whereas the performance evaluation of the algorithm is discussed in Supplementary Notes 7 and 8.

Discussion

Tianmouc excels in capturing intricate details of cognition while rapidly responding to unpredictable emergencies and motion simultaneously. It offers high speed, high dynamic range and high precision, while simultaneously maintaining adaptive low bandwidth. Unlike existing sensing paradigms, our approach overcomes the inefficiencies caused by homogenous representations and accommodates various corner cases in the open world. Compared with contemporary neuromorphic vision sensors, Tianmouc exhibits superior precision and comprehensive information while maintaining a fast and robust response in extreme environments. Importantly, its high scalability allows for advanced spatial resolution through advanced manufacturing, facilitating resolution-sensitive applications with low power and bandwidth requirements. The primitives can also potentially be designed with on-chip reconfiguration and pathway allocation flexibility, enabling active adaptation to different task requirements. The vision sensor with primitive-based complementary pathways provides a unique data source and sensing platform, opening a new avenue for developing advanced computer vision theories, algorithms and systems for open-world applications.

Methods

A brief introduction to HVS

The HVS differs fundamentally from frame-based image sensors, boasting inherent general-purpose functionality, robustness and efficiency in the open world. As shown in Extended Data Fig. 1, the use of primitives and complementary pathways is apparent throughout the HVS, spanning from initial photoreceptor response to high-level visual processing. Notably, the retina exhibits distinct responses to input stimuli: cones detect colour and converge in the fovea with high acuity, whereas rods are colour insensitive and distributed across the retina50. Rather than transmitting the entire signal from the retina to the brain, the lateral geniculate nucleus parses the response from the retina to different characteristics and transmits them to the primary visual cortex by the magnocellular and parvocellular (M and P) pathways. The P-pathway maintains a sustained response to colour information, offering a high spatial resolution but a low temporal resolution. Complementarily, the M-pathway, insensitive to colour, are sensitive to transient information, exhibiting a fast temporal response albeit at the expense of spatial resolution. Subsequently, the primary visual cortex (V1 and V2 regions) interprets these inputs into multiple visual primitives, including colour, orientation, direction and depth, which are then re-composed into ventral and dorsal pathways for high-level perception and immediate action, respectively51. These primitives and pathways enable the HVS to achieve unified and coherent representations of the open world while significantly reducing data redundancy without ignoring emergencies. Moreover, even in scenarios in which certain primitives or pathways are out of operation in some corner cases, their combination can compensate in a complementary manner. In summary, the HVS serves as a rich source of inspiration when tackling open-world challenges. However, harnessing these advantages still requires the development of mathematical models, theoretical analysis and silicon implementation.

A complementary vision paradigm

Here we present a synthesis of neuroscience findings and the key features of the HVS that are implicated in our complementary sensing paradigm. The HVS comprises two main processing pathways: one dedicated to the vision for perception (ventral pathway) and the other to vision for action (dorsal pathway). Although visual perception and cognition are often associated with semantic attributes that require high precision and high spatial resolution, for vision-guided behaviours, the absolute intensity of colours is deemed less important compared with transient and gradient information with higher temporal resolution. Drawing inspiration from the ventral and dorsal pathways, we introduce the COP and the AOP as shown in Extended Data Table 1. Furthermore, we integrate key features from the photoreceptor level, the lateral geniculate nucleus and the high-level vision into each pathway as primitives. Similar to the HVS, our paradigm offers advantages across multiple levels. At the photoreceptor level, incorporating dual photoreceptors greatly extends sensitivity and dynamic range. At the sensor level, incorporating primitive-based representations enables high dynamic range, high precision, high spatiotemporal resolution and low latency sensing without data redundancy. At the perception level, the complementary sensing paradigm provides the potential for fast responses to emergencies while having high-precision cognition for critical objects. To quantitatively assess representation ability, we introduce the concept of completeness of representation and conduct a comprehensive theoretical analysis of the primitive-based representations (Supplementary Notes 1–3). The results show that the primitive-based representation of Tianmouc maintains completeness compared with a high-speed frame-based representation.

Architecture design

The Tianmouc chip adopts a hybrid pixel array and parallel-and-heterogenous readout architectures to simultaneously implement primitives and complementary pathways on the same focal plane, enabling support for the complementary sensing paradigm. Overall, the COP captures colour intensity accurately through fine-grained cone pixels and high-precision parallel analog-to-voltage conversion, whereas the AOP uses rod pixels, which can be reconfigured for varying high-speed and precision, to facilitate fast, sparse spatiotemporal difference sensing and further compression.

A schematic of the hybrid pixel array illuminated from the back side is shown in Extended Data Fig. 2a. The colour or white filters and microlens array are fabricated on the cone-inspired and rod-inspired pixels, respectively. The three-dimensional diagram of the pixel structure shows the arrangement of the photodiode, high-density storage, transistors and metal wires. An optical micrograph of the hybrid pixel array is provided on the right. Cone-inspired and rod-inspired pixels use the same sensing frontend (Extended Data Fig. 2b) to convert optical information to charges in the photodiode and transmit these charges through the transfer gate controlled in a global shutter. The charges are converted to electrical signals by high or low gain realized by the reset and low-gain reset transistors, along with an additional capacitor. Keeping the reset transistor always on adds the additional capacitor to the signal path to achieve low gain, while keeping the low-gain reset transistor always on achieves the high-gain mode. Different sensing backends (Extended Data Fig. 2b) in cone-inspired and rod-inspired pixels enable the two pathways to read out the voltage in different ways.

Low-noise readout of intensities in cone-inspired pixels is achieved by correlated double sampling (CDS) circuits (Extended Data Fig. 2c). The value processed by the CDS circuit is further compared with a linear ramp generated by a shared digital-to-analog converter (DAC) and converted to digital format by a high-precision counter. As shown in Extended Data Fig. 2c, the spatiotemporal signals stored in rod-inspired pixels are fed simultaneously into the corresponding analog-to-digital converters (ADCs). The TD is calculated based on the conservation of charges. For SD computing, signals in each rod-inspired pixel are first processed in the CDS circuit to reduce noise, and the results of the two rows are further subtracted to generate SD values. For high-speed conversion of the processed Laplacian-like distributed spatiotemporal difference, we adopt a unified and polarity-adaptive DAC with a programmable threshold that filters the spatiotemporal difference to preserve the critical information. Using the reconfigurable slope generated by the DAC, the digital value of TD and SD is quantized to different precisions from ±7 bits to ±1 bit with various speeds from 757 fps to 10,000 fps (see detailed specification in Extended Data Table 2 and Supplementary Note 4).

The data representation of the COP is a dense matrix, whereas the AOP data are sparse with variable length. The different data representation poses challenges for efficient data communication and storage. As shown in Extended Data Fig. 2d, a packetizer using the unified address difference representation to encode the sparse spatiotemporal difference by assembling the timestamp, the address of pixels that generate non-zero values, and the corresponding TD and SD to compact and unified packets with compatibility of different types (TD and SD) and various precisions.

Experimental setup for chip characterization

The system setup for chip characterization is shown in Extended Data Fig. 3. The test board is shown in Extended Data Fig. 3a. As shown in Extended Data Fig. 3b, the digital output of the chip is processed by a commercial FPGA board (AMD-Xilinx, EK-U1-ZCU106-G-J) and transmitted to the host computer (Nvidia, Jetson AGX Orin) through Peripheral Component Interconnect Express (PCIe) protocol for post-processing. To miniaturize the system size, the FPGA board is replaced with a smaller board (Milianke, MLK-H4-KU040) in the autonomous driving perception system.

The characterization of Tianmouc is conducted in two setups. The first is based on the European Machine Vision Association 1288 (EMVA1288) standard52. As shown in Extended Data Fig. 4a–c, a uniform light is generated by disk-shaped LEDs and monochromators, and then projected on the focal plane of the sensor placed in the machine (Looglook, ez1288-RD95) for quantum efficiency measurement. The digital output of the COP is processed in the host computer. Because the AOP generates only spatiotemporal differences, a high-speed ADC acquisition card (ART Technology, PCIE8914M) is used to record the analog output of the intensity signals of rod pixels to be compatible with EMVA1288. The intensity images of both pathways are analysed by EMVA1288 standard-based algorithms to generate measurement data.

The signal-to-noise ratio (SNR) curve for dynamic range measurement in Fig. 3b is evaluated using a customized optical setup (Extended Data Fig. 4d,e) because the light source in the standard EMVA1288 machine has a limited dynamic range and cannot be programmed. Collimated light from a laser (Fisba, READYBeam) and a collimator are filtered by a high-frequency filter consisting of an objective and a pinhole, and then expanded by a lens to form a uniform light spot. This spot is then projected onto an optical power meter (Thorlabs, PM100D with S120C for high laser power characterization and PM160 for low laser power measurement) and the sensor chip on the same optical path. In both experiments, AOP data are collected by triggering the laser to generate flickers projected on the chip and recording the TD data with ±7-bit precision.

The key to calculating sensitivity and SNR curve is the calculation of SNR using intensity images. The calculation of SNR is outlined below

$${\rm{SNR}}=\frac{\mu -{\mu }_{{\rm{dark}}}\,}{{\sigma }_{I}},$$
(1)

where μ is the mean value of two consecutive normal images \({I}_{{t}_{n}}\) and \({I}_{{t}_{n+1}}\) exposed under a light source, and μdark is the mean value of two consecutive dark images \({I}_{{t}_{n}}^{{\rm{dark}}}\) and \({I}_{{t}_{n+1}}^{{\rm{dark}}}\), σ is the standard deviation of the noise. Considering an image with a resolution of M × N, μ and μdark are calculated by averaging across all rows m and columns n of the two images

$$\mu =\frac{1}{2{MN}}\mathop{\sum }\limits_{m=0}^{M-1}\mathop{\sum }\limits_{n=0}^{N-1}\left({I}_{{t}_{n}}\left(m,n\right)+{I}_{{t}_{n+1}}\left(m,n\right)\right),$$
$${\mu }_{{\rm{dark}}}=\frac{1}{2MN}\mathop{\sum }\limits_{m=0}^{M-1}\mathop{\sum }\limits_{n=0}^{N-1}\left({I}_{{t}_{n}}^{{\rm{dark}}}\left(m,n\right)+{I}_{{t}_{n+1}}^{{\rm{dark}}}\left(m,n\right)\right).$$
(2)

The σ is calculated using the mean values as

$${\sigma }^{2}=\frac{1}{2{MN}}\mathop{\sum }\limits_{m=0}^{M-1}\mathop{\sum }\limits_{n=0}^{N-1}{\left({I}_{{t}_{n}}\left(m,n\right)-{I}_{{t}_{n+1}}\left(m,n\right)\right)}^{2}.$$
(3)

Chip and system characterization

Extended Data Fig. 5a demonstrates the ability of Tianmouc to effectively capture unpredictable fast-moving ping-pong balls shot by a machine in random directions. Tianmouc accurately captures the static background when no balls are shot. The ping-pong balls are ejected suddenly by the machine and their trajectory can be detected by both AOP-TD and AOP-SD. Moreover, clear textures of other objects are captured by the AOP-SD.

The power consumption of the Tianmouc chip is measured using multimeters (Fluke, 17B+ digital multimeter) for different operating modes, ranging from 328.4 mW to 419.7 mW (Extended Data Fig. 5b). Tianmouc achieves relatively low power consumption compared with traditional high-speed image sensors41. A breakdown of power for a typical operating mode is provided in Extended Data Fig. 5b.

The Tianmouc effectively addresses temporal aliasing by using the high speed of AOP. As shown in Extended Data Fig. 5c, the COP records the rotation of a car wheel with aliasing, which is then accurately captured by AOP because of its ability for fast response. Through a reconstruction algorithm, an anti-aliasing reconstructed video is achieved, recovering the actual rotation at a speed of 757 fps.

As a supplement to Fig. 3e, we present the texture of lightning captured by Tianmouc under different operating modes. As shown in Extended Data Fig. 5d, the transient lightning with detailed texture is recorded by AOP at 1,515 fps, ±7 bits and a threshold of 50 mV. By applying an on-chip threshold filter, the peak bandwidth can be reduced to about 50 MB s−1 at a mode of 10,000 fps with ±1-bit precision (Fig. 3e), and 55 MB s−1 at a mode of 1,515 fps with ±7-bit precision (Extended Data Fig. 5d). This presents an 80–90% reduction compared with traditional cameras with equivalent spatiotemporal resolution and precision (640 × 320 × 1,515 × 8). Further details on the calculation of FOM can be found in Supplementary Note 5.

Scene reconstruction based on the CVP

For data reconstruction, a neural-network-based reconstructor is trained based on a self-supervised model. Two adjacent colour images from the COP and the AOP data stream between these images are used to reconstruct the original colourful scene. This process requires the same sampling rate as the AOP-SD and AOP-TD and the same resolution as the COP. The training process is adapted from ref. 53, incorporating the COP image at t0 \(({I}_{{t}_{0}}^{{\rm{C}}})\), the accumulation of AOP-TD from \({t}_{0} \sim {t}_{{\rm{f}}}(\sum _{t={t}_{0}-{t}_{{\rm{f}}}}{I}_{t}^{{\rm{TD}}})\) and the AOP-SD at the initial and final time points \(({I}_{{t}_{0}}^{{\rm{SD}}},{I}_{{t}_{{\rm{f}}}}^{{\rm{SD}}})\) to train a neural network F. The objective of F is to reconstruct the colour image at the target time \({t}_{{\rm{f}}}({I}_{{t}_{{\rm{f}}}}^{{\rm{C}}})\), that is, \(F({I}_{{t}_{0}}^{{\rm{C}}},{\sum }_{{t}_{{\rm{i}}}={t}_{0}}^{{t}_{{\rm{f}}}}{I}_{{t}_{{\rm{i}}}}^{{\rm{TD}}},{I}_{{t}_{0}}^{{\rm{SD}}},{I}_{{t}_{{\rm{f}}}}^{{\rm{SD}}})=\widehat{{I}_{{t}_{{\rm{f}}}}^{{\rm{C}}}}\) and the time-reversed process is also considered, as \(F({I}_{{t}_{{\rm{f}}}}^{{\rm{C}}},-{\sum }_{{t}_{{\rm{i}}}={t}_{0}}^{{t}_{{\rm{f}}}}{I}_{{t}_{{\rm{i}}}}^{{\rm{TD}}},{I}_{{t}_{0}}^{{\rm{SD}}},{I}_{{t}_{{\rm{f}}}}^{{\rm{SD}}})=\widehat{{I}_{{t}_{0}}^{{\rm{C}}}}.\)

The reconstruction network comprises three parts: a convolutional block attention module (CBAM)-based54 tiny-UNet55, an optical flow estimator and a fusion network, as shown in Extended Data Fig. 6a. The CBAM-based tiny-UNet takes the original COP data and \({\sum }_{{t}_{{\rm{i}}}={t}_{0}}^{{t}_{{\rm{f}}}}{I}_{{t}_{{\rm{i}}}}^{{\rm{TD}}}\) as input. The optical flow estimator, based on SpyNet56, is modified by replacing the colour images with \({I}_{{t}_{{\rm{f}}}}^{{\rm{SD}}}\) and \({I}_{{t}_{0}}^{{\rm{SD}}}\) as input. Then, \({\sum }_{{t}_{{\rm{i}}}={t}_{0}}^{{t}_{{\rm{f}}}}{I}_{{t}_{{\rm{i}}}}^{{\rm{TD}}}\) is added as a complement to the motion information, and the network structure is modified corresponding to the number of data channels, as shown in Extended Data Fig. 6b. The fusion network is also a UNet. Further details of the network structure can be found in Supplementary Table 1. The entire network is trained in an end-to-end manner, as shown in Extended Data Fig. 6c. By adjusting tf, the reconstruction result of an arbitrary frame can be obtained, as shown in Extended Data Fig. 6d. Using this approach, at least 757 fps reconstructed colour images with a resolution of 640 × 320 can be achieved. Moreover, the network can be generalized to high dynamic range scenes, as shown in Fig. 4c. Details of the evaluation process of the reconstruction algorithm can be found in Supplementary Note 7.

Perception algorithms used in the experiments

For real-time perceptual tasks, we build a parallel and complementary data process pipeline, as shown in Extended Data Fig. 7. The system supports the streaming process of the Tianmouc chip to support CVP-based perception processes. To accomplish detection and tracking tasks in the open world, two neural network (NN)-based algorithms run parallelly in our system, including a multi-tasks network modified on YOLOPv1 (refs. 57,58), a high-speed detector modified on YOLOv5s originated from YOLOv4 (ref. 59), and a multiple-object tracker (MOT) based on Kalman filter and an optical flow filter. The complete system contains a Tianmouc chip, an FPGA and a host computer (NVIDIA Jetson AGX Orin). The Tianmouc serves as the core of the whole system, which senses visual signals, converts them into two pathways, encodes the signals from the two pathways in compressive digital format and outputs the digital data. The output data are transmitted from the chip to the FPGA by parallel or serial interfaces and then forwarded to the host computer using a PCIe interface without modifications. All perceptual algorithms are executed on the host computer. The training details and the evaluation method can be found in Supplementary Notes 6 and 7, with detailed evaluation results provided in Supplementary Note 8 using annotated Tianmouc datasets referenced in Fig. 4.

For complementary perception, the results are synchronized and integrated using a MOT based on a Kalman filter. Detection results from multiple pathways are time-stamped and transmitted through different buffers into a tracking thread. The MOT is set to record the historical results for about 150 ms and updates tracking results at the same speed as the detector on the AOP. It also gives a tracking trace by drawing the centre points of tracked targets. Detection results given by different pathways are synchronized in the same trace.

Moreover, an optical flow solver and optical flow filter are introduced to the detection-tracking task. We adopt the Horn–Schunck method60 for real-time dense optical flow calculation. On getting the dense optical flow (uij, vij) for each point (i, j), we calculate the average of all non-zero optical flow values in the FOV to obtain the global optical flow \(({u}_{{\rm{mean}}},{v}_{{\rm{mean}}})=(\frac{{\sum }_{i=1}^{W}{\sum }_{j=1}^{H}{u}_{ij}}{N},\frac{{\sum }_{i=1}^{W}{\sum }_{j=1}^{H}{v}_{ij}}{N})\) to approximate the camera motion, where N is the number of pixels with non-zero optical flow vectors, and W and H are the width and height of the AOP frame, respectively. Based on this, targets that introduce inconsistencies with global motion can be filtered out as potential out-of-distribution obstacles through morphological operations. For more details, see Supplementary Note 9.

Bandwidth and data visualization in different open-world scenarios

To demonstrate the efficiency of Tianmouc, we calculate the bandwidth consumption for 10 different scenarios, while maintaining consistent precision and speed settings of the sensor. The results are presented in Extended Data Fig. 8a–d, corresponding to Fig. 4d–g, respectively. In Extended Data Fig. 8e, the performance of the algorithm is showcased while travelling on a tree-shaded road. Extended Data Fig. 8f shows a challenging scenario in which our vehicle passes over a speed bump, including notable camera shaking and large-amplitude vibration, a corner case in our evaluation. Despite these challenges, Tianmouc effectively tracks the target object owing to the high response speed of AOP. In Extended Data Fig. 8g, our vehicle navigates through roads with heavy traffic, resulting in dense AOP-TD, whereas in Extended Data Fig. 8h, the system operates on a highway with almost zero relative speed with other vehicles, leading to very sparse AOP-TD. Extended Data Fig. 8i shows the performance of Tianmouc when entering and leaving a short tunnel with a large variation in light on the same target. Extended Data Fig. 8j simulates an artificial anomaly in which many people are playing basketball at a tunnel exit and the vehicle keeps static. Here, the AOP-TD does not respond effectively, whereas the AOP-SD still provides a clear description of the bright part.

On the right of each case, we count and average the bandwidth consumption of the AOP-TD, AOP-SD, COP and their combination across the entire sample. The actual average bandwidth of Tianmouc ranges from 50 MB s−1 to 80 MB s−1, with peak bandwidth generally below 80 MB s−1, significantly less than traditional high-speed high-dynamic-range cameras. The main bandwidth consumption of Tianmouc is caused by the AOP-TD and AOP-SD because of their high sampling speed. However, the data distribution of the AOP-TD and AOP-SD and efficient coding methods reduce the bandwidth requirements of the AOP across all tested environments.