A vision chip with complementary pathways for open-world sensing

Yang, Zheyu; Wang, Taoyi; Lin, Yihan; Chen, Yuguo; Zeng, Hui; Pei, Jing; Wang, Jiazheng; Liu, Xue; Zhou, Yichun; Zhang, Jianqiang; Wang, Xin; Lv, Xinhao; Zhao, Rong; Shi, Luping

doi:10.1038/s41586-024-07358-4

A vision chip with complementary pathways for open-world sensing

Article
Published: 29 May 2024

Volume 629, pages 1027–1033, (2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

From

View current issue Submit your manuscript

A vision chip with complementary pathways for open-world sensing

Download PDF

14k Accesses
3 Citations
154 Altmetric
17 Mentions
Explore all metrics

Abstract

Image sensors face substantial challenges when dealing with dynamic, diverse and unpredictable scenes in open-world applications. However, the development of image sensors towards high speed, high resolution, large dynamic range and high precision is limited by power and bandwidth. Here we present a complementary sensing paradigm inspired by the human visual system that involves parsing visual information into primitive-based representations and assembling these primitives to form two complementary vision pathways: a cognition-oriented pathway for accurate cognition and an action-oriented pathway for rapid response. To realize this paradigm, a vision chip called Tianmouc is developed, incorporating a hybrid pixel array and a parallel-and-heterogeneous readout architecture. Leveraging the characteristics of the complementary vision pathway, Tianmouc achieves high-speed sensing of up to 10,000 fps, a dynamic range of 130 dB and an advanced figure of merit in terms of spatial resolution, speed and dynamic range. Furthermore, it adaptively reduces bandwidth by 90%. We demonstrate the integration of a Tianmouc chip into an autonomous driving system, showcasing its abilities to enable accurate, fast and robust perception, even in challenging corner cases on open roads. The primitive-based complementary sensing paradigm helps in overcoming fundamental limitations in developing vision systems for diverse open-world applications.

In-Sensor Visual Devices for Perception and Inference

Architectures and applications of high-speed vision

Article 27 November 2014

Low-Cost Hardware-Accelerated Vision-Based Depth Perception for Real-Time Applications

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Main

Image sensors¹ play a crucial part in a broad range of applications^2,3,4,5. However, with the increasing prevalence of open-world applications such as autonomous machines⁶, robotics⁷ and artificial intelligence⁸, current image sensors face substantial challenges. Despite the development of algorithms tailored for open-world applications⁹, image sensors have difficulty in handling dynamic, diverse and unpredictable corner cases¹⁰ beyond their sensing range, causing algorithm failures¹¹. Some of these challenges are aliasing and quantization error, data redundancy and limited dynamic range at the sensing level¹¹, as well as semantic misalignment, latency problem¹² and domain shift¹³ at the perception level¹⁴ (Fig. 1a). To effectively address corner cases, image sensors must possess exceptional performance in spatial resolution, speed, precision and dynamic range simultaneously. Yet, achieving this goal is impeded by the power and bandwidth walls¹⁵. Traditional sensors exhibit escalated power and bandwidth requirements with higher spatial resolution, speed and precision, leading to constrained capture abilities and excessive data burdens. Moreover, given the rarity of corner cases¹⁶, these sensors are prone to generating redundant data, consequently wasting bandwidth and power resources.

**Fig. 1: The challenges of open-world visual sensing and the solution with the complementary vision paradigm.**

In contrast to the existing image sensors, the human visual system (HVS) stands out for its versatility, adaptability and robustness in open-world environments. The HVS interprets visual stimuli into multiple visual primitives, such as colour, orientation and motion, and allocates them to the ventral and dorsal pathways in a complementary manner¹⁷. The cooperation between the two pathways efficiently provides a unified representation of visual scenes¹⁸ (Methods). Various endeavours have been made to replicate specific features of the HVS, including silicon retinas^19,20, neuromorphic vision sensors^{21,22,23,24,25,26,27,28}, pulse frequency modulation^29,30,31,32, time-to-first-spike^33,34,35 and near-sensor-computing chips^36,37,38,39. However, challenges persist in achieving image sensors with high spatial resolution, high speed, high precision, and large dynamic range in the constraints of limited power and bandwidth.

Here we report a complementary sensing paradigm inspired by the multi-level characteristics of the HVS, along with a vision chip designed based on this paradigm, named Tianmouc. Our paradigm consists of a primitive-based representation and two complementary visual pathways (CVPs) that enable the parsing of a visual scene into primitives and their subsequent assembly into corresponding pathways. As shown in Fig. 1b, these primitives encompass colour, precision, sensitivity, spatial resolution, speed, absolute intensity, spatial difference (SD) and temporal difference (TD), serving as the foundational elements for a comprehensive representation of the scene (Supplementary Note 2). The CVP consists of two distinct pathways: the cognition-oriented pathway (COP) and the action-oriented pathway (AOP), analogous to the ventral and dorsal pathways of the HVS, respectively (Extended Data Table 1). The COP uses primitives of colour, intensity, high spatial resolution and high precision to achieve accurate cognition, minimizing spatial aliasing and quantization errors. By contrast, the AOP uses primitives of SD, TD and high speed to achieve fast responses with robustness and high sparsity, addressing data redundancy and latency problems. The two pathways complement each other in constructing representations of normal and corner cases, thereby achieving high dynamic range and mitigating semantic misalignment and domain shift problems.

Design of the complementary vision chip

Implementing the complementary sensing paradigm in a physical sensing system has several challenges that must be addressed. It is essential to design the pixel array such that it facilitates the simultaneous dissection of optical-to-electrical information conversion for corresponding primitives on the same focal plane. Furthermore, the readout architectures of the two pathways must incorporate heterogeneous building blocks that can encode electrical information with different data distributions and formats.

As shown in Fig. 2a, the Tianmouc chip, fabricated using a 90-nm CMOS back-side illuminated technology, consists of two centrepieces: a hybrid pixel array for converting optical information into electrical signals and a parallel-and-heterogeneous readout architecture for constructing two CVPs. Inspired by photoreceptor cells, the hybrid pixel array comprises cone-inspired and rod-inspired pixels with varying characteristics such as colours, response modes, resolutions and sensitivities. These pixels can parse visual information into specific colours (red, green and blue) and a white spectrum, serving as colour-opponent primitives. They can also be adjusted for four different sensitivities using high and low charge-to-electrical conversion gain, thereby achieving a high dynamic range by using the low noise of the high-gain mode and the high saturation capacity of the low-gain mode. The cone-inspired pixels are designed with a fine-grained 4-µm pitch for absolute intensity sensing, whereas the rod-inspired pixels feature two larger receptive fields, 8 µm and 16 µm, for sensing TD and SD, respectively. A spatiotemporal consecutive pixel architecture is used to facilitate TD and SD computation through the use of high-density in-pixel memory. Specifically, the rod-inspired pixels buffer historical voltage signals in a ping-pong behaviour to enable continuous computation of TD in the AOP readout. The same memory in rod-inspired pixels across a block can be reorganized to compute the SD, as shown by the operational phase in Fig. 2b. The full hybrid pixel array comprises 320 × 320 cone-inspired pixels and 160 × 160 rod-inspired pixels. Further details about these two types of pixel are provided in the Methods and Extended Data Fig. 2a,b.

**Fig. 2: The architecture of the Tianmouc chip.**

The electrical signals travelling along the two pathways exhibit distinct characteristics, including differences in data distributions and sparsity. These disparities require the use of different methods for encoding signals into digital data with appropriate speed and precision. To address this challenge, the chip adopts parallel-and-heterogeneous readout architectures. For the COP, accurate conversion of absolute intensity signals to dense matrixes is essential. This is achieved through a single-slope analog-to-digital architecture. By contrast, the AOP requires rapid encoding of signals with spatiotemporal differences characterized by symmetrical Laplacian-like distribution⁴⁰ and high sparsity. Therefore, a specialized readout architecture is used (Fig. 2c), in which a programmable threshold filter is used to minimize redundancy and noise in the computed TD and SD signals while retaining key information. Subsequently, these signals are quantized using a fast polarity-adaptive digital-to-analog converter with configurable precision. Moreover, a data packetizer is used to achieve lossless compression of the sparse variable-precision TD and SD signals in a unified protocol (Extended Data Fig. 2d). This approach offers adaptive abilities to reduce bandwidth and further enhance the operation speed of the AOP. More details on the readout architectures can be found in the Methods and Extended Data Fig. 2c,d. An optical micrograph depicting the overall layout of the Tianmouc chip is presented in Fig. 2d.

Characterization of Tianmouc

The performance metrics of the Tianmouc chip, including quantum efficiency, dynamic range, response speed, power and bandwidth, have been evaluated comprehensively. The chip demonstrates high quantum efficiency in both COP and AOP, achieving a maximum of 72% of AOP and 69% of COP at 530 nm (Fig. 3a). It accomplishes a high dynamic range by leveraging the dynamic ranges of different gain modes in the complementary COP and AOP. As plotted in Fig. 3b, an overall dynamic range of 130 dB is achieved by detecting the lowest power density of 2.71 × 10⁻³ μW cm⁻² and the highest power density of 8.04 × 10³ μW cm⁻², in accordance with a well-established standard (Methods and Extended Data Fig. 4).

The complementary pathways of the Tianmouc chip enable high spatial resolution and precision (Fig. 3c) and high robustness in unpredictable environments (Fig. 3d). To eliminate spatial aliasing and quantization errors caused by AOP, the Tianmouc chip complementally uses spatial resolution and precision. Although the standard Siemens star chart captured by the AOP-SD in Fig. 3c may appear distorted because of its low resolution, the COP accurately records it. As shown in Fig. 3d, in a scene with horizontally fast-moving and rotating objects, and changing illumination conditions, a sudden flash of light disturbs the AOP-TD, but the AOP-SD remains unaffected. By using the COP image with the AOP-TD and AOP-SD, a frame-by-frame reconstruction of high-speed video (Methods) enables the recovery of high-speed motion.

Using the AOP, Tianmouc demonstrates rapid response with reconfigurable speeds ranging from 757 frames per second (fps) to 10,000 fps and precisions varying from ±7 bits to ±1 bit. This complements the comparatively slower speed of the COP, which maintains a sustained response at 30 fps and a 10-bit resolution. The high-speed ability of Tianmouc is assessed by a transient lightning test. As shown in Fig. 3e, Tianmouc can operate at 10,000 fps with ±1 bit at a threshold level of 50 mV to capture fast lightning bolts. It is worth noting that, owing to high sparsity, the AOP consumes only about 50 megabytes per second (MB s⁻¹) of peak bandwidth during transient phenomena, representing a 90% reduction compared with traditional cameras with equivalent spatiotemporal resolution and precision (640 × 320 × 10,000 × 2). More demonstrations of high-speed responses and temporal anti-aliasing can be found in Extended Data Fig. 5.

We use a comprehensive figure of merit (FOM), similar to that proposed in ref. ⁴¹, to evaluate the overall performance of the Tianmouc chip. This FOM incorporates key performance indicators for open-world sensing, integrating the maximum sampling rate (R_max) and dynamic range into a unified metric (R_max × dynamic range). In Fig. 3f, the FOM is plotted against power and bandwidth for various sensors, respectively. The power consumption of Tianmouc varies based on the operation mode (Extended Data Fig. 5b) and averages at 368 mW in a typical mode (±7 bits, 1,515 fps without threshold). As shown in Fig. 3f, Tianmouc achieves an advanced FOM, surpassing the existing neuromorphic sensors and traditional image sensors, while still retaining low power and low bandwidth consumption. Detailed calculations and comparisons can be found in the Methods.

Performance in the open world

The complementary sensing paradigm provides a large space of design possibilities and serves as an exceptional data source for perception algorithms. To assess these abilities in open-world scenarios, we develop an automotive driving perception system (Fig. 4a) integrated with a Tianmouc chip. The assessment is conducted on open roads, involving encounters with various corner cases, such as flash disturbances, high dynamic range scenes, domain shift problems (anomalous objects) and complex scenes featuring multiple corner cases. To make use of the advantages of Tianmouc architecture, we design a multi-pathway algorithm specifically tailored to harness the complementary features of the AOP and COP. At the sensing level, the completeness of primitives enables the reconstruction of the original scene and adaptation to extreme illumination. Meanwhile, at the perception level, the AOP provides the immediate perception of variations, textures and motion, whereas the COP offers fine semantic details. By synchronizing these outcomes, we achieve a comprehensive understanding of the scene.

**Fig. 4: Open-world perception experiments.**

The first scenario shown in Fig. 4b evaluates the sensing abilities involving a sudden light flash that causes rapid changes in illumination, potentially affecting sensor robustness. Tianmouc exhibits remarkable resilience to such light flashes while maintaining high perception performance in normal situations. For real-time high dynamic range perception (Fig. 4c), the complementary sensitivity of the two pathways enables the Tianmouc to sense high-brightness contrast without sacrificing speed. At the perception level, the anomaly detection ability is complemented by a high-speed optical flow filter on the AOP, in which the collaboration between AOP-TD and AOP-SD enables precise calculation of motion direction and speed for identifying anomalies (Fig. 4d). Figure 4e shows a complex scene with dim natural illumination, a chaotic traffic environment and sudden disturbances from artificial light, demanding diverse sensing abilities in terms of sampling speed, resolution and dynamic range. The algorithms on CVP provide complementary and diverse results, offering ample room for further decisions in these scenarios. According to the mAP_0.50 (mean average precision; Supplementary Note 8) bars, the CVP yields superior overall detection performance compared with using only a single pathway across all the cases in Fig. 4. Notably, it achieves this while consuming less than 80 MB s⁻¹ bandwidth and an average power consumption of 328 mW. The experimental results demonstrate that Tianmouc can efficiently adapt to extreme light environments and provide domain-invariant multi-level perception abilities. Further details of the experiment setup and algorithms are provided in the Methods and Extended Data Figs. 6–8, whereas the performance evaluation of the algorithm is discussed in Supplementary Notes 7 and 8.

Discussion

Tianmouc excels in capturing intricate details of cognition while rapidly responding to unpredictable emergencies and motion simultaneously. It offers high speed, high dynamic range and high precision, while simultaneously maintaining adaptive low bandwidth. Unlike existing sensing paradigms, our approach overcomes the inefficiencies caused by homogenous representations and accommodates various corner cases in the open world. Compared with contemporary neuromorphic vision sensors, Tianmouc exhibits superior precision and comprehensive information while maintaining a fast and robust response in extreme environments. Importantly, its high scalability allows for advanced spatial resolution through advanced manufacturing, facilitating resolution-sensitive applications with low power and bandwidth requirements. The primitives can also potentially be designed with on-chip reconfiguration and pathway allocation flexibility, enabling active adaptation to different task requirements. The vision sensor with primitive-based complementary pathways provides a unique data source and sensing platform, opening a new avenue for developing advanced computer vision theories, algorithms and systems for open-world applications.

Methods

A brief introduction to HVS

The HVS differs fundamentally from frame-based image sensors, boasting inherent general-purpose functionality, robustness and efficiency in the open world. As shown in Extended Data Fig. 1, the use of primitives and complementary pathways is apparent throughout the HVS, spanning from initial photoreceptor response to high-level visual processing. Notably, the retina exhibits distinct responses to input stimuli: cones detect colour and converge in the fovea with high acuity, whereas rods are colour insensitive and distributed across the retina⁵⁰. Rather than transmitting the entire signal from the retina to the brain, the lateral geniculate nucleus parses the response from the retina to different characteristics and transmits them to the primary visual cortex by the magnocellular and parvocellular (M and P) pathways. The P-pathway maintains a sustained response to colour information, offering a high spatial resolution but a low temporal resolution. Complementarily, the M-pathway, insensitive to colour, are sensitive to transient information, exhibiting a fast temporal response albeit at the expense of spatial resolution. Subsequently, the primary visual cortex (V1 and V2 regions) interprets these inputs into multiple visual primitives, including colour, orientation, direction and depth, which are then re-composed into ventral and dorsal pathways for high-level perception and immediate action, respectively⁵¹. These primitives and pathways enable the HVS to achieve unified and coherent representations of the open world while significantly reducing data redundancy without ignoring emergencies. Moreover, even in scenarios in which certain primitives or pathways are out of operation in some corner cases, their combination can compensate in a complementary manner. In summary, the HVS serves as a rich source of inspiration when tackling open-world challenges. However, harnessing these advantages still requires the development of mathematical models, theoretical analysis and silicon implementation.

A complementary vision paradigm

Here we present a synthesis of neuroscience findings and the key features of the HVS that are implicated in our complementary sensing paradigm. The HVS comprises two main processing pathways: one dedicated to the vision for perception (ventral pathway) and the other to vision for action (dorsal pathway). Although visual perception and cognition are often associated with semantic attributes that require high precision and high spatial resolution, for vision-guided behaviours, the absolute intensity of colours is deemed less important compared with transient and gradient information with higher temporal resolution. Drawing inspiration from the ventral and dorsal pathways, we introduce the COP and the AOP as shown in Extended Data Table 1. Furthermore, we integrate key features from the photoreceptor level, the lateral geniculate nucleus and the high-level vision into each pathway as primitives. Similar to the HVS, our paradigm offers advantages across multiple levels. At the photoreceptor level, incorporating dual photoreceptors greatly extends sensitivity and dynamic range. At the sensor level, incorporating primitive-based representations enables high dynamic range, high precision, high spatiotemporal resolution and low latency sensing without data redundancy. At the perception level, the complementary sensing paradigm provides the potential for fast responses to emergencies while having high-precision cognition for critical objects. To quantitatively assess representation ability, we introduce the concept of completeness of representation and conduct a comprehensive theoretical analysis of the primitive-based representations (Supplementary Notes 1–3). The results show that the primitive-based representation of Tianmouc maintains completeness compared with a high-speed frame-based representation.

Architecture design

The Tianmouc chip adopts a hybrid pixel array and parallel-and-heterogenous readout architectures to simultaneously implement primitives and complementary pathways on the same focal plane, enabling support for the complementary sensing paradigm. Overall, the COP captures colour intensity accurately through fine-grained cone pixels and high-precision parallel analog-to-voltage conversion, whereas the AOP uses rod pixels, which can be reconfigured for varying high-speed and precision, to facilitate fast, sparse spatiotemporal difference sensing and further compression.

A schematic of the hybrid pixel array illuminated from the back side is shown in Extended Data Fig. 2a. The colour or white filters and microlens array are fabricated on the cone-inspired and rod-inspired pixels, respectively. The three-dimensional diagram of the pixel structure shows the arrangement of the photodiode, high-density storage, transistors and metal wires. An optical micrograph of the hybrid pixel array is provided on the right. Cone-inspired and rod-inspired pixels use the same sensing frontend (Extended Data Fig. 2b) to convert optical information to charges in the photodiode and transmit these charges through the transfer gate controlled in a global shutter. The charges are converted to electrical signals by high or low gain realized by the reset and low-gain reset transistors, along with an additional capacitor. Keeping the reset transistor always on adds the additional capacitor to the signal path to achieve low gain, while keeping the low-gain reset transistor always on achieves the high-gain mode. Different sensing backends (Extended Data Fig. 2b) in cone-inspired and rod-inspired pixels enable the two pathways to read out the voltage in different ways.

Low-noise readout of intensities in cone-inspired pixels is achieved by correlated double sampling (CDS) circuits (Extended Data Fig. 2c). The value processed by the CDS circuit is further compared with a linear ramp generated by a shared digital-to-analog converter (DAC) and converted to digital format by a high-precision counter. As shown in Extended Data Fig. 2c, the spatiotemporal signals stored in rod-inspired pixels are fed simultaneously into the corresponding analog-to-digital converters (ADCs). The TD is calculated based on the conservation of charges. For SD computing, signals in each rod-inspired pixel are first processed in the CDS circuit to reduce noise, and the results of the two rows are further subtracted to generate SD values. For high-speed conversion of the processed Laplacian-like distributed spatiotemporal difference, we adopt a unified and polarity-adaptive DAC with a programmable threshold that filters the spatiotemporal difference to preserve the critical information. Using the reconfigurable slope generated by the DAC, the digital value of TD and SD is quantized to different precisions from ±7 bits to ±1 bit with various speeds from 757 fps to 10,000 fps (see detailed specification in Extended Data Table 2 and Supplementary Note 4).

The data representation of the COP is a dense matrix, whereas the AOP data are sparse with variable length. The different data representation poses challenges for efficient data communication and storage. As shown in Extended Data Fig. 2d, a packetizer using the unified address difference representation to encode the sparse spatiotemporal difference by assembling the timestamp, the address of pixels that generate non-zero values, and the corresponding TD and SD to compact and unified packets with compatibility of different types (TD and SD) and various precisions.

Experimental setup for chip characterization

The system setup for chip characterization is shown in Extended Data Fig. 3. The test board is shown in Extended Data Fig. 3a. As shown in Extended Data Fig. 3b, the digital output of the chip is processed by a commercial FPGA board (AMD-Xilinx, EK-U1-ZCU106-G-J) and transmitted to the host computer (Nvidia, Jetson AGX Orin) through Peripheral Component Interconnect Express (PCIe) protocol for post-processing. To miniaturize the system size, the FPGA board is replaced with a smaller board (Milianke, MLK-H4-KU040) in the autonomous driving perception system.

The characterization of Tianmouc is conducted in two setups. The first is based on the European Machine Vision Association 1288 (EMVA1288) standard⁵². As shown in Extended Data Fig. 4a–c, a uniform light is generated by disk-shaped LEDs and monochromators, and then projected on the focal plane of the sensor placed in the machine (Looglook, ez1288-RD95) for quantum efficiency measurement. The digital output of the COP is processed in the host computer. Because the AOP generates only spatiotemporal differences, a high-speed ADC acquisition card (ART Technology, PCIE8914M) is used to record the analog output of the intensity signals of rod pixels to be compatible with EMVA1288. The intensity images of both pathways are analysed by EMVA1288 standard-based algorithms to generate measurement data.

The signal-to-noise ratio (SNR) curve for dynamic range measurement in Fig. 3b is evaluated using a customized optical setup (Extended Data Fig. 4d,e) because the light source in the standard EMVA1288 machine has a limited dynamic range and cannot be programmed. Collimated light from a laser (Fisba, READYBeam) and a collimator are filtered by a high-frequency filter consisting of an objective and a pinhole, and then expanded by a lens to form a uniform light spot. This spot is then projected onto an optical power meter (Thorlabs, PM100D with S120C for high laser power characterization and PM160 for low laser power measurement) and the sensor chip on the same optical path. In both experiments, AOP data are collected by triggering the laser to generate flickers projected on the chip and recording the TD data with ±7-bit precision.

The key to calculating sensitivity and SNR curve is the calculation of SNR using intensity images. The calculation of SNR is outlined below

$${\rm{SNR}}=\frac{\mu -{\mu }_{{\rm{dark}}}\,}{{\sigma }_{I}},$$

(1)

where μ is the mean value of two consecutive normal images ${I}_{{t}_{n}}$ and ${I}_{{t}_{n+1}}$ exposed under a light source, and μ_dark is the mean value of two consecutive dark images ${I}_{{t}_{n}}^{{\rm{dark}}}$ and ${I}_{{t}_{n+1}}^{{\rm{dark}}}$, σ is the standard deviation of the noise. Considering an image with a resolution of M × N, μ and μ_dark are calculated by averaging across all rows m and columns n of the two images

$$\mu =\frac{1}{2{MN}}\mathop{\sum }\limits_{m=0}^{M-1}\mathop{\sum }\limits_{n=0}^{N-1}\left({I}_{{t}_{n}}\left(m,n\right)+{I}_{{t}_{n+1}}\left(m,n\right)\right),$$

$${\mu }_{{\rm{dark}}}=\frac{1}{2MN}\mathop{\sum }\limits_{m=0}^{M-1}\mathop{\sum }\limits_{n=0}^{N-1}\left({I}_{{t}_{n}}^{{\rm{dark}}}\left(m,n\right)+{I}_{{t}_{n+1}}^{{\rm{dark}}}\left(m,n\right)\right).$$

(2)

The σ is calculated using the mean values as

$${\sigma }^{2}=\frac{1}{2{MN}}\mathop{\sum }\limits_{m=0}^{M-1}\mathop{\sum }\limits_{n=0}^{N-1}{\left({I}_{{t}_{n}}\left(m,n\right)-{I}_{{t}_{n+1}}\left(m,n\right)\right)}^{2}.$$

(3)

Chip and system characterization

Extended Data Fig. 5a demonstrates the ability of Tianmouc to effectively capture unpredictable fast-moving ping-pong balls shot by a machine in random directions. Tianmouc accurately captures the static background when no balls are shot. The ping-pong balls are ejected suddenly by the machine and their trajectory can be detected by both AOP-TD and AOP-SD. Moreover, clear textures of other objects are captured by the AOP-SD.

The power consumption of the Tianmouc chip is measured using multimeters (Fluke, 17B+ digital multimeter) for different operating modes, ranging from 328.4 mW to 419.7 mW (Extended Data Fig. 5b). Tianmouc achieves relatively low power consumption compared with traditional high-speed image sensors⁴¹. A breakdown of power for a typical operating mode is provided in Extended Data Fig. 5b.

The Tianmouc effectively addresses temporal aliasing by using the high speed of AOP. As shown in Extended Data Fig. 5c, the COP records the rotation of a car wheel with aliasing, which is then accurately captured by AOP because of its ability for fast response. Through a reconstruction algorithm, an anti-aliasing reconstructed video is achieved, recovering the actual rotation at a speed of 757 fps.

As a supplement to Fig. 3e, we present the texture of lightning captured by Tianmouc under different operating modes. As shown in Extended Data Fig. 5d, the transient lightning with detailed texture is recorded by AOP at 1,515 fps, ±7 bits and a threshold of 50 mV. By applying an on-chip threshold filter, the peak bandwidth can be reduced to about 50 MB s⁻¹ at a mode of 10,000 fps with ±1-bit precision (Fig. 3e), and 55 MB s⁻¹ at a mode of 1,515 fps with ±7-bit precision (Extended Data Fig. 5d). This presents an 80–90% reduction compared with traditional cameras with equivalent spatiotemporal resolution and precision (640 × 320 × 1,515 × 8). Further details on the calculation of FOM can be found in Supplementary Note 5.

Scene reconstruction based on the CVP

For data reconstruction, a neural-network-based reconstructor is trained based on a self-supervised model. Two adjacent colour images from the COP and the AOP data stream between these images are used to reconstruct the original colourful scene. This process requires the same sampling rate as the AOP-SD and AOP-TD and the same resolution as the COP. The training process is adapted from ref. ⁵³, incorporating the COP image at t₀ $({I}_{{t}_{0}}^{{\rm{C}}})$, the accumulation of AOP-TD from ${t}_{0} \sim {t}_{{\rm{f}}}(\sum _{t={t}_{0}-{t}_{{\rm{f}}}}{I}_{t}^{{\rm{TD}}})$ and the AOP-SD at the initial and final time points $({I}_{{t}_{0}}^{{\rm{SD}}},{I}_{{t}_{{\rm{f}}}}^{{\rm{SD}}})$ to train a neural network F. The objective of F is to reconstruct the colour image at the target time ${t}_{{\rm{f}}}({I}_{{t}_{{\rm{f}}}}^{{\rm{C}}})$, that is, $F({I}_{{t}_{0}}^{{\rm{C}}},{\sum }_{{t}_{{\rm{i}}}={t}_{0}}^{{t}_{{\rm{f}}}}{I}_{{t}_{{\rm{i}}}}^{{\rm{TD}}},{I}_{{t}_{0}}^{{\rm{SD}}},{I}_{{t}_{{\rm{f}}}}^{{\rm{SD}}})=\widehat{{I}_{{t}_{{\rm{f}}}}^{{\rm{C}}}}$ and the time-reversed process is also considered, as $F({I}_{{t}_{{\rm{f}}}}^{{\rm{C}}},-{\sum }_{{t}_{{\rm{i}}}={t}_{0}}^{{t}_{{\rm{f}}}}{I}_{{t}_{{\rm{i}}}}^{{\rm{TD}}},{I}_{{t}_{0}}^{{\rm{SD}}},{I}_{{t}_{{\rm{f}}}}^{{\rm{SD}}})=\widehat{{I}_{{t}_{0}}^{{\rm{C}}}}.$

The reconstruction network comprises three parts: a convolutional block attention module (CBAM)-based⁵⁴ tiny-UNet⁵⁵, an optical flow estimator and a fusion network, as shown in Extended Data Fig. 6a. The CBAM-based tiny-UNet takes the original COP data and ${\sum }_{{t}_{{\rm{i}}}={t}_{0}}^{{t}_{{\rm{f}}}}{I}_{{t}_{{\rm{i}}}}^{{\rm{TD}}}$ as input. The optical flow estimator, based on SpyNet⁵⁶, is modified by replacing the colour images with ${I}_{{t}_{{\rm{f}}}}^{{\rm{SD}}}$ and ${I}_{{t}_{0}}^{{\rm{SD}}}$ as input. Then, ${\sum }_{{t}_{{\rm{i}}}={t}_{0}}^{{t}_{{\rm{f}}}}{I}_{{t}_{{\rm{i}}}}^{{\rm{TD}}}$ is added as a complement to the motion information, and the network structure is modified corresponding to the number of data channels, as shown in Extended Data Fig. 6b. The fusion network is also a UNet. Further details of the network structure can be found in Supplementary Table 1. The entire network is trained in an end-to-end manner, as shown in Extended Data Fig. 6c. By adjusting t_f, the reconstruction result of an arbitrary frame can be obtained, as shown in Extended Data Fig. 6d. Using this approach, at least 757 fps reconstructed colour images with a resolution of 640 × 320 can be achieved. Moreover, the network can be generalized to high dynamic range scenes, as shown in Fig. 4c. Details of the evaluation process of the reconstruction algorithm can be found in Supplementary Note 7.

Perception algorithms used in the experiments

For real-time perceptual tasks, we build a parallel and complementary data process pipeline, as shown in Extended Data Fig. 7. The system supports the streaming process of the Tianmouc chip to support CVP-based perception processes. To accomplish detection and tracking tasks in the open world, two neural network (NN)-based algorithms run parallelly in our system, including a multi-tasks network modified on YOLOPv1 (refs. ^57,58), a high-speed detector modified on YOLOv5s originated from YOLOv4 (ref. ⁵⁹), and a multiple-object tracker (MOT) based on Kalman filter and an optical flow filter. The complete system contains a Tianmouc chip, an FPGA and a host computer (NVIDIA Jetson AGX Orin). The Tianmouc serves as the core of the whole system, which senses visual signals, converts them into two pathways, encodes the signals from the two pathways in compressive digital format and outputs the digital data. The output data are transmitted from the chip to the FPGA by parallel or serial interfaces and then forwarded to the host computer using a PCIe interface without modifications. All perceptual algorithms are executed on the host computer. The training details and the evaluation method can be found in Supplementary Notes 6 and 7, with detailed evaluation results provided in Supplementary Note 8 using annotated Tianmouc datasets referenced in Fig. 4.

For complementary perception, the results are synchronized and integrated using a MOT based on a Kalman filter. Detection results from multiple pathways are time-stamped and transmitted through different buffers into a tracking thread. The MOT is set to record the historical results for about 150 ms and updates tracking results at the same speed as the detector on the AOP. It also gives a tracking trace by drawing the centre points of tracked targets. Detection results given by different pathways are synchronized in the same trace.

Moreover, an optical flow solver and optical flow filter are introduced to the detection-tracking task. We adopt the Horn–Schunck method⁶⁰ for real-time dense optical flow calculation. On getting the dense optical flow (u_ij, v_ij) for each point (i, j), we calculate the average of all non-zero optical flow values in the FOV to obtain the global optical flow $({u}_{{\rm{mean}}},{v}_{{\rm{mean}}})=(\frac{{\sum }_{i=1}^{W}{\sum }_{j=1}^{H}{u}_{ij}}{N},\frac{{\sum }_{i=1}^{W}{\sum }_{j=1}^{H}{v}_{ij}}{N})$ to approximate the camera motion, where N is the number of pixels with non-zero optical flow vectors, and W and H are the width and height of the AOP frame, respectively. Based on this, targets that introduce inconsistencies with global motion can be filtered out as potential out-of-distribution obstacles through morphological operations. For more details, see Supplementary Note 9.

Bandwidth and data visualization in different open-world scenarios

To demonstrate the efficiency of Tianmouc, we calculate the bandwidth consumption for 10 different scenarios, while maintaining consistent precision and speed settings of the sensor. The results are presented in Extended Data Fig. 8a–d, corresponding to Fig. 4d–g, respectively. In Extended Data Fig. 8e, the performance of the algorithm is showcased while travelling on a tree-shaded road. Extended Data Fig. 8f shows a challenging scenario in which our vehicle passes over a speed bump, including notable camera shaking and large-amplitude vibration, a corner case in our evaluation. Despite these challenges, Tianmouc effectively tracks the target object owing to the high response speed of AOP. In Extended Data Fig. 8g, our vehicle navigates through roads with heavy traffic, resulting in dense AOP-TD, whereas in Extended Data Fig. 8h, the system operates on a highway with almost zero relative speed with other vehicles, leading to very sparse AOP-TD. Extended Data Fig. 8i shows the performance of Tianmouc when entering and leaving a short tunnel with a large variation in light on the same target. Extended Data Fig. 8j simulates an artificial anomaly in which many people are playing basketball at a tunnel exit and the vehicle keeps static. Here, the AOP-TD does not respond effectively, whereas the AOP-SD still provides a clear description of the bright part.

On the right of each case, we count and average the bandwidth consumption of the AOP-TD, AOP-SD, COP and their combination across the entire sample. The actual average bandwidth of Tianmouc ranges from 50 MB s⁻¹ to 80 MB s⁻¹, with peak bandwidth generally below 80 MB s⁻¹, significantly less than traditional high-speed high-dynamic-range cameras. The main bandwidth consumption of Tianmouc is caused by the AOP-TD and AOP-SD because of their high sampling speed. However, the data distribution of the AOP-TD and AOP-SD and efficient coding methods reduce the bandwidth requirements of the AOP across all tested environments.

Data availability

The data supporting the findings of this study are available in the main text, Extended Data, Supplementary Information, source data and Zenodo (https://doi.org/10.5281/zenodo.10602822)⁶¹. Source data are provided with this paper.

Code availability

The algorithms and codes supporting the findings of this study are available at Zenodo (https://doi.org/10.5281/zenodo.10775253)⁶².

References

Fossum, E. R. CMOS image sensors: Electronic camera-on-a-chip. IEEE Trans. Electron Devices 44, 1689–1698 (1997).
Article ADS Google Scholar
Gove, R. J. in High Performance Silicon Imaging 2nd edn (ed. Durini, D.) 185–240 (Elsevier, 2019).
Yun, S. H. & Kwok, S. J. Light in diagnosis, therapy and surgery. Nat. Biomed. Eng. 1, 0008 (2017).
Article CAS PubMed PubMed Central Google Scholar
Liu, Z., Ukida, H., Ramuhalli, P. & Niel, K (eds). Integrated Imaging and Vision Techniques for Industrial Inspection (Springer, 2015).
Nakamura, J. Image Sensors and Signal Processing for Digital Still Cameras (CRC Press, 2017).
Bogdoll, D., Nitsche, M. & Zöllner, M. Anomaly detection in autonomous driving: a survey. In Proc. IEEE/CVF International Conference on Computer Vision and Pattern Recognition 4488–4499 (CVF, 2022).
Hanheide, M. et al. Robot task planning and explanation in open and uncertain worlds. Artif. Intell. 247, 119–150 (2017).
Article MathSciNet Google Scholar
Sarker, I. H. Machine learning: algorithms, real-world applications and research directions. SN Comp. Sci. 2, 160 (2021).
Article Google Scholar
Joseph, K., Khan, S., Khan, F. S. & Balasubramanian, V. N. Towards open world object detection. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 5830–5840 (CVF, 2021).
Breitenstein, J., Termöhlen, J.-A., Lipinski, D. & Fingscheidt, T. Breitenstein, J., Termöhlen, J.-A., Lipinski, D. & Fingscheidt, T. Systematization of corner cases for visual perception in automated driving. In 2020 IEEE Intelligent Vehicles Symposium (IV) 1257–1264 (IEEE, 2020).
Yan, C., Xu, W. & Liu, J. Can you trust autonomous vehicles: contactless attacks against sensors of self-driving vehicle. In Proc. Def Con 24, 109 (ACM, 2016).
Li, M., Wang, Y.-X. & Ramanan, D. Towards streaming perception. In European Conf. Computer Vision 473–488 (Springer, 2020).
Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A. & Lawrence, N. D. Dataset Shift in Machine Learning (Mit Press, 2008).
Khatab, E., Onsy, A., Varley, M. & Abouelfarag, A. Vulnerable objects detection for autonomous driving: a review. Integration 78, 36–48 (2021).
Article Google Scholar
Shu, X. & Wu, X. Real-time high-fidelity compression for extremely high frame rate video cameras. IEEE Trans. Comput. Imaging 4, 172–180 (2017).
Article MathSciNet Google Scholar
Feng, S. et al. Dense reinforcement learning for safety validation of autonomous vehicles. Nature 615, 620–627 (2023).
Article ADS CAS PubMed Google Scholar
Goodale, M. A. & Milner, A. D. Separate visual pathways for perception and action. Trends Neurosci. 15, 20–25 (1992).
Article CAS PubMed Google Scholar
Nassi, J. J. & Callaway, E. M. Parallel processing strategies of the primate visual system. Nat. Rev. Neurosci. 10, 360–372 (2009).
Article CAS PubMed PubMed Central Google Scholar
Mahowald, M. & Mahowald, M. in An Analog VLSI System for Stereoscopic Vision (ed. Mahowald, M.) 4–65 (Kluwer, 1994).
Zaghloul, K. A. & Boahen, K. Optic nerve signals in a neuromorphic chip I: Outer and inner retina models. IEEE Trans. Biomed. Eng. 51, 657–666 (2004).
Article PubMed Google Scholar
Son, B. et al. 4.1 A 640 × 480 dynamic vision sensor with a 9 µm pixel and 300 Meps address-event representation. In 2017 IEEE International Solid-State Circuits Conference (ISSCC) 66–67 (IEEE, 2017).
Kubendran, R., Paul, A. & Cauwenberghs, G. A 256 × 256 6.3 pJ/pixel-event query-driven dynamic vision sensor with energy-conserving row-parallel event scanning. In 2021 IEEE Custom Integrated Circuits Conference (CICC) 1–2 (IEEE, 2021).
Posch, C., Matolin, D. & Wohlgenannt, R. A QVGA 143 dB dynamic range frame-free PWM image sensor with lossless pixel-level video compression and time-domain CDS. IEEE J. Solid-State Circuits 46, 259–275 (2010).
Article ADS Google Scholar
Leñero-Bardallo, J. A., Serrano-Gotarredona, T. & Linares-Barranco, B. A 3.6 μs latency asynchronous frame-free event-driven dynamic-vision-sensor. IEEE J Solid-State Circuits 46, 1443–1455 (2011).
Article ADS Google Scholar
Prophesee. IMX636ES (HD) https://www.prophesee.ai/event-camera-evk4/ (2021).
Brandli, C., Berner, R., Yang, M., Liu, S.-C. & Delbruck, T. A 240 × 180 130 db 3 µs latency global shutter spatiotemporal vision sensor. IEEE J. Solid-State Circuits 49, 2333–2341 (2014).
Article ADS Google Scholar
Guo, M. et al. A 3-wafer-stacked hybrid 15MPixel CIS + 1 MPixel EVS with 4.6GEvent/s readout, in-pixel TDC and on-chip ISP and ESP function. In 2023 IEEE International Solid-State Circuits Conference (ISSCC) 90–92 (IEEE, 2023).
Kodama, K. et al. 1.22 μm 35.6Mpixel RGB hybrid event-based vision sensor with 4.88 μm-pitch event pixels and up to 10 K event frame rate by adaptive control on event sparsity. In 2023 IEEE International Solid-State Circuits Conference (ISSCC) 92–94 (IEEE, 2023).
Frohmader, K. P. A novel MOS compatible light intensity-to-frequency converter suited for monolithic integration. IEEE J. Solid-State Circuits 17, 588–591 (1982).
Article ADS Google Scholar
Huang, T. et al. 1000× faster camera and machine vision with ordinary devices. Engineering 25, 110–119 (2023).
Article Google Scholar
Wang, X., Wong, W. & Hornsey, R. A high dynamic range CMOS image sensor with inpixel light-to-frequency conversion. IEEE Trans. Electron Devices 53, 2988–2992 (2006).
Article ADS Google Scholar
Ng, D. C. et al. Pulse frequency modulation based CMOS image sensor for subretinal stimulation. IEEE Trans. Circuits Syst. II Express Briefs 53, 487–491 (2006).
Article Google Scholar
Culurciello, E., Etienne-Cummings, R. & Boahen, K. A. A biomorphic digital image sensor. IEEE J. Solid-State Circuits 38, 281–294 (2003).
Article ADS Google Scholar
Shoushun, C. & Bermak, A. Arbitrated time-to-first spike CMOS image sensor with on-chip histogram equalization. IEEE Trans. Very Large Scale Integr. VLSI Syst. 15, 346–357 (2007).
Article Google Scholar
Guo, X., Qi, X. & Harris, J. G. A time-to-first-spike CMOS image sensor. IEEE Sens. J. 7, 1165–1175 (2007).
Article ADS CAS Google Scholar
Shi, C. et al. A 1000 fps vision chip based on a dynamically reconfigurable hybrid architecture comprising a PE array processor and self-organizing map neural network. IEEE J. Solid-State Circuits 49, 2067–2082 (2014).
Article ADS Google Scholar
Hsu, T.-H. et al. A 0.8 V intelligent vision sensor with tiny convolutional neural network and programmable weights using mixed-mode processing-in-sensor technique for image classification. In 2022 IEEE International Solid-State Circuits Conference (ISSCC) 1–3 (IEEE, 2022).
Lefebvre, M., Moreau, L., Dekimpe, R. & Bol, D. 7.7 A 0.2-to-3.6TOPS/W programmable convolutional imager SoC with in-sensor current-domain ternary-weighted MAC operations for feature extraction and region-of-interest detection. In 2021 IEEE International Solid-State Circuits Conference (ISSCC) 118–120 (IEEE, 2021).
Ishikawa, M., Ogawa, K., Komuro, T. & Ishii, I. A CMOS vision chip with SIMD processing element array for 1 ms image processing. In 1999 IEEE International Solid-State Circuits Conference 206–207 (IEEE, 1999).
Shi, Y.-Q. & Sun, H. Image and Video Compression for Multimedia Engineering: Fundamentals, Algorithms, and Standards 3rd edn (CRC Press, 2019).
Sakakibara, M. et al. A 6.9-μm pixel-pitch back-illuminated global shutter CMOS image sensor with pixel-parallel 14-bit subthreshold ADC. IEEE J. Solid-State Circuits 53, 3017–3025 (2018).
Article ADS Google Scholar
Seo, M.-W. et al. 2.45 e-rms low-random-noise, 598.5 mW low-power, and 1.2 kfps high-speed 2-Mp global shutter CMOS image sensor with pixel-level ADC and memory. IEEE J. Solid-State Circuits 57, 1125–1137 (2022).
Article ADS Google Scholar
Bogaerts, J. et al. 6.3 105 × 65 mm² 391Mpixel CMOS image sensor with >78 dB dynamic range for airborne mapping applications. In 2016 IEEE International Solid-State Circuits Conference (ISSCC) 114–115 (IEEE, 2016).
Park, I., Park, C., Cheon, J. & Chae, Y. 5.4 A 76 mW 500 fps VGA CMOS image sensor with time-stretched single-slope ADCs achieving 1.95e⁻ random noise. In 2019 IEEE International Solid-State Circuits Conference (ISSCC) 100–102 (IEEE, 2019).
Oike, Y. et al. 8.3 M-pixel 480-fps global-shutter CMOS image sensor with gain-adaptive column ADCs and chip-on-chip stacked integration. IEEE J. Solid-State Circuits 52, 985–993 (2017).
Article ADS Google Scholar
Okada, C. et al. A 50.1-Mpixel 14-bit 250-frames/s back-illuminated stacked CMOS image sensor with column-parallel kT/C-canceling S&H and ΔΣADC. IEEE J. Solid-State Circuits 56, 3228–3235 (2021).
Article ADS Google Scholar
Solhusvik, J. et al. 1280 × 960 2.8 μm HDR CIS with DCG and split-pixel combined. In Proc. International Image Sensor Workshop 254–257 (2019).
Murakami, H. et al. A 4.9 Mpixel programmable-resolution multi-purpose CMOS image sensor for computer vision. In 2022 IEEE International Solid-State Circuits Conference (ISSCC) 104–106 (IEEE, 2022).
iniVation. DAVIS 346, https://inivation.com/wp-content/uploads/2019/08/DAVIS346.pdf (iniVation, 2019).
Kandel, E. R., Koester, J. D., Mack, S. H. & Siegelbaum, S. A. Principles of Neural Science 4th edn (McGraw-Hill, 2000).
Mishkin, M., Ungerleider, L. G. & Macko, K. A. Object vision and spatial vision: two cortical pathways. Trends Neurosci. 6, 414–417 (1983).
Article Google Scholar
Jähne, B. EMVA 1288 Standard for machine vision: Objective specification of vital camera data. Optik Photonik 5, 53–54 (2010).
Article Google Scholar
Reda, F. A. et al. FILM: Frame Interpolation for Large Motion. In Proc. IEEE/CVF International Conference on Computer Vision 250–266 (ACM, 2022).
Woo, S., Park, J., Lee, J.-Y. & Kweon, I. S. CBAM: Convolutional Block Attention Module. In Proc. European Conference on Computer Vision (ECCV) 3–19 (2018).
Ronneberger, O., Fischer, P. & Brox, T. U-Net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference Proc. Part III Vol. 18 (eds Navab, N. et al.) 234–241 (Springer, 2015).
Ranjan, A. & Black, M. J. CBAM: Convolutional Block Attention Module. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 4161–4170 (CVF, 2018).
Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You Only Look Once: unified, real-time object detection. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 779–788 (IEEE, 2016).
Wu, D. et al. YOLOP: You Only Look Once for Panoptic Driving Perception. Mach. Intell. Res. 19, 550–562 (2022).
Article Google Scholar
Bochkovskiy, A., Wang, C.-Y. & Liao, H.-Y. M. YOLOv4: optimal speed and accuracy of object detection. Preprint at https://arxiv.org/abs/2004.10934 (2020).
Horn, B. K. & Schunck, B. G. Determining optical flow. Artif. Intell. 17, 185–203 (1981).
Article Google Scholar
Wang, T. Tianmouc dataset. Zenodo https://doi.org/10.5281/zenodo.10602822 (2024).
Wang, T. Code of “A vision chip with complementary pathways for open-world sensing”. Zenodo https://doi.org/10.5281/zenodo.10775253 (2024).
iniVation. Understanding the Performance of Neuromorphic Event-based Vision Sensors White Paper (iniVation, 2020).
iniVation. DAVIS 346 AER https://inivation.com/wp-content/uploads/2023/07/DAVIS346-AER.pdf (iniVation, 2023).

Download references

Acknowledgements

This work was supported by the STI 2030—Major Projects 2021ZD0200300 and National Natural Science Foundation of China (no. 62088102).

Author information

These authors contributed equally: Zheyu Yang, Taoyi Wang, Yihan Lin

Authors and Affiliations

Center for Brain-Inspired Computing Research (CBICR), Optical Memory National Engineering Research Center and Department of Precision Instrument, Tsinghua University, Beijing, China
Zheyu Yang, Taoyi Wang, Yihan Lin, Yuguo Chen, Hui Zeng, Jing Pei, Jiazheng Wang, Xue Liu, Rong Zhao & Luping Shi
Lynxi Technologies, Beijing, China
Zheyu Yang, Yichun Zhou, Jianqiang Zhang, Xin Wang & Xinhao Lv
IDG/McGovern Institute for Brain Research, Tsinghua University, Beijing, China
Rong Zhao & Luping Shi
THU-CET HIK Joint Research Center for Brain-Inspired Computing, Tsinghua University, Beijing, China
Luping Shi

Authors

Zheyu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Taoyi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yihan Lin
View author publications
You can also search for this author in PubMed Google Scholar
Yuguo Chen
View author publications
You can also search for this author in PubMed Google Scholar
Hui Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Jing Pei
View author publications
You can also search for this author in PubMed Google Scholar
Jiazheng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xue Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yichun Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jianqiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xinhao Lv
View author publications
You can also search for this author in PubMed Google Scholar
Rong Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Luping Shi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Z.Y., T.W. and Y.L. were in charge of the Tianmouc chip architecture and chip design, the Tianmouc chip test and system design, and algorithm and software design, respectively. L.S. and R.Z. proposed the concept of a complementary vision paradigm, and Z.Y., Y.L., T.W. and Y.C. conducted the related theoretical analysis. T.W., J.P., Y.Z., J.Z., X.W. and X.L. contributed to the chip design. Y.C., H.Z., J.W. and X.L. contributed to the chip test. Z.Y., Y.L., T.W. and Y.C. contributed to the autonomous driving system design. All authors contributed to the experimental analysis and interpretation of results. R.Z., L.S., Z.Y., T.W., Y.L. and Y.C. wrote the paper with input from all authors. L.S. and R.Z. designed the entire experiment and supervised the whole project.

Corresponding authors

Correspondence to Rong Zhao or Luping Shi.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature thanks Craig Vineyard and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 The complementarity of the Human Vision System (HVS).

The retina is composed of rod and cone cells that operate in an oppositional manner to expand the sensitivity range. At the next level, in the lateral geniculate nucleus (LGN), the M-pathway and P-pathway encode information in a complementary manner. The output information from the LGN is then reorganized into a series of primitives, including colour, orientation, depth, and direction at the V1 region. Finally, these primitives are transmitted separately to the ventral and dorsal pathways to facilitate the recognition of objects and visual-guided behavior.

Extended Data Fig. 2 Tianmouc architecture.

a, Schematic of the pixel structure in the back-side illuminated hybrid pixel array. b, Schematic of the cone-inspired and rod-inspired pixels. c, Schematic of readout circuits of the COP and AOP. d, Schematic of compressed packets generation process through the sparse spatiotemporal difference packetizer.

Extended Data Fig. 3 Tianmouc chip testing systems.

a, Testing boards equipped with a Tianmouc chip. b, The full system to process the output data of Tianmouc chip. The data is first transmitted to the FPGA board, where it collects raw data before transferring it to the host computer through PCIe. Subsequently, the host takes the charge of data processing for test and other tasks.

Extended Data Fig. 4 Experimental setup for chip characterization.

a, Schematic illustration of the experimental set-up for the chip evaluation based on EMVA1288. b, A photograph of the optical setup. c, Photograph of the chip evaluation system including chip test board, FPGA board, host computer and the high-speed ADC acquisition card. d, Schematic illustration of the optical set-up for dynamic range measurement. e, A photograph of the optical setup for dynamic range measurement.

Extended Data Fig. 5 Chip characterization.

a, High-speed recording of an unpredictable and fast-moving ping-pong ball shot by a machine. b, Power consumption of Tianmouc. The left half depicts the distribution of different modules including pixel, analog, digital and interface circuits. The right illustrates the total power consumption under different modes. c, Anti-aliasing reconstruction of the rotation of a wheel. The alias in the wheel recorded by COP can be eliminated by the high-speed AOP. d, the AOP of Tianmouc is able to capture lightning that is missed by COP and record details of textures.

Source Data

Extended Data Fig. 6 The reconstruction pipeline.

a, The structure of the whole reconstruction network. b, The light-weight optical flow estimator modified from SpyNet, using multi-scale residual flow calculation. In this figure, d means down-sampling operation. c, A self-supervised training pipeline, where we use the two colour images and the difference data between these two images to provide two training samples. d, At the inference stage, we adjust the amount of input data to obtain high-speed colour images at any time point.

Extended Data Fig. 7 The streaming perception pipelines for the open-world automotive driving tasks.

In Tianmouc, different primitive combinations are encoded to form the AOP and COP. These two pathways maintain separate buffers and support independent feedback control. The processed data of the AOP and the COP are then sent to different NN or an optical flow solver. Subsequently, the inference results are integrated in a multi-object tracker. This approach optimally leverages the CVP at a semantic level, preserving both low-latency response ability and high performance simultaneously.

Extended Data Fig. 8 More cases demonstrate the efficiency of Tianmouc in adapting to the open world.

The sparse data in the AOP, coupled with the encoding method, enables Tianmouc to adaptively adjust its transmission bandwidth, typically maintaining it at a bandwidth below 80 MB/s in most scenarios. With the complementary perception paradigm, this bandwidth proves adequate for efficiently addressing diverse corner cases.

Source Data

Extended Data Table 1 The primitive-based representation and complementary sensing paradigm in Tianmouc

Full size table

Extended Data Table 2 Comparison of Tianmouc with existing vision sensors

Full size table

Supplementary information

Supplementary Information

The Supplementary Information file contains Supplementary Notes 1–9 and Supplementary Tables 1–2.

Source data

Source Data Fig. 3

Source Data Fig. 4

Source Data Extended Data Fig. 5

Source Data Extended Data Fig. 8

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yang, Z., Wang, T., Lin, Y. et al. A vision chip with complementary pathways for open-world sensing. Nature 629, 1027–1033 (2024). https://doi.org/10.1038/s41586-024-07358-4

Download citation

Received: 14 June 2023
Accepted: 26 March 2024
Published: 29 May 2024
Issue Date: 30 May 2024
DOI: https://doi.org/10.1038/s41586-024-07358-4
Springer Nature Limited

This article is cited by

Tianmouc vision chip designed for open-world sensing
- Haotian Li
- Qilin Hua
- Guozhen Shen
Science China Materials (2024)

A vision chip with complementary pathways for open-world sensing

Abstract

Similar content being viewed by others

Explore related subjects

Main

Design of the complementary vision chip

Characterization of Tianmouc

Performance in the open world

Discussion

Methods

A brief introduction to HVS

A complementary vision paradigm

Architecture design

Experimental setup for chip characterization

Chip and system characterization

Scene reconstruction based on the CVP

Perception algorithms used in the experiments

Bandwidth and data visualization in different open-world scenarios

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data figures and tables

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Navigation