Keywords

1 Introduction

Artificial intelligence (AI) and machine learning have progressed tremendously over recent years and are now the focus of an intense interest worldwide within many fields, with applications ranging from self-driving cars (Huval et al. 2015) to health monitoring systems (Witt et al. 2019). This rapid progress has occurred over only a few years and was driven by algorithmic advances and improvements in computing hardware (LeCun et al. 2015) that have resulted in much shorter training and validation times for AI systems. The expectation that better hardware could contribute to further improving AI systems currently fuels a large research effort to find new “computing substrates” for AI. While conventional AI is implemented with software running on general-purpose computers, it is widely accepted that much more efficient hardware implementations of AI must exist; our brains are an existence proof that some computing architectures can far exceed the density and energy efficiency of current microelectronics technology. We have published the first demonstration that microelectromechanical systems (MEMS) were an appropriate substrate for miniature, low energy consumption AI systems (Coulombe et al. 2017; Dion et al. 2018). By exploiting the nonlinearity of microfabricated mechanical oscillators, our approach implements the concept of reservoir computing (RC) (Jaeger and Haas 2004) physically in MEMS. As MEMS can be fabricated to small dimensions and therefore have high resonance frequencies (up to the GHz van Beek et al. 2007), our approach has the potential to be used as a highly efficient electrical component to implement reservoir computing.

As MEMS are also the mainstream technology for many modern sensors (Khoshnoud and de Silva 2012), our work further paves the way to the development of a new class of smart sensors with built-in data processing capabilities. As an example, we have demonstrated a MEMS displacement sensor which implements reservoir computing in the mechanical domain (Barazani et al. 2019). As the sensor is moved randomly between two positions separated by \(2\;\upmu \mathrm{m}\) at \(20.8\;\mathrm{Hz}\), it uses the nonlinear dynamics of its resonating mechanical structures to compute at every timestep when the position can change, if it had been in one of the two positions an even or an odd number of times over the last five timesteps. More recently, we have also demonstrated a MEMS accelerometer with similar neuromorphic computing capabilities (Barazani et al. 2020). By using the hardware implementation of reservoir computing in MEMS, these devices offer both sensing and non-trivial computing functions in small, highly integrated structures. We envision a number of applications for MEMS sensors integrating machine learning capabilities through our architecture (Sylvestre et al. 2018), especially in fields where small, energy-efficient systems are required, including the Internet of Things, autonomous systems, as well as mobile and wearable electronic devices.

This chapter provides a general overview of our neuromorphic computing MEMS technology. We start with an introduction to MEMS in Sect. 2, including the unique characteristics of microfabricated devices (relative to conventional devices) which are leveraged to implement computing functionalities. We discuss the modeling and analysis of nonlinear MEMS resonators (Sect. 3), leading to an example of a silicon beam design which has proven to be useful in experiments. Measurements of computing performances are presented in Sect. 4, together with observations on the tuning of the system parameters to optimize performance on different benchmark tasks.

2 Microelectromechanical Systems

Microelectromechanical systems (MEMS) are miniaturized machines able to sense or produce displacements at the micrometer and sub-micrometer scales, typically in the range of 0.1 \(\upmu \)m to 100 \(\upmu \)m. MEMS devices comprise structures such as beams or membranes that are able to move relative to the substrate, providing actuation (MEMS actuators, e.g., micropumps) or detection capabilities (MEMS sensors, e.g., pressure or force meters). However, the design of miniaturized actuators and sensors requires some modifications if compared to the design of conventional machines. At the scale of MEMS structures, surface forces (such as electrostatic and adhesion forces) are dominant compared to volumetric forces (such as gravitational and inertial forces). For instance, water surface tension forces can completely suppress MEMS mobility and are sometimes very difficult to avoid (Van Spengen et al. 2003). On the other hand, MEMS \(\mu \mathrm{m}\)-dimensions allow them to be batch produced and assembled in the same chip as the electronic circuits, resulting in cheaper (lower cost per unit), faster, and more compact monolithic devices. Furthermore, MEMS tend to demonstrate higher sensitivity, faster response, and lower energy consumption than conventional mechanisms (Ananthasuresh 2012). MEMS applications can be quite diverse and include for example printers ink-jet nozzles, airbag sensors, mirror arrays in video projectors, focusing systems in smartphone cameras, and accelerometers in smartphones or personal fitness trackers.

2.1 MEMS Fabrication

In order to manufacture MEMS, traditional fabrication methods such as milling and extrusion are replaced by processes with increased precision and resolution, such as photolithography, chemical etching, and plasma etching. MEMS fabrication utilizes processes adapted from the microelectronics industry, which were mainly developed for the handling and processing of silicon substrates (Madou 1997; Liu 2006). This sort of manufacturing consists of multiple steps of deposition and etching of structural (usually silicon) and sacrificial (usually oxide) thin films. At the end of the process, the sacrificial material is removed to enable the structural parts to move relative to the substrate. One simple MEMS fabrication method is the direct etching of silicon on insulator (SOI) wafers. SOI wafers are standardized stacks composed of a device structural layer on the top, an oxide sacrificial layer in the middle, and a handle substrate layer at the bottom. The SOI MEMS fabrication process, illustrated in Fig. 1, can be roughly summarized into two main steps: (1) etching of the device layer, after it is patterned using photolithography; and (2) partial removal of the oxide layer granting motion to the structural parts, which remain connected to the substrate through the oxide that is not etched away (the anchors). The addition of electrical contacts to the fabrication flow allows the induction of motion by the application of electrical voltages. Likewise, measurements of voltage changes can be used to gage MEMS motion.

Fig. 1
figure 1

a Initial SOI wafer. b Parts of the device layer are selected to be etched away after photolithography. c SOI wafer after the etching of the device layer. d Oxide areas to be removed in a selective etching procedure. e Final device after the oxide removal; the remaining silicon structure is anchored to the handle and is free to flex or move relative to the handle

2.2 Sensing and Driving Methods

There are several techniques used to provide or detect microscale displacements in MEMS. The most common operating principles include electrostatic, electrothermal, piezoelectric, and piezoresistive (Liu 2006). In the great majority of MEMS devices, energy conversion involves an input or an output electrical signal, typically a voltage difference. The electrostatic and electrothermal phenomena, which produce forces that are usually negligible in conventionally sized mechanisms, are the most traditional configurations for driving and sensing in MEMS. Electrothermal MEMS, for example, produce motion through the thermal expansion of structures (usually beams) caused by Joule heating due to the application of voltage (Lai et al. 2004). In the case of electrostatic MEMS, motion is induced by electrostatic forces between microelectrodes separated by a small gap (Batra et al. 2007). Alternatively, changes in the gaps caused by an external force can be measured by the capacitance change between the electrodes.

MEMS accelerometers, some of the most commercially successful MEMS devices, may present a large variety of design types and working principles (Yazdi et al. 1998). Typically, external inertial forces displace an inertial mass that is suspended by compliant springs. This motion is then converted to an electrical signal that is proportional to the magnitude of this displacement. The transduction principle is usually capacitive (electrostatic) or piezoresistive (changes in the electrical resistance due to mechanical deformations). MEMS accelerometers can detect in-plane or out-of-plane forces depending on their design configurations (Fig. 2). Planar accelerometers commonly use an interdigitated configuration in order to increase the total capacitance and therefore the electrostatic sensitivity of the sensor. Higher sensitivity can also be achieved by reducing the accelerometer’s natural frequency, which could be done by diminishing the suspension’s stiffness. However, this also reduces the frequency response (bandwidth) of the sensor. Another practice to increase the sensitivity is to increase the signal-to-noise ratio by reducing the system’s damping. This is usually done by etching holes along the proof mass or by operating the device under vacuum.

Fig. 2
figure 2

Schematic illustration of two possible configurations of MEMS accelerometers: a the proof mass moves laterally, enabling the detection of in-plane accelerations, and b the proof mass moves up and down allowing the device to sense out-of-plane accelerations

2.3 MEMS Dynamics and Nonlinearity

MEMS devices are frequently designed to work in their dynamic regime, as it happens for example in MEMS resonators. As vibrating structures, MEMS exhibit much higher resonance frequencies compared to non-miniaturized mechanisms. This is because of their much higher k/m ratio, where k is the device elastic constant and m is its total mass. In MEMS resonators, shifts in the resonance frequency can be used to detect changes of different physical quantities, enabling the manufacturing of a variety of sensors such as pressure, force, and temperature sensors (Tilmans et al. 1992). The resonance frequency of MEMS tends to be very well defined (small bandwidth) due to their typically large quality factor (Q), which is a measure of the energy dissipation of oscillating structures. High values of Q indicate low energy dissipation, which leads to lower energy consumption, higher sensitivity, and lower noise. Energy dissipation can be classified as intrinsic or extrinsic (Ekinci and Roukes 2005). The former is associated with losses due to the material microstructure while the latter is mainly related to losses induced by the media surrounding the device. Extrinsic damping effects such as drag forces or squeezed films (when structures are too close) are usually the dominant sources of energy dissipation.

Another observed characteristic of MEMS resonators is their nonlinearity. Micromechanical oscillating structures demonstrate nonlinear behavior when driven above a certain critical amplitude (Husain et al. 2003; Ekinci and Roukes 2005). Frequently, the Duffing equation for nonlinear oscillators is used to describe the motion of MEMS resonators. Essentially, when oscillating at very large amplitudes (above critical), changes in the structure’s stiffness result in nonlinear shifts of the resonance frequency. In the case of a clamped–clamped microbeam (i. e. both ends anchored) vibrating in its flexural mode, large driving amplitudes generate tensile forces that increase the beam stiffness resulting in an increase of its resonance frequency (Tilmans et al. 1992). The onset of nonlinearity in microstructures has been explored elsewhere (Buks and Yurke 2006; Tadokoro et al. 2018). In this study, the nonlinearity of a clamped–clamped microbeam is used to set up a reservoir computing system able to perform non-trivial computing tasks.

3 Driven Oscillators with Duffing Nonlinearities

The Duffing model was first introduced to describe the hardening spring effect observed in mechanical systems (Duffing 1918). It is considered as one of the most common models used to describe the jump phenomenon observed in highly deformed mechanical resonators, where a slight change of forcing frequency leads to an abrupt discontinuous change in the steady-state amplitude (Guckenheimer and Holmes 2002; Kalmar-Nagy and Balachandran 2011). It keeps a simple mathematical form and accepts, under some approximations, analytical solutions (Ali 1995; Worden 1996).

3.1 Duffing Oscillator

Several micromechanical structures behave as nonlinear systems for high levels of excitation (Ekinci and Roukes 2005; Zaitsev et al. 2012). The Duffing equation with damping and external harmonic forcing is

$$\begin{aligned} \ddot{x}+\frac{\omega _0}{Q}\dot{x}+\omega _0^2x+\beta x^3 = A\cos (\Omega t) , \end{aligned}$$
(1)

where x, t, \(\omega _0\), Q, A, \(\Omega \), and \(\beta \) are the displacement, time, undamped angular frequency, quality factor, excitation amplitude, angular excitation frequency, and cubic stiffness parameter, respectively. Dots denote derivatives with respect to time. As can be seen, Eq. (1) reduces to the forced damped linear oscillator when the anharmonic term is ignored (\(\beta =0\)). An approximative solution for the position x(t) can be obtained for small \(\omega _0/Q\), \(\beta \), and A values and assuming the forcing is close to resonance, with \(\Omega - \omega _0\) also small. Equation (1) can then be viewed as a perturbation of the autonomous harmonic oscillator. The perturbation technique known as “averaging” gives an approximative steady-state solution \(x(t)=r \cos (\Omega t + \phi )\) where r is the oscillation amplitude and \(\phi \) is the phase (see Guckenheimer and Holmes 2002 or Jan 2007 for details). Averaging gives a frequency response curve (Jan 2007),

$$\begin{aligned} \left( -2 \omega _0 \left( \Omega - \omega _0\right) r + \tfrac{3}{4} \beta r^3\right) ^2 + 4 \left( \omega _0^2/Q\right) ^2 r^2 - A^2 = 0 , \end{aligned}$$
(2)

which can be solved for r.

Fig. 3
figure 3

Amplitude–frequency response curves for the linear system (\(\beta =0 \;\mathrm{Hz}^2/\mathrm{m}^{2}\)) from the exact solution and by averaging for stiffness parameter \(\beta =\pm 0.05 \;\mathrm{Hz}^2/\mathrm{m}^{2}\). Stable and unstable solutions are denoted as solid and dashed lines, respectively. The parameters used to construct these curves were \(A={2.5}\mathrm{m}/\mathrm{s}^2\), \(Q=5\) and \(\omega _0={1}\) rad/s

Fig. 4
figure 4

Amplitude–frequency response curve obtained numerically by a sweep up followed by a sweep down of \(\Omega \); the jump and the hysteresis are apparent. The parameter values used to construct these response curves were \(A=2.5 \;\mathrm{m}/\mathrm{s}^2\), \(Q=5\), \(\beta =\)0.05m\(^{-2}\) s\(^{-2}\), and \(\omega _0={1}\) rad/s

Figure 3 shows the frequency response curve for \(\beta =0\) (from the exact solution of the linear problem) and curves from averaging for \(\beta =\pm 0.05 \mathrm{Hz}^2/\mathrm{m}^{2}\). The introduction of the cubic nonlinearity tilts the curve to the right for \(\beta >0\) (hardening spring) and to the left for \(\beta <0\) (softening spring). Furthermore, close to the peak, there are three possible solutions for a given \(\Omega \) (two stable ones and an unstable one, denoted as a dashed line). Figure 4 shows numerical solutions to Eq. 1 for \(\beta =\)0.05m\(^{-2}\) s\(^{-2}\), as the forcing angular frequency \(\Omega \) is swept up and down. Once \(\Omega \) is increased above the angular frequency of the peak \(\Omega _\downarrow \), the oscillation amplitude abruptly jumps to the lower branch, which is the only remaining solution. As \(\Omega \) is reduced again, the oscillation amplitude follows the lower stable branch and jumps back to the upper branch once it reaches the unstable solution, at \(\Omega _\uparrow \). Since \(\Omega _\downarrow >\Omega _\uparrow \), the nonlinear system exhibits hysteresis.

Fig. 5
figure 5

Phase-space plots obtained from Eq. (1) for three motion conditions: harmonic oscillations with weak forcing \(A=0.2 \mathrm{m}/\mathrm{s}^2\) and cubic stiffness corresponding to zero (blue line), moderate forcing \(A=0.29 \;\mathrm{m}/\mathrm{s}^2\) with \(\beta = 1 \;\mathrm{Hz}^2/\mathrm{m}^{2}\) (black line), and chaotic oscillations at high forcing level \(A=0.5 \;\mathrm{m}/\mathrm{s}^2\), \(\beta = 1 \;\mathrm{Hz}^2/\mathrm{m}^{2}\) (red line). The other parameters used to construct these curves were \(Q=3.33\), \(\omega _0=1 \;\mathrm{rad}/\mathrm{s}\), and \(\Omega =1.2 \;\mathrm{rad/s}\)

Figure 5 shows the phase-space plot of three distinct motion regimes. For low forcing amplitudes or when the anharmonic term is not taken into account in the Duffing equation (1), the motion of the resonator resembles a linear harmonic device where the response in phase-space is an ellipse. At intermediate forcing, the system can have more complex dynamics due to the stiffening characteristic of the resonator: there can be more than one harmonic component in the oscillator motion, as studied in Kalmar-Nagy and Balachandran (2011). Large forcing amplitudes lead to a chaotic motion and the system becomes very sensitive to the initial conditions.

For nonlinear Duffing systems, sudden jumps in the resonance response are observed, as in Fig. 6. The jump frequency depends on the direction of the frequency sweep and the type of nonlinearity (softening or stiffening) (Malatkar and Nayfeh 2002). For lightly damped Duffing oscillator, Brennan et al. presented a simple approximated non-dimensional expression which gives the maximum oscillation amplitude \(r_{\max }\) at the jump frequency \(\Omega _\downarrow \) (Brennan et al. 2008). The relationship between the jump-down frequency and the cubic stiffness can be written in a dimensional form as (Tang et al. 2016)

$$\begin{aligned} \Omega _\downarrow ^2 = \frac{3}{4} \beta r_{max}^2 + \omega _0^2. \end{aligned}$$
(3)

Solving for \(r_{max}\) gives the so-called “backbone curve” presented by the dashed line in Fig. 6. It can be used to predict the frequency response of the system (Cammarano et al. 2014; Arroyo and Zanette 2016).

Fig. 6
figure 6

Example of backbone curve (dashed line) for a Duffing oscillator swept up in forcing frequency. The parameter values used to construct these response curves were \(Q=5\), \(\beta =0.05 \;\mathrm{Hz}^2/\mathrm{m}^2\), and \(\omega _0=1 \;\mathrm{rad/s}\)

3.2 Clamped–Clamped Beams

A clamped–clamped beam is an oscillator exhibiting an anharmonic behavior at higher excitation amplitudes. Multiple studies have demonstrated that the Duffing model may describe the nonlinear behaviors observed in the beam dynamics (Verbridge et al. 2006; Antonio et al. 2012; Abdolvand et al. 2016).

Figure 7 depicts a simplified schematic of a clamped–clamped structure.

Fig. 7
figure 7

Schematic description of a clamped–clamped beam: l, w, and h correspond to the length, width, and thickness of the beam

3.2.1 Linear Analysis

The mass–damper–spring system represents the simplest model used to describe the linear resonator motions. It corresponds to Eq. (1) for which the nonlinear term \(\beta \) is null. The damper is associated here with energy losses in the system. The fundamental frequencies of excited clamped–clamped beam can be determined by solving the differential equation from Euler–Bernoulli beam theory. We assume that the beam deflection follows the fundamental mode vibration. The expression of the undamped resonance frequency for a clamped–clamped beam subjected to a lateral surface excitation can then be written as (Tilmans et al. 1992; Bao 2005)

$$\begin{aligned} \omega _0 = \frac{\lambda ^2}{l^2}\sqrt{\frac{E I}{\rho w h}}, \end{aligned}$$
(4)

where I, E, \(\rho \), l, w, and h are quadratic moment, Young’s modulus, mass density, length, width, and thickness of the beam, respectively. \(\lambda \) is a constant satisfying \(\mathrm{cosh}(\lambda )\mathrm{cos}(\lambda ) = 1\). Equation (4) indicates that the resonance frequency is closely related to the mechanical structure geometry. It corresponds, for instance, to 389 kHz for a 300 \(\upmu \mathrm{m}\) silicon beam with a width and a thickness of 4 \(\upmu \mathrm{m}\) and 10 \(\upmu \mathrm{m}\), respectively (\(\lambda = 4.73\) in that case).

3.2.2 Nonlinearity Effects

In a clamped–clamped beam, the nonlinear parameter caused by the elongation of the beam can be approximated from (Postma et al. 2005)

$$\begin{aligned} \beta = \frac{E}{18\rho } \left( \frac{2\pi }{l}\right) ^4. \end{aligned}$$
(5)

For example, the calculated nonlinear coefficient is equal to 7.75x\(10^{23}\)(Hz/m)\(^2\) using Eq. (5) for a 300 \(\upmu \mathrm{m}\) silicon beam.

Fig. 8
figure 8

a Displacement mapping of 300 \(\upmu \mathrm{m}\) clamped–clamped silicon beam obtained by ANSYS modal analysis. b Results of the analysis of transient finite elements on the clamped–clamped beam for different force amplitudes. The width w and thickness h of the beam were 4 \(\upmu \mathrm{m}\) and 10 \(\upmu \mathrm{m}\), respectively

To better understand the nonlinear dynamics of a clamped–clamped beam, a finite element modeling using the ANSYS software (Theory Reference for the Mechanical 2017) was developed. Figure 8a) presents a deformed silicon beam in its fundamental mode. The anchors, substrate, and gages are also considered in the simulation. An initial modal simulation is used to identify the resonance modes of the beam. Using an explicit time analysis, the system is then excited in the proximity of a resonant peak by a time-varying lateral force applied in the middle of the beam. This analysis takes into account the nonlinear phenomena induced by large geometrical deformations and the mechanical dissipation that occurs during the structure motion.

The simulation results are depicted in Fig. 8b). We first note that the “hardening” phenomenon, characteristic of Duffing oscillator, is present. Unlike the symmetric response in the linear case, the peak amplitudes shift to the higher frequencies when the excitation force increases. The jumps are also observed. The cubic stiffness parameter can be determined from a fit to Eq. (3) and is equal to (1.87 ± 0.26) x \( 10^{23}\) (Hz/m)\(^2\). This result is similar to the one obtained theoretically (Eq. (5)).

3.2.3 Damping Effects

The energy dissipation mechanisms of the mechanical system are associated with damping effects. The parameter indicating the damping and the efficiency of the resonator systems, the so-called quality factor Q, can be defined as the ratio of dissipated energy per period, \(\Delta \), to the energy stored in the oscillator (here, \(kr^2/2\)) (Tilmans et al. 1992; Bao and Yang 2007)

$$\begin{aligned} Q =2\pi \times \frac{kr^2/2}{\Delta }. \end{aligned}$$
(6)

Figure 9 depicts the amplitude–frequency curve for three damping conditions using numerical Duffing solutions (Eq. (1)). The larger damping effect corresponds to the smaller factor (black line) while the peak amplitude is higher for smaller damping (blue line). Note that the peak amplitude would be infinite in the absence of damping.

Fig. 9
figure 9

Effect of damping on the amplitude–frequency response curve: small damping (blue line), intermediate damping (red line), and high damping (black line). The parameter values used to construct these response curves were \(A={2.5}{\mathrm{m}/{rm s}^{2}}\), \(\beta =0.05 \;\mathrm{Hz}^2/\mathrm{m}^2\), and \(\omega _0=1 \;\mathrm{rad/s}\)

There are several sources of damping in mechanical structures. A quality factor \(Q_i\) can be attributed to each dissipation mechanism. The total quality factor Q can be written as Matthiessen’s rule (Matthiessen and Vogt 1864; Naeli and Brand 2009)

$$\begin{aligned} \frac{1}{Q} = \sum _i \frac{1}{Q_i}. \end{aligned}$$
(7)

The extrinsic damping caused by the surrounding air can often be ignored for conventional mechanical systems. However, as air damping is related to the surface area of the resonator, viscous air damping can be significant for micromechanical devices. The first damping mechanism highlighted is the drag force. It represents the effect caused by the surrounding gas on the resonator when the beam is far away from any surrounding object. From Naeli and Brand (2009), the quality factor describing gas dissipation in microbeams is

$$\begin{aligned} Q_d = \frac{\rho h w \omega _0}{3 \pi \left( \mu + w \sqrt{\rho _a \mu \omega _0 / 16} \right) }, \end{aligned}$$
(8)

where \(\mu \) is the air dynamic viscosity and \(\rho _a\) is the air mass densities. This factor can be reduced experimentally by placing mechanical devices under vacuum (Tilmans and Legtenberg 1994; Gui et al. 1995).

A driving electrode must be close to the beam in order to electrostatically drive the mechanical resonator. If the gap d between the beam and the electrode is small compared to the beam thickness h, the main damping mechanism is the “squeezed-film effect” due to the incompressible character of the gas. This is all the more important when the gap is reduced. The corresponding analytical expression of squeezed-film damping is (Starr 1990; Bao 2005)

$$\begin{aligned} Q_s = \frac{\rho w d^3 \omega _0}{\mu h^2}. \end{aligned}$$
(9)

For a silicon beam with \((w,h,l) = (4,10,300)\;\upmu \mathrm{m}\), where the gap d corresponds to 6 \(\upmu \mathrm{m}\), one has \(Q_d = 529\) and \(Q_s = 2740\). From Eq. (7), the combined quality factor Q is then 457. For additional effects comprising, for instance, the thermoelastic mechanism, we refer the reader to Verbridge et al. (2006), Naeli and Brand (2009), and Younis (2010). Note that the anchors in the clamped–clamped beams can also have a significant effect on the dynamics of the resonator (Lee et al. 2008; Naeli and Brand 2009).

4 Reservoir Computing in a MEMS

As highlighted in the previous sections, MEMS technology can reliably produce small and energy-efficient devices exhibiting rich dynamical behaviors often not accessible for mechanical structures at larger scales. Exploiting these dynamics for neuromorphic hardware thus seems a promising alternative to computing using conventional electronics, which keep struggling with power dissipation issues. As a result, the following section explores the use of a micromachined clamped–clamped silicon beam as the single dynamical node of a delay-coupled reservoir computer trained to perform simple classification tasks.

4.1 The MEMS Nonlinear Node

Construction of a hardware reservoir computer (RC) begins with the choice of a suitable physical node, which should have a nonlinear activation function in order to be able to model nonlinear processes. The stiffening Duffing behavior of a clamped–clamped silicon beam oscillating at large amplitudes can provide the nonlinearity in MEMS RC. An order of magnitude for the minimum oscillation amplitude to obtain sufficient nonlinear behavior is the amplitude \(r_c\) associated with the onset of bistability (Lifshitz and Cross 2010):

$$\begin{aligned} r_c = \left( \frac{4}{3} \right) ^{3/4} \sqrt{\frac{\omega _0^2}{Q\beta }} . \end{aligned}$$
(10)

For the beam studied in this section, the onset of the nonlinearity is around \(r_c = 150\) nm.

Fig. 10
figure 10

SEM image of the MEMS

The beam shown in Fig. 10 was microfabricated on a (100) silicon on insulator (SOI) substrate with a nominal resistivity of (0.003 ± 0.002) \( \ \Omega \) m and a sacrificial oxide thickness of 1.5 \(\upmu \)m. It has a length of \(L=500\ \upmu \)m, a width of \(w = 10\,\upmu \)m, corresponding to the SOI device layer thickness, and an in-plane thickness (normal to its displacement) of \(h = 4 \ \upmu \)m. The device was wirebonded to a chip carrier and placed in a Faraday cage for the experiments, but was otherwise unpackaged. This lack of proper packaging makes the beams sensitive to dust in their environment, which has the undesirable effect of modifying their resonant frequency over time. For instance, one beam has had its natural frequency lowered by as much as 20% over the course of one year. The experimental quality factor of the MEMS was 167 ± 2. This value, which is independent of the oscillation amplitude, is comparable to the analytical value of 204 obtained using Eqs. 79 for the nominal dimensions of the beam. Fabrication tolerances could account for this gap between the two values, as well as other dissipation mechanisms such as anchor loss and the proximity of the substrate. In the linear regime, the beam naturally oscillated at \(f_0 = 155\) kHz, compared to a calculated value of 144.2 kHz (Eq. 4), although the maximum of the frequency response shifted to higher frequencies as the drive amplitude was increased, a behavior which corresponds to a stiffening Duffing oscillator. The Duffing parameter for the beam shown in Fig. 10 was estimated to \(1.9 \times 10^{23}\) Hz\(^2\)m\(^{-2}\) by adjusting Eq. 3 to experimental data of the beam’s response. Equation 5 yields a comparable value of \(1.1 \times 10^{23}\) Hz\(^2\)m\(^{-2}\).

Fig. 11
figure 11

Experimental setup for the reservoir computer. The masking procedure as well as the delayed feedback loop are implemented in the digital domain, while the post-processing (extraction of the displacement amplitude) is carried out through custom analog electronics

Among the plethora of possible transduction methods presented in Sect. 2.2, an appropriate choice for RC MEMS is to drive the beam electrostatically and sense its displacement piezoresistively. By polarizing a 300 \(\upmu \)m long drive electrode placed 6 \(\upmu \)m away from the beam in Fig. 10 with a voltage signal of the form \(V_d(t) = V_0 \cos \left( 2\pi f_d t \right) \), a force \(F_d \propto V_d^2(t) = \frac{V_0^2}{2}\left( 1 + \cos \left( 4\pi f_d t \right) \right) \) can be applied between the beam and the fixed electrode such that vibrations of the beam are solicited at twice the input voltage frequency \(f_d\). The piezoresistive transduction of the beam motion to an electrical signal, carried out through 12 \(\upmu \)m long by 1.2 \(\upmu \)m wide piezoresistive strain gages patterned on the device, was chosen for its linearity (to ensure that nonlinear mapping comes exclusively from the beam’s displacement) and sensitivity (transduction coefficient of \({\sim }10^2\) V/m). Two external resistors were combined with the two piezoresistive gages, as illustrated in Fig. 11, to form a Wheatstone bridge, allowing for a differential measurement of the beam’s motion. Compared to a single-ended measurement, the differential configuration has the advantage of reducing the system sensitivity to noise in the DC voltage source polarizing the Wheatstone bridge, but more importantly, it also cancels the feedthrough drive signal at the readout. This unwanted signal is symmetrically coupled to both readout points (ends of the piezoresistive gages) through parasitic capacitors (much larger than the \({\sim }10\) fF capacitor formed by the beam and drive electrode) present in the device, while the displacement signal is of opposite sign in each branch (one gage stretches when the other gets compressed), so only the latter gets amplified by the instrumentation amplifier. The differential input stage is followed by a bandpass filter with a bandwidth of 80 kHz to further reduce the noise contribution, and a second amplification stage brings the displacement signal, initially of a few tens of \(\mu \)V, to a level suitable for the envelope detection stage that follows. This last step produces an appropriate output by extracting the amplitude of the beam displacement signal, yielding a signal-to-noise ratio (SNR) of 35 dB, essentially limited by the Johnson noise generated by the resistor bridge.

4.2 Training with Delayed Feedback

The use of a single physical node (Appeltant et al. 2011) greatly simplifies hardware implementation of an RC by drastically reducing the number of structures to couple physically, drive, and measure, with the main drawbacks of requiring a more refined preprocessing scheme and a serialization of the network (and thus of the computation). Indeed, since a single physical node is available, the reservoir consists of a virtual network created by time-division multiplexing of the input signal. While a space-coupled network would possess a multitude of physical nodes (typically \(\sim 10^2\)) coupled in space and use the ring-down time of the oscillators as a form of memory (the behavior of the oscillators depends on their history), a delay-coupled reservoir instead uses this decay time to couple adjacent virtual nodes in the time domain: the input signal is masked by a function of period \(\tau \), which in the simplest case is a function alternating randomly between two values after each time interval \(\theta \). \(\tau \) is an integer multiple of \(\theta \) which defines the number of virtual nodes (\(N=\tau /\theta \)). By choosing \(\theta <T\), where \(T = Q/(\pi f_0) = 330 \ \mu \)s is the decay time of the oscillator, the beam response during a given interval \(\theta \) depends on its response during previous intervals. Since the oscillator decay time T is much shorter than the characteristic time \(\tau \) of the input, the reservoir activation does not persist between two timesteps of the input signal, and the virtual network requires an additional feedback loop in order to have access to some form of memory. A feedback signal is thus added, with a delay \(\tau \) and gain \(\alpha \), to the input for the next timestep. As a result, a given virtual node is driven by a superposition of the (masked) input and of its response to the input from the previous timestep:

$$\begin{aligned} V_d(t) = V_0 \left[ u(t) m(t) + \alpha x(t-\tau ) +1 \right] \cos \left( 2 \pi f_d t \right) , \end{aligned}$$
(11)

where x(t) is the displacement amplitude signal at time t, m(t) is the temporal mask, and u(t) is the input signal.

The nonlinear nature of the beam’s amplitude response (Dion et al. 2018) guided the choice of amplitude modulation of the sinusoidal pump for the RC input. In the case of a Duffing oscillator, the nonlinearity can be tuned to a certain extent by adjusting the drive frequency. The resulting system is schematized in Fig. 11. The input u(t) is first scaled so that it is restricted to the empirically determined range [0.60, 0.75], then it is sampled and held for a time \(\tau \) and multiplied by the temporal binary mask of period \(\tau \) and characteristic time \(\theta \). For the MEMS RC, optimization of the mask with respect to the RC success rate yielded mask values of 0.45 and 0.70. The result, \(u(t)\times m(t)\), is used to modulate the amplitude of the sinusoidal pump (Sect. 4.4.2 discusses adjusting the pump in more detail). Sampling the envelope (ENV) of the displacement signal at a rate \(\theta ^{-1}\) with an analog to digital converter (ADC) yields a vector \(\mathbf {x}(t)\), containing the N virtual node states at timestep t. These values are then combined linearly to produce a scalar output:

$$\begin{aligned} y(t) = \mathbf {w}^T \mathbf {x}(t). \end{aligned}$$
(12)

The goal of the training phase is to compute the appropriate vector \(\mathbf {w}\) of weights by adjusting them so that the response of the RC to a series of training examples approximates as well as possible a known target \(y'(t)\). If the task for which the RC is trained is to process a signal which changes at every time period \(\tau \), for instance, then a series of M training periods can be presented to the system, each with an input value \(u_k=u(k\tau )\) for \(k=1,\ldots ,M\), resulting in M outputs \(y_k=y(k\tau )\) which can be compared to the desired outputs \(y'_k=y'(k\tau )\) with the mean squared error

$$\begin{aligned} \frac{1}{M} \sum _{k=1}^M (y_k - y_k')^2. \end{aligned}$$
(13)

A similar mean squared error can be defined for the classification of input sequences of different lengths (with y(t) sampled at the end of each input sequence).

The training process is done offline and consists in computing the vector \(\mathbf {w}\) minimizing the mean squared error between y(t) and \(y'(t)\). The result is

$$\begin{aligned} \mathbf {w} = \mathbf {y}' \mathrm {X}^T \left( \mathrm {X} \mathrm {X}^T + \gamma \mathrm {I} \right) ^{-1} , \end{aligned}$$
(14)

where \(\mathbf {y}'\) is the vector of desired outputs and \(\mathrm {X}\) is a matrix with each row corresponding to the state \(\mathbf {x}\) of the virtual nodes after one of the inputs \(u_k\) from the training set has been processed. \(\gamma \) is a regularization parameter that increases numerical stability and prevents overfitting. A value of \(\gamma = 10^{-4}\) V\(^2\) proved adequate for both benchmarks investigated below.

4.3 Performance Metrics

Following the training phase, it is customary to test the performance of the RC with inputs that were not part of the training set, so that the generalization capability of the RC can be assessed. In order to highlight its universal character, the MEMS RC discussed above was tested on two different benchmarks with the same set of hyperparameters: a network of \(N=400\) virtual nodes sampled every \(\theta =0.1\) ms with a feedback gain \(\alpha = 1.1\) and a beam driven at \(f_d=80.3\) kHz, \(V_0=72.5\) V, with the piezoresistive gages biased at 2.5 V.

4.3.1 Parity Benchmark

The parity benchmark is a conceptually simple task that can be nonlinear and requires memory. As such, it is well suited for a first evaluation of the system’s performance. It consists of computing the parity of \(n \ge 1\) successive input bits after an initial delay \(\delta \ge 0\):

$$\begin{aligned} P_{n,\delta }(t) = \prod _{i=0}^{n-1} u\left( t-(i+\delta )\tau \right) . \end{aligned}$$
(15)

\(P_{1,0}\) is linear and does not require memory, but for \(\delta > 0\) or \(n > 1\), the target depends on the history of the input signal, so the system must be able to store a transformed version of the input for a finite time. In this chapter, we will only report results with no delay, i.e., for \(P_n = P_{n,0}\). For this task, the input u(t) is a binary sequence randomly alternating between -1 and +1 at each time \(t=k\tau \). It is thus first shifted and scaled to [0.60, 0.75] before being fed to the RC, as discussed in Sect. 4.2.

Figure 12a shows the RC output for this task overlaid on the target after a training phase of 2000 samples. The performance is quantified by comparing the signs of the prediction and of the target over the whole 2000 samples of the testing set. The accuracy of the classification is the same for \(P_2\) to \(P_4\) since the raw RC output is thresholded, but the trace is more noisy for \(P_4\). By increasing n or \(\delta \), the complexity of the task is increased and this translates to a decrease in the prediction success rate. This performance drop can be counterbalanced up to a degree by increasing the number of nodes or the number of training samples, as evidenced by Fig. 14, or by a finer tuning of the nonlinearity (see Fig. 15). For the network of \(N=400\) nodes used to produce Fig. 12a, the mask period is \(\tau = N\theta = 40\) ms, such that the bitstream is processed at a rate of 25 bits/s. On the other hand, a network of 10 virtual nodes is sufficient to process \(P_2\) with less than 1% error, which leads to a classification rate of \(10^3\) bits/s. This means that for a given physical node with immutable characteristics, processing speed can be optimized for a specific task by adjusting the number of virtual nodes.

Fig. 12
figure 12

a Performance for the parity benchmark. After a training phase (green), the RC response (blue) to the input (black) is compared to the target (red). b Confusion matrix for the spoken digit classification task. Colors indicate the probability that an input digit (columns) is assigned to a given class (rows) by the RC

4.3.2 Spoken Digit Classification

With the same set of hyperparameters, the MEMS RC was also trained to classify the digits zero to nine spoken by sixteen different speakers, male and female, using the TI-46 dataset (Lieberman 1993). Since sounds have an inherent temporal dependence, this task seems well adapted to the RC approach, as evidenced by its predominance as a RC benchmark (Appeltant et al. 2011; Brunner et al. 2013; Coulombe et al. 2017; Dion et al. 2018; Duport et al. 2012; Larger et al. 2012, 2017; Martinenghi et al. 2012; Paquot et al. 2012; Soriano et al. 2015; Torrejon et al. 2017; Verstraeten 2005). Whether it is obtained through RNNs or by using other means, state-of-the-art performance for this task is usually accompanied by spectral preprocessing to model the human ear, such as the Mel-Frequency Cepstral Coefficients (MFCC) or the Lyon Passive Ear model (Lyon 1982). For this study, the preprocessing was kept minimal in anticipation of eventually interfacing the MEMS RC directly with sound pressure, as opposed to feeding samples from recorded waveforms. Each randomly selected utterance is first lowpass filtered 30 Hz and resampled at 60 samples/s, then it is normalized and scaled so that the complete sequence of waveforms is restricted to the range [0.60,0.75]. In order to save processing time, silences before and after the utterance are cropped, which results in an average of \(\bar{\eta }= 29\) samples per word. After being masked as described in Sect. 4.2, those samples are then fed sequentially to the reservoir without any pause between them. The output of a given virtual node for a given utterance is then the mean of its responses over the whole utterance (i.e., \(x_i = (1/\eta )\sum _{j=0}^{\eta -1} x\left( i\theta + j\tau \right) \) for node i). Ten output layers are trained for the same reservoir activation: one boolean classifier is used for each individual digit. Since there are ten different possible classes for this task, the length M of the training sequence was increased to 6000 utterances so that the RC is trained on a sufficient number of examples for each digit.

Figure 12b shows that the confusion matrix for this task is almost diagonal, although some phonetically similar digits such as “1” and “9” or “4” and “5” are more often misclassified by the RC. The global success rate is (70 ± 2) %, and slightly better performance (Dion et al. 2018) could be obtained by optimizing the hyperparameters with respect to this particular task. Despite the fact that the training procedure lasts a few hours, the trained 400 node RC processes words at a rate of 1 per second, fast enough so that one could envision using such a system for real-time speech processing.

4.4 Hyperparameter Optimization

Finding optimal parameters for successful reservoir computing can be a tedious task, as RC performance typically depends on the appropriate combination of multiple hyperparameter values. Moreover, these parameters cannot be tuned independently: modifying one of them can shift the optimal value of other parameters. Choosing a random set of parameters will most often result in no computational success at all, and the accuracy landscape may display multiple local minima, making gradient descent optimization impractical. A gridsearch may seem like a foolproof optimization method, but without any indication of the location of the success region, the search space is vast and of high dimensionality. Besides, the region of non-zero success can be limited to a rather narrow region, as will become apparent later in this section, so that if the gridsearch is too coarse, the optimal parameter set can be missed altogether. Expert knowledge is thus necessary to set bounds for the different parameters of the gridsearch in a principled way or to perform a manual search in order to find a starting point with non-trivial success for optimization. To circumvent this obstacle, different methods are investigated in the RC literature (Bala et al. 2018), such as using genetic algorithms (Dale et al. 2016; Ferreira and Ludermir 2009, 2011), particle swarm optimization (Zhou 2010; Sergio and Ludermir 2012; Jubayer Alam Rabin et al. 2013; Salah et al. 2017), differential evolution (Zhang et al. 2013; Rigamonti et al. 2018; Wang et al. 2018), or hybrid variants thereof which combine different metaheuristics.

Temporal traces of reservoir activation such as those presented in Fig. 13 can also guide the initial optimization. By detuning a single parameter such as the drive frequency \(f_d\), the feedback gain \(\alpha \), or the virtual node duration \(\theta \), the traces for healthy and unhealthy reservoirs can be compared and a few empirical criteria for successful RC can be deducted. Such criteria include the dynamic range and saturation of the response and its correlation with the input signal.

Fig. 13
figure 13

Normalized drive signal envelope (red) and beam displacement amplitude (black) for a well-performing RC (top panel) and for various detuned configurations (lower panels). Note that the beam response is oversampled compared to normal operation where it is only sampled when the mask value is updated

The optimization of hyperparameters shown below was performed using the parity benchmark, as the total training and testing time is much lower than the spoken word recognition benchmark: a training example for parity is composed of a single sample, while a spoken digit utterance contains tens of samples to feed to the RC. Nevertheless, the resulting parameter set can be used as a starting point for optimization with respect to a different task.

4.4.1 Number of Training and Testing Samples, Reservoir Size

The number of examples used for testing is one parameter that can be chosen in a principled way. Its only effect is on the uncertainty of the performance measurement. Considering that for all the benchmarks investigated here the testing phase is a series of Bernoulli trials (i.e., is the sample correctly classified?), the precision of the obtained success rates can be quantified using a binomial proportion confidence interval, such as the Agresti–Coull interval (Agresti and Coull 1998). In this specific case, the measurement error decreases as the number of trials and success rate are increased. A longer testing phase thus increases the measurement accuracy, but it also increases the acquisition time, making the results more susceptible to the effects of parameter drifts in the MEMS. This is where cross-validation becomes relevant: the training data can be reused for testing (and testing data for training), and thus not increase acquisition time but still get more measurement accuracy. A testing set of 2000 samples was deemed sufficient for the results presented here, as it is a good compromise between acquisition speed (\(\sim \)3 min for one complete training and testing experiment) and accuracy (<2%).

Figure 14 shows the \(P_3\) to \(P_6\) success rate for different pairs of (NM) values. For this task, the minimum length of the training set (M) insuring optimal performance increases with the number of virtual nodes (N) in the explored region, and the number of nodes needs to be increased as the complexity of the task is increased from \(P_3\) to \(P_6\) in order to keep a constant success rate. A narrow region, centered around \(M=N\), seems to prohibit adequate results. This could be due to overfitting, since this region does not respect the rule of thumb stating that N should not exceed M/10 to M/2 (Jaeger 2002). Training another output layer on the same data with \(\gamma = 10^{-2}\) V\(^2\) (to reduce overfitting by increasing regularization) increases performance for \(M=N\) but considerably degrades performance otherwise. Good performance is also possible in a region where \(N > M\), although unless the training set is of limited size, it is advisable to choose \(N < M\) as the speed and energy cost of increasing the number of nodes is generally higher than using a longer training phase.

Fig. 14
figure 14

Interpolated success rate in the number of nodes (N)—number of training samples (M) plane for \(P_3\) to \(P_6\) (left to right)

4.4.2 Tuning the Nonlinearity

Figure 15 shows that good performance for \(P_3\) to \(P_6\) is limited to a rather narrow, tilted band in the drive frequency—drive amplitude plane. The more nonlinear task \(P_6\) requires higher drive amplitudes for optimal success, corresponding to higher beam oscillation amplitudes and thus a more pronounced impact of the cubic term in the Duffing equation (Eq. 1). Figure 13 shows the effect of operating the system with the wrong combination of drive amplitude and frequency. At 500 Hz below the proper operating frequency, the dynamic range of the readout signal is reduced and its shape more closely resembles the input due to the more linear behavior of the beam. Such detuning can occur for example during the MEMS life if a large enough foreign particle gets attached to (or detached from) the beam, shifting its natural frequency.

Fig. 15
figure 15

Success rate in the drive frequency—drive amplitude plane for the \(P_3\) to \(P_6\) tasks. Note that this figure was produced earlier than the other figures when the oscillator had a slightly higher natural frequency. This slow drift of \(f_0\) merely translates the features in this figure horizontally

4.4.3 Feedback Strength

By plotting the success rate for the parity benchmark against the feedback strength \(\alpha \) as in Fig. 16a, it can be seen that there is an intermediate value of \(\alpha \) providing optimal results for all the investigated tasks. Below this value of \(\alpha \simeq 1.1\), the system has less memory and success eventually vanishes at \(\alpha =0\). For values of \(\alpha \) which are too large, the RC may not exhibit the fading memory property (Jaeger 2001) (or it may fade too slowly), and the system also tends to saturate (see bottom panel of Fig. 13), negatively impacting performances.

Fig. 16
figure 16

The success rate for the parity function strongly depends on both the feedback gain \(\alpha \) (a) and the mask update rate \(\theta \) (b)

4.4.4 Coupling Strength

Figure 13 shows the effect of increasing or decreasing \(\theta \) on the dynamics of the system. For \(\theta =0.05 \ \text {ms} \ll T\), the dynamic range is limited: the beam cannot respond quickly enough to the rapidly alternating low and high mask bits, and only behaves appropriately when there is a succession of identical mask values. This translates into a lower correlation coefficient of 0.05 between the input and output amplitudes, compared to a correlation coefficient of 0.44 for the optimized RC. For the case \(\theta =0.5 \ \text {ms} \gg T\), the response saturates as soon as there are two or more successive identical mask values, such that the readout (points sampled at the end of each period \(\theta \)) essentially only visits two points of the transfer function (low level and high level saturation). The correlation coefficient is 0.60 and feedback has little effect, as the signal is less dynamical and more closely tied to the input due to the weak coupling between adjacent virtual nodes. The weak coupling regime (\(\theta \lesssim T\)), where a given virtual node state is only dependent on the state of its neighbor, is analogous to a linear chain of space-coupled oscillators.

Figure 16b shows the success rate for \(P_3\) to \(P_6\) as a function of \(\theta \), which essentially controls the connectivity matrix of the reservoir. While using a value of \(\theta =0.2\) ms gives slightly better results, a virtual node duration of \(\theta = 0.1\) ms \(\simeq T/3\) was used for the results presented here as the computation is two times faster (\(\sim \)2 min). For higher values of \(\theta \), the longer acquisition time increases the effect of medium-term drifts in the system on the results: optimal weights may evolve over time but our offline training method doesn’t allow adapting them through the acquisition.

5 Conclusion

MEMS devices form the basis of many of today’s sensor technologies and are expected to play an important role in the development of new technologies related to artificial intelligence and machine learning, in the context of producing “big data” from autonomous systems (e.g., self-driving cars) or distributed sensor systems (e.g., the Internet of Things). We have presented in this chapter key concepts for using MEMS to construct neuromorphic computing devices, as well as key experimental results showing that reservoir computing can be implemented efficiently and robustly in MEMS. As MEMS can be small, energy-efficient, and function at high speeds, they could constitute a very attractive hardware substrate for unconventional AI computing. When used as “pure” computing devices (with an analog electrical input and an analog electrical output), they could implement AI functionalities with performance levels exceeding those of conventional electronics (Coulombe et al. 2017). Perhaps more interestingly, our MEMS devices can implement both neuromorphic computing and sensing functionalities in the same device. This is a fairly new idea, which could bring significant gains in system size and energy consumption through integration: instead of building mechatronic systems with a discrete sensor coupled to separate signal processing electronics, one could envision building a trainable sensor which exploits the nonlinearity of its sensing mechanism to implement computing functions on the measured data. We are developing this idea in MEMS, but similar ideas might also be relevant for optical sensors and RC systems, for instance.

Deep learning, as the most productive line of research for artificial intelligence today, relies on training complex systems (artificial neural networks) using large amounts of data. The separation between data generation and data processing has traditionally been very clear in such deep neural networks. One might however consider the example of biological brains, which actually integrate the sensing and computing functionalities in some sensory neurons (Pitkow 2015), perhaps as a strategy to increase efficiency, robustness, or adaptiveness. Nature might have discovered long ago that such integration was an effective way to build faster, smaller, and more energy-efficient intelligent biological systems, which are able to respond efficiently to sensory inputs collected from their environment (i.e., systems which are sophisticated integrated sensing and computing devices).