1 Introduction

Most data communications over networks are transmitted in a serial manner, that is, data bits are transmitted one at a time through a medium of transmission such as a copper cable, optical cable or a wireless path. An NRZ (non-return to zero) signal or Pulse-Amplitude Modulation 2-Level (PAM2) is a type of coding scheme that has two voltage levels to represent logic 0 and logic 1. Another modulation technique, PAM4, uses four voltage levels to represent four combinations of two bits logic: 11, 10, 01, and 00.

The speed of serial transmission is related to the bit time of the serial data. For a 1 GS/s PAM4 signal, the duration of transmission at each level for 2 bits is 500 ps based on Eq. 1 [1].

$$\begin{aligned} \text {Bit Rate}= \frac{1}{{\text {Bit Duration}}} \end{aligned}$$
(1)

Compared to an NRZ signal, a PAM4 signal has an advantage of having half the Nyquist frequency or double the throughput for the same Baud rate as seen in Eq. 2 [1]. Another way to look at it is that it achieves higher resolution using the same sampling rate, which is shown in Fig. 1

$$\begin{aligned} \text {Baud Rate}= \frac{\text {Bit Rate}}{\text {bits per symbol}} \end{aligned}$$
(2)
Fig. 1
figure 1

PAM4 and NRZ coding

Track and hold (T&H) amplifiers are the basis of data converters [2]. Their applications include but are not limited to ATE (Automated Test Equipment), Digital Sampling Oscilloscopes, jitter measurement, Bit Error Rate (BER), Time-Domain Reflectometers (TDR), RF Demodulation Systems, High Speed Peak Detectors, Software Defined Radio and Gigabit Passive Optical Network Applications [3, 4].

There is always a need to design better T&H circuits that meet the required criteria for specific applications. These criteria are, but not limited to: long hold period, low input offset, large input bandwidth, high sampling rate, good linearity, high accuracy, low droop rate, and high common mode rejection [5].

Section 2 lists the requirements and targeted specifications followed by an overview of the different T&H architectures in Sect. 3. The design process is discussed in Sect. 4 preceding the implementation details in Sect. 5 followed by the results in Sect. 6. A conclusion in Sect. 7 wraps up this work.

2 Requirements and specifications

Our targeted application is a portable Bit Error Rate (BER) Tester for high speed PAM4 signals operating up to 1 GS/s where the aim here is to generate the eye diagram of the signal and deduce several key parameters such as BER and jitter. Low power consumption is a key target in this design, where the device can be easily integrated in the scope of Internet of Things (IoT). In doing so, we determine the quality of the transmission medium through-which the signal is being sent.

In order to do so, samples must be taken from the cable and analyzed in the digital domain with an Analog to Digital Converter (ADC) acting as a portal between the analog and the digital domains. Having a T&H in-front of the ADC relaxes its specifications and provides more flexibility in terms of the mode of operation.

The PAM4 signal entering the chip is differential with 4 levels. The rise and fall time of each level determines the frequency of the signal. The bit period is chosen to be 500 ps.

Our aim is to be able to track our input signal, and hold it, at a sampling rate of 200 MSps. The output is then fed to an ADC that quantizes the discrete points and then feeds them to a digital processor that computes the eye diagram.

Table 1 shows six different marketing requirements we aim to achieve.

Table 1 Marketing requirements

Table 2 shows the engineering requirements that we have set and their relation to Table 1. Justification is also presented for each engineering requirement, giving it a concrete base.

Table 2 Engineering requirements

Based on the previous specifications, performance parameters for T&H circuits such as acquisition time, track bandwidth, hold error, droop rate and pedestal error are briefly explained.

The acquisition time is the measure of how much time it takes for the output signal to track the incoming signal. In our case, three different acquisition times were studied: the time to transition from the first level of the PAM4 signal to the second level of the PAM4 signal, the time to transition from the second level to the third level, and the time it takes to transition from the third level to the fourth level. The track bandwidth is simply equal to:

$$\begin{aligned} BW_{track} = \frac{1}{t_{acq}} \end{aligned}$$
(3)

The acquisition time is directly proportional to the value of the hold capacitor, so to decrease the acquisition time and increase the track bandwidth, a smaller capacitor must be used; however, this comes at the expense of worsening the hold error and the droop rate.

The hold error is a measure of the decrease in the voltage across the hold capacitor when in the hold mode. When holding, the switches are OFF, acting as a finite-valued resistance. This resistance pulls current from the hold capacitor and thus a voltage drop occurs.

As for the droop rate, it is calculated in the following manner:

$$\begin{aligned} \text {Droop Rate}= \frac{\text {Hold Error}}{\text {Hold Period}} \end{aligned}$$
(4)

The droop rate, is a measure of the rate at which the output voltage is changing due to the leakage from the hold capacitor. It is proportional to the hold error and inversely-proportional to the hold period. In our design, the droop rate is calculated by dividing the hold error by the hold period, which is 2.5 ns.

The pedestal error is the voltage difference in the output signal from when it exits the track mode and enters the hold mode. This error is caused by the transfer of charges from the switches to the hold capacitor. To decrease this error, a large capacitor can be used, but that comes at the expense of the decrease in acquisition time. To circumvent this issue, we use the minimum gate length in order to reduce the charges released by the transistors. Additionally, drain-source-connected transistors are used to absorb some of the excess charges.

3 T&H architectures

Different T&H architectures have been developed with time. These architectures perform the same operations: tracking and then holding a signal. However, some architectures are able to do so more accurately or faster than other architectures. On the other hand, higher-performance architectures are typically more complicated and require additional circuitry.

The open loop T&H circuit is a good option for high-speed applications since it does not employ feedback as shown in Fig. 2. However, it suffers from poor accuracy when compared to other architectures. The closed loop architecture is more accurate than the open loop architecture since it employs a feedback scheme as shown in Fig. 3. However, this architecture is slower than the open-loop architecture. Figure 4 is the closed loop architecture with integrator output where the addition of the capacitor connected to what is virtually ground, permits the charge transfer encountered during the hold stage to be constant and greatly improves the slew time with the addition of switches Q1 and Q2.

Fig. 2
figure 2

Open loop architecture

Fig. 3
figure 3

Closed loop architecture

Fig. 4
figure 4

Improved closed loop architecture

Another architecture, the current-multiplexed architecture shown in Fig. 5 proposed by Texas Instruments provides a track bandwidth comparable to that of the open loop configuration, an accuracy comparable to that of the closed loop configuration, and charge injection cancellation [5].

Fig. 5
figure 5

Current-multiplexed architecture

The base collector diode architecture offers a bandwidth even wider than that of the diode bridge T&H architecture. It also offers good linearity, a large dynamic range and good stability [6].

Another architecture is the switched-emitter follower, shown in Fig. 6 that operates with a wide bandwidth, has good linearity, and a large dynamic range. Its shortcoming, however, lies in its stability issues [6, 7].

Fig. 6
figure 6

Switched-emitter-follower architecture

Figure 7 shows the switched capacitor based architecture. NMOS based switches are used to control the transfer of charge in and out of the hold capacitor, these switches are controlled by two non-overlapping clocks that determine the sampling rate. This architecture is widely used because of its versatility and ability to be modified to counter effect various issues such as charge injection, clock feedthrough and speed requirement. Our choice to go with this architecture is based on its ability to fulfill our requirements especially area and power consumption.

Fig. 7
figure 7

Switched-capacitor architecture

Table 3 represents a summary of previously discussed architectures and other prevalent architectures, listing key aspects of each architecture.

Table 3 Summary of T&H architectures

4 Design

Based on the requirements set before regarding low power consumption and a large input signal range, we chose to proceed with the open loop architecture discussed previously. The essence of this topology lies in the switches implemented as NMOS transistors. The sample switch and the hold capacitor are the building blocks for the track and hold circuit. Assuming a signal-to-noise ratio (SNR) greater than 50 dB, the contribution of kT/C noise by the capacitor determines the capacitor value based on 5 [8]

$$\begin{aligned} SNR = 10 \log \left( \frac{V_{in,rms}^{2}}{kT/C_{hold}}\right) \end{aligned}$$
(5)

Where \(k=1.38 \times 10^{-23} \;\text {J} / \text {K}\) is Boltzmann’s constant and \(T=300K\) the absolute temperature. Assuming \(V_{IN}\) = 1.5V and a 60 dB SNR, the value of \(C_{hold}\) is found to be 15 fF.

When the gate voltage changes when switching between track to hold mode, an instantaneous drop in the voltage across the hold capacitor happens which causes a pedestal error. In addition to that, a contribution of the inversion charge \(Q_{gate}\) that forms the conductive layer of the MOS switch will flow back into the signal source and into the hold capacitor as described in 6.

$$\begin{aligned} V_{pedestal}=\frac{C_{ped}\Delta V_{gate}+Q_{gate}/2}{C_{hold}} \end{aligned}$$
(6)

\(C_{ped}\) is the gate overlap capacitance and is proportional to the transistor width W. Taking a look at the charge injection shows an amplification at the moment of sampling, a closer look at the signal dependant part of the output voltage shows a dependency on the length and width of the switch in 7.

$$\begin{aligned} V_{hold}&=V_{in}+\frac{Q_{gate}(V_{in})}{2C_{hold}}+V_{DC}\nonumber \\ {}&= V_{in}\left( 1+\frac{WLC_{ox}}{2C_{hold}}+V_{DC}\right) \end{aligned}$$
(7)

Based on the previous two equations, it is desirable to have the length of the switch at the smallest allowed by the process and have the width at a reasonable size, this reduces the pedestal step to the least allowed while keeping the switch fast enough. A transistor with a short gate length, the channel contains much less charge, and therefore this amplification is less of an issue.

Another consideration would be the droop rate, where in the hold phase, the charge can leak back from the hold capacitor and the signal will show a droop. To keep this value as small as possible, the gate of the switch must be kept at minimum so that the leakage is at its minimum.

$$\begin{aligned} V_{droop}=-\frac{I_{leak}T_{hold}}{C_{hold}} \end{aligned}$$
(8)

To circumvent the issue of pedestal step, compensation of the pedestal step by means of half-sized transistors whose source and drain are connected is beneficial. These dummy switches are controlled by the an inverted clock to the sampling switches, where the charge on the hold node is:

$$\begin{aligned} Q_{s}=V_{in}(t=T_{s})(C_{hold}+C_{ox}WL_{dummy}) \end{aligned}$$
(9)

The last term represent the gate capacitance of the dummy switch. After sampling, the gate of the dummy switch is pulled down releasing most of the charge on \(C_{hold}\).

The voltage on the hold capacitor must be then buffered before any operation can be performed, taking into the account the DC coupled 100-ohm differential load off chip. A differential source follower is used as an output buffer with a supply voltage of 1.7V. The output signal range is now limited to \(V_{DD}-2_{VT}-2V_{drive}\).

Based on the input/output voltage swing, for a source follower the voltage gain is calculated as:

$$\begin{aligned} G=\frac{R_{L}}{R_{L}+\big (\frac{1}{g_{m}}\big )} \end{aligned}$$
(10)

Based on our requirements, we desire a output voltage of 900mV for a 1.5V input which translates into a gain of 0.6 V/V. Setting \(R_L\)=50 \(\Omega\) we deduce the value of the transconductance required. Using 11 and taking into account the limited area available, we set \((\frac{W}{L})=40\), knowing \(K'_{n}\) from the process design kit, we deduce the drain current drawn from the 1P7 supply to be around 25mA.

Moving on the input and clock buffer, a common-gate amplifier is used because of its low input resistance where the 50-ohm matched input and clock signal is fed to. We desire \(R_{in}=50\) where \(R_{in}=\frac{1}{g_{m}}\). Based on 11, the drain current is inversely proportional to the width, so increasing the width of the transistor decreases the current consumption. We found a ratio of 80 to be suitable for the lowest power consumption to area compromise.

$$\begin{aligned} g_{m}=\sqrt{2k'_{n}\frac{W}{L}I_{D}} \end{aligned}$$
(11)

5 Implementation

In this section, we break down the block design of our proposed integrated circuit into a set of sub-blocks that aim to familiarize the reader with the targeted application for this particular IC.

Fig. 8
figure 8

Functional decomposition for the proposed T&H circuit

Figure 8 represents the T&H integrated circuit at its most fundamental level: PAM4 differential input signals \(V_{INP}\) and \(V_{INN}\), differential square wave clock inputs \(V_{CLKP}\) and \(V_{CLKN}\), and differential output signals \(V_{OUTP}\) and \(V_{OUTN}\).

The open-loop switched capacitor architecture is chosen due to its high accuracy and compact design area. The top-level circuit design shown in Fig. 9 consists of the T&H core, input buffer, output buffer, and a clock buffer. The PAM4 input signal and clock are terminated to a 50 \(\Omega\) resistor (single ended), and the output signal is matched to a 50 \(\Omega\) resistor connected to ground, representing the load resistance. All signals in the design are differential for better noise immunity.

Fig. 9
figure 9

TOP level schematic block design

Two supply domains are needed for a proper operation, AVDD_3P1, which is a 3.1 V supply for the clock buffer, and AVDD_1P7, which is a 1.7 V supply for the rest of the circuit. Additionally, there are two biasing voltages fed to the input and clock buffer in order to maintain their mode of operation. Each block is designed and simulated alone according to the requirements while taking into account the previous and the following connected blocks. This approach eases the amount of work needed to do the layout and perform the post parasitic extraction, where the individual blocks’ layouts are done separately, and any modifications needed to fix the results are done in the layout regarding the expected R’s and C’s parasitics that will show up due to routing.

5.1 Input buffer

Transistors M1, M2 and resistors R1, R2 make up the input buffer. Due to the relatively high voltage of the input, and the small input impedance of 50 \(\Omega\), a common gate amplifier is used. A common gate amplifier has a small input impedance that can be matched to our signal’s impedance by adjusting the sizing of the transistors. An aspect ratio \((W/L) = 80\) with a multiplier of 30 is used for the transistors. Due to the large input swing, we are not able to set the transistor’s input impedance at exactly 50 \(\Omega\), but rather to a midpoint between the first and fourth level resistance seen by the transistor.

Thick oxide transistors are used because of the large voltage drop across them. This voltage drop is controlled by the resistor connected between the supply voltage of 1.7 and the drain of the transistor. This resistor has a value of 500 \(\Omega\) and is placed in such a way to provide adequate legroom and headroom for the output signal, while keeping the transistors in the saturation region.

5.2 Clock buffer

The clock buffer made up from transistors M9, M10 and resistors R3, R4 has a similar operation to that of the input buffer. It is a common gate amplifier used to provide good matching with the 50 \(\Omega\) resistance seen by looking from the clock signal point-of-view, as shown in Fig. 9. An aspect ratio \((W/L) = 80\) with a multiplier of 5 is used for the transistors. The output of this buffer is directly connected to the gates of the transistors in the T&H core. In order to maintain the operation of the switches in the core, the output of the clock buffer must be greater than the highest input signal going to the source of the switching transistors in the core by \(V_{TH}\) in order to keep the transistors ON.

A separate 3.1 V power supply domain is used in the clock buffer in order to maintain the high output voltage needed for the core. A 2.5 k\(\Omega\) resistor is connected between the supply voltage and the drain of the transistor to maintain the operation of the transistors in the saturation region. A relatively high current is expected to pass in the transistors so the size of the transistors is chosen accordingly.

5.3 T&H core

The core of the chip consists of two differential low-threshold-voltage transistors M3, M4 with an aspect ratio \((W/L) = 5\) acting as switches that are driven by the positive clock cycle as shown in Fig. 9 The voltage \(V_{GS}\) across the transistors is controlled by the voltage coming from the clock and the voltage of the input signal. When the clock is high and the input signal is at its highest level, \(V_{GS}\) becomes small and a low threshold device is needed for proper functioning of the switch [8].

The drain of the transistors is then connected to two dummy transistors M5, M6 that are controlled by the inverse of the clock and have an aspect ratio half of the switches. These transistors are there to compensate for the clock feedthrough. When the switches turn OFF, charges are released from both terminals in an equal manner if the switching speed is high. When \(V_{CLKP}\) transitions from the high state to the low state, the drain-source connected transistors absorb the extra charges released by the switches. These charges will get transferred to the hold capacitors \(C_{HOLD1}\) \(C_{HOLD2}\), where they will alter the hold voltage on the capacitor. This phenomena is known as the pedestal error and it is due to the switch turn-off non-idealities such as clock feedthrough and charge injection.

Our sampling frequency is relatively low. We wish to undersample our signal at a rate of 200 MSps. In order for the switches to transition from the ON to the OFF state quickly, they need to be fast, so, the minimum gate length is used for the transistors (45 nm). By using the minimum length, the total gate charges are kept minimum therefore a small current is needed to flush the extra charge carried from the junction during switching.

When the clock is high, the switches are ON, and the voltage of the input signal charges the capacitors connected at the output node of the core. This charging phenomenon represents the tracking of the input signal. When the switches are OFF, the capacitors hold the value attained before for a certain period of time determined by the sampling rate and duty cycle.

The value of the hold capacitor was initially set to 15 fF based on Eq. 5. After doing the layout and extracting the parasitics, it was found that the routing from the core to the output stage has a parasitic capacitance of 26 fF to ground. Therefore the capacitors were removed, and the prasitics induced by the routing of Metal6 on top of Metal1 are used.

5.4 Output buffer

In order to match the output signal to the 50 \(\Omega\) resistors connected to ground, a common drain amplifier is used M7, M8. An aspect ratio \((W/L) = 400\) with a multiplier of 30 is used. A common drain amplifier has a low output resistance, which makes it good for matching with the 50 \(\Omega\) load. Low threshold devices are used to increase the swing of the output voltage. When \(V_{TH}\) is decreased, the \(V_{GS}\) requirement is relaxed and the transistors will operate as intended while maintaining relatively good output voltage levels, as shown in Fig. 9.

5.5 Layout

The layout of each block is done individually, and any affecting resistance or capacitance originating from the parasitics is taken into account. For sensitive nets such as the connections from the core to the output buffer, routes of minimum width are used in order to decrease the resulting capacitance. On the other hand, for nets that are affected by resistance such as those at the output stage, the routes are widened in order to decrease the resistance across them. The current across these routes is several mA, so an appropriate width must also be chosen so that it handles the amount of current going through.

Dummy transistors are used alongside the main transistors in all blocks in order to minimize the etch effects during fabrication, especially in differential circuits were symmetry is key. If not used, a transistor could exhibit different behavior in fingers that are on the edge compared to fingers in the middle. This behavior shows when the threshold voltages of matching transistors differ. Symmetric connections are enforced at all levels of routing so that the connections’ differential signaling is respected.

P-type substrate guard rings are used to insulate the transistors of each stage from each other for better immunity against noise and other fabrication anomalies. The P-type guard rings are then surrounded with N-type guard rings connected to the higher voltage potential AVDD_1P7 and AVDD_3P1 and separated by the minimum distance required to satisfy the Design Rule Check (DRC) constraints.

The total area of the layout turned out to be 150 um by 150 um. A small snip of the final layout can be seen in Fig. 10.

Fig. 10
figure 10

Final layout of the chip

6 Simulation results

Tables 4 and 5 are the results of the simulations done across three different temperatures at the typical process variation.

Table 4 Pre-layout simulation results
Table 5 Post-layout results

The acquisition bandwidth pre-layout is significantly greater than that of the simulated post-layout acquisition bandwidth. The hold error and droop rate are greater post-layout than they are pre-layout. The power consumption and pedestal error are both smaller post-layout. There is a noticeable decrease in the voltage levels of the output signal from pre-layout to post-layout results. This smaller level explains why post-layout design is more thrifty in terms of power consumption. Across temperature variations, the specifications do not vary by a significant amount for neither pre-layout nor post-layout simulation results.

Fig. 11 is a visual representation of the output signal of the T&H IC achieved pre-layout and post-layout with the clock and a PAM4 input signal. Figure 12 is a visual representation of the output signal sampling and NRZ input signal. Figure 13 is the generated eye diagram of a pseudorandom binary sequence (PRBS) PAM4 input signal.

Table 6 is a comparison of the results presented in this implementation with respect to similar publications.

Table 6 Comparison of our results with the latest relevant publications
Fig. 11
figure 11

PAM4 output signal waveform

Fig. 12
figure 12

NRZ output signal waveform

Fig. 13
figure 13

PAM4 eye diagram

7 Conclusion

In this work we presented a T&H Circuit that is able to sample a 1 GS/s PAM4 signal at 200 MSps consuming less than 50 mW while in operation. Although this circuit is designed for PAM4 signaling, it also works with NRZ signaling as well. No publications are available for a T&H circuit targeting PAM4 signaling for comparison, but other publications such as [9], based on a SEF (Switched-Emitter-Follower) architecture in BiCMOS, demonstrates a similar small signal bandwidth and input voltage but on the expense of bigger area and 10 times the power consumption. Publication [10], based on a switched capacitor architecture in CMOS, reports a higher sampling rate and bandwidth, but with a smaller input voltage, much smaller output voltage and bigger area. Publication [11], based on a switched capacitor architecture in CMOS, reports a similar bandwidth and power consumption, but at a much larger area and narrow output voltage. Most of the publications don’t report important specifications such as clock input voltage, hold error, droop rate, pedestal error, acquisition bandwidth and the operating temperature range. It is safe to assume that none of the publications are designed to maintain the mode of operation up to high temperatures as in our design. Our circuit can sample a PAM4 and NRZ signals, which is a first for a T&H publication based on a switched capacitor architecture.

The low power consumption of this block enables it to integrate well into IoT applications such as a portable always-on testers for 1 GS/s signals. Another application would be to integrate this circuit into existing Small Form-factor Pluggable (SFP) transceiver modules as a front-end with an ADC with the purpose of determining the quality of the received signal before being processed and software checked, saving the extra overhead in time.