

# **A 32 Gb/s Low Power Little Area Re-timer with PI Based CDR in 65 nm CMOS Technology**

Zhengbin Pang<sup>1</sup>, Fangxu Lv<sup>1(⊠)</sup>, Weiping Tang<sup>2</sup>, Mingche Lai<sup>1</sup>, Kaile Guo<sup>2</sup>, Yuxuan Wu<sup>2</sup>, Tao Liu<sup>2</sup>, Miaomiao Wu<sup>2</sup>, and Dechao Lu<sup>2</sup>

> <sup>1</sup> National University of Defense Technology, Changsha, China lvfangxu1988@163.com <sup>2</sup> Air Force Engineering University, Xi'an, China

**Abstract.** This paper presents a 32 Gb/s low power little area re-timer with Phase Interpolator (PI) based Clock and Data Recovery (CDR). To further ensure signal integrity, both a Continuous Time Linear Equalizer (CTLE) and Feed Forward Equalizer (FFE) are adapted. To save power dissipation, a quarter-rate based 3 tap FFE is proposed. To reduce the chip area, a Band-Band Phase Discriminator (BBPD) based PI CDR is employed. In addition, a 2-order digital filter is adopted to improve the jitter performance in the CDR loop. This re-timer is achieved in 65 nm CMOS technology and supplied with 1.1 V. The simulation results show that the proposed re-timer can work at 32 Gb/s and consumes 91 mW. And it can equalize >−12 dB channel attenuation, tolerate the frequency difference of 200 ppm.

**Keywords:** Re-timer · Clock and Data Recovery (CDR) · Phase Interpolator (PI) · Feed Forward Equalizer (FFE)

### **1 Introduction**

The continuously increasing bandwidth demand for data communication in high performance computer (HPC) has pushed wire-line connections towards data-rates of 25 Gb/s or beyond [\[1\]](#page-11-0). However, low-power and high density data transceivers are also key elements of modern HPC, due to systems such as network switches and processor interfaces will employ optical communication [\[2,](#page-11-1) [3\]](#page-11-2). Figure [1](#page-1-0) shows the next switch system with optical communication. The black box in the center of the system, which outputs optical signal directly, usually consists a switch chip, many re-timer chips and other optical chips. However the bandwidth, power efficiency and area of the re-timer also limit performance of the switch system. Even though, many reported CDR can meet its bandwidth, but their power is hungry due to fabricated with III-VI materials [\[4\]](#page-11-3). In addition, the large area of the CDR is not good for high density integrated.

To solve these problems, a high speed, low power and little area re-timer based CMOS technology is proposed. To save the power dissipation, a quarter-rate based 3 tap FFE is proposed. To reduce the chip area, a BBPD based PI CDR is employed. In addition, to improve the high speed performance, a 2-order digital filter is used.

<sup>©</sup> Springer Nature Singapore Pte Ltd. 2020

D. Dong et al. (Eds.): ACA 2020, CCIS 1256, pp. 31–42, 2020.

[https://doi.org/10.1007/978-981-15-8135-9\\_3](https://doi.org/10.1007/978-981-15-8135-9_3)



**Fig. 1.** System of the next switch chip with optical network.

<span id="page-1-0"></span>This paper is organized as follows. Section [2](#page-1-1) presents the architecture of the re-timer, followed by the description of building blocks. Section [3](#page-9-0) reveals the experimental results and the conclusion.

### <span id="page-1-1"></span>**2 Architecture and Circuit Design**

Figure [2](#page-2-0) shows the re-timer architecture, which includes a phase tracing control loop and a data path. In the phase tracing control loop, the input data are sampled by 1/4 rate 8 phase clocks firstly. Secondly, the early/late information between the sampling clocks and input data is extracted by PD circuit. After the voter and filter, the control words generated by code circuit are used to rotate the extra input clock to match the phase of the input data. In the data path, firstly, the input data is equalized by the CTLE. Secondly, it is resampled by the recovery clock. Lastly the data is equalized and output by the 3-tap FFE with driver.

In the phase tracing control loop, a quarter rate BBPD based CDR is introduced, which consists of 1/4 rate sampler, 8:32 DEMUX, phase detector, voter, 2-order digital filter, code, and phase interpolator. The data path consists CTLE, baud rate sampler, delay latch array, 4:1 MUX based 3-tap FFE.

#### **2.1 BBPD Based PI-CDR with 2-Order Digital Filter**

Clock recovery circuit is the most important circuit module in high re-timer system. Its main task is to extract clock information from the input data with amplitude noise and phase noise, and then retime the data. In addition, CDR can track the low frequency phase jitter introduced in the input data. The working principle of a CDR, shown in Fig. [3,](#page-2-1) mainly includes clock recovery (CR) module and data recovery (DR) module. The CR detects the phase information of the data, and then generates the clock related to the input data. The DR uses the generated clock to complete the data retiming task.

Figure [4](#page-2-2) shows the model of the proposed CDR, which is a Bang-Bang phase discriminator (BBPD) based PI CDR with 2-order digital filter. It consists a BBPD, a voter, a 2-order digital filter, a phase interpolator, and a feed-back. The BBPD is used to extract the phase error between the input data and clock generated from PI. The voter is used to get the efficient results of the decision from BBPD. The 2-order digital filter is adapted



**Fig. 2.** Re-timer architecture.

<span id="page-2-0"></span>

**Fig. 3.** Basic working principle of CDR.

<span id="page-2-1"></span>to smooth the result of the voter and then used for PI. PI is used to generate a desired phase clock with a fixed input clock (Fig. [5\)](#page-3-0).



**Fig. 4.** BBPD based PI CDR with 2-order digital filter.

<span id="page-2-2"></span>To analyze the performance of the CDR, a linearized model with parameters is modeled in Fig.  $6$ . In the linearized model,  $K_{TD}$  is the edge conversion density of the input data.  $K_{PD}$  is the phase detector gain.  $K_V$  is the gain of the voter to take effects of decimation from any decimation that takes place. The value  $K_P$  and  $K_I$  correspond to the proportional and integral paths from the output of the voting to the PI.  $K_{PI}$  is the gain of the PI. This corresponds to the resolution of the PI in units of Unit Interval (UI) per bit. *z*−*NEL* represents all of the delay (analog and digital pipe stages) in going around the



**Fig. 5.** Linearized model of the CDR.

<span id="page-3-0"></span>loop. Thus, the open-loop transfer function for the linearized CDR can be express as

$$
G(Z^{-1}) = \frac{\varphi_{out}}{\varphi_{err}} = K_{TD} K_{PD} K_V (K_P + K_I \frac{Z^{-1}}{1 - Z^{-1}}) \frac{Z^{-1}}{1 - Z^{-1}} K_{PI} Z^{-NEL}
$$
 (1)

In Z-Domain,  $z = e^{S*T_{DLF}}$ , where *S* is  $j * 2\pi f$  and  $T_{DLF}$  is the operation (cycle) period of digital loop filter (DLF). In addition,  $e^{-sT_{DLF}} = 1 + (-sT_{DLF}) + \frac{(-sT_{DLF})^2}{2} +$  $\frac{(-sT_{DLF})^3}{3!} + \cdots$ , when  $sT_{DLF} < 1$ , we can get

$$
z^{-1} = e^{-sT_{DLF}} \approx 1 - sT_{DLF}, (sT_{DLF} << 1)
$$
 (2)

Therefore the open-loop transfer function can be given by

$$
G(S) = K_{TD} K_{PD} K_V \left[ \frac{K_P (1 - sT_{DLF})}{sT_{DLF}} + \frac{K_I (1 - sT_{DLF})^2}{s^2 T_{DLF}^2} \right] K_{PI} (1 - sT_{DLF})^{NEL} \n\approx K_{TD} K_{PD} K_V \left( \frac{K_P}{sT_{DLF}} + \frac{K_I}{s^2 T_{DLF}^2} \right) K_{PI} (1 - sT_{DLF})^{NEL}
$$
(3)

The phase transfer function is given by the following well known equation:

$$
H(S) = \frac{\phi_{out}}{\phi_{in}} = \frac{G(S)}{1 + G(S)}\tag{4}
$$

Figure [6](#page-4-0) shows the calculated phase transfer function of the proposed CDR. It can be observed that the bandwidth is 1.46 MHz.

#### **2.2 Phase Interpolator**

PI is the key module in the CDR. It can generate a desired phase clock underling the control of the input control words for sampling the input data. The working principle of the basic PI can be explained by a vector diagram and its mathematical model equation. In Fig. [7,](#page-4-1) the two basic vectors  $\overrightarrow{V}_Q$  and  $\overrightarrow{V}_I$ , which between the phase is 90°, can composite a new vector. It's known by the vector knowledge of geometry that, the phase of the composite vector, which is the angle between the new composite vector and the horizontal vector, can be controlled through changing these amplitudes of the two basic vectors. And the geometry theory of this composite vector can be expressed by Eq. [\(5\)](#page-4-2).



<span id="page-4-0"></span>**Fig. 6.** Calculated the transfer function of the proposed CDR.



**Fig. 7.** Composite vector.

<span id="page-4-4"></span><span id="page-4-3"></span><span id="page-4-2"></span>
$$
\overrightarrow{V_O} = \overrightarrow{V_Q} + \overrightarrow{V_I} \tag{5}
$$

<span id="page-4-1"></span>
$$
V_{out} = \alpha A \sin(\omega t) + (1 - \alpha)A \cos(\omega t), \ (0 \le \alpha \le 1)
$$
 (6)

$$
V_{out} = A\sqrt{\alpha^2 + (1 - \alpha)^2} \sin(\omega t + \varphi_{out})
$$
\n(7)

<span id="page-4-5"></span>
$$
\varphi_{out} = \arctan(\frac{1-\alpha}{\alpha})\tag{8}
$$

In actual circuit, the two basic vectors  $\overrightarrow{V_Q}$  and  $\overrightarrow{V_I}$  can be replace by  $\alpha A \sin(\omega t)$  and  $(1 - \alpha)A\cos(\omega t)$ , thus Eq. [\(5\)](#page-4-2) can be expressed as Eq. [\(6\)](#page-4-3), which of the value is limited in [0, 1]. The phase between  $sin(\omega t)$  and  $cos(\omega t)$  is 90°,  $\alpha A$  and  $(1 - \alpha)A$  are their amplitudes respectively. When  $\alpha$  is changed, the phase of the Vout followed in 0 to 90°, which is the desired phase of Vout. In order to precisely calculate the output phase, the Eq.  $(7)$  can be derive by Eq.  $(6)$ , and the phase of Vout can be calculated by Eq.  $(8)$ . Figure [8](#page-5-0) shows the Vout waveforms with different  $\alpha$  values.

Figure [9](#page-5-1) shows the part circuit of the PI. It includes two pull-up loads, two pairs of input transistors, and 16 equivalent tail current sources under each of input pairs. And the relationship between input temperature code and output clock phase is depicted in Fig. [10.](#page-5-2)

If the input two basic clocks are be changed from 0, 90, 180, 270, the phase of the composited clock can be got in any degree (0–360) that we are desired, which is depicted



**Fig. 8.** Different output clocks with different  $\alpha$  value.

<span id="page-5-0"></span>

<span id="page-5-2"></span><span id="page-5-1"></span>**Fig. 10.** The relationship between input temperature code and output clock phase.

Temperature code of the input control

in Fig. [11.](#page-6-0) Figure [12](#page-6-1) shows the circuit of the complete PI, which consists 4 pairs of the input transistors, control words transistors and tail current sources.



Fig. 11. 360° output phase of the composite vector.

<span id="page-6-0"></span>

**Fig. 12.** Circuit of the complete PI.

#### <span id="page-6-1"></span>**2.3 4:1 MUX Based 3-Tap FFE**

As everyone knows that, the dielectric channel usually presents low - pass characteristics due to the dielectric loss and skin effect. Figure [13\(](#page-7-0)a) shows a typical backbone channel S12 curve, which includes a 19 in. PCB channel, 2 via holes, 2 packages and 2 connectors. The attention at the baud rate frequency is −17.32 dB. When data rate exceeds the channel bandwidth, the high data rate signal couldn't transform within 1 unit interval (UI) and extend to the adjacent signal interval, which are showed in Fig. [13](#page-7-0) (b), and this phenomenon is usually called inter-symbol interference (ISI). ISI can deteriorate signal integrity of the high speed signal. Figure [14](#page-7-1) presents a 32 Gb/s NRZ eye diagram before this channel, and the eye diagram after passing channel is closed due to the ISI.



<span id="page-7-0"></span>**Fig. 13.** (a) S12 curve of a typical channel, (b) unit pulse response before and after the channel.



<span id="page-7-1"></span>**Fig. 14.** (a) Eye diagram before the channel, (b) eye diagram after the channel.

In order to mitigate this problem, a feed-forward equalizer (FFE) is usually to be introduced at the output of the re-timer to reduce the ISI. The basic construction of a 3-taps FFE as show in Fig. [15,](#page-8-0) which includes 3 delay units, 3 multiplying units with 3 coefficients and a summer, is a finite impulse response (FIR) filter. The time-domain transfer function is Eq.  $(9)$ , and the Z-domain transfer function is Eq.  $(10)$ , where the Z is  $e^{j2\pi fT}$ . Figure [16](#page-8-1) shows the channel response with different character. The black curve presents the channel response without FFE. The blue curve describes a high pass based FIR filter with proper 3 tap coefficients. And the red curve shows the channel response with the FFE, which can keep the signal integrity. Figure  $17$  (a) and (b) show the eye diagrams before and after the channel with proper coefficients based FFE.

<span id="page-7-3"></span><span id="page-7-2"></span>
$$
y(t) = c0 * x(t) + c1 * x(t+T) + c2 * x(t+2T)
$$
\n(9)

$$
H(Z) = c0 * Z0 + c1 * Z-1 + c2 * Z-2 (z = ej2\pi fT)
$$
 (10)

Compared with the pre-emphasis based FFE, the de-emphasis based FFE is widely used due to its simple circuit structure. A de-emphasis based FFE equalizes the output's signal through reducing the amplitude of the high frequency components of the original



**Fig. 15.** Basic construction of FFE.

<span id="page-8-0"></span>

**Fig. 16.** Frequency domain channel response.

<span id="page-8-1"></span>

<span id="page-8-2"></span>**Fig. 17.** Time domain channel response (a) eye diagram before channel, (b) eye diagram after channel with proper coefficients FIR.

signal and maintaining the amplitude of the low frequency components of the original signal, which still follows the principle of the FFE. However, when data rates exceed 20 Gb/s, the high speed delay is power hungry and the timing is constrict under PVT variation. In order to solve these problems, a 4:1 MUX based 3-tap FFE is introduce to this re-timer showing Fig. [18.](#page-9-1) Compared with other FFE circuits, the delay cell in this FFE circuit designed with 3 4:1 MUX units, which can save power and relaxes the critical path timing by using the quarter rate clock and avoiding CML based circuits.

Figure [19](#page-9-2) describes the 4:1 MUX with its timing diagram. This MUX consists of shunt-peaked loads and four identical unit cells, which is activated sequentially by the 2UI-spaced pulses quadrature clock (i.e., CK0, CK90, CK180, and CK270) to combine the four quarter-rate data into one serial sequence.



**Fig. 18.** Multiple-MUX based FFE.

<span id="page-9-1"></span>

**Fig. 19.** The 4:1 MUX with its timing diagram.

### <span id="page-9-2"></span><span id="page-9-0"></span>**3 Experimental Results**

The re-timer designed in 65 nm CMOS Technology. The layout of the re-timer is shown in Fig. [20.](#page-9-3) The core area of this chip is 0.11 mm2.

<span id="page-9-3"></span>

**Fig. 20.** Layout of the re-timer.



<span id="page-10-0"></span>**Fig. 21.** Eye diagram of equalized signal (a) without FFE, (b) with proper coefficients of FFE.



<span id="page-10-1"></span>**Fig. 22.** Eye diagram of the recovery 1/4 rate clock with 200 ppm frequency difference.

Figure [21](#page-10-0) show the 32 Gb/s output eye diagram of this re-timer with or without FFE. When it passes a −12.52 dB@16 GHz attenuation channel without FFE, the output eye-diagram is closed as shown in Fig. [21](#page-10-0) (a). When using the 3-tap FFE with the proper coefficients, the vertical eye opening of the eye diagram is 200 mVpp just as shown in Fig. [21\(](#page-10-0)b). When setting 200 ppm frequency between the input data and the reference clock, the eye diagram of the recovery 1/4 rate clock is shown in Fig. [22,](#page-10-1) and the total jitter of that is 7.1 ps. The total power of this re-timer is 91 mW under 1.1 V supply. Table [1](#page-11-4) compares the performance of this work with prior similar works.

<span id="page-11-4"></span>

|                                | This work<br>(Simulation)        | 5 (Fabricated)         | 6 (Fabricated)                |
|--------------------------------|----------------------------------|------------------------|-------------------------------|
| Data rate                      | $32$ Gb/s                        | $32 \text{ Gb/s}$      | $26.5$ Gb/s                   |
| Power                          | $91 \text{ mW}$                  | $102 \text{ mW}$       | 254 mW                        |
| CDR technology                 | PI and 2-order digital<br>filter | DCO and digital filter | LC-ODCO and digital<br>filter |
| Recovered clock Jitter<br>(PP) | $< 7.1$ ps                       | N/A                    | $< 8.9$ ps                    |
| CoreArea $(mm^2)$              | $0.55 \times 0.2$                | $0.8 \times 0.28$      | $1 \times 0.75$               |
| Technology                     | $65 \text{ nm}$                  | $28 \text{ nm}$        | $65 \text{ nm}$               |

**Table 1.** Performance summary

# **4 Conclusion**

In order to solve the problem of high power consumption and large area of the high speed re-timer in HPC data communication, a 32 Gb/s low power little area re-timer with PI based CDR is proposed. To further ensure signal integrity, both a CTLE and feed forward equalizer are adapted. To save power dissipation, a quarter-rate based 3-tap FFE is proposed. To reduce chip area, a BBPD based PI CDR is employed. In addition, a 2-order digital filter is adopted to improve the high speed performance in the CDR loop. This re-timer is achieved in 65 nm CMOS technology and supplied with 1.1 V. The simulation results show that the proposed re-timer can work at 32 Gb/s and consumes 91mW. The 3-tap FFE in the re-timer can equalize  $>$  -12 dB channel attenuation. The PI based CDR with 2-order digital filter can CDR can tolerate a frequency difference of 200 ppm.

# **References**

- <span id="page-11-0"></span>1. [Rupp, K.: 42 years of microprocessor trend data.](https://www.karlrupp.net/2018/02/42-years-ofmicroprocessor-trend-data/) https://www.karlrupp.net/2018/02/42-yearsofmicroprocessor-trend-data/
- <span id="page-11-1"></span>2. Moore, G.E.: Cramming more components onto integrated circuits. Electronics **38**(8), 114–117 (1965)
- <span id="page-11-2"></span>3. Pham, D.: The design and implementation of a first-generation CELL processor-a multi-core SoC. In: 2005 International Conference on Integrated Circuit Design and Technology, Austin, TX, USA, pp. 49–52. IEEE (2005)
- <span id="page-11-3"></span>4. Nagashima, K.:  $28-\text{Gb/s} \times 24-\text{channel CDR-integrated VCSEL-based transceiver module for}$ high-density optical interconnects. In: 2016 Optical Fiber Communications Conference and Exhibition (OFC), Anaheim, CA, pp. 1–3. IEEE (2016)
- 5. Rahman, W.: A 22.5-to-32-Gb/s 3.2-pJ/b Referenceless Baud-Rate Digital CDR With DFE and CTLE in 28-nm CMOS. IEEE J. Solid-State Circ. **52**(12), 3517–3531 (2017)
- 6. Chu, S.-H.: A 22 to 26.5 Gb/s optical receiver with all-digital clock and data recovery in a 65 nm CMOS process. IEEE J. Solid-State Circ. **50**(11), 2603–2612 (2015)