Keywords

1 Introduction

The continuously increasing bandwidth demand for data communication in high performance computer (HPC) has pushed wire-line connections towards data-rates of 25 Gb/s or beyond [1]. However, low-power and high density data transceivers are also key elements of modern HPC, due to systems such as network switches and processor interfaces will employ optical communication [2, 3]. Figure 1 shows the next switch system with optical communication. The black box in the center of the system, which outputs optical signal directly, usually consists a switch chip, many re-timer chips and other optical chips. However the bandwidth, power efficiency and area of the re-timer also limit performance of the switch system. Even though, many reported CDR can meet its bandwidth, but their power is hungry due to fabricated with III-VI materials [4]. In addition, the large area of the CDR is not good for high density integrated.

Fig. 1.
figure 1

System of the next switch chip with optical network.

To solve these problems, a high speed, low power and little area re-timer based CMOS technology is proposed. To save the power dissipation, a quarter-rate based 3-tap FFE is proposed. To reduce the chip area, a BBPD based PI CDR is employed. In addition, to improve the high speed performance, a 2-order digital filter is used.

This paper is organized as follows. Section 2 presents the architecture of the re-timer, followed by the description of building blocks. Section 3 reveals the experimental results and the conclusion.

2 Architecture and Circuit Design

Figure 2 shows the re-timer architecture, which includes a phase tracing control loop and a data path. In the phase tracing control loop, the input data are sampled by 1/4 rate 8 phase clocks firstly. Secondly, the early/late information between the sampling clocks and input data is extracted by PD circuit. After the voter and filter, the control words generated by code circuit are used to rotate the extra input clock to match the phase of the input data. In the data path, firstly, the input data is equalized by the CTLE. Secondly, it is resampled by the recovery clock. Lastly the data is equalized and output by the 3-tap FFE with driver.

In the phase tracing control loop, a quarter rate BBPD based CDR is introduced, which consists of 1/4 rate sampler, 8:32 DEMUX, phase detector, voter, 2-order digital filter, code, and phase interpolator. The data path consists CTLE, baud rate sampler, delay latch array, 4:1 MUX based 3-tap FFE.

Fig. 2.
figure 2

Re-timer architecture.

2.1 BBPD Based PI-CDR with 2-Order Digital Filter

Clock recovery circuit is the most important circuit module in high re-timer system. Its main task is to extract clock information from the input data with amplitude noise and phase noise, and then retime the data. In addition, CDR can track the low frequency phase jitter introduced in the input data. The working principle of a CDR, shown in Fig. 3, mainly includes clock recovery (CR) module and data recovery (DR) module. The CR detects the phase information of the data, and then generates the clock related to the input data. The DR uses the generated clock to complete the data retiming task.

Fig. 3.
figure 3

Basic working principle of CDR.

Figure 4 shows the model of the proposed CDR, which is a Bang-Bang phase discriminator (BBPD) based PI CDR with 2-order digital filter. It consists a BBPD, a voter, a 2-order digital filter, a phase interpolator, and a feed-back. The BBPD is used to extract the phase error between the input data and clock generated from PI. The voter is used to get the efficient results of the decision from BBPD. The 2-order digital filter is adapted to smooth the result of the voter and then used for PI. PI is used to generate a desired phase clock with a fixed input clock (Fig. 5).

Fig. 4.
figure 4

BBPD based PI CDR with 2-order digital filter.

Fig. 5.
figure 5

Linearized model of the CDR.

To analyze the performance of the CDR, a linearized model with parameters is modeled in Fig. 6. In the linearized model, \( K_{TD} \) is the edge conversion density of the input data. \( K_{PD} \) is the phase detector gain. \( K_{V} \) is the gain of the voter to take effects of decimation from any decimation that takes place. The value \( K_{P} \) and \( K_{I} \) correspond to the proportional and integral paths from the output of the voting to the PI. \( K_{PI} \) is the gain of the PI. This corresponds to the resolution of the PI in units of Unit Interval (UI) per bit. \( z^{ - NEL} \) represents all of the delay (analog and digital pipe stages) in going around the loop. Thus, the open-loop transfer function for the linearized CDR can be express as

$$ {\text{G(Z}}^{{{ - }1}} )= \frac{{\varphi_{out} }}{{\varphi_{err} }} = K_{TD} K_{PD} K_{V} (K_{P} + K_{I} \frac{{Z^{ - 1} }}{{1 - Z^{ - 1} }})\frac{{Z^{ - 1} }}{{1 - Z^{ - 1} }}K_{PI} Z^{ - NEL} $$
(1)

In Z-Domain, \( z = e^{{S*T_{DLF} }} \), where \( S \) is \( j*2\pi f \) and \( T_{DLF} \) is the operation (cycle) period of digital loop filter (DLF). In addition, \( e^{{ - sT_{DLF} }} = 1 + ( - sT_{DLF} ) + \frac{{( - sT_{DLF} )^{2} }}{2} + \frac{{( - sT_{DLF} )^{3} }}{3!} + \cdots \cdots \), when \( {\text{s}}T_{DLF} < < 1 \), we can get

$$ z^{ - 1} = e^{{ - sT_{DLF} }} \approx 1 - sT_{DLF} ,({\text{s}}T_{DLF} \,{<}{<}\, 1) $$
(2)

Therefore the open-loop transfer function can be given by

$$ \begin{aligned} G({\text{S}}) = K_{TD} K_{PD} K_{V} \left[ {\frac{{K_{P} (1 - sT_{DLF} )}}{{sT_{DLF} }} + \frac{{K_{I} (1 - sT_{DLF} )^{2} }}{{s^{2} T_{DLF}^{2} }}} \right]K_{PI} (1 - sT_{DLF} )^{NEL} \hfill \\ \begin{array}{*{20}c} {} & { \approx K_{TD} K_{PD} K_{V} \left( {\frac{{K_{P} }}{{sT_{DLF} }} + \frac{{K_{I} }}{{s^{2} T_{DLF}^{2} }}} \right)} \\ \end{array} K_{PI} (1 - sT_{DLF} )^{NEL} \hfill \\ \end{aligned} $$
(3)

The phase transfer function is given by the following well known equation:

$$ H(S) = \frac{{\phi_{out} }}{{\phi_{in} }} = \frac{{G({\text{S}})}}{{ 1 + G({\text{S}})}} $$
(4)

Figure 6 shows the calculated phase transfer function of the proposed CDR. It can be observed that the bandwidth is 1.46 MHz.

Fig. 6.
figure 6

Calculated the transfer function of the proposed CDR.

2.2 Phase Interpolator

PI is the key module in the CDR. It can generate a desired phase clock underling the control of the input control words for sampling the input data. The working principle of the basic PI can be explained by a vector diagram and its mathematical model equation. In Fig. 7, the two basic vectors \( \overrightarrow {{V_{Q} }} \) and \( \overrightarrow {{V_{I} }} \), which between the phase is 90°, can composite a new vector. It’s known by the vector knowledge of geometry that, the phase of the composite vector, which is the angle between the new composite vector and the horizontal vector, can be controlled through changing these amplitudes of the two basic vectors. And the geometry theory of this composite vector can be expressed by Eq. (5).

Fig. 7.
figure 7

Composite vector.

$$ \overrightarrow {{V_{O} }} = \overrightarrow {{V_{Q} }} + \overrightarrow {{V_{I} }} $$
(5)
$$ V_{out} = \alpha A\,\sin (\omega t) + (1 - \alpha )A\,{ \cos }(\omega t),\;\left( {0 \le \alpha \le 1} \right) $$
(6)
$$ V_{out} = A\sqrt {\alpha^{2} + (1 - \alpha )^{2} } \sin (\omega t + \varphi_{out} ) $$
(7)
$$ \varphi_{out} = \arctan (\frac{1 - \alpha }{\alpha }) $$
(8)

In actual circuit, the two basic vectors \( \overrightarrow {{V_{Q} }} \) and \( \overrightarrow {{V_{\text{I}} }} \) can be replace by \( \alpha A\,\sin (\omega t) \) and \( (1 - \alpha )A\,{ \cos }(\omega t) \), thus Eq. (5) can be expressed as Eq. (6), which of the value is limited in [0, 1]. The phase between \( \sin (\omega t) \) and \( { \cos }(\omega t) \) is 90°, \( \alpha A \) and \( (1 - \alpha )A \) are their amplitudes respectively. When \( \alpha \) is changed, the phase of the Vout followed in 0 to 90°, which is the desired phase of Vout. In order to precisely calculate the output phase, the Eq. (7) can be derive by Eq. (6), and the phase of Vout can be calculated by Eq. (8). Figure 8 shows the Vout waveforms with different \( \alpha \) values.

Fig. 8.
figure 8

Different output clocks with different \( \alpha \) value.

Figure 9 shows the part circuit of the PI. It includes two pull-up loads, two pairs of input transistors, and 16 equivalent tail current sources under each of input pairs. And the relationship between input temperature code and output clock phase is depicted in Fig. 10.

Fig. 9.
figure 9

Part circuit of the PI.

Fig. 10.
figure 10

The relationship between input temperature code and output clock phase.

If the input two basic clocks are be changed from 0, 90, 180, 270, the phase of the composited clock can be got in any degree (0–360) that we are desired, which is depicted in Fig. 11. Figure 12 shows the circuit of the complete PI, which consists 4 pairs of the input transistors, control words transistors and tail current sources.

Fig. 11.
figure 11

360° output phase of the composite vector.

Fig. 12.
figure 12

Circuit of the complete PI.

2.3 4:1 MUX Based 3-Tap FFE

As everyone knows that, the dielectric channel usually presents low - pass characteristics due to the dielectric loss and skin effect. Figure 13(a) shows a typical backbone channel S12 curve, which includes a 19 in. PCB channel, 2 via holes, 2 packages and 2 connectors. The attention at the baud rate frequency is −17.32 dB. When data rate exceeds the channel bandwidth, the high data rate signal couldn’t transform within 1 unit interval (UI) and extend to the adjacent signal interval, which are showed in Fig. 13 (b), and this phenomenon is usually called inter-symbol interference (ISI). ISI can deteriorate signal integrity of the high speed signal. Figure 14 presents a 32 Gb/s NRZ eye diagram before this channel, and the eye diagram after passing channel is closed due to the ISI.

Fig. 13.
figure 13

(a) S12 curve of a typical channel, (b) unit pulse response before and after the channel.

Fig. 14.
figure 14

(a) Eye diagram before the channel, (b) eye diagram after the channel.

In order to mitigate this problem, a feed-forward equalizer (FFE) is usually to be introduced at the output of the re-timer to reduce the ISI. The basic construction of a 3-taps FFE as show in Fig. 15, which includes 3 delay units, 3 multiplying units with 3 coefficients and a summer, is a finite impulse response (FIR) filter. The time-domain transfer function is Eq. (9), and the Z-domain transfer function is Eq. (10), where the Z is \( {\text{e}}^{j2\pi fT} \). Figure 16 shows the channel response with different character. The black curve presents the channel response without FFE. The blue curve describes a high pass based FIR filter with proper 3 tap coefficients. And the red curve shows the channel response with the FFE, which can keep the signal integrity. Figure 17 (a) and (b) show the eye diagrams before and after the channel with proper coefficients based FFE.

Fig. 15.
figure 15

Basic construction of FFE.

$$ {\text{y}}(t) = c0*x(t) + c1*x(t + T) + c2*x(t + 2T) $$
(9)
$$ {\text{H(Z) = }}c0*Z^{0} + c1*Z^{ - 1} + c2*Z^{ - 2} \left( {{\text{z = e}}^{j2\pi fT} } \right) $$
(10)
Fig. 16.
figure 16

Frequency domain channel response.

Fig. 17.
figure 17

Time domain channel response (a) eye diagram before channel, (b) eye diagram after channel with proper coefficients FIR.

Compared with the pre-emphasis based FFE, the de-emphasis based FFE is widely used due to its simple circuit structure. A de-emphasis based FFE equalizes the output’s signal through reducing the amplitude of the high frequency components of the original signal and maintaining the amplitude of the low frequency components of the original signal, which still follows the principle of the FFE. However, when data rates exceed 20 Gb/s, the high speed delay is power hungry and the timing is constrict under PVT variation. In order to solve these problems, a 4:1 MUX based 3-tap FFE is introduce to this re-timer showing Fig. 18. Compared with other FFE circuits, the delay cell in this FFE circuit designed with 3 4:1 MUX units, which can save power and relaxes the critical path timing by using the quarter rate clock and avoiding CML based circuits.

Fig. 18.
figure 18

Multiple-MUX based FFE.

Fig. 19.
figure 19

The 4:1 MUX with its timing diagram.

Figure 19 describes the 4:1 MUX with its timing diagram. This MUX consists of shunt-peaked loads and four identical unit cells, which is activated sequentially by the 2UI-spaced pulses quadrature clock (i.e., CK0, CK90, CK180, and CK270) to combine the four quarter-rate data into one serial sequence.

3 Experimental Results

The re-timer designed in 65 nm CMOS Technology. The layout of the re-timer is shown in Fig. 20. The core area of this chip is 0.11 mm2.

Fig. 20.
figure 20

Layout of the re-timer.

Figure 21 show the 32 Gb/s output eye diagram of this re-timer with or without FFE. When it passes a −12.52 dB@16 GHz attenuation channel without FFE, the output eye-diagram is closed as shown in Fig. 21 (a). When using the 3-tap FFE with the proper coefficients, the vertical eye opening of the eye diagram is 200 mVpp just as shown in Fig. 21(b). When setting 200 ppm frequency between the input data and the reference clock, the eye diagram of the recovery 1/4 rate clock is shown in Fig. 22, and the total jitter of that is 7.1 ps. The total power of this re-timer is 91 mW under 1.1 V supply. Table 1 compares the performance of this work with prior similar works.

Fig. 21.
figure 21

Eye diagram of equalized signal (a) without FFE, (b) with proper coefficients of FFE.

Fig. 22.
figure 22

Eye diagram of the recovery 1/4 rate clock with 200 ppm frequency difference.

Table 1. Performance summary

4 Conclusion

In order to solve the problem of high power consumption and large area of the high speed re-timer in HPC data communication, a 32 Gb/s low power little area re-timer with PI based CDR is proposed. To further ensure signal integrity, both a CTLE and feed forward equalizer are adapted. To save power dissipation, a quarter-rate based 3-tap FFE is proposed. To reduce chip area, a BBPD based PI CDR is employed. In addition, a 2-order digital filter is adopted to improve the high speed performance in the CDR loop. This re-timer is achieved in 65 nm CMOS technology and supplied with 1.1 V. The simulation results show that the proposed re-timer can work at 32 Gb/s and consumes 91mW. The 3-tap FFE in the re-timer can equalize >−12 dB channel attenuation. The PI based CDR with 2-order digital filter can CDR can tolerate a frequency difference of 200 ppm.