# A 20-Gb/s/Pin Compact Single-Ended DCC-Less DECS Transceiver With CDR-Less RX Front-End for On-Chip Links

Jaeyoung Seo<sup>®</sup>, Sooeun Lee<sup>®</sup>, Myungguk Lee<sup>®</sup>, Graduate Student Member, IEEE, Changjae Moon, Graduate Student Member, IEEE, and Byungsub Kim<sup>®</sup>, Senior Member, IEEE

Abstract—This article presents a 20-Gb/s/pin 0.0024-mm<sup>2</sup> single-ended data-embedded clock signaling (DECS) transceiver (TRX) for short-reach on-chip links. The receiver (RX) directly recovers (self-slicing) and deserializes (auto-deserialization) the data from the DECS input of the RX front-end without a clock and data recovery (CDR) or clock and data alignment (CDA) circuits, while improving the timing requirement and the tolerance to duty cycle error and supply noise. At 20 Gb/s/pin, the horizontal eye was measured 0.99 UI at nominal and remained equal to or larger than 0.88 UI either when the clock duty cycle changed from 40% to 60% or when a 50-MHz 300-mV $_{\rm p-p}$ sinusoidal supply noise was injected to the RX from a printed circuit board (PCB). In addition, the proposed RX could tolerate a 200-MHz 300-mV  $_{p-p}$  sinusoidal supply noise and 200-mV  $_{p-p}$ crest factor 7 (CF7) Gaussian supply noise, while achieving 0.90 UI horizontal eye size in both cases. Because complex clocking circuits were removed, the active RX core excluding ondie-terminations (ODTs) achieved the smallest area occupancy of 0.000058 mm<sup>2</sup> and decent energy efficiency of 0.18 pJ/b. With the proposed technique, a compact high-speed short-reach onchip link is feasible without expensive high-speed duty cycle correction (DCC), duty cycle detection (DCD), CDR, or CDA. The proposed TRX was fabricated in 28-nm CMOS low-power performance (LPP) technology.

Manuscript received 28 June 2022; revised 25 November 2022 and 16 March 2023; accepted 8 June 2023. Date of publication 3 July 2023; date of current version 24 October 2023. This article was approved by Associate Editor Daniel Friedman. This work was supported in part by the Commercializations Promotion Agency Outcomes (COMPA) funded by the Ministry of Science and ICT (MSIT), Korea Government, under Grant 20211100; in part by Institute of Information & Communications Technology Planning & Evaluation (IITP) funded by the Korea government (MSIT) (A Development of Intelligent PHY Interface for High-Speed PIM Data Transfer) under Grant 2022-0-01171; in part by Samsung Electronics Co., Ltd., under Grant 10201211-08055-01; in part by BK21 FOUR project of National Research Foundation (NRF) for the Department of Electrical Engineering, POSTECH; and in part by National R&D Program through the NRF of Korea funded by the Korea Government (MSIT) under Grant 2020M3H2A107804514. (Corresponding author: Byungsub Kim.)

Jaeyoung Seo and Sooeun Lee are with Samsung Electronics, Hwaseong 18448, South Korea.

Myungguk Lee and Changjae Moon are with the Department of Electrical Engineering, Pohang University of Science and Technology, Pohang 37673, South Korea.

Byungsub Kim is with the Department of Electrical Engineering, the Department of Convergence IT Engineering, and the Department of Semi-conductor Engineering, and the Graduate School of Artificial Intelligence, Pohang University of Science and Technology, Pohang-si 37673, South Korea, and also with the Institute for Convergence Research and Education in Advanced Technology, Yonsei University, Seoul 03722, South Korea (e-mail: byungsub@postech.ac.kr).

Color versions of one or more figures in this article are available at https://doi.org/10.1109/JSSC.2023.3287071.

Digital Object Identifier 10.1109/JSSC.2023.3287071



Fig. 1. Application scenarios of the conventional and the proposed transceivers. (a) Conventional transceiver with shared CDR or CDA circuit, (b) conventional clock-forwarding transceiver, and (c) proposed DECS transceiver.

*Index Terms*—Auto-deserialization, data-embedded clock signaling (DECS), on-chip transmission line, self-slicing comparator, short-reach links, single-ended signaling.

## I. INTRODUCTION

HORT-REACH links [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11] are widely used to provide high bandwidth for high-performance computing systems. For example, streaming microprocessors (SMs) are connected via many on-chip channels in graphics processing units (GPUs). Multiple GPUs are assembled on a multi-chip module (MCM) and connected via many short-reach interconnects. The techniques of 2.5-D/3-D system-in-package (SiP), such as silicon interposers [8] or high bandwidth memory (HBM) [10], [11], are used to connect various heterogeneous dies. In such applications, a higher bandwidth can be achieved with parallelism by incorporating a large number of short-reach channels.

To provide precise clocks to many transceivers (TRXs) of short-reach links, various techniques have been developed. A clock and data recovery (CDR) adopted from long-reach links can precisely recover the sampling clock from the received data signal [5] [see Fig. 1(a)]. The precise clock recovery enables reliable data recovery from the input of the RX front-end even though the eye size is significantly reduced. In general, a CDR can be shared by multiple data lanes to amortize its power and implementation cost. It is also possible to forward a clock from the transmitter (TX) side to the receiver (RX) side through a designated lane (a clock lane) instead of recovering the sampling clock from the received

0018-9200 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

data signal [1], [2], [3], [4], [6], [12], [13] [see Fig. 1(b)]. By using the forwarded clock, the sampling clock can be more easily retrieved by a CDR or a clock and data alignment (CDA) [2]. In order to reduce the power and hardware cost of a CDR/CDA, a small digitally controlled delay line (DCDL) can be used to deskew between the forwarded clock and the data signal instead of using a CDR/CDA [12]. If the variation of the skew between the data and the forward clock is not large, the DCDL design can be very compact and power efficient [12]. This clock-forwarding technique also has a good jitter-tracking capability than the conventional TRX using a CDR. In clock-forwarding techniques, the clock lane is usually shared by multiple data lanes to amortize the power and area cost of the clock lane [14].

However, precise clocking circuits require significant power dissipation and area cost. To maintain reliable high-speed data communication with parallelism, a large number of precise high-speed clocking circuits, such as duty cycle correction (DCC) and duty cycle detection (DCD) circuits, are required. For example, the duty cycle errors are corrected by adjusting the ratio of PMOS and NMOS transistors in every delay stage [4], while calibration loop [2] and DCD [3] are used to set the configuration of the circuits. Therefore, to correct duty cycle error in order to prevent a decrease in eye width, DCC, DCD, and calibration loop are required for every TX and for every RX. Although these clock circuits might be shared by a few I/Os, they cannot be shared by many because the TRX performance is very sensitive to the front-end clocks especially if the clock speed is reduced for time-interleaving. Because the cost associated with such clocking circuits is proportional to the number of I/O count, the power consumption and area cost are significant when counted for many I/Os. CDA [2] or CDR [5] also requires significant cost too because they usually require complex analog/mixed-signal circuits and also often need additional precise duty and phase control circuits, such as DCC circuits or phase interpolators [see Fig. 1(a)].

Also, the jitter performances of the conventional short-reach links are significantly degraded by supply noise due to asymmetry active paths of the clock and data signals. Regardless of whether the clock is forwarded or recovered by a CDR, the prior arts have active circuits in the clock paths and thus suffer from supply-induced jitters. Because the clock and data paths are not matched in the prior arts, the supply-induced jitters of the clock and data signals are not matched neither. Therefore, it is hard to tolerate large supply noise at RX in the prior arts.

To reduce the cost of high-speed clocking circuits (DCC/DCD nor CDR/CDA) and also to improve jitter tracking, a 20-Gb/s/pin single-ended compact data-embedded clock signaling (DECS) TRX that works without high-speed clocking circuits (DCC/DCD nor CDR/CDA) was proposed [see Fig. 1(c)] [15]. The concept of DECS has been adopted from an off-chip GDDR application [16] and modified for clock-forwarded short-reach link applications. In the prior DECS work [16], data are embedded in the clock signal and forwarded through a single lane for a differential pair of wires. An active input buffer and a DCC are utilized at the RX to generate a large sampling clock with good duty cycle for half-rate data recovery. However, these clocking



Fig. 2. Concept of the DECS.

circuits require non-negligible area and power if numerous I/Os are considered. Also, the data and clock signals have different supply-induced jitters due to their asymmetric paths at the RX, reducing jitter tracking capability. We adopted and modified the DECS concept for clock-forwarded short-reach TRXs to remove a DCC and a clock buffer and to improve jitter tracking capability.

The RX can directly recover and deserialize the data from DECS input without CDR or CDA at the RX front-end. Such recovering and deserializing will be referred to as "self-slicing" and "auto-deserializing" in this article. This technique also improves the tolerance to duty cycle error and RX supply noise and thus allows to remove precise DCC and DCD circuits. Because neither DCC/DCD nor CDR/CDA are required in the proposed TRX, the power consumption and hardware cost can be significantly reduced. In addition, the proposed DECS RX has good jitter tracking and power supply noise rejection because the data and clock inputs of the proposed DECS RX are matched better in terms of the supply noise injection than the conventional clock-forwarded RX or the conventional DECS RX.

The rest of this article is organized as follows. Section II introduces the concept of the DECS method. Section III describes the proposed TX design. Section IV describes the proposed RX design. Section V analyzes the tolerance to duty cycle error and supply noise. Section VI analyzes the scalability of the proposed TRX to multi-lane implementation. Section VII shows the measurement results. Finally, Section VIII concludes this article.

# II. DATA-EMBEDDED CLOCK SIGNALING

In DECS, the data are embedded into the voltage level of the clock (see Fig. 2). The TXP voltage is formed by adding a voltage of the data to the voltage of the clock ( $CK_{weak}$ ), which is generated by the weak driver. For example, the TXP voltage is increased higher than the reference clock voltage at the TX ( $CK_{TX}$ ) to transmit data "1." Similarly, to transmit data "0," the TXP voltage is decreased lower than the reference clock voltage: the TXP voltage is equal to the  $CK_{weak}$  voltage, which is generated by the weak driver because a voltage of 0 V is added to the TXP voltage by the data 0. The embedded data



Fig. 3. Overall architecture of the proposed DECS TRX in an example application scenario.

are transmitted through the data lane, and a clock signal is forwarded to the RX through the clock lane.

Compared with the phase-difference modulation (PDM) signaling in which the data are embedded into the transition edge of the clock [17], [18], [19], DECS is more suitable for high-speed operation. Because signals are modulated by the data as much as the phase difference of 1/8 UI in PDM signaling [17], [18], [19], the effective Nyquist frequency of PDM signaling is eight times as high as non-return-to-zero (NRZ) signaling for the same data rate. The received signal at the RX experiences the channel loss at 8× Nyquist frequency in the PDM signaling, and thus, the PDM signal is attenuated much more than the NRZ signal. Furthermore, the PDM RX requires complex circuits, such as peaking amplifiers and phase-difference amplifiers to detect a small phase difference of 1/8 UI. Therefore, PDM signaling is not suitable for high-speed short-reach links.

#### III. TRANSMITTER DESIGN

Fig. 3 describes the example usage of the proposed DECS TRX in short-reach link applications [15]. Data lanes are used to transmit single-ended signals, while one clock lane is used to forward the reference clock.

Fig. 4(a) shows a schematic of the proposed TX. The TX core consists of a data driver, DCDLs, and D flip-flops. DCC circuits are employed only for testing purpose. Inverter banks that can be statically enabled are employed as drivers in order to control the driving strength. The TX adopts the relaxed impedance matching technique to improve the voltage swing with a low impedance driver while mitigating the penalty in signal integrity with RX termination [20], [21], [22], [23].

The data driver consists of a data modulation (DM) driver and a weak driver [see Fig. 4(a)]. The weak driver always produces the clock signal with a small amplitude if the DM driver is turned off:  $D_{\rm DM\_even}$  is 0 and  $D_{\rm DM\_odd}$  is 1. While the weak driver always generates the small clock waveform, the DM driver only turns on to increase the amplitude when necessary: 1)  $D_{\rm DM\_even}$  = "1" and  $CK_{\rm DM}$  = high and 2)  $D_{\rm DM\_odd}$  = "0" and  $CK_{\rm DM}$  = low. The DM driver also serializes the half-rate data before transmission. The clock (CK) driver is almost identical to the weak driver, except that it produces the



Fig. 4. (a) Transmitter architecture and (b) simulated output waveforms.

reference clock with a larger amplitude. The strengths of the drivers are digitally controlled. The weak driver consists of smaller driver units that provide 5-bit strength resolution. The DM driver and the CK driver are similarly implemented.

For a fixed weak driver strength, the amount of amplitude modulation of TXP can be raised by increasing the strength of the DM driver [see Figs. 4(b) and 5(a)]. For example, when the weak driver is configured for the minimum strength, the DM driver increases the amplitude of TXP from 48 to 366 mV as its strength configuration increases from 1 to 23 in decimal code [see Fig. 5(a)]. The DM and weak drivers always drive current in the same direction. Therefore, there is no fighting between these two as in an inverter-based feed-forward equalization (FFE) TX [23]. In such addition-only inverter-based driver design [23], the amplitude can be easily controlled by adjusting their strengths even though the drivers are non-linear. Because the large DM driver strength makes



Fig. 5. (a) Amount of DM driver modulation versus the DM driver's strength and (b) clock amplitude versus the clock driver's strength.



Fig. 6. Delay increments of the DCDL versus fine and coarse digital codes with 1.1-V supply.

RX comparators work more reliably, we manually set the DM driver to have maximum strength: linear and fine calibration of the amplitude modulation is not required. With a stronger DM driver configuration, the TRX can tolerate more phase errors between RXP and  $CK_{RX}$  as we will explain in Section IV.

Like amplitude modulation of TXP, the amplitude of the clock signal at node  $CK_{TX}$  can be adjusted by controlling the CK driver strength. The configurable range of the clock voltage swing is between 214 and 1034 mV [see Fig. 5(b)]. Because the RX requires the input voltage swing larger than 400 mV for reliable operation of comparators, the maximum voltage swing provided by the CK driver is close to full swing at the TX output node to overcome the channel loss. This will be further discussed in Section IV.

DCDLs were used to deskew data signals, so that many data lanes can share the clock lane (see Fig. 4). A DCDL consists of the four inverter stages with NMOS capacitor banks, each of which is inserted between inverter stages. The delays are coarsely and finely controlled in 12 steps and six steps, respectively. These delays are adjusted by controlling the number of connected NMOS capacitors in DCDLs. Fig. 6 shows the delay increments of the DCDL versus digital codes. The coarse delay control step is less than 3.15 ps. The fine delay control step is less than about 0.72 ps. The minimum and maximum delays are about 0.72 ps and about 27.24 ps, respectively. In the simulation, the skew caused by channel length mismatch is about 8.35 ps/mm. With the designed resolution of the DCDL, the DCDL can compensate for the skew caused by up to about 3.3-mm channel length. DCDLs allow for multiple data lanes to share one clock lane even though the channel length mismatch causes skews. In our proof-of-concept design, we manually set the control bits for the DCDL based on BER measurement without any calibration loops.



Fig. 7. Receiver architecture.

DCC circuits are added only to test the tolerance to a duty cycle error. A DCC circuit consists of an inverter, fine-controlled, and coarse-controlled tri-state inverter banks. The inverter is always turned on. The duty cycles are coarsely and finely controlled by the number of enabled PMOS and NMOS transistors in tri-state inverter banks. The sufficient duty cycle range between 29% and 70% could be provided by the DCCs to test tolerance to the duty cycle error in the experiment.

#### IV. RECEIVER DESIGN

The RX core consists of only on-die-terminations (ODTs), non-clocked self-slicing comparators, and dynamic latches and thus is very area-efficient (see Fig. 7). The termination voltages ( $V_{\rm TT}$ ) of the ODTs are VDD<sub>RX</sub>/2. The RX has N-type and P-type paths. Each type of path selectively reacts depending on the input common-mode voltage level: N-type evaluates the data when the input common-mode voltage is low and P-type evaluates the data when the input common-mode voltage is high. The RX front-end does not take any clock input from a CDR or CDA, neither for slicing nor deserialization. Therefore, slicing and deserialization of the DECS input are triggered by the DECS input itself without any extra sampling clock

N-type and P-type self-slicing comparators simply consist of pull-down NMOSs and pull-up PMOSs with cross-coupled inverters, respectively (see Fig. 8). A pseudo-differential pair of 4-bit binary-weighted transistor banks at the input of a comparator can adjust the threshold voltage to compensate for the offset voltage variation (see Fig. 8). While one transistor per bank is always enabled, the number of enabled transistors is controlled for the threshold voltage adjustment. Fig. 8 also shows the simulated histograms of the input-referred offset voltages of the comparators when all digitally controlled input transistor banks are turned off. The histograms were acquired by Monte Carlo simulation with 1000 samples. The  $3\sigma$  inputreferred offset voltages of N-type and P-type comparators are 89.4 and 57.6 mV, respectively. The offset variations are significantly large due to the small input transistor sizes of the comparators. To provide compensation ranges for more than the  $3\sigma$  input-referred offset variations, the N-type and the P-type comparators can increase or decrease the threshold voltages by up to 113 and 114 mV, respectively. The least



Fig. 8. Schematics of self-slicing comparators and their histograms of the input-referred offset voltages. (a) N-type self-slicing comparator and (b) P-type self-slicing comparator.

significant bit (LSB) threshold voltages of the N-type and the P-type comparators are 6.85 and 8.74 mV in simulation, respectively. The maximum differential non-linearity (DNL) threshold voltages of the N-type and the P-type comparators are 1.5 and 2.0 LSB, respectively. In our proof-of-concept design, the input-referred offsets of comparators are manually compensated based on the BER measurement without any offset cancellation algorithm. In the simulation, by using the method explained in [24], we simulated the input-referred random noise of the comparator: the calculated root mean square (rms) input-referred random noises of N-type comparator and P-type comparators are 1.168 and 1.39 mV<sub>RMS</sub>, respectively. By multiplying the rms input-referred random noise by 15.88 [25], we can calculate the peak-to-peak random noise associated with the BER of 10<sup>-15</sup>: the calculated peakto-peak input-referred random noises of N-type comparator and P-type comparators are 18.55 and 22.07 mV<sub>p-p</sub>, respectively. However, about 22 mV of input-referred random noise is insignificant because RXP amplitude modulation is about 200 mV in the simulation.

Depending on RXP and  $CK_{RX}$  voltage levels, the N-type self-slicing comparator [15] tracks the inputs ("tracking" phase) and evaluates the outputs ("evaluation" phase). Fig. 9 shows the example simulated waveforms of the N-type self-slicing comparator and the N-type dynamic latch [15]. When both RXP and  $CK_{RX}$  voltages are high (case 2 in Fig. 9), the output voltages  $comp_N$  and  $comp_N$  are pulled down low like reset operation of the conventional comparator [8],



Fig. 9. Simulated timing diagrams of an N-type comparator and an N-type dynamic latch.



Fig. 10. Simulated output waveforms of an example N-type self-slicing comparator with various input configurations (a) with a large  $CK_{RX}$  amplitude and a small RXP modulation, (b) with a moderate  $CK_{RX}$  amplitude and a large RXP modulation, and (c) with a small  $CK_{RX}$  amplitude and a small RXP modulation.

but this operation is slightly different from reset. In our design, the voltage difference between comp<sub>N</sub> and comp<sub>N</sub>\_b is set proportional to the differential DECS input voltage by adopting the concept of soft-decision [8], [26]. This voltage difference between comp<sub>N</sub> and comp<sub>N</sub>\_b during the tracking phase allows the comparator to prepare for evaluation (see Fig. 9). When the input common-mode voltage becomes low, the cross-coupled PMOSs start pulling up comp<sub>N</sub> or comp<sub>N</sub>\_b because the cross-coupled PMOSs become stronger than the pull-down NMOSs. As a result of regeneration by the cross-coupled PMOSs and NMOSs, only one of comp<sub>N</sub> or comp<sub>N</sub>\_b becomes high, depending on the DECS input. It is noticeable that the soft-decision technique during tracking phase always helps evaluation whether the input voltage difference is small [see Fig. 10(a)] or large [see Fig. 10(b)] because the polarity of the voltage difference between comp<sub>N</sub> and comp<sub>N</sub> b is always set correctly during tracking phase. In Fig. 10(a), CK<sub>RX</sub> is slightly higher than RXP in tracking phase, and thus, comp<sub>N</sub> is slightly higher than comp<sub>N</sub>\_b. The small voltage difference between comp<sub>N</sub> and comp<sub>N</sub>\_b in tracking phase helps the comparator quickly split the output in the evaluation phase [see Fig. 10(a)]. In Fig. 10(b), comp<sub>N</sub> is much larger than comp<sub>N</sub>\_b because the large input difference is amplified during tracking phase. In this case, the output evaluation is about halfway done during tracking phase because the polarity of the voltage difference is set appropriately. This also greatly helps evaluation. However, when both input voltages are low [see Fig. 10(c)], the comparator does not work



Fig. 11. Simulated shmoo plots of (a) N-type and (b) P-type self-slicing comparators. In simulation, the clock voltage level is centered between the high- and the low-voltage levels of RXP.



Fig. 12. Simulated input and output waveforms of the proposed N-type comparator (a) without RXP phase error and (b) with RXP phase error.

correctly because the input transistor banks cannot pull down the node voltages of the  $comp_N$  and  $comp_N\_b$  to track the input. This operation of the comparator is more reliable with large  $CK_{RX}$  voltage and large amount of RXP modulation.

The output results of the N-type comparator are updated in the N-type latch through the connected pull-down NMOS input of the following latch (see Fig. 9). The output, rDe, is pulled up high when the comparator outputs  $comp_N$  and  $comp_N$  are low and high, respectively, whereas the latch holds the previous results when both  $comp_N$  and  $comp_N$  are low. Similarly, the P-type self-slicing comparator and the P-type latch make a decision when a common-mode voltage of DECS input is high.

For correct decision, the clock amplitude and the amount of amplitude modulation at the RX inputs should be properly selected (see Fig. 11). With the clock amplitude smaller than 80 mV [ $V_F$  in Fig 11(a)], the N-type comparator fails no matter how large the amount of amplitude modulation is. Similarly, with the clock amplitude less than 200 mV [ $V_F$  in Fig. 11(b)], the P-type comparator fails no matter how large the amount of amplitude modulation is. Therefore, in our design, the clock amplitude should be larger than 200 mV to ensure the successful operation for both N-type and P-type comparators. The maximum channel loss that can be overcome by the proposed TRX is about -4.4 dB in our simulation. Theoretically, the proposed TRX cannot work with a large channel loss due to the reduced amplitudes of the clock and the data. In such a case, we can easily solve the problem by adding identical amplifiers for both data and clock inputs in front of the proposed RX.

Furthermore, the phase error between RXP and  $CK_{RX}$  should be appropriately adjusted to prevent decision errors in the comparator [see Fig. 12(a)]. With a large phase error,



Fig. 13. Simulated eye diagrams of the recovered data rDeven and rDodd with the duty cycle of (a) 50% and (b) 30%.

the soft-decision in tracking phase can be inverted in evaluation phase if the polarity of the voltage differences between RXP and CK<sub>RX</sub> is flipped at the end of tracking phase before starting regeneration of output voltages in evaluation phase [see Fig. 12(b)]. When the weak and DM drivers are configured for the minimum and the maximum strengths, respectively, the proposed comparator can tolerate the phase error between about -0.8 and 5 ps with respect to CK<sub>RX</sub>. For a fixed weak driver strength, the comparator can tolerate more phase error as the DM driver strength increases; thus, increasing the DM driver strength helps make the comparator work more reliably. The phase error beyond the phase-error tolerance can be compensated by DCDLs in the TX.

Owing to self-slicing and auto-deserialization at the RX front-end, a recovered horizontal eye size is about 1 UI. The data are automatically deserialized to half-rate data in the proposed RX. Because the eye diagram of the recovered data (rDeven and rDodd) after dynamic latches becomes very large (rail-to-rail) after self-slicing and auto-deserialization, the timing requirement for a local RX clock to fetch these data is much less stringent than the conventional RX: the timing margins are 79 and 84 ps for the rDodd and rDeven, respectively (see Fig. 13). In contrast to the proposed RX, a conventional RX front-end employs clocked comparators that require precise timing to recover the data from a small eye diagram.

# V. RELIABILITY

This section qualitatively analyzes and explains the tolerance to duty cycle error and supply noise.

### A. Duty Cycle Error

The proposed DECS TRX is tolerant to duty cycle errors because duty cycle errors and transitions of RXP and  $CK_{RX}$  are coupled in the DECS TRX. Because the same clock source is used to generate RXP and  $CK_{RX}$  clock signal at the TX, duty cycle errors and transitions of RXP and  $CK_{RX}$  are coupled [see Fig. 14(a)]. The proposed RX automatically makes a bit decision based on the coupled transition of RXP and  $CK_{RX}$ , and thus, the duty cycle error does not significantly affect the



Fig. 14. Simulated eye diagrams of the RX front-end inputs with the TX clock's duty cycles of 50% and 30%. (a) Proposed DECS TRX and (b) conventional TRX.

RX operation. In simulation, the RX correctly recovers the received bits at both positive and negative transition edges regardless of a duty cycle error. For example, the proposed RX can tolerate duty cycle of 30% without horizontal eye size degradation in simulation, while the horizontal eye width of the conventional TRX greatly decreases with the duty cycle error (see Figs. 13 and 14). Ideally, the recovered auto-deserialized data of the proposed RX at the interface with the digital circuits have 2 UI (100 ps) eye width because the clock period is preserved even with the duty cycle error. For example, rDeven is determined at every falling transition of the clock. However, the eye size of the recovered data is reduced to 79-84 ps by the deterministic jitter that is mainly caused by the different slopes of the positive and negative transition edges (see Fig. 13). These different slopes make the different output delays of the comparators, which becomes the data-dependent jitters.

In contrast, duty cycle errors significantly reduce the eye size of a conventional TRX. Unlike the proposed TRX, the duty cycles of the TX clock and the RX clock might be significantly different without DCC circuits in a conventional TRX because these clocks are generated by different circuits like a TX phase-locked loop (PLL), a CDR, or a DCDL. These circuits add uncorrelated duty cycle errors to the TX and the RX clocks, and thus, their duty cycles are significantly different without correction circuits. The sizes of the even and odd eye diagrams of the received signal are changed with the TX clock's duty cycle error because the transmitted data are serialized with the TX clock. Because the full-rate eye size of the received signal is dominantly determined by the smaller eye size, the full-rate eye size is decreased in proportion to the duty cycle error of the TX clock [see Fig. 14(b)]. The RX clock's duty cycle error can make this problem even much worse. For example, especially, if the RX utilizes a CDR that sets the RX clock 0.5 UI apart from one of the eye edges, then the RX may fail to sample the data correctly because of the reduced time interval between edges. Therefore, the conventional TRX must have DCC circuits to prevent



Fig. 15. Impact of supply noises in (a) proposed DECS TRX and (b) conventional TRX.

this problem. Such DCC circuits add significant hardware and power cost to the TRX.

#### B. Supply Noise

The proposed self-slicing RX is much less sensitive to the RX supply noise than a typical RX with conventional sense amplifiers. Because the two inputs of a self-slicing comparator are matched better than a usual RX [see Fig. 15(a)], the proposed RX is more tolerant to the RX supply noise than a conventional RX. Due to the good symmetry of the selfslicing comparator, the input-referred RX supply noise has larger portion of a common-mode noise that can be easily rejected by the nature of the differential design. In addition, the proposed RX rarely suffers from the supply-induced clock jitter. Because the RX clock CK<sub>RX</sub> is directly provided from the channel without using circuit elements that can cause the supply-induced jitter, the RX supply noise barely induces the RX clock jitter. Instead, the TX supply noise induces the RX clock jitter. However, this jitter is coupled with the RXP jitter, which is induced by the same TX supply noise. Because these jitters are coupled, the differential structure of the self-slicing comparator rejects the common jitter component.

However, a conventional RX is more vulnerable to the RX supply noise. A conventional slicer usually takes a reference voltage input, which is locally generated at the RX while the other input is connected to the channel. This asymmetric structure causes poor supply noise rejection. In addition, the RX supply noise significantly induces the RX clock jitter because the RX clock is usually recovered by utilizing a CDR or a CDA as well as clock buffers. These circuit elements add jitters to the RX clock if noise is added to their supply voltages at the RX. In contrast, the proposed RX does not suffer from such jitter because it does not utilize such circuits. The conventional RX has another additional noise source in the reference voltage input that the proposed RX does not have. This noise could be large in single-ended signaling due to simultaneous switching noise and poor return path. This reference noise directly reduces the voltage margin of the eye diagram. The input data signal also has the jitter induced by the supply noise at the TX too. All these noise and jitter sources are not or loosely correlated, and thus, they cooperatively reduce the eye size [see Fig. 15(b)].



Fig. 16. Post-layout simulated delay and voltage attenuation at the  $CK_{RX}$  of the farthest RX front-end input. Ten data lanes were driven by one clock lane.

#### VI. MULTI-LANE SCALABILITY

Owing to the compact RX design, ten data lanes can share one clock lane without buffers. With ten data lanes, the extracted RX input capacitance from the post-layout simulation is about 175 fF: small load capacitance is possible because of the small parasitic capacitance of the comparators. The clock driver can transmit the clock signals to ten RXs without causing significant delay and voltage attenuation. Fig. 16 shows the RX front-end clocks using one data lane and using ten data lanes: the delay difference is about 4.8 ps, which can be sufficiently adjusted by the DCDL at the TX. When we increased parasitic capacitance to 325 fF considering potential off-chip applications, the delay increased to 16.1 ps, which is also within the compensation range. In Monte Carlo simulation, the clock delay histogram was acquired from 1000 samples: mean and  $3\sigma$  of the delay are 4.91 and 2.53 ps, respectively. This delay caused by the mismatch can be canceled by using DCDL at the TX if necessary. In the post-layout simulation, the data are successfully recovered at the farthest RX without a bit error. However, the peak-to-peak jitter of the clock is increased from 0.15 to 2.1 ps (nominal). The clock delay varies depending on the supply voltage and temperature variation: about 2.5 ps/100 mV with supply voltage and 0.63 ps/100 °C with temperature variations (at SS corner). These delays caused by process-voltage-temperature (PVT) variation can be canceled by using small DCDLs at the TX.

Based on the post-layout simulation of ten data lanes, the area, power consumption, and shoreline density were estimated. The estimated total area including ten TXs, ten RXs, and one clock driver is 0.0288 mm<sup>2</sup>: the estimated areas of ten TXs, ten RXs, and one clock driver are 0.01938, 0.0075, and 0.00194 mm<sup>2</sup>, respectively. The estimated total power consumption including ten TXs, ten RXs, and one clock driver is about 208.2 mW: the estimated power consumptions of ten TXs, ten RXs, and one clock driver are 174, 30, and 4.2 mW, respectively. The calculated shoreline density is 0.631 Tb/s/mm. Although the multi-lane implementation was not fabricated for the verification due to the limitation of available resources, these estimations show that the high-density multi-lane implementation is possible with our DECS TRX.



Fig. 17. Chip microphotograph.



Fig. 18. Peripheral and test-support blocks.

## VII. MEASUREMENT RESULT

For verification, we fabricated the proposed DECS TRX with test-support blocks in 28-nm CMOS low-power performance (LPP) technology (see Fig. 17). The TRX occupies only 0.0024 mm², excluding test-support blocks. Pseudo-random binary sequence 31 (PRBS31) pattern is generated by the on-chip PRBS generator [27] at the TX (see Fig. 18). The errors of the recovered data are detected by the on-chip PRBS checker [27] at the RX. These on-chip PRBS generator and checker are utilized only for testing purposes. To prevent overflow during counting a long bit stream, long lengths (53 bits) of the bit error rate (BER) counters were used (see Fig. 18).

Fig. 19(a) shows a measurement setup. An external 10-GHz clock is provided by Keysight N4903A to the TX core, whereas external TX and RX clocks, CK<sub>TXM</sub> and CK<sub>RXM</sub>, respectively, are used for on-chip monitoring circuits on the TX and the RX sides, respectively. On-chip bathtub curves are measured by shifting the phases of the external RX clock (CK<sub>in</sub>). A programmable delay line (Colby Instruments PDL-100A-625PS) shifts CK<sub>in</sub> phase with 0.5-ps resolution. Additional supply noise on printed circuit board (PCB) is provided by Keysight 81160A to test tolerance to supply noise.

The test chip was packaged using a chip-on-a-board (COB) assembly with single-layer capacitors (SLCs) [see Fig. 19(b)].



Fig. 19. (a) Measurement setup and (b) chip-on-board (COB) microphotograph with SLCs.



Fig. 20. Schematic of on-chip monitoring circuit.



Fig. 21. (a) Structure of on-chip transmission line and (b) its simulated channel loss, S21.

The SLCs are used as decoupling capacitors to reduce the noise on the power supply traces, and their capacitance is about 62 pF.

The voltages at nodes  $CK_{TX}$ , TXP,  $CK_{RX}$ , and RXP can be measured using on-chip monitoring circuits (see Fig. 20). An on-chip monitoring circuit consists of NMOS and PMOS input double-tail latched-type comparators [28], switches, a retiming flip-flop, gating circuits, and 53-bit counters. These on-chip monitoring circuits are only used to support tests.

To emulate an on-chip low-loss short-reach channel, a 1-mm 50- $\Omega$  on-chip transmission line is utilized [see Fig. 21(a)]. The width and the space of the channel are 4 and 12  $\mu$ m, respectively. Ground shield metals are placed between the data and clock lanes to reduce crosstalk noise. Based on the simulation, the RLGC parameters of the on-chip transmission line are  $R=18.9~\mathrm{k}\Omega/\mathrm{m}$ ,  $L=390.5~\mathrm{nH/m}$ ,  $G=0.29~\mathrm{mS/m}$ , and  $C=0.17~\mathrm{nF/m}$ , respectively. The simulated channel loss using a 3-D field solver, HFSS, is  $-1.5~\mathrm{dB}$  [see Fig. 21(b)], whereas the measured channel loss using on-chip monitoring circuits is  $-2.5~\mathrm{dB}$  at Nyquist frequency (10 GHz).

Fig. 22 shows on-chip measured BER bathtubs with various conditions: 1) at nominal; 2) with various duty cycle errors; and 3) an additionally injected RX supply noise on PCB.

With a PRBS31 pattern, the maximum data rate of 20 Gb/s/pin was achieved, while a horizontal eye size of 0.99 UI at BER  $< 10^{-10}$  was obtained at nominal [see Fig. 22(a)]. This wide horizontal eye is possible by self-slicing and autodeserialization at the RX front-end. Due to the wide horizontal eye of 0.99 UI and the rail-to-rail signal swing after selfslicing/auto-deserialization at the RX front-end, the timing requirement for the local RX clock to fetch the deserialized data is much more relaxed than a usual RX. The rms random jitter can be calculated by the slope of the Q-scale version of the BER curve [29]: the calculated rms jitters of the rDeven and rDodd are 1.07 and 1.225 ps<sub>RMS</sub>, respectively. The horizontal eye margin at BER of  $10^{-15}$  is extrapolated as 0.93 UI by using the rms jitter value [25], [29]. While maintaining a horizontal eye size of 0.88 UI at BER of 10<sup>-9</sup> [see Fig. 22(b)], the proposed TRX could tolerate a clock duty cycle ranging from 40% to 60%. This shows that the proposed DECS TRX can work without high-speed DCC and DCD circuits. To test the tolerance to the RX supply noise, an additional supply noise is injected into the PCB. The eye width of 0.88 UI was measured with the 50-MHz 300-mV<sub>p-p</sub> sinusoidal RX supply noise measured on the PCB [see Fig. 22(c)]. In the simulation, the on-chip supply noise at node VDD<sub>RX</sub> was 200 mV<sub>p-p</sub> including switching noise and about 120 mV<sub>p-p</sub> noise when the  $300\text{-mV}_{p-p}$  50-MHz sinusoidal RX noise is applied to the voltage source  $(V_{\text{supply}})$ (see Fig. 23). In addition, the proposed RX can tolerate a 200-MHz 300-m $V_{p-p}$  sinusoidal supply noise [see Fig. 22(d)] and a 200-mV<sub>p-p</sub> crest factor 7 (CF7) RX Gaussian supply noise [see Fig. 22(e)], while achieving an eye width of 0.90 UI at BER of  $10^{-9}$  in both cases. This result shows that our DECS TRX is also very tolerant to the supply noise.

Pie charts of the power consumption breakdowns are shown in Fig. 24. At the data rate of 20 Gb/s/pin, the energy efficiency is 1.27 pJ/b at nominal. The total energy efficiency of the TX including TX clocking circuits is 1.09 pJ/b. The power consumptions of the drivers are dominant in total TX power consumption: clock, DM, and weak drivers dissipate 38.2%, 21.1%, and 18.1%, respectively. The energy efficiency of the TX local clocking circuit is about 0.21 pJ/b, which occupies 19.5% of 1.09 pJ/b. The power consumption of DCC and peripheral circuits that are used only for the test purpose is not included in the power consumption calculation. At the RX, non-clocked self-slicing comparators occupy 80.8% of the total power consumption of the RX because a large static current always flows from the supply voltage. The power consumption by the voltage termination is about 0.6 mW, which is 16.7% of total RX power consumption. The demultiplexing circuits except the front-end 1:2 demultiplexing were not implemented in the test chip, and thus, the power consumption of the further demultiplexing was not counted in. Although the most power dissipates at the self-slicing comparators, the total energy efficiency of the RX front-end that does not have any RX clocking circuits is only 0.18 pJ/b by simple RX architecture. In contrast, a conventional RX usually consumes larger power because it has precise clocking circuits that consume considerable power and area. Assuming ten data lanes can share one clock, the power consumption of the



Fig. 22. On-chip measured BER bathtub curves (a) at nominal, (b) with various duty cycles, (c) with an injected 50-MHz sinusoidal RX supply noise, (d) with an injected 200-MHz sinusoidal RX supply noise, and (e) with an injected CF7 RX Gaussian supply noise.



clock driver can be amortized across ten data lanes: the energy efficiency per lane would be about 0.86 pJ/b.

supply noise is injected through  $V_{\text{supply}}$ .

Power consumptions of the proposed TX driver and RX front-end are plotted for activity factors of 0.5 and 1 in Fig. 25.

Fig. 25. Simulated power consumption with the DM driver activity factor of 0.5 and 1. (a) Power consumption of the TX data driver (the DM driver and the weak driver) and (b) power consumption of the RX core.

(a)

(b)

(b)

3.77

When the activity factor is increased from 0.5 to 1, the power consumptions of the TX driver and RX front-end are raised

| TABLE I                            |      |       |  |  |  |  |  |  |  |
|------------------------------------|------|-------|--|--|--|--|--|--|--|
| PERFORMANCE SUMMARY AND COMPARISON |      |       |  |  |  |  |  |  |  |
|                                    |      |       |  |  |  |  |  |  |  |
| JSSC                               | JSSC | ISSCC |  |  |  |  |  |  |  |
|                                    |      |       |  |  |  |  |  |  |  |

|                             |                                  | ISSCC<br>2018 [1] |             | JSSC<br>2020 [2]                 | JSSC<br>2021 [3] |             | ISSCC<br>2021 [4]               | ISSCC<br>2021 [5]                    |            | VLSI<br>2021 [6]                | This work                                                                                    |                                              |                  |
|-----------------------------|----------------------------------|-------------------|-------------|----------------------------------|------------------|-------------|---------------------------------|--------------------------------------|------------|---------------------------------|----------------------------------------------------------------------------------------------|----------------------------------------------|------------------|
| Channel type                |                                  | MCM*              | +           | MCM*                             | Off-chip         |             | D2D*                            | Package                              |            | CoWoS*                          | On-chip                                                                                      |                                              |                  |
| Т                           | echnolongy (ı                    | ım)               | 16 FinF     | ΞT                               | 16 FinFET        | 28 CMC      | )S                              | 7 FinFET                             | 7 FinFE    | T                               | 7 FinFET                                                                                     | 28 LPP                                       |                  |
| Data                        | Data rate/pin (Gb/s/pin)         |                   | 25          |                                  | 20.83            | 30          |                                 | 40                                   | 112        |                                 | 20                                                                                           | 20                                           |                  |
| Shoreline density (Tb/s/mm) |                                  | N/A               |             | 0.4167                           | N/A              |             | 0.48                            | N/A                                  |            | 5.31                            | 0.631 (Calculated)                                                                           |                                              |                  |
| Signaling method            |                                  | GRS               |             | CNRZ-5                           | PAM-3 NRZ        |             | PAM-4 NRZ                       |                                      | DECS       |                                 |                                                                                              |                                              |                  |
| CDR                         |                                  | Delay lir         | ne          | CDA                              | N/A              |             | IJL-PI                          | Digital CDR                          |            | Deskew loop                     | Not required (Auto deserialized)                                                             |                                              |                  |
| Du                          | Duty cycle correction            |                   | Require     | ed                               | Required         | Required    |                                 | Required                             | Required   |                                 | Required                                                                                     | Not required (for 40% - 60%)                 |                  |
| Sup                         | opy noise imm                    | nunity            | N/A         |                                  | N/A              | N/A         |                                 | N/A N/A Noise immunity coding on PCB |            |                                 |                                                                                              |                                              |                  |
| C                           | hannel loss (                    | dB)               | - 4         |                                  | - 4.5            | - 6.6       |                                 | - 8                                  | - 3.7      |                                 | - 3                                                                                          | - 2.5                                        |                  |
| Horizontal eye size (UI)    |                                  | 0.77<br>@BER <    | 10-15       | 0.475<br>@BER <10 <sup>-15</sup> | 0.103<br>@BER <  |             | 0.55<br>@BER <10 <sup>-15</sup> | 0.14<br>@BER <                       |            | 0.63<br>@BER <10 <sup>-12</sup> | 0.99 @BER <10 <sup>-10</sup><br>(Measured)<br>0.93 @BER <10 <sup>-15</sup><br>(Extrapolated) |                                              |                  |
| Energy efficiency RX        |                                  | 1                 |             | 0.3876(3)                        | 0.26             |             | N/A                             | N/A                                  |            | N/A                             | 1.09(8) 0.68(                                                                                | 0.68(9)                                      |                  |
|                             |                                  | RX                | 0.108       |                                  | 0.3672(3)        | 0.85<br>N/A |                                 | N/A                                  | N/A<br>N/A |                                 | N/A                                                                                          | 0.18                                         | 0.18             |
|                             | (pJ/b)                           |                   | 0.617       |                                  | 0.2652(4)        |             |                                 | N/A                                  |            |                                 | N/A                                                                                          | N/A                                          | N/A              |
|                             |                                  | Total             | 1.17        |                                  | 1.02             | 1.11        |                                 | 1.7                                  | 1.7        |                                 | 0.46                                                                                         | 1.27(8)                                      | 0.86(9)          |
|                             |                                  | TX                | 0.000788(2) | 2.37                             | N/A              | 0.00338(5)  | 3.3                             | N/A <sup>(6)</sup>                   | N/A        |                                 | N/A                                                                                          | 0.00125 <sup>0</sup><br>0.00102 <sup>0</sup> |                  |
| Area<br>(mm²)               | Norm.<br>w/ tech. <sup>(1)</sup> | RX                | 0.00097(2)  | 2.49                             | N/A              | 0.0106(5)   | 8.9                             | N/A <sup>(6)</sup>                   | N/A        |                                 | N/A                                                                                          | Others 0.0                                   | 01134<br>00058 1 |
|                             |                                  | Total             | 0.00176(2)  | 2.45                             | N/A              | 0.014(5)    | 6.36                            | N/A <sup>(6)</sup>                   | 0.228(7)   | 1658                            | N/A                                                                                          | 0.00244<br>0.0022                            |                  |

- (1) Area is normalized with the technology.
- (3) Power consumption includes the internal clock trees.
- (5) Area includes the test blocks
- (7) Area includes analog and digital blocks.
- (2) Area of the I/O brick is divided by 8 data lanes. The area of amortized clock lane is included in the TX area.
- (4) Power consumpotion of TX digital, TX PLL, RX digital, and RX PLL circuits is included.
- (6) Area is not clearly reported.
- (8) Area and power consumption including 1 clock lane and 1 data lane.
- (9) Area and power consumption of the clock lane is amortized across 10 data lanes (1 data lane + 1/10 clock lane), assuming that ten data lanes can share one clock lane

by only about 17% and decreased by only 1%, respectively. The power consumptions of these circuits do not change a lot because the weak driver at the TX always toggles like a clock and the RX comparators always make a transition per clock edge. The TX data driver's 17% power consumption increase is mainly caused by the DM driver and other digital circuits. The DM driver and other digital circuits make transitions depending on the input bit pattern, causing power consumption increase with the activity factor.

Pie charts of the area breakdowns are shown in Fig. 26. The data driver occupies the most TX chip area: DM, weak, and clock drivers occupy 34.2%, 19.7%, and 19.7% of the total TX area, respectively. Especially, the RX core excluding ODTs occupies only a tiny area of 0.000058 mm<sup>2</sup>, which is only 5% of the total RX area. Owing to a compact TRX architecture, the implemented TRX only occupies 0.0024 mm<sup>2</sup>. The area would be reduced to 0.0022 mm<sup>2</sup> if ten data lanes share one clock lane. This area is the smallest area among TRXs for short-reach interfaces [1], [3], [5].

Table I compares the proposed DECS TRX and the previous state-of-the-arts. Compared to the prior state of the arts [1], [3], [5], the proposed DCC/CDR-less DECS TRX improves the area cost by  $2.45\times$ ,  $6.36\times$ , and  $1658\times$ , respectively: the area is normalized with the technology, and we



Fig. 26. Area breakdowns of (a) TX and (b) RX.

assume that ten data lanes can share one clock lane in our DECS TRX. In comparison with the prior state of the arts [1], [2], [3], [4], [5], [6], the proposed DECS TRX achieved the widest horizontal eye of 0.99 UI at BER of  $10^{-10}$  [see Fig. 22(a)], showing that the timing requirement for the local RX clock after the RX front-end was remarkably relaxed. Whereas the NRZ TRX achieved the eye width of 0.63 UI at BER of  $10^{-18}$  utilizing noise immunity coding [6],

<sup>\*</sup> MCM: Multi-chip modules, D2D: Die-to-die, CoWoS: Chip-on-Wafer-on-Substrate

the proposed TRX achieved a wider horizontal eye size of 0.88 UI [see Fig. 22(c)] without using any coding method in a much more noisy environment. Although our DECS TRX operates at a slower data rate than the NRZ [4] and the PAM-4 [5] TRXs, the DECS TRX can significantly reduce power and area costs because it does not require large and power-hungry circuits as the prior arts [4], [5]: the NRZ TRX [4] required an injection-locked phase-interpolator (IJL-PI), and the PAM-4 TRX [5] utilized CDR, CTLE, and FFE. Likewise, all other prior arts [1], [2], [3], [4], [5], [6] require additional expensive high-speed clock circuits (DCC circuits [1], [2], [3], [4], [5], [6], the delay line [1], CDA [2], IJL-PI [4], digital CDR [5], and de-skew circuits [6]) in comparison with the proposed DECS TRX that demands neither CDR/CDA nor DCC/DCD. The energy efficiency per lane (0.86 pJ/b) of the proposed TRX is better than those of the prior arts [1], [2], [3], [4], [5]. Therefore, the smallest area and good energy efficiency could be achieved with the proposed TRX.

#### VIII. CONCLUSION

In this article, we proposed DCC-less DECS TRX with CDR-less RX front-end for single-ended short-reach on-chip links. The proposed DECS TRX successfully operated at 20 Gb/s/pin with an energy efficiency of 1.27 pJ/b at nominal. The energy efficiency is reduced to 0.86 pJ/b if ten data lanes can share one clock lane. Owing to self-slicing and autodeserialization, a wide horizontal eye size of 0.99 UI at BER of  $10^{-10}$  was achieved at nominal. This wide horizontal eye relaxes the stringent timing constraint of the RX local clock. DCC and DCD circuits were not required because a wide horizontal eye size of 0.88 UI was achieved for the clock duty cycle between 40% and 60%. This shows that our DECS TRX is tolerant to duty cycle error. When the 50-MHz 300-mV<sub>p-p</sub> sinusoidal noise, the 200-MHz 300-mV<sub>p-p</sub> sinusoidal noise, and the 200-m $V_{p-p}$  CF7 Gaussian noise are injected to the supply of the PCB, wide horizontal eyes of 0.88, 0.90, and 0.90 UI were measured, respectively. This result shows that our DECS TRX is also tolerant to supply noise. Because neither high-speed DCC/DCD nor CDR/CDA circuits are required in the DECS TRX, the proposed TRX can reduce hardware costs with decent power consumption.

#### ACKNOWLEDGMENT

The authors would like to thank IC Design Education Center (IDEC) and Ansys for valuable tool support.

# REFERENCES

- [1] J. M. Wilson et al., "A 1.17 pJ/b 25 Gb/s/pin ground-referenced single-ended serial link for off- and on-package communication in 16 nm CMOS using a process- and temperature-adaptive voltage regulator," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2018, pp. 276–278.
- [2] A. Tajalli et al., "A 1.02-pJ/b 20.83-Gb/s/wire USR transceiver using CNRZ-5 in 16-nm FinFET," *IEEE J. Solid-State Circuits*, vol. 55, no. 4, pp. 1108–1123, Apr. 2020.
- [3] H. Park et al., "30-Gb/s 1.11-pJ/bit single-ended PAM-3 transceiver for high-speed memory links," *IEEE J. Solid-State Circuits*, vol. 56, no. 2, pp. 581–590, Feb. 2021.

- [4] K. McCollough, S. D. Huss, J. Vandersand, R. Smith, C. Moscone, and Q. O. Farooq, "A 480 Gb/s/mm 1.7 pJ/b short-reach wireline transceiver using single-ended NRZ for die-to-die applications," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2021, pp. 184–185.
- [5] R. Yousry et al., "A 1.7 pJ/b 112 Gb/s XSR transceiver for intra-package communication in 7 nm FinFET technology," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2021, pp. 180–181.
- [6] Y.-Y. Hsu, P.-C. Kuo, C.-L. Chuang, P.-H. Chang, H.-H. Shen, and C.-F. Chiang, "A 7 nm 0.46 pJ/bit 20 Gbps with BER 1E-25 die-todie link using minimum intrinsic auto alignment and noise-immunity encode," in *IEEE Symp. VLSI Circuits Dig. Tech. Papers*, Jun. 2021, pp. 1–2.
- [7] B. Dehlaghi and A. C. Carusone, "A 0.3 pJ/bit 20 Gb/s/wire parallel interface for die-to-die communication," *IEEE J. Solid-State Circuits*, vol. 51, no. 11, pp. 2690–2701, Nov. 2016.
- [8] B. Kim, Y. Liu, T. O. Dickson, J. F. Bulzacchelli, and D. J. Friedman, "A 10-Gb/s compact low-power serial I/O with DFE-IIR equalization in 65-nm CMOS," *IEEE J. Solid-State Circuits*, vol. 44, no. 12, pp. 3526–3538, Dec. 2009.
- [9] T. O. Dickson et al., "An 8×10-Gb/s source-synchronous I/O system based on high-density silicon carrier interconnects," *IEEE J. Solid-State Circuits*, vol. 47, no. 4, pp. 884–896, Apr. 2012.
- [10] M.-J. Park et al., "A 192-Gb 12-high 896-GB/s HBM3 DRAM with a TSV auto-calibration scheme and machine-learning-based layout optimization," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2022, pp. 444–445.
- [11] D. U. Lee et al., "A 1.2 V 8 Gb 8-channel 128 GB/s high-bandwidth memory (HBM) stacked DRAM with effective I/O test circuits," *IEEE J. Solid-State Circuits*, vol. 50, no. 1, pp. 191–203, Jan. 2015.
- [12] Y. Nishi et al., "A 0.297-pJ/b 50.4-Gb/s/wire inverter-based short-reach simultaneous bidirectional transceiver for die-to-die interface in 5 nm CMOS," in *IEEE Symp. VLSI Circuits Dig. Tech. Papers*, Jun. 2022, pp. 154–155.
- [13] J. W. Poulton et al., "A 0.54 pJ/b 20 Gb/s ground-referenced single-ended short-reach serial link in 28 nm CMOS for advanced packaging applications," *IEEE J. Solid-State Circuits*, vol. 48, no. 12, pp. 3206–3218, Dec. 2013.
- [14] B. Casper and F. O'Mahony, "Clocking analysis, implementation and measurement techniques for high-speed data links—A tutorial," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 56, no. 1, pp. 17–39, Jan. 2009.
- [15] J. Seo, S. Lee, M. Lee, C. Moon, and B. Kim, "A 20-Gb/s/pin 0.0024-mm<sup>2</sup> single-ended DECS TRX with CDR-less self-slicing/auto-deserialization to improve tolerance on duty cycle error and RX supply noise for DCC/CDR-less short-reach memory interfaces," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2022, pp. 456–457.
- [16] J. Song, Y. Kim, and C. Kim, "A 9 Gb/s/ch transceiver with referenceless data-embedded pseudo-differential clock signaling for graphics memory interfaces," *IEEE Trans. Circuits Syst. II, Exp. Briefs*, vol. 66, no. 12, pp. 1982–1986, Dec. 2019.
- [17] S. Lee et al., "A 7.8 Gb/s/pin 1.96 pJ/b compact single-ended TRX and CDR with phase-difference modulation for highly reflective memory interfaces," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2018, pp. 272–273.
- [18] S. Lee et al., "A 7.8 Gb/s/pin, 1.96 pJ/b transceiver with phase-difference-modulation signaling for highly reflective interconnects," IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 67, no. 6, pp. 2114–2127, Jun. 2020.
- [19] S. Lee, J. Seo, C. Han, J. Sim, H. Park, and B. Kim, "A DFE-enhanced phase-difference modulation signaling for multi-drop memory interfaces," *IEEE Trans. Circuits Syst. II, Exp. Briefs*, vol. 68, no. 6, pp. 1862–1866, Jun. 2021.
- [20] M. Choi et al., "An FFE transmitter which automatically and adaptively relaxes impedance matching," *IEEE J. Solid-State Circuits*, vol. 53, no. 6, pp. 1780–1792, Jun. 2018.
- [21] M. Choi et al., "An FFE TX with 3.8x eye improvement by automatic impedance adaptation for universal compatibility with arbitrary channel and RX impedances," in *IEEE Symp. VLSI Circuits Dig. Tech. Papers*, Jun. 2017, pp. 58–59.
- [22] M. Choi, M. Lee, and B. Kim, "A 12-Gb/s AC-coupled FFE TX with adaptive relaxed impedance matching achieving adaptation range of 35–75 Z0 and 30–550 RRX," in *IEEE Asian Solid-State Circuits Conf.* (A-SSCC) Dig. Tech. Papers, Nov. 2018, pp. 209–212.

- [23] C. Moon, J. Seo, M. Lee, I. Jang, and B. Kim, "A 20 Gb/s/pin 1.18 pJ/b 1149 μm² single-ended inverter-based 4-tap addition-only feed-forward equalization transmitter with improved robustness to coefficient errors in 28 nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2022, pp. 450–451.
- [24] B. Razavi, "The design of a comparator [the analog mind]," *IEEE Solid State Circuits Mag.*, vol. 12, no. 4, pp. 8–14, Fall 2020.
- [25] M. A. Kossel and M. L. Schmatz, "Jitter measurements of high-speed serial links," *IEEE Design Test Comput.*, vol. 21, no. 6, pp. 536–543, Nov./Dec. 2004.
- [26] K.-L. J. Wong, A. Rylyakov, and C.-K. K. Yang, "A 5-mW 6-Gb/s quarter-rate sampling receiver with a 2-tap DFE using soft decisions," *IEEE J. Solid-State Circuits*, vol. 42, no. 4, pp. 881–888, Apr. 2007.
- [27] M. Lee, S. Han, J.-Y. Sim, H.-J. Park, and B. Kim, "A 10-GHz multi-purpose reconfigurable built-in self-test circuit for high-speed links," in *IEEE Asian Solid-State Circuits Conf. (A-SSCC) Dig. Tech. Papers*, Nov. 2017, pp. 73–76.
- [28] D. Schinkel, E. Mensink, E. Klumperink, E. van Tuijl, and B. Nauta, "A double-tail latch-type voltage sense amplifier with 18 ps Setup+Hold time," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2007, pp. 314–315.
- [29] R. Stephens, "Jitter analysis: The dual-dirac model, RJ/DJ, and Q-scale," Agilent Technol., Santa Clara, CA, USA, White Paper, 2004. [Online]. Available: https://www.keysight.com/kr/ko/assets/7018-01309/white-papers/5989-3206.pdf



**Jaeyoung Seo** received the B.S., M.S., and Ph.D. degrees in electrical engineering from the Pohang University of Science and Technology (POSTECH), Pohang, South Korea, in 2015, 2017, and 2023, respectively.

In 2023, he became a Staff Engineer with Samsung Electronics, Hwaseong, South Korea. His research interests include high-speed serial/parallel links, signal/power integrity, and interconnect modeling.

Dr. Seo received several honorable awards. He was a co-recipient of the 19th and 23rd Korean Solid-

State Circuits Design Competition Awards. He was a recipient of the Kim Bum Man Best Dissertation Award from the Department of Electrical Engineering, POSTECH, in 2023.



**Sooeun Lee** received the B.S., M.S., and Ph.D. degrees in electrical engineering from the Pohang University of Science and Technology (POSTECH), Pohang, South Korea, in 2013, 2015, and 2020, respectively.

Since 2020, she has been a Staff Engineer with Samsung Electronics, Hwaseong, South Korea. Her research interests include high-speed serial and parallel links and signal integrity.



Myungguk Lee (Graduate Student Member, IEEE) received the B.S. degree in electronic engineering from the Kumoh National Institute of Technology, Gumi, South Korea, in 2015, and the M.S. degree in electrical engineering from the Pohang University of Science and Technology (POSTECH), Pohang, South Korea, in 2017, where he is currently pursuing the Ph.D. degree.

His research interests include high-speed links and signal integrity.



Changjae Moon (Graduate Student Member, IEEE) received the B.S. degree in electronic and electrical engineering from the Pohang University of Science and Technology (POSTECH), Pohang, South Korea, in 2018, where he is currently pursuing the Ph.D. degree.

His research interests include high-speed links and signal integrity.



Byungsub Kim (Senior Member, IEEE) received the B.S. degree in electrical engineering from the Pohang University of Science and Technology (POSTECH), Pohang, South Korea, in 2000, and the M.S. and Ph.D. degrees in electrical engineering and computer science from the Massachusetts Institute of Technology (MIT), Cambridge, MA, USA, in 2004 and 2010, respectively.

He was an Analog Design Engineer with Intel Corporation, Hillsboro, OR, USA, from 2010 to 2011. In 2012, he joined the faculty of Department of

Electrical Engineering, POSTECH, where he is currently a Professor.

Dr. Kim received several honorable awards. He was a recipient of the IEEE JOURNAL OF SOLID-STATE CIRCUITS Best Paper Award, in 2009; Analog Device Inc., Award; and the Outstanding Student Designer Award from MIT, in 2009. He was a co-recipient of the Beatrice Winner Award for Editorial Excellence at the 2009 IEEE International Solid-State Circuits Conference. For several years, he served or has been serving as the Technical Program Committee Member of the IEEE International Solid-State Circuits Conference and the IEEE Asian Solid-State Circuit Conference.