Distributed waveform generation and digitization system based on transparent transmission

NUCLEAR ELECTRONICS AND INSTRUMENTATION

Distributed waveform generation and digitization system based on transparent transmission

Lei Lang，

Kai Chen ，

Dou Zhu，

Jing Wang，

Yi-Chen Yang

Nuclear Science and Techniques

Vol.36, No.3

Article number 47

Published in print Mar 2025

Available online 29 Jan 2025

DOI：10.1007/s41365-024-01621-z

CSTR：32136.14.NST.2025.0347

2114013

Waveform generation and digitization play essential roles in numerous physics experiments. In traditional distributed systems for large-scale experiments, each frontend node contains an FPGA for data preprocessing, which interfaces with various data converters and exchanges data with a backend central processor. However, the streaming readout architecture has become a new paradigm for several experiments benefiting from advancements in data transmission and computing technologies. This paper proposes a scalable distributed waveform generation and digitization system that utilizes fiber optical connections for data transmission between frontend nodes and a central processor. By utilizing transparent transmission on top of the data link layer, the clock and data ports of the converters in the frontend nodes are directly mapped to the FPGA firmware at the backend. This streaming readout architecture reduces the complexity of frontend development and maintains the data conversion in proximity to the detector. Each frontend node uses a local clock for waveform digitization. To translate the timing information of events in each channel into the system clock domain within the backend central processing FPGA, a novel method is proposed and evaluated using a demonstrator system.

Transparent transmissionWaveform generationWaveform digitizationDistributed system

Introduction

Waveform generation and digitization serve as pivotal interfaces between the analog and digital realms. Although the natural world is inherently analog, digital signals offer superior convenience for data transmission and processing. Field-programmable gate arrays (FPGAs) are commonly employed to dynamically steer the output waveforms of digital-to-analog converters (DACs) by sending digital clocks and data signals. In contrast, analog-to-digital converters (ADCs) continuously sample and digitize input analog waveforms from detectors or sensors and transmit the resultant digital data to processors for further analysis.

Analog signals are inherently more vulnerable to noise and crosstalk, emphasizing the importance of minimizing the signal path between the detector and the data converters. For small-scale experiments or laboratory setups, FPGA, ADC, and DAC chips can typically be colocated on a single board or within a compact crate, thereby simplifying the overall design architecture. However, in larger experiments, the sensors or detectors may be distributed across hundreds or even thousands of nodes in diverse locations [1-4]. In these cases, a distributed system architecture is often employed. Considering the PUMA experiment [2] as an example, the radiofrequency signal is digitized locally at each node. All data are collected by the local processor and streamed to the central processor for further analysis. Experiments like KM3NET [3] and TRIDENT [4] use similar approaches. Signals from a limited number of detector channels are measured and preprocessed locally to reduce data volume before transmission to the central processor. KM3NET and TRIDENT use the white rabbit precision-timing protocol [5] for clock synchronization. This architecture shares similarities with the setup employed in several modern collider experiments, where digitized data for energy, timing, and position are processed through a series of frontend electronics, backend electronics, and, ultimately, a data acquisition (DAQ) system. In contrast, experiments such as PandaX [6] digitize analog signals outside the detector and transmit data to the DAQ system. Ongoing research and development efforts aim to upgrade the PandaX readout system by integrating waveform digitization into the electronics housed in liquid xenon. Cold electronics [7] have proven to be effective in reducing the overall noise contribution from electronic components. However, maintaining the simplicity of frontend electronics is crucial. Complex designs can lead to increased power consumption and higher background event rates due to material radioactivity [8], particularly for low-background rare-event detectors.

As data transmission technologies continue to evolve, an increasing number of large-scale experiments, such as LHCb [9] and ECCE [10], are adopting or considering triggerless streaming readout architectures. This approach involves transferring all the data to the backend for online or offline analysis. In some physics experiments [11-14] and chip evaluation setups [15, 16], acquiring the original digitized waveform data is necessary. This paper proposes a novel design method based on transparent transmission to isolate centralized data processing from interfaces with converter chips distributed in frontend nodes. Figure 1 illustrates the scalable system, which comprises distributed frontend nodes called analog-to-digital and digital-to-analog (ADDA) boards and a backend server with data aggregation module (DAM) PCIe cards for data aggregation. Each DAM card supports communication with multiple ADDA boards, depending on the number of available fiber-optic links. Multiple DAM cards are typically inserted into a server with multiple PCIe slots [17]. For larger systems with multiple servers, network interface cards (NICs) are installed to facilitate data exchange via a commercial network switch. This architecture eliminates the need for FPGAs in the ADDA nodes for data preprocessing. Converter chips with high-speed serial interfaces, such as the JESD, can be directly connected to the backend FPGA via optical links. The output data of ADCs with slower parallel digital ports can be aggregated onto a high-speed serial link using transceiver chips, such as the TLK1501 [18] and TLK2501 [19] series. In the reverse direction, the high-speed link from the backend FPGA can be converted to slower parallel ports for the DAC chips. Data from all the frontend nodes are transparently mapped to the firmware in the central FPGA located on the DAM card. This single FPGA performs all data encoding, decoding, and processing tasks. For multichannel slow-speed ADC chips, such as the AD4695 [20] used in the NνDEx experiment [13], a slow control interface can be realized remotely over the transparent transmission links built by the TLK1501 transceivers. Depending on the specific DAQ system architecture in each experiment, the PCIe form factor shown in Fig. 1 can be replaced by other infrastructure, such as the advanced telecom computing architecture (ATCA) standard.

Fig. 1

(Color online) Conceptual diagram of the proposed system

A demonstrator system comprising two ADDA nodes and one server equipped with a DAM card was designed to support 125 Msps waveform digitization and generation. Section 2 delves into the specifics of this design and introduces a novel timing measurement method for the clock domain crossing scenario. Section 3 presents the system evaluation results.

Design of the Evaluation System

2.1

ADDA board in the frontend nodes

The ADDA board depicted in Fig. 2 serves as the readout electronics for each frontend node. The transceiver chip TLK2501 supports a link speed of up to 2.5 Gbps, resulting in a throughput of 2 Gbps, owing to the overhead of 8B10B coding. As illustrated in Fig. 3, for the downlink, TLK2501 converts the 2.5 Gbps link to 16 parallel 125 Mbps links and recovers the embedded 125 MHz clock DAC-CLK. DAC-CLK and DAC-DATA are mapped to TX-CLK and TX-DATA in the backend firmware, facilitated by the transparent path built with TLK2501, Small Form-factor Pluggable Plus (SFP+) modules, optical fiber, and GTX core [21] in the FPGA. The DAC chip on the ADDA board is AD9744, which supports a rate of up to 210 Msps with 14-bit resolution. The higher 14 bits of DAC-DATA are connected to the DAC data port, and DAC-CLK is connected to the corresponding clock input pin. This mapping and connection enable the FPGA firmware to continuously control the DAC output waveform at a rate of up to 125 Msps. The differential current output of the AD9744 is then fed into the operational amplifier AD8047, which performs differential-to-single-ended conversion and outputs the voltage signal. For a load with 50 ohm impedance, the output range spans approximately 0 mV ~ 960 mV.

Fig. 2

(Color online) ADDA board with main components labeled: transceiver TLK2501, ADC chip AD9255, and SFP+ cage are on the top side, while the DAC chip AD9744, clock source, and buffer are on the bottom side

Fig. 3

(Color online) Overall signal and clock connections from the backend FPGA to the frontend converters

In the uplink direction, the input signal is attenuated using a π-style network to match the allowable input range of the subsequent operational amplifier. The attenuated waveform is then fed into a THS4521 amplifier for single-ended-to-differential conversion [22] with a gain of 6 dB. Subsequently, the differential signal was fed into the ADC chip for digitization. The board supports AD9255 [23] (14-bit resolution) and AD9265 [24] (16-bit resolution) chips, which are pin-compatible. The ADC sampling clock and reference clock of the TLK2501 are synchronized with a local 125 MHz clock source. The TLK2501 continuously latches the digitized data from the ADC. After 8B10B encoding and parallel-to-serial conversion, the parallel data are transformed into a 2.5 Gbps serial stream. This stream is transmitted to a backend FPGA via a fiber-optic link. The transceiver within the backend FPGA recovers both the data and the clock. This entire chain effectively maps the ADC-CLK and ADC-DATA signals to RX-CLK and RX-DATA within the backend firmware, enabling other FPGA firmware blocks to seamlessly access the remotely acquired ADC data.

In previous experiments, some of the TLK2501 parallel ports are used to realize remote JTAG over fiber optical links [25, 22]. In this setup, a similar design can be used to realize a serial peripheral interface for the ADC. However, to make the system compatible with the 16-bit ADC AD9265 and considering that the configuration is not required for the ADC chip in our use case, all 16 input ports of TLK2501 are occupied by the ADC parallel data output.

Analog and digital power supplies are isolated to minimize crosstalk between the digital and analog circuits. Digital power rails are used for the SFP+ module, TLK2501, clock oscillator, and clock buffers. The analog power supplies serve the ADC chip, ADC driver THS4521, DAC chip AD9744, and output buffer AD8047.

2.2

DAM module in the backend

2.2.1

Hardware

This project utilizes the commercially available PCIe card AX7325 from ALINX as the DAM module, as shown in Fig. 4. Featuring an 8-lane Gen2 PCIe interface and a Xilinx Kintex-7 XC7K325 FPGA, the card offers superior performance and flexibility. The front panel contains a 1×4 SFP+ cage and one Quad SFP (QSFP) cage, enabling support for up to eight pairs of bidirectional fiber-optic links. However, owing to the limitations of the SFP+ module and FPGA transceiver, the link speed is capped at a maximum of 10 Gbps. However, this single DAM module can effectively support up to eight frontend ADDA nodes.

Fig. 4

(Color online) Commercial AX7325 PCIe card, which supports 8 pairs of fiber optical links for high-speed data transmission

2.2.2

Firmware & Software

Figure 5 shows the overall design of the FPGA firmware. To facilitate efficient data exchange with the server memory, a Xilinx PCIe Gen2 XDMA core is employed.

Fig. 5

(Color online) Diagram of the firmware in the backend central FPGA

An Advanced Extensible Interface (AXI) optimized for the rapid transfer of large data volumes handles ADC data reception and DAC data transmission. The AXI–BYPASS interface enables direct data transfer without utilizing Direct Memory Access (DMA), thus preserving the computational resources for a normal AXI bus. The AXI–LITE interface, which is designed for single-cycle data transfer, facilitates communication with the register array for read and write operations. These registers are responsible for configuring and monitoring all firmware modules and hardware components on a card.

In the ADC data flow direction, the GTX receiver in the FPGA recovers 2.5 Gbps serial data and RX-CLK. The 8B10B decoder within the GTX decodes the data, ultimately outputting 16-bit RX-Data synchronized with RX-CLK. These RX signals are derived from the clock and ADC data on the ADDA board, as described in Sect. 2.1. The RX-Data is buffered in the Read-FIFO before the AXI protocol conversion module encapsulates it in the AXI format and transfers it to the XDMA core.

In the DAC data flow direction, the phase-lock loop (PLL) inside the GTX generates the TX-CLK using the reference clock as the input. The GTX transmitter receives 16-bit TX-Data synchronized with TX-CLK. Subsequently, the TX-Data undergoes 8B10B encoding, expanding to 20 bits. After parallel-to-serial conversion, 2.5 Gbps serial data are transmitted to the ADDA board. K28.5 is utilized as a comma for the GTX transceiver, complying with the definition of the 8B10B core of the TLK2501 chip.

The application software, constructed based on the official XDMA driver reference [26], offers flexibility in selecting various character-device interfaces tailored to specific needs. Specifically, a buffer of a dedicated size is allocated after initialization. Depending on the running mode, relevant control registers are used to initiate the DMA transfer. The DMA core generates an interrupt after transmission, and the driver checks the status and releases the buffer if no errors occur. The uplink direction is for waveform data storage, whereas for the downlink direction, three different modes are supported: single analog pulse generation, periodic waveform generation, and long-term waveform generation.

2.3

Timing measurement without clock synchronization

Time-of-arrival measurement is crucial in certain physics experiments for event reconstruction. In numerous scenarios, owing to the time-walk effect, the pulse arrival time experiences a delay that is proportional to its amplitude [27]. Calibration is often required for a readout system to compensate for this effect, and the calibration method is often based on the amplitude and shape of the input signals [28]. Although beyond the scope of this study, calibration is essential for precise timing measurement. Furthermore, the ADC sampling rate introduces another bias. For instance, a 125 Msps sampling rate on the ADDA board results in a coarse timing resolution of approximately 8 ns. Several physical phenomena manifest according to a Poisson distribution, and the timing of the pulse peak is independent of the sampling clock edges. Therefore, even for pulses with the same shape and amplitude, the samples shift on an event-by-event basis. Methods such as parabolic fitting can be used to estimate the peak time using the three samples with maximum amplitudes. The obtained value of $t_{fit_frac}$ can help improve the precision of timing measurements in the sampling clock domain.

Clock synchronization methodology is often employed in systems that require high-precision timing measurements. For example, in accelerator experiments, a master system clock is typically distributed to all nodes for synchronized operations [29-32]. However, the implementation of this approach on an ADDA board can introduce additional complexities. To synchronize the ADC sampling clock with the TX-CLK in Fig. 3, one solution involves the use of two TLK2501 chips: one for data reception and system clock recovery, and the other for data transmission. Alternatively, a clock chip with dynamic automatic switching between local and recovered clocks can be employed [33]. To simplify the ADDA board design, an asynchronous-clock approach is adopted for the uplink. The DAM module uses TX-CLK as its system clock. However, the ADC on the ADDA board operates on a local clock. This discrepancy necessitates aligning the timing information from all the frontend channels with TX-CLK for synchronized operations.

To address this challenge, a method for timing calibration across clock domains is proposed. By comparing the values of the two counters in the two clock domains, the firmware identifies the faster clock and the ratio of their frequencies. As shown in Fig. 6, counter CNT_sys and CNT_{fcal_int} operate in the TX-CLK and RX-CLK domains, respectively. Although both clocks run nominally at 125 MHz, slight discrepancies are observed. As shown in Fig. 6, RX-CLK is faster than TX-CLK. DFF_out is the result obtained by a D flip-flop [5, 35, 34], where the RX-CLK continuously samples the TX-CLK signal. Ideally, DFF_out is periodic with a period of P=K-J cycles, and the frequency difference between the two clocks is approximately 1/P of the RX-CLK frequency.

Fig. 6

(Color online) Timing calibration method for asynchronous clock domain crossing

Each time a falling edge appears at DFF_out or the current status of these two counters is different, the counter CNT_{fcal_int} is set to the same value as CNT_sys, and the fraction calibration CNT_{fcal_frac} is reset to 0. Otherwise, CNT_{fcal_frac} changes by -1/P per cycle. The step is 1/P for the cases in which TX-CLK is faster.

In a real system, clock jitter introduces glitches at the edges of DFF_out. The bit-value median method described in [5, 35] is a solution for removing glitches and accurately identifying the cycles of DFF_out transitions. However, unlike phase measurement systems, in this design, both CNT_{fcal_int} and CNT_{fcal_frac} are accumulative variables; therefore, a simpler approach is implemented to solve it. Calibration occurred only at the first falling edge. As shown in Fig. 7, the edges of DFF_out trigger a forbidden window. During this window, new incorrect calibrations caused by glitches at DFF_out falling edges are temporarily disabled. The window duration depends on the P value and clock jitter, and typically lasts for tens or hundreds of RX-CLK cycles. Realistically, P will not be an integer; meanwhile, clock jitter also causes slight variations in the DFF_out period, potentially affecting the system accuracy. However, this effect is comparable to the clock jitter itself, which is much smaller than the 8 ns sampling interval. $T = {CNT}_{fcal_int} + {CNT}_{fcal_frac} + t_{fit_frac}$ (1)

Fig. 7

(Color online) Mechanism for calibration pulse generation

As depicted in Eq. (1), the final result of the full calibration is that the calibrated integer adds the fitting component $t_{fit_frac}$ and the fraction component CNT_{fcal_frac}. For example, if the timing of an event is J+2+0.25 (assuming that $t_{fit_frac}$ =0.25), the final result will be $T = J + 2 + 0.25 - 2 / P = J + 2.25 - 2 / P .$ (2)

Evaluation of the System

3.1

Waveform digitization and generation

A multichannel analyzer (MCA) is implemented based on the waveform digitization path. As shown in Fig. 8, a commercial arbitrary waveform generator (AWG) is configured to output periodic digital pulses to a charge-sensitive amplifier (CSA). After amplification, the CSA converts the injected charges into an exponentially decaying analog pulse. Because capacitor Ci is 1 pF, a 1 mV voltage step corresponds to 1 fC charge injection. The CSA has an output impedance of 50 Ω and a gain of 5 mV/fC.

Fig. 8

(Color online) Test platform for the waveform digitization: commercial AWG, CSA pre-amplifier in yellow box, and the ADDA board

The output of the CSA is fed into a π-style network, low-noise amplifier, and ADC chip on the ADDA board for pulse conditioning and digitization [22].

The purple curve in Fig. 9 is a typical measured waveform that exhibits fast-rising and slow-falling edges governed by the RC constant of the CSA. The orange curve represents the results obtained after triangular filtering. The peak value is approximately 420, corresponding to an amplitude of 25 mV. The equivalent noise charge (ENC), representing the quantity of charge injected at the input, is used to represent the performance of the system. The ENC results shown in Fig. 8 are based on Eq. (3), where Vi is the injected voltage step and Ci is the capacitor value between the signal source and the CSA input pin. e₀ is 1.6×10^-19 C, and μ and σ are the mean value and standard deviation of the measured amplitudes, respectively. $E N C = \frac{V_{i} \cdot C_{i}}{e_{0}} \cdot \frac{σ}{μ}$ (3)

Fig. 9

(Color online) Waveform measured by the ADDA board and the results after the digital filtering

For comparison purposes, the CSA output is connected to a commercial digital MCA (DT5780). As shown in Fig. 10, with a 5 fC injection, the measured ENC is approximately 11 e^- worse. When the injected charge is scanned from 5 fC to 80 fC, DT5780 exhibits ENC values ranging from 135 e^- to 150 e^-, whereas the ADDA board achieves ENC values ranging from 125 e^- to 140 e^-.

Fig. 10

(Color online) Comparison of the measured ENC by ADDA Board and the commercial digital MCA DT5780, with different charge injections

To evaluate the waveform generation path, sinusoidal waveforms with different frequencies were generated by ADDA and digitized using an R&S MXO 4 Series Oscilloscope at a sampling rate of 50 MHz. The results of the spectral analysis are presented in Fig. 11. The harmonic observed at $f_{s} / 2 - f_{in}$ is the interleaving spur caused by the oscilloscope and is excluded when calculating the spurious-free dynamic range (SFDR). The spectrum plots show that the SFDR results exceed 73 dB. In addition, the signal-to-noise ratio is greater than 51 dB. The total harmonic distortion is below -68 dB, and the signal-to-noise and distortion ratio is higher than 50 dB.

Fig. 11

(Color online) Spectrum for sinusoidal waveform with different frequencies: (a) 125 kHz; (b) 250 kHz; (c) 500 kHz; (d) 1 MHz

3.2

Timing measurement with asynchronous clock

Two tests are conducted to validate the proposed timing calibration method. In the first test, a predefined pulse is periodically generated by the DAC path of ADDA and sent to the ADC path for digitization. The pulse period is 2500 cycles of TX-CLK. The timing information is measured for each pulse; ideally, the measured timing should consistently match the DAC pulse-sending timing with a fixed latency. Figure 12 shows an analysis of the time interval between consecutive pulses. The top plot shows the results of the integral cycle counter before and after calibration without fitting and fraction components, with values, including 2499, 2500, and 2501, appearing after calibration. The red histogram in the bottom plot shows the results with the fitting component. The green image shows the result with full calibration procedure, whose mean value is approximately 0.3 ps away from the ideal value of 20,000 ns. The σ value of the measured period was 1.37 ns, which is a few times smaller than 8 ns. This is caused by the ADC sampling rate, the slew rate of the digitized pulse, clock jitter, and signal noise.

Fig. 12

(Color online) Histogram of the measured pulse period

In the second test, a commercial signal generator (SDG6012X-E) is configured to generate periodic pulses and send them simultaneously to two ADDA boards. The arrival time of each pulse is measured using the backend system. Figure 13 illustrates histograms of the time differences between these two channels after calibration. With the full calibration procedure, the mean value is approximately 0.75 ns, which is mainly due to the difference in the cable length and latency of the FPGA traces. The σ value is 0.95 ns, indicating that the uncertainty of the single-channel timing measurement was approximately $\frac{0.95}{\sqrt{2}} = 0.67$ ns. This figure also shows that the fraction component can improve the precision by more than two times.

Fig. 13

(Color online) Histogram of the time difference measured by two independent ADDA boards

Conclusion and Outlook

A new system architecture was proposed based on transparent transmission on top of the data link layer, enabling remote waveform generation and digitization in multiple distributed frontend nodes with centralized data processing at the backend. A demonstration system utilizing a TLK2501 chip, enabling a 14-bit ADC and DAC running at 125 Msps, was designed and evaluated. Integration with a charge-sensitive preamplifier was carried out for digital pulse processing, showing slightly better performance than a commercial digital MCA. A new method for timing calibration was proposed to handle the clock-domain crossing issue and can be used in systems without a synchronized clock distribution. A system with a slow multiple-channel ADC AD4695 was built based on the proposed concept for the NνDEx experiment. Future work will involve designing frontend nodes with higher sampling rates using ADC and DAC chips that support the JESD protocol. The JESD core will be implemented in the central FPGA at the backend to enable control and data exchange with the ADC and DAC chips via transparent transmission.

References

R. He, X. Niu, Y. Wang et al.,

Advances in nuclear detection and readout techniques

. Nucl. Sci. Tech. 34, 205 (2023). https://doi.org/10.1007/s41365-023-01359-0