Introduction
The rapid development and deployment of fourth-generation light sources (4GLS) around the world has led to an increasing focus on studying collective beam instabilities in storage and booster rings. These advanced facilities, operating at the forefront of accelerator technology, demand a thorough understanding and mitigation of instabilities to maintain the beam quality and achieve the desired performance. As a result, beam-tracking studies aimed at analyzing these instabilities are essential for supporting the design and operation of such high-performance synchrotron light sources.
To facilitate these studies, various computational tools have been developed, including MBTRACK2 [1-5] and a multi-bunch macro-particle tracking code developed at Synchrotron SOLEIL. Originally written in C as MBTRACK [6], the code was later redesigned in Python with an Object-Oriented Programming (OOP) structure–enhancing usability and extending functionality–and renamed MBTRACK2. MBTRACK2 retains the capability to study both single and multi-bunch collective beam instabilities, with numerous new tracking options added to accommodate the evolving needs of modern accelerator research.
MBTRACK2 utilizes CPU parallel processing with MPI to manage the computational demands of tracking multiple bunches. However, as the number of bunches increases, this approach requires a large number of CPU cores to handle the computational load. The need for such a large number of cores introduces significant challenges, including high costs and complexity of maintaining the necessary hardware infrastructure. These challenges present practical difficulties for several research facilities.
GPU acceleration has emerged as a viable alternative to address these challenges. GPUs designed for high levels of parallelism can significantly enhance the performance of large-scale simulations. A key advantage of GPU acceleration is that it can often be implemented on a single PC, simplifying infrastructure requirements compared with the extensive hardware setups needed for large-scale CPU parallel processing. Various GPU-accelerated tracking simulation programs have been developed to leverage this capability, including MBTRACK-CUDA [7], a GPU-accelerated version of MBTRACK; STABLE [8], a MATLAB-based code with GPU acceleration features; APES [9], which combines GPU and CPU processing with MPI for hybrid parallel acceleration; and GPU implementations for ELEGANT [10], PyHEADTAIL [11, 12], Xsuite [13] and SixTrackLib [14].
However, typical GPU acceleration methods can experience inefficiencies when handling extremely large-scale simulations, primarily because of frequent CPU-GPU data transfers that introduce both latency and synchronization delays. In most GPU-accelerated beam-tracking codes, statistical calculations and tracking-related operations rely on such data movement, leading to communication overhead and hindering the continuous execution of GPU kernels, which ultimately reduces computational efficiency.
This paper presents MBTRACK2-CUDA [15], which performs all the necessary computations entirely on GPU CUDA cores, eliminating the need for frequent data transfers and minimizing synchronization delays. By maintaining both statistical calculations and tracking operations on the GPU, this design avoids unnecessary CPU-GPU interactions, thereby reducing communication overhead and enhancing computational efficiency. As a result, MBTRACK2-CUDA is well suited for large-scale tracking simulations, where minimizing data movement and maintaining continuous GPU execution are critical for performance.
Rapid advancements in GPU technology have played a crucial role in enabling such improvements. Since the mid-2010 s, GPU-accelerated beam tracking codes have been widely adopted, leading to substantial developments. As of the mid-2020s, GPUs have undergone significant advancements, with Dual-ported Video Random-Access Memory (VRAM) capacities reaching tens of gigabytes per GPU, and a dramatic increase in the number of processing cores. For instance, the NVIDIA GeForce RTX 4090, which we used in this research, features 16,384 CUDA cores and 24 gigabytes (GB) of Graphics Double Data Rate 6X (GDDR6X) VRAM, showcasing the immense progress in GPU capabilities that support our enhanced simulation methods. While such advanced capabilities were not feasible in the mid-2010 s, the current generation of GPUs now provides the necessary power to implement our novel approach, which involves executing all stages of the tracking simulation directly on the GPU without requiring data transfer between the CPU and GPU.
The remainder of this paper is organized as follows. In Sect. 2, we describe the specific code architecture and the tracking model of MBTRACK2-CUDA. In Sect. 3, we present the validation of MBTRACK2-CUDA and the benchmarking results against MBTRACK2 to demonstrate the code’s performance. We also analyze tracking simulations and results for single and multi-bunch cases, using the parameters of the High Energy Photon Source (HEPS) storage ring to study resistive-wall (RW) instability as an example of collective effects. This approach illustrates how MBTRACK2-CUDA can be utilized to investigate collective effects while noting that the resistive-wall study is based on assumptions regarding the beam pipe specifications, which may differ from those of the actual HEPS storage ring setup. Finally, Sect.. 4 provides summary and outlook.
Code architecture and tracking model in MBTRACK2-CUDA
Code architecture
MBTRACK2-CUDA is an independent CUDA-ported version of MBTRACK2, designed to harness the power of GPU computing. It provides a comprehensive set of tracking simulation functionalities capable of handling a large number of bunches and macro-particles efficiently on a stand-alone desktop computer. Although MBTRACK2-CUDA is a partial port of MBTRACK2 and does not include all the features of the original code, it effectively supports the core tracking processes needed for studying collective effects.
MBTRACK2 itself is an evolving codebase with ongoing development and the continuous addition of new features. Similarly, MBTRACK2-CUDA is designed to evolve alongside MBTRACK2 the aim of incorporating new advancements and functionalities over time.
Figure 1 illustrates three different computational methods for tracking simulations, each utilizing distinct approaches to parallel processing. Figure 1(a) shows the traditional CPU parallel processing, where each bunch is assigned to a separate CPU core, making it highly efficient for multi-bunch calculations. However, because each core is responsible for processing all macro-particles within its assigned bunch, handling a large number of macro-particles places a significant computational load on all allocated cores. Additionally, the number of CPU cores must be sufficient to cover all bunches, which can be a limiting factor. Depending on the code, if the CPU cores cannot cover all the bunches, the calculations may either not proceed or the remaining bunches will have to wait until the current bunches are processed, leading to potential inefficiencies.
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-F001.jpg)
Depending on the code, the GPU-accelerated method can be implemented in one of two ways. In the first approach, the GPU acceleration method used in MBTRACK-CUDA is shown in Fig. 1(b), which assigns all bunches to the GPU memory but processes only one bunch at a time, utilizing multiple GPU cores to handle macro-particles in parallel within each bunch. Although this method allows for rapid computation of individual bunches, it requires sequential processing for multiple bunches, which may limit the overall performance in large-scale multi-bunch simulations.
In contrast, the approach shown in Fig. 1(c), which was adopted in MBTRACK2-CUDA, enables fully parallel computations by processing all bunches and their respective macro-particles simultaneously on the GPU. This method allows for a more efficient utilization of GPU resources, significantly improving the simulation performance, particularly in large-scale scenarios with many bunches and macro-particles.
A flow chart of MBTRACK2-CUDA is presented in Fig. 2, which outlines the simulation process. The process begins with the CPU, similar to MBTRACK2, to generate bunches based on the machine parameters. However, once the bunch data were prepared, the entire tracking process was handed over to the GPU. From this point onward, the GPU handles all tracking computations, eliminating the need for further CPU-GPU data transfers and minimizing the CPU-GPU interactions. This design avoids the communication overhead and synchronization delays associated with frequent data exchanges, ensuring a more efficient simulation process.
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-F002.jpg)
When using MBTRACK2-CUDA for tracking simulations, it is important to ensure that the VRAM capacity of the GPU is not exceeded. If the GPU’s VRAM is surpassed, the code will resort to using CPU RAM, which can significantly degrade the performance. Therefore, it is advisable to select a model with sufficient GPU memory for tracking simulations.
Moreover, MBTRACK2-CUDA was designed to minimize overhead by performing all computations on a single GPU, and thus did not support multi-GPU setups. However, as of the mid-2020s, high-performance GPUs with ample memory for tracking simulations and a dramatically increased number of CUDA cores per GPU were released. With continued advancements in GPU technology, this limitation is becoming less of a concern as performance improvements continue to grow.
To implement the CUDA-based computation in MBTRACK2-CUDA, we utilized the Numba-CUDA library [16], which supports CUDA programming within the Python environment. Numba-CUDA is a powerful tool that enables high-performance computations by compiling Python code for GPU execution. It simplifies CUDA programming by allowing developers to write CUDA kernels directly in Python, thereby eliminating the need to switch to CUDA C/C++. This seamless integration with the Python ecosystem supports efficient GPU acceleration in Python-based workflow. In addition, Numba-CUDA manages memory allocation and optimization, offering fine-grained control over CUDA kernels, which enhances the efficiency of GPU-accelerated applications.
In MBTRACK2-CUDA, the entire tracking simulation process is implemented using multiple CUDA kernels. Each kernel handles specific aspects of the tracking simulation, such as transforming basic one-turn maps, computing collective effects, and managing macro-particles interactions. These kernels are executed sequentially in a turn-by-turn manner through a single CUDA stream, ensuring a coherent tracking progression.
Beam tracking model
Basic one-turn maps for 6D phase space transformations
The bunches we aim to track in MBTRACK2-CUDA are composed of macro-particles, each represented in a 6-dimensional (6D) phase space by the coordinates:_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M001.png)
The basic one-turn maps consist of a longitudinal map, transverse map, radio frequency (RF) cavity, and radiation damping with quantum excitation. These maps are calculated turn-by-turn (once per turn), and we present the formulae describing the coordinate transformations of the j-th single particle within the i-th bunch at each turn for the ultra-relativistic case.
We begin with a longitudinal map, where the longitudinal coordinates are updated as follows:_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M002.png)
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M003.png)
In the transverse map, we can write the coordinate transport in matrix form as follows:_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M004.png)
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M005.png)
In the longitudinal map, the energy gain from the RF cavities is not considered. To compensate for the energy loss caused by synchrotron radiation, we must calculate the energy gain provided by the RF cavities. If only the main RF cavity is installed, the relative energy deviation is updated as_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M006.png)
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M007.png)
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M008.png)
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M009.png)
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M010.png)
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M011.png)
As the final step of the basic one-turn maps, the radiation damping with quantum excitation is considered as follows [2]:_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M012.png)
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M013.png)
Wakefields of resistive-wall
After establishing the basic one-turn maps, we then incorporate the wakefield effects arising from collective interactions. In this study, we explore the effects of resistive-wall as an example, which is one of the most impactful types of wakefields in synchrotrons. To this end, we introduce the simplest model for a resistive wall: the circular beam-pipe model.
The resistive wall impedances per unit length for the longitudinal and transverse planes in a single-layered infinitely thick circular beam pipe of radius b with direct current (DC) conductivity σc are given for an ultrarelativistic beam as follows [20, 21]:_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M014.png)
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M015.png)
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M016.png)
From the Eqs. (13) and (14), the resistive wall wake functions for the longitudinal and transverse planes can be obtained by performing inverse Fourier transform [22]. The classic resistive wall wake function formulae include integral terms known as diffusion terms [23], which have traditionally been calculated using numerical integration. However, Ivanyan and Tsakanov [24] solved these integrals using the Faddeeva function. They also derived a series of expansion formulas. The wake function formulae discussed thus far are as follows._2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M017.png)
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M018.png)
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M019.png)
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M020.png)
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M021.png)
Calculating the influence of intra-bunch wakefields
A wake function is a normalized potential defined within a two-particle model, which represents the wake kick exerted by a source charge on a target charge. To compute the instabilities within a single bunch using wake functions, the model must be extended to account for a given bunch distribution. The normalized potential in the presence of a given bunch distribution is referred to as wake potential. The wake potentials in the longitudinal and transverse planes are defined as the convolutions of the longitudinal bunch distribution with wake functions, as follows [27]:_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M022.png)
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M023.png)
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M024.png)
In practical computations, continuous convolution must be converted into discrete convolution. To perform these discrete calculations, we sliced the given bunch distribution into M bins, each with a width of Δτ in the τ domain. For the mth bin, let _2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M025.png)
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M026.png)
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M027.png)
Both MBTRACK2 and MBTRACK2-CUDA implemented a dynamic binning scheme that adjusted the bin width according to the bunch length while maintaining a fixed total number of bins. To obtain a reliable longitudinal bunch distribution, both codes perform a first binning with approximately 100 bins, where the distribution is determined by counting the number of macro-particles in each bin. However, because wake functions exhibit nonlinear characteristics on a smaller scale, the resolution provided by the first binning alone is not sufficient for accurate convolution. Therefore, a second binning process was carried out through linear interpolation applied to the longitudinal bunch distribution, thereby increasing the total number of bins.
In this way, the final bin width resolution is determined by the number of interpolated bins rather than the initial bin count. This interpolation ensured that the wake functions could be approximated as linear within each bin. Thus, the number of interpolated bins should be determined by the bunch length and wake function characteristics. In general, MBTRACK2 requires 104 ~ 105 interpolated bins, which can lead to heavy convolution calculations. Because convolution is inherently a series operation, and such heavy computations are not feasible on a GPU, an alternative method was adopted in MBTRACK2-CUDA to reduce the computational load. This method first calculates the mean of the wake functions within each interpolated bin as follows:_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M028.png)
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M029.png)
Finally, from the wake potentials, we calculated the effects of the longitudinal and transverse kicks on single-bunch instabilities. An extra interpolation step during the kick calculation further refines this process by assigning a unique kick to each macro-particles. The changes in the relative energy deviation and transverse momenta due to these kicks are given as_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M030.png)
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M031.png)
For resistive wall wakefields, the means of the wake functions within each interpolated bin can be expressed analytically. These are derived by substituting Eqs. (16) and (17) into Eq. (27) and (28). The resulting expressions are as follows._2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M032.png)
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M033.png)
Special functions such as the Faddeeva function are challenging to compute within a GPU kernel. To handle this, we ported the CERN library’s Fortran90 (F90) implementation, which had already been adapted to CUDA C in the CUDA version of PyHEADTAIL [28], to CUDA Python in MBTRACK2-CUDA for efficient GPU computation.
The series expansion expressions are straightforward to implement within a CUDA kernel and offer a performance advantage, running few times faster than using the Faddeeva function. For double-precision calculations (float64), we found that the optimal number of terms in the series expansion was k=25 for Eq. (31) and up to k=24 for Eq. (32), thereby ensuring an accuracy of up to _2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M034.png)
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M035.png)
MBTRACK2-CUDA is also capable of performing convolution calculations when wake functions are provided as numerical data rather than built-in functions, allowing for arbitrary wake functions to be used. In such cases, the wake functions must be input as arrays, and these arrays should represent the cumulative sum of the wake functions, meaning that they need to be pre-integrated. Additionally, when using this approach, MBTRACK2-CUDA requires the user to specify the total length of the wake functions because this length is not automatically adjusted by the code.
Calculating the influence of coupled-bunch wakefields
To consider the effects of coupled-bunch wakefields in a multi-bunch system, we model each target bunch using a point charge representation, while the source bunch remains represented by macro-particles. This approach is valid when the dominant frequency of the long-range wake is sufficiently low, such that its wavelength is much larger than the bunch length, ensuring that the wake interacts with the bunch as a whole, rather than being sensitive to its internal structure. For instance, this assumption holds true for resistive-wall wakefields. In this model, all macro-particles within the source bunch impart the same kick to the target bunch. Additionally, because wakefields generated by a bunch do not dissipate when the bunch circulates around the synchrotron ring multiple times, the influence of residual wakefields from previous turns, which is often referred to as wake memory or wake history [2, 6], must also be considered.
Let the source bunch index be i, target bunch index be i’, and total number of bunches be Nb. To track the wake memory from the present back to the past, we introduce the turn memory index l, where l=0 represents the current turn, and l=Nt represents the oldest turn considered in the calculations. When calculating up to the turn l=Nt, the kick factors that all macro-particles in i’-th target bunch receive uniformly at the current turn are given by the following:_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M036.png)
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M037.png)
In conclusion, in a coupled-bunch scenario, the changes in the relative energy deviation and transverse momenta of all macro-particles within the i’-th target bunch can be calculated as follows:_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M038.png)
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M039.png)
Beam tracking simulations of resistive-wall instability using the HEPS storage ring parameters
Validation and benchmarking of MBTRACK2-CUDA
Before evaluating the effects of resistive-wall instability using the machine parameters of the HEPS storage ring with MBTRACK2-CUDA, we first validated the code and conducted a benchmark against MBTRACK2. The machine parameters used in this study for the HEPS storage ring are presented in Table 1 [29, 30]. For the resistive-wall instability, we used a simplified model in which the beam pipes were assumed to have a circular geometry and were made entirely of 316LN stainless steel. The analytical expressions of the intra-bunch resistive-wall wake potentials for an ideal Gaussian bunch can be obtained by substituting a Gaussian distribution into the longitudinal bunch distribution of Eqs. (21) and (22) as follows [31]._2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M040.png)
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M041.png)
| Parameters | Values |
|---|---|
| Circumference, L (m) | 1360.4 |
| Beam energy, E0 (GeV) | 6 |
| Bunch length, σz,0 (mm) | 5.02 |
| Energy spread, σδ,0 | 1.02×10-3 |
| Natural emittance, |
34.82/3.48 |
| Betatron tune, |
115.15/104.29 |
| Corrected chromaticity, |
4.95/5.01 |
| Energy loss per turn, U0 (MeV) | 2.64 |
| Damping time, |
10.86/20.62/18.71 |
| Momentum compaction, αc | 1.83×10-5 |
| Harmonic number, h | 756 |
| Main RF frequency, fRF (MHz) | 166.6 |
| Average beta, |
4.24/5.98 |
| Beam pipe conductivity, σc (S/m) | 1.3×106 |
| Beam pipe radius, b (mm) | 11 |
Figure 3 compares the wake potentials calculated by MBTRACK2-CUDA using both the Faddeeva expressions and the series expansion expressions with the analytical results from Eqs. (39) and (40) for an ideal Gaussian bunch. In these simulations, MBTRACK2-CUDA employed two million macro-particles to represent the Gaussian bunch, utilizing 100 initial bins and 6,000 interpolated bins for the wake potential calculations. These results show almost identical agreement, validating the accuracy of MBTRACK2-CUDA in calculating the intra-bunch resistive-wall wakefields.
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-F003.jpg)
We benchmarked MBTRACK2-CUDA against MBTRACK2 for a single bunch with a current of 5 mA in the presence of intra-bunch resistive-wall wakefields above the microwave instability threshold in a system with a single main RF cavity with a peak voltage of 3.176 MV. The simulations were performed for over 30,000 turns, and the results are shown in Fig. 4. Both codes used 2 million macro-particles and 100 initial bins. However, MBTRACK2 utilizes 105 interpolated bins, whereas MBTRACK2-CUDA uses 6×103 interpolated bins. For the intra-bunch resistive-wall wake functions, MBTRACK2 used the Faddeeva function formulae in Eqs. (16) and (17), whereas MBTRACK2-CUDA employed the series expansion formulae in Eqs. (31) and (32). The simulations showed good agreement in terms of the bunch length and energy spread. However, a discrepancy in transverse emittance was observed between the two codes. This difference arises from the use of distinct RNG algorithms for quantum excitation in each code. Despite this variation, the amplitude of the quantum noise was consistent between the simulations, ensuring that the discrepancy did not significantly impact the overall results.
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-F004.jpg)
The system used for the simulations consisted of an Intel Xeon w5-3423 CPU and an NVIDIA GeForce RTX 4090 GPU with 24 GB of GDDR6X VRAM, featuring an error correction code (ECC) enabled. For the simulation of a single bunch with 2×106 macro-particles undergoing an intra-bunch resistive-wall wakefield, MBTRACK2 took 17 h, 38 min, and 25 s (63,505 s), respectively. In contrast, MBTRACK2-CUDA, following the same procedure, completed the simulation in only 35 min and 13 s (2,113 s), demonstrating a remarkable reduction in the computation time. Following this, tracking simulations in the presence of resistive-wall instability for both single- and multi-bunch cases with 105 macro-particles were performed, and the computation times were compared, as shown in Table 2.
| Bunches | MBTRACK2 | MBTRACK2-CUDA |
|---|---|---|
| 1 | 4,079 s | 959 s |
| 2 | 6,920 s | 1,578 s |
| 5 | 7,885 s | 2,041 s |
| 10 | 10,767 s | 2,117 s |
| 100 | – | 6,658 s |
| 340 | – | 19,346 s |
| 680 | – | 36,442 s |
The default data type for beam-tracking computations in MBTRACK2-CUDA was double-precision (float64). However, users can choose to use single-precision (float32) to store the 6D phase-space coordinates and RNG. When the single-precision option is selected, the tracking computations are still performed with double-precision for accuracy, and the 6D phase-space coordinates are updated in double-precision before being converted back to single-precision for storage. This approach helps reduce the VRAM consumption and speed up the process, although it may slightly increase the uncertainty in the results.
To validate the code for simulating the transverse coupled-bunch instability caused by resistive-wall impedance in a multi-bunch environment, we disabled the beam loss caused by the beam pipe aperture and intra-bunch resistive-wall wakefields. In addition, we set the chromaticity to zero and applied a uniform filling pattern for the beam-tracking simulation. To calculate the growth rate in this environment, an exponential fit was applied to the center of mass from the tracking result in the transverse direction, as shown in Fig. 5. The growth rate was determined using the following equation:_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M042.png)
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M043.png)
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-F005.jpg)
Because the growth rate obtained from the simulation was affected by the resistive-wall memory length, the simulation was repeated while varying the number of memory turns. The results are presented in Fig. 6 show that as the number of memory turns increases, the simulated growth rate approaches the theoretical value obtained by Eq. (42) for both the horizontal and vertical planes. Although the results exhibit oscillations rather than perfect convergence, the amplitude of these oscillations remains of the order of 10 s-1, allowing for a reliable determination of the growth rate. A similar discussion on this behavior can be found in Ref. [6].
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-F006.jpg)
Single-bunch tracking results of intra-bunch resistive-wall wakefields
We investigated the impact of intra-bunch resistive-wall wakefields on a single bunch using the machine parameters of the HEPS storage ring through tracking simulations conducted in a system featuring a single main RF cavity. To study the collective effects at high single-bunch currents, simulations often require millions of macro-particles [33, 34]. This is especially true when investigating single-bunch currents from low to high, particularly above the microwave instability threshold, where the simulations become highly sensitive to numerical noise.
To determine the appropriate number of macro-particles, we conducted tracking simulations to compute the energy spread as a function of single-bunch current for different numbers of macro-particles. From this, we identify the convergence point, as shown in Fig. 7(b), where the energy spread converges around 1 million macro-particles. As mentioned earlier, for the purpose of the single-bunch instability study in this study, we utilized two million macro-particles. The mean value and standard deviation over the last 10,000 turns were used to generate the graphs.
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-F007.jpg)
We obtained the equilibrium longitudinal bunch distributions at different single-bunch currents through tracking simulations in the presence of potential-well distortion (PWD). The theoretical equilibrium longitudinal bunch distributions affected by the intra-bunch resistive-wall wakefield can be calculated using the Haïssinski equation [35]:_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M044.png)
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-M045.png)
Figure 8 shows a comparison of the equilibrium longitudinal bunch distributions between the results of the tracking simulations and those obtained from the Haïssinski equation. To provide further insight into the bunch dynamics, the variation in bunch length and energy spread for different single-bunch currents is presented in Fig. 7. From Fig. 7(b), it can be observed that the microwave instability (MWI) threshold occurs at approximately 1.5 mA. Below this threshold, as illustrated in Fig. 8 shows that bunch lengthening is predominantly caused by PWD owing to the intra-bunch resistive-wall wakefield. The equilibrium longitudinal bunch distributions obtained by tracking with MBTRACK2-CUDA and solving the Haïssinski equation using pycolleff demonstrate excellent agreement under these conditions.
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-F008.jpg)
However, at currents exceeding the MWI threshold, a difference began to emerge. The tracking simulations revealed a distorted bunch structure in the head region owing to MWI. Additionally, the MWI causes an overall increase in the bunch length when compared to the Haïssinski equation results. This discrepancy in bunch length is shown in Fig. 7(a), because the Haïssinski equation does not account for the effects of the MWI.
To study the transverse mode-coupling instability (TMCI), we first set the chromaticity to zero and tracked the center of mass while varying the single-bunch current. We identified the threshold single-bunch current where growth occurred and analyzed the turn-by-turn data of the center of mass using fast Fourier transform (FFT). By calculating the signal-to-noise ratios (SNR) of the FFT magnitudes, we observed betatron tuning shifts, as shown in Figs. 9(a) and (b). The corresponding shifts were especially noticeable as the single-bunch current increased in the presence of resistive-wall impedance. However, no changes in the fractional tune were observed in the absence of resistive-wall impedance.
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-F009.jpg)
We normalized these betatron tune shifts by the synchrotron tune, which accounts for the incoherent synchrotron tune shift caused by PWD, and plotted the betatron mode shift spectrograms, presented in Figs. 9(c) and (d). Using the changing bunch length information derived from the Haïssinski equation, we performed a mode-shift analysis with pycolleff and overlaid the results on the same graphs. The mode-shift results obtained from the two approaches matched well in the region below the MWI threshold.
In addition, to conduct in-depth analyses of the effects of transverse instability, the transverse growth rate and single-bunch threshold current were computed using MBTRACK2-CUDA and pycolleff, respectively. The results were in excellent agreement with the vertical growth rate presented in Fig. 10(a) for varying single-bunch current with PWD and zero chromaticity. The single-bunch vertical-threshold current calculations shown in Fig. 10(b) for varying chromaticity, both with and without PWD, also yielded consistent results.
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-F010.jpg)
Simultaneous tracking results of intra-bunch and coupled-bunch resistive-wall wakefields in a multi-bunch system
Tracking simulations were conducted to evaluate the effects of intra-bunch and coupled-bunch resistive-wall wakefields on multi-bunches using the machine parameters of the HEPS storage ring under the high-brightness mode configuration. This configuration consists of a filling pattern with a 680-bunch train and 76 empty bucket gaps out of a total of 756 buckets [29, 39]. In the high-brightness mode configuration, the current per bunch was approximately 0.29 mA. As shown in Fig. 7(b), for 105 macro-particles per bunch, the energy spread converged sufficiently in the region where the current per bunch was below 1 mA. Based on this observation, we set the number of macro-particles per bunch to 105 in the tracking simulation. In addition, 100 initial bins and 6×103 interpolated bins were used to calculate the effects of the intra-bunch wakefields. The horizontal and vertical chromaticities were configured to 4.95 and 5.01, respectively.
With the main RF cavity set to a peak voltage of 3.176 MV in the presence of both intra-bunch and coupled-bunch resistive-wall wake fields, the multi-bunch tracking simulation results are shown in Fig. 11. In equilibrium, the bunch lengths increased by a factor of approximately 1.5, while the transverse emittances for all bunches remained at their initial values, indicating the feasibility of stable operation.
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-F011.jpg)
Subsequently, we conducted additional tracking simulations that included the active third harmonic cavity without considering the beam-loading effect, with the main RF cavity set to a peak voltage of 3.393 MV and the active third harmonic cavity set to a peak voltage of 0.638 MV. These simulations were performed with and without resistive-wall effects, and the results for the case with resistive-wall effects are presented in Fig. 12.
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-F012.jpg)
In terms of bunch length, the simulation without resistive-wall instability showed that all bunches increased in length from the initial value of 5.02 mm to 29.72 mm, approximately 5.92 times longer due to the effect of the active third harmonic cavity, which closely matched the result of Ref. [29].
When both intra-bunch and coupled-bunch resistive-wall wakefields were present, with the active harmonic cavity in operation, as shown in Fig. 12(a), the bunch lengths of all bunches increased to 33.99 mm, approximately 6.77 times longer. Figure 12(b) shows that the energy spreads remained stable in the equilibrium state, and Fig. 12(c, d) demonstrate that the transverse emittances were consistently maintained throughout the tracking process, indicating that the HEPS storage ring with the active harmonic cavity in operation could be stably operated under resistive-wall wakefield conditions.
To investigate the impact of transverse instability when both intra-bunch and coupled-bunch resistive-wall wakefields were present in multi-bunch configurations, we increased the ring current while disabling the beam loss caused by the beam pipe aperture. We then examined the vertical emittance near the threshold at which it started to increase, both with and without an active harmonic cavity, as shown in Fig. 13. The results indicate that the ring current needs to be increased to approximately three times its usual value to observe vertical instability. Furthermore, the presence of an active harmonic cavity slightly increased the instability threshold.
_2026_01/1001-8042-2026-01-6/alternativeImage/1001-8042-2026-01-6-F013.jpg)
Summary and outlook
In this study, we developed MBTRACK2-CUDA, a novel approach that fully parallelizes the entire tracking computation process solely on a GPU, building upon the existing MBTRACK2 framework. The performance of the code was remarkable, demonstrating its effectiveness and practicality through tracking simulations addressing resistive wall impedance by utilizing the machine parameters of the HEPS storage ring.
As GPU architectures continue to evolve, with an increased number of CUDA cores, improved core performance, and expanded memory capacities, fully leveraging these advancements will be beneficial for GPGPU computations. In this context, we believe that executing the entire tracking computation in a fully parallel manner on the GPU is an efficient and future-oriented approach that maximizes GPGPU performance. This approach will pave the way for faster solutions in complex simulations involving large datasets by reducing data transfer overhead.
mbtrack2, a collective effect library in Python
.Study of transverse beam instabilities in the SOLEIL II synchrotron booster
. Thesis,MBTRACK2: A longitudinal RF beam loading tracking code. BNL-224788-2023-TECH
(2023).MBTRACK2 - Application on EIC 5 GeV electron ring reverse phase configuration. BNL-225173-2024-TECH
(2024).MBTRACK2
. https://doi.org/10.5281/zenodo.12749990Simultaneous computation of intrabunch and interbunch collective beam motions in storage rings
. Nucl. Instrum. Methods Phys. Res. Sect. A 806, 221–230 2016). https://doi.org/10.1016/j.nima.2015.10.029Calculation of longitudinal collective instabilities with mbtrack-cuda
. Nucl. Instrum. Methods Phys. Res. Sect. A 922, 345–351 2019). https://doi.org/10.1016/j.nima.2019.01.041Graphics-processing-unit-accelerated simulation for longitudinal beam dynamics of arbitrary bunch trains in electron storage rings
. Phys. Rev. Accel. Beams 24,Strong–strong beam–beam simulations with lattices of circular e+e- colliders
. Nucl. Instrum. Methods Phys. Res. Sect. A 1064,Current status of the GPU-accelerated ELEGANT
.Parallelisation of PyHEADTAIL, a collective beam dynamics code for particle accelerator physics
. arXiv preprint, 2016. https://doi.org/10.48550/arXiv.1610.05801Simulating collective effects on GPUs
. Thesis,Xsuite: An Integrated Beam Physics Simulation Framework
.Optimising and Extending a Single-Particle Tracking Library for High Parallel Performance
.MBTRACK2-CUDA
. https://doi.org/10.5281/zenodo.14802810Numba: A LLVM-based Python JIT compiler
.Robinson instabilities with a higher-harmonic cavity
. Phys. Rev. ST Accel. Beams 4,Further scramblings of Marsaglia’s xorshift generators
. J. Comput. Appl. Math. 315, 175–181 2017). https://doi.org/10.1016/j.cam.2016.11.006Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator
. ACM Trans. Model. Comput. Simul. 8, 3–30 1998). https://doi.org/10.1145/272991.272995Using surface impedance for calculating wakefields in flat geometry
. Phys. Rev. ST Accel. Beams 18,Resistive wall wake fields
. Thesis,The short-range resistive wall wakefields. SLAC-PUB-7074
(1995).Analytical treatment of resistive wake potentials in round pipes
. Nucl. Instrum. Methods Phys. Res. Sect. A 522, 223–229 2004). https://doi.org/10.1016/j.nima.2003.12.006Wake and impedance. SLAC-PUB-8683
(2000).Review of CPU and GPU Faddeeva implementations
.Equilibrium electron beam parameters of the High Energy Photon Source
. Radiat. Detect. Technol. Methods 7, 279–287 2023). https://doi.org/10.1007/s41605-022-00374-wPreliminary design of HEPS storage ring vacuum chambers and components
.Wake fields and ohmic losses in round vacuum chambers. DESY-HERA-92-11
(1992).Collective effects in a diffraction-limited storage ring
. J. Synchrotron Rad. 21, 937–960 2014). https://doi.org/10.1107/S1600577514015215Terahertz scale microbunching instability driven by high resistivity nonevaporable getter coating resistive-wall impedance
. Phys. Rev. Accel. Beams 27,Influence of the coating resistivity on beam dynamics
. Phys. Rev. Accel. Beams 26,Exact longitudinal equilibrium distribution of stored electrons in the presence of self-fields
. Nuov. Cim. B 18, 72–82 1973). https://doi.org/10.1007/BF02832640Equilibrium of longitudinal bunch distributions in electron storage rings with arbitrary impedance sources and generic filling patterns
. Phys. Rev. Accel. Beams 26,Measurements of collective effects related to beam coupling impedance at SIRIUS
.pycolleff and cppcolleff: modules for impedance analysis and wake-field induced instabilities evaluation
. https://doi.org/10.5281/zenodo.8088076Preliminary study on detection and cleaning of parasitic bunches
. Nucl. Sci. Tech. 32, 114 2021). https://doi.org/10.1007/s41365-021-00948-1The authors declare that they have no competing interests.

