I. INTRODUCTION
As technology scales downward in modern integrated circuits, such as SRAM, the minimum charge needed to upset a device within a unit memory cell decreases, while the influence of charge sharing on adjacent unit memory cells increases [1-5]. Therefore, advanced devices (especially deep-submicrometer) are much more sensitive to the energy deposition in the device by heavy ion irradiation, and this critically restricts the devices’ use in space.
Many methods have been proposed to mitigate the single event upsets (SEUs) occurred in advanced devices. Bits interleaving architecture is a commonly accepted approach to mitigate Multiple Bit Upsets (MBUs) in data word. In this architecture, the bits in a data word are not physically adjacent, but interleaved with bits of other data words. In this way, every MBU of physically adjacent memory cells is transformed into multiple single bit upsets (SBUs) in different memory words.
Error correcting code (ECC) utilizing Hamming code is found commonly in many high-reliability and performance applications. As a relatively simple yet powerful ECC code, it corrects single bit errors anywhere within the codeword.
Therefore, MBUs which is now a major reliability problem in commercial and industrial electronics, can be transformed into multiple SBUs appear to be uncorrelated events relative to the ECC algorithm, and then be corrected [2, 5-7].
In this hardening approach, ECC module can be used in high-reliability and performance applications to resolve SBUs combining with the bits interleaving architecture in advanced process node devices.
To observe and compare the SEUs induced by heavy ions in SRAMs of different process, and to quantitatively evaluate the applicability and effectiveness of single-bit ECC utilizing Hamming code in advanced process SRAMs, we used 12C ion beam to irradiate four SRAMs from ISSI company. Two of them, manufactured via 130 nm and 150 nm process, are the most advanced process devices in their SRAMs without ECC module, while the other two are of 65 nm process SRAMs with ECC module. Some interesting results were obtained.
II. EXPERIMENTAL BACKGROUND
Four industrial SRAMs, produced by high-performance CMOS technology, were irradiated at normal incidence in the vacuum by 12C beams from the Heavy Ion Research Facility in Lanzhou (HIRFL). The 12C ions were of effective linear energy transfer (LET) value of 1.8 MeV-cm2/mg. Table 1 shows the information of SRAMs under test. The IS2ME is 2 M-bit SRAM organized as 131072 words by 16 bits with ECC, 65 nm process node; the IS4ME is 4 M-bit SRAM organized as 262144 words by 16 bits with ECC, 65 nm process node; the IS2M is 2 M-bit SRAM organized as 131072 words by 16 bits without ECC, 150 nm process node; and the IS4M is 4 M-bit SRAM organized as 262144 words by 16 bits without ECC, 130 nm process node. The first two SRAMs with ECC are the main objects of observation, and the other two are the contrastive devices. All of the four industrial SRAMs belong to the IS61WV series made by ISSI company, and the ECC functions described in this application are made by Hamming code, a relatively simple yet powerful ECC which can correct all single bit errors in one codeword.
Device | Process node (nm) | Capacity (Mbits) | ECC | Abbreviation |
---|---|---|---|---|
61WV12816EDBLL-10TLI | 65 | 2 | with | IS2ME |
61WV25616EDBLL-10TLI | 65 | 4 | with | IS4ME |
61WV12816DBLL-10TLI | 150 | 2 | without | IS2M |
61WV25616BLL-10TLI | 130 | 4 | without | IS4M |
The SRAMs were tested using data pattern of all "1" (blanket pattern) at voltage of 3.3 V, and the work period was set at 20 MHz all the time. Under the static test mode, the devices were written prior to their beam-shot and read periodically throughout the beam shot (this technique is often referred to as multiple-read) [1, 8, 9]. The error data occurred in the test were stored in another RAM (referred as mirrored RAM relative to the SRAM under test) working in the test system, as a referenced data for next read cycle. The test flow applied (Fig. 1) distinguishes SBU, MBU and SEL. All the upset events were recorded with a timestamp and bitmap location.
-201401/1001-8042-25-01-010/alternativeImage/1001-8042-25-01-010-F001.jpg)
III. RESULT ANALYSIS AND DISCUSS
A. The high efficiency of ECC module
SEU cross sections of the four SRAMs are shown in Fig. 2. One sees that the SRAMs without ECC module are much more sensitive to the irradiation than the devices with ECC module. The SEU cross sections of SRAMs without ECC module are at the order of 10-9 cm2/bit, while they are 10-11 cm2/bit for SRAMs with ECC module. However, the technology of producing the IS2ME and IS4ME in 65 nm process is more advanced than IS2M (150 nm process node) and IS4M (130 nm process node). With technology scaling, the number of upsets per chip increases due to higher circuit density and sensitivity. Therefore, the sharp contrast of the two datum groups should be attributed to the high efficiency of ECC module.
-201401/1001-8042-25-01-010/alternativeImage/1001-8042-25-01-010-F002.jpg)
B. The ineffectiveness of ECC module
Only 1 bit upsets in a data word were detected in the devices without ECC module in this experiment. The upset events involving 1, 2 and 3 bits errors occurred in the devices with ECC module. Fig. 3 shows the measured and theoretical results of bits per upset event distribution (percentages over total events). We will discuss the results with an emphasis: special attentions shall be paid to the word "upset" and "error" in the following text— "upset" is the real change occurred in memory cell, and "error" is the data being read out from the memory finally.
-201401/1001-8042-25-01-010/alternativeImage/1001-8042-25-01-010-F003.jpg)
1. The fundamental reason
For discussing the experimental results, we have the following assumptions:
1. Considering the beam energy of 12C and the bits interleaving architecture, the normal incidence ion beams do not affect the adjacent memory cells simultaneously. So, MBUs are not supposed to occur in a codeword any time in this experiment [2, 5- 7].
2. The static mode used in this test meams that only one write operation worked in a test cycle, while the ECC module does not correct or re-write the memory itself [1], but just corrects the "error" bit(s). When the data be read out through ECC module, the memory remains in upset status until a new write command arrives with new data. Therefore, if other bit(s) upset occurs in the same word, the ECC module utilizing Hamming code, which can only correct one bit error, will lose function. So, the disablement of ECC module is an accumulation effect caused by several SBUs in a word at different time. On the other hand, as ECC functional block diagram (Fig. 4, presented in the datasheet of the devices with ECC module) shows, the circuit structure of ECC module utilizes the (12, 8) Hamming code in the application.
-201401/1001-8042-25-01-010/alternativeImage/1001-8042-25-01-010-F004.jpg)
Based on time structure of the cyclotron and the upstream scanning magnets, the incident ions are of uniform temporal and spatial distribution in the used flux range, thus each SBU could be deemed as an independent random event.
In independent random event, if the upset probability is p(p << 1), the probability that r bit(s) upset occurs in an n bits codeword is . From the results of IS2M and IS4M, about 200 ions could cause 1 bit upset in order of magnitude, assuming this probability is suitable for IS2ME and IS4ME, we have p=5× 10-3. Then, the probability of two and three SBUs occurring at different time in one codeword is
The probability of three SBUs occurs in different time in one codeword is
The results of Eq. (1) and Eq. (2) show a probability difference of two orders of magnitude between r=2 and r=3. Thus three or more SBUs occur at different time in one codeword is of very low probability, hence their omission in this experiment.
Therefore, the fundamental reason for the problem is that a 2 bits upset in a codeword causes the disablement of ECC module utilizing (12, 8) Hamming code.
2. Parsing the problem
Figure 5 is a basic memory architecture of ECC module utilizing Hamming code [10]. Table 2 is a common relationship between syndrome vector and single-error location.
S3S2S1S0 | Error location | S3S2S1S0 | Error location |
---|---|---|---|
0001 | P0 | 1000 | P3 |
0010 | P1 | 1001 | D4 |
0011 | D0 | 1010 | D5 |
0100 | P2 | 1011 | D6 |
0101 | D1 | 1100 | D7 |
0110 | D2 | — | — |
0111 | D3 | 0000 | No error |
-201401/1001-8042-25-01-010/alternativeImage/1001-8042-25-01-010-F005.jpg)
Assuming the 12 bits codeword is D7D6D5D4P3D3D2D1P2D0P1P0, 8 bits data word is vector D and 4 bits check word is vector P, the syndrome vector S can be generated by data word and check word as [11]:
means
and the corresponding (12, 8) parity matrix is
When an 8-bits data word is written in SRAM, the ECC module generates a 4-bits check word to compose 12-bits codeword and store it in the memory cell. After irradiation, when the data word is read out from memory cell through the ECC module, which generates a syndrome vector S=(S3S2S1S0), according to the codeword.
In Eq. (5), each column vector in parity matrix represents the position of each bit (Du(u= 0, 1,...7) or Pv(v = 0, 1, 2, 3)) in the codeword, 0 means that the bit does not participate in the form of Sk (k=0,1,2,3), 1 means that the bit participates in the form of Sk (k=0,1,2,3). Then, how does the 2-bits change in codeword generate S≠(0000), and how does the S point to an error in Table 2? The method to find the failure modes is discussed as follows:
1. Neither the 2 upset bits participate in the Sk, Sk=0⊕0=0 to point to "no error".
2. Both the 2 upset bits participate in the Sk, Sk=1⊕1=0 to point to "no error".
3. Only one upset bit participates in the Sk, Sk=1⊕0=1 or Sk=0⊕1=1, the value of the corresponding Sk is always 1, so the ECC module spots an "error" and makes a "correct" operation.
Consequently, the Sk value is associated with the status of 2 upset bits participating in the Sk, and the relationship is a "XOR" operation between Sk and the 2 upset bits.
For example, if the 2-bits upset comes from D3P0, they will not affect the value of S0 (as both participate in it) and S3 (as neither participate in it). However, S1=P1⊕D0⊕D2⊕D3’⊕D5⊕D6 and S2=P2⊕D1⊕D2⊕D3’⊕D7 will result in S=(S3S2S1S0)=(0110), which can be understood simply as:
Eq. (6) points to an "error" position at D2 by Table 2, then ECC module corrects the right value of D2 to an error value, while the real upset bit D3 is read out as an "right" data, leading a 2-bits errors as D3D2. In other words, the data written in is "FF", and the data read out is "F3" as an error to be detected.
Therefore, the problem-solving method can be simplified as the following procedures: 1) extract two columns vectors (2-bits upset occurring in the same bit in a codeword does not affect the S, hence the omission of this condition) from parity matrix of Eq. (5), 2) make an "XOR" operation with them as Eq. (6), 3) produce the syndrome vector S, 4) find the "error" position where S points to, 5) analyze the relationship between the "error" and the "upset", and 6) a statistics of failure modes including 1-bit, 2-bits and 3-bits errors read out from the SRAMs can be achieved.
3. Analysis results
Extracting two columns of vector from parity matrix of Eq. (4), the total number of error types is . Tables 3, 4, 5 list details of the failure modes and error types.
Upset position | S3S2S1S0 | “Error” position the S points to | Error read out | Error types (bits involved) |
---|---|---|---|---|
P1P0 | 0011 | D0 | D0 | 1 bit |
P2P0 | 0101 | D1 | D1 | 1 bit |
P3P0 | 1001 | D4 | D4 | 1 bit |
P2P1 | 0110 | D2 | D2 | 1 bit |
P3P1 | 1010 | D5 | D5 | 1 bit |
P3P2 | 1100 | D7 | D7 | 1 bit |
Upset Position | S3S2 S1S0 | "Error" position the S points to | Error read out | Error types (bits involved) |
---|---|---|---|---|
D0P0 | 0010 | P1 | D0 | 1 bit |
D1P0 | 0100 | P2 | D1 | 1 bit |
D2P0 | 0111 | D3 | D3D2 | 2 bit |
D3P0 | 0110 | D2 | D3D2 | 2 bit |
D4P0 | 0001 | P3 | D4 | 1 bit |
D5P0 | 1011 | D6 | D6D5 | 2 bit |
D6P0 | 1010 | D5 | D6D5 | 2 bit |
D7P0 | 1101 | no point | D7 | 1 bit |
D0P1 | 0001 | P0 | D0 | 1 bit |
D1P1 | 0111 | D3 | D3D1 | 2 bit |
D2P1 | 0100 | P2 | D2 | 1 bit |
D3P1 | 0101 | D1 | D3D1 | 2 bit |
D4P1 | 1011 | D6 | D6D4 | 2 bit |
D5P1 | 1000 | P3 | D5 | 1 bit |
D6P1 | 1001 | D4 | D6D4 | 2 bit |
D7P1 | 1110 | no point | D7 | 1 bit |
D0P2 | 0111 | D3 | D3D0 | 2 bit |
D1P2 | 0001 | P0 | D1 | 1 bit |
D2P2 | 0010 | P1 | D2 | 1 bit |
D3P2 | 0011 | D0 | D3D0 | 2 bit |
D4P2 | 1101 | no point | D4 | 1 bit |
D5P2 | 1110 | no point | D5 | 1 bit |
D6P2 | 1111 | no point | D6 | 1 bit |
D7P2 | 1000 | P3 | D7 | 1 bit |
D0P3 | 1011 | D6 | D6D0 | 2 bit |
D1P3 | 1101 | no point | D1 | 1 bit |
D2P3 | 1110 | no point | D2 | 1 bit |
D3P3 | 1111 | no point | D3 | 1 bit |
D4P3 | 0001 | P0 | D4 | 1 bit |
D5P3 | 0010 | P1 | D5 | 1 bit |
D6P3 | 0011 | D0 | D6D0 | 2 bit |
D7P3 | 0100 | P2 | D7 | 1 bit |
Upset position | S3S2 S1S0 | "Error" position the S points to | Error read out | Error types (bits involved) |
---|---|---|---|---|
D1D0 | 0110 | D2 | D2D1D0 | 3 bit |
D2D0 | 0101 | D1 | D2D1D0 | 3 bit |
D3D0 | 0100 | P2 | D3D0 | 2 bit |
D4D0 | 1010 | D5 | D5D4D0 | 3 bit |
D5D0 | 1001 | D4 | D5D4D0 | 3 bit |
D6D0 | 1000 | P3 | D6D0 | 2 bit |
D7D0 | 1111 | no point | D7D0 | 2 bit |
D2D1 | 0011 | D0 | D2D1D0 | 3 bit |
D3D1 | 0010 | P1 | D3D1 | 2 bit |
D4D1 | 1100 | D7 | D7D4D1 | 3 bit |
D5D1 | 1111 | no point | D5D1 | 2 bit |
D6D1 | 1110 | no point | D6D1 | 2 bit |
D7D1 | 1001 | D4 | D7D4D1 | 3 bit |
D3D2 | 0001 | P0 | D3D2 | 2 bit |
D4D2 | 1111 | no point | D4D2 | 2 bit |
D5D2 | 1100 | D7 | D7D5D2 | 3 bit |
D6D2 | 1101 | no point | D6D2 | 2 bit |
D7D2 | 1010 | D5 | D7D5D2 | 3 bit |
D4D3 | 1110 | no point | D4D3 | 2 bit |
D5D3 | 1101 | no point | D5D3 | 2 bit |
D6D3 | 1100 | D7 | D7D6D3 | 3 bit |
D7D3 | 1011 | D6 | D7D6D3 | 3 bit |
D5D4 | 0011 | D0 | D5D4D0 | 3 bit |
D6D4 | 0010 | P1 | D6D4 | 2 bit |
D7D4 | 0101 | D1 | D7D4D1 | 3 bit |
D6D5 | 0001 | P0 | D6D5 | 2 bit |
D7D5 | 0110 | D2 | D7D5D2 | 3 bit |
D7D6 | 0111 | D3 | D7D6D3 | 3 bit |
1. When 2-bits upset are both in chech word (Table 3)
In this case, the ECC module makes a wrong operation, the number of error types is , all the failure mode is 1-bit.
2. When 1 bit upset in check word, 1 bit upset in data word (Table 4)
In this case, the ECC module would makes a wrong operation, the number of error types is , of which the number of 1-bit is 20, the number of 2-bits is 12, and the failure modes includes 1-bit and 2-bits.
3. When 2 bits upset are both in data word (Table 5)
In this case, the ECC module makes a wrong operation, the number of error types is , of which the number of 2-bit is 13, and the number of 3-bit is 15, and the failure modes includes 2-bits and 3-bits.
Therefore, the total number of 1-bit is 6+20=26, the probability in all error types is 26/66=39.39%; the total number of 2-bits is 12+13=25, the probability in all error types is 25/66=37.88%; and the total number of 3-bits is 15, the probability in all error types is 15/66=22.73%. Table 6 shows the theoretical probabilities of failure modes including 1-, 2- and 3-bits agree well with the experimental results.
Error types | Number of erros measured | Probability of error | |
---|---|---|---|
Measured | Theoretical | ||
1 bit error | 119/294 | 40.48% | 39.39% |
2 bits error | 111/294 | 37.76% | 37.88% |
3 bits error | 64/294 | 21.77% | 22.73% |
Therefore, the immanent factor of failure modes of ECC module in this experiment is due to the failure of (12, 8) Hamming code facing to 2 bits upset in one codeword.
IV. CONCLUSION
The results show the effectiveness and ineffectiveness of ECC module utilizing (12, 8) Hamming code in 65 nm process node SRAMs. The ECC module works obviously in hardening the advanced process node SRAMs. The failure modes including 1-, 2-, and 3-bits in a data word has been analyzed, and the essential factor of failure modes is due to the failure of (12, 8) Hamming code facing to 2 bits upset in one codeword. The measured bits per upset event distribution agree well with theoretical calculation.
There can be several mitigation approaches if a much higher reliability is required. Periodic memory scrubbing is often used to improve the performance of the device. and a scrubbing operation will be conducted in the SRAMs exposed to heavy ions in our lab, so as to observe the relationship between the scrub-rates and the bit error rate (BER). If more redundancy is accepted, the triple-bit-correcting Golay code or the Triple Modular Redundancy (TMR) may be employed.
The research on 65 nm SRAMs may provide a reference to the manufacturers in their choice of the reinforcement model and algorithm, and to the users in their selection of device application environment and methods.
Study of the Effects of MBUs on the Reliability of a 150 nm SRAM Device
,Measurement and Reporting of Alpha Particle and Terrestrial Cosmic Ray-Induced Soft Errors in Semiconductor Devices, JESD89A, 2006
, p.10.