logo

Effectiveness and failure modes of error correcting code in industrial 65 nm CMOS SRAMs exposed to heavy ions

NUCLEAR ELECTRONICS AND INSTRUMENTATION

Effectiveness and failure modes of error correcting code in industrial 65 nm CMOS SRAMs exposed to heavy ions

TONG Teng
WANG Xiao-Hui
ZHANG Zhan-Gang
DING Peng-Cheng
LIU Jie
LIU Tian-Qi
SU Hong
Nuclear Science and TechniquesVol.25, No.1Article number 010405Published in print 20 Feb 2014Available online 20 Feb 2014
50800

Single event upsets (SEUs) induced by heavy ions were observed in 65 nm SRAMs to quantitatively evaluate the applicability and effectiveness of single-bit error correcting code (ECC) utilizing Hamming Code. The results show that the ECC did improve the performance dramatically, with the SEU cross sections of SRAMs with ECC being at the order of 10-11 cm2/bit, two orders of magnitude higher than that without ECC (at the order of 10-9 cm2/bit). Also, ineffectiveness of ECC module, including 1-, 2- and 3-bits errors in single word (not Multiple Bit Upsets), was detected. The ECC modules in SRAMs utilizing (12, 8) Hamming code would lose work when 2-bits upset accumulates in one codeword. Finally, the probabilities of failure modes involving 1-, 2- and 3-bits errors, were calcaulated at 39.39%, 37.88% and 22.73%, respectively, which agree well with the experimental results.

Single event upsets (SEU)SRAMError correcting code (ECC)Hamming codeEffectivenessFailure modes

I. INTRODUCTION

As technology scales downward in modern integrated circuits, such as SRAM, the minimum charge needed to upset a device within a unit memory cell decreases, while the influence of charge sharing on adjacent unit memory cells increases [1-5]. Therefore, advanced devices (especially deep-submicrometer) are much more sensitive to the energy deposition in the device by heavy ion irradiation, and this critically restricts the devices’ use in space.

Many methods have been proposed to mitigate the single event upsets (SEUs) occurred in advanced devices. Bits interleaving architecture is a commonly accepted approach to mitigate Multiple Bit Upsets (MBUs) in data word. In this architecture, the bits in a data word are not physically adjacent, but interleaved with bits of other data words. In this way, every MBU of physically adjacent memory cells is transformed into multiple single bit upsets (SBUs) in different memory words.

Error correcting code (ECC) utilizing Hamming code is found commonly in many high-reliability and performance applications. As a relatively simple yet powerful ECC code, it corrects single bit errors anywhere within the codeword.

Therefore, MBUs which is now a major reliability problem in commercial and industrial electronics, can be transformed into multiple SBUs appear to be uncorrelated events relative to the ECC algorithm, and then be corrected [2, 5-7].

In this hardening approach, ECC module can be used in high-reliability and performance applications to resolve SBUs combining with the bits interleaving architecture in advanced process node devices.

To observe and compare the SEUs induced by heavy ions in SRAMs of different process, and to quantitatively evaluate the applicability and effectiveness of single-bit ECC utilizing Hamming code in advanced process SRAMs, we used 12C ion beam to irradiate four SRAMs from ISSI company. Two of them, manufactured via 130 nm and 150 nm process, are the most advanced process devices in their SRAMs without ECC module, while the other two are of 65 nm process SRAMs with ECC module. Some interesting results were obtained.

II. EXPERIMENTAL BACKGROUND

Four industrial SRAMs, produced by high-performance CMOS technology, were irradiated at normal incidence in the vacuum by 12C beams from the Heavy Ion Research Facility in Lanzhou (HIRFL). The 12C ions were of effective linear energy transfer (LET) value of 1.8 MeV-cm2/mg. Table 1 shows the information of SRAMs under test. The IS2ME is 2 M-bit SRAM organized as 131072 words by 16 bits with ECC, 65 nm process node; the IS4ME is 4 M-bit SRAM organized as 262144 words by 16 bits with ECC, 65 nm process node; the IS2M is 2 M-bit SRAM organized as 131072 words by 16 bits without ECC, 150 nm process node; and the IS4M is 4 M-bit SRAM organized as 262144 words by 16 bits without ECC, 130 nm process node. The first two SRAMs with ECC are the main objects of observation, and the other two are the contrastive devices. All of the four industrial SRAMs belong to the IS61WV series made by ISSI company, and the ECC functions described in this application are made by Hamming code, a relatively simple yet powerful ECC which can correct all single bit errors in one codeword.

TABLE 1.
The information of SRAMs under test
Device Process node (nm) Capacity (Mbits) ECC Abbreviation
61WV12816EDBLL-10TLI 65 2 with IS2ME
61WV25616EDBLL-10TLI 65 4 with IS4ME
61WV12816DBLL-10TLI 150 2 without IS2M
61WV25616BLL-10TLI 130 4 without IS4M
Show more

The SRAMs were tested using data pattern of all "1" (blanket pattern) at voltage of 3.3 V, and the work period was set at 20 MHz all the time. Under the static test mode, the devices were written prior to their beam-shot and read periodically throughout the beam shot (this technique is often referred to as multiple-read) [1, 8, 9]. The error data occurred in the test were stored in another RAM (referred as mirrored RAM relative to the SRAM under test) working in the test system, as a referenced data for next read cycle. The test flow applied (Fig. 1) distinguishes SBU, MBU and SEL. All the upset events were recorded with a timestamp and bitmap location.

Fig. 1.
Static test flow.
pic

III. RESULT ANALYSIS AND DISCUSS

A. The high efficiency of ECC module

SEU cross sections of the four SRAMs are shown in Fig. 2. One sees that the SRAMs without ECC module are much more sensitive to the irradiation than the devices with ECC module. The SEU cross sections of SRAMs without ECC module are at the order of 10-9 cm2/bit, while they are 10-11 cm2/bit for SRAMs with ECC module. However, the technology of producing the IS2ME and IS4ME in 65 nm process is more advanced than IS2M (150 nm process node) and IS4M (130 nm process node). With technology scaling, the number of upsets per chip increases due to higher circuit density and sensitivity. Therefore, the sharp contrast of the two datum groups should be attributed to the high efficiency of ECC module.

Fig. 2.
Cross sections of four SRAMs.
pic
B. The ineffectiveness of ECC module

Only 1 bit upsets in a data word were detected in the devices without ECC module in this experiment. The upset events involving 1, 2 and 3 bits errors occurred in the devices with ECC module. Fig. 3 shows the measured and theoretical results of bits per upset event distribution (percentages over total events). We will discuss the results with an emphasis: special attentions shall be paid to the word "upset" and "error" in the following text— "upset" is the real change occurred in memory cell, and "error" is the data being read out from the memory finally.

Fig. 3.
The bits per upset event distribution.
pic
1. The fundamental reason

For discussing the experimental results, we have the following assumptions:

1. Considering the beam energy of 12C and the bits interleaving architecture, the normal incidence ion beams do not affect the adjacent memory cells simultaneously. So, MBUs are not supposed to occur in a codeword any time in this experiment [2, 5- 7].

2. The static mode used in this test meams that only one write operation worked in a test cycle, while the ECC module does not correct or re-write the memory itself [1], but just corrects the "error" bit(s). When the data be read out through ECC module, the memory remains in upset status until a new write command arrives with new data. Therefore, if other bit(s) upset occurs in the same word, the ECC module utilizing Hamming code, which can only correct one bit error, will lose function. So, the disablement of ECC module is an accumulation effect caused by several SBUs in a word at different time. On the other hand, as ECC functional block diagram (Fig. 4, presented in the datasheet of the devices with ECC module) shows, the circuit structure of ECC module utilizes the (12, 8) Hamming code in the application.

Fig. 4.
Functional block diagram of SRAMs with ECC module.
pic

Based on time structure of the cyclotron and the upstream scanning magnets, the incident ions are of uniform temporal and spatial distribution in the used flux range, thus each SBU could be deemed as an independent random event.

In independent random event, if the upset probability is p(p << 1), the probability that r bit(s) upset occurs in an n bits codeword is Pn(r)=Cnrpr(1p)nrn!r!(nr)!pr. From the results of IS2M and IS4M, about 200 ions could cause 1 bit upset in order of magnitude, assuming this probability is suitable for IS2ME and IS4ME, we have p=5× 10-3. Then, the probability of two and three SBUs occurring at different time in one codeword is

P12(2)=C122p2(1p)12212!2!(122)!(5×103)2=3.3×104. (1)

The probability of three SBUs occurs in different time in one codeword is

P12(3)=C123p3(1p)12312!3!(123)!(5×103)3=1.1×106. (2)

The results of Eq. (1) and Eq. (2) show a probability difference of two orders of magnitude between r=2 and r=3. Thus three or more SBUs occur at different time in one codeword is of very low probability, hence their omission in this experiment.

Therefore, the fundamental reason for the problem is that a 2 bits upset in a codeword causes the disablement of ECC module utilizing (12, 8) Hamming code.

2. Parsing the problem

Figure 5 is a basic memory architecture of ECC module utilizing Hamming code [10]. Table 2 is a common relationship between syndrome vector and single-error location.

TABLE 2.
The relationship between syndrome vector and single-error location
S3S2S1S0 Error location S3S2S1S0 Error location
0001 P0 1000 P3
0010 P1 1001 D4
0011 D0 1010 D5
0100 P2 1011 D6
0101 D1 1100 D7
0110 D2
0111 D3 0000 No error
Show more
Fig. 5.
Basic memory architecture for ECC module utilizing Hamming code.
pic

Assuming the 12 bits codeword is D7D6D5D4P3D3D2D1P2D0P1P0, 8 bits data word is vector D and 4 bits check word is vector P, the syndrome vector S can be generated by data word and check word as [11]:

S0=D0D1D3D4D6P0;S1=D0D2D3D5D6P1;S2=D1D2D3D7P2;S3=D4D5D6D7P3; (3)

means

[S0S1S2S3]=[010101010101011001100110100001111000111110000000][D7D6D5D4P3D3D2D1P2D0P1P0], (4)

and the corresponding (12, 8) parity matrix is

H=[010101010101011001100110100001111000111110000000]. (5)

When an 8-bits data word is written in SRAM, the ECC module generates a 4-bits check word to compose 12-bits codeword and store it in the memory cell. After irradiation, when the data word is read out from memory cell through the ECC module, which generates a syndrome vector S=(S3S2S1S0), according to the codeword.

In Eq. (5), each column vector in parity matrix represents the position of each bit (Du(u= 0, 1,...7) or Pv(v = 0, 1, 2, 3)) in the codeword, 0 means that the bit does not participate in the form of Sk (k=0,1,2,3), 1 means that the bit participates in the form of Sk (k=0,1,2,3). Then, how does the 2-bits change in codeword generate S≠(0000), and how does the S point to an error in Table 2? The method to find the failure modes is discussed as follows:

1. Neither the 2 upset bits participate in the Sk, Sk=0⊕0=0 to point to "no error".

2. Both the 2 upset bits participate in the Sk, Sk=1⊕1=0 to point to "no error".

3. Only one upset bit participates in the Sk, Sk=1⊕0=1 or Sk=0⊕1=1, the value of the corresponding Sk is always 1, so the ECC module spots an "error" and makes a "correct" operation.

Consequently, the Sk value is associated with the status of 2 upset bits participating in the Sk, and the relationship is a "XOR" operation between Sk and the 2 upset bits.

For example, if the 2-bits upset comes from D3P0, they will not affect the value of S0 (as both participate in it) and S3 (as neither participate in it). However, S1=P1⊕D0⊕D2⊕D3’⊕D5⊕D6 and S2=P2⊕D1⊕D2⊕D3’⊕D7 will result in S=(S3S2S1S0)=(0110), which can be understood simply as:

[S0S1S2S3]=[1110][1000]=[0110]. (6)

Eq. (6) points to an "error" position at D2 by Table 2, then ECC module corrects the right value of D2 to an error value, while the real upset bit D3 is read out as an "right" data, leading a 2-bits errors as D3D2. In other words, the data written in is "FF", and the data read out is "F3" as an error to be detected.

Therefore, the problem-solving method can be simplified as the following procedures: 1) extract two columns vectors (2-bits upset occurring in the same bit in a codeword does not affect the S, hence the omission of this condition) from parity matrix of Eq. (5), 2) make an "XOR" operation with them as Eq. (6), 3) produce the syndrome vector S, 4) find the "error" position where S points to, 5) analyze the relationship between the "error" and the "upset", and 6) a statistics of failure modes including 1-bit, 2-bits and 3-bits errors read out from the SRAMs can be achieved.

3. Analysis results

Extracting two columns of vector from parity matrix of Eq. (4), the total number of error types is C122=66. Tables 3, 4, 5 list details of the failure modes and error types.

TABLE 3.
Message of the failure modes of ECC module of a 2-bits upset with both upsets occurring in check word
Upset position S3S2S1S0 “Error” position the S points to Error read out Error types (bits involved)
P1P0 0011 D0 D0 1 bit
P2P0 0101 D1 D1 1 bit
P3P0 1001 D4 D4 1 bit
P2P1 0110 D2 D2 1 bit
P3P1 1010 D5 D5 1 bit
P3P2 1100 D7 D7 1 bit
Show more
TABLE 4.
Message of the failure modes of ECC module with 1 bit upset in check word and 1 bit upset in data word
Upset Position S3S2 S1S0 "Error" position the S points to Error read out Error types (bits involved)
D0P0 0010 P1 D0 1 bit
D1P0 0100 P2 D1 1 bit
D2P0 0111 D3 D3D2 2 bit
D3P0 0110 D2 D3D2 2 bit
D4P0 0001 P3 D4 1 bit
D5P0 1011 D6 D6D5 2 bit
D6P0 1010 D5 D6D5 2 bit
D7P0 1101 no point D7 1 bit
D0P1 0001 P0 D0 1 bit
D1P1 0111 D3 D3D1 2 bit
D2P1 0100 P2 D2 1 bit
D3P1 0101 D1 D3D1 2 bit
D4P1 1011 D6 D6D4 2 bit
D5P1 1000 P3 D5 1 bit
D6P1 1001 D4 D6D4 2 bit
D7P1 1110 no point D7 1 bit
D0P2 0111 D3 D3D0 2 bit
D1P2 0001 P0 D1 1 bit
D2P2 0010 P1 D2 1 bit
D3P2 0011 D0 D3D0 2 bit
D4P2 1101 no point D4 1 bit
D5P2 1110 no point D5 1 bit
D6P2 1111 no point D6 1 bit
D7P2 1000 P3 D7 1 bit
D0P3 1011 D6 D6D0 2 bit
D1P3 1101 no point D1 1 bit
D2P3 1110 no point D2 1 bit
D3P3 1111 no point D3 1 bit
D4P3 0001 P0 D4 1 bit
D5P3 0010 P1 D5 1 bit
D6P3 0011 D0 D6D0 2 bit
D7P3 0100 P2 D7 1 bit
Show more
TABLE 5.
The message of the failure modes of ECC module when 2 bits upset occur both in data word
Upset position S3S2 S1S0 "Error" position the S points to Error read out Error types (bits involved)
D1D0 0110 D2 D2D1D0 3 bit
D2D0 0101 D1 D2D1D0 3 bit
D3D0 0100 P2 D3D0 2 bit
D4D0 1010 D5 D5D4D0 3 bit
D5D0 1001 D4 D5D4D0 3 bit
D6D0 1000 P3 D6D0 2 bit
D7D0 1111 no point D7D0 2 bit
D2D1 0011 D0 D2D1D0 3 bit
D3D1 0010 P1 D3D1 2 bit
D4D1 1100 D7 D7D4D1 3 bit
D5D1 1111 no point D5D1 2 bit
D6D1 1110 no point D6D1 2 bit
D7D1 1001 D4 D7D4D1 3 bit
D3D2 0001 P0 D3D2 2 bit
D4D2 1111 no point D4D2 2 bit
D5D2 1100 D7 D7D5D2 3 bit
D6D2 1101 no point D6D2 2 bit
D7D2 1010 D5 D7D5D2 3 bit
D4D3 1110 no point D4D3 2 bit
D5D3 1101 no point D5D3 2 bit
D6D3 1100 D7 D7D6D3 3 bit
D7D3 1011 D6 D7D6D3 3 bit
D5D4 0011 D0 D5D4D0 3 bit
D6D4 0010 P1 D6D4 2 bit
D7D4 0101 D1 D7D4D1 3 bit
D6D5 0001 P0 D6D5 2 bit
D7D5 0110 D2 D7D5D2 3 bit
D7D6 0111 D3 D7D6D3 3 bit
Show more

1. When 2-bits upset are both in chech word (Table 3)

In this case, the ECC module makes a wrong operation, the number of error types is C42=6, all the failure mode is 1-bit.

2. When 1 bit upset in check word, 1 bit upset in data word (Table 4)

In this case, the ECC module would makes a wrong operation, the number of error types is C41C81=32, of which the number of 1-bit is 20, the number of 2-bits is 12, and the failure modes includes 1-bit and 2-bits.

3. When 2 bits upset are both in data word (Table 5)

In this case, the ECC module makes a wrong operation, the number of error types is C82=28, of which the number of 2-bit is 13, and the number of 3-bit is 15, and the failure modes includes 2-bits and 3-bits.

Therefore, the total number of 1-bit is 6+20=26, the probability in all error types is 26/66=39.39%; the total number of 2-bits is 12+13=25, the probability in all error types is 25/66=37.88%; and the total number of 3-bits is 15, the probability in all error types is 15/66=22.73%. Table 6 shows the theoretical probabilities of failure modes including 1-, 2- and 3-bits agree well with the experimental results.

TABLE 6.
The measured and calculated probabilities of the failure modes including of 1-, 2- and 3-bits
Error types Number of erros measured Probability of error
Measured Theoretical
1 bit error 119/294 40.48% 39.39%
2 bits error 111/294 37.76% 37.88%
3 bits error 64/294 21.77% 22.73%
Show more

Therefore, the immanent factor of failure modes of ECC module in this experiment is due to the failure of (12, 8) Hamming code facing to 2 bits upset in one codeword.

IV. CONCLUSION

The results show the effectiveness and ineffectiveness of ECC module utilizing (12, 8) Hamming code in 65 nm process node SRAMs. The ECC module works obviously in hardening the advanced process node SRAMs. The failure modes including 1-, 2-, and 3-bits in a data word has been analyzed, and the essential factor of failure modes is due to the failure of (12, 8) Hamming code facing to 2 bits upset in one codeword. The measured bits per upset event distribution agree well with theoretical calculation.

There can be several mitigation approaches if a much higher reliability is required. Periodic memory scrubbing is often used to improve the performance of the device. and a scrubbing operation will be conducted in the SRAMs exposed to heavy ions in our lab, so as to observe the relationship between the scrub-rates and the bit error rate (BER). If more redundancy is accepted, the triple-bit-correcting Golay code or the Triple Modular Redundancy (TMR) may be employed.

The research on 65 nm SRAMs may provide a reference to the manufacturers in their choice of the reinforcement model and algorithm, and to the users in their selection of device application environment and methods.

References
[1] Lawrence R K and Kelly A T. IEEE Trans Nucl Sci, 2008, 55: 3367-3374.
[2] Heidel D F, Marshall P W, Pellish J A, et al. IEEE Trans Nucl Sci, 2009, 56: 3499-3504.
[3] Schrimpf R D, Weller R A, Mendenhall M H, et al. Nucl Instrum Meth B, 2007, 261: 1133-1136.
[4] Liu J, Duan J L, Hou M D, et al. Nucl Instrum Meth B, 2006, 245: 342-345.
[5] Bajura M A, Boulghassoul Y, Naseer R, et al. IEEE Trans Nucl Sci, 2007, 54: 935-945.
[6] Radaelli D, Puchner H, Wong S, et al. IEEE Trans Nucl Sci, 2005, 52: 2433-2437.
[7] Maestro Juan Antonio and Pedro Reviriego,

Study of the Effects of MBUs on the Reliability of a 150 nm SRAM Device

, DAC’08 Proceedings of the 45th annual Design Automation Conference, p.930-935, California, USA, June 8–13, 2008.
Baidu ScholarGoogle Scholar
[8]

Measurement and Reporting of Alpha Particle and Terrestrial Cosmic Ray-Induced Soft Errors in Semiconductor Devices, JESD89A, 2006

, p.10.
Baidu ScholarGoogle Scholar
[9] Palomo F R, Morilla Y, Mogollón J M, et al. Nucl Instrum Meth B, 2011, 269: 2210-2216.
[10] Nicolaidis M. Soft errors in modern electronic systems, Germany, Springer, 2011, p.207.
[11] Tam S. Single error correction and double error detection, Xilinx, XAPP645 (v2.2), 2006.