Effectiveness and failure modes of error correcting code in industrial 65 nm CMOS SRAMs exposed to heavy ions

NUCLEAR ELECTRONICS AND INSTRUMENTATION

Effectiveness and failure modes of error correcting code in industrial 65 nm CMOS SRAMs exposed to heavy ions

TONG Teng，

WANG Xiao-Hui，

ZHANG Zhan-Gang，

DING Peng-Cheng，

LIU Jie，

LIU Tian-Qi，

SU Hong

Nuclear Science and Techniques

Vol.25, No.1

Article number 010405

Published in print 20 Feb 2014

Available online 20 Feb 2014

DOI：10.13538/j.1001-8042/nst.25.010405

217603

Single event upsets (SEUs) induced by heavy ions were observed in 65 nm SRAMs to quantitatively evaluate the applicability and effectiveness of single-bit error correcting code (ECC) utilizing Hamming Code. The results show that the ECC did improve the performance dramatically, with the SEU cross sections of SRAMs with ECC being at the order of 10^-11 cm²/bit, two orders of magnitude higher than that without ECC (at the order of 10^-9 cm²/bit). Also, ineffectiveness of ECC module, including 1-, 2- and 3-bits errors in single word (not Multiple Bit Upsets), was detected. The ECC modules in SRAMs utilizing (12, 8) Hamming code would lose work when 2-bits upset accumulates in one codeword. Finally, the probabilities of failure modes involving 1-, 2- and 3-bits errors, were calcaulated at 39.39%, 37.88% and 22.73%, respectively, which agree well with the experimental results.

Single event upsets (SEU)SRAMError correcting code (ECC)Hamming codeEffectivenessFailure modes

I. INTRODUCTION

As technology scales downward in modern integrated circuits, such as SRAM, the minimum charge needed to upset a device within a unit memory cell decreases, while the influence of charge sharing on adjacent unit memory cells increases [1-5]. Therefore, advanced devices (especially deep-submicrometer) are much more sensitive to the energy deposition in the device by heavy ion irradiation, and this critically restricts the devices’ use in space.

Many methods have been proposed to mitigate the single event upsets (SEUs) occurred in advanced devices. Bits interleaving architecture is a commonly accepted approach to mitigate Multiple Bit Upsets (MBUs) in data word. In this architecture, the bits in a data word are not physically adjacent, but interleaved with bits of other data words. In this way, every MBU of physically adjacent memory cells is transformed into multiple single bit upsets (SBUs) in different memory words.

Error correcting code (ECC) utilizing Hamming code is found commonly in many high-reliability and performance applications. As a relatively simple yet powerful ECC code, it corrects single bit errors anywhere within the codeword.

Therefore, MBUs which is now a major reliability problem in commercial and industrial electronics, can be transformed into multiple SBUs appear to be uncorrelated events relative to the ECC algorithm, and then be corrected [2, 5-7].

In this hardening approach, ECC module can be used in high-reliability and performance applications to resolve SBUs combining with the bits interleaving architecture in advanced process node devices.

To observe and compare the SEUs induced by heavy ions in SRAMs of different process, and to quantitatively evaluate the applicability and effectiveness of single-bit ECC utilizing Hamming code in advanced process SRAMs, we used ¹²C ion beam to irradiate four SRAMs from ISSI company. Two of them, manufactured via 130 nm and 150 nm process, are the most advanced process devices in their SRAMs without ECC module, while the other two are of 65 nm process SRAMs with ECC module. Some interesting results were obtained.

II. EXPERIMENTAL BACKGROUND

Four industrial SRAMs, produced by high-performance CMOS technology, were irradiated at normal incidence in the vacuum by ¹²C beams from the Heavy Ion Research Facility in Lanzhou (HIRFL). The ¹²C ions were of effective linear energy transfer (LET) value of 1.8 MeV-cm²/mg. Table 1 shows the information of SRAMs under test. The IS2ME is 2 M-bit SRAM organized as 131072 words by 16 bits with ECC, 65 nm process node; the IS4ME is 4 M-bit SRAM organized as 262144 words by 16 bits with ECC, 65 nm process node; the IS2M is 2 M-bit SRAM organized as 131072 words by 16 bits without ECC, 150 nm process node; and the IS4M is 4 M-bit SRAM organized as 262144 words by 16 bits without ECC, 130 nm process node. The first two SRAMs with ECC are the main objects of observation, and the other two are the contrastive devices. All of the four industrial SRAMs belong to the IS61WV series made by ISSI company, and the ECC functions described in this application are made by Hamming code, a relatively simple yet powerful ECC which can correct all single bit errors in one codeword.

The information of SRAMs under test

Device	Process node (nm)	Capacity (Mbits)	ECC	Abbreviation
61WV12816EDBLL-10TLI	65	2	with	IS2ME
61WV25616EDBLL-10TLI	65	4	with	IS4ME
61WV12816DBLL-10TLI	150	2	without	IS2M
61WV25616BLL-10TLI	130	4	without	IS4M

The SRAMs were tested using data pattern of all "1" (blanket pattern) at voltage of 3.3 V, and the work period was set at 20 MHz all the time. Under the static test mode, the devices were written prior to their beam-shot and read periodically throughout the beam shot (this technique is often referred to as multiple-read) [1, 8, 9]. The error data occurred in the test were stored in another RAM (referred as mirrored RAM relative to the SRAM under test) working in the test system, as a referenced data for next read cycle. The test flow applied (Fig. 1) distinguishes SBU, MBU and SEL. All the upset events were recorded with a timestamp and bitmap location.

Fig. 1.

Static test flow.

III. RESULT ANALYSIS AND DISCUSS

A. The high efficiency of ECC module

SEU cross sections of the four SRAMs are shown in Fig. 2. One sees that the SRAMs without ECC module are much more sensitive to the irradiation than the devices with ECC module. The SEU cross sections of SRAMs without ECC module are at the order of 10^-9 cm²/bit, while they are 10^-11 cm²/bit for SRAMs with ECC module. However, the technology of producing the IS2ME and IS4ME in 65 nm process is more advanced than IS2M (150 nm process node) and IS4M (130 nm process node). With technology scaling, the number of upsets per chip increases due to higher circuit density and sensitivity. Therefore, the sharp contrast of the two datum groups should be attributed to the high efficiency of ECC module.

Fig. 2.

Cross sections of four SRAMs.

B. The ineffectiveness of ECC module

Only 1 bit upsets in a data word were detected in the devices without ECC module in this experiment. The upset events involving 1, 2 and 3 bits errors occurred in the devices with ECC module. Fig. 3 shows the measured and theoretical results of bits per upset event distribution (percentages over total events). We will discuss the results with an emphasis: special attentions shall be paid to the word "upset" and "error" in the following text— "upset" is the real change occurred in memory cell, and "error" is the data being read out from the memory finally.

Fig. 3.

The bits per upset event distribution.

1. The fundamental reason

For discussing the experimental results, we have the following assumptions:

1. Considering the beam energy of ¹²C and the bits interleaving architecture, the normal incidence ion beams do not affect the adjacent memory cells simultaneously. So, MBUs are not supposed to occur in a codeword any time in this experiment [2, 5- 7].

2. The static mode used in this test meams that only one write operation worked in a test cycle, while the ECC module does not correct or re-write the memory itself [1], but just corrects the "error" bit(s). When the data be read out through ECC module, the memory remains in upset status until a new write command arrives with new data. Therefore, if other bit(s) upset occurs in the same word, the ECC module utilizing Hamming code, which can only correct one bit error, will lose function. So, the disablement of ECC module is an accumulation effect caused by several SBUs in a word at different time. On the other hand, as ECC functional block diagram (Fig. 4, presented in the datasheet of the devices with ECC module) shows, the circuit structure of ECC module utilizes the (12, 8) Hamming code in the application.

Fig. 4.

Functional block diagram of SRAMs with ECC module.

Based on time structure of the cyclotron and the upstream scanning magnets, the incident ions are of uniform temporal and spatial distribution in the used flux range, thus each SBU could be deemed as an independent random event.

In independent random event, if the upset probability is p(p << 1), the probability that r bit(s) upset occurs in an n bits codeword is $P_{n} (r) = C_{n}^{r} p^{r} {(1 - p)}^{n - r} \approx \frac{n!}{r! (n - r)!} p^{r}$ . From the results of IS2M and IS4M, about 200 ions could cause 1 bit upset in order of magnitude, assuming this probability is suitable for IS2ME and IS4ME, we have p=5× 10^-3. Then, the probability of two and three SBUs occurring at different time in one codeword is

\begin{array}{l} P_{12} (2) & = C_{12}^{2} p^{2} {(1 - p)}^{12 - 2} \\ \approx \frac{12!}{2! (12 - 2)!} {(5 \times 10^{- 3})}^{2} \\ = 3.3 \times 10^{- 4} . \end{array}

(1)

The probability of three SBUs occurs in different time in one codeword is

\begin{array}{l} P_{12} (3) & = C_{12}^{3} p^{3} {(1 - p)}^{12 - 3} \\ \approx \frac{12!}{3! (12 - 3)!} {(5 \times 10^{- 3})}^{3} \\ = 1.1 \times 10^{- 6} . \end{array}

(2)

The results of Eq. (1) and Eq. (2) show a probability difference of two orders of magnitude between r=2 and r=3. Thus three or more SBUs occur at different time in one codeword is of very low probability, hence their omission in this experiment.

Therefore, the fundamental reason for the problem is that a 2 bits upset in a codeword causes the disablement of ECC module utilizing (12, 8) Hamming code.

2. Parsing the problem

Figure 5 is a basic memory architecture of ECC module utilizing Hamming code [10]. Table 2 is a common relationship between syndrome vector and single-error location.

The relationship between syndrome vector and single-error location

S₃S₂S₁S₀	Error location	S₃S₂S₁S₀	Error location
0001	P₀	1000	P₃
0010	P₁	1001	D₄
0011	D₀	1010	D₅
0100	P₂	1011	D₆
0101	D₁	1100	D₇
0110	D₂	—	—
0111	D₃	0000	No error

Fig. 5.

Basic memory architecture for ECC module utilizing Hamming code.

Assuming the 12 bits codeword is D₇D₆D₅D₄P₃D₃D₂D₁P₂D₀P₁P₀, 8 bits data word is vector D and 4 bits check word is vector P, the syndrome vector S can be generated by data word and check word as [11]:

\begin{array}{l} S_{0} & = D_{0} \oplus D_{1} \oplus D_{3} \oplus D_{4} \oplus D_{6} \oplus P_{0}; \\ S_{1} & = D_{0} \oplus D_{2} \oplus D_{3} \oplus D_{5} \oplus D_{6} \oplus P_{1}; \\ S_{2} & = D_{1} \oplus D_{2} \oplus D_{3} \oplus D_{7} \oplus P_{2}; \\ S_{3} & = D_{4} \oplus D_{5} \oplus D_{6} \oplus D_{7} \oplus P_{3}; \end{array}

(3)

means

[\begin{matrix} S_{0} \\ S_{1} \\ S_{2} \\ S_{3} \end{matrix}] = [\begin{matrix} 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 \\ 0 & 1 & 1 & 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 & 0 \\ 1 & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 & 0 & 0 & 0 \\ 1 & 1 & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{matrix}] [\begin{matrix} D_{7} \\ D_{6} \\ D_{5} \\ D_{4} \\ P_{} 3 \\ D_{3} \\ D_{2} \\ D_{1} \\ P_{} 2 \\ D_{0} \\ P_{} 1 \\ P_{} 0 \end{matrix}],

(4)

and the corresponding (12, 8) parity matrix is

H = [\begin{matrix} 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 \\ 0 & 1 & 1 & 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 & 0 \\ 1 & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 & 0 & 0 & 0 \\ 1 & 1 & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{matrix}] .

(5)

When an 8-bits data word is written in SRAM, the ECC module generates a 4-bits check word to compose 12-bits codeword and store it in the memory cell. After irradiation, when the data word is read out from memory cell through the ECC module, which generates a syndrome vector S=(S₃S₂S₁S₀), according to the codeword.

In Eq. (5), each column vector in parity matrix represents the position of each bit (D_u(u= 0, 1,...7) or P_v(v = 0, 1, 2, 3)) in the codeword, 0 means that the bit does not participate in the form of S_k (k=0,1,2,3), 1 means that the bit participates in the form of S_k (k=0,1,2,3). Then, how does the 2-bits change in codeword generate S≠(0000), and how does the S point to an error in Table 2? The method to find the failure modes is discussed as follows:

1. Neither the 2 upset bits participate in the S_k, S_k=0⊕0=0 to point to "no error".

2. Both the 2 upset bits participate in the S_k, S_k=1⊕1=0 to point to "no error".

3. Only one upset bit participates in the S_k, S_k=1⊕0=1 or S_k=0⊕1=1, the value of the corresponding S_k is always 1, so the ECC module spots an "error" and makes a "correct" operation.

Consequently, the S_k value is associated with the status of 2 upset bits participating in the S_k, and the relationship is a "XOR" operation between S_k and the 2 upset bits.

For example, if the 2-bits upset comes from D₃P₀, they will not affect the value of S₀ (as both participate in it) and S₃ (as neither participate in it). However, S₁=P₁⊕D₀⊕D₂⊕D_3’⊕D₅⊕D₆ and S₂=P₂⊕D₁⊕D₂⊕D_3’⊕D₇ will result in S=(S₃S₂S₁S₀)=(0110), which can be understood simply as:

[\begin{matrix} S_{0} \\ S_{1} \\ S_{2} \\ S_{3} \end{matrix}] = [\begin{matrix} 1 \\ 1 \\ 1 \\ 0 \end{matrix}] \oplus [\begin{matrix} 1 \\ 0 \\ 0 \\ 0 \end{matrix}] = [\begin{matrix} 0 \\ 1 \\ 1 \\ 0 \end{matrix}] .

(6)

Eq. (6) points to an "error" position at D₂ by Table 2, then ECC module corrects the right value of D₂ to an error value, while the real upset bit D₃ is read out as an "right" data, leading a 2-bits errors as D₃D₂. In other words, the data written in is "FF", and the data read out is "F3" as an error to be detected.

Therefore, the problem-solving method can be simplified as the following procedures: 1) extract two columns vectors (2-bits upset occurring in the same bit in a codeword does not affect the S, hence the omission of this condition) from parity matrix of Eq. (5), 2) make an "XOR" operation with them as Eq. (6), 3) produce the syndrome vector S, 4) find the "error" position where S points to, 5) analyze the relationship between the "error" and the "upset", and 6) a statistics of failure modes including 1-bit, 2-bits and 3-bits errors read out from the SRAMs can be achieved.

3. Analysis results

Extracting two columns of vector from parity matrix of Eq. (4), the total number of error types is $C_{12}^{2} = 66$ . Tables 3, 4, 5 list details of the failure modes and error types.

Message of the failure modes of ECC module of a 2-bits upset with both upsets occurring in check word

Upset position	S₃S₂S₁S₀	“Error” position the S points to	Error read out	Error types (bits involved)
P₁P₀	0011	D₀	D₀	1 bit
P₂P₀	0101	D₁	D₁	1 bit
P₃P₀	1001	D₄	D₄	1 bit
P₂P₁	0110	D₂	D₂	1 bit
P₃P₁	1010	D₅	D₅	1 bit
P₃P₂	1100	D₇	D₇	1 bit

Message of the failure modes of ECC module with 1 bit upset in check word and 1 bit upset in data word

Upset Position	S₃S₂ S₁S₀	"Error" position the S points to	Error read out	Error types (bits involved)
D₀P₀	0010	P₁	D₀	1 bit
D₁P₀	0100	P₂	D₁	1 bit
D₂P₀	0111	D₃	D₃D₂	2 bit
D₃P₀	0110	D₂	D₃D₂	2 bit
D₄P₀	0001	P₃	D₄	1 bit
D₅P₀	1011	D₆	D₆D₅	2 bit
D₆P₀	1010	D₅	D₆D₅	2 bit
D₇P₀	1101	no point	D₇	1 bit
D₀P₁	0001	P₀	D₀	1 bit
D₁P₁	0111	D₃	D₃D₁	2 bit
D₂P₁	0100	P₂	D₂	1 bit
D₃P₁	0101	D₁	D₃D₁	2 bit
D₄P₁	1011	D₆	D₆D₄	2 bit
D₅P₁	1000	P₃	D₅	1 bit
D₆P₁	1001	D₄	D₆D₄	2 bit
D₇P₁	1110	no point	D₇	1 bit
D₀P₂	0111	D₃	D₃D₀	2 bit
D₁P₂	0001	P₀	D₁	1 bit
D₂P₂	0010	P₁	D₂	1 bit
D₃P₂	0011	D₀	D₃D₀	2 bit
D₄P₂	1101	no point	D₄	1 bit
D₅P₂	1110	no point	D₅	1 bit
D₆P₂	1111	no point	D₆	1 bit
D₇P₂	1000	P₃	D₇	1 bit
D₀P₃	1011	D₆	D₆D₀	2 bit
D₁P₃	1101	no point	D₁	1 bit
D₂P₃	1110	no point	D₂	1 bit
D₃P₃	1111	no point	D₃	1 bit
D₄P₃	0001	P₀	D₄	1 bit
D₅P₃	0010	P₁	D₅	1 bit
D₆P₃	0011	D₀	D₆D₀	2 bit
D₇P₃	0100	P₂	D₇	1 bit

The message of the failure modes of ECC module when 2 bits upset occur both in data word

Upset position	S₃S₂ S₁S₀	"Error" position the S points to	Error read out	Error types (bits involved)
D₁D₀	0110	D₂	D₂D₁D₀	3 bit
D₂D₀	0101	D₁	D₂D₁D₀	3 bit
D₃D₀	0100	P₂	D₃D₀	2 bit
D₄D₀	1010	D₅	D₅D₄D₀	3 bit
D₅D₀	1001	D₄	D₅D₄D₀	3 bit
D₆D₀	1000	P₃	D₆D₀	2 bit
D₇D₀	1111	no point	D₇D₀	2 bit
D₂D₁	0011	D₀	D₂D₁D₀	3 bit
D₃D₁	0010	P₁	D₃D₁	2 bit
D₄D₁	1100	D₇	D₇D₄D₁	3 bit
D₅D₁	1111	no point	D₅D₁	2 bit
D₆D₁	1110	no point	D₆D₁	2 bit
D₇D₁	1001	D₄	D₇D₄D₁	3 bit
D₃D₂	0001	P₀	D₃D₂	2 bit
D₄D₂	1111	no point	D₄D₂	2 bit
D₅D₂	1100	D₇	D₇D₅D₂	3 bit
D₆D₂	1101	no point	D₆D₂	2 bit
D₇D₂	1010	D₅	D₇D₅D₂	3 bit
D₄D₃	1110	no point	D₄D₃	2 bit
D₅D₃	1101	no point	D₅D₃	2 bit
D₆D₃	1100	D₇	D₇D₆D₃	3 bit
D₇D₃	1011	D₆	D₇D₆D₃	3 bit
D₅D₄	0011	D₀	D₅D₄D₀	3 bit
D₆D₄	0010	P₁	D₆D₄	2 bit
D₇D₄	0101	D₁	D₇D₄D₁	3 bit
D₆D₅	0001	P₀	D₆D₅	2 bit
D₇D₅	0110	D₂	D₇D₅D₂	3 bit
D₇D₆	0111	D₃	D₇D₆D₃	3 bit

1. When 2-bits upset are both in chech word (Table 3)

In this case, the ECC module makes a wrong operation, the number of error types is $C_{4}^{2} = 6$ , all the failure mode is 1-bit.

2. When 1 bit upset in check word, 1 bit upset in data word (Table 4)

In this case, the ECC module would makes a wrong operation, the number of error types is $C_{4}^{1} C_{8}^{1} = 32$ , of which the number of 1-bit is 20, the number of 2-bits is 12, and the failure modes includes 1-bit and 2-bits.

3. When 2 bits upset are both in data word (Table 5)

In this case, the ECC module makes a wrong operation, the number of error types is $C_{8}^{2} = 28$ , of which the number of 2-bit is 13, and the number of 3-bit is 15, and the failure modes includes 2-bits and 3-bits.

Therefore, the total number of 1-bit is 6+20=26, the probability in all error types is 26/66=39.39%; the total number of 2-bits is 12+13=25, the probability in all error types is 25/66=37.88%; and the total number of 3-bits is 15, the probability in all error types is 15/66=22.73%. Table 6 shows the theoretical probabilities of failure modes including 1-, 2- and 3-bits agree well with the experimental results.

The measured and calculated probabilities of the failure modes including of 1-, 2- and 3-bits

Error types	Number of erros measured	Probability of error
Error types	Number of erros measured	Measured	Theoretical
1 bit error	119/294	40.48%	39.39%
2 bits error	111/294	37.76%	37.88%
3 bits error	64/294	21.77%	22.73%

Therefore, the immanent factor of failure modes of ECC module in this experiment is due to the failure of (12, 8) Hamming code facing to 2 bits upset in one codeword.

IV. CONCLUSION

The results show the effectiveness and ineffectiveness of ECC module utilizing (12, 8) Hamming code in 65 nm process node SRAMs. The ECC module works obviously in hardening the advanced process node SRAMs. The failure modes including 1-, 2-, and 3-bits in a data word has been analyzed, and the essential factor of failure modes is due to the failure of (12, 8) Hamming code facing to 2 bits upset in one codeword. The measured bits per upset event distribution agree well with theoretical calculation.

There can be several mitigation approaches if a much higher reliability is required. Periodic memory scrubbing is often used to improve the performance of the device. and a scrubbing operation will be conducted in the SRAMs exposed to heavy ions in our lab, so as to observe the relationship between the scrub-rates and the bit error rate (BER). If more redundancy is accepted, the triple-bit-correcting Golay code or the Triple Modular Redundancy (TMR) may be employed.

The research on 65 nm SRAMs may provide a reference to the manufacturers in their choice of the reinforcement model and algorithm, and to the users in their selection of device application environment and methods.

References

[1]

Lawrence R K and Kelly A T. IEEE Trans Nucl Sci, 2008, 55: 3367-3374.

[2]

Heidel D F, Marshall P W, Pellish J A, et al. IEEE Trans Nucl Sci, 2009, 56: 3499-3504.

[3]

Schrimpf R D, Weller R A, Mendenhall M H, et al. Nucl Instrum Meth B, 2007, 261: 1133-1136.

[4]

Liu J, Duan J L, Hou M D, et al. Nucl Instrum Meth B, 2006, 245: 342-345.

[5]

Bajura M A, Boulghassoul Y, Naseer R, et al. IEEE Trans Nucl Sci, 2007, 54: 935-945.

[6]

Radaelli D, Puchner H, Wong S, et al. IEEE Trans Nucl Sci, 2005, 52: 2433-2437.

[7]

Maestro Juan Antonio and Pedro Reviriego,

Study of the Effects of MBUs on the Reliability of a 150 nm SRAM Device

, DAC’08 Proceedings of the 45th annual Design Automation Conference, p.930-935, California, USA, June 8–13, 2008.

Baidu Scholar

Google Scholar

[8]

Measurement and Reporting of Alpha Particle and Terrestrial Cosmic Ray-Induced Soft Errors in Semiconductor Devices, JESD89A, 2006

, p.10.

Baidu Scholar

Google Scholar

[9]

Palomo F R, Morilla Y, Mogollón J M, et al. Nucl Instrum Meth B, 2011, 269: 2210-2216.

[10]

Nicolaidis M. Soft errors in modern electronic systems, Germany, Springer, 2011, p.207.

[11]

Tam S. Single error correction and double error detection, Xilinx, XAPP645 (v2.2), 2006.