FTT-NAS: Discovering Fault-Tolerant Convolutional Neural ...

x

FTT-NAS: Discovering Fault-Tolerant Convolutional NeuralArchitecture

XUEFEI NING, GUANGJUNGE,WENSHUO LI, and ZHENHUAZHU,Department of ElectronicEngineering, Tsinghua University, ChinaYIN ZHENG,Weixin Group, Tencent, ChinaXIAOMING CHEN, State Key Laboratory of Computer Architecture, Institute of Computing Technology,Chinese Academy of Sciences, ChinaZHEN GAO, School of Electrical and Information Engineering, Tianjin University, ChinaYU WANG and HUAZHONG YANG, Department of Electronic Engineering, Tsinghua University,China

With the fast evolvement of embedded deep-learning computing systems, applications powered by deeplearning are moving from the cloud to the edge. When deploying neural networks (NNs) onto the devicesunder complex environments, there are various types of possible faults: soft errors caused by cosmic radiationand radioactive impurities, voltage instability, aging, temperature variations, malicious attackers, etc. Thusthe safety risk of deploying NNs is now drawing much attention. In this paper, after the analysis of thepossible faults in various types of NN accelerators, we formalize and implement various fault models from thealgorithmic perspective. We propose Fault-Tolerant Neural Architecture Search (FT-NAS) to automaticallydiscover convolutional neural network (CNN) architectures that are reliable to various faults in nowadaysdevices. Then we incorporate fault-tolerant training (FTT) in the search process to achieve better results, whichis referred to as FTT-NAS. Experiments on CIFAR-10 show that the discovered architectures outperform othermanually designed baseline architectures significantly, with comparable or fewer floating-point operations(FLOPs) and parameters. Specifically, with the same fault settings, F-FTT-Net discovered under the featurefault model achieves an accuracy of 86.2% (VS. 68.1% achieved by MobileNet-V2), and W-FTT-Net discoveredunder the weight fault model achieves an accuracy of 69.6% (VS. 60.8% achieved by ResNet-18). By inspectingthe discovered architectures, we find that the operation primitives, the weight quantization range, the capacityof the model, and the connection pattern have influences on the fault resilience capability of NN models.

CCS Concepts: • Hardware→ Fault tolerance; • Computing methodologies→ Computer vision.

Additional Key Words and Phrases: neural architecture search, fault tolerance, neural networks

ACM Reference Format:Xuefei Ning, Guangjun Ge, Wenshuo Li, Zhenhua Zhu, Yin Zheng, Xiaoming Chen, Zhen Gao, Yu Wang,and Huazhong Yang. 0000. FTT-NAS: Discovering Fault-Tolerant Convolutional Neural Architecture. ACMTrans. Des. Autom. Electron. Syst. x, x, Article x (January 0000), 25 pages. https://doi.org/xx.xxxx/xxxxxxx

Authors’ addresses: Xuefei Ning, [email protected]; Guangjun Ge; Wenshuo Li; Zhenhua Zhu, Departmentof Electronic Engineering, Tsinghua University, China; Yin Zheng, Weixin Group, Tencent, China; Xiaoming Chen, StateKey Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, China;Zhen Gao, School of Electrical and Information Engineering, Tianjin University, China; YuWang, [email protected];Huazhong Yang, [email protected], Department of Electronic Engineering, Tsinghua University, China.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice andthe full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requiresprior specific permission and/or a fee. Request permissions from [email protected].© 0000 Association for Computing Machinery.1084-4309/0000/1-ARTx $15.00https://doi.org/xx.xxxx/xxxxxxx

ACM Trans. Des. Autom. Electron. Syst., Vol. x, No. x, Article x. Publication date: January 0000.

https://doi.org/xx.xxxx/xxxxxxx

https://doi.org/xx.xxxx/xxxxxxx

x:2 X. Ning et al.

1 INTRODUCTIONConvolutional Neural Networks (CNNs) have achieved breakthroughs in various tasks, includingclassification [14], detection [31] and segmentation [32], etc. Due to their promising performance,CNNs have been utilized in various safety-critic applications, such as autonomous driving, intel-ligent surveillance, and identification. Meanwhile, driven by the recent academic and industrialefforts, the neural network accelerators based on various hardware platforms (e.g., ApplicationSpecific Integrated Circuits (ASIC) [9], Field Programmable Gate Array (FPGA) [37], ResistiveRandom-Access Memory (RRAM) [10]) have been rapidly evolving.

The robustness and reliability related issues of deploying neural networks onto various embeddeddevices for safety-critical applications are attracting more and more attention. There is a largestream of algorithmic studies on various robustness-related characteristics of NNs, e.g., adversarialrobustness [44], data poisoning [41], interpretability [53], and so on. However, no hardware modelsare taken into consideration in these studies. Besides the issues from the purely algorithmicperspective, there exist hardware-related reliability issues when deploying NNs onto nowadaysembedded devices. With the down-scaling of CMOS technology, circuits become more sensitiveto cosmic radiation and radioactive impurities [16]. Voltage instability, aging, and temperaturevariations are also common effects that could lead to errors. As for the emerging metal-oxide RRAMdevices, due to the immature technology, they suffer from many types of device faults [7], amongwhich hard faults such as Stuck-at-Faults (SAFs) damage the computing accuracy severely andcould not be easily mitigated [49]. Moreover, malicious attackers can attack the edge devices byembedding hardware Trojans, manipulating back-doors, and doing memory injection [54].Recently, some studies [28, 40, 46] analyzed the sensitivity of NN models. They proposed to

predict whether a layer or a neuron is sensitive to faults and protect the sensitive ones. For faulttolerance, a straightforward way is to introduce redundancy in the hardware. Triple ModularRedundancy (TMR) is a commonly used but expensive method to tolerate a single fault [4, 42, 55].Studies [28, 49] proposed various redundancy schemes for Stuck-at-Faults tolerance in the RRAM-based Computing Systems. For increasing the algorithmic fault resilience capability, studies [12, 15]proposed to use fault-tolerant training (FTT), in which random faults are injected in the trainingprocess.Although redesigning the hardware for reliability is effective, it is not flexible and inevitably

introduces a large overhead. It would be better if the issues could be mitigated as far as possiblefrom the algorithmic perspective. Existing methods mainly concerned about designing trainingmethods and analyzing the weight distribution [12, 15, 40]. Intuitively, the neural architecturemight also be important for the fault tolerance characteristics [1, 25], since it determines the “path”of fault propagation. To verify these intuitions, the accuracies of baselines under a random bit-biasfeature fault model1 are shown in Table 1, and the results under SAF weight fault model2 areshown in Table 2. These preliminary experiments on the CIFAR-10 dataset show that the faulttolerance characteristics vary among neural architectures, which motivates the employment ofneural architecture search (NAS) techniques in designing fault-tolerant neural architectures. Weemphasize that our work is orthogonal to most of the previous methods based on hardware ormapping strategy design. To our best knowledge, our work is the first to increase the algorithmicfault resilience capability by optimizing the NN architecture.In this paper, we employ NAS to discover fault-tolerant neural network architectures against

feature faults and weight faults, and demonstrate the effectiveness by experiments. The maincontributions of this paper are as follows.

1The random bit-bias feature fault model is formalized in Sec. 3.4.2The SAF weight fault model is formalized in Sec. 3.5.


FTT-NAS: Discovering Fault-Tolerant Convolutional Neural Architecture x:3

Table 1. Performance of the baseline models with random bit-bias feature faults. 0/10−5/10−4 denotes theper-MAC fault rate

Model Acc(0/10−5/10−4) #Params #FLOPsResNet-18 94.7/63.4/10.0 11.2M 1110MVGG-16† 93.1/21.4/10.0 14.7M 626M

MobileNet-V2 92.3/10.0/10.0 2.3M 182M†: For simplicity, we only keep one fully-connected layer of VGG-16.

Table 2. Performance of the baseline models with SAF weight faults. 0/4%/8% denotes the sum of the SAF1and SAF0 rates

Model Acc(0/4%/8%) #Params #FLOPsResNet-18 94.7/64.8/17.8 11.2M 1110MVGG-16 93.1/45.7/14.3 14.7M 626M

MobileNet-V2 92.3/26.2/11.7 2.3M 182M

• We analyze the possible faults in various types of NN accelerators (ASIC-based, FPGA-based,and RRAM-based), and formalize the statistical fault models from the algorithmic perspective.After the analysis, we adopt the Multiply-Accumulate (MAC)-i.i.d Bit-Bias (MiBB) model andthe arbitrary-distributed Stuck-at-Fault (adSAF) model in the neural architecture search fortolerating feature faults and weight faults, respectively.

• We establish a multi-objective neural architecture search framework. On top of this frame-work, we propose two methods to discover neural architectures with better reliability: FT-NAS (NAS with a fault-tolerant multi-objective), and FTT-NAS (NAS with a fault-tolerantmulti-objective and fault-tolerant training (FTT)).

• We employ FT-NAS and FTT-NAS to discover architectures for tolerating feature faults andweight faults. The discovered architectures, F-FTT-Net and W-FTT-Net have comparable orfewer floating-point operations (FLOPs) and parameters, and achieve better fault resiliencecapabilities than the baselines. With the same fault settings, F-FTT-Net discovered under thefeature fault model achieves an accuracy of 86.2% (VS. 68.1% achieved by MobileNet-V2), andW-FTT-Net discovered under the weight fault model achieves an accuracy of 69.6% (VS. 60.8%achieved by ResNet-18). The ability of W-FTT-Net to defend against several other types ofweight faults is also illustrated by experiments.

• We analyze the discovered architectures, and discuss how the weight quantization range, thecapacity of the model, and the connection pattern influence the fault resilience capability ofa neural network.

The rest of this paper is organized as follows. The related studies and the preliminaries areintroduced in Section 2. In Section 3, we conduct comprehensive analysis on the possible faultsand formalize the fault models. In Section 4, we elaborate on the design of the fault-tolerant NASsystem. Then in Section 5, the effectiveness of our method is illustrated by experiments, and theinsights are also presented. Finally, we discuss and conclude our work in Section 6 and Section 7.

2 RELATEDWORK AND PRELIMINARY2.1 Convolutional Neural NetworkUsually, a convolutional neural network is constructed by stacking multiple convolution layers andoptional pooling layers, followed by fully-connected layers. Denoting the input feature map (IFM),


x:4 X. Ning et al.

before-activation output feature map, output feature map (OFM, i.e. activations), weights and biasof 𝑖-th convolution layer as 𝑥 (𝑖) , 𝑓 (𝑖) , 𝑦 (𝑖) ,𝑊 (𝑖) , 𝑏 (𝑖) , the computation can be written as:

𝑓 (𝑖) =𝑊 (𝑖) ⊛ 𝑥 (𝑖) + 𝑏 (𝑖)

𝑦 (𝑖) = 𝑔(𝑓 (𝑖) ),(1)

where ⊛ is the convolution operator, 𝑔(·) is the activation function, for which the ReLU function(𝑔(𝑥) = max(𝑥, 0)) is the commonest choice. From now on, we omit the (𝑖) superscript for simplicity.

2.2 NN Accelerators and Fixed-point ArithmeticWith dedicated data flow design for efficient neural network processing, FPGA-based NN accel-erators could achieve at least 10x better energy efficiency than GPUs [11, 37]. And ASIC-basedaccelerators could achieve even higher efficiency [9]. Besides, RRAM-based computing systemsare promising solutions for energy-efficient brain-inspired computing [10], due to their capabilityof performing matrix-vector-multiplications (MVMs) in memory. Existing studies have shownRRAM-based Processing-In-Memory (PIM) architectures can enhance the energy efficiency by over100× compared with both GPU and ASIC solutions, as they can eliminate the large data movementsof bandwidth-bounded NN applications [10]. For the detailed and formal hardware architecturedescriptions, we refer the readers to the references listed above.

Currently, fixed-point arithmetic units are implemented by most of the NN accelerators, as 1) theyconsume much fewer resources and are much more efficient than the floating-point ones [11]; 2) NNmodels are proven to be insensitive to quantization [19, 37]. Consequently, quantization is usuallyapplied before a neural network model is deployed onto the edge devices. To keep consistent withthe actual deploying scenario, our simulation incorporates 8-bit dynamic fixed-point quantizationfor the weights and activations. More specifically, independent step sizes are used for the weightsand activations of different layers. Denoting the fraction length and bit-width of a tensor as 𝑙 and𝑄 , the step size (resolution) of the representation is 2−𝑙 . For common CMOS platforms, in whichcomplement representation is used for numbers, the representation range of both weights andfeatures is

[−2𝑄−𝑙 , 2−𝑙 (2𝑄 − 1)] . (2)As for RRAM-based NN platforms, two separate crossbars are usually used for storing positive and

negative weights [10]. Thus the representation range of the weights (denoted by the𝑤 superscript)is

[−𝑅𝑤, 𝑅𝑤] = [−2−𝑙 (2𝑄+1 − 1), 2−𝑙 (2𝑄+1 − 1)] . (3)For the feature representation in RRAM-based platforms, by assuming that the Analog to Digital

Converters (ADCs) and Digital to Analog Converters (DACs) have enough precision, and the CMOSbit-width is 𝑄-bit, the representation range of features (denoted by the 𝑓 superscript) in CMOScircuits is

[−𝑅 𝑓 , 𝑅 𝑓 ] = [−2𝑄−𝑙 , 2−𝑙 (2𝑄 − 1)] . (4)

2.3 Fault Resilience for CMOS-based AcceleratorsBorkar [5], Henkel et al. [16], Slayman [43] revealed that advanced nanotechnology makes circuitsmore vulnerable to soft errors. Unlike hard errors, soft errors do not damage the underlying circuits,but instead trigger an upset of the logic state. The dominant cause of soft errors in CMOS circuits isthe radioactive events, in which a single particle strikes an electronic device. Arechiga and Michaels[1], Libano et al. [26] explored how the Single-Event Upset (SEU) faults impact the FPGA-basedCNN computation system.



TMR is a commonly used approach to mitigate SEUs [4, 42, 55]. Traditional TMR methods areagnostic of the NN applications and introduce large overhead. To exploit the NN applications’characteristics to reduce the overhead, one should understand the behavior of NN models withcomputational faults. Vialatte and Leduc-Primeau [46] analyzed the layer-wise sensitivity of NNmodels under two hypothetical feature fault models. Libano et al. [26] proposed to only triplicatethe vulnerable layers after layer-wise sensitivity analysis and reduced the LUTs overhead for anNN model on Iris Flower from about 200% (TMR) to 50%. Schorn et al. [40] conducted sensitivityanalysis on the individual neuron level. Li et al. [25] found that the impacts and propagationof computational faults in an NN computation system depend on the hardware data path, themodel topology, and the type of layers. These methods analyzed the sensitivity of existing NNmodels at different granularities and exploited the resilience characteristics to reduce the hardwareoverhead for reliability. Our methods are complementary and discover NN architectures with betteralgorithmic resilience capability.To avoid the accumulation of the persistent soft errors in FPGA configuration registers, the

scrubbing technique is applied by checking and partially reloading the configuration bits [4, 6].From the algorithmic perspective, Hacene et al. [12] demonstrated the effectiveness of fault-toleranttraining in the presence of SRAM bit failures.

2.4 Fault Resilience for RRAM-based AcceleratorsRRAM devices suffer from lots of device faults [7], among which the commonly occurring SAFs areshown to cause severe degradation in the performance of mapped neural networks [49]. RRAMcells containing SAF faults get stuck at high-resistance state (SAF0) or low-resistance state (SAF1),thereby causing the weight to be stuck at the lowest or highest magnitudes of the representationrange, respectively. Besides the hard errors, resistance programming variation [24] is anothersource of faults for NN applications [27].

For the detection of SAFs, Kannan et al. [20, 21] proposed fault detectionmethods that can providehigh fault coverage, Xia et al. [50] proposed on-line fault detection method that can periodicallydetect the current distribution of faults.Most of the existing studies on improving the fault resilience ability of RRAM-based neural

computation system focus on designing the mapping and retraining methods. Chen et al. [8], Liuet al. [28], Xia et al. [49, 50] proposed different mapping strategies and the corresponding hardwareredundancy design. After the distribution detection of the faults and variations, they proposedto retrain (i.e. finetune) the NN model for tolerating the detected faults, which is exploiting theintrinsic fault resilience capability of NN models. To overcome the programming variations, Liuet al. [27] calculated the calibrated programming target weights with the log-normal resistancevariation model, and proposed to map sensitive synapses onto cells with small variations. From thealgorithmic perspective, Liu et al. [30] proposed to use error-correcting output codes (ECOC) toimprove the NN’s resilience capability for tolerating resistance variations and SAFs.

2.5 Neural Architecture SearchNeural Architecture Search, as an automatic neural network architecture design method, has beenrecently applied to design model architectures for image classification and language models [29, 36,57]. The architectures discovered by NAS techniques have demonstrated surpassing performancethan the manually designed ones. NASNet [57] used a recurrent neural network (RNN) controller tosample architectures, trained them, and used the final validation accuracy to instruct the learningof the controller. Instead of using reinforcement learning (RL)-learned RNN as the controller, Liuet al. [29] used a relaxed differentiable formulation of the neural architecture search problem, andapplied gradient-based optimizer for optimizing the architecture parameters; Real et al. [39] used


x:6 X. Ning et al.

evolutionary-based methods for sampling new architectures, by mutating the architectures in thepopulation; Recent predictor-based search strategies [34, 35] sample architectures with promisingperformance predictions, by gradient-based method [34] or discrete inner search methods [35].Besides the improvements on the search strategies, a lot of methods are proposed to speed upthe performance evaluation in NAS. Baker et al. [3] incorporated learning curve extrapolation topredict the final performance after a few epochs of training; Real et al. [39] sampled architecturesusing mutation on existing models and initialized the weights of the sampled architectures byinheriting from the parent model; Pham et al. [36] shared the weights among different sampledarchitectures, and used the shared weights to evaluate each sampled architecture.The goal of the NAS problem is to discover the architecture that maximizes some predefined

objectives. The process of the original NAS algorithm goes as follows. At each iteration, 𝛼 is sampledfrom the architecture search space A. This architecture is then assembled as a candidate networkNet(𝛼,𝑤), where𝑤 is the weights to be trained. After training the weights𝑤 on the training datasplit 𝐷𝑡 , the evaluated reward of the candidate network on the validation data split 𝐷𝑣 will be usedto instruct the sampling process. In its purest form, the NAS problem can be formalized as:

max𝛼 ∈A 𝐸𝑥𝑣∼𝐷𝑣[𝑅(𝑥𝑣,Net(𝛼,𝑤∗ (𝛼)))]

s.t.𝑤∗ (𝛼) = argmin𝑤𝐸𝑥𝑡∼𝐷𝑡[𝐿(𝑥𝑡 ,Net(𝛼,𝑤))], (5)

where A is the architecture search space. ∼ is the sampling operator, and 𝑥𝑡 , 𝑥𝑣 denote the datasampled from the training and validation data splits 𝐷𝑡 , 𝐷𝑣 , respectively. 𝐸𝑥∼𝐷 [·] denotes theexpectation with respect to the data distribution 𝐷 , 𝑅 denotes the evaluated reward used to instructthe sampling process, and 𝐿 denotes the loss criterion for back propagation during the training ofthe weights𝑤 .Originally, for the performance evaluation of each sampled architecture 𝛼 , one needs to find

the corresponding 𝑤∗ (𝛼) by fully training the candidate network from scratch. This process isextremely slow, and shared weights evaluation is commonly used for accelerating the evaluation.In shared weights evaluation, each candidate architecture 𝛼 is a subgraph of a super network andis evaluated using a subset of the super network weights. The shared weights of the super networkare updated along the search process.

3 FAULT MODELSIn Sec. 3.1, we motivate and discuss the formalization of application-level statistical fault models.Platform-specific analysis are conducted in Sec. 3.2 and Sec. 3.3. Finally, the MAC-i.i.d Bit-Bias(MiBB) feature fault model and the arbitrary-distributed Stuck-at-Fault model (adSAF) weight faultmodel are described in Sec. 3.4 and Sec. 3.5, which would be used in the neural architecture searchprocess. The analyses in this part are summarized in Fig. 4 (a) and Table 3.

3.1 Application-Level Modeling of Computational FaultsComputational faults do not necessarily result in functional errors [16, 25]. For example, a neuralnetwork for classification tasks usually outputs a class probability vector, and our work only regardsit as a functional error i.f.f the top-1 decision becomes different from the golden result. Due to thecomplexity of the NN computations and different functional error definition, it’s very inefficient toincorporate gate-level fault injection or propagation analysis into the training or architecture searchprocess. Therefore, to evaluate and further boost the algorithmic resilience of neural networks tocomputational faults, the application-level fault models should be formalized.From the algorithmic perspective, the faults fall into two categories: weight faults and feature

faults. In this section, we analyze the possible faults in various types of NN accelerators, and



Table 3. Summary of the NN application-level statistical fault models, due to various types of errors ondifferent platforms. Headers: H/S refers to Hard/Soft errors; P/T refers to Persistent/Transient influences;F/W refers to Feature/Weight faults

Platform Error source Errorposition

Logiccomponent H/S P/T Common

mitigationNN application level

F/W Simplified statistical model

RRAM SAF SB-cell Crossbar H P detection+3R[8, 28, 50] W 𝑤 ∼ 1bit-adSAF(𝑤0;𝑝0, 𝑝1)

MB-cell 𝑤 ∼ 𝑄bit-adSAF(𝑤0; 𝑝0, 𝑝1)

RRAM variations MB-cell Crossbar S P PS loop[17, 27] W 𝑤 ∼ LogNormal(𝑤0;𝜎),

𝑤 ∼ ReciprocalNormal(𝑤0;𝜎)

FPGA/ASIC SEE, overstress SRAM Weight buffer H P ECC W 𝑤 ∼ iBF(𝑤0; 𝑟ℎ ×𝑀𝑝 (𝑡))SEE, VS S T 𝑤 ∼ iBF(𝑤0; 𝑟𝑠 )

FPGASEE, overstress

LUTs PEH

P

TMR[4, 42, 55] F

𝑓 ∼ iBB(𝑓0; 𝑟ℎ ×𝑀𝑙 ×𝑀𝑝 (𝑡))

SEE, VS S TMR,Scrubbing [6] 𝑓 ∼ iBB(𝑓0; 𝑟𝑠 ×𝑀𝑙 ×𝑀𝑝 (𝑡))

FPGA/ASIC/RRAM SEE, overstress SRAM Feature buffer H P ECC F 𝑦 ∼ iBF(𝑦0; 𝑟ℎ ×𝑀𝑝 (𝑡))SEE, VS S T 𝑦 ∼ iBF(𝑦0; 𝑟𝑠 )

ASIC SEE, overstress CL gates,flip-flops PE H P TMR,

DICE [13] F 𝑓 ∼ iBB(𝑓0; 𝑟 𝑙ℎ ×𝑀𝑙 ×𝑀𝑝 (𝑡))SEE, VS S T 𝑓 ∼ iBB(𝑓0; 𝑟 𝑙𝑠 ×𝑀𝑙 )

Notations:𝑤, 𝑓 ,𝑦 refer to the weights, before-activation features, and after-activation features of a convo-lution; 𝑝0, 𝑝1 refer to the SAF0 and SAF1 rates of RRAM cells; 𝜎 refers to the standard deviation of RRAMprogramming variations; 𝑟𝑠 , 𝑟ℎ refer to the soft and hard error rates of memory elements, respectively; 𝑟 𝑙𝑠 , 𝑟 𝑙ℎrefer to the soft and hard error rates of logic elements, respectively;𝑀𝑙 is an amplifying coefficient for featureerror rate due to multiple involved computational components;𝑀𝑝 (𝑡) > 1 is a coefficient that abstracts theerror accumulation effects over time.Abbreviations: SEE refers to Single-Event Errors, including Single-Event Burnout (SEB), Single-Event Upset(SEU), etc.; “overstress” includes conditions such as high temperature, voltage or physical stress; VS refers tovoltage (down)scaling that is used for energy efficiency; SB-cell and MB-cell refer to single-bit and multi-bitmemristor cells, respectively; CL gates refer to combinational logic gates; 3R refers to various Redundancyschemes and corresponding Remapping/Retraining techniques; PS loop refers to the programming-sensing loopduring memristor programming; TMR refers to Triple Modular Redundancy; DICE refers to Dual InterlockedCell.

formalize the statistical feature and weight fault models. A summary of these fault models is shownin Table 3.

Note that we focus on the computational faults along the datapath inside the NN accelerator thatcould be modeled and mitigated from the algorithmic perspective. Faults in the control units andother chips in the system are not considered. See more discussion in the “limitation of application-level fault models” section in Sec. 6.4.

3.2 Analysis of CMOS-based Platforms: ASIC and FPGAThe possible errors in CMOS-based platforms are illustrated in Fig. 1. Soft errors that happenin the memory elements or the logic elements could lead to transient faulty outputs in ASICs.Compared with logic elements (e.g., combinational logic gates, flip-flops), memory elements aremore susceptible to soft errors [43]. An unprotected SRAM cell usually has a larger bit soft errorrate (SER) than flip-flops. Since the occurring probability of hard errors is much smaller than thatof the soft errors, we focus on the analysis of soft errors, despite that hard errors lead to permanentfailures.

The soft errors in the weight buffer could be modeled as i.i.d weight random bit-flips. Given theoriginal value as 𝑥0, the distribution of a faulty value 𝑥 ′ under the random bit-flip (BF) model could


x:8 X. Ning et al.

This part is hard to model and mitigate from the algorithmic perspective, and is not considered in this paper.

ASICConfiguration Register

Control logic

On-chip Memory (Weight/feature buffer)

LUT-based computational

logic

Gate-based computational

logic

FPGA

Routing

Bit-flip errors, modeled as i.i.d.-bit-flip (iBF)

Accumulated bias errors, modeled as i.i.d.-bit-bias (iBB)

Fig. 1. The possible error positions in CMOS-based platforms.

be written as𝑥 ′ ∼ BF(𝑥0;𝑝)

indicates 𝑥 ′ = 2−𝑙 (2𝑙𝑥0 ⊕ 𝑒), 𝑒 =

𝑄∑𝑞=1

𝑒𝑞2𝑞−1

𝑒𝑞 ∼ Bernoulli(𝑝), 𝑞 = 1, · · · , 𝑄,

(6)

where 𝑒𝑞 denotes whether a bit-flip occurs at bit position 𝑞, ⊕ is the XOR operator.By assuming that error occurs at each bit with an i.i.d bit SER of 𝑟𝑠 , we know that each 𝑄-bit

weight has an i.i.d probability 𝑝𝑤 to encounter error, and 𝑝𝑤 = (1−(1−𝑟𝑠 )𝑄 ) ≈ 𝑟𝑠 ×𝑄 , as 𝑟𝑠 ×𝑄 ≪ 1.It is worth noting that throughout the analysis, we assume that the SERs of all components ≪ 1,hence the error rate at each level is approximated as the sum of the error rates of the independentsub-components. As each weight encounters error independently, a weight tensor is distributedas i.i.d random bit-flip (iBF): 𝑤 ∼ iBF(𝑤0; 𝑟𝑠 ), where 𝑤0 is the golden weights. Reagen et al. [38]showed that the iBF model could capture the bit error behavior exhibited by real SRAM hardware.The soft errors in the feature buffer are modeled similarly as i.i.d random bit-flips, with a fault

probability of approximately 𝑟𝑠 ×𝑄 for 𝑄-bit feature values. The distribution of the output featuremap (OFM) values could be written as 𝑦 ∼ iBF(𝑦0; 𝑟𝑠 ), where 𝑦0 is the golden results.

Actually, FPGA-based implementations are usually more vulnerable to soft errors than their ASICcounterparts [2]. Since the majority space of an FPGA chip is filled with memory cells, the overallSER rate is much higher. Moreover, the soft errors occurring in logic configuration bits would leadto persistent faulty computation, rather than transient faults as in ASIC logic. Persistent errors cannot be mitigated by simple retry methods and would lead to statistically significant performancedegradation. Moreover, since the persistent errors would be accumulated if no correction is made,the equivalent error rate would keep increasing as time goes on. We abstract this effect with amonotonic increasing function𝑀𝑝 (𝑡) ≥ 1, where the subscript 𝑝 denotes “persistent”, and 𝑡 denotesthe time. For example, if the FPGA weight buffer or LUTs are reloaded for every 𝑇 period in theradioactive environment [4, 6], a multiplier of𝑀𝑝 (𝑇 ) would be the worst bounding case. Note thatthe exact choice of 𝑡 is not important in our experiments, since our work mainly aims at comparingdifferent neural architectures using certain fault insertion pattern and ratio, and the temporal effectmodeled by𝑀𝑝 (𝑡) does not influence the architectural preference.Let us recap how one convolution is mapped onto the FPGA-based accelerator, to see what the

configuration bit errors could cause on the OFM values. If the dimension of the convolution kernelis (𝑐, 𝑘, 𝑘) (channel, kernel height, kernel width, respectively), there are 𝑐𝑘2 − 1 ≈ 𝑐𝑘2 additions



needed for the computation of a feature value. We assume that the add operations are spatiallyexpanded onto adder trees constructed by LUTs, i.e., no temporal reusing of adders is used forcomputing one feature value. That is to say, the add operations are mapped onto different hardwareadders3, and encounter errors independently. The per-feature error rate could be approximated bythe adder-wise SER times𝑀𝑙 , where𝑀𝑙 ≈ 𝑐𝑘2. Now, let’s dive into the adder-level computation, ina 1-bit adder with scale 𝑠 , the bit-flip in one LUTs bit would add a bias ±2𝑠 to the output value, ifthe input bit signals match the address of this LUTs bit. If each LUT cell has an i.i.d SER of 𝑟𝑠 , in a𝑄 ′-bit adder, denoting the fraction length of the operands and result as 𝑙 ′, the distribution of thefaulty output 𝑥 ′ with the random bit-bias (BB) faults could be written as

𝑥 ′ ∼ BB(𝑥0;𝑝,𝑄 ′, 𝑙 ′)

indicates 𝑥 ′ = 𝑥0 + 𝑒, 𝑒 = 2−𝑙′𝑄′∑𝑞=1

(−1)𝛽2𝑞−1𝑒𝑞

𝑒𝑞 ∼ Bernoulli(𝑝)𝛽𝑞 ∼ Bernoulli(0.5), 𝑞 = 1, · · · , 𝑄 ′.

(7)

As for the result of the adder tree constructed by multiple LUT-based adders, since the probabilitythat multiple bit-bias errors co-occur is orders of magnitude smaller, we ignore the accumulationof the biases that are smaller than the OFM quantization resolution 2−𝑙 . Consequently, the OFMfeature values before the activation function follow the i.i.d Random Bit-Bias distribution 𝑓 ∼iBB(𝑓0; 𝑟𝑠 ×𝑀𝑙 ×𝑀𝑝 (𝑡), 𝑄, 𝑙), where𝑄 and 𝑙 are the bit-width and fraction length of the OFM values,respectively.

We can make an intuitive comparison between the equivalent feature error rates induced by LUTssoft errors and feature buffer soft errors. As the majority of FPGAs is SRAM-based, considering thebit SER 𝑟𝑠 of LUTs cell and BRAM cell to be close, we can see that the feature error rate inducedby LUTs errors is amplified by 𝑀𝑙 ×𝑀𝑝 (𝑡). As we have discussed, 𝑀𝑝 (𝑡) ≥ 1, 𝑀𝑙 = 𝑐𝑘2 > 1, theperformance degradation induced by LUTs errors could be significantly larger than that inducedby feature buffer errors.

3.3 Analysis of PIM-based Platforms: RRAM as an exampleIn an RRAM-based computing system, compared with the accompanying CMOS circuits, the RRAMcrossbar is much more vulnerable to various non-ideal factors. In multi-bit RRAM cells, studieshave showed that the distribution of the resistance due to programming variance is either Gaussianor Log-Normal [24]. As each weight is programmed as the conductance of the memristor cell,the weight could be seen as being distributed as Reciprocal-Normal or Log-Normal. Besides thesoft errors, common hard errors such as SAFs, caused by fabrication defects or limited endurance,could result in severe performance degradation [49]. SAFs occur frequently in nowadays RRAMcrossbar: As reported by [7], the overall SAF ratio could be larger than 10% (𝑝1 = 9.04% for SAF1and 𝑝0 = 1.75% for SAF0) in a fabricated RRAM device. The statistical model of SAFs in single-bitand multi-bit RRAM devices would be formalized in Sec. 3.5.

As the RRAM crossbars also serve as the computation units, some non-ideal factors (e.g., IR-drop,wire resistance) could be abstracted as feature faults. They are not considered in this work since themodeling of these effects highly depends on the implementation (e.g., crossbar dimension, mappingstrategy) and hardware-in-the-loop testing [15].

3See more discussion in Sec. 6.4.


x:10 X. Ning et al.

3.4 Feature Fault ModelAs analyzed in Sec. 3.2, the soft errors in LUTs are relatively the more pernicious source of featurefaults, as 1) SER is usually much higher than hard error rate: 𝑟𝑠 ≫ 𝑟ℎ , 2) these errors are persistentif no correction is made, 3) the per-feature equivalent error rate is amplified as multiple adders areinvolved. Therefore, we use the iBB fault model in our exploration of mitigating feature faults.We have 𝑓 ∼ iBB(𝑓0; 𝑟𝑠𝑀𝑙𝑀𝑝 (𝑡)), where 𝑀𝑙 = 𝑐𝑘2, and the probability of error occurring at

every position in the OFM is 𝑝 = 𝑟𝑠𝑀𝑙𝑀𝑝 (𝑡)𝑄 = 𝑝𝑚𝑀𝑙 , where 𝑝𝑚 = 𝑟𝑠𝑄𝑀𝑝 (𝑡) is defined asthe per-MAC error rate. Denoting the dimension of the OFM as (𝐶𝑜 , 𝐻,𝑊 ) (channel, height, andwidth, respectively) and the dimension of each convolution kernel as (𝑐, 𝑘, 𝑘), the computation of aconvolution layer under this fault model could be written as

𝑦 = 𝑔(𝑊 ⊛ 𝑥 + 𝑏 + \ · 2𝛼−𝑙 · (−1)𝛽 )s.t. \ ∼ Bernoulli(𝑝)𝐶𝑜×𝐻×𝑊

𝛼 ∼ 𝑈 {0, · · · , 𝑄 − 1}𝐶𝑜×𝐻×𝑊

𝛽 ∼ 𝑈 {0, 1}𝐶𝑜×𝐻×𝑊 ,

(8)

where \ is the mask indicating whether an error occurs at each feature map position, 𝛼 representsthe bit position of the bias, 𝛽 represents the bias sign. Note that this formulation is not equivalentto the random bit-bias formalization in Eq. 7, and is adopted for efficient simulation. These twoformulations are close when the odds that two errors take effect simultaneously is small (𝑝𝑚/𝑄 ≪ 1).This fault model is referred to as the MAC-i.i.d Bit-Bias model (abbreviated as MiBB). An exampleof injecting this type of feature faults is illustrated in Fig. 2.

12 14 21 -6

20 2 1 16

6 3 4 7

9 4 3 -1 2 3 12

-1 8 5

15 9 23

SEU

weight

feature

=

=

528 347

336 327

500 351

340 311

output

(" = 8, & = 0)

12 14 21 -2

20 10 1 16

6 3 0 7

9 4 3 -1

Fig. 2. An example of injecting feature faults under the iBB fault model (soft errors in FPGA LUTs).

Intuitively, convolution computation that needs fewer MACs might be more immune to thefaults, as the equivalent error rate at each OFM location is lower.

3.5 Weight Fault ModelAs RRAM-based accelerators suffer from a much higher weight error rate than the CMOS-basedones. The Stuck-at-Faults in RRAM crossbars are mainly considered for the setup of the weightfault model. We assume the underlying platform is RRAM with multi-bit cells, and adopt thecommonly-used mapping scheme, in which separate crossbars are used for storing positive andnegative weights [10]. That is to say, when an SAF0 fault causes a cell to be stuck at HRS, the



corresponding logical weight would be stuck at 0. When an SAF1 fault causes a cell to be stuck atLRS, the weight would be stuck at −𝑅𝑤 if it’s negative, or 𝑅𝑤 otherwise.The computation of a convolution layer under the SAF weight fault model could be written as

𝑦 = 𝑔(𝑊 ′ ⊛ 𝑥 + 𝑏)s.t. 𝑊 ′ = (1 − \ ) ·𝑊 + \ · 𝑒

\ ∼ Bernoulli(𝑝0 + 𝑝1)𝐶𝑜×𝑐×𝑘×𝑘

𝑚 ∼ Bernoulli( 𝑝1

𝑝0 + 𝑝1)𝐶𝑜×𝑐×𝑘×𝑘

𝑒 = 𝑅𝑤 sgn(𝑊 ) ·𝑚,

(9)

where 𝑅𝑤 refers to the representation bound in Eq. 3, \ is the mask indicating whether fault occursat each weight position,𝑚 is the mask representing the SAF types (SAF0 or SAF1) at faulty weightpositions, 𝑒 is the mask representing the faulty target values (0 or ±𝑅𝑤). Every single weight hasan i.i.d probability of 𝑝0 to be stuck at 0, and 𝑝1 to be stuck at the positive or negative bounds ofthe representation range, for positive and negative weights, respectively. An example of injectingthis type of weight faults is illustrated in Fig. 3.

12 14 21 -6

20 2 1 16

6 3 4 7

9 4 3 -1

2 3 12

0 8 5

15 9 23

2 3 12

0 127 5

15 9 23

SAF-1

weight

feature

=

=

528 347

336 327

786 468

699 806

output

0 0 0

0 0 0

0 0 0

0 0 0

-1 0 0

0 0 0

SAF-0

Positive Negative

(" = 6, & = 0)

Fig. 3. An example of injecting weight faults under the adSAF fault model (SAF errors in RRAM cells).

Note that the weight fault model, referred to as arbitrary-distributed Stuck-at-Fault model(adSAF), is much harder to defend against than SAF faults with a specific known defect map.A neural network model that behaves well under the adSAF model is expected to achieve highreliability across different specific SAF defect maps.The above adSAF fault model assumes the underlying hardware is multi-bit RRAM devices,

adSAFs in single-bit RRAM devices are also of interest. In single-bit RRAM devices, multiple bits ofone weight value are mapped onto different crossbars, of which the results would be shifted andadded together [56]. In this case, an SAF fault that occurs in a cell would cause the correspondingbit of the corresponding weight to be stuck at 0 or 1. The effects of adSAF faults on a weight value


x:12 X. Ning et al.

Network Architecture(Rollout)

RNNController

Sample

CandidateNetwork

Super-net Evaluator

Reward

Objective Loss

AccuracyFaulty

AccuracyClean Cross

EntropyFaulty Cross

Entropy

FTT-NAS Framework (Sec IV)(b)

Discovered Architecture

Fault-Tolerant Training

Fault-Tolerant Training

Fault-TolerantNN Model

(c)

Application-level Statistical Fault Model (Sec III)(a)

RRAM-based PIM Accelerator

𝑅"

Device Fault 1:Stuck-At-Fault (SA0, SA1)

……

- + - + - + - +

……

……

𝑉$

𝑉%

𝑅" 𝑅" 𝑅"

RRAM

g*,,

𝑉%, =.𝑉/* ⋅ 𝑐*,*

,

𝑐*, = −𝑔*,𝑔"

w~1𝑏𝑖𝑡𝑎𝑑𝑆𝐴𝐹(𝑤A; 𝑝%, 𝑝D)w~𝑄𝑏𝑖𝑡𝑎𝑑𝑆𝐴𝐹(𝑤A ; 𝑝%, 𝑝D)

Device Fault 2:Resistance Variation

𝑅$GHIJ 𝑅

w~𝑅𝑒𝑐𝑖𝑝𝑟𝑜𝑐𝑎𝑙𝑁𝑜𝑟𝑚𝑎𝑙 𝑤A; 𝜎w~𝐿𝑜𝑔𝑁𝑜𝑟𝑚𝑎𝑙 𝑤A; 𝜎

REG

𝑉$DREG

𝑉$S

𝑉$T

REG…… MAC

REG

𝑉$DREG

𝑉$S

𝑉$T

REG…… MAC

SRAM bankFeature Buffer

…

SRAM bank

SRAM bankWeight Buffer

…

SRAM bank

Process Elements

REG

𝑉$DREG

𝑉$S

𝑉$T

REG…… MAC

SEU in Feature Buffer

SEU in Weight Buffer

y~𝑖𝐵𝐹(𝑦A; 𝑟")

w~𝑖𝐵𝐹(𝑤A; 𝑟")

ASIC CNN Accelerator

Logic gate-based Computing Unite.g.,1-bit Full Adder

𝐴

𝐵

𝐶/𝑂

𝐶Y

𝐴𝐵

𝐶/

0

1

SEU at Gates

f~𝑖𝐵𝐵(𝑓A; 𝑟"\ 𝑀\)

FPGA CNN Accelerator

LUT-based Computing Unite.g.,Adder with Carry Chain

𝑐/

𝑎/^D

𝑏/^D

LUT𝑎/

𝑏/

Carry gates

LUTCarry gates

𝑐/^D

𝑠/

𝑠/^D

…SEU in LUTs

f~𝑖𝐵𝐵(𝑓A; 𝑟"𝑀\𝑀`(𝑡))

Fig. 4. Illustration of the overall workflow. (a) The setup of the application-level statistical fault models. (b)The FTT-NAS framework. (c) The final fault-tolerant training stage.

in single-bit RRAM devices can be formulated as

𝑤 ′ = sgn(𝑤)2−𝑙 (((¬\ ) ∧ 2𝑙 |𝑤 |) ∨ (\ ∧ 𝑒))

\ =

𝑄∑𝑞=1

\𝑞2𝑞−1, 𝑒 =

𝑄∑𝑞=1

𝑚𝑞2𝑞−1

\𝑞𝑖𝑖𝑑∼ Bernoulli(𝑝0 + 𝑝1), 𝑞 = 1, · · · , 𝑄

𝑚𝑞𝑖𝑖𝑑∼ Bernoulli( 𝑝1

𝑝0 + 𝑝1), 𝑞 = 1, · · · , 𝑄,

(10)

where the binary representation of \ indicates whether fault occurs at each bit position, the binaryrepresentation of 𝑒 represents the target faulty values (0 or 1) at each bit position if fault occurs.We will demonstrate that the architecture discovered under the multi-bits adSAF fault model canalso defend against single-bit adSAF faults and iBF weight faults caused by errors in the weightbuffers of CMOS-based accelerators.

4 FAULT-TOLERANT NASIn this section, we present the FTT-NAS framework. We first give out the problem formalizationand framework overview in Sec. 4.1. Then, the search space, sampling and assembling process aredescribed in Sec. 4.2 and Sec. 4.3, respectively. Finally, the search process is elaborated in Sec. 4.4.



4.1 Framework OverviewDenoting the fault distribution characterized by the fault models as 𝐹 , the neural network searchfor fault tolerance can be formalized as

max𝛼 ∈A 𝐸𝑥𝑣∼𝐷𝑣[𝐸𝑓 ∼𝐹 [𝑅(𝑥𝑣,Net(𝛼,𝑤∗ (𝛼)), 𝑓 )]]

s.t.𝑤∗ (𝛼) = argmin𝑤 𝐸𝑥𝑡∼𝐷𝑡[𝐸𝑓 ∼𝐹 [𝐿(𝑥𝑡 ,Net(𝛼,𝑤), 𝑓 )]] . (11)

where A is the architecture search space, and 𝐷𝑡 , 𝐷𝑣 denote the training and validation data split,respectively. 𝑅 and 𝐿 denote the reward and loss criterion, respectively. The major difference ofEq. 11 to the vanilla NAS problem Eq. 5 lies in the introduction of the fault model 𝐹 .As the cost of finding the best weights 𝑤∗ for each architecture 𝛼 is almost unbearable, we

use the shared-weights based evaluator, in which shared weights are directly used to evaluatesampled architectures. The resulting method, FTT-NAS, is the method to solve this NAS problemapproximately. And FT-NAS can be viewed as a degraded special case for FTT-NAS, in which nofault is injected in the inner optimization of finding𝑤∗ (𝛼).The overall neural architecture search framework is illustrated in Fig. 4 (b). There are multiple

components in the framework: A controller that samples different architecture rollouts fromthe search space; A candidate network is assembled by taking out the corresponding subsetof weights from the super-net. A shared weights based evaluator evaluates the performance ofdifferent rollouts on the CIFAR10 dataset, using fault-tolerant objectives.

4.2 Search SpaceThe design of the search space is as follows: We use a cell-based macro architecture, similar tothe one used in [29, 36]. There are two types of cells: normal cell, and reduction cell with stride2. All normal cells share the same connection topology, while all reduction cells share anotherconnection topology. The layout and connections between cells are illustrated in Fig. 5.

In every cell, there are 𝐵 nodes, node 1 and node 2 are treated as the cell’s inputs, which are theoutputs of the two previous cells. For each of the other 𝐵 − 2 nodes, two in-coming connectionswill be selected and element-wise added. For each connection, the 11 possible operations are: none;skip connect; 3x3 average (avg.) pool; 3x3 max pool; 1x1 Conv; 3x3 ReLU-Conv-BN block; 5x5ReLU-Conv-BN block; 3x3 SepConv block; 5x5 SepConv block; 3x3 DilConv block; 5x5 DilConvblock.The complexity of the search space can be estimated. For each cell type, there are (11(𝐵−2) ×

(𝐵 − 1)!)2 possible choices. As there are two independent cell types, there are (11(𝐵−2) × (𝐵 − 1)!)4possible architecture in the search space, which is roughly 9.5× 1024 with 𝐵 = 6 in our experiments.

4.3 Sampling and Assembling ArchitecturesIn our experiments, the controller is an RNN, and the performance evaluation is based on a supernetwork with shared weights, as used by [36].

An example of the sampled cell architecture is illustrated in Fig. 6. Specifically, to sample a cellarchitecture, the controller RNN samples 𝐵 − 2 blocks of decisions, one for each node 3, · · · , 𝐵. Inthe decision block for node 𝑖 , 𝑀 = 2 input nodes are sampled from 1, · · · , 𝑖 − 1, to be connectedwith node 𝑖 . Then𝑀 operations are sampled from the 11 basic operation primitives, one for eachof the𝑀 connections. Note that the two sampled input nodes can be the same node 𝑗 , which willresult in two independent connections from node 𝑗 to node 𝑖 .

During the search process, the architecture assembling process using the shared-weights supernetwork is straightforward [36]: Just take out the weights from the super network correspondingto the connections and operation types of the sampled architecture.


x:14 X. Ning et al.

Stem

Normal Cell

Reduce

Normal Cell

Reduce

Avgpool + FC

Normal Cell

Normal Cell

Normal Cell

Normal Cell

preprocess preprocess

Input node

Output node

Operation type None Skip connect 3x3 average pooling 3x3 max pooling 1x1 Conv 3x3 ReLU-Conv-BN block 5x5 ReLU-Conv-BN block 3x3 SepConv block 5x5 SepConv block 3x3 DilConv block 5x5 DilConv block

1 2

3

4

5

6

Input node

Fig. 5. Illustration of the search space design. Left: The layout and connections between cells. Right: Thepossible connections in each cell, and the possible operation types on every connection.

preprocess

Input node

Output node

2

5

6

1

3

4

Avg 3x3 R-C-B 3x3

R-C-B 5x5

skip

R-C-B 3x3

Conv 1x1

R-C-B 5x5

None

Input node

preprocess

Fig. 6. An example of the sampled cell architecture.

4.4 Searching for Fault-Tolerant ArchitectureThe FTT-NAS algorithm is illustrated in Alg. 1. To search for a fault-tolerant architecture, we use aweighted sum of the clean accuracy acc𝑐 and the accuracy with fault injection acc𝑓 as the rewardto instruct the training of the controller:

𝑅 = (1 − 𝛼𝑟 ) ∗ acc𝑐 + 𝛼𝑟 ∗ acc𝑓 , (12)where acc𝑓 is calculated by injecting faults following the fault distribution described in Sec. 3. For theoptimization of the controller, we employ the Adam optimizer [22] to optimize the REINFORCE [47]objective, together with an entropy encouraging regularization.

In every epoch of the search process, we alternatively train the shared weights and the controlleron separate data splits𝐷𝑡 and𝐷𝑣 , respectively. For the training of the shared weights, we carried out



Algorithm 1 FTT-NAS1: EPOCH: the total search epochs2: w: shared weights in the super network3: \ : the parameters of the controller 𝜋4: epoch = 05: while epoch < EPOCH do6: for all 𝑥𝑡 , 𝑦𝑡 ∼ 𝐷𝑡 do7: 𝑎 ∼ 𝜋 (𝑎;\ ) # sample an architecture from the controller8: 𝑓 ∼ 𝐹 (𝑓 ) # sample faults from the fault model9: 𝐿𝑐 = CE(Net(𝑎;𝑤) (𝑥𝑡 ), 𝑦𝑡 ) # clean cross entropy10: 𝐿𝑓 = CE(Net(𝑎;𝑤) (𝑥𝑡 , 𝑓 ), 𝑦𝑡 ) # faulty cross entropy11: 𝐿(𝑥𝑡 , 𝑦𝑡 ,Net(𝑎;𝑤), 𝑓 ) = (1 − 𝛼𝑙 )𝐿𝑐 + 𝛼𝑙𝐿𝑓12: 𝑤 = 𝑤 − [𝑤∇𝑤𝐿 # for clarity, we omit momentum calculation here13: end for14: for all 𝑥𝑣, 𝑦𝑣 ∼ 𝐷𝑣 do15: 𝑎 ∼ 𝜋 (𝑎;\ ) # sample an architecture from the controller16: 𝑓 ∼ 𝐹 (𝑓 ) # sample faults from the fault model17: 𝑅𝑐 = Acc(Net(𝑎;𝑤) (𝑥𝑣), 𝑦𝑣) # clean accuracy18: 𝑅𝑓 = Acc(Net(𝑎;𝑤) (𝑥𝑣, 𝑓 ), 𝑦𝑣) # faulty accuracy19: 𝑅(𝑥𝑣, 𝑦𝑣,Net(𝑎;𝑤), 𝑓 ) = (1 − 𝛼𝑟 ) ∗ 𝑅𝑐 + 𝛼𝑟 ∗ 𝑅𝑓

20: \ = \ + [\ (𝑅 − 𝑏)∇\ log𝜋 (𝑎;\ )21: end for22: epoch = epoch + 123: schedule [𝑤, [\24: end while25: return 𝑎 ∼ 𝜋 (𝑎;\ )

experiments under two different settings: without/with FTT. When training with FTT, a weightedsum of the clean cross entropy loss CE𝑐 and the cross entropy loss with fault injection CE𝑓 is usedto train the shared weights. The FTT loss can be written as

L = (1 − 𝛼𝑙 ) ∗ CE𝑐 + 𝛼𝑙 ∗ CE𝑓 . (13)

As shown in line 7-12 in Alg. 1, in each step of training the shared weights, we sample architecture𝛼 using the current controller, then backpropagate using the FTT loss to update the parameters ofthe candidate network. Training without FTT (in FT-NAS) is a special case with 𝛼𝑙 = 0.

As shown in line 15-20 in Alg. 1, in each step of training the controller, we sample architecturefrom the controller, assemble this architecture using the shared weights, and then get the reward𝑅 on one data batch in 𝐷𝑣 . Finally, the reward is used to update the controller by applying theREINFORCE technique [47], with the reward baseline denoted as 𝑏.

5 EXPERIMENTSIn this section, we demonstrate the effectiveness of the FTT-NAS framework and analyze thediscovered architectures under different fault models. First, we introduce the experiment setup inSec. 5.1. Then, the effectiveness under the feature and weight fault models are shown in Sec. 5.2and Sec. 5.3, respectively. The effectiveness of the learned controller is illustrated in Sec. 5.4. Finally,the analyses and illustrative experiments are presented in Sec. 5.5.


x:16 X. Ning et al.

5.1 SetupOur experiments are carried out on the CIFAR-10 [23] dataset. CIFAR-10 is one of themost commonlyused computer vision datasets and contains 60000 32 × 32 RGB images. Three manually designedarchitectures VGG-16, ResNet-18, and MobileNet-V2 are chosen as the baselines. 8-bit dynamicfixed-point quantization is used throughout the search and training process, and the fraction lengthis found following the minimal-overflow principle.In the neural architecture search process, we split the training dataset into two subsets. 80% of

the training data is used to train the shared weights, and the remaining 20% is used to train thecontroller. The super network is an 8-cell network, with all the possible connections and operations.The channel number of the first cell is set to 20 during the search process, and the channel numberincreases by 2 upon every reduction cell. The controller network is an RNNwith one hidden layer ofsize 100. The learning rate for training the controller is 1e-3. The reward baseline 𝑏 is updated usinga moving average with momentum 0.99. To encourage exploration, we add an entropy encouragingregularization to the controller’s REINFORCE objective, with a coefficient of 0.01. For training theshared weights, we use an SGD optimizer with momentum 0.9 and weight decay 1e-4, the learningrate is scheduled by a cosine annealing scheduler [33], started from 0.05. Each architecture searchprocess is run for 100 epochs. Note that all these are typical settings that are similar to [36].To conduct the final training of the architectures (Fig.4 (c)), we run fault-tolerant training for

100 epochs. The learning rate is set to 0.1 initially and decayed by 10 at epoch 40 and 80. We haveexperimented with a fault-tolerant training choice: whether to mask out the error positions infeature/weights during the backpropagation process. If the error positions are masked out, then nogradient would be backpropagated through the erroneous feature positions, and no gradient wouldbe calculated w.r.t. the erroneous weight positions. We find that this choice does not affect thefault-tolerant training result, thus we do not use the masking operation in our final experiments.We build the neural architecture search framework and fault injection framework upon the

PyTorch framework, and all the codes are available at https://github.com/walkerning/aw_nas.

5.2 Defend Against MiBB Feature Faults

Table 4. Comparison of different architectures under the MiBB feature fault model

Arch Training† Cleanaccuracy

Accuracy (%) with feature faults #FLOPs #Params3e-6 1e-5 3e-5 1e-4 3e-4ResNet-18 clean 94.7 89.1 63.4 11.5 10.0 10.0 1110M 11.16MVGG-16 clean 93.1 78.2 21.4 10.0 10.0 10.0 626M 14.65M

MobileNet-V2 clean 92.3 10.0 10.0 10.0 10.0 10.0 182M 2.30MF-FT-Net clean 91.0 71.3 22.8 10.0 10.0 10.0 234M 0.61MResNet-18 𝑝𝑚=1e-4 79.2 79.1 79.6 78.9 60.6 11.3 1110M 11.16MVGG-16 𝑝𝑚=3e-5 83.5 82.4 77.9 50.7 11.1 10.0 626M 14.65M

MobileNet-V2 𝑝𝑚=3e-4 71.2 70.3 69.0 68.7 68.1 47.8 182M 2.30MF-FTT-Net 𝑝𝑚=3e-4 88.6 88.7 88.5 88.0 86.2 51.0 245M 0.65M†: As also noted in the main text, for all the FTT trained models, we successively try per-MAC faultinjection probability 𝑝𝑚 in {3e-4, 1e-4, 3e-5}, and use the largest injection probability with which themodel could achieve a clean accuracy above 50%.

As described in Sec. 4, we conduct neural architecture searching without/with fault-toleranttraining (i.e., FT-NAS and FTT-NAS, correspondingly). The per-MAC injection probability 𝑝𝑚 usedin the search process is 1e-4. The reward coefficients 𝛼𝑟 in Eq. 12 is set to 0.5. In FTT-NAS, the


https://github.com/walkerning/aw_nas


normal

c_{k-2}0

sep_conv_5x5

sep_conv_5x5

2

sep_conv_5x5

3

dil_conv_5x5

c_{k-1}1

avg_pool_3x3

skip_connect

sep_conv_5x5

c_{k}

dil_conv_5x5

(a)

reduce

c_{k-2}

0

dil_conv_5x5

2sep_conv_5x5

c_{k-1} avg_pool_3x3

dil_conv_3x3

3dil_conv_5x5

1dil_conv_3x3

sep_conv_5x5

skip_connect

c_{k}

(b)

Fig. 7. The discovered cell architectures under the MiBB feature fault model. (a) Normal cell. (b) Reductioncell.

loss coefficient 𝛼𝑙 in Eq. 13 is also set to 0.5. As the baselines for FT-NAS and FTT-NAS, we trainResNet-18, VGG-16, MobileNet-V2 with both normal training and FTT. For each model trainedwith FTT, we successively try per-MAC fault injection probability 𝑝𝑚 in {3e-4, 1e-4, 3e-5}, and usethe largest injection probability with which the model could achieve a clean accuracy above 50%.Consequently, the ResNet-18 and VGG-16 are trained with a per-MAC fault injection probability of1e-4 and 3e-5, respectively.

The discovered cell architectures are shown in Fig. 7, and the evaluation results are shown inTable 4. The discovered architecture F-FTT-Net outperforms the baselines significantly at variousfault ratios. In the meantime, compared with the most efficient baseline MobileNet-V2, the FLOPsnumber of F-FTT-Net is comparable, and the parameter number is only 28.3% (0.65M versus 2.30M).If we require that the accuracy should be kept above 70%, MobileNet-V2 could function with aper-MAC error rate of 3e-6, and F-FTT-Net could function with a per-MAC error rate larger than1e-4. That is to say, while meeting the same accuracy requirements, F-FTT-Net could function inan environment with a much higher SER.

We can see that FTT-NAS is much more effective than its degraded variant, FT-NAS. We concludethat, generally, NAS should be used in conjunction with FTT, as suggested by Eq. 11. Anotherinteresting fact is that, under theMiBB fault model, the relative rankings of the resilience capabilitiesof different architectures change after FTT: After FTT, MobileNet-V2 suffers from the smallestaccuracy degradation among 3 baselines, whereas it is the most vulnerable one without FTT.

5.3 Defend Against adSAF Weight FaultsWe conduct FT-NAS and FTT-NAS under the adSAF model. The overall SAF ratio 𝑝 = 𝑝0 + 𝑝1is set to 8%, in which the proportion of SAF0 and SAF1 is 83.7% and 16.3%, respectively (SAF0ratio 𝑝0=6.7%, SAF1 ratio 𝑝1=1.3%). The reward coefficient 𝛼𝑟 is set to 0.2. The loss coefficient 𝛼𝑙 inFTT-NAS is set to 0.7. After the fault-tolerant training of the discovered architecture, we test themodel accuracy with various SAF ratios (from 4% to 12%), while the relative ratio 𝑝0/𝑝1 remainsunchanged according to the numbers reported by Chen et al. [7].The discovered cell architectures are shown in Fig. 8. As shown in Table 5, the discovered W-

FTT-Net outperforms the baselines significantly at various test SAF ratios, with comparable FLOPsand less parameter number. We then apply channel augmentation to the discovered architectureto explore the performance of the model at different scales. We can see that models with largercapacity have better reliability under the adSAF weight fault model, e.g., 54.2% (W-FTT-Net-40) VS.38.4% (W-FTT-Net-20) with 10% adSAF faults.


x:18 X. Ning et al.

Table 5. Comparison of different architectures under the adSAF weight fault model

Arch Training Cleanaccuracy

Accuracy (%) with weight faults #FLOPs #Params0.04 0.06 0.08 0.10 0.12ResNet-18 clean 94.7 64.8 34.9 17.8 12.4 11.0 1110M 11.16MVGG-16 clean 93.1 45.7 21.7 14.3 12.6 10.6 626M 14.65M

MobileNet-V2 clean 92.3 26.2 14.3 11.7 10.3 10.5 182M 2.30MW-FT-Net-20 clean 91.7 54.2 30.7 19.6 15.5 11.9 1020M 3.05MResNet-18 𝑝=0.08 92.0 86.4 77.9 60.8 41.6 25.6 1110M 11.16MVGG-16 𝑝=0.08 91.1 82.6 73.3 58.5 41.7 28.1 626M 14.65M

MobileNet-V2 𝑝=0.08 86.3 76.6 55.9 35.7 18.7 15.1 182M 2.30MW-FTT-Net-20† 𝑝=0.08 90.8 86.2 79.5 69.6 53.5 38.4 919M 2.71MW-FTT-Net-40 𝑝=0.08 92.1 88.8 85.5 79.3 69.2 54.2 3655M 10.78M

†: The “-𝑁 ” suffix means that the base of the channel number is 𝑁 .

normal

c_{k-2}

0relu_conv_bn_3x3

relu_conv_bn_5x5

2

sep_conv_3x3

c_{k-1}1

conv_1x1

dil_conv_3x3

sep_conv_3x3

3relu_conv_bn_5x5

max_pool_3x3

c_{k}

(a)

reduce

c_{k-2}

0

skip_connect1

sep_conv_3x3

c_{k-1}relu_conv_bn_5x5

2max_pool_3x3

dil_conv_3x3

3skip_connect

c_{k}

conv_1x1

dil_conv_3x3

(b)

Fig. 8. The discovered cell architectures under the adSAF weight fault model. (a) Normal cell. (b) Reductioncell.

In order to investigate whether the model FTT-trained under the adSAF fault model can tolerateother types of weight faults, we evaluate the reliability of W-FTT-Net under 1bit-adSAF modeland the iBF model. As shown in Fig. 9 (b)(c), under the 1bit-adSAF and iBF weight fault model,W-FTT-Net outperforms all the baselines consistently at different noise levels.

(a) (b) (c)

Fig. 9. Accuracy curves under different weight fault models. (a) W-FTT-Net under 8bit-adSAF model. (b)W-FTT-Net under 1bit-adSAF model. (c) W-FTT-Net under iBF model.



5.4 The Effectiveness of The Learned ControllerTo demonstrate the effectiveness of the learned controller, we compare the performance of thearchitectures sampled by the controller, with the performance of the architectures random sampledfrom the search space. For both the MiBB feature fault model and the adSAF weight fault model,we random sample 5 architectures from the search space, and train them with FTT for 100 epochs.A per-MAC fault injection probability of 3e-4 is used for feature faults, and an SAF ratio of 8%(𝑝0=6.7%, 𝑝1=1.3%) is used for weight faults.

As shown in Table 6 and Table 7, the performance of different architectures in the search spacevaries a lot, and the architectures sampled by the learned controllers, F-FTT-Net and W-FTT-Net, outperform all the random sampled architectures. Note that, as we use different preprocessoperations for feature faults and weight faults (ReLU-Conv-BN 3x3 and SepConv 3x3, respectively),there exist differences in FLOPs and parameter number even with the same cell architectures.

Table 6. RNN controller VS. random samples under the MiBB feature fault model

Model clean acc 𝑝𝑚=3e-4 #FLOPs #Paramssample1 60.2 19.5 281M 0.81Msample2 79.7 29.7 206M 0.58Msample3 25.0 32.2 340M 1.09Msample4 32.9 25.8 387M 1.23Msample5 17.4 10.8 253M 0.77M

F-FTT-Net 88.6 51.0 245M 0.65M

Table 7. RNN controller VS. random sample under the adSAF weight fault model

Model clean acc 𝑝=8% #FLOPs #Paramssample1 90.7 63.6 705M 1.89Msample2 84.7 36.7 591M 1.54Msample3 90.3 60.3 799M 2.33Msample4 90.5 64.0 874M 2.55Msample5 85.2 45.6 665M 1.83M

W-FTT-Net 90.7 68.5 919M 2.71M

5.5 Inspection of the Discovered ArchitecturesFeature faults: From the discovered cell architectures shown in Fig. 7, we can observe thatthe controller obviously prefers SepConv and DilConv blocks over Relu-Conv-BN blocks. Thisobservation is consistent with our anticipation. As under the MiBB feature fault model, operationswith smaller FLOPs will result in a lower equivalent fault rate in the OFM.Under the MiBB feature fault model, there is a tradeoff between the capacity of the model and the

feature error rate. As the number of channels increases, the operations become more expressive,but the equivalent error rates in the OFMs also get higher. Thus there exists a tradeoff point of 𝑐∗for the number of channels. Intuitively, 𝑐∗ depends on the per-MAC error rate 𝑝𝑚 , the larger the𝑝𝑚 is, the smaller the 𝑐∗ is.

Besides the choices of primitives, the connection pattern and combination of different primitives alsoplay a role in making the architecture fault-tolerant. To verify this, first, we conduct a simple experi-ment to confirm the preference of primitives: For each of the 4 different primitives (SepConv 3x3,


x:20 X. Ning et al.

SepConv 5x5, DilConv 3x3, DilConv 5x5), we stack 5 layers of the primitives, get the performanceof the stacked NN after FTT training it with 𝑝𝑚=3e-4. The stacked NNs achieve the accuracy of60.0%, 65.1%, 50.0% and 56.3% with 𝑝𝑚 =1e-4, respectively. The stacked NN of SepConv 5x5 blocksachieves the best performance, which is of no surprise since the most frequent block in F-FTT-Net isSepConv5x5. Then, we construct six architectures by random sampling five architectures with onlySepConv5x5 connections and replacing all the primitives in F-FTT-Net with SepConv 5x5 blocks.The best result achieved by these six architecture is 77.5% with 𝑝𝑚 =1e-4 (versus 86.2% achieved byF-FTT-Net). These illustrative experiments indicate that the connection pattern and combinationof different primitives all play a role in the fault resilience capability of a neural network architecture.

Weight faults: Under the adSAF fault model, the controller prefers ReLU-Conv-BN blocks overSepConv and DilConv blocks. This preference is not so easy to anticipate. We hypothesise that theweight distribution of different primitives might lead to different behaviors when encounteringSAF faults. For example, if the quantization range of a weight value is larger, the value deviationcaused by an SAF1 fault would be larger, and we know that a large increase in the magnitude ofweights would damage the performance severely [12]. We conduct a simple experiment to verifythis hypothesis: We stack several blocks to construct a network, and in each block, one of thethree operations (a SepConv3x3 block, a ReLU-Conv-BN 3x3 block, and a ReLU-Conv-BN 1x1block) is randomly picked in every training step. The SepConv 3x3 block is constructed with aDepthwiseConv 3x3 and two Conv 1x1, and the ReLU-Conv-BN 3x3 and ReLU-Conv-BN 1x1 containa Conv 3x3 and a Conv 1x1, respectively. After training, the weight magnitude ranges of Conv3x3, Conv 1x1, and DepthwiseConv 3x3 are 0.036±0.043, 0.112±0.121, 0.140±0.094, respectively.Since the magnitude of the weights in 3x3 convolutions is smaller than that of the 1x1 convolutionsand the depthwise convolutions, SAF weight faults would cause larger weight deviations in a SepConvor DilConv block than in a ReLU-Conv-BN 3x3 block.

6 DISCUSSION6.1 OrthogonalityMost of the previous methods are exploiting the inherent fault resilience capability of existing NNarchitectures to tolerate different types of hardware faults. In contrast, our methods improve theinherent fault resilience capability of NN models, thus effectively increase the algorithmic faultresilience “budget” to be utilized by hardware-specific methods. Our methods are orthogonal toexisting fault-tolerance methods, and can be easily integrated with them, e.g., helping hardware-based methods to reduce the overhead largely.

6.2 Data representationIn our work, an 8-bit dynamic fixed-point representation is used for the weights and features. Aspointed out in Sec. 5.5, the dynamic range has impacts on the resilience characteristics againstweight faults. The data format itself obviously decides or affects the data range. Yan et al. [52]found out that the errors in exponent bits of the 32bit floating-point weights have large impacts onthe performance. Li et al. [25] investigated the resilience characteristics of several floating-pointand non-dynamic fixed-point representations.

6.3 Other search strategies (controllers)A NAS system mainly consists of three components that work together: The search space thatdefines the architectural decisions to make; The evaluator that assesses the performance of anarchitecture; The search strategy (controller) that explores the search space using the rewards



provided by the evaluator. Apart from few exceptions such as the differentiable NAS methods [29],the NAS system could be built in a modularized and decoupled way, in which the controller andthe evaluator can be designed independently. Our major modifications to the NAS framework liein that 1) We analyze different categories of faults existing in various devices, and conduct faultinsertion with the formulated fault model when evaluating the reward. 2) The fault-toleranttraining techniques should be incorporated when training the supernet weights, to reduce thegap between the search and final training stages, since it is a common technique for training a fault-tolerant NN model. Note that these two amendments are all on the evaluator component. Despitethat we choose to use the popular reinforcement learning-based controller, other controllers suchas evolutionary-based [39] and predictor-based ones [35] could be incorporated into the FTT-NASframework easily. The application of other controllers is outside the scope of this work’s interests.

6.4 Fault modelLimitation of application-level fault model: There are faults that are hard or unlikely to modeland mitigate by our methods, e.g., timing errors, routing/DSP errors in FPGA, etc. A hardware-in-the-loop framework could be established for a thorough evaluation of the system-level faulthazards. Anyway, since the correspondence between these faults and the application-level elementsare subtle, it’s more suitable to mitigate these faults in the lower abstraction layer.

FPGA platform: In the MiBB feature fault model, we assume that the add operations are spatiallyexpanded onto independent hardware adders, which applies to the template-based designs [45]. ForISA (Instruction Set Architecture) based accelerators [37], the NN computations are orchestratedusing instructions, time-multiplexed onto hardware units. In this case, the accumulation of thefaults follows a different model and might show different preferences among architectures. Anyway,the FTT-NAS framework is general and could be used with different fault models. We leave theexploration and experiments of other fault models for future work.

RRAM platform: As for the RRAM platform, this paper mainly focuses on discovering fault-tolerant neural architecture to mitigate SAFs, which have significant impacts on computing accu-racy. In addition to SAFs, the variation is another typical RRAM non-ideal factor that may leadto inaccurate computation. There exist various circuit-level optimizations that can mitigate thecomputation error caused by the RRAM variation. Firstly, with the development of the RRAM devicetechnology, a large on/off ratio of RRAM devices (i.e., the resistance ratio of the high resistancestate and the low resistance state) can be obtained (e.g., 103 [48]). And a large on/off ratio makes thebit line current difference among different computation results obvious and thus improves the faulttolerance capability against variation. Secondly, in existing RRAM-based accelerators, the number ofactivated RRAM rows at one time is limited. For example, only four rows are activated in each cycle,which provides a sufficient signal margin against process variation [51]. In contrast, compared withthe process variation, it is more costly to mitigate SAFs by circuit-level optimization (e.g., existingwork utilizes costly redundant hardware to tolerate SAFs [18]). Thus, we aim at tolerating the SAFsfrom the algorithmic perspective. Anyway, simulating the variation is a meaningful extension ofthe general FTT-NAS framework for the RRAM platform, and we leave it for future work.

Combining multiple fault models: We experiment with a fault model at a time, and do notcombine different fault models. And our experimental results show that the architectural preferencesof the adSAF and iBB feature fault models are distinct (See discussions in Sec. 5.5). Fortunately,the two types of faults that we experiment with would not co-exist for the same part of an NNmodel: IBB feature faults (caused by FPGA LUTs errors) and the adSAF weight faults (in the RRAM


x:22 X. Ning et al.

crossbar). Nevertheless, there indeed exist scenarios that some weights and feature errors couldhappen simultaneously on one platform. For example, iBF in the feature buffer and SAF in thecrossbar can occur simultaneously in an RRAM-based accelerator. However, on the same platformin the same environment, the influences of different types of errors would usually be vastly different.For example, compared with the accuracy degradation caused by the SAF errors in the RRAM cell,the influences of iBF errors in the feature buffer could usually be ignored on the same device.

As a future direction, it might be interesting to combine these fault models to search for a neuralarchitecture to be partitioned and deployed onto a heterogeneous hardware system. In that case,the fault patterns, along with the computation and memory access patterns of multiple platformsshould be considered jointly.

7 CONCLUSIONIn this paper, we analyze the possible faults in various types of NN accelerators and formalize thestatistical fault models from the algorithmic perspective. After the analysis, the MAC-i.i.d Bit-Bias(MiBB) model and the arbitrary-distributed Stuck-at-Fault (adSAF) model are adopted in the neuralarchitecture search for tolerating feature faults and weight faults, respectively. To search for thefault-tolerant neural network architectures, we propose the multi-objective Fault-Tolerant NAS(FT-NAS) and Fault-Tolerant Training NAS (FTT-NAS) method. In FTT-NAS, the NAS technique isemployed in conjunction with the Fault-Tolerant Training (FTT). The fault resilience capabilities ofthe discovered architectures, F-FTT-Net and W-FTT-Net, outperform multiple manually designedarchitecture baselines, with comparable or fewer FLOPs and parameters. And W-FTT-Net trainedunder the 8bit-adSAF model can defend against other types of weight faults. Generally, comparedwith FT-NAS, FTT-NAS is more effective and should be used. In addition, through the inspectionof the discovered architectures, we find that since operation primitives differ in their MACs,expressiveness, weight distributions, they exhibit different resilience capabilities under differentfault models. The connection pattern is also shown to have influences on the fault resiliencecapability of NN models.

ACKNOWLEDGMENTSThis work was supported by National Natural Science Foundation of China (No. U19B2019, 61832007,61621091), National Key R&D Program of China (No. 2017YFA02077600); Beijing National ResearchCenter for Information Science and Technology (BNRist); Beijing Innovation Center for FutureChips; the project of Tsinghua University and Toyota Joint Research Center for AI Technology ofAutomated Vehicle (TT2020-01); Beijing Academy of Artificial Intelligence.

REFERENCES[1] Austin P. Arechiga and Alan J. Michaels. 2018. The Robustness of Modern Deep Learning Architectures against Single

Event Upset Errors. 2018 IEEE High Performance extreme Computing Conference (HPEC), 1–6.[2] Hossein Asadi and Mehdi B Tahoori. 2007. Analytical Techniques for Soft Error Rate Modeling and Mitigation of

FPGA-Based Designs. IEEE Transactions on Very Large Scale Integration Systems (TVLSI) 15, 12 (Dec 2007), 1320–1331.[3] Bowen Baker, Otkrist Gupta, R. Raskar, and N. Naik. 2017. Accelerating neural architecture search using performance

prediction. arXiv preprint arXiv:1705.10823 (2017).[4] Cristiana Bolchini, Antonio Miele, and Marco D Santambrogio. 2007. TMR and Partial Dynamic Reconfiguration to

mitigate SEU faults in FPGAs. IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT) (2007),87–95.

[5] Shekhar Borkar. 2005. Designing reliable systems from unreliable components: the challenges of transistor variabilityand degradation. IEEE Micro 25, 6 (2005), 10–16.

[6] Carl Carmichael, Michael Caffrey, and Anthony Salazar. 2000. Correcting single-event upsets through Virtex partialconfiguration. Xilinx Application Notes 216 (2000), v1.



[7] Ching-Yi Chen, Hsiu-Chuan Shih, Cheng-Wen Wu, C. Lin, Pi-Feng Chiu, S. Sheu, and F. Chen. 2015. RRAM DefectModeling and Failure Analysis Based on March Test and a Novel Squeeze-Search Scheme. IEEE Trans. Comput. 64, 1(Jan 2015), 180–190.

[8] Lerong Chen, Jiawen Li, Yiran Chen, Qiuping Deng, Jiyuan Shen, X. Liang, and L. Jiang. 2017. Accelerator-friendlyneural-network training: Learning variations and defects in RRAM crossbar. In IEEE/ACM Proceedings Design, Automa-tion and Test in Eurpoe (DATE). 19–24.

[9] Tianshi Chen, Zidong Du, Ninghui Sun, J. Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. DianNao: asmall-footprint high-throughput accelerator for ubiquitous machine-learning. In ACM International Conference onArchitectural Support for Programming Languages and Operating Systems (ASPLOS).

[10] Ping Chi, Shuangchen Li, C. Xu, Tao Zhang, J. Zhao, Yongpan Liu, Y. Wang, and Yuan Xie. 2016. PRIME: A NovelProcessing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory. In IEEE/ACMInternational Symposium on Computer Architecture (ISCA). IEEE Press, 27–39.

[11] Kaiyuan Guo, Shulin Zeng, Jincheng Yu, YuWang, and Huazhong Yang. 2019. A Survey of FPGA-based Neural NetworkInference Accelerators. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 12, 1, Article 2 (March2019), 26 pages.

[12] Ghouthi Boukli Hacene, François Leduc-Primeau, Amal Ben Soussia, Vincent Gripon, and F. Gagnon. 2019. TrainingModern Deep Neural Networks for Memory-Fault Robustness. IEEE International Symposium on Circuits and Systems(ISCAS) (2019), 1–5.

[13] Mahta Haghi and Jeff Draper. 2009. The 90 nm Double-DICE storage element to reduce Single-Event upsets. In IEEEInternational Midwest Symposium on Circuits and Systems (MWSCAS). IEEE, 463–466.

[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. IEEEConference on Computer Vision and Pattern Recognition (CVPR) (2016), 770–778.

[15] Zhezhi He, Jie Lin, Rickard Ewetz, J. Yuan, and Deliang Fan. 2019. Noise Injection Adaption: End-to-End ReRAMCrossbar Non-ideal Effect Adaption for Neural Network Mapping. In ACM/IEEE Design Automation Conference (DAC).ACM.

[16] Jörg Henkel, Lars Bauer, Nikil Dutt, Puneet Gupta, Sani Nassif, Muhammad Shafique, Mehdi Tahoori, and NorbertWehn.2013. Reliable on-chip systems in the nano-era: Lessons learnt and future trends. In ACM/IEEE Design AutomationConference (DAC). IEEE, 1–10.

[17] Miao Hu, Hai Li, Yiran Chen, Q. Wu, and G. Rose. 2013. BSB training scheme implementation on memristor-basedcircuit. In IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA). IEEE, 80–87.

[18] Wenqin Huangfu, Lixue Xia, Ming Cheng, Xiling Yin, Tianqi Tang, Boxun Li, Krishnendu Chakrabarty, Yuan Xie, YuWang, and Huazhong Yang. 2017. Computation-oriented fault-tolerance schemes for RRAM computing systems. InIEEE/ACM Asia and South Pacific Design Automation Conference (ASPDAC). IEEE, 794–799.

[19] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2017. Quantized neural networks:Training neural networks with low precision weights and activations. Journal of Machine Learning Research (JMLR) 18(2017).

[20] Sachhidh Kannan, Naghmeh Karimi, Ramesh Karri, and Ozgur Sinanoglu. 2015. Modeling, Detection, and Diagnosisof Faults in Multilevel Memristor Memories. IEEE Transactions on Computer-Aided Design of Integrated Circuits andSystems (TCAD) 34 (2015), 822–834.

[21] Sachhidh Kannan, Jeyavijayan Rajendran, Ramesh Karri, and Ozgur Sinanoglu. 2013. Sneak-path Testing of Memristor-based Memories. 2013 26th International Conference on VLSI Design and 2013 12th International Conference on EmbeddedSystems (2013), 386–391.

[22] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. International Conference onLearning Representations (ICLR) (2015).

[23] Alex Krizhevsky. 2009. Learning multiple layers of features from tiny images. (2009).[24] Binh Q Le, Alessandro Grossi, Elisa Vianello, Tony Wu, Giusy Lama, Edith Beigne, H-S Philip Wong, and Subhasish

Mitra. 2018. Resistive RAM with multiple bits per cell: Array-level demonstration of 3 bits per cell. 66, 1 (2018),641–646.

[25] Guanpeng Li, S. Hari, M. Sullivan, T. Tsai, K. Pattabiraman, J. Emer, and Stephen W. Keckler. 2017. Understandingerror propagation in deep learning neural network (DNN) accelerators and applications. In ACM/IEEE SupercomputingConference (SC). ACM, 8.

[26] F. Libano, B. Wilson, J. Anderson, M. Wirthlin, C. Cazzaniga, C. Frost, and P. Rech. 2019. Selective Hardening forNeural Networks in FPGAs. IEEE Transactions on Nuclear Science (TNS) 66 (2019), 216–222.

[27] Beiye Liu, Hai Li, Yiran Chen, Xin Li, Qing Wu, and Tingwen Huang. 2015. Vortex: Variation-aware training formemristor X-bar. In ACM/IEEE Design Automation Conference (DAC). 1–6.

[28] Chenchen Liu, Miao Hu, John Paul Strachan, and Hai Li. 2017. Rescuing memristor-based neuromorphic design withhigh defects. In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1–6.


x:24 X. Ning et al.

[29] Hanxiao Liu, K. Simonyan, and Yiming Yang. 2019. DARTS: Differentiable Architecture Search. In InternationalConference on Learning Representations (ICLR).

[30] Tao Liu, Wujie Wen, Lei Jiang, Yanzhi Wang, Chengmo Yang, and Gang Quan. 2019. A Fault-Tolerant Neural NetworkArchitecture. In ACM/IEEE Design Automation Conference (DAC). 55:1–55:6.

[31] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg.2016. SSD: Single Shot MultiBox Detector. In European Conference on Computer Vision (ECCV). Springer, 21–37.

[32] Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation.IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), 3431–3440.

[33] Ilya Loshchilov and Frank Hutter. 2017. Sgdr: Stochastic gradient descent with warm restarts. International Conferenceon Learning Representations (ICLR) (2017).

[34] Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and Tie-Yan Liu. 2018. Neural Architecture Optimization. In AnnualConference on Neural Information Processing Systems (NIPS). 7816–7827.

[35] Xuefei Ning, Yin Zheng, Tianchen Zhao, Yu Wang, and Huazhong Yang. 2020. A Generic Graph-based NeuralArchitecture Encoding Scheme for Predictor-based NAS. In European Conference on Computer Vision (ECCV).

[36] Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. 2018. Efficient Neural Architecture Search viaParameter Sharing. In International Conference on Machine Learning (ICML).

[37] Jiantao Qiu, J. Wang, Song Yao, K. Guo, Boxun Li, Erjin Zhou, J. Yu, T. Tang, N. Xu, S. Song, Yu Wang, and H. Yang. 2016.Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. In ACM International Symposium onField-Programmable Gate Arrays (FPGA). ACM, 26–35.

[38] Brandon Reagen, Udit Gupta, L. Pentecost, P. Whatmough, S. Lee, Niamh Mulholland, D. Brooks, and Gu-Yeon Wei.2018. Ares: A Framework for Quantifying the Resilience of Deep Neural Networks. In ACM/IEEE Design AutomationConference (DAC) (San Francisco, California) (DAC ’18). ACM, Article 17, 6 pages.

[39] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. 2019. Regularized evolution for image classifierarchitecture search. In AAAI Conference on Artificial Intelligence, Vol. 33. 4780–4789.

[40] Christoph Schorn, Andre Guntoro, and Gerd Ascheid. 2018. Accurate neuron resilience prediction for a flexiblereliability management in neural network accelerators. In IEEE/ACM Proceedings Design, Automation and Test in Eurpoe(DATE).

[41] Ali Shafahi, W Ronny Huang, Mahyar Najibi, Octavian Suciu, Christoph Studer, Tudor Dumitras, and Tom Goldstein.2018. Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks. Annual Conference on NeuralInformation Processing Systems (NIPS) (2018), 6103–6113.

[42] Xiaoxuan She and N. Li. 2017. Reducing Critical Configuration Bits via Partial TMR for SEU Mitigation in FPGAs.IEEE Transactions on Nuclear Science (TNS) 64 (2017), 2626–2632.

[43] Charles Slayman. 2011. Soft error trends and mitigation techniques in memory devices. In Proceedings - AnnualReliability and Maintainability Symposium. 1–5.

[44] Christian Szegedy, W. Zaremba, Ilya Sutskever, Joan Bruna, D. Erhan, Ian J. Goodfellow, and R. Fergus. 2013. Intriguingproperties of neural networks. arXiv preprint arXiv:1312.6199 (2013).

[45] Stylianos I. Venieris and C. Bouganis. 2019. fpgaConvNet: Mapping Regular and Irregular Convolutional NeuralNetworks on FPGAs. IEEE Transactions on Neural Networks and Learning Systems 30 (2019), 326–342.

[46] Jean-Charles Vialatte and François Leduc-Primeau. 2017. A study of deep learning robustness against computationfailures. arXiv:1704.05396 (2017).

[47] Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine Learning 8, 3-4 (1992), 229–256.

[48] Jiyong Woo, Tien Van Nguyen, Jeong-Hun Kim, J. Im, Solyee Im, Yeriaron Kim, Kyeong-Sik Min, and S. Moon.2020. Exploiting defective RRAM array as synapses of HTM spatial pooler with boost-factor adjustment scheme fordefect-tolerant neuromorphic systems. Scientific Reports 10 (2020).

[49] Lixue Xia, Wenqin Huangfu, Tianqi Tang, Xiling Yin, K. Chakrabarty, Yuan Xie, Y. Wang, and H. Yang. 2018. Stuck-atFault Tolerance in RRAM Computing Systems. IEEE Journal on Emerging and Selected Topics in Circuits and Systems(JETCAS) 8 (2018), 102–115.

[50] Lixue Xia, Mengyun Liu, Xuefei Ning, K. Chakrabarty, and Yu Wang. 2017. Fault-tolerant training with on-line faultdetection for RRAM-based neural computing systems. ACM/IEEE Design Automation Conference (DAC) (2017), 1–6.

[51] Cheng-Xin. Xue, J-M. Hung, H-Y. Kao, Y-H. Huang, S-P. Huang, F-C. Chang, P. Chen, T-W. Liu, C-J. Jhang, C-I. Su,W-S. Khwa, C-C. Lo, R-S. Liu, C-C. Hsieh, K-T. Tang, Y-D. Chih, T-Y. J. Chang, and M-F. Chang. 2021. A 22nm 4Mb8b-precision ReRAM computing-in-memory macrowith 11.91 to 195.7TOPS/W for tiny AI edges devices. In IEEEInternational Solid-State Circuits Conference (ISSCC).

[52] Zheyu Yan, Yiyu Shi, Wang Liao, M. Hashimoto, Xichuan Zhou, and Cheng Zhuo. 2020. When Single Event UpsetMeets Deep Neural Networks: Observations, Explorations, and Remedies. IEEE/ACM Asia and South Pacific DesignAutomation Conference (ASPDAC) (2020), 163–168.



[53] Quanshi Zhang, Ruiming Cao, Feng Shi, Ying Nian Wu, and Song-Chun Zhu. 2018. Interpreting cnn knowledge via anexplanatory graph. In AAAI Conference on Artificial Intelligence.

[54] Yang Zhao, X. Hu, Shuangchen Li, Jing Ye, Lei Deng, Y. Ji, Jianyu Xu, Dong Wu, and Yuan Xie. 2019. Memory TrojanAttack on Neural Network Accelerators. IEEE/ACM Proceedings Design, Automation and Test in Eurpoe (DATE) (2019),1415–1420.

[55] Zhuoran Zhao, D. Agiakatsikas, N. H. Nguyen, E. Çetin, and O. Diessel. 2018. Fine-grained module-based error recoveryin FPGA-based TMR systems. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 11, 1 (2018), 4.

[56] Zhenhua Zhu, Hanbo Sun, Yujun Lin, Guohao Dai, L. Xia, Song Han, Yu Wang, and H. Yang. 2019. A ConfigurableMulti-Precision CNN Computing Framework Based on Single Bit RRAM. In ACM/IEEE Design Automation Conference(DAC). 1–6.

[57] Barret Zoph and Quoc V. Le. 2017. Neural Architecture Search with Reinforcement Learning. International Conferenceon Learning Representations (ICLR) (2017).


FTT-NAS: Discovering Fault-Tolerant Convolutional Neural ...

Documents

Transcript of FTT-NAS: Discovering Fault-Tolerant Convolutional Neural ...