Reliability Models for the Internet of Things: A Paradigm ......Reliability Models for the Internet...
Transcript of Reliability Models for the Internet of Things: A Paradigm ......Reliability Models for the Internet...
Reliability Models for the Internet of Things: A Paradigm Shift
Mudasir Ahmad
Dis tinguished Engineer
Center of Excellence for Numerical Analysis
Cisco Systems, Inc. email: [email protected]
December 5th, 2014
2 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Internet of Things (IoT)
3 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Internet of Things (IoT)
http://inte rnetofeverything.cisco.com/
4 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Internet of Things (IoT)
IoT expected to grow exponentia lly and surpass smartphones
5 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Technology Hype Cycle
Source : Gartner, Augus t 2014
6 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Internet of Things (IoT) Reference Model
“The Inte rne t of Things : Moving Beyond the Hype”, Wim Elfrink, Cisco, IoT World Forum 2014
7 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Networking Devices Coming Closer To You
• Field Area Network Power Grid Example
8 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Networking Devices Coming Closer To You
• Connected Car Example
9 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Cisco ACI – SDN and More
• Trans itioning from traditiona l model to open a rchitecture
• Third party software and apps
• How re liable will the third party software be?
• How to capture uncerta inty in software re liability?
• “Black box” approach – from de te rminis tic to probabilis tic
10 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Key Challenges of IoT Devices
1. Standardized inte rfaces - such as IPv6
2. Configuration of massive of amount devices
3. S trong access control and authentica tion
4. Privacy and Safe ty
5. Ins trumenta tion and feedback
6. Dealing with software errors vulnerabilities and software updates
7. Potentia l opportunities for third party bus inesses
Vint Cerf, Federa l Trade Commiss ion Workshop on Inte rne t of Things , “Internet of Things - Privacy and Security in a Connected World”, November 19, 2013
11 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Additional Challenges of IoT Network Devices
High Reliability
• Comparable to
Te lecom: (Hardware : 99.999%, Software : 99.95%)
• Limited se rviceability options (cannot eas ily access hardware)
• Long ta rge t fie ld life
• High SYSTEM Level Re liability (Hardware + Software)
Uncerta in Use
Conditions
• Multiple use applica tions of the same product
• Combina tion of use conditions
• How to capture uncerta inty?
• Segmenta tion? Ruggediza tion?
Black Box Software
• Little control over
third party software
• Compatibility is sues with hardware
• Hardware resource consumption cons tra ints
• Variable software upda tes on diffe rent sys tems
Requires a paradigm shift: from determinis tic to probabilis tic
12 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Analysis Overview
Hardware in Uncerta in Environments
Case S tudy 1 & 2
Software in Uncerta in Environments Case S tudy 4
Hardware + Software Uncerta in
Environments Case S tudy 3 a & b
13 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Case S tudy 1:
[Hardware]
Inte rconnect Re liability
Effect of Uncerta in Use Conditions
14 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Hardware Interconnect Failures
1st Leve l Inte rconnect FCBGA
2nd Leve l Inte rconnect
3rd Leve l Inte rconnect
FCBGA = Flip-Chip Ball Grid Array
Motherboard/Line Card
Inte rposer
FCBGA on Inte rposer mounted on a Motherboard
15 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
IoT Environments – Uncertain Conditions Te
mpe
ratu
re F
luct
uatio
ns
Product Expected Lifetime
• Different applications and uncertain use conditions
• The higher the temperature fluctuations, the shorter the product expected life.
Datacenters Indus tria l
Mobile
Wind Mills
Ships
16 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Hardware Device Reliability Analysis Process today
Time
Tem
p
Lab Testing
Interconnects Time
Tem
p
Average Product Use Conditions
Product Service Life Recommendation (Years)
17 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Example of Field Life Prediction
Time
Tem
p
Lab Testing
Parameter Sub Parameter Value Units
Thermal Cycling Conditions
Minimum Temperature 0 °C Maximum Temperature 100 °C Delta T 100 °C Heating and Cooling Rate 10 °C/Minute Dwell Time 10 Minute
Fa ilure Data
Sample Tes ted 32 N/A Tes t S tandard IPC- 9701 N/A Characte ris tic Life 7000 Cycles Weibull Shape Parameter (β) 5 N/A
Fa ilure Mode Solder Joint Cracking - -
Time
Tem
p
Average Product Use Conditions Applications Te lecom (Controlled)
Fie ld Life time (Years) 10 years Environmental Cycles 1 / month Power Cycles 1 / day Operational Temperature Range Power: 85 °C
Environment: 6 °C Chip Junction Temperature (Tj) Typical/Max. 85 °C / 110 °C
Product Se rvice Life Recommendation (Years )
Ca lcula ted Fie ld Life : 9.4 years Assuming da ta cente r use conditions
18 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
What About Uncertainty? Parameter Sub Parameter Mean Std. Dev Units
Tes t Condition (IPC-9701)
Minimum Temperature 0 ±1 °C
Maximum Temperature 100 ±5 °C
Delta T 100 ±5 °C
Heating and Cooling Rate 10 ±1 °C / Min
Dwell Time 10 ±1 Min
Use Condition (JESD49)
Fie ld Life time 10 NA Years
Environmental Cycles 1 / month ±0.05 Cycles
Power Cycles 1 / day ±0.1 Cycles
Operational Temperature Range Power: ∆85 °C
Environment: ∆6 °C
±5
±1 °C
Chip Junction Temperature (Tj) Traffic/Temp
25% 70 ±5 / ±3
°C 50% 77
75% 85 100% 110
10,000 iterations of Monte Carlo Simulations run for target life
19 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Field Life Results with Uncertainty (Datacenter)
• Typica lly, a 5% probability is used for risk ana lys is .
• There is a 5% chance tha t the product could fa il in a lmos t HALF the time as predicted de terminis tica lly.
• The diffe rence is a lmos t 45% (9.4 years vs . 5.12 years ).
• Determinis tic results can be s ignificantly e rroneous
5% Prob = 17.43 yea rs 5% Prob = 5.12 yea rs 9.4 yea rs
(50% Prob)
20 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Variations in Ambient Temperature Range
Diurnal Temperature variation is variation in temperature between day max and day min
Temp Range reduces in the winter months Temp Range increases during summer Overall, DTR could be as high as 40C depending on use location in summer months DTR values averaged over 6 years.
Sun et al, “Seasonal Variations in Diurnal Temperature Range From Satellites and Surface Observations”, IEEE, Oct 2006
21 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Ambient Temperature Uncertainty (Outdoors Applications)
• The average da ily tempera ture and s tandard devia tion across the US is roughly 11.5°C and 8.42°C respective ly
• The results show tha t the same product used outs ide , (a ll e lse kept the same) will fa il 2X fas te r
• How to des ign to improve product re liability in these uncerta in conditions?
• Ruggedize or improve base line re liability? How much to spend on ruggedizing?
5% Prob = 3.7 yea rs 7.1 yea rs
(50% Prob)
22 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Case S tudy 2:
[Hardware]
Fan AND Inte rconnect Re liability
Effect of Uncerta in Use Conditions
23 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Reliability Tradeoffs
Interconnects
The hotte r it runs , the
earlie r it fa ils
Fans
The more the fan cools the inte rconnect, the earlie r the fan
fa ils ! *Image courtesy of http://www.idac.co.uk/products /products /icepak.html
Inte rconnect cooled by fan
24 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Fan Speed vs. Chip Junction Temperature
0
1000
2000
3000
4000
5000
6000
7000
0 20 40 60 80 100 120
Fan
Sped
(rpm
)
Chip Junction Temperature (°C)
8040.8e-0.023*T
Typica l re la tionship be tween fan and chip tempera ture
25 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Reliability Tradeoff Curve
0
1
2
3
4
5
6
7
80 90 100 110 120 130
Effe
ctiv
e Pr
oduc
t Life
(Yea
rs)
Max Junction Temperature (°C)
Trade off Curve for Product Life, Max Temperature and BGA Fatigue Cycles
7000 Cycles8000 Cycles9000 Cycles10000 Cycles
Fan Limited Region Interconnect Limited Region
• Optimize Chip ta rge t tempera ture to maximize product re liability
• IoT product could “se lf optimize” chip tempera ture to maximize re liability in rea l time
26 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Case S tudy 3a :
[Software – Hardware]
Software – Hardware Inte raction
Effect of Tra ffic Load Uncerta inty
27 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Effect of Traffic Load Parameter Sub Parameter Mean Std. Dev Units
Tes t Condition (IPC-9701)
Minimum Temperature 0 ±1 °C
Maximum Temperature 100 ±5 °C
Delta T 100 ±5 °C
Heating and Cooling Rate 10 ±1 °C / Min
Dwell Time 10 ±1 Min
Use Condition (JESD49)
Fie ld Life time 15 NA Years
Environmental Cycles 1 / month ±0.05 Cycles
Power Cycles 1 / day ±0.1 Cycles
Operational Temperature Range Power: ∆85 °C
Environment: ∆6 °C
±5
±1 °C
Chip Junction Temperature (Tj) Traffic/Temp
25% 70 ±5 / ±3
°C 50% 77
75% 85 100% 110
28 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Traffic Load Variation Uncertainty
• Interne t tra ffic varies s ignificantly over time
• Load varia tion means even more tempera ture fluctua tion S. Gebert e t a l, “Interne t Access Traffic Measurement and Analys is”, Traffic Monitoring and Analys is , Lecture Notes in Computer Science Volume 7189, 2012, pp 29-42
29 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Traffic Load Variation Uncertainty
• With higher load varia tion in tra ffic, the fie ld life reduces even further
• The model can be used ite ra tive ly to try out typica l tra ffic varia tions and see how sens itive the fie ld life is to tra ffic varia tion
Outdoor IoT Applica tion
Predicted Fie ld Life (5% Probability) Value Units Nominal Load (110°C, 5°C Std Dev) 3.7 Years Variable Load (110°C, 10°C Std Dev) 2.98 Years
30 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Case S tudy 3b:
[Software – Hardware]
Software Resource Consumption
Effect of Hardware Resource Ava ilability
31 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Cisco ACI – SDN and More
• Trans itioning from traditiona l model to open a rchitecture
• Third party software and apps
• How re liable will the third party software be?
• How to capture uncerta inty in software re liability?
• “Black box” approach – from de te rminis tic to probabilis tic
32 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Software Resource Optimization
• OEM Software and Third Party Software running on the same hardware
• Determinis tica lly, not a ll 5 software applica tions can run a t the same time
• Sys tem could be uns table due to resource cons tra ints
• Monte Carlo s imula tion (10,000 runs)
Software Resources
CPU (Total: 100) Memory (Total: 200GB) Mean Std Deviation Mean Std Deviation
A 20 10 60 5 B 32 12 25 10 C 25 5 40 10 D 32 2 30 4 E 30 10 50 5 Total 139 - 205 -
33 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Software Resource Consumption
• 95% of the time, the hardware resources will be sufficient
• 5% of the time, the sys tem could be CPU cons tra ined but not memory cons tra ined
• Could cons ider increas ing CPU capability to accommodate software usage
• Proper resource a lloca tion could reduce ins tability is sues with third party software
• Prevent Thrashing re la ted is sues (memory cons tra ints ) by applying technique to individual processes
Software Resource Consumption Resource Target Predicted Usage (5% Probability) CPU 100 107.3 Memory 200 GB 178.4 GB
34 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Case S tudy 4:
[Software]
Software Upda te Frequency
Effect of Hardware-Software Incompatibility
35 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Software-Hardware Incompatibility Issues
• Hardware is updated approximate ly every 18 months
• Moore’s Law – doubling every 18 months
• Software is mos t s table on the hardware pla tform it was benchmarked on
• Hardware-compatibility software updates need to be frequent to prevent hardware-software compatibility is sues . But how frequent?
• Asynchronous hardware-software updates could lead to higher incompatibility is sues
36 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Software-Hardware Incompatibility Issues
Time
Number of Hardware –Software Compatibility
Is sues
Point of Software Benchmarking
• Examples :
• Windows 8 running 3 software origina lly benchmarked on Windows XP • Windows 7 running 2 software origina lly benchmarked on Windows 8
• How to es timate incompatibility is sues and prioritize how to fix them?
Points of Hardware Updates
37 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Software-Hardware Incompatibility Issues
• Theore tica l methodology:
• Firs t es tablish ta rge t life time of product
• Enterprise Networking products typica lly have 5 – 7 years life • Modular s lots can be updated with re lease of new hardware
• Next, for each software , es timate number of updates done over life time-to-da te of product (mean and s tandard devia tion)
• Ratio of life time to number of updates should be close to 18 months (Moore’s Law)
• The la rger the va lue , the higher the hardware -software incompatibility risk
38 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Software-Hardware Incompatibility Issues
• Example :
Product Life time 5 years (L)
Number of Hardware-Compatible Software Updates in Life time Software Mean Standard Devia tion
A 2 0.4 B 2 0.5 C 2 0.2 D 5 1 E 8 1
𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻 − 𝑆𝑆𝑆𝑆𝐻𝐻𝐻𝐻 𝐺𝐻𝐺 = Σ 𝐿𝑆 𝑆𝑆𝐿
L = Expected life time of the product f(Su) = The ins tantaneous va lue of updates based on mean and s tandard devia tion of the number of updates done in the life time of the product (or the current life time) for each software
39 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Software-Hardware Incompatibility Issues Product Life time 5 years
Number of Hardware-Compatible Software Updates in Life time Software Mean Standard Devia tion
A 2 0.4 B 2 0.5 C 2 0.2 D 5 1 E 8 1
Hardware-Software Update Gap
Target va lue 1.5 (18 months) Es timated Gap (5% Probability)
1.61 Fa il because upda te gap is higher than ta rget va lue
Hardware-Software Update Gap
Target va lue 1.5 (18 months) Es timated Gap with A at 3 updates mean (5% Probability)
1.48 Pass because upda te gap is lower than ta rget va lue
• 10,000 Monte Carlo ite ra tions to obta in es timated gap:
40 © 2013-2014 Cisco and/or its a ffilia te s . All rights re se rved.
Conclusion
• Designing for IoT requires a s ignificant paradigm shift: • Move from determinis tic to probabilis tic des ign to account for uncerta inty
• Hardware-Software tightly coupled – require integra ted des ign
• Requires extens ive tradeoff ana lys is to maximize re liability ins tead of des igning for fixed ta rge ts
• Build re liability monitoring capabilities into the IoT devices themselves