Failures associated with HCI
Transcript of Failures associated with HCI
IFIP Siena
Failures associated with HCI
But its not the users fault!
Brendan Murphy
MSR Cambridge
IFIP Siena
Talk contents
How do other technologies address this
issue.
Computer behaviour influenced by
humans or technology?
Why HCI is so difficult.
Summary.
IFIP Siena
Complex Systems
Railways
Liverpool & Manchester Railway opened
15th Sep 1830 (Stevenson’s Rocket)
Rapid development, no controls, many
deaths.
E.g. Brunel and Babbage trains narrowly
escaped crashing into each other
Railway Inspectors set up 1840’s
IFIP Siena
Technology Challenges
Train movements based on Time
Leaves Bristol
12 noon
Leaves London
2:10pm
Single Track
London-Bristol 2 hours
What’s the problem?
Solution: Change Time
November 1840 Great Western moved to GMT
Sep 1847 all Railways recommended to use GMT
UK set clocks to GMT 1880
US Standardized time in November 1883
IFIP Siena
Improvements in railway safety
Improving the quality of components that makeup the system
trains, carriages, signalling, tracks etc.
Focus on human elementTraining for all personnel.
Training certification
Railway inspectors review total system after acrash
Analysing the interaction of components
Analysing whether external factors impact accident.
Make mandatory recommendations.
IFIP Siena
Current status of trains
Trains are now very safe
Death Rates in UK >0.5 deaths per 1
billion miles.
But are they inflexible and
expensive to build?
Movement from trains to cars
Death Rates in UK 3,500 deaths per 1
billion miles.
IFIP Siena
Traditional Engineering approach to
safety
Use of models based on
theory verified through
historical validation.
Standardization &
Certification.
Making the problem fit the
solution.
Very conservative,
reluctant to accept new
ideas/implementations.
IFIP SienaProduct BehaviourVAX Behavioural Drivers
Cause Of System Crashes
0
20
40
60
80
100
1985 1994
Changes Over Time
Pe
rce
nta
ge
O
f C
ras
he
s
Others
System Management
Software Failure
Hardware Failure
Impact on Availability?Impact on Availability?
IFIP Siena
Reliability vs. Maintenance EventsDifferentiation by Day?
Distribution of System Crashes
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
Monday
Tuesday
Wed
nesday
Thursday
Friday
Sat
urday
Sunday
Pe
rce
nta
ge
o
f fi
lte
red
e
ve
nts
NT 4
Win 2k
Win 2k
SP1
Distribution Of System Outages
VAX/VMS
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday
Perc
en
tag
e
System
OutagesSystem
Crashes
VAX Servers: ©FTSC 1999 Madison, Murphy, Davies, Compaq.
IFIP Siena
Reliability vs. Maintenance EventsDifferentiation by Day?
Distribution Of System OutagesOpen VMS VAX
0%
1%
2%
3%
4%
5%
6%
7%
8%
9%
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Time Of Day
Monday - Friday
Pe
rce
nta
ge
System
Outages
System
Crashes
Distribution Of System Outages
Windows 2000
0%
1%
2%
3%
4%
5%
6%
7%
8%
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Monday - Friday
Perc
en
tag
e
System Outages
System Crashes
VAX Servers: ©FTSC 1999 Madison, Murphy, Davies, Compaq.
IFIP Siena
System Instability on single servers
A system crash may induce subsequent crashes.
Behaviour first characterized in the early 1980s.
Crash Distribution on VAX systems
0%
5%
10%
15%
20%
25%
0 15 30 45 60 75 90 105
120
135
150
165
180
195
Minutes between crashes
Dis
trib
uti
on
VAX Crashes
timeSystem
Down
Up
Event Clusters
Crash Distribution on VAX and NT systems
0%
5%
10%
15%
20%
25%
0 15 30 45 60 75 90 105
120
135
150
165
180
195
Minutes between crashes
Dis
trib
uti
on
VAX
Crashes
NT
Bluescreens
IFIP Siena
System Instability on distributed
systems
Configuration impacts dependability.
Affected by technical and human factors.Multi-node event clusters
Down
Up System 2
Down
Up System 3
System 1Down
Up
IFIP Siena
System Instability on distributed
systems
Configuration impacts dependability.
Affected by technical and human factors.Multi-node event clusters
Down
Up System 2
Down
Up System 3
System 1Down
Up
Average Downtime of a VMS cluster in Data Centers Environments
0
0.5
1
1.5
2
2.5
3
3.5
1 2 3 4 5 6
Number of Servers in a cluster
Ho
urs
IFIP Siena
Techniques for improving Software
Quality
Formal Specifications.
Use of strong type checking
languages.
Formal development
processes (e.g. SEI 5).
Use of software verification
tools.
User training.
IFIP Siena
Techniques for avoiding HCI errors
Fully understand the users specifications
Bound the product based on those
specifications.
Develop the UI based on User Modelling,
Intelligent User Interfaces, HCI design and
visualization, psychological studies.
Verify acceptability through trials with
target groups.
IFIP Siena
The problem: How to improve the
reliability of software.
Well understood
user profile
Hardware frozenChanging user needs
(home and business)
Infinite hardware options
New usages being
developed
Usage profile unknown
Flexible and adaptable
Rapid development
Media Center.
Standardized interface
Multi Functional
Evolving
Hardware limited but
regional variations.
IFIP Siena
What Microsoft does well
Standardized application interfaceFile always top left, help last in list.
Print always under file etc.
Short cut keys common across applications.
Common programmable interface.
Do not force users to new UI (Classic view).
Installation of OS and products for the homeuser.
Still searching for the optimum solution for installingproducts across multiple systems (same as the restof the industry).
IFIP Siena
Major causes of HCI induced failures
Not following correct management
procedures.
Not correctly configuring their systems.
Not patching their systems.
Using non compliant hardware/software.
Viruses due to non patched and wrongly
configured systems.
IFIP Siena
Bug collection and patch distribution
system an example of HCI problems.
Develop process to collected Windows XPfailures (i.e. to fix problems we create).
Base process on Windows 2000 failure ratesand characteristics.
Initial results of deployment
Failure rates a factor of 10 greater than forecast(HCI failures become prominent).
Failure characteristics different to Windows 2000(lack of complex software bugs).
Most failures not due to Microsoft code.
But still viewed as Windows problems.
IFIP Siena
Bug collection and patch distribution
Work with peripheral device manufacturers to resolveerrors.
Correct all Microsoft bugs and release patches.
Develop Windows Update process as a user invokedservice for privacy reasons.
ProblemsPeripheral device manufacturers assume users know theyalways need the latest driver.
Users blames Microsoft not the device manufacturers.
Users had better things to do than invoke Windows Update.
Not all patches had to go to all users.
The frequency of patch production overwhelmed business users.
Patches were too large for users with 56k modems.
IFIP Siena
Then security issues came along.
Most security issues identified by securityresearchers or firms.
Provides free publicity to the researcher or firm.
Pressure from them to publish.
Obvious way to resolve these issues.Come up with patch for failure and other risk areas.
Test and release patch through Update ASAP.
ResultHackers reverse engineer patch.
Virus only start appearing after patch is released.
Users not updating systems.
IFIP Siena
Solution for Windows XP
Windows update is a push rather than pull process.
Patches are as small as possible
Patches release in monthly batches.
Not all patches are released
Distribute uncommon patches via crash reporting process.
Develop a process for business for patch distribution.
Windows XP SP2 will automatically configure the systemfor security.
May break some applications (e.g. Kazaa).
Architectural changes in Longhorn to assist in patchdistribution.
IFIP Siena
Summary.
Perfect Reliability of complex systems is probably impossible.
Normal accidents, Charles Perrow
The computer is a component of the complex systems.
Becoming smaller all the time.
The system is the interaction of humans and technology.
Wizards and usability have to replace the need for training.
Computer software rarely has a single objective.
Can wizards and usability address near infinite flexibility.
Reliability is one of many system attributes.
The problem should define the relative importance of reliability.
Higher reliability inevitably decreases product flexibility.
Is research focusing on the problem?