Failures associated with HCI

23
IFIP Siena Failures associated with HCI But its not the users fault! Brendan Murphy MSR Cambridge

Transcript of Failures associated with HCI

Page 1: Failures associated with HCI

IFIP Siena

Failures associated with HCI

But its not the users fault!

Brendan Murphy

MSR Cambridge

Page 2: Failures associated with HCI

IFIP Siena

Talk contents

How do other technologies address this

issue.

Computer behaviour influenced by

humans or technology?

Why HCI is so difficult.

Summary.

Page 3: Failures associated with HCI

IFIP Siena

Complex Systems

Railways

Liverpool & Manchester Railway opened

15th Sep 1830 (Stevenson’s Rocket)

Rapid development, no controls, many

deaths.

E.g. Brunel and Babbage trains narrowly

escaped crashing into each other

Railway Inspectors set up 1840’s

Page 4: Failures associated with HCI

IFIP Siena

Technology Challenges

Train movements based on Time

Leaves Bristol

12 noon

Leaves London

2:10pm

Single Track

London-Bristol 2 hours

What’s the problem?

Solution: Change Time

November 1840 Great Western moved to GMT

Sep 1847 all Railways recommended to use GMT

UK set clocks to GMT 1880

US Standardized time in November 1883

Page 5: Failures associated with HCI

IFIP Siena

Improvements in railway safety

Improving the quality of components that makeup the system

trains, carriages, signalling, tracks etc.

Focus on human elementTraining for all personnel.

Training certification

Railway inspectors review total system after acrash

Analysing the interaction of components

Analysing whether external factors impact accident.

Make mandatory recommendations.

Page 6: Failures associated with HCI

IFIP Siena

Current status of trains

Trains are now very safe

Death Rates in UK >0.5 deaths per 1

billion miles.

But are they inflexible and

expensive to build?

Movement from trains to cars

Death Rates in UK 3,500 deaths per 1

billion miles.

Page 7: Failures associated with HCI

IFIP Siena

Traditional Engineering approach to

safety

Use of models based on

theory verified through

historical validation.

Standardization &

Certification.

Making the problem fit the

solution.

Very conservative,

reluctant to accept new

ideas/implementations.

Page 8: Failures associated with HCI

IFIP SienaProduct BehaviourVAX Behavioural Drivers

Cause Of System Crashes

0

20

40

60

80

100

1985 1994

Changes Over Time

Pe

rce

nta

ge

O

f C

ras

he

s

Others

System Management

Software Failure

Hardware Failure

Impact on Availability?Impact on Availability?

Page 9: Failures associated with HCI

IFIP Siena

Reliability vs. Maintenance EventsDifferentiation by Day?

Distribution of System Crashes

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

Monday

Tuesday

Wed

nesday

Thursday

Friday

Sat

urday

Sunday

Pe

rce

nta

ge

o

f fi

lte

red

e

ve

nts

NT 4

Win 2k

Win 2k

SP1

Distribution Of System Outages

VAX/VMS

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

Monday

Tuesday

Wednesday

Thursday

Friday

Saturday

Sunday

Perc

en

tag

e

System

OutagesSystem

Crashes

VAX Servers: ©FTSC 1999 Madison, Murphy, Davies, Compaq.

Page 10: Failures associated with HCI

IFIP Siena

Reliability vs. Maintenance EventsDifferentiation by Day?

Distribution Of System OutagesOpen VMS VAX

0%

1%

2%

3%

4%

5%

6%

7%

8%

9%

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Time Of Day

Monday - Friday

Pe

rce

nta

ge

System

Outages

System

Crashes

Distribution Of System Outages

Windows 2000

0%

1%

2%

3%

4%

5%

6%

7%

8%

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Monday - Friday

Perc

en

tag

e

System Outages

System Crashes

VAX Servers: ©FTSC 1999 Madison, Murphy, Davies, Compaq.

Page 11: Failures associated with HCI

IFIP Siena

System Instability on single servers

A system crash may induce subsequent crashes.

Behaviour first characterized in the early 1980s.

Crash Distribution on VAX systems

0%

5%

10%

15%

20%

25%

0 15 30 45 60 75 90 105

120

135

150

165

180

195

Minutes between crashes

Dis

trib

uti

on

VAX Crashes

timeSystem

Down

Up

Event Clusters

Crash Distribution on VAX and NT systems

0%

5%

10%

15%

20%

25%

0 15 30 45 60 75 90 105

120

135

150

165

180

195

Minutes between crashes

Dis

trib

uti

on

VAX

Crashes

NT

Bluescreens

Page 12: Failures associated with HCI

IFIP Siena

System Instability on distributed

systems

Configuration impacts dependability.

Affected by technical and human factors.Multi-node event clusters

Down

Up System 2

Down

Up System 3

System 1Down

Up

Page 13: Failures associated with HCI

IFIP Siena

System Instability on distributed

systems

Configuration impacts dependability.

Affected by technical and human factors.Multi-node event clusters

Down

Up System 2

Down

Up System 3

System 1Down

Up

Average Downtime of a VMS cluster in Data Centers Environments

0

0.5

1

1.5

2

2.5

3

3.5

1 2 3 4 5 6

Number of Servers in a cluster

Ho

urs

Page 14: Failures associated with HCI

IFIP Siena

Techniques for improving Software

Quality

Formal Specifications.

Use of strong type checking

languages.

Formal development

processes (e.g. SEI 5).

Use of software verification

tools.

User training.

Page 15: Failures associated with HCI

IFIP Siena

Techniques for avoiding HCI errors

Fully understand the users specifications

Bound the product based on those

specifications.

Develop the UI based on User Modelling,

Intelligent User Interfaces, HCI design and

visualization, psychological studies.

Verify acceptability through trials with

target groups.

Page 16: Failures associated with HCI

IFIP Siena

The problem: How to improve the

reliability of software.

Well understood

user profile

Hardware frozenChanging user needs

(home and business)

Infinite hardware options

New usages being

developed

Usage profile unknown

Flexible and adaptable

Rapid development

Media Center.

Standardized interface

Multi Functional

Evolving

Hardware limited but

regional variations.

Page 17: Failures associated with HCI

IFIP Siena

What Microsoft does well

Standardized application interfaceFile always top left, help last in list.

Print always under file etc.

Short cut keys common across applications.

Common programmable interface.

Do not force users to new UI (Classic view).

Installation of OS and products for the homeuser.

Still searching for the optimum solution for installingproducts across multiple systems (same as the restof the industry).

Page 18: Failures associated with HCI

IFIP Siena

Major causes of HCI induced failures

Not following correct management

procedures.

Not correctly configuring their systems.

Not patching their systems.

Using non compliant hardware/software.

Viruses due to non patched and wrongly

configured systems.

Page 19: Failures associated with HCI

IFIP Siena

Bug collection and patch distribution

system an example of HCI problems.

Develop process to collected Windows XPfailures (i.e. to fix problems we create).

Base process on Windows 2000 failure ratesand characteristics.

Initial results of deployment

Failure rates a factor of 10 greater than forecast(HCI failures become prominent).

Failure characteristics different to Windows 2000(lack of complex software bugs).

Most failures not due to Microsoft code.

But still viewed as Windows problems.

Page 20: Failures associated with HCI

IFIP Siena

Bug collection and patch distribution

Work with peripheral device manufacturers to resolveerrors.

Correct all Microsoft bugs and release patches.

Develop Windows Update process as a user invokedservice for privacy reasons.

ProblemsPeripheral device manufacturers assume users know theyalways need the latest driver.

Users blames Microsoft not the device manufacturers.

Users had better things to do than invoke Windows Update.

Not all patches had to go to all users.

The frequency of patch production overwhelmed business users.

Patches were too large for users with 56k modems.

Page 21: Failures associated with HCI

IFIP Siena

Then security issues came along.

Most security issues identified by securityresearchers or firms.

Provides free publicity to the researcher or firm.

Pressure from them to publish.

Obvious way to resolve these issues.Come up with patch for failure and other risk areas.

Test and release patch through Update ASAP.

ResultHackers reverse engineer patch.

Virus only start appearing after patch is released.

Users not updating systems.

Page 22: Failures associated with HCI

IFIP Siena

Solution for Windows XP

Windows update is a push rather than pull process.

Patches are as small as possible

Patches release in monthly batches.

Not all patches are released

Distribute uncommon patches via crash reporting process.

Develop a process for business for patch distribution.

Windows XP SP2 will automatically configure the systemfor security.

May break some applications (e.g. Kazaa).

Architectural changes in Longhorn to assist in patchdistribution.

Page 23: Failures associated with HCI

IFIP Siena

Summary.

Perfect Reliability of complex systems is probably impossible.

Normal accidents, Charles Perrow

The computer is a component of the complex systems.

Becoming smaller all the time.

The system is the interaction of humans and technology.

Wizards and usability have to replace the need for training.

Computer software rarely has a single objective.

Can wizards and usability address near infinite flexibility.

Reliability is one of many system attributes.

The problem should define the relative importance of reliability.

Higher reliability inevitably decreases product flexibility.

Is research focusing on the problem?