The e-risks of e-commerce Professor Ken Birman Dept. of Computer Science Cornell University.

the e-risks of e-commerce

Professor Ken BirmanDept. of Computer Science

Cornell University

Reliability

If it stops ticking when it takes a licking… your e-commerce company could tank

So you need to know that your technology base is reliable

It does what it should do, does it when needed, does it correctly, and is accessible to your customers.

A Quiz

Q: When and why did Sun Microsystems have a losing quarter?Ken Birman:

Mr. Birman,

Sun experienced a loss in Q4FY89 (June 1989). This was the quarter in which we transitioned to a new manufacturing, order processing and inventory control systems.

Andrew CaseyManager, Investor RelationsSun Microsystems, Inc.(650) [email protected]

Ken Birman:

Mr. Birman,

Sun experienced a loss in Q4FY89 (June 1989). This was the quarter in which we transitioned to a new manufacturing, order processing and inventory control systems.

Andrew CaseyManager, Investor RelationsSun Microsystems, Inc.(650) [email protected]

Typical Web Session

firewall

get http://www.cs.cornell.edu/People/ken

where what

Typical Web Session

DNS root

DNS nodeDNS node

DNS rootDNS leafDNS root

DNS leafDNS leaf

caching proxyload-balancing

proxycaching proxy

web server web serverfirewall

resolve “www.cs.cornell.edu”

IP address128.64.31.77

The Web’s dark side

Netscape error: web server www.cs.cornell.edu

... not responding. Server may have crashed or is overloaded.

OK

Right URL, but the request times out. Why?

The web server could be down Your network connection may have failed There could be a problem in the “DNS” There could be a network routing problem The Internet may be experiencing an overload Your web caching proxy may be down Your PC might have a problem, or your version of

Netscape (or Explorer), or the file system you are using, or your LAN

The URL itself may be wrong A router or network link may have failed and the Internet

may not yet have rerouted around the problem

E-Trade computers crash again -- and again

The computer system of online security firm E-Trade crashed on Friday for the third consecutive day. "It was just a software glitch. I think we were all frustrated by it," says an E-Trade executive. Industry analyst James Mark of Deutsche Bank commented “…it's the application on a large scale. As soon as E-Trade's volumes started spiking up, they had the same problems as others…."

Edupage Editors <[email protected]> Sun, 07 Feb 1999 10:28:30 -0500

Reliable Distributed Computing:

Increasingly urgent, yet unsolved

Distributed computing has swept the world Impact has become revolutionary Vast wave of applications migrating to networks Already as critical a national infrastructure as

water, electricity, or telephones Yet distributed systems remain

Unreliable, prone to inexplicable outages Insecure, easily attacked Difficult (and costly) to program, bug-prone

A National Imperative

Potential for catastrophe cited by Presidential Commission on Critical

Infrastructure Protection (PCCIP) National Academy of Sciences Study on Trust

in Cyberspace These experts warn that we need a

quantum improvement in technologies Meanwhile, your e-commerce venture is

at grave risk of stumbling – just like many others

A Business Imperative

E-projects Often Fail

e-commerce revolves around computing

Even business and marketing people are at the mercy of these systems

When your company’s computing systems aren’t running, you’re out of business

Big and Little Pictures

It is too easy to understand “reliability” as a narrow technical issue

In fact, many systems and companies stumble by building Unreliable technologies, because of A mixture of poor management and poor

technical judgment Reliable systems demand a balance

between good management and good technology

A Glimpse of Some “Unreliable Systems”

Quick review of some failed projects These were characterized by poor

reliability of the final product But the issues were not really

technical As future managers you need to

understand this phenomenon!

Tales from the Software Crypt

NYC Control of 10,000 Traffic Lights

Univac, based on experience in Baltimore and TorontoStarted in late 1960’sScrapped 2-3 years laterSpent: ?

Second system effect: New radio control systemNew software, algorithms

Earlier systems were 100x smaller

Incommensurate scaling

California Dept. of Motor Vehicles

Vehicle registration and drivers licensesStarted in 1987Scrapped 1994Spent: $44M

Underestimated cost by a factor of 3 Slower than 1965 systemGovernor fired the whistleblowerDMV blames Tandem, Tandem blames DMV

United Airlines/UNIVAC Automated reservations, ticketing, flight schedules, fuel delivery, kitchens, and general administrationStarted in late 1960’sScrapped early 1970’sSpent: $50M

Second system effect: Tried to automate everything, including kitchen sink

Ditto: Burroughs/TWA. Delta currently planning to build something similarBut they will use the web. “Magic bullet” concept…

CONFIRM Hilton, Marriot, Budget, American AirlinesHotel res., links to Wizard and SabreStarted: 1988Scrapped: 1992Spent: $125M

Second system Very dull tools (machine language)Bad-news diodeSee CACM October 1994 for details

Today uses web, works well

Source: Jerry Saltzer, Keynote address, SOSP 1999

Tales from the Software Crypt

SACSS (California) State-wide system for automated child support trackingStarted 1991 ($99M)“On hold” 1997Spent: $300M

Lockheed and HWDC disagree on what the system contains and which part of it isn’t working

“Departments shouldn’t deploy a system to additional users if it is not working”

Taurus (British Stock Exchange)

Replacement for British Stock ExchangeStarted 1980’sScrapped 1993Spent $600M

“massive complexity of the back-end settlement systems…”delays and cost overruns

IBM Workplace OS for PC Mach 3.0 + binary compatibility with Pink, AIX, DOS, OS/400 + new clock mgt. + new RPC + new I/O + new CPUStarted in 1991Scrapped 1996Spent: $2B

400 staff on kernel, 1500 elsewhere“sheer complexity of the class structure proved to be overwhelming”

Even question of how to represent numbers wasn’t settledEarly design choices and compatibility decisions doomed the project

Advanced Automation System (AAS)

Replacement for “in route” air traffic control systemStarted 1982Scrapped 1994Spent more than $6B

Management misestimated size and length of projectProject goals constantly changed

Poor technology choicesRun by gov’t. bureaucratsSource: Jerry Saltzer, Keynote address, SOSP 1999

1995 Standish Group Study

“Challenged” 50%

“Success” 20%

“Impaired” 30%

On timeOn budgetOn function

Scrapped

Over budgetMissed scheduleLacks functions

2x budget2x competion time2/3 planned functionality


A strange picture

Many technology projects fail For lots of reasons

But some succeed Today we do web-based hotel

reservations all the time, yet “Confirm” failed

French air traffic project was a success yet US project lost $6 billion

Is there a pattern?

Recurring Problems

Incommensurate scaling Too many ideas Mythical man-month Bad ideas included Modularity is hard Bad-news diode Best people are far more productive than

average employees New is better, not-even-available yet is best Magic bullet syndrome


1995 Study of Tandem Computer Systems

77% of failures that are software problems. Software fault-tolerance techniques can

overcome about 75% of detected faults. Loose coupling between primary and

backup is important for software fault tolerance.

Over two-thirds (72%) of measured software failures are recurrences of previously reported faults.


A Buggy Aside

Q: What are the two main categories of software bugs called?

A: Bohrbugs and Heisenbugs Q: Why?

Bohr Model of Atom

Bohr argued that thenucleus was a little ball

Bohr Model of Atom


Bohr bug is a nastybut well defined thing

Bohr Model of Atom


Bohr bug is a nastybut well defined thing

Your technical peoplecan reproduce it, so theycan nail it

Heisenbug

Heisenberg modeled atom as a cloud of electromsand a cloud-like nucleus

The closer you look, themore it wiggled

A Heisenbug moves when your people try and pin it down.They won’t find it easy to fix.

Why?

Bohrbugs tend to be deterministic errors – outright mistakes in the code

Once you understand what triggers them they are easy to search for and fix

Heisenbugs are often delayed side-effects of an old error. Like a bad tank of gas, effect may happen long after the bug first “occurs”. Hard to fix because at the time the mistake happened, nothing obvious went wrong

Why Systems fail

Mostly, because something crashes Usually, software or a human error Mean time to failure improves with age

but software problems remain prevalent Every kind of software system is prone to

failures. Failure to plan for failures is the most common way for e-systems to fail.

E-reliability

We want e-commerce solutions to be reliable… but what should this mean? Fault-tolerant? Secure? Fast enough? Accessible to customers?

Deliver critical services when needed, where needed, in a correct, timely manner

Costs of a Failure

Minimizing Downtime

Idea is to design critical parts of your system to survive failures

Two basic approaches Recoverable systems are designed to

restart without human intervention – but may wait until outage is repaired

Highly available systems are designed to keep running during failure

Recoverability

The technology is called “transactions” We’ll discuss this next time, but…

Main issue is time needed to restart the service

For a large database, half an hour or more is not at all unusual

Faster restart requires a “warm standby”

High Availability

Idea is to have a way to keep the system running even while some parts are crashed

For example, a backup that takes over if primary fails

Backup is kept “warm” This involves replicating information As changes occur, backup may lag behind

Complexity

The looming threat to your e-commerce solution, no matter what it may be

Even simple systems are hard to make reliable

Complex systems are almost impossible to make reliable

Yet innovative e-commerce projects often require fairly complex technologies!

Two Side-by-Side Case Studies

American Advanced Automation System Intended as replacement for air traffic control

system Needed because Pres. Reagan fired many

controllers in 1981 But project was a fiasco, lost $6B

French Phidias System Similar goals, slightly less ambitious But rolled out, on time and on budget, in

1999

Background

Air traffic control systems are using 1970’s technology

Extremely costly to maintain and impossible to upgrade

Meanwhile, load on controllers is rising steadily

Can’t easily reduce load

Air Traffic Control system (one site)

Team of Controllers

Air Traffic Database(flight plans, etc)

X.500 Directory

Radar

Onboard

Politics

Government wanted to upgrade the whole thing, solve a nagging problem

Controllers demanded various simplifications and powerful new tools

Everyone assumed that what you use at home can be adapted to the demands of an air traffic control center

Technology

IBM bid the project, proposed to use its own workstations

These aren’t super reliable, so they proposed to adapt a new approach to “fault-tolerance”

Idea is to plan for failure Detect failures when they occur Automatically switch to backups

Core Technical Issue?

Problem revolves around high availability Waiting for restart not seen as an option:

goal is 10sec downtime in 10 years So IBM proposed a replication scheme

much like the “load balancing” approach IBM had primary and backup simply do

the same work, keeping them in the same state

Technology

radarfind

tracksIdentifyflight

Lookuprecord

Planactions

Humanaction

Conceptual flow of system

radarfind

tracksIdentifyflight

Lookuprecord

Planactions

Humanactionradar

findtracks

Identifyflight

Lookuprecord

Planactions

Humanaction

IBM’s fault-tolerant process pair concept

Why is this Hard?

The system has many “real-time” constraints on it Actions need to occur promptly Even if something fails, we want the human

controller to continue to see updates IBM’s technology

Based on a research paper by Flaviu Cristian But had never been used except for proof of

concept purposes, on a small scale in the laboratory

Politics

IBM’s proposal sounded good… … and they were the second lowest bidder … and they had the most aggressive

schedule So the FAA selected them over

alternatives IBM took on the whole thing all at once

Disaster Strikes

Immediate confusion: all parts of the system seemed interdependent To design part A I need to know how part B,

also being designed, will work Controllers didn’t like early proposals and

insisted on major changes to design Fault-tolerance idea was one of the

reasons IBM was picked, but made the system so complex that it went on the back burner

Summary of Simplifications

Focus on some core components Postpone worry about fault-tolerance

until later Try and build a simple version that can

be fleshed out later… but the simplification wasn’t enough.

Too many players kept intruding with requirements

Crash and Burn

The technical guys saw it coming Probably as early as one year into the effort But they kept it secret (“bad news diode”) Anyhow, management wasn’t listening

(“they’ve heard it all before – whining engineers!”) The fault-tolerance scheme didn’t work

Many technical issues unresolved The FAA kept out of the technical issues

But a mixture of changing specifications and serious technical issues were at the root of the problems

What came out?

In the USA, nothing. The entire system was useless – the

technology was of an all-or-nothing style and nothing was ready to deploy

British later rolled out a very limited version of a similar technology, late, with many bugs, but it does work…

Contrast with French

They took a very incremental approach Early design sought to cut back as much

as possible If it isn’t “mandatory” don’t do it yet Focus was on console cluster

architecture and fault-tolerance They insisted on using off-the-shelf

technology

Contrast with French

Managers intervened in technology choices For example, the vendor wanted to do

a home-brew fault-tolerance technology

French insisted on a specific existing technology and refused to bid out the work until vendors accepted

A critical “good call” as it worked out

Learning by Doing

To gain experience with technology They tested, and tested, and tested Designed simple prototypes and played with

them Discovered that large cluster would perform

poorly But found a “sweet spot” and worked within

it This forced project to cut back on some goals

Testing

9/10th of time and expense on any system is in Testing Debugging Integration

Many projects overlook this French planned conservatively

Software Bugs

Figure 1/10 lines in new code But as many as 1/250 lines in old code Bugs show up under stress Trick is to run a system in an unstressed

mode French identified “stress points” and

designed to steer far from them Their design also assumed that components

would fail and automated the restart

All of this worked!

Take-aways from French project? Complex technical issues at the core of the

system But they managed to break big poject into pieces Do the critical core first, separately, and focus

exclusively on it Test, test, test Don’t build anything you can possibly buy Management was technically sophisticated

enough to make some critical “calls”

Your Problem

e-commerce systems are at e-risk These e-risks take many forms:

System complexity Failure to plan for failures Poor project management

Ignore this at our peril, as we’ve seen But how can we learn to do better?

Keys to Reliability

Know the basic technologies Realize that software is buggy and failures

will happen. Design to treat failure as a mundane event Failure to plan for failure is the biggest e-risk!

Complexity is a huge threat. Use your naiveté as an advantage: if you can’t understand it, why assume that “they” can understand it?

E-commerce Technologies

The network and associated services

Databases Web servers “Scripts” – the glue your people use

to tie it all together

Next Lecture

Look at some realistic e-commerce systems

Ask ourselves where to start first, if we need to convince ourselves that the system will be reliable enough

Trick is to balance between system complexity and adequate risk coverage

The e-risks of e-commerce Professor Ken Birman Dept. of Computer Science Cornell University.

Documents

Transcript of The e-risks of e-commerce Professor Ken Birman Dept. of Computer Science Cornell University.