The e-risks of e-commerce Professor Ken Birman Dept. of Computer Science Cornell University.
-
date post
20-Dec-2015 -
Category
Documents
-
view
219 -
download
0
Transcript of The e-risks of e-commerce Professor Ken Birman Dept. of Computer Science Cornell University.
Reliability
If it stops ticking when it takes a licking… your e-commerce company could tank
So you need to know that your technology base is reliable
It does what it should do, does it when needed, does it correctly, and is accessible to your customers.
A Quiz
Q: When and why did Sun Microsystems have a losing quarter?Ken Birman:
Mr. Birman,
Sun experienced a loss in Q4FY89 (June 1989). This was the quarter in which we transitioned to a new manufacturing, order processing and inventory control systems.
Andrew CaseyManager, Investor RelationsSun Microsystems, Inc.(650) [email protected]
Ken Birman:
Mr. Birman,
Sun experienced a loss in Q4FY89 (June 1989). This was the quarter in which we transitioned to a new manufacturing, order processing and inventory control systems.
Andrew CaseyManager, Investor RelationsSun Microsystems, Inc.(650) [email protected]
Typical Web Session
DNS root
DNS nodeDNS node
DNS rootDNS leafDNS root
DNS leafDNS leaf
caching proxyload-balancing
proxycaching proxy
web server web serverfirewall
resolve “www.cs.cornell.edu”
IP address128.64.31.77
The Web’s dark side
Netscape error: web server www.cs.cornell.edu
... not responding. Server may have crashed or is overloaded.
OK
Right URL, but the request times out. Why?
The web server could be down Your network connection may have failed There could be a problem in the “DNS” There could be a network routing problem The Internet may be experiencing an overload Your web caching proxy may be down Your PC might have a problem, or your version of
Netscape (or Explorer), or the file system you are using, or your LAN
The URL itself may be wrong A router or network link may have failed and the Internet
may not yet have rerouted around the problem
E-Trade computers crash again -- and again
The computer system of online security firm E-Trade crashed on Friday for the third consecutive day. "It was just a software glitch. I think we were all frustrated by it," says an E-Trade executive. Industry analyst James Mark of Deutsche Bank commented “…it's the application on a large scale. As soon as E-Trade's volumes started spiking up, they had the same problems as others…."
Edupage Editors <[email protected]> Sun, 07 Feb 1999 10:28:30 -0500
Reliable Distributed Computing:
Increasingly urgent, yet unsolved
Distributed computing has swept the world Impact has become revolutionary Vast wave of applications migrating to networks Already as critical a national infrastructure as
water, electricity, or telephones Yet distributed systems remain
Unreliable, prone to inexplicable outages Insecure, easily attacked Difficult (and costly) to program, bug-prone
A National Imperative
Potential for catastrophe cited by Presidential Commission on Critical
Infrastructure Protection (PCCIP) National Academy of Sciences Study on Trust
in Cyberspace These experts warn that we need a
quantum improvement in technologies Meanwhile, your e-commerce venture is
at grave risk of stumbling – just like many others
E-projects Often Fail
e-commerce revolves around computing
Even business and marketing people are at the mercy of these systems
When your company’s computing systems aren’t running, you’re out of business
Big and Little Pictures
It is too easy to understand “reliability” as a narrow technical issue
In fact, many systems and companies stumble by building Unreliable technologies, because of A mixture of poor management and poor
technical judgment Reliable systems demand a balance
between good management and good technology
A Glimpse of Some “Unreliable Systems”
Quick review of some failed projects These were characterized by poor
reliability of the final product But the issues were not really
technical As future managers you need to
understand this phenomenon!
Tales from the Software Crypt
NYC Control of 10,000 Traffic Lights
Univac, based on experience in Baltimore and TorontoStarted in late 1960’sScrapped 2-3 years laterSpent: ?
Second system effect: New radio control systemNew software, algorithms
Earlier systems were 100x smaller
Incommensurate scaling
California Dept. of Motor Vehicles
Vehicle registration and drivers licensesStarted in 1987Scrapped 1994Spent: $44M
Underestimated cost by a factor of 3 Slower than 1965 systemGovernor fired the whistleblowerDMV blames Tandem, Tandem blames DMV
United Airlines/UNIVAC Automated reservations, ticketing, flight schedules, fuel delivery, kitchens, and general administrationStarted in late 1960’sScrapped early 1970’sSpent: $50M
Second system effect: Tried to automate everything, including kitchen sink
Ditto: Burroughs/TWA. Delta currently planning to build something similarBut they will use the web. “Magic bullet” concept…
CONFIRM Hilton, Marriot, Budget, American AirlinesHotel res., links to Wizard and SabreStarted: 1988Scrapped: 1992Spent: $125M
Second system Very dull tools (machine language)Bad-news diodeSee CACM October 1994 for details
Today uses web, works well
Source: Jerry Saltzer, Keynote address, SOSP 1999
Tales from the Software Crypt
SACSS (California) State-wide system for automated child support trackingStarted 1991 ($99M)“On hold” 1997Spent: $300M
Lockheed and HWDC disagree on what the system contains and which part of it isn’t working
“Departments shouldn’t deploy a system to additional users if it is not working”
Taurus (British Stock Exchange)
Replacement for British Stock ExchangeStarted 1980’sScrapped 1993Spent $600M
“massive complexity of the back-end settlement systems…”delays and cost overruns
IBM Workplace OS for PC Mach 3.0 + binary compatibility with Pink, AIX, DOS, OS/400 + new clock mgt. + new RPC + new I/O + new CPUStarted in 1991Scrapped 1996Spent: $2B
400 staff on kernel, 1500 elsewhere“sheer complexity of the class structure proved to be overwhelming”
Even question of how to represent numbers wasn’t settledEarly design choices and compatibility decisions doomed the project
Advanced Automation System (AAS)
Replacement for “in route” air traffic control systemStarted 1982Scrapped 1994Spent more than $6B
Management misestimated size and length of projectProject goals constantly changed
Poor technology choicesRun by gov’t. bureaucratsSource: Jerry Saltzer, Keynote address, SOSP 1999
1995 Standish Group Study
“Challenged” 50%
“Success” 20%
“Impaired” 30%
On timeOn budgetOn function
Scrapped
Over budgetMissed scheduleLacks functions
2x budget2x competion time2/3 planned functionality
Source: Jerry Saltzer, Keynote address, SOSP 1999
A strange picture
Many technology projects fail For lots of reasons
But some succeed Today we do web-based hotel
reservations all the time, yet “Confirm” failed
French air traffic project was a success yet US project lost $6 billion
Is there a pattern?
Recurring Problems
Incommensurate scaling Too many ideas Mythical man-month Bad ideas included Modularity is hard Bad-news diode Best people are far more productive than
average employees New is better, not-even-available yet is best Magic bullet syndrome
Source: Jerry Saltzer, Keynote address, SOSP 1999
1995 Study of Tandem Computer Systems
77% of failures that are software problems. Software fault-tolerance techniques can
overcome about 75% of detected faults. Loose coupling between primary and
backup is important for software fault tolerance.
Over two-thirds (72%) of measured software failures are recurrences of previously reported faults.
Source: Jerry Saltzer, Keynote address, SOSP 1999
A Buggy Aside
Q: What are the two main categories of software bugs called?
A: Bohrbugs and Heisenbugs Q: Why?
Bohr Model of Atom
Bohr argued that thenucleus was a little ball
Bohr bug is a nastybut well defined thing
Bohr Model of Atom
Bohr argued that thenucleus was a little ball
Bohr bug is a nastybut well defined thing
Your technical peoplecan reproduce it, so theycan nail it
Heisenbug
Heisenberg modeled atom as a cloud of electromsand a cloud-like nucleus
The closer you look, themore it wiggled
A Heisenbug moves when your people try and pin it down.They won’t find it easy to fix.
Why?
Bohrbugs tend to be deterministic errors – outright mistakes in the code
Once you understand what triggers them they are easy to search for and fix
Heisenbugs are often delayed side-effects of an old error. Like a bad tank of gas, effect may happen long after the bug first “occurs”. Hard to fix because at the time the mistake happened, nothing obvious went wrong
Why Systems fail
Mostly, because something crashes Usually, software or a human error Mean time to failure improves with age
but software problems remain prevalent Every kind of software system is prone to
failures. Failure to plan for failures is the most common way for e-systems to fail.
E-reliability
We want e-commerce solutions to be reliable… but what should this mean? Fault-tolerant? Secure? Fast enough? Accessible to customers?
Deliver critical services when needed, where needed, in a correct, timely manner
Minimizing Downtime
Idea is to design critical parts of your system to survive failures
Two basic approaches Recoverable systems are designed to
restart without human intervention – but may wait until outage is repaired
Highly available systems are designed to keep running during failure
Recoverability
The technology is called “transactions” We’ll discuss this next time, but…
Main issue is time needed to restart the service
For a large database, half an hour or more is not at all unusual
Faster restart requires a “warm standby”
High Availability
Idea is to have a way to keep the system running even while some parts are crashed
For example, a backup that takes over if primary fails
Backup is kept “warm” This involves replicating information As changes occur, backup may lag behind
Complexity
The looming threat to your e-commerce solution, no matter what it may be
Even simple systems are hard to make reliable
Complex systems are almost impossible to make reliable
Yet innovative e-commerce projects often require fairly complex technologies!
Two Side-by-Side Case Studies
American Advanced Automation System Intended as replacement for air traffic control
system Needed because Pres. Reagan fired many
controllers in 1981 But project was a fiasco, lost $6B
French Phidias System Similar goals, slightly less ambitious But rolled out, on time and on budget, in
1999
Background
Air traffic control systems are using 1970’s technology
Extremely costly to maintain and impossible to upgrade
Meanwhile, load on controllers is rising steadily
Can’t easily reduce load
Air Traffic Control system (one site)
Team of Controllers
Air Traffic Database(flight plans, etc)
X.500 Directory
Radar
Onboard
Politics
Government wanted to upgrade the whole thing, solve a nagging problem
Controllers demanded various simplifications and powerful new tools
Everyone assumed that what you use at home can be adapted to the demands of an air traffic control center
Technology
IBM bid the project, proposed to use its own workstations
These aren’t super reliable, so they proposed to adapt a new approach to “fault-tolerance”
Idea is to plan for failure Detect failures when they occur Automatically switch to backups
Core Technical Issue?
Problem revolves around high availability Waiting for restart not seen as an option:
goal is 10sec downtime in 10 years So IBM proposed a replication scheme
much like the “load balancing” approach IBM had primary and backup simply do
the same work, keeping them in the same state
Technology
radarfind
tracksIdentifyflight
Lookuprecord
Planactions
Humanaction
Conceptual flow of system
radarfind
tracksIdentifyflight
Lookuprecord
Planactions
Humanactionradar
findtracks
Identifyflight
Lookuprecord
Planactions
Humanaction
IBM’s fault-tolerant process pair concept
Why is this Hard?
The system has many “real-time” constraints on it Actions need to occur promptly Even if something fails, we want the human
controller to continue to see updates IBM’s technology
Based on a research paper by Flaviu Cristian But had never been used except for proof of
concept purposes, on a small scale in the laboratory
Politics
IBM’s proposal sounded good… … and they were the second lowest bidder … and they had the most aggressive
schedule So the FAA selected them over
alternatives IBM took on the whole thing all at once
Disaster Strikes
Immediate confusion: all parts of the system seemed interdependent To design part A I need to know how part B,
also being designed, will work Controllers didn’t like early proposals and
insisted on major changes to design Fault-tolerance idea was one of the
reasons IBM was picked, but made the system so complex that it went on the back burner
Summary of Simplifications
Focus on some core components Postpone worry about fault-tolerance
until later Try and build a simple version that can
be fleshed out later… but the simplification wasn’t enough.
Too many players kept intruding with requirements
Crash and Burn
The technical guys saw it coming Probably as early as one year into the effort But they kept it secret (“bad news diode”) Anyhow, management wasn’t listening
(“they’ve heard it all before – whining engineers!”) The fault-tolerance scheme didn’t work
Many technical issues unresolved The FAA kept out of the technical issues
But a mixture of changing specifications and serious technical issues were at the root of the problems
What came out?
In the USA, nothing. The entire system was useless – the
technology was of an all-or-nothing style and nothing was ready to deploy
British later rolled out a very limited version of a similar technology, late, with many bugs, but it does work…
Contrast with French
They took a very incremental approach Early design sought to cut back as much
as possible If it isn’t “mandatory” don’t do it yet Focus was on console cluster
architecture and fault-tolerance They insisted on using off-the-shelf
technology
Contrast with French
Managers intervened in technology choices For example, the vendor wanted to do
a home-brew fault-tolerance technology
French insisted on a specific existing technology and refused to bid out the work until vendors accepted
A critical “good call” as it worked out
Learning by Doing
To gain experience with technology They tested, and tested, and tested Designed simple prototypes and played with
them Discovered that large cluster would perform
poorly But found a “sweet spot” and worked within
it This forced project to cut back on some goals
Testing
9/10th of time and expense on any system is in Testing Debugging Integration
Many projects overlook this French planned conservatively
Software Bugs
Figure 1/10 lines in new code But as many as 1/250 lines in old code Bugs show up under stress Trick is to run a system in an unstressed
mode French identified “stress points” and
designed to steer far from them Their design also assumed that components
would fail and automated the restart
All of this worked!
Take-aways from French project? Complex technical issues at the core of the
system But they managed to break big poject into pieces Do the critical core first, separately, and focus
exclusively on it Test, test, test Don’t build anything you can possibly buy Management was technically sophisticated
enough to make some critical “calls”
Your Problem
e-commerce systems are at e-risk These e-risks take many forms:
System complexity Failure to plan for failures Poor project management
Ignore this at our peril, as we’ve seen But how can we learn to do better?
Keys to Reliability
Know the basic technologies Realize that software is buggy and failures
will happen. Design to treat failure as a mundane event Failure to plan for failure is the biggest e-risk!
Complexity is a huge threat. Use your naiveté as an advantage: if you can’t understand it, why assume that “they” can understand it?
E-commerce Technologies
The network and associated services
Databases Web servers “Scripts” – the glue your people use
to tie it all together