Life of a Cell

32
Life of a Cell Woes and Wins

description

Life of a Cell. Woes and Wins. The Conundrum. Distribute -- on-line -- millions of pages of aircraft maintenance documentation in a system that the FAA requires to be foolproof: No downtime All data identical for every mechanic worldwide. “Always”. Business Risks. - PowerPoint PPT Presentation

Transcript of Life of a Cell

Page 1: Life of a Cell

Life of a Cell

Woes and Wins

Page 2: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 2

The Conundrum

Distribute -- on-line -- millions of pages of aircraft maintenance documentation in a system that the FAA requires to be foolproof:– No downtime– All data identical for every mechanic

worldwide. “Always”

Page 3: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 3

Business Risks

An airplane cannot leave the gate if maintenance documentation is unavailable.

An airplane stuck at the gate causes the airline to lose lots of money (system wide)

Hasn’t been done before

Page 4: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 4

Business Drivers

Faster access to documentation translates to millions of dollars a year in recovered revenue– No such thing as “I did that yesterday I’ll just

wing it” – documents change daily– New document is printed and carried aboard

the aircraft (or you’re busted)– Search times and print times must be low

Page 5: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 5

Business Drivers

Consistency of documentation eliminates “flip flop” maintenance costs– I use procedure A and perform X– Downline – old documents ... “Hey, who did

that? But uh oh I can fix it.” Procedure B– Downline – new documents, Procedure A ....

Page 6: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 6

Business Drivers

• Safety– An incident involving a fatality drops ticket

sales by 50% for two weeks.– If the incident cannot be explained ticket

sales remain off until it is– US Airways 737 (1994?), Pittsburgh, almost

put airline out of business– Airline people really do care about the people

they’re responsible for

Page 7: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 7

The Plan

Be the first airline to gain competitive advantage by going to 100% online documentation

Retire microfilm/microfiche completely

Don’t lose shirt

Page 8: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 8

The Technologies

• Excalibur Technologies “EFS” (Electronic File System)

• Transarc AFS 3.3

• HP Servers

• Bunch’o’stuff to convert manuals to TIF

• Windows 3.1 target user platform

Page 9: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 9

The Process

Scan microfiche/film manual pages to TIF• EFS: OCR TIFs• AFS: Store TIF pages• EFS: Index TIFs (OCR output), keyword indexes• AFS: Store index• AFS: Replicate to strategically placed fileservers• Mechanics and engineers:

– Click on index icon (File cabinet)– Keyword search– EFS client on Windows 3.1 desktop requests data from EFS

server running on AFS fileserver

Page 10: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 10

World wide airline, world wide cell

• Fileserver locations decided by– Location on corporate backbone– Connectivity from other linestations (smaller

airports)– Number of linestations that can be served

from location– Paranoia (over designed by 2x)

Page 11: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 11

Domestic Fileserver Locations

BOI

PIT

IAD

BWI

MIA

IAH

IND.181130

189

96

75

373

nLarge location (> 50 workstations);

Fileserver location. n is totalnumber of workstations in region.

Medium location (8-workstations); AFSclient only. No local fileservers.

Small location (< 8 workstations);AFS client only. No local fileservers.

Basic U.S. map with airport codes courtesy of Roger Blundell

AFS Fileserver Locations and their FileserviceRegions

Page 12: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 12

End User Workstations

• Every hangar -- many per “dock”

• Every gate – 2x, independent LANs

• Every engineering department

• Facilities for support of in-air aircraft

(World wide)

Page 13: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 13

AFS Client Locations

• Minimal– No supported Windows 3.1 AFS client– EFS client requests data from AFS client

Page 14: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 14

Number of users

• 40000 human users– “I forgot my password” puts airline out of

business

• 1500 workstations – workstation hostname is “user” and is written on front of workstation

Page 15: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 15

Woes and Wins

• Network – shoving data into your LAN

• Replication management– Who is authorized– You want me to release how many volumes?– vos release times

• FAA – the system will not go down! All replicas will be identical

• Let’s use a really big cache for Seattle!

Page 16: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 16

Woe: Network

How to get 300 – 600 GB of data to fileserver for initial load of ROs– Slow links to small airports– Slow links to international server locations– Fast links heavily trafficked– vos release can beat the * out of a network– An airline is always in operation – no magic

window of opportunity

Page 17: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 17

Win: Network

• Can’t use vos release

• Hey, we have lots of those airplane things– Load local (SFO) fileserver array with disks,

setup vicep’s– vos addsite to fileserver/array; vos release– vgexport – OS says by to volume groups – vos remsite; remove drives; – Fly to wherever; vgimport, vos addsite / vos

release. Rio, anyone?

Page 18: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 18

Woes: Replication Management

15000 RW volumes, all replicated

• Who’s authorized to issue vos release?

• Which volumes to release? EFS randomly places data ...

• How many volumes did you say to release?

Page 19: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 19

Win: Replication Management

• Authorization/automation– Per fleet per manual vosrel PTS group– PTS group on every relevant volume root

node– User interface writes record to work queue, a

file in /afs• Requester; manual/index; priority

– Fileserver cron job compares requester with vosrel PTS group, figures out volume list, performs vos release –localauth

Page 20: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 20

Woe: Replication Management

• Which volumes to release?– Well known volume tree and consistent

naming conventions– Release all volumes for requested manual– Who cares, really? How many can there be?

• Sometimes 4000+ volumes per night• vos release is slowish – doesn’t check to see if

volume is unchanged; looks at contents• Release cycle > 24 hours, queue issue. OW!

Page 21: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 21

Win: Replication Management

• Filter release requests– Compare RO dates, RW dates – if RW not

changed and all ROs same date, skip it• Filter: 3 seconds • vos release “no op” – 30 seconds

– Small fraction of volumes for given manual are actually changed

• Sometimes 0 changed; sometimes < 1%; usually small fraction of total

Page 22: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 22

Woe: FAA – the system will not fail!!

• FAA requires 100% uptime, else won’t approve system and airline can go fish

• Yeah, right!

Page 23: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 23

Win: FAA – the system will not fail!!

• Data outage vs. system outage

• Replication, of course

• Multiple configurations for EFS client– Crude failover

• No data outage for six years and counting– Well, there were a couple of times when ...

but we fixed that ...

Page 24: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 24

Woe: FAA –replicas will be identical

• Several million RW files X 5 replicas

• Have to prove that all files are identical across the 5 ROs for a given volume

Page 25: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 25

Win: FAA –replicas will be identical

• Tree crawler!

• A little cheesy – “ls –l | cksum” each directory in volume and compare results

• Known “bad case” looked for 6x per day

• Key “fs setserverprefs” – I prefer you, now you, now you, now you

• Dedicated client, no mounted .backups

Page 26: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 26

Woe: Let’s use a really big cache

• It seemed like a really good idea– 20% files changed per quarter -- < 2%/week– Average file size 10K– Oops, the indexes are monolithic and 300

MB ... but don’t change often– Let’s try a 12 GB cache!

• “Hello? I’ve got twenty minutes to turn the shuttle. It takes fifteen minutes to ...”

Page 27: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 27

Win: Let’s not use a really big cache

• AFS client (still I believe?) chokes on large cache– 12 GB =~ 1,200,000 cache “Vfiles”– At garbage collection time, cache purge

looks for LRU– Gee, that takes a long time. Is the machine

dead?– Let’s try a 3 GB cache!

• (Worked indefinitely from 3.3 through 3.6)

Page 28: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 28

Other smidgeons

• vos release manager– Does volume need to be released?– Are all the relevant fileservers available?– Is there a sync site for the VLDB?– Do it– Did it?

• Check VLDB entry• Compare dates

Page 29: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 29

Other smidgeons

• Data reasonableness checks– Do files pointed to by index actually exist?– If not, do not vos rel the index– Avoids the data outage of “empty index” – for

example *(bad day)*

Page 30: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 30

Other smidgeons

• popcache– Index files: monolithic and large– Fileservers: overseas, slow networks– Initial search of newly released index could

take many minutes– Cat indexes to /dev/null every five minutes

• If index unchanged, local cached copy is used• If index changed, pulled from fileserver and user

doesn’t pay penalty for first search

Page 31: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 31

Other smidgeons

• Anyone here ever have these?– AFS is complaining about the network, so AFS broke

the network • AFS is the network’s canary in a cage

– We could do the whole thing with NFS!– AFS isn’t POSIX compliant. Yay DFS! – A file lock resides on disk. File in RO volume can’t be

locked. (Oh yes it can.)– HP T500 goes to sleep?– We could do the whole thing on a Kenmore!

Page 32: Life of a Cell

Dexter "Kim" Kimball ([email protected]) 32

Outcome: AFS Rules

• The airline became the first airline (and may still be the only) to place 100% of its aircraft maintenance documentation on line

• The system has run reliably for 5 years +• So of course it’s time to replace it

• There are three server locations in the US, one each in Europe, Hong Kong, Narita, Sydney, Montevideo, Rio de J

• Mechanics no longer mash the microfilm reader

This system was enabled by AFS