inside - USENIX · 2019. 2. 25. · Bosko Milekic Juan Navarro David E. Ott Amit Purohit Brennan...

27
THE MAGAZINE OF USENIX & SAGE August 2002 volume 27 • number 5 The Advanced Computing Systems Association & The System Administrators Guild & inside: CONFERENCE REPORTS USENIX 2002

Transcript of inside - USENIX · 2019. 2. 25. · Bosko Milekic Juan Navarro David E. Ott Amit Purohit Brennan...

Page 1: inside - USENIX · 2019. 2. 25. · Bosko Milekic Juan Navarro David E. Ott Amit Purohit Brennan Reynolds Matt Selsky J.D. Welch Li Xiao Praveen Yalagandula Haijin Yan Gary Zacheiss

THE MAGAZINE OF USENIX amp SAGEAugust 2002 volume 27 bull number 5

The Advanced Computing Systems Association amp

The System Administrators Guild

amp

insideCONFERENCE REPORTS

USENIX 2002

62 Vol 27 No 5 login

conference reportsKEYNOTE ADDRESS

THE INTERNETrsquoS COMING SILENT SPRING

Lawrence Lessig Stanford University

Summarized by David E Ott

In a talk that received a standing ova-tion Lawrence Lessig pointed out therecent legal crisis that is stifling innova-tion by extending notions of privateownership of technology beyond rea-sonable limits

Several lessons from history are instruc-tive (1) Edwin Armstrong the creator of FM radio technology became anenemy to RCA which launched a legalcampaign to suppress the technology(2) packet switching networks proposedby Paul Baron were seen by ATampT as anew competing technology that had tobe suppressed (3) Disney took Grimmtales by then in the public domain andretold them in magically innovativeways Should building upon the past inthis way be considered an offense

ldquoThe truth is architectures can allowrdquothat is freedom for innovation can bebuilt into architectures Consider theInternet a simple core allows for anunlimited number of smart end-to-endapplications

Compare an ATampT proprietary core net-work to the Internetrsquos end-to-endmodel The number of innovators forthe former is one company while for thelatter itrsquos potentially the number of peo-ple connected to it Innovations forATampT are designed to benefit the ownerwhile the Internet is a wide-open boardthat allows all kinds of benefits to allkinds of groups The diversity of con-tributors in the Internet arena is stagger-ing

The auction of spectrum by the FCC isanother case in point The spectrum isusually seen as a fixed resource charac-terized by scarcity The courts have seenspectrum as something to be owned asproperty Technologists however haveshown that capacity can be a function of

OUR THANKS TO THE SUMMARIZERS

For the USENIX Annual Technical Conference

Josh Simon who organized the collecting of

the summaries in his usual flawless fashion

Steve Bauer

Florian Buchholz

Matt Butner

Pradipta De

Xiaobo Fan

Hai Huang

Scott Kilroy

Teri Lampoudi

Josh Lothian

Bosko Milekic

Juan Navarro

David E Ott

Amit Purohit

Brennan Reynolds

Matt Selsky

JD Welch

Li Xiao

Praveen Yalagandula

Haijin Yan

Gary Zacheiss

2002 USENIX Annual Technical ConferenceMONTEREY CALIFORNIA USAJUNE 10-15 2002ANNOUNCEMENTS

Summarized by Josh Simon

The 2002 USENIX Annual TechnicalConference was very exciting The gen-eral track had 105 papers submitted (up28 from 82 in 2001) and accepted 25(19 from students) the FREENIX trackhad 53 submitted (up from 52 in 2001)and accepted 26 (7 from students)

The two annual USENIX-given awardswere presented by outgoing USENIXBoard President Dan Geer The USENIXLifetime Achievement Award (also

known as theldquoFlamerdquo becauseof the shape ofthe award) wentto JamesGosling for hiscontributionsincluding thePascal compilerfor MulticsEmacs an early

SMP UNIX work on X11 and Sunrsquoswindowing system the first enscriptand Java The Software Tools UsersGroup (STUG) Award was presented tothe Apache Foundation and accepted byRasmus Lerdorf In addition to the well-known Web server Apache produces

Jakartamod_perlmod_tcl andXML parserwith over 80members in at least 15countries

James Gosling

Rasmus Lerdorf

architecture As David Reed arguescapacity can scale with the number ofusers ndash assuming an effective technologyand an open architecture

Developments over the last three yearsare disturbing and can be summarizedas two layers of corruption intellectual-property rights constraining technicalinnovation and proprietary hardwareplatforms constraining software innova-tion

Issues have been framed by the courtslargely in two mistaken ways (1) itrsquostheir property ndash a new technologyshouldnrsquot interfere with the rights ofcurrent technology owners (2) itrsquos justtheft ndash copyright laws should be upheldby suppressing certain new technologies

In fact we must reframe the debate fromldquoitrsquos their propertyrdquo to a ldquohighwaysrdquometaphor that acts as a neutral platformfor innovation without discriminationldquoTheftrdquo should be reframed as ldquoWaltDisneyrdquo who built upon works from thepast in richly creative ways that demon-strate the utility of allowing work toreach the public domain

In the end creativity depends upon thebalance between property and accesspublic and private controlled and com-mon access Free code builds a free cul-ture and open architectures are whatgive rise to the freedom to innovate Allof us need to become involved inreframing the debate on this issue asRachel Carsonrsquos Silent Spring points outan entire ecology can be undermined bysmall changes from within

INVITED TALKS

THE IETF OR WHERE DO ALL THOSE RFCS

COME FROM ANYWAY

Steve Bellovin ATampT Labs ndash Research

Summarized by Josh Simon

The Internet Engineering Task Force(IETF) is a standards body but not alegal entity consisting of individuals(not organizations) and driven by a con-sensus-based decision model Anyonewho ldquoshows uprdquo ndash be it at the thrice-

63October 2002 login

C

ON

FERE

NC

ERE

PORT

Sannual in-person meetings or on theemail lists for the various groups ndash canjoin and be a member The IETF is con-cerned with Internet protocols and openstandards not LAN-specific (such asAppletalk) or layer-1 or -2 (like copperversus fiber)

The organizational structure is looseThere are many working groups eachwith a specific focus within severalareas Each area has an area directorwho collectively form the Internet Engi-neering Steering Group (IESG) The sixpermanent areas are Internet (withworking groups for IPv6 DNS andICMP) Transport (TCP QoS VoIP andSCTP) Applications (mail some WebLDAP) Routing (OSPF BGP) Opera-tions and Management (SNMP) andSecurity (IPSec TLS SMIME) Thereare also two other areas SubIP is a tem-porary area for things underneath the IPprotocol stack (such as MPLS IP overwireless and traffic engineering) andtherersquos a General area for miscellaneousand process-based working groups

Internet Requests for Comments (RFCs)fall into three tracks Standard Informa-tional and Experimental Note that thismeans that not all RFCs are standardsThe RFCs in the Informational track aregenerally for proprietary protocols orApril first jokes those in the Experimen-tal track are results ideas or theories

The RFCs in the Standard track comefrom working groups in the variousareas through a time-consuming com-plex process Working groups are createdwith an agenda a problem statement anemail list some draft RFCs and a chairThey typically start out as a BoF sessionThe working group and the IESG makea charter to define the scope milestonesand deadlines the Internet AdvisoryBoard (IAB) ensures that the workinggroup proposals are architecturallysound Working groups are narrowlyfocused and are supposed to die off oncethe problem is solved and all milestonesachieved Working groups meet andwork mainly through the email list

though there are three in-person high-bandwidth meetings per year Howeverdecisions reached in person must be rat-ified by the mailing list since not every-body can get to three meetings per yearThey produce RFCs which go throughthe Standard track these need to gothrough the entire working group beforebeing submitted for comment to theentire IETF and then to the IESG MostRFCs wind up going back to the work-ing group at least once from the areadirector or IESG level

The format of an RFC is well-definedand requires it be published in plain 7-bit ASCII Theyrsquore freely redistributableand the IETF reserves the right ofchange control on all Standard-trackRFCs

The big problems the IETF is currentlyfacing are security internationalizationand congestion control Security has tobe designed into protocols from thestart Internationalization has shown usthat 7-bit-only ASCII is bad and doesnrsquotwork especially for those character setsthat require more than 7 bits (likeKanji) UTF-8 is a reasonable compro-mise But what about domain namesWhile not specified as requiring 7-bitASCII in the specifications most DNSapplications assume a 7-bit character setin the namespace This is a hard prob-lem Finally congestion control isanother hard problem since the Internetis not the same as a really big LAN

INTRODUCTION TO AIR TRAFFIC

MANAGEMENT SYSTEMS

Ron Reisman NASA Ames ResearchCenter Rob Savoye Seneca Software

Summarized by Josh Simon

Air traffic control is organized into fourdomains surface which runs out of theairport control tower and controls theaircraft on the ground (eg taxi andtakeoff) terminal area which covers air-craft at 11000 feet and below handledby the Terminal Radar Approach Con-trol (TRACON) facilities en routewhich covers between 11000 and 40000feet including climb descent and at-

USENIX 2002

altitude flight runs from the 20 AirRoute Traffic Control Centers (ARTCCpronounced ldquoartsyrdquo) and traffic flowmanagement which is the strategic armEach area has sectors for low high andvery-high flight Each sector has a con-troller team including one person onthe microphone and handles between12 and 16 aircraft at a time Since thenumber of sectors and areas is limitedand fixed therersquos limited system capac-ity The events of September 11 2001gave us a respite in terms of systemusage but based on path growth pat-terns the air traffic system will be over-subscribed within two to three yearsHow do we handle this oversubscrip-tion

Air Traffic Management (ATM) Deci-sion Support Tools (DST) use physicsaeronautics heuristics (expert systems)fuzzy logic and neural nets to help the(human) aircraft controllers route air-craft around The rest of the talk focusedon capacity issues but the DST alsohandle safety and security issues Thesoftware follows open standards (ISOPOSIX and ANSI) The team at NASAAmes made Center-TRACON Automa-tion System (CTAS) which is softwarefor each of the ARTCCs portable fromSolaris to HP-UX and Linux as wellUnlike just about every other major soft-ware project this one really is standardand portable co-presenter Rob Savoyehas experience in maintaining gcc onmultiple platforms and is the projectlead on the portability and standardsissues for the code CTAS allows theARTCCs to upgrade and enhance indi-vidual aspects or parts of the system itisnrsquot a monolithic all-or-nothing entitylike the old ATM systems

Some future areas of research include ahead-mounted augmented reality devicefor tower operators to improve their sit-uational awareness by automatinghuman factors and new digital globalpositioning system (DGPS) technologieswhich are accurate within inches insteadof feet

64 Vol 27 No 5 login

Questions centered around advancedavionics (eg getting rid of ground con-trol) cooperation between the US andEurope for software development (wersquoreworking together on software develop-ment but the various European coun-triesrsquo controllers donrsquot talk well to eachother) and privatization

ADVENTURES IN DNS

Bill Manning ISI

Summarized by Brennan Reynolds

Manning began by posing a simplequestion is DNS really a critical infra-structure The answer is not simple Per-haps seven years ago when the Internetwas just starting to become popular theanswer was a definite no But today withIPv6 being implemented and so manytransactions being conducted over theInternet the question does not have aclear-cut answer The Internet Engineer-ing Task Force (IETF) has several newmodifications to the DNS service thatmay be used to protect and extend itsusability

The first extension Manning discussedwas Internationalized Domain Names(IDN) To date all DNS records arebased on the ASCII character set butmany addresses in the Internetrsquos globalnetwork cannot be easily written inASCII characters The goal of IDN is toprovide encoding for hostnames that isfair efficient and allows for a smoothtransition from the current scheme Thework has resulted in two encodingschemes ACE and UTF-8 Each encod-ing is independent of the other but theycan be used together in various combi-nations Manning expressed his opinionthat while neither is an ideal solutionACE appears to be the lesser of two evilsA major hindrance getting IDN rolledout into the Internetrsquos DNS root struc-ture is the increase in zone file complex-ity

Manningrsquos next topic was the inclusionof IPv6 records In an IPv4 world it ispossible for administrators to rememberthe numeric representation of anaddress IPv6 makes the addresses too

long and complex to be easily remem-bered This means that DNS will play avital role in getting IPv6 deployed Sev-eral new resource records (RR) havebeen proposed to handle the translationincluding AAAA A6 DNAME and BIT-STRING Manning commented on thedifference between IPv4 and v6 as atransport protocol and that systemstuned for v4 traffic will suffer a perfor-mance hit when using v6 This is largelyattributed to the increase in data trans-mitted per DNS request

The final extension Manning discussedwas DNSSec He introduced this exten-sion as a mechanism that protects thesystem from itself DNSSec protectsagainst data spoofing and providesauthentication between servers Itincludes a cryptographic signature ofthe RR set to ensure authenticity andintegrity By signing the entire set theamount of computation is kept to aminimum The information itself isstored in a new RR within each zone fileon the DNS server

Manning briefly commented on the useof DNS to provide a PKI infrastructurestating that that was not the purpose ofDNS and therefore it should not be usedin that fashion The signing of the RRsets can be done hierarchically resultingin the use of a single trusted key at theroot of the DNS tree to sign all sets tothe leafs However the job of keyreplacement and rollover is extremelydifficult for a system that is distributedacross the globe

Manning stated that in an operation testbed with all of these extensions enabledthe packet size for a single queryresponse grew from 518 bytes to largerthan 18000 This results in a largeincrease of bandwidth usage for highvolume name servers and puts There-fore in Manningrsquos opinion not all ofthese features will be deployed in thenear future For those looking for moreinformation Billrsquos Web site can be foundat httpwwwisieduotdr

THE JOY OF BREAKING THINGS

Pat Parseghian Transmeta

Summarized by Juan Navarro

Pat Parseghian described her experiencein testing the Crusoe microprocessor atthe Transmeta Lab for Compatibility(TLC) where the motto is ldquoYou make it we break itrdquo and the goal is to makeengineers miserable

She first gave an overview of Crusoewhose most distinctive feature is a codemorphing layer that translates x86instructions into native VLIW Suchpeculiarity makes the testing processparticularly challenging because itinvolves testing two processors (theldquoexternalrdquo x86 and the ldquointernalrdquo VLIW)and also because of variations on howthe code-morphing layer works (it mayinterpret code the first few times it seesan instruction sequence and translateand save in a translation cache after-wards) There are also reproducibilityissues since the VLIW processor mayrun at different speeds to save energy

Pat then described the tests that theysubject the processor to (hardware com-patibility tests common applicationsoperating systems envelope-pushinggames and hard-to-acquire legacy appli-cations) and the issues that must be con-sidered when defining testing policiesShe gave some testing tips includingorganizational issues like trackingresources and results

To illustrate the testing process Pat gavea list of possible causes of a system crashthat must be investigated If it is the sili-con then it might be because it is dam-aged or because of a manufacturingproblem or a design flaw If the problemis in the code-morphing layer is thefault with the interpreter or with thetranslator The fault could also be exter-nal to the processor it could be thehomemade system BIOS a faulty main-board or an operator error Or Trans-meta might not be at fault at all thecrash might be due to a bug in the OS orthe application The key to pinpointing

65October 2002 login

C

ON

FERE

NC

ERE

PORT

Sthe cause of a problem is to isolate it byidentifying the conditions that repro-duce the problem and repeating thoseconditions in other test platforms andnon-Crusoe systems Then relevant fac-tors must be identified based on productknowledge past experience and com-mon sense

To conclude Pat suggested that some ofTLCrsquos testing lessons can be applied toproducts that the audience was involvedin and assured us that breaking things isfun

TECHNOLOGY LIBERTY FREEDOM AND

WASHINGTON

Alan Davidson Center for Democracyand Technology

Summarized by Steve Bauer

ldquoExperience should teach us to be moston our guard to protect liberty when theGovernmentrsquos purposes are beneficentMen born to freedom are naturally alertto repel invasion of their liberty by evil-minded rulers The greatest dangers toliberty lurk in insidious encroachmentby men of zeal well-meaning but with-out understandingrdquo ndash Louis BrandeisOlmstead v US

Alan Davison an associate director atthe CDT (httpwwwcdtorg) since1996 concluded his talk with this quoteIn many ways it aptly characterizes theimportance to the USENIX communityof the topics he covered The majorthemes of the talk were the impact oflaw on individual liberty system archi-tectures and appropriate responses bythe technical community

The first part of the talk provided theaudience with an overview of legislationeither already introduced or likely to beintroduced in the US Congress Thisincluded various proposals to protectchildren such as relegating all sexualcontent to a xxx domain or providingkid-safe zones such as kidsus Othersimilar laws discussed were the Chil-drenrsquos Online Protection Act and legisla-tion dealing with virtual child pornography

Other lower-profile pieces covered werelaws prohibiting false contact informa-tion in emails and domain registrationdatabases Similar proposals exist thatwould prohibit misleading subject linesin emails and require messages to clearlyidentify if they are advertisementsBriefly mentioned were laws impactingonline gambling

Bills and laws affecting the architecturaldesign of systems and networks and var-ious efforts to establish boundaries andborders in cyberspace were then dis-cussed These included the ConsumerBroadband and Television PromotionAct and the French cases against Yahooand its CEO for allowing the sale of Nazimaterials on the Yahoo auction site

A major part of the talk focused on thetechnological aspects of the USA-PatriotAct passed in the wake of the September11 attacks Topics included ldquopen regis-tersrdquo for the Internet expanded govern-ment power to conduct secret searchesroving wiretaps nationwide service ofwarrants sharing grand jury and intelli-gence information and the establish-ment of an identification system forvisitors to the US

The technological community needs tobe aware and care about the impact oflaw on individual liberty and the systemsthe community builds Specific sugges-tions included designing privacy andanonymity into architectures and limit-ing information collection particularlyany personally identifiable data to onlywhat is essential Finally Alan empha-sized that many people simply do notunderstand the implications of lawThus the technology community has aimportant role in helping make theseimplications clear

TAKING AN OPEN SOURCE PROJECT TO

MARKET

Eric Allman Sendmail Inc

Summarized by Scott Kilroy

Eric started the talk with the upsides anddownsides to open source software Theupsides include rapid feedback high-

USENIX 2002

quality work (mostly) lots of hands andan intensely engineering-drivenapproach The downsides include nosupport structure limited project mar-keting expertise and volunteers havinglimited time

In 1998 Sendmail was becoming a suc-cess disaster A success disaster includestwo key elements (1) volunteers startspending all their time supporting cur-rent functionality instead of enhancing(this heavy burden can lead to projectstagnation and eventually death)(2) competition for the better-fundedsources can lead to FUD (fear uncer-tainty and doubt) about the project

Eric then led listeners down the path hetook to turn Sendmail into a businessEric envisioned Sendmail Inc as a small-ish and close-knit company but soonrealized that the family atmosphere hedesired could not last as the companygrew He observed that maybe only thefirst 10 people will buy into your visionafter which each individual is primarilyinterested in something else (eg powerwealth or status) He warns that therewill be people working for you that youdo not like Eric particularly did not carefor sales people but he emphasized howimportant sales marketing finance andlegal people are to companies

The nature of companies in infancy canbe summed up as you need money andyou need investors in order to getmoney Investors want a return oninvestment so money managementbecomes critical Companies thereforemust optimize money functions Ericrsquosexperience with investors were lessonsall in themselves More than givingmoney good investors can provide con-nections and insight so you donrsquot wantinvestors who donrsquot believe in you

A company must have bug historychange management and people whounderstand the product and have a senseof the target market Eric now sees theimportance of marketing targetresearch A company can never forget

66 Vol 27 No 5 login

the simple equation that drives businessprofit = revenue - expense

Finally the concept of value should bebased on what is valuable to the cus-tomer Eric observed that his customersneeded documentation extra value inthe product that they couldnrsquot get else-where and technical support

Eric learned hard lessons along the wayIf you want to do interesting opensource it might be best not to be toosuccessful You donrsquot want to let thelarger dragons (companies) notice youIf you want a commercial user base youhave to manage process watch corpo-rate culture provide value to customerswatch bottom lines and develop a thickskin

Starting a company is easy but perpetu-ating it is extremely difficult

INFORMATION VISUALIZATION FOR

SYSTEMS PEOPLE

Tamara Munzner University of BritishColumbia

Summarized by JD Welch

Munzner presented interesting evidencethat information visualization throughinteractive representations of data canhelp people to perform tasks more effec-tively and reduce the load on workingmemory

A popular technique in visualizing datais abstracting it through node-link rep-resentation ndash for example a list of cross-references in a book Representing therelationships between references in agraphical (node-link) manner ldquooffloadscognition to the human perceptual sys-temrdquo These graphs can be producedmanually but automated drawing allowsfor increased complexity and quickerproduction times

The way in which people interpret visualdata is important in designing effectivevisualization systems Attention to pre-attentive visual cues are key to successPicking out a red dot among a field ofidentically shaped blue dots is signifi-cantly faster than the same data repre-

sented in a mixture of hue and shapevariations There are many pre-attentivevisual cues including size orientationintersection and intensity The accuracyof these cues can be ranked with posi-tion triggering high accuracy and coloror density on the low end

Visual cues combined with differentdata types ndash quantitative (eg 10inches) ordered (small medium large)and categorical (apples oranges) ndash canalso be ranked with spatial positionbeing best for all types Beyond thiscommonality however accuracy variedwidely with length ranked second forquantitative data eighth for ordinal andninth for categorical data types

These guidelines about visual perceptioncan be applied to information visualiza-tion which unlike scientific visualiza-tion focuses on abstract data and choiceof specialization Several techniques areused to clarify abstract data includingmulti-part glyphs where changes inindividual parts are incorporated into aneasier to understand gestalt interactiv-ity motion and animation In large datasets techniques like ldquofocus + contextrdquowhere a zoomed portion of the graph isshown along with a thumbnail view ofthe entire graph are used to minimizeuser disorientation

Future problems include dealing withhuge databases such as the HumanGenome reckoning dynamic data likethe changing structure of the Web orreal-time network monitoring andtransforming ldquopixel boundrdquo displays intolarge ldquodigital wallpaperrdquo-type systems

FIXING NETWORK SECURITY BY HACKING

THE BUSINESS CLIMATE

Bruce Schneier Counterpane InternetSecurity

Summarized by Florian Buchholz

Bruce Schneier identified security as oneof the fundamental building blocks ofthe Internet A certain degree of securityis needed for all things on the Net andthe limits of security will become limitsof the Net itself However companies are

hesitant to employ security measuresScience is doing well but one cannot seethe benefits in the real world Anincreasing number of users means moreproblems affecting more people Oldproblems such as buffer overflowshavenrsquot gone away and new problemsshow up Furthermore the amount ofexpertise needed to launch attacks isdecreasing

Schneier argues that complexity is theenemy of security and that while secu-rity is getting better the growth in com-plexity is outpacing it As security isfundamentally a people problem oneshouldnrsquot focus on technologies butrather should look at businesses busi-ness motivations and business costsTraditionally one can distinguishbetween two security models threatavoidance where security is absoluteand risk management where security isrelative and one has to mitigate the riskwith technologies and procedures In thelatter model one has to find a pointwith reasonable security at an acceptablecost

After a brief discussion on how securityis handled in businesses today in whichhe concluded that businesses ldquotalk bigabout it but do as little as possiblerdquoSchneier identified four necessary stepsto fix the problems

First enforce liabilities Today no realconsequences have to be feared fromsecurity incidents Holding peopleaccountable will increase the trans-parency of security and give an incentiveto make processes public As possibleoptions to achieve this he listed indus-try-defined standards federal regula-tion and lawsuits The problems withenforcement however lie in difficultiesassociated with the international natureof the problem and the fact that com-plexity makes assigning liabilities diffi-cult Furthermore fear of liability couldhave a chilling effect on free softwaredevelopment and could stifle new com-panies

67October 2002 login

C

ON

FERE

NC

ERE

PORT

SAs a second step Schneier identified theneed to allow partners to transfer liabil-ity Insurance would spread liability riskamong a group and would be a CEOrsquosprimary risk analysis tool There is aneed for standardized risk models andprotection profiles and for more securi-ties as opposed to empty press releasesSchneier predicts that insurance will cre-ate a marketplace for security where cus-tomers and vendors will have the abilityto accept liabilities from each otherComputer-security insurance shouldsoon be as common as household or fireinsurance and from that developmentinsurance will become the driving factorof the security business

The next step is to provide mechanismsto reduce risks both before and aftersoftware is released techniques andprocesses to improve software quality aswell as an evolution of security manage-ment are therefore needed Currentlysecurity products try to rebuild the wallsndash such as physical badges and entrydoorways ndash that were lost when gettingconnected Schneier believes thisldquofortress metaphorrdquo is bad one shouldthink of the problem more in terms of acity Since most businesses cannot affordproper security outsourcing is the onlyway to make security scalable With out-sourcing the concenpt of best practicesbecomes important and insurance com-panies can be tied to them outsourcingwill level the playing field

As a final step Schneier predicts thatrational prosecution and education willlead to deterrence He claims that peoplefeel safe because they live in a lawfulsociety whereas the Internet is classifiedas lawless very much like the AmericanWest in the 1800s or a society ruled bywarlords This is because it is difficult toprove who an attacker is prosecution ishampered by complicated evidencegathering and irrational prosecutionSchneier believes however that educa-tion will play a major role in turning theInternet into a lawful society Specifi-cally he pointed out that we need lawsthat can be explained

Risks wonrsquot go away the best we can dois manage them A company able tomanage risk better will be more prof-itable and we need to give CEOs thenecessary risk-management tools

There were numerous questions Listedbelow are the more complicated ones ina paraphrased QampA format

Q Will liability be effective will insur-ance companies be willing to accept therisks A The government might have tostep in it needs to be seen how it playsout

Q Is there an analogy to the real worldin the fact that in a lawless society onlythe rich can afford security Securitysolutions differentiate between classesA Schneier disagreed with the state-ment giving an analogy to front-doorlocks but conceded that there might bespecial cases

Q Doesnrsquot homogeneity hurt securityA Homogeneity is oversold Diversetypes can be more survivable but giventhe limited number of options the dif-ference will be negligible

Q Regulations in airbag protection haveled to deaths in some cases How can wekeep the pendulum from swinging theother way A Lobbying will not be pre-vented An imperfect solution is proba-ble there might be reversals ofrequirements such as the airbag one

Q What about personal liability A Thiswill be analogous to auto insurance lia-bility comes with computerNet access

Q If the rule of law is to become realitythere must be a law enforcement func-tion that applies to a physical space Youcannot do that with any existing govern-ment agency for the whole worldShould an organization like the UNassume that role A Schneier was notconvinced global enforcement is possi-ble

Q What advice should we take awayfrom this talk A Liability is comingSince the network is important to ourinfrastructure eventually the problems

USENIX 2002

will be solved in a legal environmentYou need to start thinking about how tosolve the problems and how the solu-tions will affect us

The entire speech (including slides) isavailable at httpwwwcounterpanecompresentation4pdf

LIFE IN AN OPEN SOURCE STARTUP

Daryll Strauss Consultant

Summarized by Teri Lampoudi

This talk was packed with morsels ofinsight and tidbits of information aboutwhat life in an open source startup islike (though ldquowas likerdquo might be moreappropriate) what the issues are instarting maintaining and commercial-izing an open source project and theway hardware vendors treat open sourcedevelopers Strauss began by tracing thetimeline of his involvement with devel-oping 3-D graphics support for Linuxfrom the 3dfx voodoo1 driver for Linuxin 1997 to the establishment of Preci-sion Insight in 1998 to its buyout by VALinux to the dismantling of his groupby VA to what the future may hold

Obviously there are benefits from doingopen source development both for thedeveloper and for the end user Animportant point was that open sourcedevelops entire technologies not justproducts A misapprehension is theinherent difficulty of managing a projectthat accepts code from a large number ofdevelopers some volunteer and somepaid The notion that code can just bethrown over the wall into the world is aBig Lie

How does one start and keep afloat anopen source development company Inthe case of Precision Insight the subjectmatter ndash developing graphics andOpenGL for Linux ndash required a consid-erable amount of expertise intimateknowledge of the hardware the librariesX and the Linux kernel as well as theend applications Expertise is mar-ketable What helps even further is hav-ing a visible virtual team of expertspeople who have an established trackrecord of contributions to open source

68 Vol 27 No 5 login

in the particular area of expertise Sup-port from the hardware and softwarevendors came next While everyone waswilling to contract PI to write the por-tions of code that were specific to theirhardware no one wanted to pay the billfor developing the underlying founda-tion that was necessary for the drivers tobe useful In the PI model the infra-structure cost was shared by several cus-tomers at once and the technology keptmoving Once a driver is written for avendor the vendor is perfectly capableof figuring out how to write the nextdriver necessary thereby obviating theneed to contract the job This and ques-tions of protecting intellectual propertyndash hardware design in this case ndash aredeeply problematic with the open sourcemode of development This is not to saythat there is no way to get over thembut that they are likely to arise moreoften than not

Management issues seem equally impor-tant In the PI setting contracts wereflexible ndash both a good and a bad thing ndashand developers overworked Furtherdevelopment ldquoin a fishbowlrdquo that isunder public scrutiny is not easyStrauss stressed the value of good com-munication and the use of various out-of-sync communication methods likeIRC and mailing lists

Finally Strauss discussed what portionsof software can be free and what can beproprietary His suggestion was that hor-izontal markets want to be free whereasvertically one can develop proprietarysolutions The talk closed with a smallstroll through the problems that projectpolitics brings up from things likemerging branches to mailing list andsource tree readability

GENERAL TRACK SESSIONS

FILE SYSTEMS

Summarized by Haijin Yan

STRUCTURE AND PERFORMANCE OF THE

DIRECT ACCESS FILE SYSTEM

Kostas Magoutis Salimah AddetiaAlexandra Fedorova and Margo ISeltzer Harvard University JeffreyChase Andrew Gallatin Richard Kisleyand Rajiv Wickremesinghe Duke Uni-versity and Eran Gabber Lucent Tech-nologies

This paper won the Best Paper award

The Direct Access File System (DAFS) isa new standard for network-attachedstorage over direct-access transport net-works DAFS takes advantage of user-level network interface standards thatenable client-side functionality in whichremote data access resides in a libraryrather than in the kernel This reducesthe overhead of memory copy for datamovement and protocol overheadRemote Direct Memory Access (RDMA)is a direct access transport network thatallows the network adapter to reducecopy overhead by accessing applicationbuffers directly

This paper explores the fundamentalstructural and performance characteris-tics of network file access using a user-level file system structure on adirect-access transport network withRDMA It describes DAFS-based clientand server reference implementationsfor FreeBSD and reports experimentalresults comparing DAFS to a zero-copyNFS implementation It illustrates thebenefits and trade-offs of these tech-niques to provide a basis for informedchoices about deployment of DAFS-based systems and similar extensions toother network file protocols such asNFS Experiments show that DAFS givesapplications direct control over an IOsystem and increases the client CPUrsquosusage while the client is doing IOFuture work includes how to addresslongstanding problems related to theintegration of the application and filesystem for high-performance applica-tions

CONQUEST BETTER PERFORMANCE THROUGH

A DISKPERSISTENT-RAM HYBRID FILE

SYSTEM

An-I A Wang Peter Reiher and GeraldJ Popek UCLA Geoffrey M Kuenning Harvey Mudd College

Motivated by the declining cost of per-sistent RAM the authors propose theConquest file system which stores allsmall files metadata executables andshared libraries in persistent RAM diskshold only the data content of remaininglarge files Compared to alternativessuch as caching and RAM file systemsConquest has the advantages of effi-ciency consistency and reliability at areduced cost Using popular bench-marks experiments show that Conquestincurs little overhead while achievingfaster performance Future workincludes designing mechanisms foradjusting file-size threshold dynamicallyand finding a better disk layout for largedata blocks

EXPLOITING GRAY-BOX KNOWLEDGE OF

BUFFER-CACHE MANAGEMENT

Nathan C Burnett John Bent AndreaC Arpaci-Dusseau and Remzi HArpaci-Dusseau University of Wisconsin Madison

Knowing what algorithm is used tomanage the buffer cache is very impor-tant for improving application perfor-mance However there is currently nointerface for finding this algorithm Thispaper introduces Dust a simple finger-printing tool that is able to identify thebuffer-cache replacement policy Dustautomatically identifies the cache sizeand replacement policy based on theconfiguring attributes of access ordersrecency frequency and long-term his-tory Through simulation Dust was ableto distinguish between a variety ofreplacement algorithm policies found inthe literature FIFO LRU LFU ClockSegmented FIFO 2Qand LRU-K Fur-ther experiments of fingerprinting realoperating system such as NetBSDLinux and Solaris show that Dust is ableto identify the hidden cache replacementalgorithm

69October 2002 login

C

ON

FERE

NC

ERE

PORT

SBy knowing the underlying cachereplacement policy a cache-aware Webserver can reschedule the requests on acached request-first policy to obtain per-formance improvement Experimentsshow that cache-aware schedulingimproves average response time and sys-tem throughput

OPERATING SYSTEMS

(AND DANCING BEARS)

Summarized by Matt Butner

THE JX OPERATING SYSTEM

Michael Golm Meik Felser ChristianWawersich and Juumlrgen Kleinoumlder University of Erlagen-Nuumlrnberg

The talk opened by emphasizing theneed for and the practicality of a func-tional Java OS Such implementation inthe JX OS attempts to mimic the recenttrend toward using highly abstractedlanguages in application development inorder to create an OS as functional andpowerful as one written in a lower-levellanguage

The JX is a micro-kernel solution thatuses separate JVMs for each entity of thekernel and in some cases for eachapplication The separated domains donot share objects and have no threadmigration and each domain implementsits own code The interesting part of thepresentation was the discussion of sys-tem-level programming with Java keyareas discussed were memory manage-ment and interrupt handlers Theauthors concluded by noting that theirtype-safe and modular system resultedin a robust system with great configura-tion flexibility and acceptable perfor-mance

DESIGN EVOLUTION OF THE EROS

SINGLE-LEVEL STORE

Jonathan S Shapiro Johns HopkinsUniversity Jonathan Adams Universityof Pennsylvania

This presentation outlined current char-acteristics of file systems and some oftheir least desirable characteristics Itwas based on the revival of the EROS

single-level-store approach for currentattempts to capitalize on system consis-tency and efficiency Their solution didnot relate directly to EROS but ratherto the design and use of the constructsand notions on which EROS is based Byextending the mapping of the memorysystem to include the disk systems arefurther able to ensure global persistencewithout regard for a disk structure Inexplaining the need for absolute systemconsistency and an environment whichby definition is not allowed to crash thegoal of having such an exhaustive designbecomes clear The cool part of thiswork is the availability of its design andarchitecture to the public

THINK A SOFTWARE FRAMEWORK FOR

COMPONENT-BASED OPERATING SYSTEM

KERNELS

Jean-Philippe Fassino France TeacuteleacutecomRampD Jean-Bernard Stefani INRIA JuliaLawall DIKU Gilles Muller INRIA

This presentation discussed the need forcomponent-based operating systemsand respective structures to ensure flexi-bility for arbitrary-sized systems Thinkprovides a binding model that mapsuniformed components for OS develop-ers and architects to follow ensuringconsistent implementations for an arbi-trary system size However the goal isnot to force developers into a predefinedkernel but to promote the use of certaincomponents in varied ways

Thinkrsquos concentration is primarily onembedded systems where short devel-opment time is necessary but is con-strained by rigorous needs and limitedresources The need to build flexible sys-tems in such an environment can becostly and implementation-specific butThink creates an environment sup-ported by the ability to dynamically loadtype-safe components This allows formore flexible systems that retain func-tionality because of Thinkrsquos ability tobind fine-grained components Bench-marks revealed that dedicated micro-kernels can show performanceimprovements in comparison to mono-

USENIX 2002

lithic kernels Notable future workincludes building components for low-end appliances and the development of areal-time OS component library

BUILDING SERVICES

Summarized by Pradipta De

NINJA A FRAMEWORK FOR NETWORK

SERVICES

J Robert von Behren Eric A BrewerNikita Borisov Michael Chen MattWelsh Josh MacDonald Jeremy Lauand David Culler University of Califor-nia Berkeley

Robert von Behren presented Ninja anongoing project that aims to provide aframework for building robust and scal-able Internet-based network serviceslike Web-hosting instant-messagingemail and file-sharing applicationsRobert drew attention to the difficultiesof writing cluster applications One hasto take care of data consistency andissues of concurrency and for robustapplications there are problems relatedto fault tolerance Ninja works as awrapper to relieve the user of theseproblems The goal of the project isbuilding network services that are scala-ble and highly available maintainingpersistent data providing gracefuldegradation and supporting online evo-lution

The use of clusters distinguishes thissetup from generic distributed systemsin terms of reliability and security aswell as providing a partition-free net-work The next important feature of thisproject is the new programming modelwhich is more restrictive than a generalprogramming model but still expressiveenough to write most of the applica-tions This model described as ldquosingleprogram multiple connectionrdquo usesintertask parallelism instead of multi-threaded concurrency and is character-ized by all nodes running the sameprogram with connections delegated tothe nodes by a centralized connectionmanager (CM) A CM takes care of hid-ing the details of mapping an externalconnection to an internal node Ninja

70 Vol 27 No 5 login

also uses a design called ldquostaged event-driven architecturerdquo where each serviceis broken down into a set of stages con-nected by event queues This architec-ture is suitable for a modular design andhelps in graceful degradation by adap-tive load shedding from the eventqueues

The presentation concluded with exam-ples of implementation and evaluationof a Web server and an email systemcalled NinjaMail to show the ease ofauthoring and the efficacy of using theNinja framework for service develop-ment

USING COHORT-SCHEDULING TO ENHANCE

SERVER PERFORMANCE

James R Larus and Michael ParkesMicrosoft Research

James Larus presented a new schedulingpolicy for increasing the throughput ofserver applications Cohort schedulingbatches execution of similar operationsarising in different server requests Theusual programming paradigm in han-dling server requests is to spawn multi-ple concurrent threads and switch fromone thread to another whenever a threadgets blocked for IO Since the threads ina server mostly execute unrelated piecesof code the locality of reference isreduced hence the effectiveness of dif-ferent caching mechanisms One way tosolve this problem is to throw in morehardware But Larus presented a com-plementary view where the programbehavior is investigated and used toimprove the performance

The problem in this scheme is to iden-tify pieces of code that can be batchedtogether for processing One simple wayis to look at the next program countervalues and use them to club threadstogether However this talk presented anew programming abstraction ldquostagedcomputationrdquo which replaces the threadmodel with ldquostagesrdquo A stage is anabstraction for operations with similarbehavior and locality The StagedServerlibrary can be used for programming inthis model It is a collection of C++

classes that implement staged computa-tion and cohort scheduling on either auniprocessor or multiprocessor It canbe used to define stages

The presentation showed the experi-mental evaluation of the cohort schedul-ing over the thread-based model byimplementing two servers a Web serverwhich is mainly IO bound and a pub-lish-subscribe server which is mainlycompute bound The SURGE bench-mark was used for the first experimentand the Fabret workload for the secondThe results showed that cohort-schedul-ing-based implementation gave a betterthroughput than the thread-basedimplementation at high loads

NETWORK PERFORMANCE

Summarized by Xiaobo Fan

ETE PASSIVE END-TO-END INTERNET SERVICE

PERFORMANCE MONITORING

Yun Fu Amin Vahdat Duke UniversityLudmila Cherkasova Wenting Tang HPLabs

This paper won the Best Student Paperaward

Ludmila Cherkasova began by listingseveral questions most Web serviceproviders want answered in order toimprove service quality She reviewedthe difficulties of making accurate andefficient end-to-end Web service mea-surement and the shortcomings of cur-rently available approaches Theypropose a passive trace-based architec-ture called EtE to monitor Web serverperformance on behalf of end users

The first step is to collect network pack-ets passively The second module recon-structs all TCP connections and extractsHTTP transactions To reconstruct Webpage accesses they first build a knowl-edge base indexed by client IP and URLand then group objects to the relatedWeb pages they are embedded in Statis-tical analysis is used to handle non-matched objects EtE Monitor cangenerate three groups of metrics to mea-sure Web service performance responsetime Web caching and page abortion

To demonstrate the benefits of EtE mon-itor Cherkasova talked about two casestudies and based on the calculatedmetrics gave some insightful explana-tions about whatrsquos happening behind the variations of Web performanceThrough validation they claim theirapproach provides a very close approxi-mation to the real scenario

THE PERFORMANCE OF REMOTE DISPLAY

MECHANISMS FOR THIN-CLIENT COMPUTING

S Jae Yang Jason Nieh Matt SelskyNikhil Tiwari Columbia University

Noting the trend toward thin-clientcomputing the authors compared dif-ferent techniques and design choices inmeasuring the performance of six popu-lar thin-client platforms ndash CitrixMetaFrame Microsoft Terminal Ser-vices Sun Ray Tarantella VNC and XAfter pointing out several challenges inbenchmarking thin clients Yang pro-posed slow-motion benchmarking toachieve non-invasive packet monitoringand consistent visual quality Basicallythey insert delays between separatevisual events in the benchmark applica-tions of the server side so that theclientrsquos display update can catch up withthe serverrsquos processing speed The exper-iments are conducted on an emulatednetwork over a range of network band-widths

Their results show that thin clients canprovide good performance for Webapplications in LAN environments butonly some platforms performed well forvideo benchmark Pixel-based encodingmay achieve better performance andbandwidth efficiency than high-levelgraphics Display caching and compres-sion should be used with care

A MECHANISM FOR TCP-FRIENDLY

TRANSPORT-LEVEL PROTOCOL

COORDINATION

David E Ott and Ketan Mayer-PatelUniversity of North Carolina ChapelHill

A revised transport-level protocol opti-mized for cluster-to-cluster (C-to-C)

71October 2002 login

C

ON

FERE

NC

ERE

PORT

Sapplications is what David Ott tries toexplain in his talk A C-to-C applicationclass is identified as one set of processescommunicating to another set ofprocesses across a common Internetpath The fundamental problem with C-to-C applications is how to coordinateall C-to-C communication flows so thatthey share a consistent view of the com-mon C-to-C network adapt to changingnetwork conditions and cooperate tomeet specific requirements Aggregationpoints (AP) are placed at the first andlast hop of the common data path toprobe network conditions (latencybandwidth loss rate etc) To carry andtransfer this information a new protocolndash Coordination Protocol (CP) ndash isinserted between the network layer (IP)and the transport layer (TCP UCP etc)Ott illustrated how the CP header isupdated and used when packets origi-nate from source and traverse throughlocal and remote AP to arrive at theirdestination and how AP maintains aper-cluster state table and detects net-work conditions Through simulationresults this coordination mechanismappears effective in sharing commonnetwork resources among C-to-C com-munication flows

STORAGE SYSTEMS

Summarized by Praveen Yalagandula

MY CACHE OR YOURS MAKING STORAGE

MORE EXCLUSIVE

Theodore M Wong Carnegie MellonUniversity John Wilkes HP Labs

Theodore Wong explained the ineffi-ciency of current ldquoinclusiverdquo cachingschemes in storage area networks ndashwhen a client accesses a block the blockis read from the disk and is cached atboth the disk array cache and at theclientrsquos cache Then he presented theconcept of ldquoexclusiverdquo caching where ablock accessed by a client is only cachedin that clientrsquos cache On eviction fromthe clientrsquos cache they have come upwith a DEMOTE operation to move thedata block to the tail of the array cache

The new exclusion caching schemes areevaluated on both single-client and mul-tiple-client systems The single-clientresults are presented for two differenttypes of synthetic workloads Random(transaction-processing type workloads)and Zipf (typical Web workloads) Theexclusive policy was quite effective inachieving higher hit rates and lower readlatencies They also showed a 22 timeshit-rate improvement over inclusivetechniques in the case of a real-lifeworkload httpd with a single-client set-ting

The DEMOTE scheme also performedwell in the case of multiple-client sys-tems when the data accessed by clients isdisjointed In the case of shared dataworkloads this scheme performed worsethan the inclusive schemes Theodorethen presented an adaptive exclusivecaching scheme ndash a block accessed by aclient is placed at the tail in the clientrsquoscache and also in the array cache at anappropriate place determined by thepopularity of the block The popularityof the block is measured by maintaininga ghost cache to accumulate the numberof times each block is accessed This newadaptive scheme has achieved a maxi-mum of 52 speedup in the meanlatency in the experiments with real-lifeworkloads

For more information visit httpwwwcscmuedu~tmwongresearch and httpwwwhplhpcomresearchitccslssp

BRIDGING THE INFORMATION GAP IN

STORAGE PROTOCOL STACKS

Timothy E Denehy Andrea C Arpaci-Dusseau Remzi H Arpaci-DusseauUniversity of Wisconsin Madison

Currently there is a huge informationgap between storage systems and file sys-tems The interface exposed by storagesystems to file systems based on blocksand providing only readwrite interfacesis very narrow This leads to poor per-formance as a whole because of dupli-cated functionality in both systems and

USENIX 2002

reduced functionality resulting fromstorage systemsrsquo lack of file information

The speaker presented two enhance-ments ExRaid an exposed RAID andILFS Informed Log-Structured File Sys-tem ExRAID is an enhancement to theblock-based RAID storage system thatexposes the following three types ofinformation to the file system (1)regions ndash contiguous portions of theaddress space comprising one or multi-ple disks (2) performance informationabout queue lengths and throughput ofthe regions revealing disk heterogeneityto the file systems and (3) failure infor-mation ndash dynamically updated informa-tion conveying the number of tolerablefailures in each region

ILFS allows online incremental expan-sion of the storage space performsdynamic parallelism using ExRAIDrsquosperformance information to performwell on heterogeneous storage systemsand provides a range of different mecha-nisms with different granularities forredundancy of files using ExRAIDrsquosregion failure characteristics These newfeatures are added to LFS with only a19 increase in the code size

More informationhttpwwwcswisceduwind

MAXIMIZING THROUGHPUT IN REPLICATED

DISK STRIPING OF VARIABLE

BIT-RATE STREAMS

Stergios V Anastasiadis Duke Univer-sity Kenneth C Sevcik and MichaelStumm University of Toronto

There is an increasing demand for thecontinuous real-time streaming ofmedia files Disk striping is a commontechnique used for supporting manyconnections on the media servers Buteven with 12 million hours of meantime between failures there will be morethan one disk failure per week with 7000disks Fault tolerance can be achieved byusing data replication and reservingextra bandwidth during normal opera-tion This work focused on supportingthe most common variable bit-ratestream formats (eg MPEG)

72 Vol 27 No 5 login

The authors used prototype mediaserver EXEDRA to implement and eval-uate different replication and bandwidthreservation schemes This system sup-ports a variable bit-rate scheme doesstride-based disk allocation and is capa-ble of supporting variable-grain diskstriping Two replication policies arepresented deterministic ndash data blocksare replicated in a round-robin fashionon the secondary replicas random ndashdata blocks are replicated on the ran-domly chosen secondary replicas Twobandwidth reservation techniques arealso presented mirroring reservation ndashdisk bandwidth is reserved for both primary and replicas of the media fileduring the playback and minimumreservation ndash a more efficient scheme inwhich bandwidth is reserved only for thesum of primary data access time and themaximum of the backup data accesstimes required in each round

Experimental results showed that deter-ministic replica placement is better thanrandom placement for small disks mini-mum disk bandwidth reservation istwice as good as mirroring in thethroughput achieved and fault tolerancecan be achieved with a minimal impacton the throughput

TOOLS

Summarized by Li Xiao

SIMPLE AND GENERAL STATISTICAL PROFILING

WITH PCT

Charles Blake and Steven Bauer MIT

This talk introduced a Profile CollectionToolkit (PCT) ndash a sampling-based CPUprofiling facility A novel aspect of PCTis that it allows sampling of semanticallyrich data such as function call stacksfunction parameters local or globalvariables CPU registers or other execu-tion contexts The design objectives ofPCT were driven by user needs and theinadequacies or inaccessibility of priorsystems

The rich data collection capability isachieved via a debugger-controller pro-gram dbct1 The talk then introducedthe profiling collector and data collec-

tion activation PCT file format optionsprovide flexibility and various ways tostore exact states PCT also has analysisoptions two examples were given ndashldquoSimplerdquo and ldquoMixed User and LinuxCoderdquo

The talk then went into how generalsampling helps in the following areas adebugger-controller state machineexample script fragments a simpleexample program and general valuereports in terms of call-path histogramsnumeric value histograms and data gen-erality Related work such as gprof andExpect was then summarized The con-tribution of this work is to provide aportable general value sampling toolThe limitations are that PCT does notsupport much of strip and count inac-curacies happen because of its statisticalnature Based on this limitation -g ispreferred over strip

Possible future directions included sup-porting more debuggers such as dbxand xdb script generators for otherkinds of traces more canned reports forgeneral values and a libgdb-basedlibrary sampler PCT is available athttppdoslcsmitedupct PCT runs onalmost any UNIX-like system

ENGINEERING A DIFFERENCING AND

COMPRESSION DATA FORMAT

David G Korn and Kiem Phong VoATampT Laboratories ndash Research

This talk began with an equation ldquoDif-ferencing + Compression = Delta Com-pressionrdquo Compression removesinformation redundancy in a data setExamples are gzip bzip compress andpack Differencing encodes differencesbetween two data sets Examples are diff -e fdelta bdiff and xdelta Deltacompression compresses a data set givenanother combining differencing andcompression and reduces to pure com-pression when therersquos no commonality

After presenting an overall scheme for adelta compressor and showing deltacompression performance this talk dis-cussed the encoding format of a newlydesigned Vcdiff for delta compression A

Vcdiff instruction code table consists of256 entries of each coding up to a pair ofinstructions and recodes I-byte indicesand any additional data

The talk then showed Vcdiff rsquos perfor-mance with Web data collected fromCNN compared with W3C standardGdiff Gdiff+gzip and diff+gzip whereGdiff was computed using Vcdiff deltainstructions The results of two experi-ments ldquoFirstrdquo and ldquoSuccessiverdquo werepresented In ldquoFirstrdquo each file is com-pressed against the first file collected inldquoSuccessiverdquo each file is compressedagainst the one in the previous hourThe diff+gzip did not work well becausediff was line-oriented Vcdiff performedfavorably compared with other formats

Vcdiff is part of the Vcodex packageVcodex is a platform for all commondata transformations delta compres-sion plain compression encryptiontranscoding (eg uuencode base64) Itis structured in three layers for maxi-mum usability Base library uses Dis-plines and Methods interfaces

The code for Vcdiff can be found athttpwwwresearchattcomswtoolsThey are moving Vcdiff to the IETFstandard as a comprehensive platformfor transforming data Please refer tohttpwwwietforginternet-draftdraft-korn-vcdiff06txt

WHERE IN THE NET

Summarized by Amit Purohit

A PRECISE AND EFFICIENT EVALUATION OF

THE PROXIMITY BETWEEN WEB CLIENTS AND

THEIR LOCAL DNS SERVERS

Zhuoqing Morley Mao University ofCalifornia Berkeley Charles D CranorFred Douglis Michael RabinovichOliver Spatscheck and Jia Wang ATampTLabs ndash Research

Content Distribution Networks (CDNs)attempt to improve Web performance bydelivering Web content to end usersfrom servers located at the edge of thenetwork When a Web client requestscontent the CDN dynamically chooses aserver to route the request to usually the

73October 2002 login

C

ON

FERE

NC

ERE

PORT

Sone that is nearest the client As CDNshave access only to the IP address of thelocal DNS server (LDNS) of the clientthe CDNrsquos authoritative DNS servermaps the clientrsquos LDNS to a geographicregion within a particular network andcombines that with network and serverload information to perform CDNserver selection

This method has two limitations First itis based on the implicit assumption thatclients are close to their LDNS This maynot always be valid Second a singlerequest from an LDNS can represent adiffering number of Web clients ndash calledthe hidden load factor Knowledge of thehidden load factor can be used toachieve better load distribution

The extent of the first limitation and itsimpact on the CDN server selection isdealt with To determine the associationsbetween clients and their LDNS a sim-ple non-intrusive and efficient map-ping technique was developed The datacollected was used to study the impact ofproximity on DNS-based server selec-tion using four different proximity met-rics (1) autonomous system (AS)clustering ndash observing whether a client isin the same AS as its LDNS ndash concludedthat LDNS is very good for coarse-grained server selection as 64 of theassociations belong to the same AS(2) network clustering ndash observingwhether a client is in the same network-aware cluster (NAC) ndash implied that DNSis less useful for finer-grained serverselection since only 16 of the clientand LDNS are in the same NAC (3)traceroute divergence ndash the length of thedivergent paths to the client and itsLDNS from a probe point using tracer-oute ndash implies that most clients aretopologically close to their LDNS asviewed from a randomly chosen probesite (4) round-trip time (RTT) correla-tion (some CDNs select severs based onRTT between the CDN server and theclientrsquos LDNS) examines the correlationbetween the message RTTs from a probepoint to the client and its local DNS

results imply this is a reasonable metricto use to avoid really distant servers

A study of the impact that client-LDNSassociations have on DNS-based serverselection concludes that knowing theclientrsquos IP address would allow moreaccurate server selection in a large num-ber of cases The optimality of the serverselection also depends on the serverdensity placement and selection algo-rithms

Further information can be found athttpwwweecsberkeleyedu~zmaomyresearchhtml or by contactingzmaoeecsberkeleyedu

GEOGRAPHIC PROPERTIES OF INTERNET

ROUTING

Lakshminarayanan SubramanianVenkata N Padmanabhan MicrosoftResearch Randy H Katz University ofCalifornia at Berkeley

Geographic information can provideinsights into the structure and function-ing of the Internet including interac-tions between different autonomoussystems by analyzing certain propertiesof Internet routing It can be used tomeasure and quantify certain routingproperties such as circuitous routinghot-potato routing and geographic faulttolerance

Traceroute has been used to gather therequired data and Geotrack tool hasbeen used to determine the location ofthe nodes along each network path Thisenables the computation of ldquolinearizeddistancesrdquo which is the sum of the geo-graphic distances between successivepairs of routers along the path

In order to measure the circuitousnessof a path a metric ldquodistance ratiordquo hasbeen defined as the ratio of the lin-earized distance of a path to the geo-graphic distance between the source anddestination of the path From the data ithas been observed that the circuitous-ness of a route depends on both geo-graphic and network location of the endhosts A large value of the distance ratioenables us to flag paths that are highly

USENIX 2002

circuitous possibly (though not neces-sarily) because of routing anomalies Itis also shown that the minimum delaybetween end hosts is strongly correlatedwith the linearized distance of the path

Geographic information can be used tostudy various aspects of wide-area Inter-net paths that traverse multiple ISPs Itwas found that end-to-end Internetpaths tend to be more circuitous thanintra-ISP paths the cause for this beingthe peering relationships between ISPsAlso paths traversing substantial dis-tances within two or more ISPs tend tobe more circuitous than paths largelytraversing only a single ISP Anotherfinding is that ISPs generally employhot-potato routing

Geographic information can also beused to capture the fact that two seem-ingly unrelated routers can be suscepti-ble to correlated failures By using thegeographic information of routers wecan construct a geographic topology ofan ISP Using this we can find the toler-ance of an ISPrsquos network to the total net-work failure in a geographic region

For further information contactlakmecsberkeleyedu

PROVIDING PROCESS ORIGIN INFORMATION

TO AID IN NETWORK TRACEBACK

Florian P Buchholz Purdue UniversityClay Shields Georgetown University

Network traceback is currently limitedbecause host audit systems do not main-tain enough information to matchincoming network traffic to outgoingnetwork traffic The talk presented analternative assigning origin informationto every process and logging it duringinteractive login creation

The current implementation concen-trates mainly on interactive sessions inwhich an event is logged when a newconnection is established using SSH orTelnet The method is effective andcould successfully determine steppingstones and reliably detect the source of aDDoS attack The speaker then talkedabout the related work done in this area

74 Vol 27 No 5 login

which led to discussion of the latestdevelopment in the area of ldquohostcausalityrdquo

The speaker ended with the limitationsand future directions of the frameworkThis framework could be extended andcould find many applications in the areaof security Current implementationdoesnrsquot take care of startup scripts andcron jobs but incorporating the origininformation in FS could solve this prob-lem In the current implementation log-ging is just implemented as a proof ofconcept It could be made safe in manyways and this could be another impor-tant aspect of future work

PROGRAMMING

Summarized by Amit Purohit

CYCLONE A SAFE DIALECT OF C

Trevor Jim ATampT Labs ndash Research GregMorrisett Dan Grossman MichaelHicks James Cheney Yanling WangCornell University

Cyclone is designed to provide protec-tion against attacks such as buffer over-flows format string attacks andmemory management errors The cur-rent structure of C allows programmersto write vulnerable programs Cycloneextends C so that it has the safety guar-antee of Java while keeping C syntaxtypes and semantics untouched

The Cyclone compiler performs staticanalysis of the source code and insertsruntime checks into the compiled out-put at places where the analysis cannotdetermine that the code execution willnot violate safety constraints Cycloneimposes many restrictions to preservesafety such as NULL checksThesechecks do not exist in normal C

The speaker then talked about somesample code written in Cyclone and howit tackles safety problems Porting anexisting C application to Cyclone ispretty easy with fewer than 10 changerequired The current implementationconcentrates more on safety than onperformance ndash hence the performance

penalty is substantial Cyclone was ableto find many lingering bugs The finalstep of the project will be to write acompiler to convert a normal C programto an equivalent Cyclone program

COOPERATIVE TASK MANAGEMENT WITHOUT

MANUAL STACK MANAGEMENT

Atul Adya Jon Howell MarvinTheimer William J Bolosky John RDouceur Microsoft Research

The speaker described the definitionsand motivations behind the distinctconcepts of task management stackmanagement IO management conflictmanagement and data partitioningConventional concurrent programminguses preemptive task management andexploits the automatic stack manage-ment of a standard language In the sec-ond approach cooperative tasks areorganized as event handlers that yieldcontrol by returning control to the eventscheduler manually unrolling theirstacks In this project they have adopteda hybrid approach that makes it possiblefor both stack management styles tocoexist Thus programmers can codeassuming one or the other of these stackmanagement styles will operate depend-ing upon the application The speakeralso gave a detailed example of how touse their mechanism to switch betweenthe two styles

The talk then continued with the imple-mentation They were able to preemptmany subtle concurrency problems byusing cooperative task managementPaying a cost up-front to reduce a subtlerace condition proved to be a goodinvestment

Though the choice of task managementis fundamental the choice of stack man-agement can be left to individual tasteThis project enables use of any type ofstack management in conjunction withcooperative task management

IMPROVING WAIT-FREE ALGORITHMS FOR

INTERPROCESS COMMUNICATION IN

EMBEDDED REAL-TIME SYSTEMS

Hai Huang Padmanabhan Pillai KangG Shin University of Michigan

The main characteristic of the real-timetime-sensitive system is its pre-dictable response time But concurrencymanagement creates hurdles to achiev-ing this goal because of the use of locksto maintain consistency To solve thisproblem many wait-free algorithmshave been developed but these are typi-cally high-cost solutions By takingadvantage of the temporal characteris-tics of the system however the time andspace overhead can be reduced

The speaker presented an algorithm fortemporal concurrency control andapplied this technique to improve threewait-free algorithms A single-writermultiple-reader wait-free algorithm anda double-buffer algorithm were pro-posed

Using their transformation mechanismthey achieved an improvement of17ndash66 in ACET and a 14ndash70 reduc-tion in memory requirements for IPCalgorithms This mechanism is extensi-ble and can be applied to other non-blocking IPC algorithms as well Futurework involves reducing the synchroniz-ing overheads in more general IPC algo-rithms with multiple-writer semantics

MOBILITY

Summarized by Praveen Yalagandula

ROBUST POSITIONING ALGORITHMS FOR

DISTRIBUTED AD-HOC WIRELESS SENSOR

NETWORKS

Chris Savarese Jan Rabaey Universityof California Berkeley Koen Langen-doen Delft University of Technology

The Pico Radio Network comprisesmore than a hundred sensors monitorsand actuators equipped with wirelessconnectivity with the following proper-ties (1) no infrastructure (2) computa-tion in a distributed fashion instead ofcentralized computation (3) dynamictopology (4) limited radio range

75October 2002 login

C

ON

FERE

NC

ERE

PORT

S(5) nodes that can act as repeaters(6) devices of low cost and with lowpower and (7) sparse anchor nodes ndashnodes with GPS capability A positioningalgorithm determines the geographicalposition of each node in the network

In two-dimensional space each nodeneeds three reference positions to esti-mate its geographical position There aretwo problems that make positioning dif-ficult in the Pico Radio Network typesetting (1) a sparse anchor node prob-lem and (2) a range error problem

The two-phase approach that theauthors have taken in solving the prob-lem consists of (1) a hop-terrain algo-rithm ndash in this first phase each noderoughly guesses its location by the dis-tance calculated using multi-hops to theanchor points and (2) a refinementalgorithm ndash in this second phase eachnode uses its neighborsrsquo positions torefine its own position estimate Toguarantee the convergence thisapproach uses confidence metricsassigning a value of 10 for anchor nodesand 01 to start for other nodes andincreasing these with each iteration

A simulation tool OMNet++ was usedfor both phases and in various scenariosThe results show that they achievedposition errors of less than 33 in a sce-nario with 5 anchor nodes an averageconnectivity of 7 and 5 range mea-surement error

Guidelines for anchor node deploymentare high connectivity (gt10) a reason-able fraction of anchor nodes (gt 5)and for anchor placement coverededges

APPLICATION-SPECIFIC NETWORK

MANAGEMENT FOR ENERGY-AWARE

STREAMING OF POPULAR MULTIMEDIA

FORMATS

Surendar Chandra University of Geor-gia and Amin Vahdat Duke University

The main hindrance in supporting theincreasing demand for mobile multime-dia on PDA is the battery capacity of the

small devices The idle-time power con-sumption is almost the same as thereceive power consumption on typicalwireless network cards (1319mW vs1425mW) while the sleep state con-sumes far less power (177mW) To letthe system consume energy proportionalto the stream quality the network cardshould transition to sleep state aggres-sively between each packet

The speaker presented previous workwhere different multimedia streamswere studied for PDAs MS Media RealMedia and Quicktime It was found thatif the inter-packet gap is predictablethen huge savings are possible forexample about 80 savings in MSmedia streams with few packet lossesThe limitation of the IEEE 80211bpower-save mode comes into play whenthere are two or more nodes producinga delay between beacon time and whenthe node receivessends packets Thisbadly affects the streams since higherenergy is consumed while waiting

The authors propose traffic shaping forenergy conservation where this is doneby proxy such that packets arrive at reg-ular intervals This is achieved using alocal proxy in the access point and aclient-side proxy in the mobile hostSimulations show that traffic shapingreduces energy consumption and alsoreveals that the higher bandwidthstreams have a lower energy metric(mJkB)

More information is at httpgreenhousecsugaedu

CHARACTERIZING ALERT AND BROWSE SER-

VICES OF MOBILE CLIENTS

Atul Adya Paramvir Bahl and Lili QiuMicrosoft Research

Even though there is a dramatic increasein Internet access from wireless devicesthere are not many studies done oncharacterizing this traffic In this paperthe authors characterize the trafficobserved on the MSN Web site withboth notification and browse traces

USENIX 2002

Around 33 million browsing accessesand about 325 million notificationentries are present in the traces Threetypes of analyses are done for each oneof these two services content analysisconcerning the most popular contentcategories and their distribution popu-larity analysis or the popularity distri-bution of documents and user behavioranalysis

The analysis of the notification logsshows that document access rates followa Zipf-like distribution with most of theaccesses concentrated on a small num-ber of messages and the accesses exhibitgeographical locality ndash users from samelocality tend to receive similar notifica-tion content The browser log analysisshows that a smaller set of URLs areaccessed most times though the accesspattern does not fit any Zipf curve andthe highly accessed URLS remain stableA correlation study between the notifi-cation and browsing services shows thatwireless users have a moderate correla-tion of 012

FREENIX TRACK SESSIONS

BUILDING APPLICATIONS

Summarized by Matt Butner

INTERACTIVE 3-D GRAPHICS APPLICATIONS

FOR TCL

Oliver Kersting Juumlrgen Doumlllner HassoPlattner Institute for Software SystemsEngineering University of Potsdam

The integration of 3-D image renderingfunctionality into a scripting languagepermits interactive and animated 3-Ddevelopment and application withoutthe formalities and precision demandedby low-level CC++ graphics and visual-ization libraries The large and complexC++ API of the Virtual Rendering Sys-tem (VRS) can be combined with theconveniences of the Tcl scripting lan-guage The mapping of class interfaces isdone via an automated process and gen-erates respective wrapper classes all ofwhich ensures complete API accessibilityand functionality without any signifi-

76 Vol 27 No 5 login

cant performance retribution The gen-eral C++-to-Tcl mapper SWIG grantsthe necessary object and hierarchal abili-ties without the use of object-orientedTcl extensions The final API mappingtechniques address C++ features such asclasses overloaded methods enumera-tions and inheritance relations All areimplemented in a proof-of-concept thatmaps the complete C++ API of the VRSto Tcl and are showcased in a completeinteractive 3-D-map system

THE AGFL GRAMMAR WORK LAB

Cornelis HA Coster Erik VerbruggenUniversity of Nijmegen (KUN)

The growth and implementation of Nat-ural Language Processing (NLP) is thecornerstone of the continued evolutionand implementation of truly intelligentsearch machines and services In partdue to the growing collections of com-puter-stored human-readable docu-ments in the public domain theimplementation of linguistic analysiswill become necessary for desirable pre-cision and resolution Subsequentlysuch tools and linguistic resources mustbe released into the public domain andthey have done so with the release of theAGFL Grammar Work Lab under theGNU Public License as a tool for lin-guistic research and the development ofNLP-based applications

The AGFL (Affix Grammars over aFinite Lattice) Grammar Work Labmeshes context-free grammars withfinite set-valued features that are accept-able to a range of languages In com-puter science terms ldquoSyntax rules areprocedures with parameters and a non-deterministic executionrdquo The EnglishPhrases for Information Retrieval(EP4IR) released with the AGFL-GWLas a robust grammar of English is anAGFL-GWL generated English parserthat outputs ldquoHeadModifiedrdquo framesThe sentences ldquoCompanyX sponsoredthis conferencerdquo and ldquoThis conferencewas sponsored by CompanyXrdquo both gen-erate [CompanyX[sponsored confer-

ence]] The hope is that public availabil-ity of such tools will encourage furtherdevelopment of grammar and lexicalsoftware systems

SWILL A SIMPLE EMBEDDED WEB SERVER

LIBRARY

Sotiria Lampoudi David M BeazleyUniversity of Chicago

This paper won the FREENIX Best Stu-dent Paper award

SWILL (Simple Web Interface LinkLibrary) is a simple Web server in theform of a C library whose developmentwas motivated by a wish to give coolapplications an interface to the Web TheSWILL library provides a simple inter-face that can be efficiently implementedfor tasks that vary from flexible Web-based monitoring to software debuggingand diagnostics Though originallydesigned to be integrated with high-per-formance scientific simulation softwarethe interface is generic enough to allowfor unbounded uses SWILL is a single-threaded server relying upon non-blocking IO through the creation of atemporary server which services IOrequests SWILL does not provide SSL orcryptographic authentication but doeshave HTTP authentication abilities

A fantastic feature of SWILL is its sup-port for SPMD-style parallel applica-tions which utilize MPI provingvaluable for Beowulf clusters and largeparallel machines Another practicalapplication was the implementation ofSWILL in a modified Yalnix emulator byUniversity of Chicago Operating Sys-tems courses which utilized the addedYalnix functionality for OS developmentand debugging SWILL requires minimalmemory overhead and relies upon theHTTP10 protocol

NETWORK PERFORMANCE

Summarized by Florian Buchholz

LINUX NFS CLIENT WRITE PERFORMANCE

Chuck Lever Network Appliance PeterHoneyman CITI University of Michi-gan

Lever introduced a benchmark to mea-sure an NFS client write performanceClient performance is difficult to mea-sure due to hindrances such as poorhardware or bandwidth limitations Fur-thermore measuring application perfor-mance does not identify weaknessesspecifically at the client side Thus abenchmark was developed trying toexercise only data transfers in one direc-tion between server and application Forthis purpose the benchmark was basedon the block sequential write portion ofthe Bonnie file system benchmark Oncea benchmark for NFS clients is estab-lished it can be used to improve clientperformance

The performance measurements wereperformed with an SMP Linux clientand both a Linux NFS server and a Net-work Appliance F85 filer During testinga periodic jump in write latency timewas discovered This was due to a ratherlarge number of pending write opera-tions that were scheduled to be writtenafter certain threshold values wereexceeded By introducing a separate dae-mon that flushes the cached writerequest the spikes could be eliminatedbut as a result the average latency growsover time The problem could be tracedto a function that scans a linked list ofwrite requests After having added ahashtable to improve lookup perfor-mance the latency improved consider-ably

The improved client was then used tomeasure throughput against the twoservers A discrepancy between theLinux server and the filer test wasnoticed and the reason for that tracedback to a global kernel lock that wasunnecessarily held when accessing thenetwork stack After correcting this per-formance improved further However

77October 2002 login

C

ON

FERE

NC

ERE

PORT

Sthe measurements showed that a clientmay run slower when paired with fastservers on fast networks This is due toheavy client interrupt loads more net-work processing on the client side andmore global kernel lock contention

The source code of the project is avail-able at httpwwwcitiumicheduprojectsnfs-perfpatches

A STUDY OF THE RELATIVE COSTS OF

NETWORK SECURITY PROTOCOLS

Stefan Miltchev and Sotiris IoannidisUniversity of Pennsylvania AngelosKeromytis Columbia University

With the increasing need for securityand integrity of remote network ser-vices it becomes important to quantifythe communication overhead of IPSecand compare it to alternatives such asSSH SCP and HTTPS

For this purpose the authors set upthree testing networks direct link twohosts separated by two gateways andthree hosts connecting through onegateway Protocols were compared ineach setup and manual keying was usedto eliminate connection setup costs ForIPSec the different encryption algo-rithms AES DES 3DES hardware DESand hardware 3DES were used In detailFTP was compared to SFTP SCP andFTP over IPSec HTTP to HTTPS andHTTP over IPSec and NFS to NFS overIPSec and local disk performance

The result of the measurements werethat IPSec outperforms other popularencryption schemes Overall unen-crypted communication was fastest butin some cases like FTP the overhead canbe small The use of crypto hardwarecan significantly improve performanceFor future work the inclusion of setupcosts hardware-accelerated SSL SFTPand SSH were mentioned

CONGESTION CONTROL IN LINUX TCP

Pasi Sarolahti University of HelsinkiAlexey Kuznetsov Institute for NuclearResearch at Moscow

Having attended the talk and read thepaper I am still unclear about whetherthe authors are merely describing thedesign decisions of TCP congestion con-trol or whether they are actually the cre-ators of that part of the Linux code Myguess leans toward the former

In the talk the speaker compared theTCP protocol congestion control meas-ures according to IETF and RFC specifi-cations with the actual Linux implemen-tation which does conform to the basicprinciples but nevertheless has differ-ences A specific emphasis was placed onretransmission mechanisms and thecongestion window Also several TCPenhancements ndash the NewReno algo-rithm Selective ACKs (SACK) ForwardACKs (FACK) ndash were discussed andcompared

In some instances Linux does not con-form to the IETF specifications The fastrecovery does not fully follow RFC 2582since the threshold for triggering re-transmit is adjusted dynamically and thecongestion windowrsquos size is not changedAlso the roundtrip-time estimator andthe RTO calculation differ from RFC2988 since it uses more conservativeRTT estimates and a minimum RTO of200ms The performance measuresshowed that with the additional Linux-specific features enabled slightly higherthroughput more steady data flow andfewer unnecessary retransmissions canbe achieved

XTREME XCITEMENT

Summarized by Steve Bauer

THE FUTURE IS COMING WHERE THE X

WINDOW SHOULD GO

Jim Gettys Compaq Computer Corp

Jim Gettys one of the principal authorsof the X Window System outlined thenear-term objectives for the system pri-marily focusing on the changes andinfrastructure required to enable replica-tion and migration of X applicationsProviding better support for this func-tionality would enable users to retrieveor duplicate X applications betweentheir servers at home and work

USENIX 2002

One interesting example of an applica-tion that currently is capable of migra-tion and replication is Emacs To create anew frame on DISPLAY try ldquoM-x make-frame-on-display ltReturngt DISPLAYltReturngtrdquo

However technical challenges makereplication and migration difficult ingeneral These include the ldquomajorheadachesrdquo of server-side fonts thenonuniformity of X servers and screensizes and the need to appropriatelyretrofit toolkits Authentication andauthorization issues are obviously alsoimportant The rest of the talk delvedinto some of the details of these interest-ing technical challenges

HACKING IN THE KERNEL

Summarized by Hai Huang

AN IMPLEMENTATION OF SCHEDULER

ACTIVATIONS ON THE NETBSD OPERATING

SYSTEM

Nathan J Williams Wasabi Systems

Scheduler activation is an old idea Basi-cally there are benefits and drawbacks tousing solely kernel-level or user-levelthreading Scheduler activation is able tocombine the two layers of control toprovide more concurrency in the sys-tem

In his talk Nathan gave a fairly detaileddescription of the implementation ofscheduler activation in the NetBSD ker-nel One important change in the imple-mentation is to differentiate the threadcontext from the process context This isdone by defining a separate data struc-ture for these threads and relocatingsome of the information that wasembedded in the process context tothese thread contexts Stack was espe-cially a concern due to the upcall Spe-cial handling must be done to make surethat the upcall doesnrsquot mess up the stackso that the preempted user-level threadcan continue afterwards Lastly Nathanexplained that signals were handled byupcalls

78 Vol 27 No 5 login

AUTHORIZATION AND CHARGING IN PUBLIC

WLANS USING FREEBSD AND 8021X

Pekka Nikander Ericsson ResearchNomadicLab

8021x standards are well known in thewireless community as link-layerauthentication protocols In this talkPekka explained some novel ways ofusing the 8021x protocols that might beof interest to people on the move It ispossible to set up a public WLAN thatwould support various chargingschemes via virtual tokens which peoplecan purchase or earn and later use

This is implemented on FreeBSD usingthe netgraph utility It is basically a filterin the link layer that would differentiatetraffic based on the MAC address of theclient node which is either authenti-cated denied or let through The over-head for this service is fairly minimal

ACPI IMPLEMENTATION ON FREEBSD

Takanori Watanabe Kobe University

ACPI (Advanced Configuration andPower Management Interface) was pro-posed as a joint effort by Intel Toshibaand Microsoft to provide a standard andfiner-control method of managingpower states of individual devices withina system Such low-level power manage-ment is especially important for thosemobile and embedded systems that arepowered by fixed-capacity energy batter-ies

Takanori explained that the ACPI speci-fication is composed of three partstables BIOS and registers He was ableto implement some functionalities ofACPI in a FreeBSD kernel ACPI Com-ponent Architecture was implementedby Intel and it provides a high-levelACPI API to the operating systemTakanorirsquos ACPI implementation is builtupon this underlying layer of APIs

ACCESS CONTROL

Summarized by Florian Buchholz

DESIGN AND PERFORMANCE OF THE

OPENBSD STATEFUL PACKET FILTER (PF)

Daniel Hartmeier Systor AG

Daniel Hartmeier described the newstateful packet filter (pf) that replacedIPFilter in the OpenBSD 30 releaseIPFilter could no longer be included dueto licensing issues and thus there was aneed to write a new filter making use ofoptimized data structures

The filter rules are implemented as alinked list which is traversed from startto end Two actions may be takenaccording to the rules ldquopassrdquo or ldquoblockrdquoA ldquopassrdquo action forwards the packet anda ldquoblockrdquo action will drop it Wheremore than one rule matches a packetthe last rule wins Rules that are markedas ldquofinalrdquo will immediately terminate any further rule evaluation An opti-mization called ldquoskip-stepsrdquo was alsoimplemented where blocks of similarrules are skipped if they cannot matchthe current packet These skip-steps arecalculated when the rule set is loadedFurthermore a state table keeps track ofTCP connections Only packets thatmatch the sequence numbers areallowed UDP and ICMP queries andreplies are considered in the state tablewhere initial packets will create an entryfor a pseudo-connection with a low ini-tial timeout value The state table isimplemented using a balanced binarysearch tree NAT mappings are alsostored in the state table whereas appli-cation proxies reside in user space Thepacket filter also is able to perform frag-ment reassembly and to modulate TCPsequence numbers to protect hostsbehind the firewall

Pf was compared against IPFilter as wellas Linuxrsquos Iptables by measuringthroughput and latency with increasingtraffic rates and different packet sizes Ina test with a fixed rule set size of 100Iptables outperformed the other two fil-ters whose results were close to each

other In a second test where the rule setsize was continually increased Iptablesconsistently had about twice thethroughput of the other two (whichevaluate the rule set on both the incom-ing and outgoing interfaces) A third testcompared only pf and IPFilter using asingle rule that created state in the statetable with a fixed-state entry size Pfreached an overloaded state much laterthan IPFilter The experiment wasrepeated with a variable-state entry sizeand pf performed much better thanIPFilter for a small number of states

In general rule set evaluation is expen-sive and benchmarks only reflectextreme cases whereas in real life otherbehavior should be observed Further-more the benchmarks show that statefulfiltering can actually improve perfor-mance due to cheap state-table lookupsas compared to rule evaluation

ENHANCING NFS CROSS-ADMINISTRATIVE

DOMAIN ACCESS

Joseph Spadavecchia and Erez ZadokStony Brook University

The speaker presented modification toan NFS server that allows an improvedNFS access between administrativedomains A problem lies in the fact thatNFS assumes a shared UIDGID spacewhich makes it unsafe to export filesoutside the administrative domain ofthe server Design goals were to leaveprotocol and existing clients unchangeda minimum amount of server changesflexibility and increased performance

To solve the problem two techniques areutilized ldquorange-mappingrdquo and ldquofile-cloakingrdquo Range-mapping maps IDsbetween client and server The mappingis performed on a per-export basis andhas to be manually set up in an exportfile The mappings can be 1-1 N-N orN-1 In file-cloaking the server restrictsfile access based on UIDGID and spe-cial cloaking-mask bits Here users canonly access their own file permissionsThe policy on whether or not the file isvisible to others is set by the cloakingmask which is logically ANDed with the

79October 2002 login

C

ON

FERE

NC

ERE

PORT

Sfilersquos protection bits File-cloaking onlyworks however if the client doesnrsquot holdcached copies of directory contents andfile-attributes Because of this the clientsare forced to re-read directories byincrementing the mtime value of thedirectory each time it is listed

To test the performance of the modifiedserver five different NFS configurationswere evaluated An unmodified NFSserver was compared against one serverwith the modified code included but notused one with only range-mappingenabled one with only file-cloakingenabled and one version with all modi-fications enabled For each setup differ-ent file system benchmarks were runThe results show only a small overheadwhen the modifications are used gener-ally an increase of below 5 Anotherexperiment tested the performance ofthe system with an increasing number ofmapped or cloaked entries on a systemThe results show that an increase from10 to 1000 entries resulted in a maxi-mum of about 14 cost in performance

One member of the audience pointedout that if clients choose to ignore thechanged mtimes from the server andthus still hold caches of the directoryentries the file-cloaking mechanismcould be defeated After a rather lengthydebate the speaker had to concede thatthe model doesnrsquot add any extra securityAnother question was asked about scala-bility of the setup of range mappingThe speaker referred to application-leveltools that could be developed for thatpurpose

The software is available atftpftpfslcssunysbedupubenf

ENGINEERING OPEN SOURCE

SOFTWARE

Summarized by Teri Lampoudi

NINGAUI A LINUX CLUSTER FOR BUSINESS

Andrew Hume ATampT Labs ndash ResearchScott Daniels Electronic Data SystemsCorp

Ningaui is a general purpose highlyavailable resilient architecture built

from commodity software and hard-ware Emphatically however Ningaui isnot a Beowulf Hume calls the clusterdesign the ldquoSwiss canton modelrdquo inwhich there are a number of looselyaffiliated independent nodes with datareplicated among them Jobs areassigned by bidding and leases and clus-ter services done as session-based serversare sited via generic job assignment Theemphasis is on keeping the architectureend-to-end checking all work via check-sums and logging everything Theresilience and high availability requiredby their goal of 8-5 maintenance ndash vsthe typical 24-7 model where people getpaged whenever the slightest thing goeswrong regardless of the time of day ndash isachieved by job restartability Finally allcomputation is performed on local datawithout the use of NFS or networkattached storage

Humersquos message is a hopeful onedespite the many problems encounteredndash things like kernel and service limitsauto-installation problems TCP stormsand the like ndash the existence of sourcecode and the paranoid practice of log-ging and checksumming everything hashelped The final product performs rea-sonably well and it appears that theresilient techniques employed do make adifference One drawback however isthat the software mentioned in thepaper is not yet available for download

CPCMS A CONFIGURATION MANAGEMENT

SYSTEM BASED ON CRYPTOGRAPHIC NAMES

Jonathan S Shapiro and John Vander-burgh Johns Hopkins University

This paper won the FREENIX BestPaper award

The basic notion behind the project isthe fact that everyone has a pet com-plaint about CVS and yet it is currentlythe configuration manager in mostwidespread use Shapiro has unleashedan alternative Interestingly he did notbegin but ended with the usual host ofreasons why CVS is bad The talk insteadbegan abruptly by characterizing the job

USENIX 2002

of a software configuration managercontinued by stating the namespaceswhich it must handle and wrapped upwith the challenges faced

X MEETS Z VERIFYING CORRECTNESS IN THE

PRESENCE OF POSIX THREADS

Bart Massey Portland State UniversityRobert T Bauer Rational SoftwareCorp

Massey delivered a humorous talk onthe insight gained from applying Z for-mal specification notation to systemsoftware design rather than the moreinformal analysis and design processnormally used

The story is told with respect to writingXCB which replaces the Xlib protocollayer and is supposed to be threadfriendly But where threads are con-cerned deadlock avoidance becomes ahard problem that cannot be solved inan ad-hoc manner But full modelchecking is also too hard In this caseMassey resorted to Z specification tomodel the XCB lower layer abstractaway locking and data transfers andlocate fundamental issues Essentiallythe difficulties of searching the literatureand locating information relevant to theproblem at hand were overcome AsMassey put it ldquothe formal method savedthe dayrdquo

FILE SYSTEMS

Summarized by Bosko Milekic

PLANNED EXTENSIONS TO THE LINUX

EXT2EXT3 FILESYSTEM

Theodore Y Tsrsquoo IBM StephenTweedie Red Hat

The speaker presented improvements tothe Linux Ext2 file system with the goalof allowing for various expansions whilestriving to maintain compatibility witholder code Improvements have beenfacilitated by a few extra superblockfields that were added to Ext2 not longbefore the Linux 20 kernel was releasedThe fields allow for file system featuresto be added without compromising theexisting setup this is done by providing

80 Vol 27 No 5 login

bits indicating the impact of the addedfeatures with respect to compatibilityNamely a file system with the ldquoincom-patrdquo bit marked is not allowed to bemounted Similarly a ldquoread-onlyrdquo mark-ing would only allow the file system tobe mounted read-only

Directory indexing changes linear direc-tory searches with a faster search using afixed-depth tree and hashed keys Filesystem size can be dynamicallyincreased and the expanded inode dou-bled from 128 to 256 bytes allows formore extensions

Other potential improvements were dis-cussed as well in particular pre-alloca-tion for contiguous files which allowsfor better performance in certain setupsby pre-allocating contiguous blocksSecurity-related modifications extendedattributes and ACLs were mentionedAn implementation of these featuresalready exists but has not yet beenmerged into the mainline Ext23 code

RECENT FILESYSTEM OPTIMISATIONS ON

FREEBSD

Ian Dowse Corvil Networks DavidMalone CNRI Dublin Institute of Technology

David Malone presented four importantfile system optimizations for FreeBSDOS soft updates dirpref vmiodir anddirhash It turns out that certain combi-nations of the optimizations (beautifullyillustrated in the paper) may yield per-formance improvements of anywherebetween 2 and 10 orders of magnitudefor real-world applications

All four of the optimizations deal withfile system metadata Soft updates allowfor asynchronous metadata updatesDirpref changes the way directories areorganized attempting to place childdirectories closer to their parentsthereby increasing locality of referenceand reducing disk-seek times Vmiodirtrades some extra memory in order toachieve better directory caching Finallydirhash which was implemented by IanDowse changes the way in which entries

are searched in a directory SpecificallyFFS uses a linear search to find an entryby name dirhash builds a hashtable ofdirectory entries on the fly For a direc-tory of size n with a working set of mfiles a search that in certain cases couldhave been O(nm) has been reduceddue to dirhash to effectively O(n + m)

FILESYSTEM PERFORMANCE AND SCALABILITY

IN LINUX 2417

Ray Bryant SGI Ruth Forester IBMLTC John Hawkes SGI

This talk focused on performance evalu-ation of a number of file systems avail-able and commonly deployed on Linuxmachines Comparisons under variousconfigurations of Ext2 Ext3 ReiserFSXFS and JFS were presented

The benchmarks chosen for the datagathering were pgmeter filemark andAIM Benchmark Suite VII Pgmetermeasures the rate of data transfer ofreadswrites of a file under a syntheticworkload Filemark is similar to post-mark in that it is an operation-intensivebenchmark although filemark isthreaded and offers various other fea-tures that postmark lacks AIM VIImeasures performance for various file-system-related functionalities it offersvarious metrics under an imposedworkload thus stressing the perfor-mance of the file system not only underIO load but also under significant CPUload

Tests were run on three different setupsa small a medium and a large configu-ration ReiserFS and Ext2 appear at thetop of the pile for smaller and mediumsetups Notably XFS and JFS performworse for smaller system configurationsthan the others although XFS clearlyappears to generate better numbers thanJFS It should be noted that XFS seemsto scale well under a higher load Thiswas most evident in the large-systemresults where XFS appears to offer thebest overall results

THINGS TO THINK ABOUT

Summarized by Bosko Milekic

SPEEDING UP KERNEL SCHEDULER BY REDUC-

ING CACHE MISSES

Shuji Yamamura Akira Hirai MitsuruSato Masao Yamamoto Akira NaruseKouichi Kumon Fujitsu Labs

This was an interesting talk pertaining tothe effects of cache coloring for taskstructures in the Linux kernel scheduler(Linux kernel 24x) The speaker firstpresented some interesting benchmarknumbers for the Linux scheduler show-ing that as the number of processes onthe task queue was increased the perfor-mance decreased The authors usedsome really nifty hardware to measurethe number of bus transactionsthroughout their tests and were thusable to reasonably quantify the impactthat cache misses had in the Linuxscheduler

Their experiments led them to imple-ment a cache coloring scheme for taskstructures which were previouslyaligned on 8KB boundaries and there-fore were being eventually mapped tothe same cache lines This unfortunateplacement of task structures in memoryinduced a significant number of cachemisses as the number of tasks grew inthe scheduler

The implemented solution consisted ofaligning task structures to more evenlydistribute cached entries across the L2cache The result was inevitably fewercache misses in the scheduler Some neg-ative effects were observed in certain sit-uations These were primarily due tomore cache slots being used by taskstructure data in the scheduler thusforcing data previously cached there tobe pushed out

81October 2002 login

C

ON

FERE

NC

ERE

PORT

SWORK-IN-PROGRESS REPORTS

Summarized by Brennan Reynolds

RESOURCE VIRTUALIZATION TECHNIQUES FOR

WIDE-AREA OVERLAY NETWORKS

Kartik Gopalan University Stony Brook

This work addressed the issue of provi-sioning a maximum number of virtualoverlay networks (VON) with diversequality of service (QoS) requirementson a single physical network Each of theVONs is logically isolated from others toensure the QoS requirements Gopalanmentioned several critical researchissues with this problem that are cur-rently being investigated Dealing withhow to provision the network at variouslevels (link route or path) and thenenforce the provisioning at run-time isone of the toughest challenges Cur-rently Gopalan has developed severalalgorithms to handle admission controlend-to-end QoS route selection sched-uling and fault tolerance in the net-work

For more information visithttpwwwecslcssunysbedu

VISUALIZING SOFTWARE INSTABILITY

Jennifer Bevan University of CaliforniaSanta Cruz

Detection of instability in software hastypically been an afterthought Thepoint in the development cycle when thesoftware is reviewed for instability isusually after it is difficult and costly togo back and perform major modifica-tions to the code base Bevan has devel-oped a technique to allow thevisualization of unstable regions of codethat can be used much earlier in thedevelopment cycle Her technique cre-ates a time series of dependent graphsthat include clusters and lines calledfault lines From the graphs a developeris able to easily determine where theunstable sections of code are and proac-tively restructure them She is currentlyworking on a prototype implementa-tion

For more information visit httpwwwcseucscecu~jbevanevo_viz

RELIABLE AND SCALABLE PEER-TO-PEER WEB

DOCUMENT SHARING

Li Xiao William and Mary College

The idea presented by Xiao would allowend users to share the content of theWeb browser caches with neighboringInternet users The rationale for this isthat todayrsquos end users are increasinglyconnected to the Internet over high-speed links and the browser caches arebecoming large storage systems There-fore if an individual accesses a pagewhich does not exist in their local cacheXiao is suggesting that they first queryother end users for the content beforetrying to access the machine hosting theoriginal This strategy does have someserious problems associated with it thatstill need to be addressed includingensuring the integrity of the content andprotecting the identity and privacy ofthe end users

SEREL FAST BOOTING FOR UNIX

Leni Mayo Fast Boot Software

Serel is a tool that generates a visual rep-resentation of a UNIX machinersquos boot-up sequence It can be used to identifythe critical path and can show if a par-ticular service or process blocks for anextended period of time This informa-tion could be used to determine wherethe largest performance gains could berealized by tuning the order of executionat boot-up Serel creates a dependencygraph expressed in XML during boot-up This graph is then used to create thevisual representation Currently the toolonly works on POSIX-compliant sys-tems but Mayo stated that he would beporting it to other platforms Otherextensions that were mentionedincluded having the metadata includethe use of shared libraries and monitor-ing the suspendresume sequence ofportable machines

For more information visithttpwwwfastbootorg

USENIX 2002

BERKELEY DB XML

John Merrells Sleepycat Software

Merrells gave a quick introduction andoverview of the new XML library forBerkeley DB The library specializes instorage and retrieval of XML contentthrough a tool called XPath The libraryallows for multiple containers per docu-ment and stores everything natively asXML The user is also given a wide rangeof elements to create indices withincluding edges elements text stringsor presence The XPath tool consists of aquery parser generator optimizer andexecution engine To conclude his pre-sentation Merrell gave a live demonstra-tion of the software

For more information visithttpwwwsleepycatcomxml

CLUSTER-ON-DEMAND (COD)

Justin Moore Duke University

Modern clusters are growing at a rapidrate Many have pushed beyond the 5000-machine mark and deployingthem results in large expenses as well asmanagement and provisioning issuesFurthermore if the cluster is ldquorentedrdquoout to various users it is very time-con-suming to configure it to a userrsquos specsregardless of how long they need to useit The COD work presented createsdynamic virtual clusters within a givenphysical cluster The goal was to have aprovisioning tool that would automati-cally select a chunk of available nodesand install the operating system andmiddleware specified by the customer ina short period of time This would allowa greater use of resources since multiplevirtual clusters can exist at once Byusing a virtual cluster the size can bechanged dynamically Moore stated thatthey have created a working prototypeand are currently testing and bench-marking its performance

For more information visithttpwwwcsdukeedu~justincod

82 Vol 27 No 5 login

CATACOMB

Elias Sinderson University of CaliforniaSanta Cruz

Catacomb is a project to develop a data-base-backed DASL module for theApache Web server It was designed as areplacement for the WebDAV The initialrelease of the module only contains sup-port for a MySQL database but could beextended to others Sinderson brieflytouched on the performance of hermodule It was comparable to themod_dav Apache module for all querytypes but search The presentation wasconcluded with remarks about addingsupport for the lock method and includ-ing ACL specifications in the future

For more information visithttpoceancseucsceducatacomb

SELF-ORGANIZING STORAGE

Dan Ellard Harvard University

Ellardrsquos presentation introduced a stor-age system that tuned itself based on theworkload of the system without requir-ing the intervention of the user Theintelligence was implemented as a vir-tual self-organizing disk that residesbelow any file system The virtual diskobserves the access patterns exhibited bythe system and then attempts to predictwhat information will be accessed nextAn experiment was done using an NFStrace at a large ISP on one of their emailservers Ellardrsquos self-organizing storagesystem worked well which he attributesto the fact that most of the files beingrequested were large email boxes Areasof future work include exploration ofthe length and detail of the data collec-tion stage as well as the CPU impact ofrunning the virtual disk layer

VISUAL DEBUGGING

John Costigan Virginia Tech

Costigan feels that the state of currentdebugging facilities in the UNIX worldis not as good as it should be He pro-poses the addition of several elements toprograms like ddd and gdb The firstaddition is including separate heap and

stack components Another is to displayonly certain fields of a data structureand have the ability to zoom in and outif needed Finally the debugger shouldprovide the ability to visually presentcomplex data structures includinglinked lists trees etc He said that a betaversion of a debugger with these abilitiesis currently available

For more information visit httpinfoviscsvtedudatastruct

IMPROVING APPLICATION PERFORMANCE

THROUGH SYSTEM-CALL COMPOSITION

Amit Purohit University of Stony Brook

Web servers perform a huge number ofcontext switches and internal data copy-ing during normal operation These twoelements can drastically limit the perfor-mance of an application regardless ofthe hardware platform it is run on TheCompound System Call (CoSy) frame-work is an attempt to reduce the perfor-mance penalty for context switches anddata copies It includes a user-levellibrary and several kernel facilities thatcan be used via system calls The libraryprovides the programmer with a com-plete set of memory-management func-tions Performance tests using the CoSyframework showed a large saving forcontext switches and data copies Theonly area where the savings betweenconventional libraries and CoSy werenegligible was for very large file copies

For more information visithttpwwwcssunysbedu~purohit

ELASTIC QUOTAS

John Oscar Columbia University

Oscar began by stating that mostresources are flexible but that to datedisks have not been While approxi-mately 80 percent of files are short-liveddisk quotas are hard limits imposed byadministrators The idea behind elasticquotas is to have a non-permanent stor-age area that each user can temporarilyuse Oscar suggested the creation of aehome directory structure to be used indeploying elastic quotas Currently he

has implemented elastic quotas as astackable file system There were alsoseveral trends apparent in the dataMany users would ssh to their remotehost (good) but then launch an emailreader on the local machine whichwould connect via POP to the same hostand send the password in clear text(bad) This situation can easily be reme-died by tunneling protocols like POPthrough SSH but it appeared that manypeople were not aware this could bedone While most of his comments onthe use of the wireless network werenegative the list of passwords he hadcollected showed that people wereindeed using strong passwords His rec-ommendations were to educate andencourage people to use protocols likeIPSec SSH and SSL when conductingwork over a wireless network becauseyou never know who else is listening

SYSTRACE

Niels Provos University of Michigan

How can one be sure the applicationsone is using actually do exactly whattheir developers said they do Shortanswer you canrsquot People today are usinga large number of complex applicationswhich means it is impossible to checkeach application thoroughly for securityvulnerabilities There are tools out therethat can help though Provos has devel-oped a tool called systrace that allows auser to generate a policy of acceptablesystem calls a particular application canmake If the application attempts tomake a call that is not defined in thepolicy the user is notified and allowed tochoose an action Systrace includes anauto-policy generation mechanismwhich uses a training phase to record theactions of all programs the user exe-cutes If the user chooses not to be both-ered by applications breaking policysystrace allows default enforcementactions to be set Currently systrace isimplemented for FreeBSD and NetBSDwith a Linux version coming out shortly

83October 2002 login

C

ON

FERE

NC

ERE

PORT

SFor more information visit httpwwwcitiumicheduuprovossystrace

VERIFIABLE SECRET REDISTRIBUTION FOR

SURVIVABLE STORAGE SYSTEMS

Ted Wong Carnegie Mellon University

Wong presented a protocol that can beused to re-create a file distributed over aset of servers even if one of the serversis damaged In his scheme the user mustchoose how many shares the file is splitinto and the number of servers it will bestored across The goal of this work is toprovide persistent secure storage ofinformation even if it comes underattack Wong stated that his design ofthe protocol was complete and he is cur-rently building a prototype implementa-tion of it

For more information visit httpwwwcscmuedu~tmwongresearch

THE GURU IS IN SESSIONS

LINUX ON LAPTOPPDA

Bdale Garbee HP Linux Systems Operation

Summarized by Brennan Reynolds

This was a roundtable-like discussionwith Garbee acting as moderator He iscurrently in charge of the Debian distri-bution of Linux and is involved in port-ing Linux to platforms other than i386He has successfully help port the kernelto the Alpha Sparc and ARM and iscurrently working on a version to runon VAX machines

A discussion was held on which file sys-tem is best for battery life The Riser andExt3 file systems were discussed peoplecommented that when using Ext3 ontheir laptops the disc would never spindown and thus was always consuming alarge amount of power One suggestionwas to use a RAM-based file system forany volume that requires a large numberof writes and to only have the file systemwritten to disk at infrequent intervals orwhen the machine is shut down or sus-pended

A question was directed to Garbee abouthis knowledge of how the HP-Compaqmerger would affect Linux Garbeethought that the merger was good forLinux both within the company and forthe open source community as a wholeHe stated that the new company wouldbe the number one shipper of systemswith Linux as the base operating systemHe also fielded a question about thesupport of older HP servers and theirability to run Linux saying that indeedLinux has been ported to them and hepersonally had several machines in hisbasement running it

A question which sparked a large num-ber of responses concerned problemspeople had with running Linux onmobile platforms Most people actuallydid not have many problems at allThere were a few cases of a particularmachine model not working but therewere only two widespread issues dock-ing stations and the new ACPI powermanagement scheme Docking stationsstill appear to be somewhat of aheadache but most people had devel-oped ad-hoc solutions for getting themto work The ACPI power managementscheme developed by Intel does notappear to have a quick solution TedTrsquoso one of the head kernel developersstated that there are fundamental archi-tectural problems with the 24 series ker-nel that do not easily allow ACPI to beadded However he also stated thatACPI is supported in the newest devel-opment kernel 25 and will be includedin 26

The final topic was the possibility ofpurchasing a laptop without paying anychargetax for Microsoft Windows Theconsensus was that currently this is notpossible Even if the machine comeswith Linux or without any operatingsystem the vendors have contracts inplace which require them to payMicrosoft for each machine they ship ifthat machine is capable of running aMicrosoft operating system The onlyway this will change is if the demand for

USENIX 2002

other operating systems increases to thepoint that vendors renegotiate their con-tracts with Microsoft and no one sawthis happening in the near future

LARGE CLUSTERS

Andrew Hume ATampT Labs ndash Research

Summarized by Teri Lampoudi

This was a session on clusters big dataand resilient computing Given that theaudience was primarily interested inclusters and not necessarily big data orbeating nodes to a pulp the conversa-tion revolved mainly around what ittakes to make a heterogeneous clusterresilient Hume also presented aFREENIX paper on his current clusterproject which explained in more detailmuch of what was abstractly claimed inthe guru session

Humersquos use of the word ldquoclusterrdquoreferred not to what I had assumedwould be a Beowulf-type system but to aloosely coupled farm of machines inwhich concurrency was much less of anissue than it would be in a Beowulf Infact the architecture Hume describedwas designed to deal with large amountsof independent transaction processingessentially the process of billing callswhich requires no interprocess messagepassing or anything of the sort a parallelscientific application might

Where does the large data come inSince the transactions in question con-sist of large numbers of flat files amechanism for getting the files ontonodes and off-loading the results is nec-essary In this particular cluster namedNingaui this task is handled by theldquoreplication managerrdquo which generateda large amount of interest in the audi-ence All management functions in thecluster as well as job allocation consti-tute services that are to be bid for andleased out to nodes Furthermore fail-ures are handled by simply repeating thefailed task until it is completed properlyif that is possible and if it is not thenlooking through detailed logs to discover

84 Vol 27 No 5 login

what went wrong Logging and testingfor the successful completion of eachtask are what make this model resilientEvery management operation is loggedat some appropriate level of detail gen-erating 120MBday and every single filetransfer is md5 checksummed This wascharacterized by Hume as a ldquopatientrdquostyle of management where no one con-trols the cluster scheduling behavior isemergent rather than stringentlyplanned and nodes are free to drop inand out of the cluster a downside to thismodel is the increase in latency offset bythe fact that the cluster continues tofunction unattended outside of 8-to-5support hours

In response to questions about the pro-jected scalability of such a schemeHume said the system would presum-ably scale from the present 60 nodes toabout 100 without modifications butthat scaling higher would be a matter forfurther design The guru session endedon a somewhat unrelated note regardingthe problems of getting data off main-frames and onto Linux machines ndashissues of fixed- vs variable-lengthencoding of data blocks resulting in use-ful information being stripped by FTPand problems in converting COBOLcopybooks to something useful on a C-based architecture To reiterate Humersquosclaim there are homemade solutions forsome of these problems and the readershould feel free to contact Mr Hume forthem

WRITING PORTABLE APPLICATIONS

Nick Stoughton MSB Consultants

Summarized by Josh Lothian

Nick Stoughton addressed a concernthat developers are facing more andmore writing portable applications Heproposed that there is no such thing as aldquoportable applicationrdquo only those thathave been ported Developers are still insearch of the Holy Grail of applicationsprogramming a write-once compile-anywhere program Languages such asJava are leading up to this but virtual

machines still have to be ported pathswill still be different etc Web-basedapplications are another area that couldbe promising for portable applications

Nick went on to talk about the future ofportability The recently released POSIX2001 standards are expected to help thesituation The 2001 revision expands theoriginal POSIX sepc into seven volumesEven though POSIX 2001 is more spe-cific than the previous release Nickpointed out that even this standardallows for small differences where ven-dors may define their own behaviorsThis can only hurt developers in thelong run Along these lines it waspointed out that there may be room forautomated tools that have the capabilityto check application code and determinethe level of portability and standardscompliance of the source

NETWORK MANAGEMENT SYSTEM

PERFORMANCE TUNING

Jeff Allen Tellme Networks Inc

Summarized by Matt Selsky

Jeff Allen author of Cricket discussednetwork tuning The basic idea of tuningis measure twiddle and measure againCricket can be used for large installa-tions but itrsquos overkill for smaller instal-lations for which Mrtg is better suitedYou donrsquot need Cricket to do measure-ment Doing something like a shellscript that dumps data to Excel is goodbut you need to get some sort of mea-surement Cricket was not designed forbilling it was meant to help answerquestions

Useful tools and techniques coveredincluded looking for the differencebetween machines in order to identifythe causes for the differences measuringbut also thinking about what yoursquoremeasuring and why remembering touse strace and tcpdump to identifyproblems Problems repeat themselvesperformance tuning requires an intricateunderstanding of all the layers involvedto solve the problem and simplify theautomation (Having a computer science

background helps but is not essential) Ifyou can reproduce the problem youthen should investigate each piece tofind the bottleneck If you canrsquot repro-duce the problem then measure Youneed to understand all the system inter-actions Measurement can help deter-mine whether things are actually slow orif the user is imagining it

Troubleshooting begins with a hunchbut scientific processes are essential Youshould be able to determine the eventstream or when each event occurs hav-ing observation logs can help Somehints provided include checking cron forperiodic anomalies slowing downbinrm to avoid IO overload and look-ing for unusual kernel CPU usage

Also try to use lightweight monitoringto reduce overhead You donrsquot need tomonitor every resource on every systembut you should monitor those resourceson those systems that are essential Donrsquotcheck things too often since you canintroduce overhead that way Lots ofinformation can be gathered from theproc file system interface to the kernel

BSD BOF

Host Kirk McKusick Author and Consultant

Summarized by Bosko Milekic

The BSD Birds of a Feather session atUSENIX started off with five quick andinformative talks and ended with anexciting discussion of licensing issues

Christos Zoulas for the NetBSD Projectwent over the primary goals of theNetBSD Project (namely portability aclean architecture and security) andthen proceeded to briefly discuss someof the challenges that the project hasencountered The successes and areas forimprovement of 16 were then exam-ined NetBSD has recently seen variousimprovements in its cross-build systempackaging system and third-party toolsupport For what concerns the kernel azero-copy TCPUDP implementationhas been integrated and a new pmap

85October 2002 login

C

ON

FERE

NC

ERE

PORT

Scode for arm32 has been introduced ashave some other rather major changes tovarious subsystems More information isavailable in Christosrsquos slides which arenow available at httpwwwnetbsdorggalleryeventsusenix2002

Theo de Raadt of the OpenBSD Projectpresented a rundown of various newthings that the OpenBSD and OpenSSHteams have been looking at Theorsquos talkincluded an amusing description (andpictures) of some of the events thattranspired during OpenBSDrsquos hack-a-thon which occurred the week beforeUSENIX rsquo02 Finally Theo mentionedthat following complications with theIPFilter license the OpenBSD team haddone a fairly extensive license audit

Robert Watson of the FreeBSD Projectbrought up various examples ofFreeBSD being used in the real worldHe went on to describe a wide variety ofchanges that will surface with theupcoming FreeBSD release 50 Notably50 will introduce a re-architecting ofthe kernel aimed at providing muchmore scalable support for SMPmachines 50 will also include an earlyimplementation of KSE FreeBSDrsquos ver-sion of scheduler activations largeimprovements to pccard and significantframework bits from the TrustedBSDproject

Mike Karels from Wind River Systemspresented an overview of how BSDOShas been evolving following the WindRiver Systems acquisition of BSDirsquos soft-ware assets Ernie Prabhakar from Applediscussed Applersquos OS X and its successon the desktop He explained how OS Xaims to bring BSD to the desktop whilestriving to maintain a strong and stablecore one that is heavily based on BSDtechnology

The BoF finished with a discussion onlicensing specifically some folks ques-tioned the impact of Applersquos licensingfor code from their core OS (the DarwinProject) on the BSD community as awhole

CLOSING SESSION

HOW FLIES FLY

Michael H Dickinson University ofCalifornia Berkeley

Summarized by JD Welch

In this lively and offbeat talk Dickinsondiscussed his research into the mecha-nisms of flight in the fruit fly (Droso-phila melanogaster) and related theautonomic behavior to that of a techno-logical system Insects are the most suc-cessful organisms on earth due in nosmall part to their early adoption offlight Flies travel through space onstraight trajectories interrupted by sac-cades jumpy turning motions some-what analogous to the fast jumpymovements of the human eye Using avariety of monitoring techniquesincluding high-speed video in a ldquovirtualreality flight arenardquo Dickinson and hiscolleagues have observed that fliesrespond to visual cues to decide when tosaccade during flight For example morevisual texture makes flies saccade earlier(ie further away from the arena wall)described by Dickinson as a ldquocollisioncontrol algorithmrdquo

Through a combination of sensoryinputs flies can make decisions abouttheir flight path Flies possess a visualldquoexpansion detectorrdquo which at a certainthreshold causes the animal to turn acertain direction However expansion infront of the fly sometimes causes it toland How does the fly decide Using thevirtual reality device to replicate variousvisual scenarios Dickinson observedthat the flies fixate on vertical objectsthat move back and forth across thefield Expansion on the left side of theanimal causes it to turn right and viceversa while expansion directly in frontof the animal triggers the legs to flareout in preparation for landing

If the eyes detect these changes how arethe responses implemented The fliesrsquowings have ldquopower musclesrdquo controlledby mechanical resonance (as opposed tothe nervous system) which drive the

USENIX 2002

wings combined with neurally activatedldquosteering musclesrdquo which change theconfiguration of the wing joints Subtlevariations in the timing of impulses cor-respond to changes in wing movementcontrolled by sensors in the ldquowing-pitrdquoA small wing-like structure the halterecontrols equilibrium by beating con-stantly

Voluntary control is accomplished bythe halteres whose signals can interferewith the autonomic control of the strokecycle The halteres have ldquosteering mus-clesrdquo as well and information derivedfrom the visual system can turn off thehaltere or trick it into a ldquovirtualrdquo prob-lem requiring a response

Dickinson has also studied the aerody-namics of insect flight using a devicecalled the ldquoRobo Flyrdquo an oversizedmechanical insect wing suspended in alarge tank of mineral oil InterestinglyDickinson observed that the average liftgenerated by the fliesrsquo wings is under itsbody weight the flies use three mecha-nisms to overcome this including rotat-ing the wings (rotational lift)

Insects are extraordinarily robust crea-tures and because Dickinson analyzedthe problem in a systems-oriented waythese observations and analysis areimmediately applicable to technologythe response system can be used as anefficient search algorithm for controlsystems in autonomous vehicles forexample

THE AFS WORKSHOP

Summarized by Garry Zacheiss MIT

Love Hornquist-Astrand of the Arladevelopment team presented the Arlastatus report Arla 0358 has beenreleased Scheduled for release soon areimproved support for Tru64 UNIXMacOS X and FreeBSD improved vol-ume handling and implementation ofmore of the vospts subcommands Itwas stressed that MacOS X is consideredan important platform and that a GUIconfiguration manager for the cache

86 Vol 27 No 5 login

manager a GUI ACL manager for thefinder and a graphical login that obtainsKerberos tickets and AFS tokens at logintime were all under development

Future goals planned for the underlyingAFS protocols include GSSAPISPNEGOsupport for Rx performance improve-ments to Rx an enhanced disconnectedmode and IPv6 support for Rx anexperimental patch is already availablefor the latter Future Arla-specific goalsinclude improved performance partialfile reading and increased stability forseveral platforms Work is also inprogress for the RXGSS protocol exten-sions (integrating GSSAPI into Rx) Apartial implementation exists and workcontinues as developers find time

Derrick Brashear of the OpenAFS devel-opment team presented the OpenAFSstatus report OpenAFS was releasedimmediately prior to the conferenceOpenAFS 125 fixed a remotelyexploitable denial-of-service attack inseveral OpenAFS platforms mostnotably IRIX and AIX Future workplanned for OpenAFS includes bettersupport for MacOS X including work-ing around Finder interaction issuesBetter support for the BSDs is alsoplanned FreeBSD has a partially imple-mented client NetBSD and OpenBSDhave only server programs availableright now AIX 5 Tru64 51A andMacOS X 102 (aka Jaguar) are allplanned for the future Other plannedenhancements include nested groups inthe ptserver (code donated by the Uni-versity of Michigan awaits integration)disconnected AFS and further work ona native Windows client Derrick statedthat the guiding principles of OpenAFSwere to maintain compatibility withIBM AFS support new platforms easethe administrative burden and add newfunctionality

AFS performance was discussed Theopenafsorg cell consists of two fileservers one in Stockholm Sweden andone in Pittsburgh PA AFS works rea-sonably well over trans-Atlantic links

although OpenAFS clients donrsquot deter-mine which file server to talk to veryeffectively Arla clients use RTTs to theserver to determine the optimal fileserver to fetch replicated data fromModifications to OpenAFS to supportthis behavior in the future are desired

Jimmy Engelbrecht and Harald Barth ofKTH discussed their AFSCrawler scriptwritten to determine how many AFScells and clients were in the world whatimplementationsversions they were(Arla vs IBM AFS vs OpenAFS) andhow much data was in AFS The scriptunfortunately triggered a bug in IBMAFS 36ndashderived code causing someclients to panic while handling a specificRPC This has since been fixed in OpenAFS 125 and the most recent IBMAFS patch level and all AFS users arestrongly encouraged to upgrade Norelease of Arla is vulnerable to this par-ticular denial-of-service attack Therewas an extended discussion of the use-fulness of this exploration Many sitesbelieved this was useful information andsuch scanning should continue in thefuture but only on an opt-in basis

Many sites face the problem of manag-ing KerberosAFS credentials for batch-scheduled jobs Specifically most batchprocessing software needs to be modi-fied to forward tickets as part of thebatch submission process renew ticketsand tokens while the job is in the queueand for the lifetime of the job and prop-erly destroy credentials when the jobcompletes Ken Hornstein of NRL wasable to pay a commercial vendor to sup-port Kerberos 45 credential manage-ment in their product although they didnot implement AFS token managementMIT has implemented some of thedesired functionality in OpenPBS andmight be able to make it available toother interested sites

Tools to simplify AFS administrationwere discussed including

AFS Balancer A tool written byCMU to automate the process ofbalancing disk usage across all

servers in a cell Available fromftpftpandrewcmuedupubAFS-Toolsbalance-11btargz

Themis Themis is KTHrsquos enhancedversion of the AFS tool ldquopackagerdquofor updating files on local disk froma central AFS image KTHrsquosenhancements include allowing thedeletion of files simplifying theprocess of adding a file and allow-ing the merging of multiple rulesets for determining which files areupdated Themis is available fromthe Arla CVS repository

Stanford was presented as an example ofa large AFS site Stanfordrsquos AFS usageconsists of approximately 14TB of datain AFS in the form of approximately100000 volumes 33TB of storage isavailable in their primary cell irstan-fordedu Their file servers consistentirely of Solaris machines running acombination of Transarc 36 patch-level3 and OpenAFS 12x while their data-base servers run OpenAFS 122 Theircell consists of 25 file servers using acombination of EMC and Sun StorEdgehardware Stanford continues to use thekaserver for their authentication infra-structure with future plans to migrateentirely to an MIT Kerberos 5 KDC

Stanford has approximately 3400 clientson their campus not including SLAC(Stanford Linear Accelerator) approxi-mately 2100 AFS clients from outsideStanford contact their cell every monthTheir supported clients are almostentirely IBM AFS 36 clients althoughthey plan to release OpenAFS clientssoon Stanford currently supports onlyUNIX clients There is some on-campuspresence of Windows clients but Stan-ford has never publicly released or sup-ported it They do intend to release andsupport the MacOS X client in the nearfuture

All Stanford students faculty and staffare assigned AFS home directories witha default quota of 50MB for a total ofapproximately 550GB of user homedirectories Other significant uses of AFS

87October 2002 login

C

ON

FERE

NC

ERE

PORT

Sstorage are data volumes for workstationsoftware (400GB) and volumes forcourse Web sites and assignments(100GB)

AFS usage at Intel was also presentedIntel has been an AFS site since 1994They had bad experiences with the IBM35 Linux client their experience withOpenAFS on Linux 24x kernels hasbeen much better They use and are sat-isfied with the OpenAFS IA64 Linuxport Intel has hundreds of OpenAFS123 and 124 clients in many produc-tion cells accessing data stored on IBMAFS file servers They have not encoun-tered any interoperability issues Intelhas some concerns about OpenAFS theywould like to purchase commercial sup-port for OpenAFS and to see OpenAFSsupport for HP-UX on both PA-RISCand Itanium hardware HP-UX supportis currently unavailable due to a specificHP-UX header file being unavailablefrom HP this may be available soonIntel has not yet committed to migratingtheir file servers to OpenAFS and areunsure if they will do so without com-mercial support

Backups are a traditional topic of dis-cussion at AFS workshops and this timewas no exception Many users complainthat the traditional AFS backup tools(ldquobackuprdquo and ldquobutcrdquo) are complex anddifficult to automate requiring manyhome-grown scripts and much userintervention for error recovery An addi-tional complaint was that the traditionalAFS tools do not support file- or direc-tory-level backups and restores datamust be backed up and restored at thevolume level

Mitch Collinsworth of Cornell presentedwork to make AMANDA the freebackup software from the University ofMaryland suitable for AFS backupsUsing AMANDA for AFS backups allowsone to share AFS backup tapes withnon-AFS backups easily run multiplebackups in parallel automate errorrecovery and provide a robust degradedmode that prevents tape errors from

stopping backups altogether Theirimplementation allows for full volumerestores as well as individual directoryand file restores They have finished cod-ing this work and are in the process oftesting and documenting it

Peter Honeyman of CITI at the Univer-sity of Michigan spoke about work hehas proposed to replace Rx with RPC-SEC GSS in OpenAFS this would allowAFS to use a TCP-based transport mech-anism rather than the UDP-based Rxand possibly gain better congestion con-trol dynamic adaptation and fragmen-tation avoidance as a result RPCSECGSS uses the GSSAPI to authenticateSUN ONC RPC RPCSEC GSS is trans-port agnostic provides strong security isa developing Internet standard and hasmultiple open source implementationsBackward compatibility with existingAFS servers and clients is an importantgoal of this project

Page 2: inside - USENIX · 2019. 2. 25. · Bosko Milekic Juan Navarro David E. Ott Amit Purohit Brennan Reynolds Matt Selsky J.D. Welch Li Xiao Praveen Yalagandula Haijin Yan Gary Zacheiss

62 Vol 27 No 5 login

conference reportsKEYNOTE ADDRESS

THE INTERNETrsquoS COMING SILENT SPRING

Lawrence Lessig Stanford University

Summarized by David E Ott

In a talk that received a standing ova-tion Lawrence Lessig pointed out therecent legal crisis that is stifling innova-tion by extending notions of privateownership of technology beyond rea-sonable limits

Several lessons from history are instruc-tive (1) Edwin Armstrong the creator of FM radio technology became anenemy to RCA which launched a legalcampaign to suppress the technology(2) packet switching networks proposedby Paul Baron were seen by ATampT as anew competing technology that had tobe suppressed (3) Disney took Grimmtales by then in the public domain andretold them in magically innovativeways Should building upon the past inthis way be considered an offense

ldquoThe truth is architectures can allowrdquothat is freedom for innovation can bebuilt into architectures Consider theInternet a simple core allows for anunlimited number of smart end-to-endapplications

Compare an ATampT proprietary core net-work to the Internetrsquos end-to-endmodel The number of innovators forthe former is one company while for thelatter itrsquos potentially the number of peo-ple connected to it Innovations forATampT are designed to benefit the ownerwhile the Internet is a wide-open boardthat allows all kinds of benefits to allkinds of groups The diversity of con-tributors in the Internet arena is stagger-ing

The auction of spectrum by the FCC isanother case in point The spectrum isusually seen as a fixed resource charac-terized by scarcity The courts have seenspectrum as something to be owned asproperty Technologists however haveshown that capacity can be a function of

OUR THANKS TO THE SUMMARIZERS

For the USENIX Annual Technical Conference

Josh Simon who organized the collecting of

the summaries in his usual flawless fashion

Steve Bauer

Florian Buchholz

Matt Butner

Pradipta De

Xiaobo Fan

Hai Huang

Scott Kilroy

Teri Lampoudi

Josh Lothian

Bosko Milekic

Juan Navarro

David E Ott

Amit Purohit

Brennan Reynolds

Matt Selsky

JD Welch

Li Xiao

Praveen Yalagandula

Haijin Yan

Gary Zacheiss

2002 USENIX Annual Technical ConferenceMONTEREY CALIFORNIA USAJUNE 10-15 2002ANNOUNCEMENTS

Summarized by Josh Simon

The 2002 USENIX Annual TechnicalConference was very exciting The gen-eral track had 105 papers submitted (up28 from 82 in 2001) and accepted 25(19 from students) the FREENIX trackhad 53 submitted (up from 52 in 2001)and accepted 26 (7 from students)

The two annual USENIX-given awardswere presented by outgoing USENIXBoard President Dan Geer The USENIXLifetime Achievement Award (also

known as theldquoFlamerdquo becauseof the shape ofthe award) wentto JamesGosling for hiscontributionsincluding thePascal compilerfor MulticsEmacs an early

SMP UNIX work on X11 and Sunrsquoswindowing system the first enscriptand Java The Software Tools UsersGroup (STUG) Award was presented tothe Apache Foundation and accepted byRasmus Lerdorf In addition to the well-known Web server Apache produces

Jakartamod_perlmod_tcl andXML parserwith over 80members in at least 15countries

James Gosling

Rasmus Lerdorf

architecture As David Reed arguescapacity can scale with the number ofusers ndash assuming an effective technologyand an open architecture

Developments over the last three yearsare disturbing and can be summarizedas two layers of corruption intellectual-property rights constraining technicalinnovation and proprietary hardwareplatforms constraining software innova-tion

Issues have been framed by the courtslargely in two mistaken ways (1) itrsquostheir property ndash a new technologyshouldnrsquot interfere with the rights ofcurrent technology owners (2) itrsquos justtheft ndash copyright laws should be upheldby suppressing certain new technologies

In fact we must reframe the debate fromldquoitrsquos their propertyrdquo to a ldquohighwaysrdquometaphor that acts as a neutral platformfor innovation without discriminationldquoTheftrdquo should be reframed as ldquoWaltDisneyrdquo who built upon works from thepast in richly creative ways that demon-strate the utility of allowing work toreach the public domain

In the end creativity depends upon thebalance between property and accesspublic and private controlled and com-mon access Free code builds a free cul-ture and open architectures are whatgive rise to the freedom to innovate Allof us need to become involved inreframing the debate on this issue asRachel Carsonrsquos Silent Spring points outan entire ecology can be undermined bysmall changes from within

INVITED TALKS

THE IETF OR WHERE DO ALL THOSE RFCS

COME FROM ANYWAY

Steve Bellovin ATampT Labs ndash Research

Summarized by Josh Simon

The Internet Engineering Task Force(IETF) is a standards body but not alegal entity consisting of individuals(not organizations) and driven by a con-sensus-based decision model Anyonewho ldquoshows uprdquo ndash be it at the thrice-

63October 2002 login

C

ON

FERE

NC

ERE

PORT

Sannual in-person meetings or on theemail lists for the various groups ndash canjoin and be a member The IETF is con-cerned with Internet protocols and openstandards not LAN-specific (such asAppletalk) or layer-1 or -2 (like copperversus fiber)

The organizational structure is looseThere are many working groups eachwith a specific focus within severalareas Each area has an area directorwho collectively form the Internet Engi-neering Steering Group (IESG) The sixpermanent areas are Internet (withworking groups for IPv6 DNS andICMP) Transport (TCP QoS VoIP andSCTP) Applications (mail some WebLDAP) Routing (OSPF BGP) Opera-tions and Management (SNMP) andSecurity (IPSec TLS SMIME) Thereare also two other areas SubIP is a tem-porary area for things underneath the IPprotocol stack (such as MPLS IP overwireless and traffic engineering) andtherersquos a General area for miscellaneousand process-based working groups

Internet Requests for Comments (RFCs)fall into three tracks Standard Informa-tional and Experimental Note that thismeans that not all RFCs are standardsThe RFCs in the Informational track aregenerally for proprietary protocols orApril first jokes those in the Experimen-tal track are results ideas or theories

The RFCs in the Standard track comefrom working groups in the variousareas through a time-consuming com-plex process Working groups are createdwith an agenda a problem statement anemail list some draft RFCs and a chairThey typically start out as a BoF sessionThe working group and the IESG makea charter to define the scope milestonesand deadlines the Internet AdvisoryBoard (IAB) ensures that the workinggroup proposals are architecturallysound Working groups are narrowlyfocused and are supposed to die off oncethe problem is solved and all milestonesachieved Working groups meet andwork mainly through the email list

though there are three in-person high-bandwidth meetings per year Howeverdecisions reached in person must be rat-ified by the mailing list since not every-body can get to three meetings per yearThey produce RFCs which go throughthe Standard track these need to gothrough the entire working group beforebeing submitted for comment to theentire IETF and then to the IESG MostRFCs wind up going back to the work-ing group at least once from the areadirector or IESG level

The format of an RFC is well-definedand requires it be published in plain 7-bit ASCII Theyrsquore freely redistributableand the IETF reserves the right ofchange control on all Standard-trackRFCs

The big problems the IETF is currentlyfacing are security internationalizationand congestion control Security has tobe designed into protocols from thestart Internationalization has shown usthat 7-bit-only ASCII is bad and doesnrsquotwork especially for those character setsthat require more than 7 bits (likeKanji) UTF-8 is a reasonable compro-mise But what about domain namesWhile not specified as requiring 7-bitASCII in the specifications most DNSapplications assume a 7-bit character setin the namespace This is a hard prob-lem Finally congestion control isanother hard problem since the Internetis not the same as a really big LAN

INTRODUCTION TO AIR TRAFFIC

MANAGEMENT SYSTEMS

Ron Reisman NASA Ames ResearchCenter Rob Savoye Seneca Software

Summarized by Josh Simon

Air traffic control is organized into fourdomains surface which runs out of theairport control tower and controls theaircraft on the ground (eg taxi andtakeoff) terminal area which covers air-craft at 11000 feet and below handledby the Terminal Radar Approach Con-trol (TRACON) facilities en routewhich covers between 11000 and 40000feet including climb descent and at-

USENIX 2002

altitude flight runs from the 20 AirRoute Traffic Control Centers (ARTCCpronounced ldquoartsyrdquo) and traffic flowmanagement which is the strategic armEach area has sectors for low high andvery-high flight Each sector has a con-troller team including one person onthe microphone and handles between12 and 16 aircraft at a time Since thenumber of sectors and areas is limitedand fixed therersquos limited system capac-ity The events of September 11 2001gave us a respite in terms of systemusage but based on path growth pat-terns the air traffic system will be over-subscribed within two to three yearsHow do we handle this oversubscrip-tion

Air Traffic Management (ATM) Deci-sion Support Tools (DST) use physicsaeronautics heuristics (expert systems)fuzzy logic and neural nets to help the(human) aircraft controllers route air-craft around The rest of the talk focusedon capacity issues but the DST alsohandle safety and security issues Thesoftware follows open standards (ISOPOSIX and ANSI) The team at NASAAmes made Center-TRACON Automa-tion System (CTAS) which is softwarefor each of the ARTCCs portable fromSolaris to HP-UX and Linux as wellUnlike just about every other major soft-ware project this one really is standardand portable co-presenter Rob Savoyehas experience in maintaining gcc onmultiple platforms and is the projectlead on the portability and standardsissues for the code CTAS allows theARTCCs to upgrade and enhance indi-vidual aspects or parts of the system itisnrsquot a monolithic all-or-nothing entitylike the old ATM systems

Some future areas of research include ahead-mounted augmented reality devicefor tower operators to improve their sit-uational awareness by automatinghuman factors and new digital globalpositioning system (DGPS) technologieswhich are accurate within inches insteadof feet

64 Vol 27 No 5 login

Questions centered around advancedavionics (eg getting rid of ground con-trol) cooperation between the US andEurope for software development (wersquoreworking together on software develop-ment but the various European coun-triesrsquo controllers donrsquot talk well to eachother) and privatization

ADVENTURES IN DNS

Bill Manning ISI

Summarized by Brennan Reynolds

Manning began by posing a simplequestion is DNS really a critical infra-structure The answer is not simple Per-haps seven years ago when the Internetwas just starting to become popular theanswer was a definite no But today withIPv6 being implemented and so manytransactions being conducted over theInternet the question does not have aclear-cut answer The Internet Engineer-ing Task Force (IETF) has several newmodifications to the DNS service thatmay be used to protect and extend itsusability

The first extension Manning discussedwas Internationalized Domain Names(IDN) To date all DNS records arebased on the ASCII character set butmany addresses in the Internetrsquos globalnetwork cannot be easily written inASCII characters The goal of IDN is toprovide encoding for hostnames that isfair efficient and allows for a smoothtransition from the current scheme Thework has resulted in two encodingschemes ACE and UTF-8 Each encod-ing is independent of the other but theycan be used together in various combi-nations Manning expressed his opinionthat while neither is an ideal solutionACE appears to be the lesser of two evilsA major hindrance getting IDN rolledout into the Internetrsquos DNS root struc-ture is the increase in zone file complex-ity

Manningrsquos next topic was the inclusionof IPv6 records In an IPv4 world it ispossible for administrators to rememberthe numeric representation of anaddress IPv6 makes the addresses too

long and complex to be easily remem-bered This means that DNS will play avital role in getting IPv6 deployed Sev-eral new resource records (RR) havebeen proposed to handle the translationincluding AAAA A6 DNAME and BIT-STRING Manning commented on thedifference between IPv4 and v6 as atransport protocol and that systemstuned for v4 traffic will suffer a perfor-mance hit when using v6 This is largelyattributed to the increase in data trans-mitted per DNS request

The final extension Manning discussedwas DNSSec He introduced this exten-sion as a mechanism that protects thesystem from itself DNSSec protectsagainst data spoofing and providesauthentication between servers Itincludes a cryptographic signature ofthe RR set to ensure authenticity andintegrity By signing the entire set theamount of computation is kept to aminimum The information itself isstored in a new RR within each zone fileon the DNS server

Manning briefly commented on the useof DNS to provide a PKI infrastructurestating that that was not the purpose ofDNS and therefore it should not be usedin that fashion The signing of the RRsets can be done hierarchically resultingin the use of a single trusted key at theroot of the DNS tree to sign all sets tothe leafs However the job of keyreplacement and rollover is extremelydifficult for a system that is distributedacross the globe

Manning stated that in an operation testbed with all of these extensions enabledthe packet size for a single queryresponse grew from 518 bytes to largerthan 18000 This results in a largeincrease of bandwidth usage for highvolume name servers and puts There-fore in Manningrsquos opinion not all ofthese features will be deployed in thenear future For those looking for moreinformation Billrsquos Web site can be foundat httpwwwisieduotdr

THE JOY OF BREAKING THINGS

Pat Parseghian Transmeta

Summarized by Juan Navarro

Pat Parseghian described her experiencein testing the Crusoe microprocessor atthe Transmeta Lab for Compatibility(TLC) where the motto is ldquoYou make it we break itrdquo and the goal is to makeengineers miserable

She first gave an overview of Crusoewhose most distinctive feature is a codemorphing layer that translates x86instructions into native VLIW Suchpeculiarity makes the testing processparticularly challenging because itinvolves testing two processors (theldquoexternalrdquo x86 and the ldquointernalrdquo VLIW)and also because of variations on howthe code-morphing layer works (it mayinterpret code the first few times it seesan instruction sequence and translateand save in a translation cache after-wards) There are also reproducibilityissues since the VLIW processor mayrun at different speeds to save energy

Pat then described the tests that theysubject the processor to (hardware com-patibility tests common applicationsoperating systems envelope-pushinggames and hard-to-acquire legacy appli-cations) and the issues that must be con-sidered when defining testing policiesShe gave some testing tips includingorganizational issues like trackingresources and results

To illustrate the testing process Pat gavea list of possible causes of a system crashthat must be investigated If it is the sili-con then it might be because it is dam-aged or because of a manufacturingproblem or a design flaw If the problemis in the code-morphing layer is thefault with the interpreter or with thetranslator The fault could also be exter-nal to the processor it could be thehomemade system BIOS a faulty main-board or an operator error Or Trans-meta might not be at fault at all thecrash might be due to a bug in the OS orthe application The key to pinpointing

65October 2002 login

C

ON

FERE

NC

ERE

PORT

Sthe cause of a problem is to isolate it byidentifying the conditions that repro-duce the problem and repeating thoseconditions in other test platforms andnon-Crusoe systems Then relevant fac-tors must be identified based on productknowledge past experience and com-mon sense

To conclude Pat suggested that some ofTLCrsquos testing lessons can be applied toproducts that the audience was involvedin and assured us that breaking things isfun

TECHNOLOGY LIBERTY FREEDOM AND

WASHINGTON

Alan Davidson Center for Democracyand Technology

Summarized by Steve Bauer

ldquoExperience should teach us to be moston our guard to protect liberty when theGovernmentrsquos purposes are beneficentMen born to freedom are naturally alertto repel invasion of their liberty by evil-minded rulers The greatest dangers toliberty lurk in insidious encroachmentby men of zeal well-meaning but with-out understandingrdquo ndash Louis BrandeisOlmstead v US

Alan Davison an associate director atthe CDT (httpwwwcdtorg) since1996 concluded his talk with this quoteIn many ways it aptly characterizes theimportance to the USENIX communityof the topics he covered The majorthemes of the talk were the impact oflaw on individual liberty system archi-tectures and appropriate responses bythe technical community

The first part of the talk provided theaudience with an overview of legislationeither already introduced or likely to beintroduced in the US Congress Thisincluded various proposals to protectchildren such as relegating all sexualcontent to a xxx domain or providingkid-safe zones such as kidsus Othersimilar laws discussed were the Chil-drenrsquos Online Protection Act and legisla-tion dealing with virtual child pornography

Other lower-profile pieces covered werelaws prohibiting false contact informa-tion in emails and domain registrationdatabases Similar proposals exist thatwould prohibit misleading subject linesin emails and require messages to clearlyidentify if they are advertisementsBriefly mentioned were laws impactingonline gambling

Bills and laws affecting the architecturaldesign of systems and networks and var-ious efforts to establish boundaries andborders in cyberspace were then dis-cussed These included the ConsumerBroadband and Television PromotionAct and the French cases against Yahooand its CEO for allowing the sale of Nazimaterials on the Yahoo auction site

A major part of the talk focused on thetechnological aspects of the USA-PatriotAct passed in the wake of the September11 attacks Topics included ldquopen regis-tersrdquo for the Internet expanded govern-ment power to conduct secret searchesroving wiretaps nationwide service ofwarrants sharing grand jury and intelli-gence information and the establish-ment of an identification system forvisitors to the US

The technological community needs tobe aware and care about the impact oflaw on individual liberty and the systemsthe community builds Specific sugges-tions included designing privacy andanonymity into architectures and limit-ing information collection particularlyany personally identifiable data to onlywhat is essential Finally Alan empha-sized that many people simply do notunderstand the implications of lawThus the technology community has aimportant role in helping make theseimplications clear

TAKING AN OPEN SOURCE PROJECT TO

MARKET

Eric Allman Sendmail Inc

Summarized by Scott Kilroy

Eric started the talk with the upsides anddownsides to open source software Theupsides include rapid feedback high-

USENIX 2002

quality work (mostly) lots of hands andan intensely engineering-drivenapproach The downsides include nosupport structure limited project mar-keting expertise and volunteers havinglimited time

In 1998 Sendmail was becoming a suc-cess disaster A success disaster includestwo key elements (1) volunteers startspending all their time supporting cur-rent functionality instead of enhancing(this heavy burden can lead to projectstagnation and eventually death)(2) competition for the better-fundedsources can lead to FUD (fear uncer-tainty and doubt) about the project

Eric then led listeners down the path hetook to turn Sendmail into a businessEric envisioned Sendmail Inc as a small-ish and close-knit company but soonrealized that the family atmosphere hedesired could not last as the companygrew He observed that maybe only thefirst 10 people will buy into your visionafter which each individual is primarilyinterested in something else (eg powerwealth or status) He warns that therewill be people working for you that youdo not like Eric particularly did not carefor sales people but he emphasized howimportant sales marketing finance andlegal people are to companies

The nature of companies in infancy canbe summed up as you need money andyou need investors in order to getmoney Investors want a return oninvestment so money managementbecomes critical Companies thereforemust optimize money functions Ericrsquosexperience with investors were lessonsall in themselves More than givingmoney good investors can provide con-nections and insight so you donrsquot wantinvestors who donrsquot believe in you

A company must have bug historychange management and people whounderstand the product and have a senseof the target market Eric now sees theimportance of marketing targetresearch A company can never forget

66 Vol 27 No 5 login

the simple equation that drives businessprofit = revenue - expense

Finally the concept of value should bebased on what is valuable to the cus-tomer Eric observed that his customersneeded documentation extra value inthe product that they couldnrsquot get else-where and technical support

Eric learned hard lessons along the wayIf you want to do interesting opensource it might be best not to be toosuccessful You donrsquot want to let thelarger dragons (companies) notice youIf you want a commercial user base youhave to manage process watch corpo-rate culture provide value to customerswatch bottom lines and develop a thickskin

Starting a company is easy but perpetu-ating it is extremely difficult

INFORMATION VISUALIZATION FOR

SYSTEMS PEOPLE

Tamara Munzner University of BritishColumbia

Summarized by JD Welch

Munzner presented interesting evidencethat information visualization throughinteractive representations of data canhelp people to perform tasks more effec-tively and reduce the load on workingmemory

A popular technique in visualizing datais abstracting it through node-link rep-resentation ndash for example a list of cross-references in a book Representing therelationships between references in agraphical (node-link) manner ldquooffloadscognition to the human perceptual sys-temrdquo These graphs can be producedmanually but automated drawing allowsfor increased complexity and quickerproduction times

The way in which people interpret visualdata is important in designing effectivevisualization systems Attention to pre-attentive visual cues are key to successPicking out a red dot among a field ofidentically shaped blue dots is signifi-cantly faster than the same data repre-

sented in a mixture of hue and shapevariations There are many pre-attentivevisual cues including size orientationintersection and intensity The accuracyof these cues can be ranked with posi-tion triggering high accuracy and coloror density on the low end

Visual cues combined with differentdata types ndash quantitative (eg 10inches) ordered (small medium large)and categorical (apples oranges) ndash canalso be ranked with spatial positionbeing best for all types Beyond thiscommonality however accuracy variedwidely with length ranked second forquantitative data eighth for ordinal andninth for categorical data types

These guidelines about visual perceptioncan be applied to information visualiza-tion which unlike scientific visualiza-tion focuses on abstract data and choiceof specialization Several techniques areused to clarify abstract data includingmulti-part glyphs where changes inindividual parts are incorporated into aneasier to understand gestalt interactiv-ity motion and animation In large datasets techniques like ldquofocus + contextrdquowhere a zoomed portion of the graph isshown along with a thumbnail view ofthe entire graph are used to minimizeuser disorientation

Future problems include dealing withhuge databases such as the HumanGenome reckoning dynamic data likethe changing structure of the Web orreal-time network monitoring andtransforming ldquopixel boundrdquo displays intolarge ldquodigital wallpaperrdquo-type systems

FIXING NETWORK SECURITY BY HACKING

THE BUSINESS CLIMATE

Bruce Schneier Counterpane InternetSecurity

Summarized by Florian Buchholz

Bruce Schneier identified security as oneof the fundamental building blocks ofthe Internet A certain degree of securityis needed for all things on the Net andthe limits of security will become limitsof the Net itself However companies are

hesitant to employ security measuresScience is doing well but one cannot seethe benefits in the real world Anincreasing number of users means moreproblems affecting more people Oldproblems such as buffer overflowshavenrsquot gone away and new problemsshow up Furthermore the amount ofexpertise needed to launch attacks isdecreasing

Schneier argues that complexity is theenemy of security and that while secu-rity is getting better the growth in com-plexity is outpacing it As security isfundamentally a people problem oneshouldnrsquot focus on technologies butrather should look at businesses busi-ness motivations and business costsTraditionally one can distinguishbetween two security models threatavoidance where security is absoluteand risk management where security isrelative and one has to mitigate the riskwith technologies and procedures In thelatter model one has to find a pointwith reasonable security at an acceptablecost

After a brief discussion on how securityis handled in businesses today in whichhe concluded that businesses ldquotalk bigabout it but do as little as possiblerdquoSchneier identified four necessary stepsto fix the problems

First enforce liabilities Today no realconsequences have to be feared fromsecurity incidents Holding peopleaccountable will increase the trans-parency of security and give an incentiveto make processes public As possibleoptions to achieve this he listed indus-try-defined standards federal regula-tion and lawsuits The problems withenforcement however lie in difficultiesassociated with the international natureof the problem and the fact that com-plexity makes assigning liabilities diffi-cult Furthermore fear of liability couldhave a chilling effect on free softwaredevelopment and could stifle new com-panies

67October 2002 login

C

ON

FERE

NC

ERE

PORT

SAs a second step Schneier identified theneed to allow partners to transfer liabil-ity Insurance would spread liability riskamong a group and would be a CEOrsquosprimary risk analysis tool There is aneed for standardized risk models andprotection profiles and for more securi-ties as opposed to empty press releasesSchneier predicts that insurance will cre-ate a marketplace for security where cus-tomers and vendors will have the abilityto accept liabilities from each otherComputer-security insurance shouldsoon be as common as household or fireinsurance and from that developmentinsurance will become the driving factorof the security business

The next step is to provide mechanismsto reduce risks both before and aftersoftware is released techniques andprocesses to improve software quality aswell as an evolution of security manage-ment are therefore needed Currentlysecurity products try to rebuild the wallsndash such as physical badges and entrydoorways ndash that were lost when gettingconnected Schneier believes thisldquofortress metaphorrdquo is bad one shouldthink of the problem more in terms of acity Since most businesses cannot affordproper security outsourcing is the onlyway to make security scalable With out-sourcing the concenpt of best practicesbecomes important and insurance com-panies can be tied to them outsourcingwill level the playing field

As a final step Schneier predicts thatrational prosecution and education willlead to deterrence He claims that peoplefeel safe because they live in a lawfulsociety whereas the Internet is classifiedas lawless very much like the AmericanWest in the 1800s or a society ruled bywarlords This is because it is difficult toprove who an attacker is prosecution ishampered by complicated evidencegathering and irrational prosecutionSchneier believes however that educa-tion will play a major role in turning theInternet into a lawful society Specifi-cally he pointed out that we need lawsthat can be explained

Risks wonrsquot go away the best we can dois manage them A company able tomanage risk better will be more prof-itable and we need to give CEOs thenecessary risk-management tools

There were numerous questions Listedbelow are the more complicated ones ina paraphrased QampA format

Q Will liability be effective will insur-ance companies be willing to accept therisks A The government might have tostep in it needs to be seen how it playsout

Q Is there an analogy to the real worldin the fact that in a lawless society onlythe rich can afford security Securitysolutions differentiate between classesA Schneier disagreed with the state-ment giving an analogy to front-doorlocks but conceded that there might bespecial cases

Q Doesnrsquot homogeneity hurt securityA Homogeneity is oversold Diversetypes can be more survivable but giventhe limited number of options the dif-ference will be negligible

Q Regulations in airbag protection haveled to deaths in some cases How can wekeep the pendulum from swinging theother way A Lobbying will not be pre-vented An imperfect solution is proba-ble there might be reversals ofrequirements such as the airbag one

Q What about personal liability A Thiswill be analogous to auto insurance lia-bility comes with computerNet access

Q If the rule of law is to become realitythere must be a law enforcement func-tion that applies to a physical space Youcannot do that with any existing govern-ment agency for the whole worldShould an organization like the UNassume that role A Schneier was notconvinced global enforcement is possi-ble

Q What advice should we take awayfrom this talk A Liability is comingSince the network is important to ourinfrastructure eventually the problems

USENIX 2002

will be solved in a legal environmentYou need to start thinking about how tosolve the problems and how the solu-tions will affect us

The entire speech (including slides) isavailable at httpwwwcounterpanecompresentation4pdf

LIFE IN AN OPEN SOURCE STARTUP

Daryll Strauss Consultant

Summarized by Teri Lampoudi

This talk was packed with morsels ofinsight and tidbits of information aboutwhat life in an open source startup islike (though ldquowas likerdquo might be moreappropriate) what the issues are instarting maintaining and commercial-izing an open source project and theway hardware vendors treat open sourcedevelopers Strauss began by tracing thetimeline of his involvement with devel-oping 3-D graphics support for Linuxfrom the 3dfx voodoo1 driver for Linuxin 1997 to the establishment of Preci-sion Insight in 1998 to its buyout by VALinux to the dismantling of his groupby VA to what the future may hold

Obviously there are benefits from doingopen source development both for thedeveloper and for the end user Animportant point was that open sourcedevelops entire technologies not justproducts A misapprehension is theinherent difficulty of managing a projectthat accepts code from a large number ofdevelopers some volunteer and somepaid The notion that code can just bethrown over the wall into the world is aBig Lie

How does one start and keep afloat anopen source development company Inthe case of Precision Insight the subjectmatter ndash developing graphics andOpenGL for Linux ndash required a consid-erable amount of expertise intimateknowledge of the hardware the librariesX and the Linux kernel as well as theend applications Expertise is mar-ketable What helps even further is hav-ing a visible virtual team of expertspeople who have an established trackrecord of contributions to open source

68 Vol 27 No 5 login

in the particular area of expertise Sup-port from the hardware and softwarevendors came next While everyone waswilling to contract PI to write the por-tions of code that were specific to theirhardware no one wanted to pay the billfor developing the underlying founda-tion that was necessary for the drivers tobe useful In the PI model the infra-structure cost was shared by several cus-tomers at once and the technology keptmoving Once a driver is written for avendor the vendor is perfectly capableof figuring out how to write the nextdriver necessary thereby obviating theneed to contract the job This and ques-tions of protecting intellectual propertyndash hardware design in this case ndash aredeeply problematic with the open sourcemode of development This is not to saythat there is no way to get over thembut that they are likely to arise moreoften than not

Management issues seem equally impor-tant In the PI setting contracts wereflexible ndash both a good and a bad thing ndashand developers overworked Furtherdevelopment ldquoin a fishbowlrdquo that isunder public scrutiny is not easyStrauss stressed the value of good com-munication and the use of various out-of-sync communication methods likeIRC and mailing lists

Finally Strauss discussed what portionsof software can be free and what can beproprietary His suggestion was that hor-izontal markets want to be free whereasvertically one can develop proprietarysolutions The talk closed with a smallstroll through the problems that projectpolitics brings up from things likemerging branches to mailing list andsource tree readability

GENERAL TRACK SESSIONS

FILE SYSTEMS

Summarized by Haijin Yan

STRUCTURE AND PERFORMANCE OF THE

DIRECT ACCESS FILE SYSTEM

Kostas Magoutis Salimah AddetiaAlexandra Fedorova and Margo ISeltzer Harvard University JeffreyChase Andrew Gallatin Richard Kisleyand Rajiv Wickremesinghe Duke Uni-versity and Eran Gabber Lucent Tech-nologies

This paper won the Best Paper award

The Direct Access File System (DAFS) isa new standard for network-attachedstorage over direct-access transport net-works DAFS takes advantage of user-level network interface standards thatenable client-side functionality in whichremote data access resides in a libraryrather than in the kernel This reducesthe overhead of memory copy for datamovement and protocol overheadRemote Direct Memory Access (RDMA)is a direct access transport network thatallows the network adapter to reducecopy overhead by accessing applicationbuffers directly

This paper explores the fundamentalstructural and performance characteris-tics of network file access using a user-level file system structure on adirect-access transport network withRDMA It describes DAFS-based clientand server reference implementationsfor FreeBSD and reports experimentalresults comparing DAFS to a zero-copyNFS implementation It illustrates thebenefits and trade-offs of these tech-niques to provide a basis for informedchoices about deployment of DAFS-based systems and similar extensions toother network file protocols such asNFS Experiments show that DAFS givesapplications direct control over an IOsystem and increases the client CPUrsquosusage while the client is doing IOFuture work includes how to addresslongstanding problems related to theintegration of the application and filesystem for high-performance applica-tions

CONQUEST BETTER PERFORMANCE THROUGH

A DISKPERSISTENT-RAM HYBRID FILE

SYSTEM

An-I A Wang Peter Reiher and GeraldJ Popek UCLA Geoffrey M Kuenning Harvey Mudd College

Motivated by the declining cost of per-sistent RAM the authors propose theConquest file system which stores allsmall files metadata executables andshared libraries in persistent RAM diskshold only the data content of remaininglarge files Compared to alternativessuch as caching and RAM file systemsConquest has the advantages of effi-ciency consistency and reliability at areduced cost Using popular bench-marks experiments show that Conquestincurs little overhead while achievingfaster performance Future workincludes designing mechanisms foradjusting file-size threshold dynamicallyand finding a better disk layout for largedata blocks

EXPLOITING GRAY-BOX KNOWLEDGE OF

BUFFER-CACHE MANAGEMENT

Nathan C Burnett John Bent AndreaC Arpaci-Dusseau and Remzi HArpaci-Dusseau University of Wisconsin Madison

Knowing what algorithm is used tomanage the buffer cache is very impor-tant for improving application perfor-mance However there is currently nointerface for finding this algorithm Thispaper introduces Dust a simple finger-printing tool that is able to identify thebuffer-cache replacement policy Dustautomatically identifies the cache sizeand replacement policy based on theconfiguring attributes of access ordersrecency frequency and long-term his-tory Through simulation Dust was ableto distinguish between a variety ofreplacement algorithm policies found inthe literature FIFO LRU LFU ClockSegmented FIFO 2Qand LRU-K Fur-ther experiments of fingerprinting realoperating system such as NetBSDLinux and Solaris show that Dust is ableto identify the hidden cache replacementalgorithm

69October 2002 login

C

ON

FERE

NC

ERE

PORT

SBy knowing the underlying cachereplacement policy a cache-aware Webserver can reschedule the requests on acached request-first policy to obtain per-formance improvement Experimentsshow that cache-aware schedulingimproves average response time and sys-tem throughput

OPERATING SYSTEMS

(AND DANCING BEARS)

Summarized by Matt Butner

THE JX OPERATING SYSTEM

Michael Golm Meik Felser ChristianWawersich and Juumlrgen Kleinoumlder University of Erlagen-Nuumlrnberg

The talk opened by emphasizing theneed for and the practicality of a func-tional Java OS Such implementation inthe JX OS attempts to mimic the recenttrend toward using highly abstractedlanguages in application development inorder to create an OS as functional andpowerful as one written in a lower-levellanguage

The JX is a micro-kernel solution thatuses separate JVMs for each entity of thekernel and in some cases for eachapplication The separated domains donot share objects and have no threadmigration and each domain implementsits own code The interesting part of thepresentation was the discussion of sys-tem-level programming with Java keyareas discussed were memory manage-ment and interrupt handlers Theauthors concluded by noting that theirtype-safe and modular system resultedin a robust system with great configura-tion flexibility and acceptable perfor-mance

DESIGN EVOLUTION OF THE EROS

SINGLE-LEVEL STORE

Jonathan S Shapiro Johns HopkinsUniversity Jonathan Adams Universityof Pennsylvania

This presentation outlined current char-acteristics of file systems and some oftheir least desirable characteristics Itwas based on the revival of the EROS

single-level-store approach for currentattempts to capitalize on system consis-tency and efficiency Their solution didnot relate directly to EROS but ratherto the design and use of the constructsand notions on which EROS is based Byextending the mapping of the memorysystem to include the disk systems arefurther able to ensure global persistencewithout regard for a disk structure Inexplaining the need for absolute systemconsistency and an environment whichby definition is not allowed to crash thegoal of having such an exhaustive designbecomes clear The cool part of thiswork is the availability of its design andarchitecture to the public

THINK A SOFTWARE FRAMEWORK FOR

COMPONENT-BASED OPERATING SYSTEM

KERNELS

Jean-Philippe Fassino France TeacuteleacutecomRampD Jean-Bernard Stefani INRIA JuliaLawall DIKU Gilles Muller INRIA

This presentation discussed the need forcomponent-based operating systemsand respective structures to ensure flexi-bility for arbitrary-sized systems Thinkprovides a binding model that mapsuniformed components for OS develop-ers and architects to follow ensuringconsistent implementations for an arbi-trary system size However the goal isnot to force developers into a predefinedkernel but to promote the use of certaincomponents in varied ways

Thinkrsquos concentration is primarily onembedded systems where short devel-opment time is necessary but is con-strained by rigorous needs and limitedresources The need to build flexible sys-tems in such an environment can becostly and implementation-specific butThink creates an environment sup-ported by the ability to dynamically loadtype-safe components This allows formore flexible systems that retain func-tionality because of Thinkrsquos ability tobind fine-grained components Bench-marks revealed that dedicated micro-kernels can show performanceimprovements in comparison to mono-

USENIX 2002

lithic kernels Notable future workincludes building components for low-end appliances and the development of areal-time OS component library

BUILDING SERVICES

Summarized by Pradipta De

NINJA A FRAMEWORK FOR NETWORK

SERVICES

J Robert von Behren Eric A BrewerNikita Borisov Michael Chen MattWelsh Josh MacDonald Jeremy Lauand David Culler University of Califor-nia Berkeley

Robert von Behren presented Ninja anongoing project that aims to provide aframework for building robust and scal-able Internet-based network serviceslike Web-hosting instant-messagingemail and file-sharing applicationsRobert drew attention to the difficultiesof writing cluster applications One hasto take care of data consistency andissues of concurrency and for robustapplications there are problems relatedto fault tolerance Ninja works as awrapper to relieve the user of theseproblems The goal of the project isbuilding network services that are scala-ble and highly available maintainingpersistent data providing gracefuldegradation and supporting online evo-lution

The use of clusters distinguishes thissetup from generic distributed systemsin terms of reliability and security aswell as providing a partition-free net-work The next important feature of thisproject is the new programming modelwhich is more restrictive than a generalprogramming model but still expressiveenough to write most of the applica-tions This model described as ldquosingleprogram multiple connectionrdquo usesintertask parallelism instead of multi-threaded concurrency and is character-ized by all nodes running the sameprogram with connections delegated tothe nodes by a centralized connectionmanager (CM) A CM takes care of hid-ing the details of mapping an externalconnection to an internal node Ninja

70 Vol 27 No 5 login

also uses a design called ldquostaged event-driven architecturerdquo where each serviceis broken down into a set of stages con-nected by event queues This architec-ture is suitable for a modular design andhelps in graceful degradation by adap-tive load shedding from the eventqueues

The presentation concluded with exam-ples of implementation and evaluationof a Web server and an email systemcalled NinjaMail to show the ease ofauthoring and the efficacy of using theNinja framework for service develop-ment

USING COHORT-SCHEDULING TO ENHANCE

SERVER PERFORMANCE

James R Larus and Michael ParkesMicrosoft Research

James Larus presented a new schedulingpolicy for increasing the throughput ofserver applications Cohort schedulingbatches execution of similar operationsarising in different server requests Theusual programming paradigm in han-dling server requests is to spawn multi-ple concurrent threads and switch fromone thread to another whenever a threadgets blocked for IO Since the threads ina server mostly execute unrelated piecesof code the locality of reference isreduced hence the effectiveness of dif-ferent caching mechanisms One way tosolve this problem is to throw in morehardware But Larus presented a com-plementary view where the programbehavior is investigated and used toimprove the performance

The problem in this scheme is to iden-tify pieces of code that can be batchedtogether for processing One simple wayis to look at the next program countervalues and use them to club threadstogether However this talk presented anew programming abstraction ldquostagedcomputationrdquo which replaces the threadmodel with ldquostagesrdquo A stage is anabstraction for operations with similarbehavior and locality The StagedServerlibrary can be used for programming inthis model It is a collection of C++

classes that implement staged computa-tion and cohort scheduling on either auniprocessor or multiprocessor It canbe used to define stages

The presentation showed the experi-mental evaluation of the cohort schedul-ing over the thread-based model byimplementing two servers a Web serverwhich is mainly IO bound and a pub-lish-subscribe server which is mainlycompute bound The SURGE bench-mark was used for the first experimentand the Fabret workload for the secondThe results showed that cohort-schedul-ing-based implementation gave a betterthroughput than the thread-basedimplementation at high loads

NETWORK PERFORMANCE

Summarized by Xiaobo Fan

ETE PASSIVE END-TO-END INTERNET SERVICE

PERFORMANCE MONITORING

Yun Fu Amin Vahdat Duke UniversityLudmila Cherkasova Wenting Tang HPLabs

This paper won the Best Student Paperaward

Ludmila Cherkasova began by listingseveral questions most Web serviceproviders want answered in order toimprove service quality She reviewedthe difficulties of making accurate andefficient end-to-end Web service mea-surement and the shortcomings of cur-rently available approaches Theypropose a passive trace-based architec-ture called EtE to monitor Web serverperformance on behalf of end users

The first step is to collect network pack-ets passively The second module recon-structs all TCP connections and extractsHTTP transactions To reconstruct Webpage accesses they first build a knowl-edge base indexed by client IP and URLand then group objects to the relatedWeb pages they are embedded in Statis-tical analysis is used to handle non-matched objects EtE Monitor cangenerate three groups of metrics to mea-sure Web service performance responsetime Web caching and page abortion

To demonstrate the benefits of EtE mon-itor Cherkasova talked about two casestudies and based on the calculatedmetrics gave some insightful explana-tions about whatrsquos happening behind the variations of Web performanceThrough validation they claim theirapproach provides a very close approxi-mation to the real scenario

THE PERFORMANCE OF REMOTE DISPLAY

MECHANISMS FOR THIN-CLIENT COMPUTING

S Jae Yang Jason Nieh Matt SelskyNikhil Tiwari Columbia University

Noting the trend toward thin-clientcomputing the authors compared dif-ferent techniques and design choices inmeasuring the performance of six popu-lar thin-client platforms ndash CitrixMetaFrame Microsoft Terminal Ser-vices Sun Ray Tarantella VNC and XAfter pointing out several challenges inbenchmarking thin clients Yang pro-posed slow-motion benchmarking toachieve non-invasive packet monitoringand consistent visual quality Basicallythey insert delays between separatevisual events in the benchmark applica-tions of the server side so that theclientrsquos display update can catch up withthe serverrsquos processing speed The exper-iments are conducted on an emulatednetwork over a range of network band-widths

Their results show that thin clients canprovide good performance for Webapplications in LAN environments butonly some platforms performed well forvideo benchmark Pixel-based encodingmay achieve better performance andbandwidth efficiency than high-levelgraphics Display caching and compres-sion should be used with care

A MECHANISM FOR TCP-FRIENDLY

TRANSPORT-LEVEL PROTOCOL

COORDINATION

David E Ott and Ketan Mayer-PatelUniversity of North Carolina ChapelHill

A revised transport-level protocol opti-mized for cluster-to-cluster (C-to-C)

71October 2002 login

C

ON

FERE

NC

ERE

PORT

Sapplications is what David Ott tries toexplain in his talk A C-to-C applicationclass is identified as one set of processescommunicating to another set ofprocesses across a common Internetpath The fundamental problem with C-to-C applications is how to coordinateall C-to-C communication flows so thatthey share a consistent view of the com-mon C-to-C network adapt to changingnetwork conditions and cooperate tomeet specific requirements Aggregationpoints (AP) are placed at the first andlast hop of the common data path toprobe network conditions (latencybandwidth loss rate etc) To carry andtransfer this information a new protocolndash Coordination Protocol (CP) ndash isinserted between the network layer (IP)and the transport layer (TCP UCP etc)Ott illustrated how the CP header isupdated and used when packets origi-nate from source and traverse throughlocal and remote AP to arrive at theirdestination and how AP maintains aper-cluster state table and detects net-work conditions Through simulationresults this coordination mechanismappears effective in sharing commonnetwork resources among C-to-C com-munication flows

STORAGE SYSTEMS

Summarized by Praveen Yalagandula

MY CACHE OR YOURS MAKING STORAGE

MORE EXCLUSIVE

Theodore M Wong Carnegie MellonUniversity John Wilkes HP Labs

Theodore Wong explained the ineffi-ciency of current ldquoinclusiverdquo cachingschemes in storage area networks ndashwhen a client accesses a block the blockis read from the disk and is cached atboth the disk array cache and at theclientrsquos cache Then he presented theconcept of ldquoexclusiverdquo caching where ablock accessed by a client is only cachedin that clientrsquos cache On eviction fromthe clientrsquos cache they have come upwith a DEMOTE operation to move thedata block to the tail of the array cache

The new exclusion caching schemes areevaluated on both single-client and mul-tiple-client systems The single-clientresults are presented for two differenttypes of synthetic workloads Random(transaction-processing type workloads)and Zipf (typical Web workloads) Theexclusive policy was quite effective inachieving higher hit rates and lower readlatencies They also showed a 22 timeshit-rate improvement over inclusivetechniques in the case of a real-lifeworkload httpd with a single-client set-ting

The DEMOTE scheme also performedwell in the case of multiple-client sys-tems when the data accessed by clients isdisjointed In the case of shared dataworkloads this scheme performed worsethan the inclusive schemes Theodorethen presented an adaptive exclusivecaching scheme ndash a block accessed by aclient is placed at the tail in the clientrsquoscache and also in the array cache at anappropriate place determined by thepopularity of the block The popularityof the block is measured by maintaininga ghost cache to accumulate the numberof times each block is accessed This newadaptive scheme has achieved a maxi-mum of 52 speedup in the meanlatency in the experiments with real-lifeworkloads

For more information visit httpwwwcscmuedu~tmwongresearch and httpwwwhplhpcomresearchitccslssp

BRIDGING THE INFORMATION GAP IN

STORAGE PROTOCOL STACKS

Timothy E Denehy Andrea C Arpaci-Dusseau Remzi H Arpaci-DusseauUniversity of Wisconsin Madison

Currently there is a huge informationgap between storage systems and file sys-tems The interface exposed by storagesystems to file systems based on blocksand providing only readwrite interfacesis very narrow This leads to poor per-formance as a whole because of dupli-cated functionality in both systems and

USENIX 2002

reduced functionality resulting fromstorage systemsrsquo lack of file information

The speaker presented two enhance-ments ExRaid an exposed RAID andILFS Informed Log-Structured File Sys-tem ExRAID is an enhancement to theblock-based RAID storage system thatexposes the following three types ofinformation to the file system (1)regions ndash contiguous portions of theaddress space comprising one or multi-ple disks (2) performance informationabout queue lengths and throughput ofthe regions revealing disk heterogeneityto the file systems and (3) failure infor-mation ndash dynamically updated informa-tion conveying the number of tolerablefailures in each region

ILFS allows online incremental expan-sion of the storage space performsdynamic parallelism using ExRAIDrsquosperformance information to performwell on heterogeneous storage systemsand provides a range of different mecha-nisms with different granularities forredundancy of files using ExRAIDrsquosregion failure characteristics These newfeatures are added to LFS with only a19 increase in the code size

More informationhttpwwwcswisceduwind

MAXIMIZING THROUGHPUT IN REPLICATED

DISK STRIPING OF VARIABLE

BIT-RATE STREAMS

Stergios V Anastasiadis Duke Univer-sity Kenneth C Sevcik and MichaelStumm University of Toronto

There is an increasing demand for thecontinuous real-time streaming ofmedia files Disk striping is a commontechnique used for supporting manyconnections on the media servers Buteven with 12 million hours of meantime between failures there will be morethan one disk failure per week with 7000disks Fault tolerance can be achieved byusing data replication and reservingextra bandwidth during normal opera-tion This work focused on supportingthe most common variable bit-ratestream formats (eg MPEG)

72 Vol 27 No 5 login

The authors used prototype mediaserver EXEDRA to implement and eval-uate different replication and bandwidthreservation schemes This system sup-ports a variable bit-rate scheme doesstride-based disk allocation and is capa-ble of supporting variable-grain diskstriping Two replication policies arepresented deterministic ndash data blocksare replicated in a round-robin fashionon the secondary replicas random ndashdata blocks are replicated on the ran-domly chosen secondary replicas Twobandwidth reservation techniques arealso presented mirroring reservation ndashdisk bandwidth is reserved for both primary and replicas of the media fileduring the playback and minimumreservation ndash a more efficient scheme inwhich bandwidth is reserved only for thesum of primary data access time and themaximum of the backup data accesstimes required in each round

Experimental results showed that deter-ministic replica placement is better thanrandom placement for small disks mini-mum disk bandwidth reservation istwice as good as mirroring in thethroughput achieved and fault tolerancecan be achieved with a minimal impacton the throughput

TOOLS

Summarized by Li Xiao

SIMPLE AND GENERAL STATISTICAL PROFILING

WITH PCT

Charles Blake and Steven Bauer MIT

This talk introduced a Profile CollectionToolkit (PCT) ndash a sampling-based CPUprofiling facility A novel aspect of PCTis that it allows sampling of semanticallyrich data such as function call stacksfunction parameters local or globalvariables CPU registers or other execu-tion contexts The design objectives ofPCT were driven by user needs and theinadequacies or inaccessibility of priorsystems

The rich data collection capability isachieved via a debugger-controller pro-gram dbct1 The talk then introducedthe profiling collector and data collec-

tion activation PCT file format optionsprovide flexibility and various ways tostore exact states PCT also has analysisoptions two examples were given ndashldquoSimplerdquo and ldquoMixed User and LinuxCoderdquo

The talk then went into how generalsampling helps in the following areas adebugger-controller state machineexample script fragments a simpleexample program and general valuereports in terms of call-path histogramsnumeric value histograms and data gen-erality Related work such as gprof andExpect was then summarized The con-tribution of this work is to provide aportable general value sampling toolThe limitations are that PCT does notsupport much of strip and count inac-curacies happen because of its statisticalnature Based on this limitation -g ispreferred over strip

Possible future directions included sup-porting more debuggers such as dbxand xdb script generators for otherkinds of traces more canned reports forgeneral values and a libgdb-basedlibrary sampler PCT is available athttppdoslcsmitedupct PCT runs onalmost any UNIX-like system

ENGINEERING A DIFFERENCING AND

COMPRESSION DATA FORMAT

David G Korn and Kiem Phong VoATampT Laboratories ndash Research

This talk began with an equation ldquoDif-ferencing + Compression = Delta Com-pressionrdquo Compression removesinformation redundancy in a data setExamples are gzip bzip compress andpack Differencing encodes differencesbetween two data sets Examples are diff -e fdelta bdiff and xdelta Deltacompression compresses a data set givenanother combining differencing andcompression and reduces to pure com-pression when therersquos no commonality

After presenting an overall scheme for adelta compressor and showing deltacompression performance this talk dis-cussed the encoding format of a newlydesigned Vcdiff for delta compression A

Vcdiff instruction code table consists of256 entries of each coding up to a pair ofinstructions and recodes I-byte indicesand any additional data

The talk then showed Vcdiff rsquos perfor-mance with Web data collected fromCNN compared with W3C standardGdiff Gdiff+gzip and diff+gzip whereGdiff was computed using Vcdiff deltainstructions The results of two experi-ments ldquoFirstrdquo and ldquoSuccessiverdquo werepresented In ldquoFirstrdquo each file is com-pressed against the first file collected inldquoSuccessiverdquo each file is compressedagainst the one in the previous hourThe diff+gzip did not work well becausediff was line-oriented Vcdiff performedfavorably compared with other formats

Vcdiff is part of the Vcodex packageVcodex is a platform for all commondata transformations delta compres-sion plain compression encryptiontranscoding (eg uuencode base64) Itis structured in three layers for maxi-mum usability Base library uses Dis-plines and Methods interfaces

The code for Vcdiff can be found athttpwwwresearchattcomswtoolsThey are moving Vcdiff to the IETFstandard as a comprehensive platformfor transforming data Please refer tohttpwwwietforginternet-draftdraft-korn-vcdiff06txt

WHERE IN THE NET

Summarized by Amit Purohit

A PRECISE AND EFFICIENT EVALUATION OF

THE PROXIMITY BETWEEN WEB CLIENTS AND

THEIR LOCAL DNS SERVERS

Zhuoqing Morley Mao University ofCalifornia Berkeley Charles D CranorFred Douglis Michael RabinovichOliver Spatscheck and Jia Wang ATampTLabs ndash Research

Content Distribution Networks (CDNs)attempt to improve Web performance bydelivering Web content to end usersfrom servers located at the edge of thenetwork When a Web client requestscontent the CDN dynamically chooses aserver to route the request to usually the

73October 2002 login

C

ON

FERE

NC

ERE

PORT

Sone that is nearest the client As CDNshave access only to the IP address of thelocal DNS server (LDNS) of the clientthe CDNrsquos authoritative DNS servermaps the clientrsquos LDNS to a geographicregion within a particular network andcombines that with network and serverload information to perform CDNserver selection

This method has two limitations First itis based on the implicit assumption thatclients are close to their LDNS This maynot always be valid Second a singlerequest from an LDNS can represent adiffering number of Web clients ndash calledthe hidden load factor Knowledge of thehidden load factor can be used toachieve better load distribution

The extent of the first limitation and itsimpact on the CDN server selection isdealt with To determine the associationsbetween clients and their LDNS a sim-ple non-intrusive and efficient map-ping technique was developed The datacollected was used to study the impact ofproximity on DNS-based server selec-tion using four different proximity met-rics (1) autonomous system (AS)clustering ndash observing whether a client isin the same AS as its LDNS ndash concludedthat LDNS is very good for coarse-grained server selection as 64 of theassociations belong to the same AS(2) network clustering ndash observingwhether a client is in the same network-aware cluster (NAC) ndash implied that DNSis less useful for finer-grained serverselection since only 16 of the clientand LDNS are in the same NAC (3)traceroute divergence ndash the length of thedivergent paths to the client and itsLDNS from a probe point using tracer-oute ndash implies that most clients aretopologically close to their LDNS asviewed from a randomly chosen probesite (4) round-trip time (RTT) correla-tion (some CDNs select severs based onRTT between the CDN server and theclientrsquos LDNS) examines the correlationbetween the message RTTs from a probepoint to the client and its local DNS

results imply this is a reasonable metricto use to avoid really distant servers

A study of the impact that client-LDNSassociations have on DNS-based serverselection concludes that knowing theclientrsquos IP address would allow moreaccurate server selection in a large num-ber of cases The optimality of the serverselection also depends on the serverdensity placement and selection algo-rithms

Further information can be found athttpwwweecsberkeleyedu~zmaomyresearchhtml or by contactingzmaoeecsberkeleyedu

GEOGRAPHIC PROPERTIES OF INTERNET

ROUTING

Lakshminarayanan SubramanianVenkata N Padmanabhan MicrosoftResearch Randy H Katz University ofCalifornia at Berkeley

Geographic information can provideinsights into the structure and function-ing of the Internet including interac-tions between different autonomoussystems by analyzing certain propertiesof Internet routing It can be used tomeasure and quantify certain routingproperties such as circuitous routinghot-potato routing and geographic faulttolerance

Traceroute has been used to gather therequired data and Geotrack tool hasbeen used to determine the location ofthe nodes along each network path Thisenables the computation of ldquolinearizeddistancesrdquo which is the sum of the geo-graphic distances between successivepairs of routers along the path

In order to measure the circuitousnessof a path a metric ldquodistance ratiordquo hasbeen defined as the ratio of the lin-earized distance of a path to the geo-graphic distance between the source anddestination of the path From the data ithas been observed that the circuitous-ness of a route depends on both geo-graphic and network location of the endhosts A large value of the distance ratioenables us to flag paths that are highly

USENIX 2002

circuitous possibly (though not neces-sarily) because of routing anomalies Itis also shown that the minimum delaybetween end hosts is strongly correlatedwith the linearized distance of the path

Geographic information can be used tostudy various aspects of wide-area Inter-net paths that traverse multiple ISPs Itwas found that end-to-end Internetpaths tend to be more circuitous thanintra-ISP paths the cause for this beingthe peering relationships between ISPsAlso paths traversing substantial dis-tances within two or more ISPs tend tobe more circuitous than paths largelytraversing only a single ISP Anotherfinding is that ISPs generally employhot-potato routing

Geographic information can also beused to capture the fact that two seem-ingly unrelated routers can be suscepti-ble to correlated failures By using thegeographic information of routers wecan construct a geographic topology ofan ISP Using this we can find the toler-ance of an ISPrsquos network to the total net-work failure in a geographic region

For further information contactlakmecsberkeleyedu

PROVIDING PROCESS ORIGIN INFORMATION

TO AID IN NETWORK TRACEBACK

Florian P Buchholz Purdue UniversityClay Shields Georgetown University

Network traceback is currently limitedbecause host audit systems do not main-tain enough information to matchincoming network traffic to outgoingnetwork traffic The talk presented analternative assigning origin informationto every process and logging it duringinteractive login creation

The current implementation concen-trates mainly on interactive sessions inwhich an event is logged when a newconnection is established using SSH orTelnet The method is effective andcould successfully determine steppingstones and reliably detect the source of aDDoS attack The speaker then talkedabout the related work done in this area

74 Vol 27 No 5 login

which led to discussion of the latestdevelopment in the area of ldquohostcausalityrdquo

The speaker ended with the limitationsand future directions of the frameworkThis framework could be extended andcould find many applications in the areaof security Current implementationdoesnrsquot take care of startup scripts andcron jobs but incorporating the origininformation in FS could solve this prob-lem In the current implementation log-ging is just implemented as a proof ofconcept It could be made safe in manyways and this could be another impor-tant aspect of future work

PROGRAMMING

Summarized by Amit Purohit

CYCLONE A SAFE DIALECT OF C

Trevor Jim ATampT Labs ndash Research GregMorrisett Dan Grossman MichaelHicks James Cheney Yanling WangCornell University

Cyclone is designed to provide protec-tion against attacks such as buffer over-flows format string attacks andmemory management errors The cur-rent structure of C allows programmersto write vulnerable programs Cycloneextends C so that it has the safety guar-antee of Java while keeping C syntaxtypes and semantics untouched

The Cyclone compiler performs staticanalysis of the source code and insertsruntime checks into the compiled out-put at places where the analysis cannotdetermine that the code execution willnot violate safety constraints Cycloneimposes many restrictions to preservesafety such as NULL checksThesechecks do not exist in normal C

The speaker then talked about somesample code written in Cyclone and howit tackles safety problems Porting anexisting C application to Cyclone ispretty easy with fewer than 10 changerequired The current implementationconcentrates more on safety than onperformance ndash hence the performance

penalty is substantial Cyclone was ableto find many lingering bugs The finalstep of the project will be to write acompiler to convert a normal C programto an equivalent Cyclone program

COOPERATIVE TASK MANAGEMENT WITHOUT

MANUAL STACK MANAGEMENT

Atul Adya Jon Howell MarvinTheimer William J Bolosky John RDouceur Microsoft Research

The speaker described the definitionsand motivations behind the distinctconcepts of task management stackmanagement IO management conflictmanagement and data partitioningConventional concurrent programminguses preemptive task management andexploits the automatic stack manage-ment of a standard language In the sec-ond approach cooperative tasks areorganized as event handlers that yieldcontrol by returning control to the eventscheduler manually unrolling theirstacks In this project they have adopteda hybrid approach that makes it possiblefor both stack management styles tocoexist Thus programmers can codeassuming one or the other of these stackmanagement styles will operate depend-ing upon the application The speakeralso gave a detailed example of how touse their mechanism to switch betweenthe two styles

The talk then continued with the imple-mentation They were able to preemptmany subtle concurrency problems byusing cooperative task managementPaying a cost up-front to reduce a subtlerace condition proved to be a goodinvestment

Though the choice of task managementis fundamental the choice of stack man-agement can be left to individual tasteThis project enables use of any type ofstack management in conjunction withcooperative task management

IMPROVING WAIT-FREE ALGORITHMS FOR

INTERPROCESS COMMUNICATION IN

EMBEDDED REAL-TIME SYSTEMS

Hai Huang Padmanabhan Pillai KangG Shin University of Michigan

The main characteristic of the real-timetime-sensitive system is its pre-dictable response time But concurrencymanagement creates hurdles to achiev-ing this goal because of the use of locksto maintain consistency To solve thisproblem many wait-free algorithmshave been developed but these are typi-cally high-cost solutions By takingadvantage of the temporal characteris-tics of the system however the time andspace overhead can be reduced

The speaker presented an algorithm fortemporal concurrency control andapplied this technique to improve threewait-free algorithms A single-writermultiple-reader wait-free algorithm anda double-buffer algorithm were pro-posed

Using their transformation mechanismthey achieved an improvement of17ndash66 in ACET and a 14ndash70 reduc-tion in memory requirements for IPCalgorithms This mechanism is extensi-ble and can be applied to other non-blocking IPC algorithms as well Futurework involves reducing the synchroniz-ing overheads in more general IPC algo-rithms with multiple-writer semantics

MOBILITY

Summarized by Praveen Yalagandula

ROBUST POSITIONING ALGORITHMS FOR

DISTRIBUTED AD-HOC WIRELESS SENSOR

NETWORKS

Chris Savarese Jan Rabaey Universityof California Berkeley Koen Langen-doen Delft University of Technology

The Pico Radio Network comprisesmore than a hundred sensors monitorsand actuators equipped with wirelessconnectivity with the following proper-ties (1) no infrastructure (2) computa-tion in a distributed fashion instead ofcentralized computation (3) dynamictopology (4) limited radio range

75October 2002 login

C

ON

FERE

NC

ERE

PORT

S(5) nodes that can act as repeaters(6) devices of low cost and with lowpower and (7) sparse anchor nodes ndashnodes with GPS capability A positioningalgorithm determines the geographicalposition of each node in the network

In two-dimensional space each nodeneeds three reference positions to esti-mate its geographical position There aretwo problems that make positioning dif-ficult in the Pico Radio Network typesetting (1) a sparse anchor node prob-lem and (2) a range error problem

The two-phase approach that theauthors have taken in solving the prob-lem consists of (1) a hop-terrain algo-rithm ndash in this first phase each noderoughly guesses its location by the dis-tance calculated using multi-hops to theanchor points and (2) a refinementalgorithm ndash in this second phase eachnode uses its neighborsrsquo positions torefine its own position estimate Toguarantee the convergence thisapproach uses confidence metricsassigning a value of 10 for anchor nodesand 01 to start for other nodes andincreasing these with each iteration

A simulation tool OMNet++ was usedfor both phases and in various scenariosThe results show that they achievedposition errors of less than 33 in a sce-nario with 5 anchor nodes an averageconnectivity of 7 and 5 range mea-surement error

Guidelines for anchor node deploymentare high connectivity (gt10) a reason-able fraction of anchor nodes (gt 5)and for anchor placement coverededges

APPLICATION-SPECIFIC NETWORK

MANAGEMENT FOR ENERGY-AWARE

STREAMING OF POPULAR MULTIMEDIA

FORMATS

Surendar Chandra University of Geor-gia and Amin Vahdat Duke University

The main hindrance in supporting theincreasing demand for mobile multime-dia on PDA is the battery capacity of the

small devices The idle-time power con-sumption is almost the same as thereceive power consumption on typicalwireless network cards (1319mW vs1425mW) while the sleep state con-sumes far less power (177mW) To letthe system consume energy proportionalto the stream quality the network cardshould transition to sleep state aggres-sively between each packet

The speaker presented previous workwhere different multimedia streamswere studied for PDAs MS Media RealMedia and Quicktime It was found thatif the inter-packet gap is predictablethen huge savings are possible forexample about 80 savings in MSmedia streams with few packet lossesThe limitation of the IEEE 80211bpower-save mode comes into play whenthere are two or more nodes producinga delay between beacon time and whenthe node receivessends packets Thisbadly affects the streams since higherenergy is consumed while waiting

The authors propose traffic shaping forenergy conservation where this is doneby proxy such that packets arrive at reg-ular intervals This is achieved using alocal proxy in the access point and aclient-side proxy in the mobile hostSimulations show that traffic shapingreduces energy consumption and alsoreveals that the higher bandwidthstreams have a lower energy metric(mJkB)

More information is at httpgreenhousecsugaedu

CHARACTERIZING ALERT AND BROWSE SER-

VICES OF MOBILE CLIENTS

Atul Adya Paramvir Bahl and Lili QiuMicrosoft Research

Even though there is a dramatic increasein Internet access from wireless devicesthere are not many studies done oncharacterizing this traffic In this paperthe authors characterize the trafficobserved on the MSN Web site withboth notification and browse traces

USENIX 2002

Around 33 million browsing accessesand about 325 million notificationentries are present in the traces Threetypes of analyses are done for each oneof these two services content analysisconcerning the most popular contentcategories and their distribution popu-larity analysis or the popularity distri-bution of documents and user behavioranalysis

The analysis of the notification logsshows that document access rates followa Zipf-like distribution with most of theaccesses concentrated on a small num-ber of messages and the accesses exhibitgeographical locality ndash users from samelocality tend to receive similar notifica-tion content The browser log analysisshows that a smaller set of URLs areaccessed most times though the accesspattern does not fit any Zipf curve andthe highly accessed URLS remain stableA correlation study between the notifi-cation and browsing services shows thatwireless users have a moderate correla-tion of 012

FREENIX TRACK SESSIONS

BUILDING APPLICATIONS

Summarized by Matt Butner

INTERACTIVE 3-D GRAPHICS APPLICATIONS

FOR TCL

Oliver Kersting Juumlrgen Doumlllner HassoPlattner Institute for Software SystemsEngineering University of Potsdam

The integration of 3-D image renderingfunctionality into a scripting languagepermits interactive and animated 3-Ddevelopment and application withoutthe formalities and precision demandedby low-level CC++ graphics and visual-ization libraries The large and complexC++ API of the Virtual Rendering Sys-tem (VRS) can be combined with theconveniences of the Tcl scripting lan-guage The mapping of class interfaces isdone via an automated process and gen-erates respective wrapper classes all ofwhich ensures complete API accessibilityand functionality without any signifi-

76 Vol 27 No 5 login

cant performance retribution The gen-eral C++-to-Tcl mapper SWIG grantsthe necessary object and hierarchal abili-ties without the use of object-orientedTcl extensions The final API mappingtechniques address C++ features such asclasses overloaded methods enumera-tions and inheritance relations All areimplemented in a proof-of-concept thatmaps the complete C++ API of the VRSto Tcl and are showcased in a completeinteractive 3-D-map system

THE AGFL GRAMMAR WORK LAB

Cornelis HA Coster Erik VerbruggenUniversity of Nijmegen (KUN)

The growth and implementation of Nat-ural Language Processing (NLP) is thecornerstone of the continued evolutionand implementation of truly intelligentsearch machines and services In partdue to the growing collections of com-puter-stored human-readable docu-ments in the public domain theimplementation of linguistic analysiswill become necessary for desirable pre-cision and resolution Subsequentlysuch tools and linguistic resources mustbe released into the public domain andthey have done so with the release of theAGFL Grammar Work Lab under theGNU Public License as a tool for lin-guistic research and the development ofNLP-based applications

The AGFL (Affix Grammars over aFinite Lattice) Grammar Work Labmeshes context-free grammars withfinite set-valued features that are accept-able to a range of languages In com-puter science terms ldquoSyntax rules areprocedures with parameters and a non-deterministic executionrdquo The EnglishPhrases for Information Retrieval(EP4IR) released with the AGFL-GWLas a robust grammar of English is anAGFL-GWL generated English parserthat outputs ldquoHeadModifiedrdquo framesThe sentences ldquoCompanyX sponsoredthis conferencerdquo and ldquoThis conferencewas sponsored by CompanyXrdquo both gen-erate [CompanyX[sponsored confer-

ence]] The hope is that public availabil-ity of such tools will encourage furtherdevelopment of grammar and lexicalsoftware systems

SWILL A SIMPLE EMBEDDED WEB SERVER

LIBRARY

Sotiria Lampoudi David M BeazleyUniversity of Chicago

This paper won the FREENIX Best Stu-dent Paper award

SWILL (Simple Web Interface LinkLibrary) is a simple Web server in theform of a C library whose developmentwas motivated by a wish to give coolapplications an interface to the Web TheSWILL library provides a simple inter-face that can be efficiently implementedfor tasks that vary from flexible Web-based monitoring to software debuggingand diagnostics Though originallydesigned to be integrated with high-per-formance scientific simulation softwarethe interface is generic enough to allowfor unbounded uses SWILL is a single-threaded server relying upon non-blocking IO through the creation of atemporary server which services IOrequests SWILL does not provide SSL orcryptographic authentication but doeshave HTTP authentication abilities

A fantastic feature of SWILL is its sup-port for SPMD-style parallel applica-tions which utilize MPI provingvaluable for Beowulf clusters and largeparallel machines Another practicalapplication was the implementation ofSWILL in a modified Yalnix emulator byUniversity of Chicago Operating Sys-tems courses which utilized the addedYalnix functionality for OS developmentand debugging SWILL requires minimalmemory overhead and relies upon theHTTP10 protocol

NETWORK PERFORMANCE

Summarized by Florian Buchholz

LINUX NFS CLIENT WRITE PERFORMANCE

Chuck Lever Network Appliance PeterHoneyman CITI University of Michi-gan

Lever introduced a benchmark to mea-sure an NFS client write performanceClient performance is difficult to mea-sure due to hindrances such as poorhardware or bandwidth limitations Fur-thermore measuring application perfor-mance does not identify weaknessesspecifically at the client side Thus abenchmark was developed trying toexercise only data transfers in one direc-tion between server and application Forthis purpose the benchmark was basedon the block sequential write portion ofthe Bonnie file system benchmark Oncea benchmark for NFS clients is estab-lished it can be used to improve clientperformance

The performance measurements wereperformed with an SMP Linux clientand both a Linux NFS server and a Net-work Appliance F85 filer During testinga periodic jump in write latency timewas discovered This was due to a ratherlarge number of pending write opera-tions that were scheduled to be writtenafter certain threshold values wereexceeded By introducing a separate dae-mon that flushes the cached writerequest the spikes could be eliminatedbut as a result the average latency growsover time The problem could be tracedto a function that scans a linked list ofwrite requests After having added ahashtable to improve lookup perfor-mance the latency improved consider-ably

The improved client was then used tomeasure throughput against the twoservers A discrepancy between theLinux server and the filer test wasnoticed and the reason for that tracedback to a global kernel lock that wasunnecessarily held when accessing thenetwork stack After correcting this per-formance improved further However

77October 2002 login

C

ON

FERE

NC

ERE

PORT

Sthe measurements showed that a clientmay run slower when paired with fastservers on fast networks This is due toheavy client interrupt loads more net-work processing on the client side andmore global kernel lock contention

The source code of the project is avail-able at httpwwwcitiumicheduprojectsnfs-perfpatches

A STUDY OF THE RELATIVE COSTS OF

NETWORK SECURITY PROTOCOLS

Stefan Miltchev and Sotiris IoannidisUniversity of Pennsylvania AngelosKeromytis Columbia University

With the increasing need for securityand integrity of remote network ser-vices it becomes important to quantifythe communication overhead of IPSecand compare it to alternatives such asSSH SCP and HTTPS

For this purpose the authors set upthree testing networks direct link twohosts separated by two gateways andthree hosts connecting through onegateway Protocols were compared ineach setup and manual keying was usedto eliminate connection setup costs ForIPSec the different encryption algo-rithms AES DES 3DES hardware DESand hardware 3DES were used In detailFTP was compared to SFTP SCP andFTP over IPSec HTTP to HTTPS andHTTP over IPSec and NFS to NFS overIPSec and local disk performance

The result of the measurements werethat IPSec outperforms other popularencryption schemes Overall unen-crypted communication was fastest butin some cases like FTP the overhead canbe small The use of crypto hardwarecan significantly improve performanceFor future work the inclusion of setupcosts hardware-accelerated SSL SFTPand SSH were mentioned

CONGESTION CONTROL IN LINUX TCP

Pasi Sarolahti University of HelsinkiAlexey Kuznetsov Institute for NuclearResearch at Moscow

Having attended the talk and read thepaper I am still unclear about whetherthe authors are merely describing thedesign decisions of TCP congestion con-trol or whether they are actually the cre-ators of that part of the Linux code Myguess leans toward the former

In the talk the speaker compared theTCP protocol congestion control meas-ures according to IETF and RFC specifi-cations with the actual Linux implemen-tation which does conform to the basicprinciples but nevertheless has differ-ences A specific emphasis was placed onretransmission mechanisms and thecongestion window Also several TCPenhancements ndash the NewReno algo-rithm Selective ACKs (SACK) ForwardACKs (FACK) ndash were discussed andcompared

In some instances Linux does not con-form to the IETF specifications The fastrecovery does not fully follow RFC 2582since the threshold for triggering re-transmit is adjusted dynamically and thecongestion windowrsquos size is not changedAlso the roundtrip-time estimator andthe RTO calculation differ from RFC2988 since it uses more conservativeRTT estimates and a minimum RTO of200ms The performance measuresshowed that with the additional Linux-specific features enabled slightly higherthroughput more steady data flow andfewer unnecessary retransmissions canbe achieved

XTREME XCITEMENT

Summarized by Steve Bauer

THE FUTURE IS COMING WHERE THE X

WINDOW SHOULD GO

Jim Gettys Compaq Computer Corp

Jim Gettys one of the principal authorsof the X Window System outlined thenear-term objectives for the system pri-marily focusing on the changes andinfrastructure required to enable replica-tion and migration of X applicationsProviding better support for this func-tionality would enable users to retrieveor duplicate X applications betweentheir servers at home and work

USENIX 2002

One interesting example of an applica-tion that currently is capable of migra-tion and replication is Emacs To create anew frame on DISPLAY try ldquoM-x make-frame-on-display ltReturngt DISPLAYltReturngtrdquo

However technical challenges makereplication and migration difficult ingeneral These include the ldquomajorheadachesrdquo of server-side fonts thenonuniformity of X servers and screensizes and the need to appropriatelyretrofit toolkits Authentication andauthorization issues are obviously alsoimportant The rest of the talk delvedinto some of the details of these interest-ing technical challenges

HACKING IN THE KERNEL

Summarized by Hai Huang

AN IMPLEMENTATION OF SCHEDULER

ACTIVATIONS ON THE NETBSD OPERATING

SYSTEM

Nathan J Williams Wasabi Systems

Scheduler activation is an old idea Basi-cally there are benefits and drawbacks tousing solely kernel-level or user-levelthreading Scheduler activation is able tocombine the two layers of control toprovide more concurrency in the sys-tem

In his talk Nathan gave a fairly detaileddescription of the implementation ofscheduler activation in the NetBSD ker-nel One important change in the imple-mentation is to differentiate the threadcontext from the process context This isdone by defining a separate data struc-ture for these threads and relocatingsome of the information that wasembedded in the process context tothese thread contexts Stack was espe-cially a concern due to the upcall Spe-cial handling must be done to make surethat the upcall doesnrsquot mess up the stackso that the preempted user-level threadcan continue afterwards Lastly Nathanexplained that signals were handled byupcalls

78 Vol 27 No 5 login

AUTHORIZATION AND CHARGING IN PUBLIC

WLANS USING FREEBSD AND 8021X

Pekka Nikander Ericsson ResearchNomadicLab

8021x standards are well known in thewireless community as link-layerauthentication protocols In this talkPekka explained some novel ways ofusing the 8021x protocols that might beof interest to people on the move It ispossible to set up a public WLAN thatwould support various chargingschemes via virtual tokens which peoplecan purchase or earn and later use

This is implemented on FreeBSD usingthe netgraph utility It is basically a filterin the link layer that would differentiatetraffic based on the MAC address of theclient node which is either authenti-cated denied or let through The over-head for this service is fairly minimal

ACPI IMPLEMENTATION ON FREEBSD

Takanori Watanabe Kobe University

ACPI (Advanced Configuration andPower Management Interface) was pro-posed as a joint effort by Intel Toshibaand Microsoft to provide a standard andfiner-control method of managingpower states of individual devices withina system Such low-level power manage-ment is especially important for thosemobile and embedded systems that arepowered by fixed-capacity energy batter-ies

Takanori explained that the ACPI speci-fication is composed of three partstables BIOS and registers He was ableto implement some functionalities ofACPI in a FreeBSD kernel ACPI Com-ponent Architecture was implementedby Intel and it provides a high-levelACPI API to the operating systemTakanorirsquos ACPI implementation is builtupon this underlying layer of APIs

ACCESS CONTROL

Summarized by Florian Buchholz

DESIGN AND PERFORMANCE OF THE

OPENBSD STATEFUL PACKET FILTER (PF)

Daniel Hartmeier Systor AG

Daniel Hartmeier described the newstateful packet filter (pf) that replacedIPFilter in the OpenBSD 30 releaseIPFilter could no longer be included dueto licensing issues and thus there was aneed to write a new filter making use ofoptimized data structures

The filter rules are implemented as alinked list which is traversed from startto end Two actions may be takenaccording to the rules ldquopassrdquo or ldquoblockrdquoA ldquopassrdquo action forwards the packet anda ldquoblockrdquo action will drop it Wheremore than one rule matches a packetthe last rule wins Rules that are markedas ldquofinalrdquo will immediately terminate any further rule evaluation An opti-mization called ldquoskip-stepsrdquo was alsoimplemented where blocks of similarrules are skipped if they cannot matchthe current packet These skip-steps arecalculated when the rule set is loadedFurthermore a state table keeps track ofTCP connections Only packets thatmatch the sequence numbers areallowed UDP and ICMP queries andreplies are considered in the state tablewhere initial packets will create an entryfor a pseudo-connection with a low ini-tial timeout value The state table isimplemented using a balanced binarysearch tree NAT mappings are alsostored in the state table whereas appli-cation proxies reside in user space Thepacket filter also is able to perform frag-ment reassembly and to modulate TCPsequence numbers to protect hostsbehind the firewall

Pf was compared against IPFilter as wellas Linuxrsquos Iptables by measuringthroughput and latency with increasingtraffic rates and different packet sizes Ina test with a fixed rule set size of 100Iptables outperformed the other two fil-ters whose results were close to each

other In a second test where the rule setsize was continually increased Iptablesconsistently had about twice thethroughput of the other two (whichevaluate the rule set on both the incom-ing and outgoing interfaces) A third testcompared only pf and IPFilter using asingle rule that created state in the statetable with a fixed-state entry size Pfreached an overloaded state much laterthan IPFilter The experiment wasrepeated with a variable-state entry sizeand pf performed much better thanIPFilter for a small number of states

In general rule set evaluation is expen-sive and benchmarks only reflectextreme cases whereas in real life otherbehavior should be observed Further-more the benchmarks show that statefulfiltering can actually improve perfor-mance due to cheap state-table lookupsas compared to rule evaluation

ENHANCING NFS CROSS-ADMINISTRATIVE

DOMAIN ACCESS

Joseph Spadavecchia and Erez ZadokStony Brook University

The speaker presented modification toan NFS server that allows an improvedNFS access between administrativedomains A problem lies in the fact thatNFS assumes a shared UIDGID spacewhich makes it unsafe to export filesoutside the administrative domain ofthe server Design goals were to leaveprotocol and existing clients unchangeda minimum amount of server changesflexibility and increased performance

To solve the problem two techniques areutilized ldquorange-mappingrdquo and ldquofile-cloakingrdquo Range-mapping maps IDsbetween client and server The mappingis performed on a per-export basis andhas to be manually set up in an exportfile The mappings can be 1-1 N-N orN-1 In file-cloaking the server restrictsfile access based on UIDGID and spe-cial cloaking-mask bits Here users canonly access their own file permissionsThe policy on whether or not the file isvisible to others is set by the cloakingmask which is logically ANDed with the

79October 2002 login

C

ON

FERE

NC

ERE

PORT

Sfilersquos protection bits File-cloaking onlyworks however if the client doesnrsquot holdcached copies of directory contents andfile-attributes Because of this the clientsare forced to re-read directories byincrementing the mtime value of thedirectory each time it is listed

To test the performance of the modifiedserver five different NFS configurationswere evaluated An unmodified NFSserver was compared against one serverwith the modified code included but notused one with only range-mappingenabled one with only file-cloakingenabled and one version with all modi-fications enabled For each setup differ-ent file system benchmarks were runThe results show only a small overheadwhen the modifications are used gener-ally an increase of below 5 Anotherexperiment tested the performance ofthe system with an increasing number ofmapped or cloaked entries on a systemThe results show that an increase from10 to 1000 entries resulted in a maxi-mum of about 14 cost in performance

One member of the audience pointedout that if clients choose to ignore thechanged mtimes from the server andthus still hold caches of the directoryentries the file-cloaking mechanismcould be defeated After a rather lengthydebate the speaker had to concede thatthe model doesnrsquot add any extra securityAnother question was asked about scala-bility of the setup of range mappingThe speaker referred to application-leveltools that could be developed for thatpurpose

The software is available atftpftpfslcssunysbedupubenf

ENGINEERING OPEN SOURCE

SOFTWARE

Summarized by Teri Lampoudi

NINGAUI A LINUX CLUSTER FOR BUSINESS

Andrew Hume ATampT Labs ndash ResearchScott Daniels Electronic Data SystemsCorp

Ningaui is a general purpose highlyavailable resilient architecture built

from commodity software and hard-ware Emphatically however Ningaui isnot a Beowulf Hume calls the clusterdesign the ldquoSwiss canton modelrdquo inwhich there are a number of looselyaffiliated independent nodes with datareplicated among them Jobs areassigned by bidding and leases and clus-ter services done as session-based serversare sited via generic job assignment Theemphasis is on keeping the architectureend-to-end checking all work via check-sums and logging everything Theresilience and high availability requiredby their goal of 8-5 maintenance ndash vsthe typical 24-7 model where people getpaged whenever the slightest thing goeswrong regardless of the time of day ndash isachieved by job restartability Finally allcomputation is performed on local datawithout the use of NFS or networkattached storage

Humersquos message is a hopeful onedespite the many problems encounteredndash things like kernel and service limitsauto-installation problems TCP stormsand the like ndash the existence of sourcecode and the paranoid practice of log-ging and checksumming everything hashelped The final product performs rea-sonably well and it appears that theresilient techniques employed do make adifference One drawback however isthat the software mentioned in thepaper is not yet available for download

CPCMS A CONFIGURATION MANAGEMENT

SYSTEM BASED ON CRYPTOGRAPHIC NAMES

Jonathan S Shapiro and John Vander-burgh Johns Hopkins University

This paper won the FREENIX BestPaper award

The basic notion behind the project isthe fact that everyone has a pet com-plaint about CVS and yet it is currentlythe configuration manager in mostwidespread use Shapiro has unleashedan alternative Interestingly he did notbegin but ended with the usual host ofreasons why CVS is bad The talk insteadbegan abruptly by characterizing the job

USENIX 2002

of a software configuration managercontinued by stating the namespaceswhich it must handle and wrapped upwith the challenges faced

X MEETS Z VERIFYING CORRECTNESS IN THE

PRESENCE OF POSIX THREADS

Bart Massey Portland State UniversityRobert T Bauer Rational SoftwareCorp

Massey delivered a humorous talk onthe insight gained from applying Z for-mal specification notation to systemsoftware design rather than the moreinformal analysis and design processnormally used

The story is told with respect to writingXCB which replaces the Xlib protocollayer and is supposed to be threadfriendly But where threads are con-cerned deadlock avoidance becomes ahard problem that cannot be solved inan ad-hoc manner But full modelchecking is also too hard In this caseMassey resorted to Z specification tomodel the XCB lower layer abstractaway locking and data transfers andlocate fundamental issues Essentiallythe difficulties of searching the literatureand locating information relevant to theproblem at hand were overcome AsMassey put it ldquothe formal method savedthe dayrdquo

FILE SYSTEMS

Summarized by Bosko Milekic

PLANNED EXTENSIONS TO THE LINUX

EXT2EXT3 FILESYSTEM

Theodore Y Tsrsquoo IBM StephenTweedie Red Hat

The speaker presented improvements tothe Linux Ext2 file system with the goalof allowing for various expansions whilestriving to maintain compatibility witholder code Improvements have beenfacilitated by a few extra superblockfields that were added to Ext2 not longbefore the Linux 20 kernel was releasedThe fields allow for file system featuresto be added without compromising theexisting setup this is done by providing

80 Vol 27 No 5 login

bits indicating the impact of the addedfeatures with respect to compatibilityNamely a file system with the ldquoincom-patrdquo bit marked is not allowed to bemounted Similarly a ldquoread-onlyrdquo mark-ing would only allow the file system tobe mounted read-only

Directory indexing changes linear direc-tory searches with a faster search using afixed-depth tree and hashed keys Filesystem size can be dynamicallyincreased and the expanded inode dou-bled from 128 to 256 bytes allows formore extensions

Other potential improvements were dis-cussed as well in particular pre-alloca-tion for contiguous files which allowsfor better performance in certain setupsby pre-allocating contiguous blocksSecurity-related modifications extendedattributes and ACLs were mentionedAn implementation of these featuresalready exists but has not yet beenmerged into the mainline Ext23 code

RECENT FILESYSTEM OPTIMISATIONS ON

FREEBSD

Ian Dowse Corvil Networks DavidMalone CNRI Dublin Institute of Technology

David Malone presented four importantfile system optimizations for FreeBSDOS soft updates dirpref vmiodir anddirhash It turns out that certain combi-nations of the optimizations (beautifullyillustrated in the paper) may yield per-formance improvements of anywherebetween 2 and 10 orders of magnitudefor real-world applications

All four of the optimizations deal withfile system metadata Soft updates allowfor asynchronous metadata updatesDirpref changes the way directories areorganized attempting to place childdirectories closer to their parentsthereby increasing locality of referenceand reducing disk-seek times Vmiodirtrades some extra memory in order toachieve better directory caching Finallydirhash which was implemented by IanDowse changes the way in which entries

are searched in a directory SpecificallyFFS uses a linear search to find an entryby name dirhash builds a hashtable ofdirectory entries on the fly For a direc-tory of size n with a working set of mfiles a search that in certain cases couldhave been O(nm) has been reduceddue to dirhash to effectively O(n + m)

FILESYSTEM PERFORMANCE AND SCALABILITY

IN LINUX 2417

Ray Bryant SGI Ruth Forester IBMLTC John Hawkes SGI

This talk focused on performance evalu-ation of a number of file systems avail-able and commonly deployed on Linuxmachines Comparisons under variousconfigurations of Ext2 Ext3 ReiserFSXFS and JFS were presented

The benchmarks chosen for the datagathering were pgmeter filemark andAIM Benchmark Suite VII Pgmetermeasures the rate of data transfer ofreadswrites of a file under a syntheticworkload Filemark is similar to post-mark in that it is an operation-intensivebenchmark although filemark isthreaded and offers various other fea-tures that postmark lacks AIM VIImeasures performance for various file-system-related functionalities it offersvarious metrics under an imposedworkload thus stressing the perfor-mance of the file system not only underIO load but also under significant CPUload

Tests were run on three different setupsa small a medium and a large configu-ration ReiserFS and Ext2 appear at thetop of the pile for smaller and mediumsetups Notably XFS and JFS performworse for smaller system configurationsthan the others although XFS clearlyappears to generate better numbers thanJFS It should be noted that XFS seemsto scale well under a higher load Thiswas most evident in the large-systemresults where XFS appears to offer thebest overall results

THINGS TO THINK ABOUT

Summarized by Bosko Milekic

SPEEDING UP KERNEL SCHEDULER BY REDUC-

ING CACHE MISSES

Shuji Yamamura Akira Hirai MitsuruSato Masao Yamamoto Akira NaruseKouichi Kumon Fujitsu Labs

This was an interesting talk pertaining tothe effects of cache coloring for taskstructures in the Linux kernel scheduler(Linux kernel 24x) The speaker firstpresented some interesting benchmarknumbers for the Linux scheduler show-ing that as the number of processes onthe task queue was increased the perfor-mance decreased The authors usedsome really nifty hardware to measurethe number of bus transactionsthroughout their tests and were thusable to reasonably quantify the impactthat cache misses had in the Linuxscheduler

Their experiments led them to imple-ment a cache coloring scheme for taskstructures which were previouslyaligned on 8KB boundaries and there-fore were being eventually mapped tothe same cache lines This unfortunateplacement of task structures in memoryinduced a significant number of cachemisses as the number of tasks grew inthe scheduler

The implemented solution consisted ofaligning task structures to more evenlydistribute cached entries across the L2cache The result was inevitably fewercache misses in the scheduler Some neg-ative effects were observed in certain sit-uations These were primarily due tomore cache slots being used by taskstructure data in the scheduler thusforcing data previously cached there tobe pushed out

81October 2002 login

C

ON

FERE

NC

ERE

PORT

SWORK-IN-PROGRESS REPORTS

Summarized by Brennan Reynolds

RESOURCE VIRTUALIZATION TECHNIQUES FOR

WIDE-AREA OVERLAY NETWORKS

Kartik Gopalan University Stony Brook

This work addressed the issue of provi-sioning a maximum number of virtualoverlay networks (VON) with diversequality of service (QoS) requirementson a single physical network Each of theVONs is logically isolated from others toensure the QoS requirements Gopalanmentioned several critical researchissues with this problem that are cur-rently being investigated Dealing withhow to provision the network at variouslevels (link route or path) and thenenforce the provisioning at run-time isone of the toughest challenges Cur-rently Gopalan has developed severalalgorithms to handle admission controlend-to-end QoS route selection sched-uling and fault tolerance in the net-work

For more information visithttpwwwecslcssunysbedu

VISUALIZING SOFTWARE INSTABILITY

Jennifer Bevan University of CaliforniaSanta Cruz

Detection of instability in software hastypically been an afterthought Thepoint in the development cycle when thesoftware is reviewed for instability isusually after it is difficult and costly togo back and perform major modifica-tions to the code base Bevan has devel-oped a technique to allow thevisualization of unstable regions of codethat can be used much earlier in thedevelopment cycle Her technique cre-ates a time series of dependent graphsthat include clusters and lines calledfault lines From the graphs a developeris able to easily determine where theunstable sections of code are and proac-tively restructure them She is currentlyworking on a prototype implementa-tion

For more information visit httpwwwcseucscecu~jbevanevo_viz

RELIABLE AND SCALABLE PEER-TO-PEER WEB

DOCUMENT SHARING

Li Xiao William and Mary College

The idea presented by Xiao would allowend users to share the content of theWeb browser caches with neighboringInternet users The rationale for this isthat todayrsquos end users are increasinglyconnected to the Internet over high-speed links and the browser caches arebecoming large storage systems There-fore if an individual accesses a pagewhich does not exist in their local cacheXiao is suggesting that they first queryother end users for the content beforetrying to access the machine hosting theoriginal This strategy does have someserious problems associated with it thatstill need to be addressed includingensuring the integrity of the content andprotecting the identity and privacy ofthe end users

SEREL FAST BOOTING FOR UNIX

Leni Mayo Fast Boot Software

Serel is a tool that generates a visual rep-resentation of a UNIX machinersquos boot-up sequence It can be used to identifythe critical path and can show if a par-ticular service or process blocks for anextended period of time This informa-tion could be used to determine wherethe largest performance gains could berealized by tuning the order of executionat boot-up Serel creates a dependencygraph expressed in XML during boot-up This graph is then used to create thevisual representation Currently the toolonly works on POSIX-compliant sys-tems but Mayo stated that he would beporting it to other platforms Otherextensions that were mentionedincluded having the metadata includethe use of shared libraries and monitor-ing the suspendresume sequence ofportable machines

For more information visithttpwwwfastbootorg

USENIX 2002

BERKELEY DB XML

John Merrells Sleepycat Software

Merrells gave a quick introduction andoverview of the new XML library forBerkeley DB The library specializes instorage and retrieval of XML contentthrough a tool called XPath The libraryallows for multiple containers per docu-ment and stores everything natively asXML The user is also given a wide rangeof elements to create indices withincluding edges elements text stringsor presence The XPath tool consists of aquery parser generator optimizer andexecution engine To conclude his pre-sentation Merrell gave a live demonstra-tion of the software

For more information visithttpwwwsleepycatcomxml

CLUSTER-ON-DEMAND (COD)

Justin Moore Duke University

Modern clusters are growing at a rapidrate Many have pushed beyond the 5000-machine mark and deployingthem results in large expenses as well asmanagement and provisioning issuesFurthermore if the cluster is ldquorentedrdquoout to various users it is very time-con-suming to configure it to a userrsquos specsregardless of how long they need to useit The COD work presented createsdynamic virtual clusters within a givenphysical cluster The goal was to have aprovisioning tool that would automati-cally select a chunk of available nodesand install the operating system andmiddleware specified by the customer ina short period of time This would allowa greater use of resources since multiplevirtual clusters can exist at once Byusing a virtual cluster the size can bechanged dynamically Moore stated thatthey have created a working prototypeand are currently testing and bench-marking its performance

For more information visithttpwwwcsdukeedu~justincod

82 Vol 27 No 5 login

CATACOMB

Elias Sinderson University of CaliforniaSanta Cruz

Catacomb is a project to develop a data-base-backed DASL module for theApache Web server It was designed as areplacement for the WebDAV The initialrelease of the module only contains sup-port for a MySQL database but could beextended to others Sinderson brieflytouched on the performance of hermodule It was comparable to themod_dav Apache module for all querytypes but search The presentation wasconcluded with remarks about addingsupport for the lock method and includ-ing ACL specifications in the future

For more information visithttpoceancseucsceducatacomb

SELF-ORGANIZING STORAGE

Dan Ellard Harvard University

Ellardrsquos presentation introduced a stor-age system that tuned itself based on theworkload of the system without requir-ing the intervention of the user Theintelligence was implemented as a vir-tual self-organizing disk that residesbelow any file system The virtual diskobserves the access patterns exhibited bythe system and then attempts to predictwhat information will be accessed nextAn experiment was done using an NFStrace at a large ISP on one of their emailservers Ellardrsquos self-organizing storagesystem worked well which he attributesto the fact that most of the files beingrequested were large email boxes Areasof future work include exploration ofthe length and detail of the data collec-tion stage as well as the CPU impact ofrunning the virtual disk layer

VISUAL DEBUGGING

John Costigan Virginia Tech

Costigan feels that the state of currentdebugging facilities in the UNIX worldis not as good as it should be He pro-poses the addition of several elements toprograms like ddd and gdb The firstaddition is including separate heap and

stack components Another is to displayonly certain fields of a data structureand have the ability to zoom in and outif needed Finally the debugger shouldprovide the ability to visually presentcomplex data structures includinglinked lists trees etc He said that a betaversion of a debugger with these abilitiesis currently available

For more information visit httpinfoviscsvtedudatastruct

IMPROVING APPLICATION PERFORMANCE

THROUGH SYSTEM-CALL COMPOSITION

Amit Purohit University of Stony Brook

Web servers perform a huge number ofcontext switches and internal data copy-ing during normal operation These twoelements can drastically limit the perfor-mance of an application regardless ofthe hardware platform it is run on TheCompound System Call (CoSy) frame-work is an attempt to reduce the perfor-mance penalty for context switches anddata copies It includes a user-levellibrary and several kernel facilities thatcan be used via system calls The libraryprovides the programmer with a com-plete set of memory-management func-tions Performance tests using the CoSyframework showed a large saving forcontext switches and data copies Theonly area where the savings betweenconventional libraries and CoSy werenegligible was for very large file copies

For more information visithttpwwwcssunysbedu~purohit

ELASTIC QUOTAS

John Oscar Columbia University

Oscar began by stating that mostresources are flexible but that to datedisks have not been While approxi-mately 80 percent of files are short-liveddisk quotas are hard limits imposed byadministrators The idea behind elasticquotas is to have a non-permanent stor-age area that each user can temporarilyuse Oscar suggested the creation of aehome directory structure to be used indeploying elastic quotas Currently he

has implemented elastic quotas as astackable file system There were alsoseveral trends apparent in the dataMany users would ssh to their remotehost (good) but then launch an emailreader on the local machine whichwould connect via POP to the same hostand send the password in clear text(bad) This situation can easily be reme-died by tunneling protocols like POPthrough SSH but it appeared that manypeople were not aware this could bedone While most of his comments onthe use of the wireless network werenegative the list of passwords he hadcollected showed that people wereindeed using strong passwords His rec-ommendations were to educate andencourage people to use protocols likeIPSec SSH and SSL when conductingwork over a wireless network becauseyou never know who else is listening

SYSTRACE

Niels Provos University of Michigan

How can one be sure the applicationsone is using actually do exactly whattheir developers said they do Shortanswer you canrsquot People today are usinga large number of complex applicationswhich means it is impossible to checkeach application thoroughly for securityvulnerabilities There are tools out therethat can help though Provos has devel-oped a tool called systrace that allows auser to generate a policy of acceptablesystem calls a particular application canmake If the application attempts tomake a call that is not defined in thepolicy the user is notified and allowed tochoose an action Systrace includes anauto-policy generation mechanismwhich uses a training phase to record theactions of all programs the user exe-cutes If the user chooses not to be both-ered by applications breaking policysystrace allows default enforcementactions to be set Currently systrace isimplemented for FreeBSD and NetBSDwith a Linux version coming out shortly

83October 2002 login

C

ON

FERE

NC

ERE

PORT

SFor more information visit httpwwwcitiumicheduuprovossystrace

VERIFIABLE SECRET REDISTRIBUTION FOR

SURVIVABLE STORAGE SYSTEMS

Ted Wong Carnegie Mellon University

Wong presented a protocol that can beused to re-create a file distributed over aset of servers even if one of the serversis damaged In his scheme the user mustchoose how many shares the file is splitinto and the number of servers it will bestored across The goal of this work is toprovide persistent secure storage ofinformation even if it comes underattack Wong stated that his design ofthe protocol was complete and he is cur-rently building a prototype implementa-tion of it

For more information visit httpwwwcscmuedu~tmwongresearch

THE GURU IS IN SESSIONS

LINUX ON LAPTOPPDA

Bdale Garbee HP Linux Systems Operation

Summarized by Brennan Reynolds

This was a roundtable-like discussionwith Garbee acting as moderator He iscurrently in charge of the Debian distri-bution of Linux and is involved in port-ing Linux to platforms other than i386He has successfully help port the kernelto the Alpha Sparc and ARM and iscurrently working on a version to runon VAX machines

A discussion was held on which file sys-tem is best for battery life The Riser andExt3 file systems were discussed peoplecommented that when using Ext3 ontheir laptops the disc would never spindown and thus was always consuming alarge amount of power One suggestionwas to use a RAM-based file system forany volume that requires a large numberof writes and to only have the file systemwritten to disk at infrequent intervals orwhen the machine is shut down or sus-pended

A question was directed to Garbee abouthis knowledge of how the HP-Compaqmerger would affect Linux Garbeethought that the merger was good forLinux both within the company and forthe open source community as a wholeHe stated that the new company wouldbe the number one shipper of systemswith Linux as the base operating systemHe also fielded a question about thesupport of older HP servers and theirability to run Linux saying that indeedLinux has been ported to them and hepersonally had several machines in hisbasement running it

A question which sparked a large num-ber of responses concerned problemspeople had with running Linux onmobile platforms Most people actuallydid not have many problems at allThere were a few cases of a particularmachine model not working but therewere only two widespread issues dock-ing stations and the new ACPI powermanagement scheme Docking stationsstill appear to be somewhat of aheadache but most people had devel-oped ad-hoc solutions for getting themto work The ACPI power managementscheme developed by Intel does notappear to have a quick solution TedTrsquoso one of the head kernel developersstated that there are fundamental archi-tectural problems with the 24 series ker-nel that do not easily allow ACPI to beadded However he also stated thatACPI is supported in the newest devel-opment kernel 25 and will be includedin 26

The final topic was the possibility ofpurchasing a laptop without paying anychargetax for Microsoft Windows Theconsensus was that currently this is notpossible Even if the machine comeswith Linux or without any operatingsystem the vendors have contracts inplace which require them to payMicrosoft for each machine they ship ifthat machine is capable of running aMicrosoft operating system The onlyway this will change is if the demand for

USENIX 2002

other operating systems increases to thepoint that vendors renegotiate their con-tracts with Microsoft and no one sawthis happening in the near future

LARGE CLUSTERS

Andrew Hume ATampT Labs ndash Research

Summarized by Teri Lampoudi

This was a session on clusters big dataand resilient computing Given that theaudience was primarily interested inclusters and not necessarily big data orbeating nodes to a pulp the conversa-tion revolved mainly around what ittakes to make a heterogeneous clusterresilient Hume also presented aFREENIX paper on his current clusterproject which explained in more detailmuch of what was abstractly claimed inthe guru session

Humersquos use of the word ldquoclusterrdquoreferred not to what I had assumedwould be a Beowulf-type system but to aloosely coupled farm of machines inwhich concurrency was much less of anissue than it would be in a Beowulf Infact the architecture Hume describedwas designed to deal with large amountsof independent transaction processingessentially the process of billing callswhich requires no interprocess messagepassing or anything of the sort a parallelscientific application might

Where does the large data come inSince the transactions in question con-sist of large numbers of flat files amechanism for getting the files ontonodes and off-loading the results is nec-essary In this particular cluster namedNingaui this task is handled by theldquoreplication managerrdquo which generateda large amount of interest in the audi-ence All management functions in thecluster as well as job allocation consti-tute services that are to be bid for andleased out to nodes Furthermore fail-ures are handled by simply repeating thefailed task until it is completed properlyif that is possible and if it is not thenlooking through detailed logs to discover

84 Vol 27 No 5 login

what went wrong Logging and testingfor the successful completion of eachtask are what make this model resilientEvery management operation is loggedat some appropriate level of detail gen-erating 120MBday and every single filetransfer is md5 checksummed This wascharacterized by Hume as a ldquopatientrdquostyle of management where no one con-trols the cluster scheduling behavior isemergent rather than stringentlyplanned and nodes are free to drop inand out of the cluster a downside to thismodel is the increase in latency offset bythe fact that the cluster continues tofunction unattended outside of 8-to-5support hours

In response to questions about the pro-jected scalability of such a schemeHume said the system would presum-ably scale from the present 60 nodes toabout 100 without modifications butthat scaling higher would be a matter forfurther design The guru session endedon a somewhat unrelated note regardingthe problems of getting data off main-frames and onto Linux machines ndashissues of fixed- vs variable-lengthencoding of data blocks resulting in use-ful information being stripped by FTPand problems in converting COBOLcopybooks to something useful on a C-based architecture To reiterate Humersquosclaim there are homemade solutions forsome of these problems and the readershould feel free to contact Mr Hume forthem

WRITING PORTABLE APPLICATIONS

Nick Stoughton MSB Consultants

Summarized by Josh Lothian

Nick Stoughton addressed a concernthat developers are facing more andmore writing portable applications Heproposed that there is no such thing as aldquoportable applicationrdquo only those thathave been ported Developers are still insearch of the Holy Grail of applicationsprogramming a write-once compile-anywhere program Languages such asJava are leading up to this but virtual

machines still have to be ported pathswill still be different etc Web-basedapplications are another area that couldbe promising for portable applications

Nick went on to talk about the future ofportability The recently released POSIX2001 standards are expected to help thesituation The 2001 revision expands theoriginal POSIX sepc into seven volumesEven though POSIX 2001 is more spe-cific than the previous release Nickpointed out that even this standardallows for small differences where ven-dors may define their own behaviorsThis can only hurt developers in thelong run Along these lines it waspointed out that there may be room forautomated tools that have the capabilityto check application code and determinethe level of portability and standardscompliance of the source

NETWORK MANAGEMENT SYSTEM

PERFORMANCE TUNING

Jeff Allen Tellme Networks Inc

Summarized by Matt Selsky

Jeff Allen author of Cricket discussednetwork tuning The basic idea of tuningis measure twiddle and measure againCricket can be used for large installa-tions but itrsquos overkill for smaller instal-lations for which Mrtg is better suitedYou donrsquot need Cricket to do measure-ment Doing something like a shellscript that dumps data to Excel is goodbut you need to get some sort of mea-surement Cricket was not designed forbilling it was meant to help answerquestions

Useful tools and techniques coveredincluded looking for the differencebetween machines in order to identifythe causes for the differences measuringbut also thinking about what yoursquoremeasuring and why remembering touse strace and tcpdump to identifyproblems Problems repeat themselvesperformance tuning requires an intricateunderstanding of all the layers involvedto solve the problem and simplify theautomation (Having a computer science

background helps but is not essential) Ifyou can reproduce the problem youthen should investigate each piece tofind the bottleneck If you canrsquot repro-duce the problem then measure Youneed to understand all the system inter-actions Measurement can help deter-mine whether things are actually slow orif the user is imagining it

Troubleshooting begins with a hunchbut scientific processes are essential Youshould be able to determine the eventstream or when each event occurs hav-ing observation logs can help Somehints provided include checking cron forperiodic anomalies slowing downbinrm to avoid IO overload and look-ing for unusual kernel CPU usage

Also try to use lightweight monitoringto reduce overhead You donrsquot need tomonitor every resource on every systembut you should monitor those resourceson those systems that are essential Donrsquotcheck things too often since you canintroduce overhead that way Lots ofinformation can be gathered from theproc file system interface to the kernel

BSD BOF

Host Kirk McKusick Author and Consultant

Summarized by Bosko Milekic

The BSD Birds of a Feather session atUSENIX started off with five quick andinformative talks and ended with anexciting discussion of licensing issues

Christos Zoulas for the NetBSD Projectwent over the primary goals of theNetBSD Project (namely portability aclean architecture and security) andthen proceeded to briefly discuss someof the challenges that the project hasencountered The successes and areas forimprovement of 16 were then exam-ined NetBSD has recently seen variousimprovements in its cross-build systempackaging system and third-party toolsupport For what concerns the kernel azero-copy TCPUDP implementationhas been integrated and a new pmap

85October 2002 login

C

ON

FERE

NC

ERE

PORT

Scode for arm32 has been introduced ashave some other rather major changes tovarious subsystems More information isavailable in Christosrsquos slides which arenow available at httpwwwnetbsdorggalleryeventsusenix2002

Theo de Raadt of the OpenBSD Projectpresented a rundown of various newthings that the OpenBSD and OpenSSHteams have been looking at Theorsquos talkincluded an amusing description (andpictures) of some of the events thattranspired during OpenBSDrsquos hack-a-thon which occurred the week beforeUSENIX rsquo02 Finally Theo mentionedthat following complications with theIPFilter license the OpenBSD team haddone a fairly extensive license audit

Robert Watson of the FreeBSD Projectbrought up various examples ofFreeBSD being used in the real worldHe went on to describe a wide variety ofchanges that will surface with theupcoming FreeBSD release 50 Notably50 will introduce a re-architecting ofthe kernel aimed at providing muchmore scalable support for SMPmachines 50 will also include an earlyimplementation of KSE FreeBSDrsquos ver-sion of scheduler activations largeimprovements to pccard and significantframework bits from the TrustedBSDproject

Mike Karels from Wind River Systemspresented an overview of how BSDOShas been evolving following the WindRiver Systems acquisition of BSDirsquos soft-ware assets Ernie Prabhakar from Applediscussed Applersquos OS X and its successon the desktop He explained how OS Xaims to bring BSD to the desktop whilestriving to maintain a strong and stablecore one that is heavily based on BSDtechnology

The BoF finished with a discussion onlicensing specifically some folks ques-tioned the impact of Applersquos licensingfor code from their core OS (the DarwinProject) on the BSD community as awhole

CLOSING SESSION

HOW FLIES FLY

Michael H Dickinson University ofCalifornia Berkeley

Summarized by JD Welch

In this lively and offbeat talk Dickinsondiscussed his research into the mecha-nisms of flight in the fruit fly (Droso-phila melanogaster) and related theautonomic behavior to that of a techno-logical system Insects are the most suc-cessful organisms on earth due in nosmall part to their early adoption offlight Flies travel through space onstraight trajectories interrupted by sac-cades jumpy turning motions some-what analogous to the fast jumpymovements of the human eye Using avariety of monitoring techniquesincluding high-speed video in a ldquovirtualreality flight arenardquo Dickinson and hiscolleagues have observed that fliesrespond to visual cues to decide when tosaccade during flight For example morevisual texture makes flies saccade earlier(ie further away from the arena wall)described by Dickinson as a ldquocollisioncontrol algorithmrdquo

Through a combination of sensoryinputs flies can make decisions abouttheir flight path Flies possess a visualldquoexpansion detectorrdquo which at a certainthreshold causes the animal to turn acertain direction However expansion infront of the fly sometimes causes it toland How does the fly decide Using thevirtual reality device to replicate variousvisual scenarios Dickinson observedthat the flies fixate on vertical objectsthat move back and forth across thefield Expansion on the left side of theanimal causes it to turn right and viceversa while expansion directly in frontof the animal triggers the legs to flareout in preparation for landing

If the eyes detect these changes how arethe responses implemented The fliesrsquowings have ldquopower musclesrdquo controlledby mechanical resonance (as opposed tothe nervous system) which drive the

USENIX 2002

wings combined with neurally activatedldquosteering musclesrdquo which change theconfiguration of the wing joints Subtlevariations in the timing of impulses cor-respond to changes in wing movementcontrolled by sensors in the ldquowing-pitrdquoA small wing-like structure the halterecontrols equilibrium by beating con-stantly

Voluntary control is accomplished bythe halteres whose signals can interferewith the autonomic control of the strokecycle The halteres have ldquosteering mus-clesrdquo as well and information derivedfrom the visual system can turn off thehaltere or trick it into a ldquovirtualrdquo prob-lem requiring a response

Dickinson has also studied the aerody-namics of insect flight using a devicecalled the ldquoRobo Flyrdquo an oversizedmechanical insect wing suspended in alarge tank of mineral oil InterestinglyDickinson observed that the average liftgenerated by the fliesrsquo wings is under itsbody weight the flies use three mecha-nisms to overcome this including rotat-ing the wings (rotational lift)

Insects are extraordinarily robust crea-tures and because Dickinson analyzedthe problem in a systems-oriented waythese observations and analysis areimmediately applicable to technologythe response system can be used as anefficient search algorithm for controlsystems in autonomous vehicles forexample

THE AFS WORKSHOP

Summarized by Garry Zacheiss MIT

Love Hornquist-Astrand of the Arladevelopment team presented the Arlastatus report Arla 0358 has beenreleased Scheduled for release soon areimproved support for Tru64 UNIXMacOS X and FreeBSD improved vol-ume handling and implementation ofmore of the vospts subcommands Itwas stressed that MacOS X is consideredan important platform and that a GUIconfiguration manager for the cache

86 Vol 27 No 5 login

manager a GUI ACL manager for thefinder and a graphical login that obtainsKerberos tickets and AFS tokens at logintime were all under development

Future goals planned for the underlyingAFS protocols include GSSAPISPNEGOsupport for Rx performance improve-ments to Rx an enhanced disconnectedmode and IPv6 support for Rx anexperimental patch is already availablefor the latter Future Arla-specific goalsinclude improved performance partialfile reading and increased stability forseveral platforms Work is also inprogress for the RXGSS protocol exten-sions (integrating GSSAPI into Rx) Apartial implementation exists and workcontinues as developers find time

Derrick Brashear of the OpenAFS devel-opment team presented the OpenAFSstatus report OpenAFS was releasedimmediately prior to the conferenceOpenAFS 125 fixed a remotelyexploitable denial-of-service attack inseveral OpenAFS platforms mostnotably IRIX and AIX Future workplanned for OpenAFS includes bettersupport for MacOS X including work-ing around Finder interaction issuesBetter support for the BSDs is alsoplanned FreeBSD has a partially imple-mented client NetBSD and OpenBSDhave only server programs availableright now AIX 5 Tru64 51A andMacOS X 102 (aka Jaguar) are allplanned for the future Other plannedenhancements include nested groups inthe ptserver (code donated by the Uni-versity of Michigan awaits integration)disconnected AFS and further work ona native Windows client Derrick statedthat the guiding principles of OpenAFSwere to maintain compatibility withIBM AFS support new platforms easethe administrative burden and add newfunctionality

AFS performance was discussed Theopenafsorg cell consists of two fileservers one in Stockholm Sweden andone in Pittsburgh PA AFS works rea-sonably well over trans-Atlantic links

although OpenAFS clients donrsquot deter-mine which file server to talk to veryeffectively Arla clients use RTTs to theserver to determine the optimal fileserver to fetch replicated data fromModifications to OpenAFS to supportthis behavior in the future are desired

Jimmy Engelbrecht and Harald Barth ofKTH discussed their AFSCrawler scriptwritten to determine how many AFScells and clients were in the world whatimplementationsversions they were(Arla vs IBM AFS vs OpenAFS) andhow much data was in AFS The scriptunfortunately triggered a bug in IBMAFS 36ndashderived code causing someclients to panic while handling a specificRPC This has since been fixed in OpenAFS 125 and the most recent IBMAFS patch level and all AFS users arestrongly encouraged to upgrade Norelease of Arla is vulnerable to this par-ticular denial-of-service attack Therewas an extended discussion of the use-fulness of this exploration Many sitesbelieved this was useful information andsuch scanning should continue in thefuture but only on an opt-in basis

Many sites face the problem of manag-ing KerberosAFS credentials for batch-scheduled jobs Specifically most batchprocessing software needs to be modi-fied to forward tickets as part of thebatch submission process renew ticketsand tokens while the job is in the queueand for the lifetime of the job and prop-erly destroy credentials when the jobcompletes Ken Hornstein of NRL wasable to pay a commercial vendor to sup-port Kerberos 45 credential manage-ment in their product although they didnot implement AFS token managementMIT has implemented some of thedesired functionality in OpenPBS andmight be able to make it available toother interested sites

Tools to simplify AFS administrationwere discussed including

AFS Balancer A tool written byCMU to automate the process ofbalancing disk usage across all

servers in a cell Available fromftpftpandrewcmuedupubAFS-Toolsbalance-11btargz

Themis Themis is KTHrsquos enhancedversion of the AFS tool ldquopackagerdquofor updating files on local disk froma central AFS image KTHrsquosenhancements include allowing thedeletion of files simplifying theprocess of adding a file and allow-ing the merging of multiple rulesets for determining which files areupdated Themis is available fromthe Arla CVS repository

Stanford was presented as an example ofa large AFS site Stanfordrsquos AFS usageconsists of approximately 14TB of datain AFS in the form of approximately100000 volumes 33TB of storage isavailable in their primary cell irstan-fordedu Their file servers consistentirely of Solaris machines running acombination of Transarc 36 patch-level3 and OpenAFS 12x while their data-base servers run OpenAFS 122 Theircell consists of 25 file servers using acombination of EMC and Sun StorEdgehardware Stanford continues to use thekaserver for their authentication infra-structure with future plans to migrateentirely to an MIT Kerberos 5 KDC

Stanford has approximately 3400 clientson their campus not including SLAC(Stanford Linear Accelerator) approxi-mately 2100 AFS clients from outsideStanford contact their cell every monthTheir supported clients are almostentirely IBM AFS 36 clients althoughthey plan to release OpenAFS clientssoon Stanford currently supports onlyUNIX clients There is some on-campuspresence of Windows clients but Stan-ford has never publicly released or sup-ported it They do intend to release andsupport the MacOS X client in the nearfuture

All Stanford students faculty and staffare assigned AFS home directories witha default quota of 50MB for a total ofapproximately 550GB of user homedirectories Other significant uses of AFS

87October 2002 login

C

ON

FERE

NC

ERE

PORT

Sstorage are data volumes for workstationsoftware (400GB) and volumes forcourse Web sites and assignments(100GB)

AFS usage at Intel was also presentedIntel has been an AFS site since 1994They had bad experiences with the IBM35 Linux client their experience withOpenAFS on Linux 24x kernels hasbeen much better They use and are sat-isfied with the OpenAFS IA64 Linuxport Intel has hundreds of OpenAFS123 and 124 clients in many produc-tion cells accessing data stored on IBMAFS file servers They have not encoun-tered any interoperability issues Intelhas some concerns about OpenAFS theywould like to purchase commercial sup-port for OpenAFS and to see OpenAFSsupport for HP-UX on both PA-RISCand Itanium hardware HP-UX supportis currently unavailable due to a specificHP-UX header file being unavailablefrom HP this may be available soonIntel has not yet committed to migratingtheir file servers to OpenAFS and areunsure if they will do so without com-mercial support

Backups are a traditional topic of dis-cussion at AFS workshops and this timewas no exception Many users complainthat the traditional AFS backup tools(ldquobackuprdquo and ldquobutcrdquo) are complex anddifficult to automate requiring manyhome-grown scripts and much userintervention for error recovery An addi-tional complaint was that the traditionalAFS tools do not support file- or direc-tory-level backups and restores datamust be backed up and restored at thevolume level

Mitch Collinsworth of Cornell presentedwork to make AMANDA the freebackup software from the University ofMaryland suitable for AFS backupsUsing AMANDA for AFS backups allowsone to share AFS backup tapes withnon-AFS backups easily run multiplebackups in parallel automate errorrecovery and provide a robust degradedmode that prevents tape errors from

stopping backups altogether Theirimplementation allows for full volumerestores as well as individual directoryand file restores They have finished cod-ing this work and are in the process oftesting and documenting it

Peter Honeyman of CITI at the Univer-sity of Michigan spoke about work hehas proposed to replace Rx with RPC-SEC GSS in OpenAFS this would allowAFS to use a TCP-based transport mech-anism rather than the UDP-based Rxand possibly gain better congestion con-trol dynamic adaptation and fragmen-tation avoidance as a result RPCSECGSS uses the GSSAPI to authenticateSUN ONC RPC RPCSEC GSS is trans-port agnostic provides strong security isa developing Internet standard and hasmultiple open source implementationsBackward compatibility with existingAFS servers and clients is an importantgoal of this project

Page 3: inside - USENIX · 2019. 2. 25. · Bosko Milekic Juan Navarro David E. Ott Amit Purohit Brennan Reynolds Matt Selsky J.D. Welch Li Xiao Praveen Yalagandula Haijin Yan Gary Zacheiss

architecture As David Reed arguescapacity can scale with the number ofusers ndash assuming an effective technologyand an open architecture

Developments over the last three yearsare disturbing and can be summarizedas two layers of corruption intellectual-property rights constraining technicalinnovation and proprietary hardwareplatforms constraining software innova-tion

Issues have been framed by the courtslargely in two mistaken ways (1) itrsquostheir property ndash a new technologyshouldnrsquot interfere with the rights ofcurrent technology owners (2) itrsquos justtheft ndash copyright laws should be upheldby suppressing certain new technologies

In fact we must reframe the debate fromldquoitrsquos their propertyrdquo to a ldquohighwaysrdquometaphor that acts as a neutral platformfor innovation without discriminationldquoTheftrdquo should be reframed as ldquoWaltDisneyrdquo who built upon works from thepast in richly creative ways that demon-strate the utility of allowing work toreach the public domain

In the end creativity depends upon thebalance between property and accesspublic and private controlled and com-mon access Free code builds a free cul-ture and open architectures are whatgive rise to the freedom to innovate Allof us need to become involved inreframing the debate on this issue asRachel Carsonrsquos Silent Spring points outan entire ecology can be undermined bysmall changes from within

INVITED TALKS

THE IETF OR WHERE DO ALL THOSE RFCS

COME FROM ANYWAY

Steve Bellovin ATampT Labs ndash Research

Summarized by Josh Simon

The Internet Engineering Task Force(IETF) is a standards body but not alegal entity consisting of individuals(not organizations) and driven by a con-sensus-based decision model Anyonewho ldquoshows uprdquo ndash be it at the thrice-

63October 2002 login

C

ON

FERE

NC

ERE

PORT

Sannual in-person meetings or on theemail lists for the various groups ndash canjoin and be a member The IETF is con-cerned with Internet protocols and openstandards not LAN-specific (such asAppletalk) or layer-1 or -2 (like copperversus fiber)

The organizational structure is looseThere are many working groups eachwith a specific focus within severalareas Each area has an area directorwho collectively form the Internet Engi-neering Steering Group (IESG) The sixpermanent areas are Internet (withworking groups for IPv6 DNS andICMP) Transport (TCP QoS VoIP andSCTP) Applications (mail some WebLDAP) Routing (OSPF BGP) Opera-tions and Management (SNMP) andSecurity (IPSec TLS SMIME) Thereare also two other areas SubIP is a tem-porary area for things underneath the IPprotocol stack (such as MPLS IP overwireless and traffic engineering) andtherersquos a General area for miscellaneousand process-based working groups

Internet Requests for Comments (RFCs)fall into three tracks Standard Informa-tional and Experimental Note that thismeans that not all RFCs are standardsThe RFCs in the Informational track aregenerally for proprietary protocols orApril first jokes those in the Experimen-tal track are results ideas or theories

The RFCs in the Standard track comefrom working groups in the variousareas through a time-consuming com-plex process Working groups are createdwith an agenda a problem statement anemail list some draft RFCs and a chairThey typically start out as a BoF sessionThe working group and the IESG makea charter to define the scope milestonesand deadlines the Internet AdvisoryBoard (IAB) ensures that the workinggroup proposals are architecturallysound Working groups are narrowlyfocused and are supposed to die off oncethe problem is solved and all milestonesachieved Working groups meet andwork mainly through the email list

though there are three in-person high-bandwidth meetings per year Howeverdecisions reached in person must be rat-ified by the mailing list since not every-body can get to three meetings per yearThey produce RFCs which go throughthe Standard track these need to gothrough the entire working group beforebeing submitted for comment to theentire IETF and then to the IESG MostRFCs wind up going back to the work-ing group at least once from the areadirector or IESG level

The format of an RFC is well-definedand requires it be published in plain 7-bit ASCII Theyrsquore freely redistributableand the IETF reserves the right ofchange control on all Standard-trackRFCs

The big problems the IETF is currentlyfacing are security internationalizationand congestion control Security has tobe designed into protocols from thestart Internationalization has shown usthat 7-bit-only ASCII is bad and doesnrsquotwork especially for those character setsthat require more than 7 bits (likeKanji) UTF-8 is a reasonable compro-mise But what about domain namesWhile not specified as requiring 7-bitASCII in the specifications most DNSapplications assume a 7-bit character setin the namespace This is a hard prob-lem Finally congestion control isanother hard problem since the Internetis not the same as a really big LAN

INTRODUCTION TO AIR TRAFFIC

MANAGEMENT SYSTEMS

Ron Reisman NASA Ames ResearchCenter Rob Savoye Seneca Software

Summarized by Josh Simon

Air traffic control is organized into fourdomains surface which runs out of theairport control tower and controls theaircraft on the ground (eg taxi andtakeoff) terminal area which covers air-craft at 11000 feet and below handledby the Terminal Radar Approach Con-trol (TRACON) facilities en routewhich covers between 11000 and 40000feet including climb descent and at-

USENIX 2002

altitude flight runs from the 20 AirRoute Traffic Control Centers (ARTCCpronounced ldquoartsyrdquo) and traffic flowmanagement which is the strategic armEach area has sectors for low high andvery-high flight Each sector has a con-troller team including one person onthe microphone and handles between12 and 16 aircraft at a time Since thenumber of sectors and areas is limitedand fixed therersquos limited system capac-ity The events of September 11 2001gave us a respite in terms of systemusage but based on path growth pat-terns the air traffic system will be over-subscribed within two to three yearsHow do we handle this oversubscrip-tion

Air Traffic Management (ATM) Deci-sion Support Tools (DST) use physicsaeronautics heuristics (expert systems)fuzzy logic and neural nets to help the(human) aircraft controllers route air-craft around The rest of the talk focusedon capacity issues but the DST alsohandle safety and security issues Thesoftware follows open standards (ISOPOSIX and ANSI) The team at NASAAmes made Center-TRACON Automa-tion System (CTAS) which is softwarefor each of the ARTCCs portable fromSolaris to HP-UX and Linux as wellUnlike just about every other major soft-ware project this one really is standardand portable co-presenter Rob Savoyehas experience in maintaining gcc onmultiple platforms and is the projectlead on the portability and standardsissues for the code CTAS allows theARTCCs to upgrade and enhance indi-vidual aspects or parts of the system itisnrsquot a monolithic all-or-nothing entitylike the old ATM systems

Some future areas of research include ahead-mounted augmented reality devicefor tower operators to improve their sit-uational awareness by automatinghuman factors and new digital globalpositioning system (DGPS) technologieswhich are accurate within inches insteadof feet

64 Vol 27 No 5 login

Questions centered around advancedavionics (eg getting rid of ground con-trol) cooperation between the US andEurope for software development (wersquoreworking together on software develop-ment but the various European coun-triesrsquo controllers donrsquot talk well to eachother) and privatization

ADVENTURES IN DNS

Bill Manning ISI

Summarized by Brennan Reynolds

Manning began by posing a simplequestion is DNS really a critical infra-structure The answer is not simple Per-haps seven years ago when the Internetwas just starting to become popular theanswer was a definite no But today withIPv6 being implemented and so manytransactions being conducted over theInternet the question does not have aclear-cut answer The Internet Engineer-ing Task Force (IETF) has several newmodifications to the DNS service thatmay be used to protect and extend itsusability

The first extension Manning discussedwas Internationalized Domain Names(IDN) To date all DNS records arebased on the ASCII character set butmany addresses in the Internetrsquos globalnetwork cannot be easily written inASCII characters The goal of IDN is toprovide encoding for hostnames that isfair efficient and allows for a smoothtransition from the current scheme Thework has resulted in two encodingschemes ACE and UTF-8 Each encod-ing is independent of the other but theycan be used together in various combi-nations Manning expressed his opinionthat while neither is an ideal solutionACE appears to be the lesser of two evilsA major hindrance getting IDN rolledout into the Internetrsquos DNS root struc-ture is the increase in zone file complex-ity

Manningrsquos next topic was the inclusionof IPv6 records In an IPv4 world it ispossible for administrators to rememberthe numeric representation of anaddress IPv6 makes the addresses too

long and complex to be easily remem-bered This means that DNS will play avital role in getting IPv6 deployed Sev-eral new resource records (RR) havebeen proposed to handle the translationincluding AAAA A6 DNAME and BIT-STRING Manning commented on thedifference between IPv4 and v6 as atransport protocol and that systemstuned for v4 traffic will suffer a perfor-mance hit when using v6 This is largelyattributed to the increase in data trans-mitted per DNS request

The final extension Manning discussedwas DNSSec He introduced this exten-sion as a mechanism that protects thesystem from itself DNSSec protectsagainst data spoofing and providesauthentication between servers Itincludes a cryptographic signature ofthe RR set to ensure authenticity andintegrity By signing the entire set theamount of computation is kept to aminimum The information itself isstored in a new RR within each zone fileon the DNS server

Manning briefly commented on the useof DNS to provide a PKI infrastructurestating that that was not the purpose ofDNS and therefore it should not be usedin that fashion The signing of the RRsets can be done hierarchically resultingin the use of a single trusted key at theroot of the DNS tree to sign all sets tothe leafs However the job of keyreplacement and rollover is extremelydifficult for a system that is distributedacross the globe

Manning stated that in an operation testbed with all of these extensions enabledthe packet size for a single queryresponse grew from 518 bytes to largerthan 18000 This results in a largeincrease of bandwidth usage for highvolume name servers and puts There-fore in Manningrsquos opinion not all ofthese features will be deployed in thenear future For those looking for moreinformation Billrsquos Web site can be foundat httpwwwisieduotdr

THE JOY OF BREAKING THINGS

Pat Parseghian Transmeta

Summarized by Juan Navarro

Pat Parseghian described her experiencein testing the Crusoe microprocessor atthe Transmeta Lab for Compatibility(TLC) where the motto is ldquoYou make it we break itrdquo and the goal is to makeengineers miserable

She first gave an overview of Crusoewhose most distinctive feature is a codemorphing layer that translates x86instructions into native VLIW Suchpeculiarity makes the testing processparticularly challenging because itinvolves testing two processors (theldquoexternalrdquo x86 and the ldquointernalrdquo VLIW)and also because of variations on howthe code-morphing layer works (it mayinterpret code the first few times it seesan instruction sequence and translateand save in a translation cache after-wards) There are also reproducibilityissues since the VLIW processor mayrun at different speeds to save energy

Pat then described the tests that theysubject the processor to (hardware com-patibility tests common applicationsoperating systems envelope-pushinggames and hard-to-acquire legacy appli-cations) and the issues that must be con-sidered when defining testing policiesShe gave some testing tips includingorganizational issues like trackingresources and results

To illustrate the testing process Pat gavea list of possible causes of a system crashthat must be investigated If it is the sili-con then it might be because it is dam-aged or because of a manufacturingproblem or a design flaw If the problemis in the code-morphing layer is thefault with the interpreter or with thetranslator The fault could also be exter-nal to the processor it could be thehomemade system BIOS a faulty main-board or an operator error Or Trans-meta might not be at fault at all thecrash might be due to a bug in the OS orthe application The key to pinpointing

65October 2002 login

C

ON

FERE

NC

ERE

PORT

Sthe cause of a problem is to isolate it byidentifying the conditions that repro-duce the problem and repeating thoseconditions in other test platforms andnon-Crusoe systems Then relevant fac-tors must be identified based on productknowledge past experience and com-mon sense

To conclude Pat suggested that some ofTLCrsquos testing lessons can be applied toproducts that the audience was involvedin and assured us that breaking things isfun

TECHNOLOGY LIBERTY FREEDOM AND

WASHINGTON

Alan Davidson Center for Democracyand Technology

Summarized by Steve Bauer

ldquoExperience should teach us to be moston our guard to protect liberty when theGovernmentrsquos purposes are beneficentMen born to freedom are naturally alertto repel invasion of their liberty by evil-minded rulers The greatest dangers toliberty lurk in insidious encroachmentby men of zeal well-meaning but with-out understandingrdquo ndash Louis BrandeisOlmstead v US

Alan Davison an associate director atthe CDT (httpwwwcdtorg) since1996 concluded his talk with this quoteIn many ways it aptly characterizes theimportance to the USENIX communityof the topics he covered The majorthemes of the talk were the impact oflaw on individual liberty system archi-tectures and appropriate responses bythe technical community

The first part of the talk provided theaudience with an overview of legislationeither already introduced or likely to beintroduced in the US Congress Thisincluded various proposals to protectchildren such as relegating all sexualcontent to a xxx domain or providingkid-safe zones such as kidsus Othersimilar laws discussed were the Chil-drenrsquos Online Protection Act and legisla-tion dealing with virtual child pornography

Other lower-profile pieces covered werelaws prohibiting false contact informa-tion in emails and domain registrationdatabases Similar proposals exist thatwould prohibit misleading subject linesin emails and require messages to clearlyidentify if they are advertisementsBriefly mentioned were laws impactingonline gambling

Bills and laws affecting the architecturaldesign of systems and networks and var-ious efforts to establish boundaries andborders in cyberspace were then dis-cussed These included the ConsumerBroadband and Television PromotionAct and the French cases against Yahooand its CEO for allowing the sale of Nazimaterials on the Yahoo auction site

A major part of the talk focused on thetechnological aspects of the USA-PatriotAct passed in the wake of the September11 attacks Topics included ldquopen regis-tersrdquo for the Internet expanded govern-ment power to conduct secret searchesroving wiretaps nationwide service ofwarrants sharing grand jury and intelli-gence information and the establish-ment of an identification system forvisitors to the US

The technological community needs tobe aware and care about the impact oflaw on individual liberty and the systemsthe community builds Specific sugges-tions included designing privacy andanonymity into architectures and limit-ing information collection particularlyany personally identifiable data to onlywhat is essential Finally Alan empha-sized that many people simply do notunderstand the implications of lawThus the technology community has aimportant role in helping make theseimplications clear

TAKING AN OPEN SOURCE PROJECT TO

MARKET

Eric Allman Sendmail Inc

Summarized by Scott Kilroy

Eric started the talk with the upsides anddownsides to open source software Theupsides include rapid feedback high-

USENIX 2002

quality work (mostly) lots of hands andan intensely engineering-drivenapproach The downsides include nosupport structure limited project mar-keting expertise and volunteers havinglimited time

In 1998 Sendmail was becoming a suc-cess disaster A success disaster includestwo key elements (1) volunteers startspending all their time supporting cur-rent functionality instead of enhancing(this heavy burden can lead to projectstagnation and eventually death)(2) competition for the better-fundedsources can lead to FUD (fear uncer-tainty and doubt) about the project

Eric then led listeners down the path hetook to turn Sendmail into a businessEric envisioned Sendmail Inc as a small-ish and close-knit company but soonrealized that the family atmosphere hedesired could not last as the companygrew He observed that maybe only thefirst 10 people will buy into your visionafter which each individual is primarilyinterested in something else (eg powerwealth or status) He warns that therewill be people working for you that youdo not like Eric particularly did not carefor sales people but he emphasized howimportant sales marketing finance andlegal people are to companies

The nature of companies in infancy canbe summed up as you need money andyou need investors in order to getmoney Investors want a return oninvestment so money managementbecomes critical Companies thereforemust optimize money functions Ericrsquosexperience with investors were lessonsall in themselves More than givingmoney good investors can provide con-nections and insight so you donrsquot wantinvestors who donrsquot believe in you

A company must have bug historychange management and people whounderstand the product and have a senseof the target market Eric now sees theimportance of marketing targetresearch A company can never forget

66 Vol 27 No 5 login

the simple equation that drives businessprofit = revenue - expense

Finally the concept of value should bebased on what is valuable to the cus-tomer Eric observed that his customersneeded documentation extra value inthe product that they couldnrsquot get else-where and technical support

Eric learned hard lessons along the wayIf you want to do interesting opensource it might be best not to be toosuccessful You donrsquot want to let thelarger dragons (companies) notice youIf you want a commercial user base youhave to manage process watch corpo-rate culture provide value to customerswatch bottom lines and develop a thickskin

Starting a company is easy but perpetu-ating it is extremely difficult

INFORMATION VISUALIZATION FOR

SYSTEMS PEOPLE

Tamara Munzner University of BritishColumbia

Summarized by JD Welch

Munzner presented interesting evidencethat information visualization throughinteractive representations of data canhelp people to perform tasks more effec-tively and reduce the load on workingmemory

A popular technique in visualizing datais abstracting it through node-link rep-resentation ndash for example a list of cross-references in a book Representing therelationships between references in agraphical (node-link) manner ldquooffloadscognition to the human perceptual sys-temrdquo These graphs can be producedmanually but automated drawing allowsfor increased complexity and quickerproduction times

The way in which people interpret visualdata is important in designing effectivevisualization systems Attention to pre-attentive visual cues are key to successPicking out a red dot among a field ofidentically shaped blue dots is signifi-cantly faster than the same data repre-

sented in a mixture of hue and shapevariations There are many pre-attentivevisual cues including size orientationintersection and intensity The accuracyof these cues can be ranked with posi-tion triggering high accuracy and coloror density on the low end

Visual cues combined with differentdata types ndash quantitative (eg 10inches) ordered (small medium large)and categorical (apples oranges) ndash canalso be ranked with spatial positionbeing best for all types Beyond thiscommonality however accuracy variedwidely with length ranked second forquantitative data eighth for ordinal andninth for categorical data types

These guidelines about visual perceptioncan be applied to information visualiza-tion which unlike scientific visualiza-tion focuses on abstract data and choiceof specialization Several techniques areused to clarify abstract data includingmulti-part glyphs where changes inindividual parts are incorporated into aneasier to understand gestalt interactiv-ity motion and animation In large datasets techniques like ldquofocus + contextrdquowhere a zoomed portion of the graph isshown along with a thumbnail view ofthe entire graph are used to minimizeuser disorientation

Future problems include dealing withhuge databases such as the HumanGenome reckoning dynamic data likethe changing structure of the Web orreal-time network monitoring andtransforming ldquopixel boundrdquo displays intolarge ldquodigital wallpaperrdquo-type systems

FIXING NETWORK SECURITY BY HACKING

THE BUSINESS CLIMATE

Bruce Schneier Counterpane InternetSecurity

Summarized by Florian Buchholz

Bruce Schneier identified security as oneof the fundamental building blocks ofthe Internet A certain degree of securityis needed for all things on the Net andthe limits of security will become limitsof the Net itself However companies are

hesitant to employ security measuresScience is doing well but one cannot seethe benefits in the real world Anincreasing number of users means moreproblems affecting more people Oldproblems such as buffer overflowshavenrsquot gone away and new problemsshow up Furthermore the amount ofexpertise needed to launch attacks isdecreasing

Schneier argues that complexity is theenemy of security and that while secu-rity is getting better the growth in com-plexity is outpacing it As security isfundamentally a people problem oneshouldnrsquot focus on technologies butrather should look at businesses busi-ness motivations and business costsTraditionally one can distinguishbetween two security models threatavoidance where security is absoluteand risk management where security isrelative and one has to mitigate the riskwith technologies and procedures In thelatter model one has to find a pointwith reasonable security at an acceptablecost

After a brief discussion on how securityis handled in businesses today in whichhe concluded that businesses ldquotalk bigabout it but do as little as possiblerdquoSchneier identified four necessary stepsto fix the problems

First enforce liabilities Today no realconsequences have to be feared fromsecurity incidents Holding peopleaccountable will increase the trans-parency of security and give an incentiveto make processes public As possibleoptions to achieve this he listed indus-try-defined standards federal regula-tion and lawsuits The problems withenforcement however lie in difficultiesassociated with the international natureof the problem and the fact that com-plexity makes assigning liabilities diffi-cult Furthermore fear of liability couldhave a chilling effect on free softwaredevelopment and could stifle new com-panies

67October 2002 login

C

ON

FERE

NC

ERE

PORT

SAs a second step Schneier identified theneed to allow partners to transfer liabil-ity Insurance would spread liability riskamong a group and would be a CEOrsquosprimary risk analysis tool There is aneed for standardized risk models andprotection profiles and for more securi-ties as opposed to empty press releasesSchneier predicts that insurance will cre-ate a marketplace for security where cus-tomers and vendors will have the abilityto accept liabilities from each otherComputer-security insurance shouldsoon be as common as household or fireinsurance and from that developmentinsurance will become the driving factorof the security business

The next step is to provide mechanismsto reduce risks both before and aftersoftware is released techniques andprocesses to improve software quality aswell as an evolution of security manage-ment are therefore needed Currentlysecurity products try to rebuild the wallsndash such as physical badges and entrydoorways ndash that were lost when gettingconnected Schneier believes thisldquofortress metaphorrdquo is bad one shouldthink of the problem more in terms of acity Since most businesses cannot affordproper security outsourcing is the onlyway to make security scalable With out-sourcing the concenpt of best practicesbecomes important and insurance com-panies can be tied to them outsourcingwill level the playing field

As a final step Schneier predicts thatrational prosecution and education willlead to deterrence He claims that peoplefeel safe because they live in a lawfulsociety whereas the Internet is classifiedas lawless very much like the AmericanWest in the 1800s or a society ruled bywarlords This is because it is difficult toprove who an attacker is prosecution ishampered by complicated evidencegathering and irrational prosecutionSchneier believes however that educa-tion will play a major role in turning theInternet into a lawful society Specifi-cally he pointed out that we need lawsthat can be explained

Risks wonrsquot go away the best we can dois manage them A company able tomanage risk better will be more prof-itable and we need to give CEOs thenecessary risk-management tools

There were numerous questions Listedbelow are the more complicated ones ina paraphrased QampA format

Q Will liability be effective will insur-ance companies be willing to accept therisks A The government might have tostep in it needs to be seen how it playsout

Q Is there an analogy to the real worldin the fact that in a lawless society onlythe rich can afford security Securitysolutions differentiate between classesA Schneier disagreed with the state-ment giving an analogy to front-doorlocks but conceded that there might bespecial cases

Q Doesnrsquot homogeneity hurt securityA Homogeneity is oversold Diversetypes can be more survivable but giventhe limited number of options the dif-ference will be negligible

Q Regulations in airbag protection haveled to deaths in some cases How can wekeep the pendulum from swinging theother way A Lobbying will not be pre-vented An imperfect solution is proba-ble there might be reversals ofrequirements such as the airbag one

Q What about personal liability A Thiswill be analogous to auto insurance lia-bility comes with computerNet access

Q If the rule of law is to become realitythere must be a law enforcement func-tion that applies to a physical space Youcannot do that with any existing govern-ment agency for the whole worldShould an organization like the UNassume that role A Schneier was notconvinced global enforcement is possi-ble

Q What advice should we take awayfrom this talk A Liability is comingSince the network is important to ourinfrastructure eventually the problems

USENIX 2002

will be solved in a legal environmentYou need to start thinking about how tosolve the problems and how the solu-tions will affect us

The entire speech (including slides) isavailable at httpwwwcounterpanecompresentation4pdf

LIFE IN AN OPEN SOURCE STARTUP

Daryll Strauss Consultant

Summarized by Teri Lampoudi

This talk was packed with morsels ofinsight and tidbits of information aboutwhat life in an open source startup islike (though ldquowas likerdquo might be moreappropriate) what the issues are instarting maintaining and commercial-izing an open source project and theway hardware vendors treat open sourcedevelopers Strauss began by tracing thetimeline of his involvement with devel-oping 3-D graphics support for Linuxfrom the 3dfx voodoo1 driver for Linuxin 1997 to the establishment of Preci-sion Insight in 1998 to its buyout by VALinux to the dismantling of his groupby VA to what the future may hold

Obviously there are benefits from doingopen source development both for thedeveloper and for the end user Animportant point was that open sourcedevelops entire technologies not justproducts A misapprehension is theinherent difficulty of managing a projectthat accepts code from a large number ofdevelopers some volunteer and somepaid The notion that code can just bethrown over the wall into the world is aBig Lie

How does one start and keep afloat anopen source development company Inthe case of Precision Insight the subjectmatter ndash developing graphics andOpenGL for Linux ndash required a consid-erable amount of expertise intimateknowledge of the hardware the librariesX and the Linux kernel as well as theend applications Expertise is mar-ketable What helps even further is hav-ing a visible virtual team of expertspeople who have an established trackrecord of contributions to open source

68 Vol 27 No 5 login

in the particular area of expertise Sup-port from the hardware and softwarevendors came next While everyone waswilling to contract PI to write the por-tions of code that were specific to theirhardware no one wanted to pay the billfor developing the underlying founda-tion that was necessary for the drivers tobe useful In the PI model the infra-structure cost was shared by several cus-tomers at once and the technology keptmoving Once a driver is written for avendor the vendor is perfectly capableof figuring out how to write the nextdriver necessary thereby obviating theneed to contract the job This and ques-tions of protecting intellectual propertyndash hardware design in this case ndash aredeeply problematic with the open sourcemode of development This is not to saythat there is no way to get over thembut that they are likely to arise moreoften than not

Management issues seem equally impor-tant In the PI setting contracts wereflexible ndash both a good and a bad thing ndashand developers overworked Furtherdevelopment ldquoin a fishbowlrdquo that isunder public scrutiny is not easyStrauss stressed the value of good com-munication and the use of various out-of-sync communication methods likeIRC and mailing lists

Finally Strauss discussed what portionsof software can be free and what can beproprietary His suggestion was that hor-izontal markets want to be free whereasvertically one can develop proprietarysolutions The talk closed with a smallstroll through the problems that projectpolitics brings up from things likemerging branches to mailing list andsource tree readability

GENERAL TRACK SESSIONS

FILE SYSTEMS

Summarized by Haijin Yan

STRUCTURE AND PERFORMANCE OF THE

DIRECT ACCESS FILE SYSTEM

Kostas Magoutis Salimah AddetiaAlexandra Fedorova and Margo ISeltzer Harvard University JeffreyChase Andrew Gallatin Richard Kisleyand Rajiv Wickremesinghe Duke Uni-versity and Eran Gabber Lucent Tech-nologies

This paper won the Best Paper award

The Direct Access File System (DAFS) isa new standard for network-attachedstorage over direct-access transport net-works DAFS takes advantage of user-level network interface standards thatenable client-side functionality in whichremote data access resides in a libraryrather than in the kernel This reducesthe overhead of memory copy for datamovement and protocol overheadRemote Direct Memory Access (RDMA)is a direct access transport network thatallows the network adapter to reducecopy overhead by accessing applicationbuffers directly

This paper explores the fundamentalstructural and performance characteris-tics of network file access using a user-level file system structure on adirect-access transport network withRDMA It describes DAFS-based clientand server reference implementationsfor FreeBSD and reports experimentalresults comparing DAFS to a zero-copyNFS implementation It illustrates thebenefits and trade-offs of these tech-niques to provide a basis for informedchoices about deployment of DAFS-based systems and similar extensions toother network file protocols such asNFS Experiments show that DAFS givesapplications direct control over an IOsystem and increases the client CPUrsquosusage while the client is doing IOFuture work includes how to addresslongstanding problems related to theintegration of the application and filesystem for high-performance applica-tions

CONQUEST BETTER PERFORMANCE THROUGH

A DISKPERSISTENT-RAM HYBRID FILE

SYSTEM

An-I A Wang Peter Reiher and GeraldJ Popek UCLA Geoffrey M Kuenning Harvey Mudd College

Motivated by the declining cost of per-sistent RAM the authors propose theConquest file system which stores allsmall files metadata executables andshared libraries in persistent RAM diskshold only the data content of remaininglarge files Compared to alternativessuch as caching and RAM file systemsConquest has the advantages of effi-ciency consistency and reliability at areduced cost Using popular bench-marks experiments show that Conquestincurs little overhead while achievingfaster performance Future workincludes designing mechanisms foradjusting file-size threshold dynamicallyand finding a better disk layout for largedata blocks

EXPLOITING GRAY-BOX KNOWLEDGE OF

BUFFER-CACHE MANAGEMENT

Nathan C Burnett John Bent AndreaC Arpaci-Dusseau and Remzi HArpaci-Dusseau University of Wisconsin Madison

Knowing what algorithm is used tomanage the buffer cache is very impor-tant for improving application perfor-mance However there is currently nointerface for finding this algorithm Thispaper introduces Dust a simple finger-printing tool that is able to identify thebuffer-cache replacement policy Dustautomatically identifies the cache sizeand replacement policy based on theconfiguring attributes of access ordersrecency frequency and long-term his-tory Through simulation Dust was ableto distinguish between a variety ofreplacement algorithm policies found inthe literature FIFO LRU LFU ClockSegmented FIFO 2Qand LRU-K Fur-ther experiments of fingerprinting realoperating system such as NetBSDLinux and Solaris show that Dust is ableto identify the hidden cache replacementalgorithm

69October 2002 login

C

ON

FERE

NC

ERE

PORT

SBy knowing the underlying cachereplacement policy a cache-aware Webserver can reschedule the requests on acached request-first policy to obtain per-formance improvement Experimentsshow that cache-aware schedulingimproves average response time and sys-tem throughput

OPERATING SYSTEMS

(AND DANCING BEARS)

Summarized by Matt Butner

THE JX OPERATING SYSTEM

Michael Golm Meik Felser ChristianWawersich and Juumlrgen Kleinoumlder University of Erlagen-Nuumlrnberg

The talk opened by emphasizing theneed for and the practicality of a func-tional Java OS Such implementation inthe JX OS attempts to mimic the recenttrend toward using highly abstractedlanguages in application development inorder to create an OS as functional andpowerful as one written in a lower-levellanguage

The JX is a micro-kernel solution thatuses separate JVMs for each entity of thekernel and in some cases for eachapplication The separated domains donot share objects and have no threadmigration and each domain implementsits own code The interesting part of thepresentation was the discussion of sys-tem-level programming with Java keyareas discussed were memory manage-ment and interrupt handlers Theauthors concluded by noting that theirtype-safe and modular system resultedin a robust system with great configura-tion flexibility and acceptable perfor-mance

DESIGN EVOLUTION OF THE EROS

SINGLE-LEVEL STORE

Jonathan S Shapiro Johns HopkinsUniversity Jonathan Adams Universityof Pennsylvania

This presentation outlined current char-acteristics of file systems and some oftheir least desirable characteristics Itwas based on the revival of the EROS

single-level-store approach for currentattempts to capitalize on system consis-tency and efficiency Their solution didnot relate directly to EROS but ratherto the design and use of the constructsand notions on which EROS is based Byextending the mapping of the memorysystem to include the disk systems arefurther able to ensure global persistencewithout regard for a disk structure Inexplaining the need for absolute systemconsistency and an environment whichby definition is not allowed to crash thegoal of having such an exhaustive designbecomes clear The cool part of thiswork is the availability of its design andarchitecture to the public

THINK A SOFTWARE FRAMEWORK FOR

COMPONENT-BASED OPERATING SYSTEM

KERNELS

Jean-Philippe Fassino France TeacuteleacutecomRampD Jean-Bernard Stefani INRIA JuliaLawall DIKU Gilles Muller INRIA

This presentation discussed the need forcomponent-based operating systemsand respective structures to ensure flexi-bility for arbitrary-sized systems Thinkprovides a binding model that mapsuniformed components for OS develop-ers and architects to follow ensuringconsistent implementations for an arbi-trary system size However the goal isnot to force developers into a predefinedkernel but to promote the use of certaincomponents in varied ways

Thinkrsquos concentration is primarily onembedded systems where short devel-opment time is necessary but is con-strained by rigorous needs and limitedresources The need to build flexible sys-tems in such an environment can becostly and implementation-specific butThink creates an environment sup-ported by the ability to dynamically loadtype-safe components This allows formore flexible systems that retain func-tionality because of Thinkrsquos ability tobind fine-grained components Bench-marks revealed that dedicated micro-kernels can show performanceimprovements in comparison to mono-

USENIX 2002

lithic kernels Notable future workincludes building components for low-end appliances and the development of areal-time OS component library

BUILDING SERVICES

Summarized by Pradipta De

NINJA A FRAMEWORK FOR NETWORK

SERVICES

J Robert von Behren Eric A BrewerNikita Borisov Michael Chen MattWelsh Josh MacDonald Jeremy Lauand David Culler University of Califor-nia Berkeley

Robert von Behren presented Ninja anongoing project that aims to provide aframework for building robust and scal-able Internet-based network serviceslike Web-hosting instant-messagingemail and file-sharing applicationsRobert drew attention to the difficultiesof writing cluster applications One hasto take care of data consistency andissues of concurrency and for robustapplications there are problems relatedto fault tolerance Ninja works as awrapper to relieve the user of theseproblems The goal of the project isbuilding network services that are scala-ble and highly available maintainingpersistent data providing gracefuldegradation and supporting online evo-lution

The use of clusters distinguishes thissetup from generic distributed systemsin terms of reliability and security aswell as providing a partition-free net-work The next important feature of thisproject is the new programming modelwhich is more restrictive than a generalprogramming model but still expressiveenough to write most of the applica-tions This model described as ldquosingleprogram multiple connectionrdquo usesintertask parallelism instead of multi-threaded concurrency and is character-ized by all nodes running the sameprogram with connections delegated tothe nodes by a centralized connectionmanager (CM) A CM takes care of hid-ing the details of mapping an externalconnection to an internal node Ninja

70 Vol 27 No 5 login

also uses a design called ldquostaged event-driven architecturerdquo where each serviceis broken down into a set of stages con-nected by event queues This architec-ture is suitable for a modular design andhelps in graceful degradation by adap-tive load shedding from the eventqueues

The presentation concluded with exam-ples of implementation and evaluationof a Web server and an email systemcalled NinjaMail to show the ease ofauthoring and the efficacy of using theNinja framework for service develop-ment

USING COHORT-SCHEDULING TO ENHANCE

SERVER PERFORMANCE

James R Larus and Michael ParkesMicrosoft Research

James Larus presented a new schedulingpolicy for increasing the throughput ofserver applications Cohort schedulingbatches execution of similar operationsarising in different server requests Theusual programming paradigm in han-dling server requests is to spawn multi-ple concurrent threads and switch fromone thread to another whenever a threadgets blocked for IO Since the threads ina server mostly execute unrelated piecesof code the locality of reference isreduced hence the effectiveness of dif-ferent caching mechanisms One way tosolve this problem is to throw in morehardware But Larus presented a com-plementary view where the programbehavior is investigated and used toimprove the performance

The problem in this scheme is to iden-tify pieces of code that can be batchedtogether for processing One simple wayis to look at the next program countervalues and use them to club threadstogether However this talk presented anew programming abstraction ldquostagedcomputationrdquo which replaces the threadmodel with ldquostagesrdquo A stage is anabstraction for operations with similarbehavior and locality The StagedServerlibrary can be used for programming inthis model It is a collection of C++

classes that implement staged computa-tion and cohort scheduling on either auniprocessor or multiprocessor It canbe used to define stages

The presentation showed the experi-mental evaluation of the cohort schedul-ing over the thread-based model byimplementing two servers a Web serverwhich is mainly IO bound and a pub-lish-subscribe server which is mainlycompute bound The SURGE bench-mark was used for the first experimentand the Fabret workload for the secondThe results showed that cohort-schedul-ing-based implementation gave a betterthroughput than the thread-basedimplementation at high loads

NETWORK PERFORMANCE

Summarized by Xiaobo Fan

ETE PASSIVE END-TO-END INTERNET SERVICE

PERFORMANCE MONITORING

Yun Fu Amin Vahdat Duke UniversityLudmila Cherkasova Wenting Tang HPLabs

This paper won the Best Student Paperaward

Ludmila Cherkasova began by listingseveral questions most Web serviceproviders want answered in order toimprove service quality She reviewedthe difficulties of making accurate andefficient end-to-end Web service mea-surement and the shortcomings of cur-rently available approaches Theypropose a passive trace-based architec-ture called EtE to monitor Web serverperformance on behalf of end users

The first step is to collect network pack-ets passively The second module recon-structs all TCP connections and extractsHTTP transactions To reconstruct Webpage accesses they first build a knowl-edge base indexed by client IP and URLand then group objects to the relatedWeb pages they are embedded in Statis-tical analysis is used to handle non-matched objects EtE Monitor cangenerate three groups of metrics to mea-sure Web service performance responsetime Web caching and page abortion

To demonstrate the benefits of EtE mon-itor Cherkasova talked about two casestudies and based on the calculatedmetrics gave some insightful explana-tions about whatrsquos happening behind the variations of Web performanceThrough validation they claim theirapproach provides a very close approxi-mation to the real scenario

THE PERFORMANCE OF REMOTE DISPLAY

MECHANISMS FOR THIN-CLIENT COMPUTING

S Jae Yang Jason Nieh Matt SelskyNikhil Tiwari Columbia University

Noting the trend toward thin-clientcomputing the authors compared dif-ferent techniques and design choices inmeasuring the performance of six popu-lar thin-client platforms ndash CitrixMetaFrame Microsoft Terminal Ser-vices Sun Ray Tarantella VNC and XAfter pointing out several challenges inbenchmarking thin clients Yang pro-posed slow-motion benchmarking toachieve non-invasive packet monitoringand consistent visual quality Basicallythey insert delays between separatevisual events in the benchmark applica-tions of the server side so that theclientrsquos display update can catch up withthe serverrsquos processing speed The exper-iments are conducted on an emulatednetwork over a range of network band-widths

Their results show that thin clients canprovide good performance for Webapplications in LAN environments butonly some platforms performed well forvideo benchmark Pixel-based encodingmay achieve better performance andbandwidth efficiency than high-levelgraphics Display caching and compres-sion should be used with care

A MECHANISM FOR TCP-FRIENDLY

TRANSPORT-LEVEL PROTOCOL

COORDINATION

David E Ott and Ketan Mayer-PatelUniversity of North Carolina ChapelHill

A revised transport-level protocol opti-mized for cluster-to-cluster (C-to-C)

71October 2002 login

C

ON

FERE

NC

ERE

PORT

Sapplications is what David Ott tries toexplain in his talk A C-to-C applicationclass is identified as one set of processescommunicating to another set ofprocesses across a common Internetpath The fundamental problem with C-to-C applications is how to coordinateall C-to-C communication flows so thatthey share a consistent view of the com-mon C-to-C network adapt to changingnetwork conditions and cooperate tomeet specific requirements Aggregationpoints (AP) are placed at the first andlast hop of the common data path toprobe network conditions (latencybandwidth loss rate etc) To carry andtransfer this information a new protocolndash Coordination Protocol (CP) ndash isinserted between the network layer (IP)and the transport layer (TCP UCP etc)Ott illustrated how the CP header isupdated and used when packets origi-nate from source and traverse throughlocal and remote AP to arrive at theirdestination and how AP maintains aper-cluster state table and detects net-work conditions Through simulationresults this coordination mechanismappears effective in sharing commonnetwork resources among C-to-C com-munication flows

STORAGE SYSTEMS

Summarized by Praveen Yalagandula

MY CACHE OR YOURS MAKING STORAGE

MORE EXCLUSIVE

Theodore M Wong Carnegie MellonUniversity John Wilkes HP Labs

Theodore Wong explained the ineffi-ciency of current ldquoinclusiverdquo cachingschemes in storage area networks ndashwhen a client accesses a block the blockis read from the disk and is cached atboth the disk array cache and at theclientrsquos cache Then he presented theconcept of ldquoexclusiverdquo caching where ablock accessed by a client is only cachedin that clientrsquos cache On eviction fromthe clientrsquos cache they have come upwith a DEMOTE operation to move thedata block to the tail of the array cache

The new exclusion caching schemes areevaluated on both single-client and mul-tiple-client systems The single-clientresults are presented for two differenttypes of synthetic workloads Random(transaction-processing type workloads)and Zipf (typical Web workloads) Theexclusive policy was quite effective inachieving higher hit rates and lower readlatencies They also showed a 22 timeshit-rate improvement over inclusivetechniques in the case of a real-lifeworkload httpd with a single-client set-ting

The DEMOTE scheme also performedwell in the case of multiple-client sys-tems when the data accessed by clients isdisjointed In the case of shared dataworkloads this scheme performed worsethan the inclusive schemes Theodorethen presented an adaptive exclusivecaching scheme ndash a block accessed by aclient is placed at the tail in the clientrsquoscache and also in the array cache at anappropriate place determined by thepopularity of the block The popularityof the block is measured by maintaininga ghost cache to accumulate the numberof times each block is accessed This newadaptive scheme has achieved a maxi-mum of 52 speedup in the meanlatency in the experiments with real-lifeworkloads

For more information visit httpwwwcscmuedu~tmwongresearch and httpwwwhplhpcomresearchitccslssp

BRIDGING THE INFORMATION GAP IN

STORAGE PROTOCOL STACKS

Timothy E Denehy Andrea C Arpaci-Dusseau Remzi H Arpaci-DusseauUniversity of Wisconsin Madison

Currently there is a huge informationgap between storage systems and file sys-tems The interface exposed by storagesystems to file systems based on blocksand providing only readwrite interfacesis very narrow This leads to poor per-formance as a whole because of dupli-cated functionality in both systems and

USENIX 2002

reduced functionality resulting fromstorage systemsrsquo lack of file information

The speaker presented two enhance-ments ExRaid an exposed RAID andILFS Informed Log-Structured File Sys-tem ExRAID is an enhancement to theblock-based RAID storage system thatexposes the following three types ofinformation to the file system (1)regions ndash contiguous portions of theaddress space comprising one or multi-ple disks (2) performance informationabout queue lengths and throughput ofthe regions revealing disk heterogeneityto the file systems and (3) failure infor-mation ndash dynamically updated informa-tion conveying the number of tolerablefailures in each region

ILFS allows online incremental expan-sion of the storage space performsdynamic parallelism using ExRAIDrsquosperformance information to performwell on heterogeneous storage systemsand provides a range of different mecha-nisms with different granularities forredundancy of files using ExRAIDrsquosregion failure characteristics These newfeatures are added to LFS with only a19 increase in the code size

More informationhttpwwwcswisceduwind

MAXIMIZING THROUGHPUT IN REPLICATED

DISK STRIPING OF VARIABLE

BIT-RATE STREAMS

Stergios V Anastasiadis Duke Univer-sity Kenneth C Sevcik and MichaelStumm University of Toronto

There is an increasing demand for thecontinuous real-time streaming ofmedia files Disk striping is a commontechnique used for supporting manyconnections on the media servers Buteven with 12 million hours of meantime between failures there will be morethan one disk failure per week with 7000disks Fault tolerance can be achieved byusing data replication and reservingextra bandwidth during normal opera-tion This work focused on supportingthe most common variable bit-ratestream formats (eg MPEG)

72 Vol 27 No 5 login

The authors used prototype mediaserver EXEDRA to implement and eval-uate different replication and bandwidthreservation schemes This system sup-ports a variable bit-rate scheme doesstride-based disk allocation and is capa-ble of supporting variable-grain diskstriping Two replication policies arepresented deterministic ndash data blocksare replicated in a round-robin fashionon the secondary replicas random ndashdata blocks are replicated on the ran-domly chosen secondary replicas Twobandwidth reservation techniques arealso presented mirroring reservation ndashdisk bandwidth is reserved for both primary and replicas of the media fileduring the playback and minimumreservation ndash a more efficient scheme inwhich bandwidth is reserved only for thesum of primary data access time and themaximum of the backup data accesstimes required in each round

Experimental results showed that deter-ministic replica placement is better thanrandom placement for small disks mini-mum disk bandwidth reservation istwice as good as mirroring in thethroughput achieved and fault tolerancecan be achieved with a minimal impacton the throughput

TOOLS

Summarized by Li Xiao

SIMPLE AND GENERAL STATISTICAL PROFILING

WITH PCT

Charles Blake and Steven Bauer MIT

This talk introduced a Profile CollectionToolkit (PCT) ndash a sampling-based CPUprofiling facility A novel aspect of PCTis that it allows sampling of semanticallyrich data such as function call stacksfunction parameters local or globalvariables CPU registers or other execu-tion contexts The design objectives ofPCT were driven by user needs and theinadequacies or inaccessibility of priorsystems

The rich data collection capability isachieved via a debugger-controller pro-gram dbct1 The talk then introducedthe profiling collector and data collec-

tion activation PCT file format optionsprovide flexibility and various ways tostore exact states PCT also has analysisoptions two examples were given ndashldquoSimplerdquo and ldquoMixed User and LinuxCoderdquo

The talk then went into how generalsampling helps in the following areas adebugger-controller state machineexample script fragments a simpleexample program and general valuereports in terms of call-path histogramsnumeric value histograms and data gen-erality Related work such as gprof andExpect was then summarized The con-tribution of this work is to provide aportable general value sampling toolThe limitations are that PCT does notsupport much of strip and count inac-curacies happen because of its statisticalnature Based on this limitation -g ispreferred over strip

Possible future directions included sup-porting more debuggers such as dbxand xdb script generators for otherkinds of traces more canned reports forgeneral values and a libgdb-basedlibrary sampler PCT is available athttppdoslcsmitedupct PCT runs onalmost any UNIX-like system

ENGINEERING A DIFFERENCING AND

COMPRESSION DATA FORMAT

David G Korn and Kiem Phong VoATampT Laboratories ndash Research

This talk began with an equation ldquoDif-ferencing + Compression = Delta Com-pressionrdquo Compression removesinformation redundancy in a data setExamples are gzip bzip compress andpack Differencing encodes differencesbetween two data sets Examples are diff -e fdelta bdiff and xdelta Deltacompression compresses a data set givenanother combining differencing andcompression and reduces to pure com-pression when therersquos no commonality

After presenting an overall scheme for adelta compressor and showing deltacompression performance this talk dis-cussed the encoding format of a newlydesigned Vcdiff for delta compression A

Vcdiff instruction code table consists of256 entries of each coding up to a pair ofinstructions and recodes I-byte indicesand any additional data

The talk then showed Vcdiff rsquos perfor-mance with Web data collected fromCNN compared with W3C standardGdiff Gdiff+gzip and diff+gzip whereGdiff was computed using Vcdiff deltainstructions The results of two experi-ments ldquoFirstrdquo and ldquoSuccessiverdquo werepresented In ldquoFirstrdquo each file is com-pressed against the first file collected inldquoSuccessiverdquo each file is compressedagainst the one in the previous hourThe diff+gzip did not work well becausediff was line-oriented Vcdiff performedfavorably compared with other formats

Vcdiff is part of the Vcodex packageVcodex is a platform for all commondata transformations delta compres-sion plain compression encryptiontranscoding (eg uuencode base64) Itis structured in three layers for maxi-mum usability Base library uses Dis-plines and Methods interfaces

The code for Vcdiff can be found athttpwwwresearchattcomswtoolsThey are moving Vcdiff to the IETFstandard as a comprehensive platformfor transforming data Please refer tohttpwwwietforginternet-draftdraft-korn-vcdiff06txt

WHERE IN THE NET

Summarized by Amit Purohit

A PRECISE AND EFFICIENT EVALUATION OF

THE PROXIMITY BETWEEN WEB CLIENTS AND

THEIR LOCAL DNS SERVERS

Zhuoqing Morley Mao University ofCalifornia Berkeley Charles D CranorFred Douglis Michael RabinovichOliver Spatscheck and Jia Wang ATampTLabs ndash Research

Content Distribution Networks (CDNs)attempt to improve Web performance bydelivering Web content to end usersfrom servers located at the edge of thenetwork When a Web client requestscontent the CDN dynamically chooses aserver to route the request to usually the

73October 2002 login

C

ON

FERE

NC

ERE

PORT

Sone that is nearest the client As CDNshave access only to the IP address of thelocal DNS server (LDNS) of the clientthe CDNrsquos authoritative DNS servermaps the clientrsquos LDNS to a geographicregion within a particular network andcombines that with network and serverload information to perform CDNserver selection

This method has two limitations First itis based on the implicit assumption thatclients are close to their LDNS This maynot always be valid Second a singlerequest from an LDNS can represent adiffering number of Web clients ndash calledthe hidden load factor Knowledge of thehidden load factor can be used toachieve better load distribution

The extent of the first limitation and itsimpact on the CDN server selection isdealt with To determine the associationsbetween clients and their LDNS a sim-ple non-intrusive and efficient map-ping technique was developed The datacollected was used to study the impact ofproximity on DNS-based server selec-tion using four different proximity met-rics (1) autonomous system (AS)clustering ndash observing whether a client isin the same AS as its LDNS ndash concludedthat LDNS is very good for coarse-grained server selection as 64 of theassociations belong to the same AS(2) network clustering ndash observingwhether a client is in the same network-aware cluster (NAC) ndash implied that DNSis less useful for finer-grained serverselection since only 16 of the clientand LDNS are in the same NAC (3)traceroute divergence ndash the length of thedivergent paths to the client and itsLDNS from a probe point using tracer-oute ndash implies that most clients aretopologically close to their LDNS asviewed from a randomly chosen probesite (4) round-trip time (RTT) correla-tion (some CDNs select severs based onRTT between the CDN server and theclientrsquos LDNS) examines the correlationbetween the message RTTs from a probepoint to the client and its local DNS

results imply this is a reasonable metricto use to avoid really distant servers

A study of the impact that client-LDNSassociations have on DNS-based serverselection concludes that knowing theclientrsquos IP address would allow moreaccurate server selection in a large num-ber of cases The optimality of the serverselection also depends on the serverdensity placement and selection algo-rithms

Further information can be found athttpwwweecsberkeleyedu~zmaomyresearchhtml or by contactingzmaoeecsberkeleyedu

GEOGRAPHIC PROPERTIES OF INTERNET

ROUTING

Lakshminarayanan SubramanianVenkata N Padmanabhan MicrosoftResearch Randy H Katz University ofCalifornia at Berkeley

Geographic information can provideinsights into the structure and function-ing of the Internet including interac-tions between different autonomoussystems by analyzing certain propertiesof Internet routing It can be used tomeasure and quantify certain routingproperties such as circuitous routinghot-potato routing and geographic faulttolerance

Traceroute has been used to gather therequired data and Geotrack tool hasbeen used to determine the location ofthe nodes along each network path Thisenables the computation of ldquolinearizeddistancesrdquo which is the sum of the geo-graphic distances between successivepairs of routers along the path

In order to measure the circuitousnessof a path a metric ldquodistance ratiordquo hasbeen defined as the ratio of the lin-earized distance of a path to the geo-graphic distance between the source anddestination of the path From the data ithas been observed that the circuitous-ness of a route depends on both geo-graphic and network location of the endhosts A large value of the distance ratioenables us to flag paths that are highly

USENIX 2002

circuitous possibly (though not neces-sarily) because of routing anomalies Itis also shown that the minimum delaybetween end hosts is strongly correlatedwith the linearized distance of the path

Geographic information can be used tostudy various aspects of wide-area Inter-net paths that traverse multiple ISPs Itwas found that end-to-end Internetpaths tend to be more circuitous thanintra-ISP paths the cause for this beingthe peering relationships between ISPsAlso paths traversing substantial dis-tances within two or more ISPs tend tobe more circuitous than paths largelytraversing only a single ISP Anotherfinding is that ISPs generally employhot-potato routing

Geographic information can also beused to capture the fact that two seem-ingly unrelated routers can be suscepti-ble to correlated failures By using thegeographic information of routers wecan construct a geographic topology ofan ISP Using this we can find the toler-ance of an ISPrsquos network to the total net-work failure in a geographic region

For further information contactlakmecsberkeleyedu

PROVIDING PROCESS ORIGIN INFORMATION

TO AID IN NETWORK TRACEBACK

Florian P Buchholz Purdue UniversityClay Shields Georgetown University

Network traceback is currently limitedbecause host audit systems do not main-tain enough information to matchincoming network traffic to outgoingnetwork traffic The talk presented analternative assigning origin informationto every process and logging it duringinteractive login creation

The current implementation concen-trates mainly on interactive sessions inwhich an event is logged when a newconnection is established using SSH orTelnet The method is effective andcould successfully determine steppingstones and reliably detect the source of aDDoS attack The speaker then talkedabout the related work done in this area

74 Vol 27 No 5 login

which led to discussion of the latestdevelopment in the area of ldquohostcausalityrdquo

The speaker ended with the limitationsand future directions of the frameworkThis framework could be extended andcould find many applications in the areaof security Current implementationdoesnrsquot take care of startup scripts andcron jobs but incorporating the origininformation in FS could solve this prob-lem In the current implementation log-ging is just implemented as a proof ofconcept It could be made safe in manyways and this could be another impor-tant aspect of future work

PROGRAMMING

Summarized by Amit Purohit

CYCLONE A SAFE DIALECT OF C

Trevor Jim ATampT Labs ndash Research GregMorrisett Dan Grossman MichaelHicks James Cheney Yanling WangCornell University

Cyclone is designed to provide protec-tion against attacks such as buffer over-flows format string attacks andmemory management errors The cur-rent structure of C allows programmersto write vulnerable programs Cycloneextends C so that it has the safety guar-antee of Java while keeping C syntaxtypes and semantics untouched

The Cyclone compiler performs staticanalysis of the source code and insertsruntime checks into the compiled out-put at places where the analysis cannotdetermine that the code execution willnot violate safety constraints Cycloneimposes many restrictions to preservesafety such as NULL checksThesechecks do not exist in normal C

The speaker then talked about somesample code written in Cyclone and howit tackles safety problems Porting anexisting C application to Cyclone ispretty easy with fewer than 10 changerequired The current implementationconcentrates more on safety than onperformance ndash hence the performance

penalty is substantial Cyclone was ableto find many lingering bugs The finalstep of the project will be to write acompiler to convert a normal C programto an equivalent Cyclone program

COOPERATIVE TASK MANAGEMENT WITHOUT

MANUAL STACK MANAGEMENT

Atul Adya Jon Howell MarvinTheimer William J Bolosky John RDouceur Microsoft Research

The speaker described the definitionsand motivations behind the distinctconcepts of task management stackmanagement IO management conflictmanagement and data partitioningConventional concurrent programminguses preemptive task management andexploits the automatic stack manage-ment of a standard language In the sec-ond approach cooperative tasks areorganized as event handlers that yieldcontrol by returning control to the eventscheduler manually unrolling theirstacks In this project they have adopteda hybrid approach that makes it possiblefor both stack management styles tocoexist Thus programmers can codeassuming one or the other of these stackmanagement styles will operate depend-ing upon the application The speakeralso gave a detailed example of how touse their mechanism to switch betweenthe two styles

The talk then continued with the imple-mentation They were able to preemptmany subtle concurrency problems byusing cooperative task managementPaying a cost up-front to reduce a subtlerace condition proved to be a goodinvestment

Though the choice of task managementis fundamental the choice of stack man-agement can be left to individual tasteThis project enables use of any type ofstack management in conjunction withcooperative task management

IMPROVING WAIT-FREE ALGORITHMS FOR

INTERPROCESS COMMUNICATION IN

EMBEDDED REAL-TIME SYSTEMS

Hai Huang Padmanabhan Pillai KangG Shin University of Michigan

The main characteristic of the real-timetime-sensitive system is its pre-dictable response time But concurrencymanagement creates hurdles to achiev-ing this goal because of the use of locksto maintain consistency To solve thisproblem many wait-free algorithmshave been developed but these are typi-cally high-cost solutions By takingadvantage of the temporal characteris-tics of the system however the time andspace overhead can be reduced

The speaker presented an algorithm fortemporal concurrency control andapplied this technique to improve threewait-free algorithms A single-writermultiple-reader wait-free algorithm anda double-buffer algorithm were pro-posed

Using their transformation mechanismthey achieved an improvement of17ndash66 in ACET and a 14ndash70 reduc-tion in memory requirements for IPCalgorithms This mechanism is extensi-ble and can be applied to other non-blocking IPC algorithms as well Futurework involves reducing the synchroniz-ing overheads in more general IPC algo-rithms with multiple-writer semantics

MOBILITY

Summarized by Praveen Yalagandula

ROBUST POSITIONING ALGORITHMS FOR

DISTRIBUTED AD-HOC WIRELESS SENSOR

NETWORKS

Chris Savarese Jan Rabaey Universityof California Berkeley Koen Langen-doen Delft University of Technology

The Pico Radio Network comprisesmore than a hundred sensors monitorsand actuators equipped with wirelessconnectivity with the following proper-ties (1) no infrastructure (2) computa-tion in a distributed fashion instead ofcentralized computation (3) dynamictopology (4) limited radio range

75October 2002 login

C

ON

FERE

NC

ERE

PORT

S(5) nodes that can act as repeaters(6) devices of low cost and with lowpower and (7) sparse anchor nodes ndashnodes with GPS capability A positioningalgorithm determines the geographicalposition of each node in the network

In two-dimensional space each nodeneeds three reference positions to esti-mate its geographical position There aretwo problems that make positioning dif-ficult in the Pico Radio Network typesetting (1) a sparse anchor node prob-lem and (2) a range error problem

The two-phase approach that theauthors have taken in solving the prob-lem consists of (1) a hop-terrain algo-rithm ndash in this first phase each noderoughly guesses its location by the dis-tance calculated using multi-hops to theanchor points and (2) a refinementalgorithm ndash in this second phase eachnode uses its neighborsrsquo positions torefine its own position estimate Toguarantee the convergence thisapproach uses confidence metricsassigning a value of 10 for anchor nodesand 01 to start for other nodes andincreasing these with each iteration

A simulation tool OMNet++ was usedfor both phases and in various scenariosThe results show that they achievedposition errors of less than 33 in a sce-nario with 5 anchor nodes an averageconnectivity of 7 and 5 range mea-surement error

Guidelines for anchor node deploymentare high connectivity (gt10) a reason-able fraction of anchor nodes (gt 5)and for anchor placement coverededges

APPLICATION-SPECIFIC NETWORK

MANAGEMENT FOR ENERGY-AWARE

STREAMING OF POPULAR MULTIMEDIA

FORMATS

Surendar Chandra University of Geor-gia and Amin Vahdat Duke University

The main hindrance in supporting theincreasing demand for mobile multime-dia on PDA is the battery capacity of the

small devices The idle-time power con-sumption is almost the same as thereceive power consumption on typicalwireless network cards (1319mW vs1425mW) while the sleep state con-sumes far less power (177mW) To letthe system consume energy proportionalto the stream quality the network cardshould transition to sleep state aggres-sively between each packet

The speaker presented previous workwhere different multimedia streamswere studied for PDAs MS Media RealMedia and Quicktime It was found thatif the inter-packet gap is predictablethen huge savings are possible forexample about 80 savings in MSmedia streams with few packet lossesThe limitation of the IEEE 80211bpower-save mode comes into play whenthere are two or more nodes producinga delay between beacon time and whenthe node receivessends packets Thisbadly affects the streams since higherenergy is consumed while waiting

The authors propose traffic shaping forenergy conservation where this is doneby proxy such that packets arrive at reg-ular intervals This is achieved using alocal proxy in the access point and aclient-side proxy in the mobile hostSimulations show that traffic shapingreduces energy consumption and alsoreveals that the higher bandwidthstreams have a lower energy metric(mJkB)

More information is at httpgreenhousecsugaedu

CHARACTERIZING ALERT AND BROWSE SER-

VICES OF MOBILE CLIENTS

Atul Adya Paramvir Bahl and Lili QiuMicrosoft Research

Even though there is a dramatic increasein Internet access from wireless devicesthere are not many studies done oncharacterizing this traffic In this paperthe authors characterize the trafficobserved on the MSN Web site withboth notification and browse traces

USENIX 2002

Around 33 million browsing accessesand about 325 million notificationentries are present in the traces Threetypes of analyses are done for each oneof these two services content analysisconcerning the most popular contentcategories and their distribution popu-larity analysis or the popularity distri-bution of documents and user behavioranalysis

The analysis of the notification logsshows that document access rates followa Zipf-like distribution with most of theaccesses concentrated on a small num-ber of messages and the accesses exhibitgeographical locality ndash users from samelocality tend to receive similar notifica-tion content The browser log analysisshows that a smaller set of URLs areaccessed most times though the accesspattern does not fit any Zipf curve andthe highly accessed URLS remain stableA correlation study between the notifi-cation and browsing services shows thatwireless users have a moderate correla-tion of 012

FREENIX TRACK SESSIONS

BUILDING APPLICATIONS

Summarized by Matt Butner

INTERACTIVE 3-D GRAPHICS APPLICATIONS

FOR TCL

Oliver Kersting Juumlrgen Doumlllner HassoPlattner Institute for Software SystemsEngineering University of Potsdam

The integration of 3-D image renderingfunctionality into a scripting languagepermits interactive and animated 3-Ddevelopment and application withoutthe formalities and precision demandedby low-level CC++ graphics and visual-ization libraries The large and complexC++ API of the Virtual Rendering Sys-tem (VRS) can be combined with theconveniences of the Tcl scripting lan-guage The mapping of class interfaces isdone via an automated process and gen-erates respective wrapper classes all ofwhich ensures complete API accessibilityand functionality without any signifi-

76 Vol 27 No 5 login

cant performance retribution The gen-eral C++-to-Tcl mapper SWIG grantsthe necessary object and hierarchal abili-ties without the use of object-orientedTcl extensions The final API mappingtechniques address C++ features such asclasses overloaded methods enumera-tions and inheritance relations All areimplemented in a proof-of-concept thatmaps the complete C++ API of the VRSto Tcl and are showcased in a completeinteractive 3-D-map system

THE AGFL GRAMMAR WORK LAB

Cornelis HA Coster Erik VerbruggenUniversity of Nijmegen (KUN)

The growth and implementation of Nat-ural Language Processing (NLP) is thecornerstone of the continued evolutionand implementation of truly intelligentsearch machines and services In partdue to the growing collections of com-puter-stored human-readable docu-ments in the public domain theimplementation of linguistic analysiswill become necessary for desirable pre-cision and resolution Subsequentlysuch tools and linguistic resources mustbe released into the public domain andthey have done so with the release of theAGFL Grammar Work Lab under theGNU Public License as a tool for lin-guistic research and the development ofNLP-based applications

The AGFL (Affix Grammars over aFinite Lattice) Grammar Work Labmeshes context-free grammars withfinite set-valued features that are accept-able to a range of languages In com-puter science terms ldquoSyntax rules areprocedures with parameters and a non-deterministic executionrdquo The EnglishPhrases for Information Retrieval(EP4IR) released with the AGFL-GWLas a robust grammar of English is anAGFL-GWL generated English parserthat outputs ldquoHeadModifiedrdquo framesThe sentences ldquoCompanyX sponsoredthis conferencerdquo and ldquoThis conferencewas sponsored by CompanyXrdquo both gen-erate [CompanyX[sponsored confer-

ence]] The hope is that public availabil-ity of such tools will encourage furtherdevelopment of grammar and lexicalsoftware systems

SWILL A SIMPLE EMBEDDED WEB SERVER

LIBRARY

Sotiria Lampoudi David M BeazleyUniversity of Chicago

This paper won the FREENIX Best Stu-dent Paper award

SWILL (Simple Web Interface LinkLibrary) is a simple Web server in theform of a C library whose developmentwas motivated by a wish to give coolapplications an interface to the Web TheSWILL library provides a simple inter-face that can be efficiently implementedfor tasks that vary from flexible Web-based monitoring to software debuggingand diagnostics Though originallydesigned to be integrated with high-per-formance scientific simulation softwarethe interface is generic enough to allowfor unbounded uses SWILL is a single-threaded server relying upon non-blocking IO through the creation of atemporary server which services IOrequests SWILL does not provide SSL orcryptographic authentication but doeshave HTTP authentication abilities

A fantastic feature of SWILL is its sup-port for SPMD-style parallel applica-tions which utilize MPI provingvaluable for Beowulf clusters and largeparallel machines Another practicalapplication was the implementation ofSWILL in a modified Yalnix emulator byUniversity of Chicago Operating Sys-tems courses which utilized the addedYalnix functionality for OS developmentand debugging SWILL requires minimalmemory overhead and relies upon theHTTP10 protocol

NETWORK PERFORMANCE

Summarized by Florian Buchholz

LINUX NFS CLIENT WRITE PERFORMANCE

Chuck Lever Network Appliance PeterHoneyman CITI University of Michi-gan

Lever introduced a benchmark to mea-sure an NFS client write performanceClient performance is difficult to mea-sure due to hindrances such as poorhardware or bandwidth limitations Fur-thermore measuring application perfor-mance does not identify weaknessesspecifically at the client side Thus abenchmark was developed trying toexercise only data transfers in one direc-tion between server and application Forthis purpose the benchmark was basedon the block sequential write portion ofthe Bonnie file system benchmark Oncea benchmark for NFS clients is estab-lished it can be used to improve clientperformance

The performance measurements wereperformed with an SMP Linux clientand both a Linux NFS server and a Net-work Appliance F85 filer During testinga periodic jump in write latency timewas discovered This was due to a ratherlarge number of pending write opera-tions that were scheduled to be writtenafter certain threshold values wereexceeded By introducing a separate dae-mon that flushes the cached writerequest the spikes could be eliminatedbut as a result the average latency growsover time The problem could be tracedto a function that scans a linked list ofwrite requests After having added ahashtable to improve lookup perfor-mance the latency improved consider-ably

The improved client was then used tomeasure throughput against the twoservers A discrepancy between theLinux server and the filer test wasnoticed and the reason for that tracedback to a global kernel lock that wasunnecessarily held when accessing thenetwork stack After correcting this per-formance improved further However

77October 2002 login

C

ON

FERE

NC

ERE

PORT

Sthe measurements showed that a clientmay run slower when paired with fastservers on fast networks This is due toheavy client interrupt loads more net-work processing on the client side andmore global kernel lock contention

The source code of the project is avail-able at httpwwwcitiumicheduprojectsnfs-perfpatches

A STUDY OF THE RELATIVE COSTS OF

NETWORK SECURITY PROTOCOLS

Stefan Miltchev and Sotiris IoannidisUniversity of Pennsylvania AngelosKeromytis Columbia University

With the increasing need for securityand integrity of remote network ser-vices it becomes important to quantifythe communication overhead of IPSecand compare it to alternatives such asSSH SCP and HTTPS

For this purpose the authors set upthree testing networks direct link twohosts separated by two gateways andthree hosts connecting through onegateway Protocols were compared ineach setup and manual keying was usedto eliminate connection setup costs ForIPSec the different encryption algo-rithms AES DES 3DES hardware DESand hardware 3DES were used In detailFTP was compared to SFTP SCP andFTP over IPSec HTTP to HTTPS andHTTP over IPSec and NFS to NFS overIPSec and local disk performance

The result of the measurements werethat IPSec outperforms other popularencryption schemes Overall unen-crypted communication was fastest butin some cases like FTP the overhead canbe small The use of crypto hardwarecan significantly improve performanceFor future work the inclusion of setupcosts hardware-accelerated SSL SFTPand SSH were mentioned

CONGESTION CONTROL IN LINUX TCP

Pasi Sarolahti University of HelsinkiAlexey Kuznetsov Institute for NuclearResearch at Moscow

Having attended the talk and read thepaper I am still unclear about whetherthe authors are merely describing thedesign decisions of TCP congestion con-trol or whether they are actually the cre-ators of that part of the Linux code Myguess leans toward the former

In the talk the speaker compared theTCP protocol congestion control meas-ures according to IETF and RFC specifi-cations with the actual Linux implemen-tation which does conform to the basicprinciples but nevertheless has differ-ences A specific emphasis was placed onretransmission mechanisms and thecongestion window Also several TCPenhancements ndash the NewReno algo-rithm Selective ACKs (SACK) ForwardACKs (FACK) ndash were discussed andcompared

In some instances Linux does not con-form to the IETF specifications The fastrecovery does not fully follow RFC 2582since the threshold for triggering re-transmit is adjusted dynamically and thecongestion windowrsquos size is not changedAlso the roundtrip-time estimator andthe RTO calculation differ from RFC2988 since it uses more conservativeRTT estimates and a minimum RTO of200ms The performance measuresshowed that with the additional Linux-specific features enabled slightly higherthroughput more steady data flow andfewer unnecessary retransmissions canbe achieved

XTREME XCITEMENT

Summarized by Steve Bauer

THE FUTURE IS COMING WHERE THE X

WINDOW SHOULD GO

Jim Gettys Compaq Computer Corp

Jim Gettys one of the principal authorsof the X Window System outlined thenear-term objectives for the system pri-marily focusing on the changes andinfrastructure required to enable replica-tion and migration of X applicationsProviding better support for this func-tionality would enable users to retrieveor duplicate X applications betweentheir servers at home and work

USENIX 2002

One interesting example of an applica-tion that currently is capable of migra-tion and replication is Emacs To create anew frame on DISPLAY try ldquoM-x make-frame-on-display ltReturngt DISPLAYltReturngtrdquo

However technical challenges makereplication and migration difficult ingeneral These include the ldquomajorheadachesrdquo of server-side fonts thenonuniformity of X servers and screensizes and the need to appropriatelyretrofit toolkits Authentication andauthorization issues are obviously alsoimportant The rest of the talk delvedinto some of the details of these interest-ing technical challenges

HACKING IN THE KERNEL

Summarized by Hai Huang

AN IMPLEMENTATION OF SCHEDULER

ACTIVATIONS ON THE NETBSD OPERATING

SYSTEM

Nathan J Williams Wasabi Systems

Scheduler activation is an old idea Basi-cally there are benefits and drawbacks tousing solely kernel-level or user-levelthreading Scheduler activation is able tocombine the two layers of control toprovide more concurrency in the sys-tem

In his talk Nathan gave a fairly detaileddescription of the implementation ofscheduler activation in the NetBSD ker-nel One important change in the imple-mentation is to differentiate the threadcontext from the process context This isdone by defining a separate data struc-ture for these threads and relocatingsome of the information that wasembedded in the process context tothese thread contexts Stack was espe-cially a concern due to the upcall Spe-cial handling must be done to make surethat the upcall doesnrsquot mess up the stackso that the preempted user-level threadcan continue afterwards Lastly Nathanexplained that signals were handled byupcalls

78 Vol 27 No 5 login

AUTHORIZATION AND CHARGING IN PUBLIC

WLANS USING FREEBSD AND 8021X

Pekka Nikander Ericsson ResearchNomadicLab

8021x standards are well known in thewireless community as link-layerauthentication protocols In this talkPekka explained some novel ways ofusing the 8021x protocols that might beof interest to people on the move It ispossible to set up a public WLAN thatwould support various chargingschemes via virtual tokens which peoplecan purchase or earn and later use

This is implemented on FreeBSD usingthe netgraph utility It is basically a filterin the link layer that would differentiatetraffic based on the MAC address of theclient node which is either authenti-cated denied or let through The over-head for this service is fairly minimal

ACPI IMPLEMENTATION ON FREEBSD

Takanori Watanabe Kobe University

ACPI (Advanced Configuration andPower Management Interface) was pro-posed as a joint effort by Intel Toshibaand Microsoft to provide a standard andfiner-control method of managingpower states of individual devices withina system Such low-level power manage-ment is especially important for thosemobile and embedded systems that arepowered by fixed-capacity energy batter-ies

Takanori explained that the ACPI speci-fication is composed of three partstables BIOS and registers He was ableto implement some functionalities ofACPI in a FreeBSD kernel ACPI Com-ponent Architecture was implementedby Intel and it provides a high-levelACPI API to the operating systemTakanorirsquos ACPI implementation is builtupon this underlying layer of APIs

ACCESS CONTROL

Summarized by Florian Buchholz

DESIGN AND PERFORMANCE OF THE

OPENBSD STATEFUL PACKET FILTER (PF)

Daniel Hartmeier Systor AG

Daniel Hartmeier described the newstateful packet filter (pf) that replacedIPFilter in the OpenBSD 30 releaseIPFilter could no longer be included dueto licensing issues and thus there was aneed to write a new filter making use ofoptimized data structures

The filter rules are implemented as alinked list which is traversed from startto end Two actions may be takenaccording to the rules ldquopassrdquo or ldquoblockrdquoA ldquopassrdquo action forwards the packet anda ldquoblockrdquo action will drop it Wheremore than one rule matches a packetthe last rule wins Rules that are markedas ldquofinalrdquo will immediately terminate any further rule evaluation An opti-mization called ldquoskip-stepsrdquo was alsoimplemented where blocks of similarrules are skipped if they cannot matchthe current packet These skip-steps arecalculated when the rule set is loadedFurthermore a state table keeps track ofTCP connections Only packets thatmatch the sequence numbers areallowed UDP and ICMP queries andreplies are considered in the state tablewhere initial packets will create an entryfor a pseudo-connection with a low ini-tial timeout value The state table isimplemented using a balanced binarysearch tree NAT mappings are alsostored in the state table whereas appli-cation proxies reside in user space Thepacket filter also is able to perform frag-ment reassembly and to modulate TCPsequence numbers to protect hostsbehind the firewall

Pf was compared against IPFilter as wellas Linuxrsquos Iptables by measuringthroughput and latency with increasingtraffic rates and different packet sizes Ina test with a fixed rule set size of 100Iptables outperformed the other two fil-ters whose results were close to each

other In a second test where the rule setsize was continually increased Iptablesconsistently had about twice thethroughput of the other two (whichevaluate the rule set on both the incom-ing and outgoing interfaces) A third testcompared only pf and IPFilter using asingle rule that created state in the statetable with a fixed-state entry size Pfreached an overloaded state much laterthan IPFilter The experiment wasrepeated with a variable-state entry sizeand pf performed much better thanIPFilter for a small number of states

In general rule set evaluation is expen-sive and benchmarks only reflectextreme cases whereas in real life otherbehavior should be observed Further-more the benchmarks show that statefulfiltering can actually improve perfor-mance due to cheap state-table lookupsas compared to rule evaluation

ENHANCING NFS CROSS-ADMINISTRATIVE

DOMAIN ACCESS

Joseph Spadavecchia and Erez ZadokStony Brook University

The speaker presented modification toan NFS server that allows an improvedNFS access between administrativedomains A problem lies in the fact thatNFS assumes a shared UIDGID spacewhich makes it unsafe to export filesoutside the administrative domain ofthe server Design goals were to leaveprotocol and existing clients unchangeda minimum amount of server changesflexibility and increased performance

To solve the problem two techniques areutilized ldquorange-mappingrdquo and ldquofile-cloakingrdquo Range-mapping maps IDsbetween client and server The mappingis performed on a per-export basis andhas to be manually set up in an exportfile The mappings can be 1-1 N-N orN-1 In file-cloaking the server restrictsfile access based on UIDGID and spe-cial cloaking-mask bits Here users canonly access their own file permissionsThe policy on whether or not the file isvisible to others is set by the cloakingmask which is logically ANDed with the

79October 2002 login

C

ON

FERE

NC

ERE

PORT

Sfilersquos protection bits File-cloaking onlyworks however if the client doesnrsquot holdcached copies of directory contents andfile-attributes Because of this the clientsare forced to re-read directories byincrementing the mtime value of thedirectory each time it is listed

To test the performance of the modifiedserver five different NFS configurationswere evaluated An unmodified NFSserver was compared against one serverwith the modified code included but notused one with only range-mappingenabled one with only file-cloakingenabled and one version with all modi-fications enabled For each setup differ-ent file system benchmarks were runThe results show only a small overheadwhen the modifications are used gener-ally an increase of below 5 Anotherexperiment tested the performance ofthe system with an increasing number ofmapped or cloaked entries on a systemThe results show that an increase from10 to 1000 entries resulted in a maxi-mum of about 14 cost in performance

One member of the audience pointedout that if clients choose to ignore thechanged mtimes from the server andthus still hold caches of the directoryentries the file-cloaking mechanismcould be defeated After a rather lengthydebate the speaker had to concede thatthe model doesnrsquot add any extra securityAnother question was asked about scala-bility of the setup of range mappingThe speaker referred to application-leveltools that could be developed for thatpurpose

The software is available atftpftpfslcssunysbedupubenf

ENGINEERING OPEN SOURCE

SOFTWARE

Summarized by Teri Lampoudi

NINGAUI A LINUX CLUSTER FOR BUSINESS

Andrew Hume ATampT Labs ndash ResearchScott Daniels Electronic Data SystemsCorp

Ningaui is a general purpose highlyavailable resilient architecture built

from commodity software and hard-ware Emphatically however Ningaui isnot a Beowulf Hume calls the clusterdesign the ldquoSwiss canton modelrdquo inwhich there are a number of looselyaffiliated independent nodes with datareplicated among them Jobs areassigned by bidding and leases and clus-ter services done as session-based serversare sited via generic job assignment Theemphasis is on keeping the architectureend-to-end checking all work via check-sums and logging everything Theresilience and high availability requiredby their goal of 8-5 maintenance ndash vsthe typical 24-7 model where people getpaged whenever the slightest thing goeswrong regardless of the time of day ndash isachieved by job restartability Finally allcomputation is performed on local datawithout the use of NFS or networkattached storage

Humersquos message is a hopeful onedespite the many problems encounteredndash things like kernel and service limitsauto-installation problems TCP stormsand the like ndash the existence of sourcecode and the paranoid practice of log-ging and checksumming everything hashelped The final product performs rea-sonably well and it appears that theresilient techniques employed do make adifference One drawback however isthat the software mentioned in thepaper is not yet available for download

CPCMS A CONFIGURATION MANAGEMENT

SYSTEM BASED ON CRYPTOGRAPHIC NAMES

Jonathan S Shapiro and John Vander-burgh Johns Hopkins University

This paper won the FREENIX BestPaper award

The basic notion behind the project isthe fact that everyone has a pet com-plaint about CVS and yet it is currentlythe configuration manager in mostwidespread use Shapiro has unleashedan alternative Interestingly he did notbegin but ended with the usual host ofreasons why CVS is bad The talk insteadbegan abruptly by characterizing the job

USENIX 2002

of a software configuration managercontinued by stating the namespaceswhich it must handle and wrapped upwith the challenges faced

X MEETS Z VERIFYING CORRECTNESS IN THE

PRESENCE OF POSIX THREADS

Bart Massey Portland State UniversityRobert T Bauer Rational SoftwareCorp

Massey delivered a humorous talk onthe insight gained from applying Z for-mal specification notation to systemsoftware design rather than the moreinformal analysis and design processnormally used

The story is told with respect to writingXCB which replaces the Xlib protocollayer and is supposed to be threadfriendly But where threads are con-cerned deadlock avoidance becomes ahard problem that cannot be solved inan ad-hoc manner But full modelchecking is also too hard In this caseMassey resorted to Z specification tomodel the XCB lower layer abstractaway locking and data transfers andlocate fundamental issues Essentiallythe difficulties of searching the literatureand locating information relevant to theproblem at hand were overcome AsMassey put it ldquothe formal method savedthe dayrdquo

FILE SYSTEMS

Summarized by Bosko Milekic

PLANNED EXTENSIONS TO THE LINUX

EXT2EXT3 FILESYSTEM

Theodore Y Tsrsquoo IBM StephenTweedie Red Hat

The speaker presented improvements tothe Linux Ext2 file system with the goalof allowing for various expansions whilestriving to maintain compatibility witholder code Improvements have beenfacilitated by a few extra superblockfields that were added to Ext2 not longbefore the Linux 20 kernel was releasedThe fields allow for file system featuresto be added without compromising theexisting setup this is done by providing

80 Vol 27 No 5 login

bits indicating the impact of the addedfeatures with respect to compatibilityNamely a file system with the ldquoincom-patrdquo bit marked is not allowed to bemounted Similarly a ldquoread-onlyrdquo mark-ing would only allow the file system tobe mounted read-only

Directory indexing changes linear direc-tory searches with a faster search using afixed-depth tree and hashed keys Filesystem size can be dynamicallyincreased and the expanded inode dou-bled from 128 to 256 bytes allows formore extensions

Other potential improvements were dis-cussed as well in particular pre-alloca-tion for contiguous files which allowsfor better performance in certain setupsby pre-allocating contiguous blocksSecurity-related modifications extendedattributes and ACLs were mentionedAn implementation of these featuresalready exists but has not yet beenmerged into the mainline Ext23 code

RECENT FILESYSTEM OPTIMISATIONS ON

FREEBSD

Ian Dowse Corvil Networks DavidMalone CNRI Dublin Institute of Technology

David Malone presented four importantfile system optimizations for FreeBSDOS soft updates dirpref vmiodir anddirhash It turns out that certain combi-nations of the optimizations (beautifullyillustrated in the paper) may yield per-formance improvements of anywherebetween 2 and 10 orders of magnitudefor real-world applications

All four of the optimizations deal withfile system metadata Soft updates allowfor asynchronous metadata updatesDirpref changes the way directories areorganized attempting to place childdirectories closer to their parentsthereby increasing locality of referenceand reducing disk-seek times Vmiodirtrades some extra memory in order toachieve better directory caching Finallydirhash which was implemented by IanDowse changes the way in which entries

are searched in a directory SpecificallyFFS uses a linear search to find an entryby name dirhash builds a hashtable ofdirectory entries on the fly For a direc-tory of size n with a working set of mfiles a search that in certain cases couldhave been O(nm) has been reduceddue to dirhash to effectively O(n + m)

FILESYSTEM PERFORMANCE AND SCALABILITY

IN LINUX 2417

Ray Bryant SGI Ruth Forester IBMLTC John Hawkes SGI

This talk focused on performance evalu-ation of a number of file systems avail-able and commonly deployed on Linuxmachines Comparisons under variousconfigurations of Ext2 Ext3 ReiserFSXFS and JFS were presented

The benchmarks chosen for the datagathering were pgmeter filemark andAIM Benchmark Suite VII Pgmetermeasures the rate of data transfer ofreadswrites of a file under a syntheticworkload Filemark is similar to post-mark in that it is an operation-intensivebenchmark although filemark isthreaded and offers various other fea-tures that postmark lacks AIM VIImeasures performance for various file-system-related functionalities it offersvarious metrics under an imposedworkload thus stressing the perfor-mance of the file system not only underIO load but also under significant CPUload

Tests were run on three different setupsa small a medium and a large configu-ration ReiserFS and Ext2 appear at thetop of the pile for smaller and mediumsetups Notably XFS and JFS performworse for smaller system configurationsthan the others although XFS clearlyappears to generate better numbers thanJFS It should be noted that XFS seemsto scale well under a higher load Thiswas most evident in the large-systemresults where XFS appears to offer thebest overall results

THINGS TO THINK ABOUT

Summarized by Bosko Milekic

SPEEDING UP KERNEL SCHEDULER BY REDUC-

ING CACHE MISSES

Shuji Yamamura Akira Hirai MitsuruSato Masao Yamamoto Akira NaruseKouichi Kumon Fujitsu Labs

This was an interesting talk pertaining tothe effects of cache coloring for taskstructures in the Linux kernel scheduler(Linux kernel 24x) The speaker firstpresented some interesting benchmarknumbers for the Linux scheduler show-ing that as the number of processes onthe task queue was increased the perfor-mance decreased The authors usedsome really nifty hardware to measurethe number of bus transactionsthroughout their tests and were thusable to reasonably quantify the impactthat cache misses had in the Linuxscheduler

Their experiments led them to imple-ment a cache coloring scheme for taskstructures which were previouslyaligned on 8KB boundaries and there-fore were being eventually mapped tothe same cache lines This unfortunateplacement of task structures in memoryinduced a significant number of cachemisses as the number of tasks grew inthe scheduler

The implemented solution consisted ofaligning task structures to more evenlydistribute cached entries across the L2cache The result was inevitably fewercache misses in the scheduler Some neg-ative effects were observed in certain sit-uations These were primarily due tomore cache slots being used by taskstructure data in the scheduler thusforcing data previously cached there tobe pushed out

81October 2002 login

C

ON

FERE

NC

ERE

PORT

SWORK-IN-PROGRESS REPORTS

Summarized by Brennan Reynolds

RESOURCE VIRTUALIZATION TECHNIQUES FOR

WIDE-AREA OVERLAY NETWORKS

Kartik Gopalan University Stony Brook

This work addressed the issue of provi-sioning a maximum number of virtualoverlay networks (VON) with diversequality of service (QoS) requirementson a single physical network Each of theVONs is logically isolated from others toensure the QoS requirements Gopalanmentioned several critical researchissues with this problem that are cur-rently being investigated Dealing withhow to provision the network at variouslevels (link route or path) and thenenforce the provisioning at run-time isone of the toughest challenges Cur-rently Gopalan has developed severalalgorithms to handle admission controlend-to-end QoS route selection sched-uling and fault tolerance in the net-work

For more information visithttpwwwecslcssunysbedu

VISUALIZING SOFTWARE INSTABILITY

Jennifer Bevan University of CaliforniaSanta Cruz

Detection of instability in software hastypically been an afterthought Thepoint in the development cycle when thesoftware is reviewed for instability isusually after it is difficult and costly togo back and perform major modifica-tions to the code base Bevan has devel-oped a technique to allow thevisualization of unstable regions of codethat can be used much earlier in thedevelopment cycle Her technique cre-ates a time series of dependent graphsthat include clusters and lines calledfault lines From the graphs a developeris able to easily determine where theunstable sections of code are and proac-tively restructure them She is currentlyworking on a prototype implementa-tion

For more information visit httpwwwcseucscecu~jbevanevo_viz

RELIABLE AND SCALABLE PEER-TO-PEER WEB

DOCUMENT SHARING

Li Xiao William and Mary College

The idea presented by Xiao would allowend users to share the content of theWeb browser caches with neighboringInternet users The rationale for this isthat todayrsquos end users are increasinglyconnected to the Internet over high-speed links and the browser caches arebecoming large storage systems There-fore if an individual accesses a pagewhich does not exist in their local cacheXiao is suggesting that they first queryother end users for the content beforetrying to access the machine hosting theoriginal This strategy does have someserious problems associated with it thatstill need to be addressed includingensuring the integrity of the content andprotecting the identity and privacy ofthe end users

SEREL FAST BOOTING FOR UNIX

Leni Mayo Fast Boot Software

Serel is a tool that generates a visual rep-resentation of a UNIX machinersquos boot-up sequence It can be used to identifythe critical path and can show if a par-ticular service or process blocks for anextended period of time This informa-tion could be used to determine wherethe largest performance gains could berealized by tuning the order of executionat boot-up Serel creates a dependencygraph expressed in XML during boot-up This graph is then used to create thevisual representation Currently the toolonly works on POSIX-compliant sys-tems but Mayo stated that he would beporting it to other platforms Otherextensions that were mentionedincluded having the metadata includethe use of shared libraries and monitor-ing the suspendresume sequence ofportable machines

For more information visithttpwwwfastbootorg

USENIX 2002

BERKELEY DB XML

John Merrells Sleepycat Software

Merrells gave a quick introduction andoverview of the new XML library forBerkeley DB The library specializes instorage and retrieval of XML contentthrough a tool called XPath The libraryallows for multiple containers per docu-ment and stores everything natively asXML The user is also given a wide rangeof elements to create indices withincluding edges elements text stringsor presence The XPath tool consists of aquery parser generator optimizer andexecution engine To conclude his pre-sentation Merrell gave a live demonstra-tion of the software

For more information visithttpwwwsleepycatcomxml

CLUSTER-ON-DEMAND (COD)

Justin Moore Duke University

Modern clusters are growing at a rapidrate Many have pushed beyond the 5000-machine mark and deployingthem results in large expenses as well asmanagement and provisioning issuesFurthermore if the cluster is ldquorentedrdquoout to various users it is very time-con-suming to configure it to a userrsquos specsregardless of how long they need to useit The COD work presented createsdynamic virtual clusters within a givenphysical cluster The goal was to have aprovisioning tool that would automati-cally select a chunk of available nodesand install the operating system andmiddleware specified by the customer ina short period of time This would allowa greater use of resources since multiplevirtual clusters can exist at once Byusing a virtual cluster the size can bechanged dynamically Moore stated thatthey have created a working prototypeand are currently testing and bench-marking its performance

For more information visithttpwwwcsdukeedu~justincod

82 Vol 27 No 5 login

CATACOMB

Elias Sinderson University of CaliforniaSanta Cruz

Catacomb is a project to develop a data-base-backed DASL module for theApache Web server It was designed as areplacement for the WebDAV The initialrelease of the module only contains sup-port for a MySQL database but could beextended to others Sinderson brieflytouched on the performance of hermodule It was comparable to themod_dav Apache module for all querytypes but search The presentation wasconcluded with remarks about addingsupport for the lock method and includ-ing ACL specifications in the future

For more information visithttpoceancseucsceducatacomb

SELF-ORGANIZING STORAGE

Dan Ellard Harvard University

Ellardrsquos presentation introduced a stor-age system that tuned itself based on theworkload of the system without requir-ing the intervention of the user Theintelligence was implemented as a vir-tual self-organizing disk that residesbelow any file system The virtual diskobserves the access patterns exhibited bythe system and then attempts to predictwhat information will be accessed nextAn experiment was done using an NFStrace at a large ISP on one of their emailservers Ellardrsquos self-organizing storagesystem worked well which he attributesto the fact that most of the files beingrequested were large email boxes Areasof future work include exploration ofthe length and detail of the data collec-tion stage as well as the CPU impact ofrunning the virtual disk layer

VISUAL DEBUGGING

John Costigan Virginia Tech

Costigan feels that the state of currentdebugging facilities in the UNIX worldis not as good as it should be He pro-poses the addition of several elements toprograms like ddd and gdb The firstaddition is including separate heap and

stack components Another is to displayonly certain fields of a data structureand have the ability to zoom in and outif needed Finally the debugger shouldprovide the ability to visually presentcomplex data structures includinglinked lists trees etc He said that a betaversion of a debugger with these abilitiesis currently available

For more information visit httpinfoviscsvtedudatastruct

IMPROVING APPLICATION PERFORMANCE

THROUGH SYSTEM-CALL COMPOSITION

Amit Purohit University of Stony Brook

Web servers perform a huge number ofcontext switches and internal data copy-ing during normal operation These twoelements can drastically limit the perfor-mance of an application regardless ofthe hardware platform it is run on TheCompound System Call (CoSy) frame-work is an attempt to reduce the perfor-mance penalty for context switches anddata copies It includes a user-levellibrary and several kernel facilities thatcan be used via system calls The libraryprovides the programmer with a com-plete set of memory-management func-tions Performance tests using the CoSyframework showed a large saving forcontext switches and data copies Theonly area where the savings betweenconventional libraries and CoSy werenegligible was for very large file copies

For more information visithttpwwwcssunysbedu~purohit

ELASTIC QUOTAS

John Oscar Columbia University

Oscar began by stating that mostresources are flexible but that to datedisks have not been While approxi-mately 80 percent of files are short-liveddisk quotas are hard limits imposed byadministrators The idea behind elasticquotas is to have a non-permanent stor-age area that each user can temporarilyuse Oscar suggested the creation of aehome directory structure to be used indeploying elastic quotas Currently he

has implemented elastic quotas as astackable file system There were alsoseveral trends apparent in the dataMany users would ssh to their remotehost (good) but then launch an emailreader on the local machine whichwould connect via POP to the same hostand send the password in clear text(bad) This situation can easily be reme-died by tunneling protocols like POPthrough SSH but it appeared that manypeople were not aware this could bedone While most of his comments onthe use of the wireless network werenegative the list of passwords he hadcollected showed that people wereindeed using strong passwords His rec-ommendations were to educate andencourage people to use protocols likeIPSec SSH and SSL when conductingwork over a wireless network becauseyou never know who else is listening

SYSTRACE

Niels Provos University of Michigan

How can one be sure the applicationsone is using actually do exactly whattheir developers said they do Shortanswer you canrsquot People today are usinga large number of complex applicationswhich means it is impossible to checkeach application thoroughly for securityvulnerabilities There are tools out therethat can help though Provos has devel-oped a tool called systrace that allows auser to generate a policy of acceptablesystem calls a particular application canmake If the application attempts tomake a call that is not defined in thepolicy the user is notified and allowed tochoose an action Systrace includes anauto-policy generation mechanismwhich uses a training phase to record theactions of all programs the user exe-cutes If the user chooses not to be both-ered by applications breaking policysystrace allows default enforcementactions to be set Currently systrace isimplemented for FreeBSD and NetBSDwith a Linux version coming out shortly

83October 2002 login

C

ON

FERE

NC

ERE

PORT

SFor more information visit httpwwwcitiumicheduuprovossystrace

VERIFIABLE SECRET REDISTRIBUTION FOR

SURVIVABLE STORAGE SYSTEMS

Ted Wong Carnegie Mellon University

Wong presented a protocol that can beused to re-create a file distributed over aset of servers even if one of the serversis damaged In his scheme the user mustchoose how many shares the file is splitinto and the number of servers it will bestored across The goal of this work is toprovide persistent secure storage ofinformation even if it comes underattack Wong stated that his design ofthe protocol was complete and he is cur-rently building a prototype implementa-tion of it

For more information visit httpwwwcscmuedu~tmwongresearch

THE GURU IS IN SESSIONS

LINUX ON LAPTOPPDA

Bdale Garbee HP Linux Systems Operation

Summarized by Brennan Reynolds

This was a roundtable-like discussionwith Garbee acting as moderator He iscurrently in charge of the Debian distri-bution of Linux and is involved in port-ing Linux to platforms other than i386He has successfully help port the kernelto the Alpha Sparc and ARM and iscurrently working on a version to runon VAX machines

A discussion was held on which file sys-tem is best for battery life The Riser andExt3 file systems were discussed peoplecommented that when using Ext3 ontheir laptops the disc would never spindown and thus was always consuming alarge amount of power One suggestionwas to use a RAM-based file system forany volume that requires a large numberof writes and to only have the file systemwritten to disk at infrequent intervals orwhen the machine is shut down or sus-pended

A question was directed to Garbee abouthis knowledge of how the HP-Compaqmerger would affect Linux Garbeethought that the merger was good forLinux both within the company and forthe open source community as a wholeHe stated that the new company wouldbe the number one shipper of systemswith Linux as the base operating systemHe also fielded a question about thesupport of older HP servers and theirability to run Linux saying that indeedLinux has been ported to them and hepersonally had several machines in hisbasement running it

A question which sparked a large num-ber of responses concerned problemspeople had with running Linux onmobile platforms Most people actuallydid not have many problems at allThere were a few cases of a particularmachine model not working but therewere only two widespread issues dock-ing stations and the new ACPI powermanagement scheme Docking stationsstill appear to be somewhat of aheadache but most people had devel-oped ad-hoc solutions for getting themto work The ACPI power managementscheme developed by Intel does notappear to have a quick solution TedTrsquoso one of the head kernel developersstated that there are fundamental archi-tectural problems with the 24 series ker-nel that do not easily allow ACPI to beadded However he also stated thatACPI is supported in the newest devel-opment kernel 25 and will be includedin 26

The final topic was the possibility ofpurchasing a laptop without paying anychargetax for Microsoft Windows Theconsensus was that currently this is notpossible Even if the machine comeswith Linux or without any operatingsystem the vendors have contracts inplace which require them to payMicrosoft for each machine they ship ifthat machine is capable of running aMicrosoft operating system The onlyway this will change is if the demand for

USENIX 2002

other operating systems increases to thepoint that vendors renegotiate their con-tracts with Microsoft and no one sawthis happening in the near future

LARGE CLUSTERS

Andrew Hume ATampT Labs ndash Research

Summarized by Teri Lampoudi

This was a session on clusters big dataand resilient computing Given that theaudience was primarily interested inclusters and not necessarily big data orbeating nodes to a pulp the conversa-tion revolved mainly around what ittakes to make a heterogeneous clusterresilient Hume also presented aFREENIX paper on his current clusterproject which explained in more detailmuch of what was abstractly claimed inthe guru session

Humersquos use of the word ldquoclusterrdquoreferred not to what I had assumedwould be a Beowulf-type system but to aloosely coupled farm of machines inwhich concurrency was much less of anissue than it would be in a Beowulf Infact the architecture Hume describedwas designed to deal with large amountsof independent transaction processingessentially the process of billing callswhich requires no interprocess messagepassing or anything of the sort a parallelscientific application might

Where does the large data come inSince the transactions in question con-sist of large numbers of flat files amechanism for getting the files ontonodes and off-loading the results is nec-essary In this particular cluster namedNingaui this task is handled by theldquoreplication managerrdquo which generateda large amount of interest in the audi-ence All management functions in thecluster as well as job allocation consti-tute services that are to be bid for andleased out to nodes Furthermore fail-ures are handled by simply repeating thefailed task until it is completed properlyif that is possible and if it is not thenlooking through detailed logs to discover

84 Vol 27 No 5 login

what went wrong Logging and testingfor the successful completion of eachtask are what make this model resilientEvery management operation is loggedat some appropriate level of detail gen-erating 120MBday and every single filetransfer is md5 checksummed This wascharacterized by Hume as a ldquopatientrdquostyle of management where no one con-trols the cluster scheduling behavior isemergent rather than stringentlyplanned and nodes are free to drop inand out of the cluster a downside to thismodel is the increase in latency offset bythe fact that the cluster continues tofunction unattended outside of 8-to-5support hours

In response to questions about the pro-jected scalability of such a schemeHume said the system would presum-ably scale from the present 60 nodes toabout 100 without modifications butthat scaling higher would be a matter forfurther design The guru session endedon a somewhat unrelated note regardingthe problems of getting data off main-frames and onto Linux machines ndashissues of fixed- vs variable-lengthencoding of data blocks resulting in use-ful information being stripped by FTPand problems in converting COBOLcopybooks to something useful on a C-based architecture To reiterate Humersquosclaim there are homemade solutions forsome of these problems and the readershould feel free to contact Mr Hume forthem

WRITING PORTABLE APPLICATIONS

Nick Stoughton MSB Consultants

Summarized by Josh Lothian

Nick Stoughton addressed a concernthat developers are facing more andmore writing portable applications Heproposed that there is no such thing as aldquoportable applicationrdquo only those thathave been ported Developers are still insearch of the Holy Grail of applicationsprogramming a write-once compile-anywhere program Languages such asJava are leading up to this but virtual

machines still have to be ported pathswill still be different etc Web-basedapplications are another area that couldbe promising for portable applications

Nick went on to talk about the future ofportability The recently released POSIX2001 standards are expected to help thesituation The 2001 revision expands theoriginal POSIX sepc into seven volumesEven though POSIX 2001 is more spe-cific than the previous release Nickpointed out that even this standardallows for small differences where ven-dors may define their own behaviorsThis can only hurt developers in thelong run Along these lines it waspointed out that there may be room forautomated tools that have the capabilityto check application code and determinethe level of portability and standardscompliance of the source

NETWORK MANAGEMENT SYSTEM

PERFORMANCE TUNING

Jeff Allen Tellme Networks Inc

Summarized by Matt Selsky

Jeff Allen author of Cricket discussednetwork tuning The basic idea of tuningis measure twiddle and measure againCricket can be used for large installa-tions but itrsquos overkill for smaller instal-lations for which Mrtg is better suitedYou donrsquot need Cricket to do measure-ment Doing something like a shellscript that dumps data to Excel is goodbut you need to get some sort of mea-surement Cricket was not designed forbilling it was meant to help answerquestions

Useful tools and techniques coveredincluded looking for the differencebetween machines in order to identifythe causes for the differences measuringbut also thinking about what yoursquoremeasuring and why remembering touse strace and tcpdump to identifyproblems Problems repeat themselvesperformance tuning requires an intricateunderstanding of all the layers involvedto solve the problem and simplify theautomation (Having a computer science

background helps but is not essential) Ifyou can reproduce the problem youthen should investigate each piece tofind the bottleneck If you canrsquot repro-duce the problem then measure Youneed to understand all the system inter-actions Measurement can help deter-mine whether things are actually slow orif the user is imagining it

Troubleshooting begins with a hunchbut scientific processes are essential Youshould be able to determine the eventstream or when each event occurs hav-ing observation logs can help Somehints provided include checking cron forperiodic anomalies slowing downbinrm to avoid IO overload and look-ing for unusual kernel CPU usage

Also try to use lightweight monitoringto reduce overhead You donrsquot need tomonitor every resource on every systembut you should monitor those resourceson those systems that are essential Donrsquotcheck things too often since you canintroduce overhead that way Lots ofinformation can be gathered from theproc file system interface to the kernel

BSD BOF

Host Kirk McKusick Author and Consultant

Summarized by Bosko Milekic

The BSD Birds of a Feather session atUSENIX started off with five quick andinformative talks and ended with anexciting discussion of licensing issues

Christos Zoulas for the NetBSD Projectwent over the primary goals of theNetBSD Project (namely portability aclean architecture and security) andthen proceeded to briefly discuss someof the challenges that the project hasencountered The successes and areas forimprovement of 16 were then exam-ined NetBSD has recently seen variousimprovements in its cross-build systempackaging system and third-party toolsupport For what concerns the kernel azero-copy TCPUDP implementationhas been integrated and a new pmap

85October 2002 login

C

ON

FERE

NC

ERE

PORT

Scode for arm32 has been introduced ashave some other rather major changes tovarious subsystems More information isavailable in Christosrsquos slides which arenow available at httpwwwnetbsdorggalleryeventsusenix2002

Theo de Raadt of the OpenBSD Projectpresented a rundown of various newthings that the OpenBSD and OpenSSHteams have been looking at Theorsquos talkincluded an amusing description (andpictures) of some of the events thattranspired during OpenBSDrsquos hack-a-thon which occurred the week beforeUSENIX rsquo02 Finally Theo mentionedthat following complications with theIPFilter license the OpenBSD team haddone a fairly extensive license audit

Robert Watson of the FreeBSD Projectbrought up various examples ofFreeBSD being used in the real worldHe went on to describe a wide variety ofchanges that will surface with theupcoming FreeBSD release 50 Notably50 will introduce a re-architecting ofthe kernel aimed at providing muchmore scalable support for SMPmachines 50 will also include an earlyimplementation of KSE FreeBSDrsquos ver-sion of scheduler activations largeimprovements to pccard and significantframework bits from the TrustedBSDproject

Mike Karels from Wind River Systemspresented an overview of how BSDOShas been evolving following the WindRiver Systems acquisition of BSDirsquos soft-ware assets Ernie Prabhakar from Applediscussed Applersquos OS X and its successon the desktop He explained how OS Xaims to bring BSD to the desktop whilestriving to maintain a strong and stablecore one that is heavily based on BSDtechnology

The BoF finished with a discussion onlicensing specifically some folks ques-tioned the impact of Applersquos licensingfor code from their core OS (the DarwinProject) on the BSD community as awhole

CLOSING SESSION

HOW FLIES FLY

Michael H Dickinson University ofCalifornia Berkeley

Summarized by JD Welch

In this lively and offbeat talk Dickinsondiscussed his research into the mecha-nisms of flight in the fruit fly (Droso-phila melanogaster) and related theautonomic behavior to that of a techno-logical system Insects are the most suc-cessful organisms on earth due in nosmall part to their early adoption offlight Flies travel through space onstraight trajectories interrupted by sac-cades jumpy turning motions some-what analogous to the fast jumpymovements of the human eye Using avariety of monitoring techniquesincluding high-speed video in a ldquovirtualreality flight arenardquo Dickinson and hiscolleagues have observed that fliesrespond to visual cues to decide when tosaccade during flight For example morevisual texture makes flies saccade earlier(ie further away from the arena wall)described by Dickinson as a ldquocollisioncontrol algorithmrdquo

Through a combination of sensoryinputs flies can make decisions abouttheir flight path Flies possess a visualldquoexpansion detectorrdquo which at a certainthreshold causes the animal to turn acertain direction However expansion infront of the fly sometimes causes it toland How does the fly decide Using thevirtual reality device to replicate variousvisual scenarios Dickinson observedthat the flies fixate on vertical objectsthat move back and forth across thefield Expansion on the left side of theanimal causes it to turn right and viceversa while expansion directly in frontof the animal triggers the legs to flareout in preparation for landing

If the eyes detect these changes how arethe responses implemented The fliesrsquowings have ldquopower musclesrdquo controlledby mechanical resonance (as opposed tothe nervous system) which drive the

USENIX 2002

wings combined with neurally activatedldquosteering musclesrdquo which change theconfiguration of the wing joints Subtlevariations in the timing of impulses cor-respond to changes in wing movementcontrolled by sensors in the ldquowing-pitrdquoA small wing-like structure the halterecontrols equilibrium by beating con-stantly

Voluntary control is accomplished bythe halteres whose signals can interferewith the autonomic control of the strokecycle The halteres have ldquosteering mus-clesrdquo as well and information derivedfrom the visual system can turn off thehaltere or trick it into a ldquovirtualrdquo prob-lem requiring a response

Dickinson has also studied the aerody-namics of insect flight using a devicecalled the ldquoRobo Flyrdquo an oversizedmechanical insect wing suspended in alarge tank of mineral oil InterestinglyDickinson observed that the average liftgenerated by the fliesrsquo wings is under itsbody weight the flies use three mecha-nisms to overcome this including rotat-ing the wings (rotational lift)

Insects are extraordinarily robust crea-tures and because Dickinson analyzedthe problem in a systems-oriented waythese observations and analysis areimmediately applicable to technologythe response system can be used as anefficient search algorithm for controlsystems in autonomous vehicles forexample

THE AFS WORKSHOP

Summarized by Garry Zacheiss MIT

Love Hornquist-Astrand of the Arladevelopment team presented the Arlastatus report Arla 0358 has beenreleased Scheduled for release soon areimproved support for Tru64 UNIXMacOS X and FreeBSD improved vol-ume handling and implementation ofmore of the vospts subcommands Itwas stressed that MacOS X is consideredan important platform and that a GUIconfiguration manager for the cache

86 Vol 27 No 5 login

manager a GUI ACL manager for thefinder and a graphical login that obtainsKerberos tickets and AFS tokens at logintime were all under development

Future goals planned for the underlyingAFS protocols include GSSAPISPNEGOsupport for Rx performance improve-ments to Rx an enhanced disconnectedmode and IPv6 support for Rx anexperimental patch is already availablefor the latter Future Arla-specific goalsinclude improved performance partialfile reading and increased stability forseveral platforms Work is also inprogress for the RXGSS protocol exten-sions (integrating GSSAPI into Rx) Apartial implementation exists and workcontinues as developers find time

Derrick Brashear of the OpenAFS devel-opment team presented the OpenAFSstatus report OpenAFS was releasedimmediately prior to the conferenceOpenAFS 125 fixed a remotelyexploitable denial-of-service attack inseveral OpenAFS platforms mostnotably IRIX and AIX Future workplanned for OpenAFS includes bettersupport for MacOS X including work-ing around Finder interaction issuesBetter support for the BSDs is alsoplanned FreeBSD has a partially imple-mented client NetBSD and OpenBSDhave only server programs availableright now AIX 5 Tru64 51A andMacOS X 102 (aka Jaguar) are allplanned for the future Other plannedenhancements include nested groups inthe ptserver (code donated by the Uni-versity of Michigan awaits integration)disconnected AFS and further work ona native Windows client Derrick statedthat the guiding principles of OpenAFSwere to maintain compatibility withIBM AFS support new platforms easethe administrative burden and add newfunctionality

AFS performance was discussed Theopenafsorg cell consists of two fileservers one in Stockholm Sweden andone in Pittsburgh PA AFS works rea-sonably well over trans-Atlantic links

although OpenAFS clients donrsquot deter-mine which file server to talk to veryeffectively Arla clients use RTTs to theserver to determine the optimal fileserver to fetch replicated data fromModifications to OpenAFS to supportthis behavior in the future are desired

Jimmy Engelbrecht and Harald Barth ofKTH discussed their AFSCrawler scriptwritten to determine how many AFScells and clients were in the world whatimplementationsversions they were(Arla vs IBM AFS vs OpenAFS) andhow much data was in AFS The scriptunfortunately triggered a bug in IBMAFS 36ndashderived code causing someclients to panic while handling a specificRPC This has since been fixed in OpenAFS 125 and the most recent IBMAFS patch level and all AFS users arestrongly encouraged to upgrade Norelease of Arla is vulnerable to this par-ticular denial-of-service attack Therewas an extended discussion of the use-fulness of this exploration Many sitesbelieved this was useful information andsuch scanning should continue in thefuture but only on an opt-in basis

Many sites face the problem of manag-ing KerberosAFS credentials for batch-scheduled jobs Specifically most batchprocessing software needs to be modi-fied to forward tickets as part of thebatch submission process renew ticketsand tokens while the job is in the queueand for the lifetime of the job and prop-erly destroy credentials when the jobcompletes Ken Hornstein of NRL wasable to pay a commercial vendor to sup-port Kerberos 45 credential manage-ment in their product although they didnot implement AFS token managementMIT has implemented some of thedesired functionality in OpenPBS andmight be able to make it available toother interested sites

Tools to simplify AFS administrationwere discussed including

AFS Balancer A tool written byCMU to automate the process ofbalancing disk usage across all

servers in a cell Available fromftpftpandrewcmuedupubAFS-Toolsbalance-11btargz

Themis Themis is KTHrsquos enhancedversion of the AFS tool ldquopackagerdquofor updating files on local disk froma central AFS image KTHrsquosenhancements include allowing thedeletion of files simplifying theprocess of adding a file and allow-ing the merging of multiple rulesets for determining which files areupdated Themis is available fromthe Arla CVS repository

Stanford was presented as an example ofa large AFS site Stanfordrsquos AFS usageconsists of approximately 14TB of datain AFS in the form of approximately100000 volumes 33TB of storage isavailable in their primary cell irstan-fordedu Their file servers consistentirely of Solaris machines running acombination of Transarc 36 patch-level3 and OpenAFS 12x while their data-base servers run OpenAFS 122 Theircell consists of 25 file servers using acombination of EMC and Sun StorEdgehardware Stanford continues to use thekaserver for their authentication infra-structure with future plans to migrateentirely to an MIT Kerberos 5 KDC

Stanford has approximately 3400 clientson their campus not including SLAC(Stanford Linear Accelerator) approxi-mately 2100 AFS clients from outsideStanford contact their cell every monthTheir supported clients are almostentirely IBM AFS 36 clients althoughthey plan to release OpenAFS clientssoon Stanford currently supports onlyUNIX clients There is some on-campuspresence of Windows clients but Stan-ford has never publicly released or sup-ported it They do intend to release andsupport the MacOS X client in the nearfuture

All Stanford students faculty and staffare assigned AFS home directories witha default quota of 50MB for a total ofapproximately 550GB of user homedirectories Other significant uses of AFS

87October 2002 login

C

ON

FERE

NC

ERE

PORT

Sstorage are data volumes for workstationsoftware (400GB) and volumes forcourse Web sites and assignments(100GB)

AFS usage at Intel was also presentedIntel has been an AFS site since 1994They had bad experiences with the IBM35 Linux client their experience withOpenAFS on Linux 24x kernels hasbeen much better They use and are sat-isfied with the OpenAFS IA64 Linuxport Intel has hundreds of OpenAFS123 and 124 clients in many produc-tion cells accessing data stored on IBMAFS file servers They have not encoun-tered any interoperability issues Intelhas some concerns about OpenAFS theywould like to purchase commercial sup-port for OpenAFS and to see OpenAFSsupport for HP-UX on both PA-RISCand Itanium hardware HP-UX supportis currently unavailable due to a specificHP-UX header file being unavailablefrom HP this may be available soonIntel has not yet committed to migratingtheir file servers to OpenAFS and areunsure if they will do so without com-mercial support

Backups are a traditional topic of dis-cussion at AFS workshops and this timewas no exception Many users complainthat the traditional AFS backup tools(ldquobackuprdquo and ldquobutcrdquo) are complex anddifficult to automate requiring manyhome-grown scripts and much userintervention for error recovery An addi-tional complaint was that the traditionalAFS tools do not support file- or direc-tory-level backups and restores datamust be backed up and restored at thevolume level

Mitch Collinsworth of Cornell presentedwork to make AMANDA the freebackup software from the University ofMaryland suitable for AFS backupsUsing AMANDA for AFS backups allowsone to share AFS backup tapes withnon-AFS backups easily run multiplebackups in parallel automate errorrecovery and provide a robust degradedmode that prevents tape errors from

stopping backups altogether Theirimplementation allows for full volumerestores as well as individual directoryand file restores They have finished cod-ing this work and are in the process oftesting and documenting it

Peter Honeyman of CITI at the Univer-sity of Michigan spoke about work hehas proposed to replace Rx with RPC-SEC GSS in OpenAFS this would allowAFS to use a TCP-based transport mech-anism rather than the UDP-based Rxand possibly gain better congestion con-trol dynamic adaptation and fragmen-tation avoidance as a result RPCSECGSS uses the GSSAPI to authenticateSUN ONC RPC RPCSEC GSS is trans-port agnostic provides strong security isa developing Internet standard and hasmultiple open source implementationsBackward compatibility with existingAFS servers and clients is an importantgoal of this project

Page 4: inside - USENIX · 2019. 2. 25. · Bosko Milekic Juan Navarro David E. Ott Amit Purohit Brennan Reynolds Matt Selsky J.D. Welch Li Xiao Praveen Yalagandula Haijin Yan Gary Zacheiss

altitude flight runs from the 20 AirRoute Traffic Control Centers (ARTCCpronounced ldquoartsyrdquo) and traffic flowmanagement which is the strategic armEach area has sectors for low high andvery-high flight Each sector has a con-troller team including one person onthe microphone and handles between12 and 16 aircraft at a time Since thenumber of sectors and areas is limitedand fixed therersquos limited system capac-ity The events of September 11 2001gave us a respite in terms of systemusage but based on path growth pat-terns the air traffic system will be over-subscribed within two to three yearsHow do we handle this oversubscrip-tion

Air Traffic Management (ATM) Deci-sion Support Tools (DST) use physicsaeronautics heuristics (expert systems)fuzzy logic and neural nets to help the(human) aircraft controllers route air-craft around The rest of the talk focusedon capacity issues but the DST alsohandle safety and security issues Thesoftware follows open standards (ISOPOSIX and ANSI) The team at NASAAmes made Center-TRACON Automa-tion System (CTAS) which is softwarefor each of the ARTCCs portable fromSolaris to HP-UX and Linux as wellUnlike just about every other major soft-ware project this one really is standardand portable co-presenter Rob Savoyehas experience in maintaining gcc onmultiple platforms and is the projectlead on the portability and standardsissues for the code CTAS allows theARTCCs to upgrade and enhance indi-vidual aspects or parts of the system itisnrsquot a monolithic all-or-nothing entitylike the old ATM systems

Some future areas of research include ahead-mounted augmented reality devicefor tower operators to improve their sit-uational awareness by automatinghuman factors and new digital globalpositioning system (DGPS) technologieswhich are accurate within inches insteadof feet

64 Vol 27 No 5 login

Questions centered around advancedavionics (eg getting rid of ground con-trol) cooperation between the US andEurope for software development (wersquoreworking together on software develop-ment but the various European coun-triesrsquo controllers donrsquot talk well to eachother) and privatization

ADVENTURES IN DNS

Bill Manning ISI

Summarized by Brennan Reynolds

Manning began by posing a simplequestion is DNS really a critical infra-structure The answer is not simple Per-haps seven years ago when the Internetwas just starting to become popular theanswer was a definite no But today withIPv6 being implemented and so manytransactions being conducted over theInternet the question does not have aclear-cut answer The Internet Engineer-ing Task Force (IETF) has several newmodifications to the DNS service thatmay be used to protect and extend itsusability

The first extension Manning discussedwas Internationalized Domain Names(IDN) To date all DNS records arebased on the ASCII character set butmany addresses in the Internetrsquos globalnetwork cannot be easily written inASCII characters The goal of IDN is toprovide encoding for hostnames that isfair efficient and allows for a smoothtransition from the current scheme Thework has resulted in two encodingschemes ACE and UTF-8 Each encod-ing is independent of the other but theycan be used together in various combi-nations Manning expressed his opinionthat while neither is an ideal solutionACE appears to be the lesser of two evilsA major hindrance getting IDN rolledout into the Internetrsquos DNS root struc-ture is the increase in zone file complex-ity

Manningrsquos next topic was the inclusionof IPv6 records In an IPv4 world it ispossible for administrators to rememberthe numeric representation of anaddress IPv6 makes the addresses too

long and complex to be easily remem-bered This means that DNS will play avital role in getting IPv6 deployed Sev-eral new resource records (RR) havebeen proposed to handle the translationincluding AAAA A6 DNAME and BIT-STRING Manning commented on thedifference between IPv4 and v6 as atransport protocol and that systemstuned for v4 traffic will suffer a perfor-mance hit when using v6 This is largelyattributed to the increase in data trans-mitted per DNS request

The final extension Manning discussedwas DNSSec He introduced this exten-sion as a mechanism that protects thesystem from itself DNSSec protectsagainst data spoofing and providesauthentication between servers Itincludes a cryptographic signature ofthe RR set to ensure authenticity andintegrity By signing the entire set theamount of computation is kept to aminimum The information itself isstored in a new RR within each zone fileon the DNS server

Manning briefly commented on the useof DNS to provide a PKI infrastructurestating that that was not the purpose ofDNS and therefore it should not be usedin that fashion The signing of the RRsets can be done hierarchically resultingin the use of a single trusted key at theroot of the DNS tree to sign all sets tothe leafs However the job of keyreplacement and rollover is extremelydifficult for a system that is distributedacross the globe

Manning stated that in an operation testbed with all of these extensions enabledthe packet size for a single queryresponse grew from 518 bytes to largerthan 18000 This results in a largeincrease of bandwidth usage for highvolume name servers and puts There-fore in Manningrsquos opinion not all ofthese features will be deployed in thenear future For those looking for moreinformation Billrsquos Web site can be foundat httpwwwisieduotdr

THE JOY OF BREAKING THINGS

Pat Parseghian Transmeta

Summarized by Juan Navarro

Pat Parseghian described her experiencein testing the Crusoe microprocessor atthe Transmeta Lab for Compatibility(TLC) where the motto is ldquoYou make it we break itrdquo and the goal is to makeengineers miserable

She first gave an overview of Crusoewhose most distinctive feature is a codemorphing layer that translates x86instructions into native VLIW Suchpeculiarity makes the testing processparticularly challenging because itinvolves testing two processors (theldquoexternalrdquo x86 and the ldquointernalrdquo VLIW)and also because of variations on howthe code-morphing layer works (it mayinterpret code the first few times it seesan instruction sequence and translateand save in a translation cache after-wards) There are also reproducibilityissues since the VLIW processor mayrun at different speeds to save energy

Pat then described the tests that theysubject the processor to (hardware com-patibility tests common applicationsoperating systems envelope-pushinggames and hard-to-acquire legacy appli-cations) and the issues that must be con-sidered when defining testing policiesShe gave some testing tips includingorganizational issues like trackingresources and results

To illustrate the testing process Pat gavea list of possible causes of a system crashthat must be investigated If it is the sili-con then it might be because it is dam-aged or because of a manufacturingproblem or a design flaw If the problemis in the code-morphing layer is thefault with the interpreter or with thetranslator The fault could also be exter-nal to the processor it could be thehomemade system BIOS a faulty main-board or an operator error Or Trans-meta might not be at fault at all thecrash might be due to a bug in the OS orthe application The key to pinpointing

65October 2002 login

C

ON

FERE

NC

ERE

PORT

Sthe cause of a problem is to isolate it byidentifying the conditions that repro-duce the problem and repeating thoseconditions in other test platforms andnon-Crusoe systems Then relevant fac-tors must be identified based on productknowledge past experience and com-mon sense

To conclude Pat suggested that some ofTLCrsquos testing lessons can be applied toproducts that the audience was involvedin and assured us that breaking things isfun

TECHNOLOGY LIBERTY FREEDOM AND

WASHINGTON

Alan Davidson Center for Democracyand Technology

Summarized by Steve Bauer

ldquoExperience should teach us to be moston our guard to protect liberty when theGovernmentrsquos purposes are beneficentMen born to freedom are naturally alertto repel invasion of their liberty by evil-minded rulers The greatest dangers toliberty lurk in insidious encroachmentby men of zeal well-meaning but with-out understandingrdquo ndash Louis BrandeisOlmstead v US

Alan Davison an associate director atthe CDT (httpwwwcdtorg) since1996 concluded his talk with this quoteIn many ways it aptly characterizes theimportance to the USENIX communityof the topics he covered The majorthemes of the talk were the impact oflaw on individual liberty system archi-tectures and appropriate responses bythe technical community

The first part of the talk provided theaudience with an overview of legislationeither already introduced or likely to beintroduced in the US Congress Thisincluded various proposals to protectchildren such as relegating all sexualcontent to a xxx domain or providingkid-safe zones such as kidsus Othersimilar laws discussed were the Chil-drenrsquos Online Protection Act and legisla-tion dealing with virtual child pornography

Other lower-profile pieces covered werelaws prohibiting false contact informa-tion in emails and domain registrationdatabases Similar proposals exist thatwould prohibit misleading subject linesin emails and require messages to clearlyidentify if they are advertisementsBriefly mentioned were laws impactingonline gambling

Bills and laws affecting the architecturaldesign of systems and networks and var-ious efforts to establish boundaries andborders in cyberspace were then dis-cussed These included the ConsumerBroadband and Television PromotionAct and the French cases against Yahooand its CEO for allowing the sale of Nazimaterials on the Yahoo auction site

A major part of the talk focused on thetechnological aspects of the USA-PatriotAct passed in the wake of the September11 attacks Topics included ldquopen regis-tersrdquo for the Internet expanded govern-ment power to conduct secret searchesroving wiretaps nationwide service ofwarrants sharing grand jury and intelli-gence information and the establish-ment of an identification system forvisitors to the US

The technological community needs tobe aware and care about the impact oflaw on individual liberty and the systemsthe community builds Specific sugges-tions included designing privacy andanonymity into architectures and limit-ing information collection particularlyany personally identifiable data to onlywhat is essential Finally Alan empha-sized that many people simply do notunderstand the implications of lawThus the technology community has aimportant role in helping make theseimplications clear

TAKING AN OPEN SOURCE PROJECT TO

MARKET

Eric Allman Sendmail Inc

Summarized by Scott Kilroy

Eric started the talk with the upsides anddownsides to open source software Theupsides include rapid feedback high-

USENIX 2002

quality work (mostly) lots of hands andan intensely engineering-drivenapproach The downsides include nosupport structure limited project mar-keting expertise and volunteers havinglimited time

In 1998 Sendmail was becoming a suc-cess disaster A success disaster includestwo key elements (1) volunteers startspending all their time supporting cur-rent functionality instead of enhancing(this heavy burden can lead to projectstagnation and eventually death)(2) competition for the better-fundedsources can lead to FUD (fear uncer-tainty and doubt) about the project

Eric then led listeners down the path hetook to turn Sendmail into a businessEric envisioned Sendmail Inc as a small-ish and close-knit company but soonrealized that the family atmosphere hedesired could not last as the companygrew He observed that maybe only thefirst 10 people will buy into your visionafter which each individual is primarilyinterested in something else (eg powerwealth or status) He warns that therewill be people working for you that youdo not like Eric particularly did not carefor sales people but he emphasized howimportant sales marketing finance andlegal people are to companies

The nature of companies in infancy canbe summed up as you need money andyou need investors in order to getmoney Investors want a return oninvestment so money managementbecomes critical Companies thereforemust optimize money functions Ericrsquosexperience with investors were lessonsall in themselves More than givingmoney good investors can provide con-nections and insight so you donrsquot wantinvestors who donrsquot believe in you

A company must have bug historychange management and people whounderstand the product and have a senseof the target market Eric now sees theimportance of marketing targetresearch A company can never forget

66 Vol 27 No 5 login

the simple equation that drives businessprofit = revenue - expense

Finally the concept of value should bebased on what is valuable to the cus-tomer Eric observed that his customersneeded documentation extra value inthe product that they couldnrsquot get else-where and technical support

Eric learned hard lessons along the wayIf you want to do interesting opensource it might be best not to be toosuccessful You donrsquot want to let thelarger dragons (companies) notice youIf you want a commercial user base youhave to manage process watch corpo-rate culture provide value to customerswatch bottom lines and develop a thickskin

Starting a company is easy but perpetu-ating it is extremely difficult

INFORMATION VISUALIZATION FOR

SYSTEMS PEOPLE

Tamara Munzner University of BritishColumbia

Summarized by JD Welch

Munzner presented interesting evidencethat information visualization throughinteractive representations of data canhelp people to perform tasks more effec-tively and reduce the load on workingmemory

A popular technique in visualizing datais abstracting it through node-link rep-resentation ndash for example a list of cross-references in a book Representing therelationships between references in agraphical (node-link) manner ldquooffloadscognition to the human perceptual sys-temrdquo These graphs can be producedmanually but automated drawing allowsfor increased complexity and quickerproduction times

The way in which people interpret visualdata is important in designing effectivevisualization systems Attention to pre-attentive visual cues are key to successPicking out a red dot among a field ofidentically shaped blue dots is signifi-cantly faster than the same data repre-

sented in a mixture of hue and shapevariations There are many pre-attentivevisual cues including size orientationintersection and intensity The accuracyof these cues can be ranked with posi-tion triggering high accuracy and coloror density on the low end

Visual cues combined with differentdata types ndash quantitative (eg 10inches) ordered (small medium large)and categorical (apples oranges) ndash canalso be ranked with spatial positionbeing best for all types Beyond thiscommonality however accuracy variedwidely with length ranked second forquantitative data eighth for ordinal andninth for categorical data types

These guidelines about visual perceptioncan be applied to information visualiza-tion which unlike scientific visualiza-tion focuses on abstract data and choiceof specialization Several techniques areused to clarify abstract data includingmulti-part glyphs where changes inindividual parts are incorporated into aneasier to understand gestalt interactiv-ity motion and animation In large datasets techniques like ldquofocus + contextrdquowhere a zoomed portion of the graph isshown along with a thumbnail view ofthe entire graph are used to minimizeuser disorientation

Future problems include dealing withhuge databases such as the HumanGenome reckoning dynamic data likethe changing structure of the Web orreal-time network monitoring andtransforming ldquopixel boundrdquo displays intolarge ldquodigital wallpaperrdquo-type systems

FIXING NETWORK SECURITY BY HACKING

THE BUSINESS CLIMATE

Bruce Schneier Counterpane InternetSecurity

Summarized by Florian Buchholz

Bruce Schneier identified security as oneof the fundamental building blocks ofthe Internet A certain degree of securityis needed for all things on the Net andthe limits of security will become limitsof the Net itself However companies are

hesitant to employ security measuresScience is doing well but one cannot seethe benefits in the real world Anincreasing number of users means moreproblems affecting more people Oldproblems such as buffer overflowshavenrsquot gone away and new problemsshow up Furthermore the amount ofexpertise needed to launch attacks isdecreasing

Schneier argues that complexity is theenemy of security and that while secu-rity is getting better the growth in com-plexity is outpacing it As security isfundamentally a people problem oneshouldnrsquot focus on technologies butrather should look at businesses busi-ness motivations and business costsTraditionally one can distinguishbetween two security models threatavoidance where security is absoluteand risk management where security isrelative and one has to mitigate the riskwith technologies and procedures In thelatter model one has to find a pointwith reasonable security at an acceptablecost

After a brief discussion on how securityis handled in businesses today in whichhe concluded that businesses ldquotalk bigabout it but do as little as possiblerdquoSchneier identified four necessary stepsto fix the problems

First enforce liabilities Today no realconsequences have to be feared fromsecurity incidents Holding peopleaccountable will increase the trans-parency of security and give an incentiveto make processes public As possibleoptions to achieve this he listed indus-try-defined standards federal regula-tion and lawsuits The problems withenforcement however lie in difficultiesassociated with the international natureof the problem and the fact that com-plexity makes assigning liabilities diffi-cult Furthermore fear of liability couldhave a chilling effect on free softwaredevelopment and could stifle new com-panies

67October 2002 login

C

ON

FERE

NC

ERE

PORT

SAs a second step Schneier identified theneed to allow partners to transfer liabil-ity Insurance would spread liability riskamong a group and would be a CEOrsquosprimary risk analysis tool There is aneed for standardized risk models andprotection profiles and for more securi-ties as opposed to empty press releasesSchneier predicts that insurance will cre-ate a marketplace for security where cus-tomers and vendors will have the abilityto accept liabilities from each otherComputer-security insurance shouldsoon be as common as household or fireinsurance and from that developmentinsurance will become the driving factorof the security business

The next step is to provide mechanismsto reduce risks both before and aftersoftware is released techniques andprocesses to improve software quality aswell as an evolution of security manage-ment are therefore needed Currentlysecurity products try to rebuild the wallsndash such as physical badges and entrydoorways ndash that were lost when gettingconnected Schneier believes thisldquofortress metaphorrdquo is bad one shouldthink of the problem more in terms of acity Since most businesses cannot affordproper security outsourcing is the onlyway to make security scalable With out-sourcing the concenpt of best practicesbecomes important and insurance com-panies can be tied to them outsourcingwill level the playing field

As a final step Schneier predicts thatrational prosecution and education willlead to deterrence He claims that peoplefeel safe because they live in a lawfulsociety whereas the Internet is classifiedas lawless very much like the AmericanWest in the 1800s or a society ruled bywarlords This is because it is difficult toprove who an attacker is prosecution ishampered by complicated evidencegathering and irrational prosecutionSchneier believes however that educa-tion will play a major role in turning theInternet into a lawful society Specifi-cally he pointed out that we need lawsthat can be explained

Risks wonrsquot go away the best we can dois manage them A company able tomanage risk better will be more prof-itable and we need to give CEOs thenecessary risk-management tools

There were numerous questions Listedbelow are the more complicated ones ina paraphrased QampA format

Q Will liability be effective will insur-ance companies be willing to accept therisks A The government might have tostep in it needs to be seen how it playsout

Q Is there an analogy to the real worldin the fact that in a lawless society onlythe rich can afford security Securitysolutions differentiate between classesA Schneier disagreed with the state-ment giving an analogy to front-doorlocks but conceded that there might bespecial cases

Q Doesnrsquot homogeneity hurt securityA Homogeneity is oversold Diversetypes can be more survivable but giventhe limited number of options the dif-ference will be negligible

Q Regulations in airbag protection haveled to deaths in some cases How can wekeep the pendulum from swinging theother way A Lobbying will not be pre-vented An imperfect solution is proba-ble there might be reversals ofrequirements such as the airbag one

Q What about personal liability A Thiswill be analogous to auto insurance lia-bility comes with computerNet access

Q If the rule of law is to become realitythere must be a law enforcement func-tion that applies to a physical space Youcannot do that with any existing govern-ment agency for the whole worldShould an organization like the UNassume that role A Schneier was notconvinced global enforcement is possi-ble

Q What advice should we take awayfrom this talk A Liability is comingSince the network is important to ourinfrastructure eventually the problems

USENIX 2002

will be solved in a legal environmentYou need to start thinking about how tosolve the problems and how the solu-tions will affect us

The entire speech (including slides) isavailable at httpwwwcounterpanecompresentation4pdf

LIFE IN AN OPEN SOURCE STARTUP

Daryll Strauss Consultant

Summarized by Teri Lampoudi

This talk was packed with morsels ofinsight and tidbits of information aboutwhat life in an open source startup islike (though ldquowas likerdquo might be moreappropriate) what the issues are instarting maintaining and commercial-izing an open source project and theway hardware vendors treat open sourcedevelopers Strauss began by tracing thetimeline of his involvement with devel-oping 3-D graphics support for Linuxfrom the 3dfx voodoo1 driver for Linuxin 1997 to the establishment of Preci-sion Insight in 1998 to its buyout by VALinux to the dismantling of his groupby VA to what the future may hold

Obviously there are benefits from doingopen source development both for thedeveloper and for the end user Animportant point was that open sourcedevelops entire technologies not justproducts A misapprehension is theinherent difficulty of managing a projectthat accepts code from a large number ofdevelopers some volunteer and somepaid The notion that code can just bethrown over the wall into the world is aBig Lie

How does one start and keep afloat anopen source development company Inthe case of Precision Insight the subjectmatter ndash developing graphics andOpenGL for Linux ndash required a consid-erable amount of expertise intimateknowledge of the hardware the librariesX and the Linux kernel as well as theend applications Expertise is mar-ketable What helps even further is hav-ing a visible virtual team of expertspeople who have an established trackrecord of contributions to open source

68 Vol 27 No 5 login

in the particular area of expertise Sup-port from the hardware and softwarevendors came next While everyone waswilling to contract PI to write the por-tions of code that were specific to theirhardware no one wanted to pay the billfor developing the underlying founda-tion that was necessary for the drivers tobe useful In the PI model the infra-structure cost was shared by several cus-tomers at once and the technology keptmoving Once a driver is written for avendor the vendor is perfectly capableof figuring out how to write the nextdriver necessary thereby obviating theneed to contract the job This and ques-tions of protecting intellectual propertyndash hardware design in this case ndash aredeeply problematic with the open sourcemode of development This is not to saythat there is no way to get over thembut that they are likely to arise moreoften than not

Management issues seem equally impor-tant In the PI setting contracts wereflexible ndash both a good and a bad thing ndashand developers overworked Furtherdevelopment ldquoin a fishbowlrdquo that isunder public scrutiny is not easyStrauss stressed the value of good com-munication and the use of various out-of-sync communication methods likeIRC and mailing lists

Finally Strauss discussed what portionsof software can be free and what can beproprietary His suggestion was that hor-izontal markets want to be free whereasvertically one can develop proprietarysolutions The talk closed with a smallstroll through the problems that projectpolitics brings up from things likemerging branches to mailing list andsource tree readability

GENERAL TRACK SESSIONS

FILE SYSTEMS

Summarized by Haijin Yan

STRUCTURE AND PERFORMANCE OF THE

DIRECT ACCESS FILE SYSTEM

Kostas Magoutis Salimah AddetiaAlexandra Fedorova and Margo ISeltzer Harvard University JeffreyChase Andrew Gallatin Richard Kisleyand Rajiv Wickremesinghe Duke Uni-versity and Eran Gabber Lucent Tech-nologies

This paper won the Best Paper award

The Direct Access File System (DAFS) isa new standard for network-attachedstorage over direct-access transport net-works DAFS takes advantage of user-level network interface standards thatenable client-side functionality in whichremote data access resides in a libraryrather than in the kernel This reducesthe overhead of memory copy for datamovement and protocol overheadRemote Direct Memory Access (RDMA)is a direct access transport network thatallows the network adapter to reducecopy overhead by accessing applicationbuffers directly

This paper explores the fundamentalstructural and performance characteris-tics of network file access using a user-level file system structure on adirect-access transport network withRDMA It describes DAFS-based clientand server reference implementationsfor FreeBSD and reports experimentalresults comparing DAFS to a zero-copyNFS implementation It illustrates thebenefits and trade-offs of these tech-niques to provide a basis for informedchoices about deployment of DAFS-based systems and similar extensions toother network file protocols such asNFS Experiments show that DAFS givesapplications direct control over an IOsystem and increases the client CPUrsquosusage while the client is doing IOFuture work includes how to addresslongstanding problems related to theintegration of the application and filesystem for high-performance applica-tions

CONQUEST BETTER PERFORMANCE THROUGH

A DISKPERSISTENT-RAM HYBRID FILE

SYSTEM

An-I A Wang Peter Reiher and GeraldJ Popek UCLA Geoffrey M Kuenning Harvey Mudd College

Motivated by the declining cost of per-sistent RAM the authors propose theConquest file system which stores allsmall files metadata executables andshared libraries in persistent RAM diskshold only the data content of remaininglarge files Compared to alternativessuch as caching and RAM file systemsConquest has the advantages of effi-ciency consistency and reliability at areduced cost Using popular bench-marks experiments show that Conquestincurs little overhead while achievingfaster performance Future workincludes designing mechanisms foradjusting file-size threshold dynamicallyand finding a better disk layout for largedata blocks

EXPLOITING GRAY-BOX KNOWLEDGE OF

BUFFER-CACHE MANAGEMENT

Nathan C Burnett John Bent AndreaC Arpaci-Dusseau and Remzi HArpaci-Dusseau University of Wisconsin Madison

Knowing what algorithm is used tomanage the buffer cache is very impor-tant for improving application perfor-mance However there is currently nointerface for finding this algorithm Thispaper introduces Dust a simple finger-printing tool that is able to identify thebuffer-cache replacement policy Dustautomatically identifies the cache sizeand replacement policy based on theconfiguring attributes of access ordersrecency frequency and long-term his-tory Through simulation Dust was ableto distinguish between a variety ofreplacement algorithm policies found inthe literature FIFO LRU LFU ClockSegmented FIFO 2Qand LRU-K Fur-ther experiments of fingerprinting realoperating system such as NetBSDLinux and Solaris show that Dust is ableto identify the hidden cache replacementalgorithm

69October 2002 login

C

ON

FERE

NC

ERE

PORT

SBy knowing the underlying cachereplacement policy a cache-aware Webserver can reschedule the requests on acached request-first policy to obtain per-formance improvement Experimentsshow that cache-aware schedulingimproves average response time and sys-tem throughput

OPERATING SYSTEMS

(AND DANCING BEARS)

Summarized by Matt Butner

THE JX OPERATING SYSTEM

Michael Golm Meik Felser ChristianWawersich and Juumlrgen Kleinoumlder University of Erlagen-Nuumlrnberg

The talk opened by emphasizing theneed for and the practicality of a func-tional Java OS Such implementation inthe JX OS attempts to mimic the recenttrend toward using highly abstractedlanguages in application development inorder to create an OS as functional andpowerful as one written in a lower-levellanguage

The JX is a micro-kernel solution thatuses separate JVMs for each entity of thekernel and in some cases for eachapplication The separated domains donot share objects and have no threadmigration and each domain implementsits own code The interesting part of thepresentation was the discussion of sys-tem-level programming with Java keyareas discussed were memory manage-ment and interrupt handlers Theauthors concluded by noting that theirtype-safe and modular system resultedin a robust system with great configura-tion flexibility and acceptable perfor-mance

DESIGN EVOLUTION OF THE EROS

SINGLE-LEVEL STORE

Jonathan S Shapiro Johns HopkinsUniversity Jonathan Adams Universityof Pennsylvania

This presentation outlined current char-acteristics of file systems and some oftheir least desirable characteristics Itwas based on the revival of the EROS

single-level-store approach for currentattempts to capitalize on system consis-tency and efficiency Their solution didnot relate directly to EROS but ratherto the design and use of the constructsand notions on which EROS is based Byextending the mapping of the memorysystem to include the disk systems arefurther able to ensure global persistencewithout regard for a disk structure Inexplaining the need for absolute systemconsistency and an environment whichby definition is not allowed to crash thegoal of having such an exhaustive designbecomes clear The cool part of thiswork is the availability of its design andarchitecture to the public

THINK A SOFTWARE FRAMEWORK FOR

COMPONENT-BASED OPERATING SYSTEM

KERNELS

Jean-Philippe Fassino France TeacuteleacutecomRampD Jean-Bernard Stefani INRIA JuliaLawall DIKU Gilles Muller INRIA

This presentation discussed the need forcomponent-based operating systemsand respective structures to ensure flexi-bility for arbitrary-sized systems Thinkprovides a binding model that mapsuniformed components for OS develop-ers and architects to follow ensuringconsistent implementations for an arbi-trary system size However the goal isnot to force developers into a predefinedkernel but to promote the use of certaincomponents in varied ways

Thinkrsquos concentration is primarily onembedded systems where short devel-opment time is necessary but is con-strained by rigorous needs and limitedresources The need to build flexible sys-tems in such an environment can becostly and implementation-specific butThink creates an environment sup-ported by the ability to dynamically loadtype-safe components This allows formore flexible systems that retain func-tionality because of Thinkrsquos ability tobind fine-grained components Bench-marks revealed that dedicated micro-kernels can show performanceimprovements in comparison to mono-

USENIX 2002

lithic kernels Notable future workincludes building components for low-end appliances and the development of areal-time OS component library

BUILDING SERVICES

Summarized by Pradipta De

NINJA A FRAMEWORK FOR NETWORK

SERVICES

J Robert von Behren Eric A BrewerNikita Borisov Michael Chen MattWelsh Josh MacDonald Jeremy Lauand David Culler University of Califor-nia Berkeley

Robert von Behren presented Ninja anongoing project that aims to provide aframework for building robust and scal-able Internet-based network serviceslike Web-hosting instant-messagingemail and file-sharing applicationsRobert drew attention to the difficultiesof writing cluster applications One hasto take care of data consistency andissues of concurrency and for robustapplications there are problems relatedto fault tolerance Ninja works as awrapper to relieve the user of theseproblems The goal of the project isbuilding network services that are scala-ble and highly available maintainingpersistent data providing gracefuldegradation and supporting online evo-lution

The use of clusters distinguishes thissetup from generic distributed systemsin terms of reliability and security aswell as providing a partition-free net-work The next important feature of thisproject is the new programming modelwhich is more restrictive than a generalprogramming model but still expressiveenough to write most of the applica-tions This model described as ldquosingleprogram multiple connectionrdquo usesintertask parallelism instead of multi-threaded concurrency and is character-ized by all nodes running the sameprogram with connections delegated tothe nodes by a centralized connectionmanager (CM) A CM takes care of hid-ing the details of mapping an externalconnection to an internal node Ninja

70 Vol 27 No 5 login

also uses a design called ldquostaged event-driven architecturerdquo where each serviceis broken down into a set of stages con-nected by event queues This architec-ture is suitable for a modular design andhelps in graceful degradation by adap-tive load shedding from the eventqueues

The presentation concluded with exam-ples of implementation and evaluationof a Web server and an email systemcalled NinjaMail to show the ease ofauthoring and the efficacy of using theNinja framework for service develop-ment

USING COHORT-SCHEDULING TO ENHANCE

SERVER PERFORMANCE

James R Larus and Michael ParkesMicrosoft Research

James Larus presented a new schedulingpolicy for increasing the throughput ofserver applications Cohort schedulingbatches execution of similar operationsarising in different server requests Theusual programming paradigm in han-dling server requests is to spawn multi-ple concurrent threads and switch fromone thread to another whenever a threadgets blocked for IO Since the threads ina server mostly execute unrelated piecesof code the locality of reference isreduced hence the effectiveness of dif-ferent caching mechanisms One way tosolve this problem is to throw in morehardware But Larus presented a com-plementary view where the programbehavior is investigated and used toimprove the performance

The problem in this scheme is to iden-tify pieces of code that can be batchedtogether for processing One simple wayis to look at the next program countervalues and use them to club threadstogether However this talk presented anew programming abstraction ldquostagedcomputationrdquo which replaces the threadmodel with ldquostagesrdquo A stage is anabstraction for operations with similarbehavior and locality The StagedServerlibrary can be used for programming inthis model It is a collection of C++

classes that implement staged computa-tion and cohort scheduling on either auniprocessor or multiprocessor It canbe used to define stages

The presentation showed the experi-mental evaluation of the cohort schedul-ing over the thread-based model byimplementing two servers a Web serverwhich is mainly IO bound and a pub-lish-subscribe server which is mainlycompute bound The SURGE bench-mark was used for the first experimentand the Fabret workload for the secondThe results showed that cohort-schedul-ing-based implementation gave a betterthroughput than the thread-basedimplementation at high loads

NETWORK PERFORMANCE

Summarized by Xiaobo Fan

ETE PASSIVE END-TO-END INTERNET SERVICE

PERFORMANCE MONITORING

Yun Fu Amin Vahdat Duke UniversityLudmila Cherkasova Wenting Tang HPLabs

This paper won the Best Student Paperaward

Ludmila Cherkasova began by listingseveral questions most Web serviceproviders want answered in order toimprove service quality She reviewedthe difficulties of making accurate andefficient end-to-end Web service mea-surement and the shortcomings of cur-rently available approaches Theypropose a passive trace-based architec-ture called EtE to monitor Web serverperformance on behalf of end users

The first step is to collect network pack-ets passively The second module recon-structs all TCP connections and extractsHTTP transactions To reconstruct Webpage accesses they first build a knowl-edge base indexed by client IP and URLand then group objects to the relatedWeb pages they are embedded in Statis-tical analysis is used to handle non-matched objects EtE Monitor cangenerate three groups of metrics to mea-sure Web service performance responsetime Web caching and page abortion

To demonstrate the benefits of EtE mon-itor Cherkasova talked about two casestudies and based on the calculatedmetrics gave some insightful explana-tions about whatrsquos happening behind the variations of Web performanceThrough validation they claim theirapproach provides a very close approxi-mation to the real scenario

THE PERFORMANCE OF REMOTE DISPLAY

MECHANISMS FOR THIN-CLIENT COMPUTING

S Jae Yang Jason Nieh Matt SelskyNikhil Tiwari Columbia University

Noting the trend toward thin-clientcomputing the authors compared dif-ferent techniques and design choices inmeasuring the performance of six popu-lar thin-client platforms ndash CitrixMetaFrame Microsoft Terminal Ser-vices Sun Ray Tarantella VNC and XAfter pointing out several challenges inbenchmarking thin clients Yang pro-posed slow-motion benchmarking toachieve non-invasive packet monitoringand consistent visual quality Basicallythey insert delays between separatevisual events in the benchmark applica-tions of the server side so that theclientrsquos display update can catch up withthe serverrsquos processing speed The exper-iments are conducted on an emulatednetwork over a range of network band-widths

Their results show that thin clients canprovide good performance for Webapplications in LAN environments butonly some platforms performed well forvideo benchmark Pixel-based encodingmay achieve better performance andbandwidth efficiency than high-levelgraphics Display caching and compres-sion should be used with care

A MECHANISM FOR TCP-FRIENDLY

TRANSPORT-LEVEL PROTOCOL

COORDINATION

David E Ott and Ketan Mayer-PatelUniversity of North Carolina ChapelHill

A revised transport-level protocol opti-mized for cluster-to-cluster (C-to-C)

71October 2002 login

C

ON

FERE

NC

ERE

PORT

Sapplications is what David Ott tries toexplain in his talk A C-to-C applicationclass is identified as one set of processescommunicating to another set ofprocesses across a common Internetpath The fundamental problem with C-to-C applications is how to coordinateall C-to-C communication flows so thatthey share a consistent view of the com-mon C-to-C network adapt to changingnetwork conditions and cooperate tomeet specific requirements Aggregationpoints (AP) are placed at the first andlast hop of the common data path toprobe network conditions (latencybandwidth loss rate etc) To carry andtransfer this information a new protocolndash Coordination Protocol (CP) ndash isinserted between the network layer (IP)and the transport layer (TCP UCP etc)Ott illustrated how the CP header isupdated and used when packets origi-nate from source and traverse throughlocal and remote AP to arrive at theirdestination and how AP maintains aper-cluster state table and detects net-work conditions Through simulationresults this coordination mechanismappears effective in sharing commonnetwork resources among C-to-C com-munication flows

STORAGE SYSTEMS

Summarized by Praveen Yalagandula

MY CACHE OR YOURS MAKING STORAGE

MORE EXCLUSIVE

Theodore M Wong Carnegie MellonUniversity John Wilkes HP Labs

Theodore Wong explained the ineffi-ciency of current ldquoinclusiverdquo cachingschemes in storage area networks ndashwhen a client accesses a block the blockis read from the disk and is cached atboth the disk array cache and at theclientrsquos cache Then he presented theconcept of ldquoexclusiverdquo caching where ablock accessed by a client is only cachedin that clientrsquos cache On eviction fromthe clientrsquos cache they have come upwith a DEMOTE operation to move thedata block to the tail of the array cache

The new exclusion caching schemes areevaluated on both single-client and mul-tiple-client systems The single-clientresults are presented for two differenttypes of synthetic workloads Random(transaction-processing type workloads)and Zipf (typical Web workloads) Theexclusive policy was quite effective inachieving higher hit rates and lower readlatencies They also showed a 22 timeshit-rate improvement over inclusivetechniques in the case of a real-lifeworkload httpd with a single-client set-ting

The DEMOTE scheme also performedwell in the case of multiple-client sys-tems when the data accessed by clients isdisjointed In the case of shared dataworkloads this scheme performed worsethan the inclusive schemes Theodorethen presented an adaptive exclusivecaching scheme ndash a block accessed by aclient is placed at the tail in the clientrsquoscache and also in the array cache at anappropriate place determined by thepopularity of the block The popularityof the block is measured by maintaininga ghost cache to accumulate the numberof times each block is accessed This newadaptive scheme has achieved a maxi-mum of 52 speedup in the meanlatency in the experiments with real-lifeworkloads

For more information visit httpwwwcscmuedu~tmwongresearch and httpwwwhplhpcomresearchitccslssp

BRIDGING THE INFORMATION GAP IN

STORAGE PROTOCOL STACKS

Timothy E Denehy Andrea C Arpaci-Dusseau Remzi H Arpaci-DusseauUniversity of Wisconsin Madison

Currently there is a huge informationgap between storage systems and file sys-tems The interface exposed by storagesystems to file systems based on blocksand providing only readwrite interfacesis very narrow This leads to poor per-formance as a whole because of dupli-cated functionality in both systems and

USENIX 2002

reduced functionality resulting fromstorage systemsrsquo lack of file information

The speaker presented two enhance-ments ExRaid an exposed RAID andILFS Informed Log-Structured File Sys-tem ExRAID is an enhancement to theblock-based RAID storage system thatexposes the following three types ofinformation to the file system (1)regions ndash contiguous portions of theaddress space comprising one or multi-ple disks (2) performance informationabout queue lengths and throughput ofthe regions revealing disk heterogeneityto the file systems and (3) failure infor-mation ndash dynamically updated informa-tion conveying the number of tolerablefailures in each region

ILFS allows online incremental expan-sion of the storage space performsdynamic parallelism using ExRAIDrsquosperformance information to performwell on heterogeneous storage systemsand provides a range of different mecha-nisms with different granularities forredundancy of files using ExRAIDrsquosregion failure characteristics These newfeatures are added to LFS with only a19 increase in the code size

More informationhttpwwwcswisceduwind

MAXIMIZING THROUGHPUT IN REPLICATED

DISK STRIPING OF VARIABLE

BIT-RATE STREAMS

Stergios V Anastasiadis Duke Univer-sity Kenneth C Sevcik and MichaelStumm University of Toronto

There is an increasing demand for thecontinuous real-time streaming ofmedia files Disk striping is a commontechnique used for supporting manyconnections on the media servers Buteven with 12 million hours of meantime between failures there will be morethan one disk failure per week with 7000disks Fault tolerance can be achieved byusing data replication and reservingextra bandwidth during normal opera-tion This work focused on supportingthe most common variable bit-ratestream formats (eg MPEG)

72 Vol 27 No 5 login

The authors used prototype mediaserver EXEDRA to implement and eval-uate different replication and bandwidthreservation schemes This system sup-ports a variable bit-rate scheme doesstride-based disk allocation and is capa-ble of supporting variable-grain diskstriping Two replication policies arepresented deterministic ndash data blocksare replicated in a round-robin fashionon the secondary replicas random ndashdata blocks are replicated on the ran-domly chosen secondary replicas Twobandwidth reservation techniques arealso presented mirroring reservation ndashdisk bandwidth is reserved for both primary and replicas of the media fileduring the playback and minimumreservation ndash a more efficient scheme inwhich bandwidth is reserved only for thesum of primary data access time and themaximum of the backup data accesstimes required in each round

Experimental results showed that deter-ministic replica placement is better thanrandom placement for small disks mini-mum disk bandwidth reservation istwice as good as mirroring in thethroughput achieved and fault tolerancecan be achieved with a minimal impacton the throughput

TOOLS

Summarized by Li Xiao

SIMPLE AND GENERAL STATISTICAL PROFILING

WITH PCT

Charles Blake and Steven Bauer MIT

This talk introduced a Profile CollectionToolkit (PCT) ndash a sampling-based CPUprofiling facility A novel aspect of PCTis that it allows sampling of semanticallyrich data such as function call stacksfunction parameters local or globalvariables CPU registers or other execu-tion contexts The design objectives ofPCT were driven by user needs and theinadequacies or inaccessibility of priorsystems

The rich data collection capability isachieved via a debugger-controller pro-gram dbct1 The talk then introducedthe profiling collector and data collec-

tion activation PCT file format optionsprovide flexibility and various ways tostore exact states PCT also has analysisoptions two examples were given ndashldquoSimplerdquo and ldquoMixed User and LinuxCoderdquo

The talk then went into how generalsampling helps in the following areas adebugger-controller state machineexample script fragments a simpleexample program and general valuereports in terms of call-path histogramsnumeric value histograms and data gen-erality Related work such as gprof andExpect was then summarized The con-tribution of this work is to provide aportable general value sampling toolThe limitations are that PCT does notsupport much of strip and count inac-curacies happen because of its statisticalnature Based on this limitation -g ispreferred over strip

Possible future directions included sup-porting more debuggers such as dbxand xdb script generators for otherkinds of traces more canned reports forgeneral values and a libgdb-basedlibrary sampler PCT is available athttppdoslcsmitedupct PCT runs onalmost any UNIX-like system

ENGINEERING A DIFFERENCING AND

COMPRESSION DATA FORMAT

David G Korn and Kiem Phong VoATampT Laboratories ndash Research

This talk began with an equation ldquoDif-ferencing + Compression = Delta Com-pressionrdquo Compression removesinformation redundancy in a data setExamples are gzip bzip compress andpack Differencing encodes differencesbetween two data sets Examples are diff -e fdelta bdiff and xdelta Deltacompression compresses a data set givenanother combining differencing andcompression and reduces to pure com-pression when therersquos no commonality

After presenting an overall scheme for adelta compressor and showing deltacompression performance this talk dis-cussed the encoding format of a newlydesigned Vcdiff for delta compression A

Vcdiff instruction code table consists of256 entries of each coding up to a pair ofinstructions and recodes I-byte indicesand any additional data

The talk then showed Vcdiff rsquos perfor-mance with Web data collected fromCNN compared with W3C standardGdiff Gdiff+gzip and diff+gzip whereGdiff was computed using Vcdiff deltainstructions The results of two experi-ments ldquoFirstrdquo and ldquoSuccessiverdquo werepresented In ldquoFirstrdquo each file is com-pressed against the first file collected inldquoSuccessiverdquo each file is compressedagainst the one in the previous hourThe diff+gzip did not work well becausediff was line-oriented Vcdiff performedfavorably compared with other formats

Vcdiff is part of the Vcodex packageVcodex is a platform for all commondata transformations delta compres-sion plain compression encryptiontranscoding (eg uuencode base64) Itis structured in three layers for maxi-mum usability Base library uses Dis-plines and Methods interfaces

The code for Vcdiff can be found athttpwwwresearchattcomswtoolsThey are moving Vcdiff to the IETFstandard as a comprehensive platformfor transforming data Please refer tohttpwwwietforginternet-draftdraft-korn-vcdiff06txt

WHERE IN THE NET

Summarized by Amit Purohit

A PRECISE AND EFFICIENT EVALUATION OF

THE PROXIMITY BETWEEN WEB CLIENTS AND

THEIR LOCAL DNS SERVERS

Zhuoqing Morley Mao University ofCalifornia Berkeley Charles D CranorFred Douglis Michael RabinovichOliver Spatscheck and Jia Wang ATampTLabs ndash Research

Content Distribution Networks (CDNs)attempt to improve Web performance bydelivering Web content to end usersfrom servers located at the edge of thenetwork When a Web client requestscontent the CDN dynamically chooses aserver to route the request to usually the

73October 2002 login

C

ON

FERE

NC

ERE

PORT

Sone that is nearest the client As CDNshave access only to the IP address of thelocal DNS server (LDNS) of the clientthe CDNrsquos authoritative DNS servermaps the clientrsquos LDNS to a geographicregion within a particular network andcombines that with network and serverload information to perform CDNserver selection

This method has two limitations First itis based on the implicit assumption thatclients are close to their LDNS This maynot always be valid Second a singlerequest from an LDNS can represent adiffering number of Web clients ndash calledthe hidden load factor Knowledge of thehidden load factor can be used toachieve better load distribution

The extent of the first limitation and itsimpact on the CDN server selection isdealt with To determine the associationsbetween clients and their LDNS a sim-ple non-intrusive and efficient map-ping technique was developed The datacollected was used to study the impact ofproximity on DNS-based server selec-tion using four different proximity met-rics (1) autonomous system (AS)clustering ndash observing whether a client isin the same AS as its LDNS ndash concludedthat LDNS is very good for coarse-grained server selection as 64 of theassociations belong to the same AS(2) network clustering ndash observingwhether a client is in the same network-aware cluster (NAC) ndash implied that DNSis less useful for finer-grained serverselection since only 16 of the clientand LDNS are in the same NAC (3)traceroute divergence ndash the length of thedivergent paths to the client and itsLDNS from a probe point using tracer-oute ndash implies that most clients aretopologically close to their LDNS asviewed from a randomly chosen probesite (4) round-trip time (RTT) correla-tion (some CDNs select severs based onRTT between the CDN server and theclientrsquos LDNS) examines the correlationbetween the message RTTs from a probepoint to the client and its local DNS

results imply this is a reasonable metricto use to avoid really distant servers

A study of the impact that client-LDNSassociations have on DNS-based serverselection concludes that knowing theclientrsquos IP address would allow moreaccurate server selection in a large num-ber of cases The optimality of the serverselection also depends on the serverdensity placement and selection algo-rithms

Further information can be found athttpwwweecsberkeleyedu~zmaomyresearchhtml or by contactingzmaoeecsberkeleyedu

GEOGRAPHIC PROPERTIES OF INTERNET

ROUTING

Lakshminarayanan SubramanianVenkata N Padmanabhan MicrosoftResearch Randy H Katz University ofCalifornia at Berkeley

Geographic information can provideinsights into the structure and function-ing of the Internet including interac-tions between different autonomoussystems by analyzing certain propertiesof Internet routing It can be used tomeasure and quantify certain routingproperties such as circuitous routinghot-potato routing and geographic faulttolerance

Traceroute has been used to gather therequired data and Geotrack tool hasbeen used to determine the location ofthe nodes along each network path Thisenables the computation of ldquolinearizeddistancesrdquo which is the sum of the geo-graphic distances between successivepairs of routers along the path

In order to measure the circuitousnessof a path a metric ldquodistance ratiordquo hasbeen defined as the ratio of the lin-earized distance of a path to the geo-graphic distance between the source anddestination of the path From the data ithas been observed that the circuitous-ness of a route depends on both geo-graphic and network location of the endhosts A large value of the distance ratioenables us to flag paths that are highly

USENIX 2002

circuitous possibly (though not neces-sarily) because of routing anomalies Itis also shown that the minimum delaybetween end hosts is strongly correlatedwith the linearized distance of the path

Geographic information can be used tostudy various aspects of wide-area Inter-net paths that traverse multiple ISPs Itwas found that end-to-end Internetpaths tend to be more circuitous thanintra-ISP paths the cause for this beingthe peering relationships between ISPsAlso paths traversing substantial dis-tances within two or more ISPs tend tobe more circuitous than paths largelytraversing only a single ISP Anotherfinding is that ISPs generally employhot-potato routing

Geographic information can also beused to capture the fact that two seem-ingly unrelated routers can be suscepti-ble to correlated failures By using thegeographic information of routers wecan construct a geographic topology ofan ISP Using this we can find the toler-ance of an ISPrsquos network to the total net-work failure in a geographic region

For further information contactlakmecsberkeleyedu

PROVIDING PROCESS ORIGIN INFORMATION

TO AID IN NETWORK TRACEBACK

Florian P Buchholz Purdue UniversityClay Shields Georgetown University

Network traceback is currently limitedbecause host audit systems do not main-tain enough information to matchincoming network traffic to outgoingnetwork traffic The talk presented analternative assigning origin informationto every process and logging it duringinteractive login creation

The current implementation concen-trates mainly on interactive sessions inwhich an event is logged when a newconnection is established using SSH orTelnet The method is effective andcould successfully determine steppingstones and reliably detect the source of aDDoS attack The speaker then talkedabout the related work done in this area

74 Vol 27 No 5 login

which led to discussion of the latestdevelopment in the area of ldquohostcausalityrdquo

The speaker ended with the limitationsand future directions of the frameworkThis framework could be extended andcould find many applications in the areaof security Current implementationdoesnrsquot take care of startup scripts andcron jobs but incorporating the origininformation in FS could solve this prob-lem In the current implementation log-ging is just implemented as a proof ofconcept It could be made safe in manyways and this could be another impor-tant aspect of future work

PROGRAMMING

Summarized by Amit Purohit

CYCLONE A SAFE DIALECT OF C

Trevor Jim ATampT Labs ndash Research GregMorrisett Dan Grossman MichaelHicks James Cheney Yanling WangCornell University

Cyclone is designed to provide protec-tion against attacks such as buffer over-flows format string attacks andmemory management errors The cur-rent structure of C allows programmersto write vulnerable programs Cycloneextends C so that it has the safety guar-antee of Java while keeping C syntaxtypes and semantics untouched

The Cyclone compiler performs staticanalysis of the source code and insertsruntime checks into the compiled out-put at places where the analysis cannotdetermine that the code execution willnot violate safety constraints Cycloneimposes many restrictions to preservesafety such as NULL checksThesechecks do not exist in normal C

The speaker then talked about somesample code written in Cyclone and howit tackles safety problems Porting anexisting C application to Cyclone ispretty easy with fewer than 10 changerequired The current implementationconcentrates more on safety than onperformance ndash hence the performance

penalty is substantial Cyclone was ableto find many lingering bugs The finalstep of the project will be to write acompiler to convert a normal C programto an equivalent Cyclone program

COOPERATIVE TASK MANAGEMENT WITHOUT

MANUAL STACK MANAGEMENT

Atul Adya Jon Howell MarvinTheimer William J Bolosky John RDouceur Microsoft Research

The speaker described the definitionsand motivations behind the distinctconcepts of task management stackmanagement IO management conflictmanagement and data partitioningConventional concurrent programminguses preemptive task management andexploits the automatic stack manage-ment of a standard language In the sec-ond approach cooperative tasks areorganized as event handlers that yieldcontrol by returning control to the eventscheduler manually unrolling theirstacks In this project they have adopteda hybrid approach that makes it possiblefor both stack management styles tocoexist Thus programmers can codeassuming one or the other of these stackmanagement styles will operate depend-ing upon the application The speakeralso gave a detailed example of how touse their mechanism to switch betweenthe two styles

The talk then continued with the imple-mentation They were able to preemptmany subtle concurrency problems byusing cooperative task managementPaying a cost up-front to reduce a subtlerace condition proved to be a goodinvestment

Though the choice of task managementis fundamental the choice of stack man-agement can be left to individual tasteThis project enables use of any type ofstack management in conjunction withcooperative task management

IMPROVING WAIT-FREE ALGORITHMS FOR

INTERPROCESS COMMUNICATION IN

EMBEDDED REAL-TIME SYSTEMS

Hai Huang Padmanabhan Pillai KangG Shin University of Michigan

The main characteristic of the real-timetime-sensitive system is its pre-dictable response time But concurrencymanagement creates hurdles to achiev-ing this goal because of the use of locksto maintain consistency To solve thisproblem many wait-free algorithmshave been developed but these are typi-cally high-cost solutions By takingadvantage of the temporal characteris-tics of the system however the time andspace overhead can be reduced

The speaker presented an algorithm fortemporal concurrency control andapplied this technique to improve threewait-free algorithms A single-writermultiple-reader wait-free algorithm anda double-buffer algorithm were pro-posed

Using their transformation mechanismthey achieved an improvement of17ndash66 in ACET and a 14ndash70 reduc-tion in memory requirements for IPCalgorithms This mechanism is extensi-ble and can be applied to other non-blocking IPC algorithms as well Futurework involves reducing the synchroniz-ing overheads in more general IPC algo-rithms with multiple-writer semantics

MOBILITY

Summarized by Praveen Yalagandula

ROBUST POSITIONING ALGORITHMS FOR

DISTRIBUTED AD-HOC WIRELESS SENSOR

NETWORKS

Chris Savarese Jan Rabaey Universityof California Berkeley Koen Langen-doen Delft University of Technology

The Pico Radio Network comprisesmore than a hundred sensors monitorsand actuators equipped with wirelessconnectivity with the following proper-ties (1) no infrastructure (2) computa-tion in a distributed fashion instead ofcentralized computation (3) dynamictopology (4) limited radio range

75October 2002 login

C

ON

FERE

NC

ERE

PORT

S(5) nodes that can act as repeaters(6) devices of low cost and with lowpower and (7) sparse anchor nodes ndashnodes with GPS capability A positioningalgorithm determines the geographicalposition of each node in the network

In two-dimensional space each nodeneeds three reference positions to esti-mate its geographical position There aretwo problems that make positioning dif-ficult in the Pico Radio Network typesetting (1) a sparse anchor node prob-lem and (2) a range error problem

The two-phase approach that theauthors have taken in solving the prob-lem consists of (1) a hop-terrain algo-rithm ndash in this first phase each noderoughly guesses its location by the dis-tance calculated using multi-hops to theanchor points and (2) a refinementalgorithm ndash in this second phase eachnode uses its neighborsrsquo positions torefine its own position estimate Toguarantee the convergence thisapproach uses confidence metricsassigning a value of 10 for anchor nodesand 01 to start for other nodes andincreasing these with each iteration

A simulation tool OMNet++ was usedfor both phases and in various scenariosThe results show that they achievedposition errors of less than 33 in a sce-nario with 5 anchor nodes an averageconnectivity of 7 and 5 range mea-surement error

Guidelines for anchor node deploymentare high connectivity (gt10) a reason-able fraction of anchor nodes (gt 5)and for anchor placement coverededges

APPLICATION-SPECIFIC NETWORK

MANAGEMENT FOR ENERGY-AWARE

STREAMING OF POPULAR MULTIMEDIA

FORMATS

Surendar Chandra University of Geor-gia and Amin Vahdat Duke University

The main hindrance in supporting theincreasing demand for mobile multime-dia on PDA is the battery capacity of the

small devices The idle-time power con-sumption is almost the same as thereceive power consumption on typicalwireless network cards (1319mW vs1425mW) while the sleep state con-sumes far less power (177mW) To letthe system consume energy proportionalto the stream quality the network cardshould transition to sleep state aggres-sively between each packet

The speaker presented previous workwhere different multimedia streamswere studied for PDAs MS Media RealMedia and Quicktime It was found thatif the inter-packet gap is predictablethen huge savings are possible forexample about 80 savings in MSmedia streams with few packet lossesThe limitation of the IEEE 80211bpower-save mode comes into play whenthere are two or more nodes producinga delay between beacon time and whenthe node receivessends packets Thisbadly affects the streams since higherenergy is consumed while waiting

The authors propose traffic shaping forenergy conservation where this is doneby proxy such that packets arrive at reg-ular intervals This is achieved using alocal proxy in the access point and aclient-side proxy in the mobile hostSimulations show that traffic shapingreduces energy consumption and alsoreveals that the higher bandwidthstreams have a lower energy metric(mJkB)

More information is at httpgreenhousecsugaedu

CHARACTERIZING ALERT AND BROWSE SER-

VICES OF MOBILE CLIENTS

Atul Adya Paramvir Bahl and Lili QiuMicrosoft Research

Even though there is a dramatic increasein Internet access from wireless devicesthere are not many studies done oncharacterizing this traffic In this paperthe authors characterize the trafficobserved on the MSN Web site withboth notification and browse traces

USENIX 2002

Around 33 million browsing accessesand about 325 million notificationentries are present in the traces Threetypes of analyses are done for each oneof these two services content analysisconcerning the most popular contentcategories and their distribution popu-larity analysis or the popularity distri-bution of documents and user behavioranalysis

The analysis of the notification logsshows that document access rates followa Zipf-like distribution with most of theaccesses concentrated on a small num-ber of messages and the accesses exhibitgeographical locality ndash users from samelocality tend to receive similar notifica-tion content The browser log analysisshows that a smaller set of URLs areaccessed most times though the accesspattern does not fit any Zipf curve andthe highly accessed URLS remain stableA correlation study between the notifi-cation and browsing services shows thatwireless users have a moderate correla-tion of 012

FREENIX TRACK SESSIONS

BUILDING APPLICATIONS

Summarized by Matt Butner

INTERACTIVE 3-D GRAPHICS APPLICATIONS

FOR TCL

Oliver Kersting Juumlrgen Doumlllner HassoPlattner Institute for Software SystemsEngineering University of Potsdam

The integration of 3-D image renderingfunctionality into a scripting languagepermits interactive and animated 3-Ddevelopment and application withoutthe formalities and precision demandedby low-level CC++ graphics and visual-ization libraries The large and complexC++ API of the Virtual Rendering Sys-tem (VRS) can be combined with theconveniences of the Tcl scripting lan-guage The mapping of class interfaces isdone via an automated process and gen-erates respective wrapper classes all ofwhich ensures complete API accessibilityand functionality without any signifi-

76 Vol 27 No 5 login

cant performance retribution The gen-eral C++-to-Tcl mapper SWIG grantsthe necessary object and hierarchal abili-ties without the use of object-orientedTcl extensions The final API mappingtechniques address C++ features such asclasses overloaded methods enumera-tions and inheritance relations All areimplemented in a proof-of-concept thatmaps the complete C++ API of the VRSto Tcl and are showcased in a completeinteractive 3-D-map system

THE AGFL GRAMMAR WORK LAB

Cornelis HA Coster Erik VerbruggenUniversity of Nijmegen (KUN)

The growth and implementation of Nat-ural Language Processing (NLP) is thecornerstone of the continued evolutionand implementation of truly intelligentsearch machines and services In partdue to the growing collections of com-puter-stored human-readable docu-ments in the public domain theimplementation of linguistic analysiswill become necessary for desirable pre-cision and resolution Subsequentlysuch tools and linguistic resources mustbe released into the public domain andthey have done so with the release of theAGFL Grammar Work Lab under theGNU Public License as a tool for lin-guistic research and the development ofNLP-based applications

The AGFL (Affix Grammars over aFinite Lattice) Grammar Work Labmeshes context-free grammars withfinite set-valued features that are accept-able to a range of languages In com-puter science terms ldquoSyntax rules areprocedures with parameters and a non-deterministic executionrdquo The EnglishPhrases for Information Retrieval(EP4IR) released with the AGFL-GWLas a robust grammar of English is anAGFL-GWL generated English parserthat outputs ldquoHeadModifiedrdquo framesThe sentences ldquoCompanyX sponsoredthis conferencerdquo and ldquoThis conferencewas sponsored by CompanyXrdquo both gen-erate [CompanyX[sponsored confer-

ence]] The hope is that public availabil-ity of such tools will encourage furtherdevelopment of grammar and lexicalsoftware systems

SWILL A SIMPLE EMBEDDED WEB SERVER

LIBRARY

Sotiria Lampoudi David M BeazleyUniversity of Chicago

This paper won the FREENIX Best Stu-dent Paper award

SWILL (Simple Web Interface LinkLibrary) is a simple Web server in theform of a C library whose developmentwas motivated by a wish to give coolapplications an interface to the Web TheSWILL library provides a simple inter-face that can be efficiently implementedfor tasks that vary from flexible Web-based monitoring to software debuggingand diagnostics Though originallydesigned to be integrated with high-per-formance scientific simulation softwarethe interface is generic enough to allowfor unbounded uses SWILL is a single-threaded server relying upon non-blocking IO through the creation of atemporary server which services IOrequests SWILL does not provide SSL orcryptographic authentication but doeshave HTTP authentication abilities

A fantastic feature of SWILL is its sup-port for SPMD-style parallel applica-tions which utilize MPI provingvaluable for Beowulf clusters and largeparallel machines Another practicalapplication was the implementation ofSWILL in a modified Yalnix emulator byUniversity of Chicago Operating Sys-tems courses which utilized the addedYalnix functionality for OS developmentand debugging SWILL requires minimalmemory overhead and relies upon theHTTP10 protocol

NETWORK PERFORMANCE

Summarized by Florian Buchholz

LINUX NFS CLIENT WRITE PERFORMANCE

Chuck Lever Network Appliance PeterHoneyman CITI University of Michi-gan

Lever introduced a benchmark to mea-sure an NFS client write performanceClient performance is difficult to mea-sure due to hindrances such as poorhardware or bandwidth limitations Fur-thermore measuring application perfor-mance does not identify weaknessesspecifically at the client side Thus abenchmark was developed trying toexercise only data transfers in one direc-tion between server and application Forthis purpose the benchmark was basedon the block sequential write portion ofthe Bonnie file system benchmark Oncea benchmark for NFS clients is estab-lished it can be used to improve clientperformance

The performance measurements wereperformed with an SMP Linux clientand both a Linux NFS server and a Net-work Appliance F85 filer During testinga periodic jump in write latency timewas discovered This was due to a ratherlarge number of pending write opera-tions that were scheduled to be writtenafter certain threshold values wereexceeded By introducing a separate dae-mon that flushes the cached writerequest the spikes could be eliminatedbut as a result the average latency growsover time The problem could be tracedto a function that scans a linked list ofwrite requests After having added ahashtable to improve lookup perfor-mance the latency improved consider-ably

The improved client was then used tomeasure throughput against the twoservers A discrepancy between theLinux server and the filer test wasnoticed and the reason for that tracedback to a global kernel lock that wasunnecessarily held when accessing thenetwork stack After correcting this per-formance improved further However

77October 2002 login

C

ON

FERE

NC

ERE

PORT

Sthe measurements showed that a clientmay run slower when paired with fastservers on fast networks This is due toheavy client interrupt loads more net-work processing on the client side andmore global kernel lock contention

The source code of the project is avail-able at httpwwwcitiumicheduprojectsnfs-perfpatches

A STUDY OF THE RELATIVE COSTS OF

NETWORK SECURITY PROTOCOLS

Stefan Miltchev and Sotiris IoannidisUniversity of Pennsylvania AngelosKeromytis Columbia University

With the increasing need for securityand integrity of remote network ser-vices it becomes important to quantifythe communication overhead of IPSecand compare it to alternatives such asSSH SCP and HTTPS

For this purpose the authors set upthree testing networks direct link twohosts separated by two gateways andthree hosts connecting through onegateway Protocols were compared ineach setup and manual keying was usedto eliminate connection setup costs ForIPSec the different encryption algo-rithms AES DES 3DES hardware DESand hardware 3DES were used In detailFTP was compared to SFTP SCP andFTP over IPSec HTTP to HTTPS andHTTP over IPSec and NFS to NFS overIPSec and local disk performance

The result of the measurements werethat IPSec outperforms other popularencryption schemes Overall unen-crypted communication was fastest butin some cases like FTP the overhead canbe small The use of crypto hardwarecan significantly improve performanceFor future work the inclusion of setupcosts hardware-accelerated SSL SFTPand SSH were mentioned

CONGESTION CONTROL IN LINUX TCP

Pasi Sarolahti University of HelsinkiAlexey Kuznetsov Institute for NuclearResearch at Moscow

Having attended the talk and read thepaper I am still unclear about whetherthe authors are merely describing thedesign decisions of TCP congestion con-trol or whether they are actually the cre-ators of that part of the Linux code Myguess leans toward the former

In the talk the speaker compared theTCP protocol congestion control meas-ures according to IETF and RFC specifi-cations with the actual Linux implemen-tation which does conform to the basicprinciples but nevertheless has differ-ences A specific emphasis was placed onretransmission mechanisms and thecongestion window Also several TCPenhancements ndash the NewReno algo-rithm Selective ACKs (SACK) ForwardACKs (FACK) ndash were discussed andcompared

In some instances Linux does not con-form to the IETF specifications The fastrecovery does not fully follow RFC 2582since the threshold for triggering re-transmit is adjusted dynamically and thecongestion windowrsquos size is not changedAlso the roundtrip-time estimator andthe RTO calculation differ from RFC2988 since it uses more conservativeRTT estimates and a minimum RTO of200ms The performance measuresshowed that with the additional Linux-specific features enabled slightly higherthroughput more steady data flow andfewer unnecessary retransmissions canbe achieved

XTREME XCITEMENT

Summarized by Steve Bauer

THE FUTURE IS COMING WHERE THE X

WINDOW SHOULD GO

Jim Gettys Compaq Computer Corp

Jim Gettys one of the principal authorsof the X Window System outlined thenear-term objectives for the system pri-marily focusing on the changes andinfrastructure required to enable replica-tion and migration of X applicationsProviding better support for this func-tionality would enable users to retrieveor duplicate X applications betweentheir servers at home and work

USENIX 2002

One interesting example of an applica-tion that currently is capable of migra-tion and replication is Emacs To create anew frame on DISPLAY try ldquoM-x make-frame-on-display ltReturngt DISPLAYltReturngtrdquo

However technical challenges makereplication and migration difficult ingeneral These include the ldquomajorheadachesrdquo of server-side fonts thenonuniformity of X servers and screensizes and the need to appropriatelyretrofit toolkits Authentication andauthorization issues are obviously alsoimportant The rest of the talk delvedinto some of the details of these interest-ing technical challenges

HACKING IN THE KERNEL

Summarized by Hai Huang

AN IMPLEMENTATION OF SCHEDULER

ACTIVATIONS ON THE NETBSD OPERATING

SYSTEM

Nathan J Williams Wasabi Systems

Scheduler activation is an old idea Basi-cally there are benefits and drawbacks tousing solely kernel-level or user-levelthreading Scheduler activation is able tocombine the two layers of control toprovide more concurrency in the sys-tem

In his talk Nathan gave a fairly detaileddescription of the implementation ofscheduler activation in the NetBSD ker-nel One important change in the imple-mentation is to differentiate the threadcontext from the process context This isdone by defining a separate data struc-ture for these threads and relocatingsome of the information that wasembedded in the process context tothese thread contexts Stack was espe-cially a concern due to the upcall Spe-cial handling must be done to make surethat the upcall doesnrsquot mess up the stackso that the preempted user-level threadcan continue afterwards Lastly Nathanexplained that signals were handled byupcalls

78 Vol 27 No 5 login

AUTHORIZATION AND CHARGING IN PUBLIC

WLANS USING FREEBSD AND 8021X

Pekka Nikander Ericsson ResearchNomadicLab

8021x standards are well known in thewireless community as link-layerauthentication protocols In this talkPekka explained some novel ways ofusing the 8021x protocols that might beof interest to people on the move It ispossible to set up a public WLAN thatwould support various chargingschemes via virtual tokens which peoplecan purchase or earn and later use

This is implemented on FreeBSD usingthe netgraph utility It is basically a filterin the link layer that would differentiatetraffic based on the MAC address of theclient node which is either authenti-cated denied or let through The over-head for this service is fairly minimal

ACPI IMPLEMENTATION ON FREEBSD

Takanori Watanabe Kobe University

ACPI (Advanced Configuration andPower Management Interface) was pro-posed as a joint effort by Intel Toshibaand Microsoft to provide a standard andfiner-control method of managingpower states of individual devices withina system Such low-level power manage-ment is especially important for thosemobile and embedded systems that arepowered by fixed-capacity energy batter-ies

Takanori explained that the ACPI speci-fication is composed of three partstables BIOS and registers He was ableto implement some functionalities ofACPI in a FreeBSD kernel ACPI Com-ponent Architecture was implementedby Intel and it provides a high-levelACPI API to the operating systemTakanorirsquos ACPI implementation is builtupon this underlying layer of APIs

ACCESS CONTROL

Summarized by Florian Buchholz

DESIGN AND PERFORMANCE OF THE

OPENBSD STATEFUL PACKET FILTER (PF)

Daniel Hartmeier Systor AG

Daniel Hartmeier described the newstateful packet filter (pf) that replacedIPFilter in the OpenBSD 30 releaseIPFilter could no longer be included dueto licensing issues and thus there was aneed to write a new filter making use ofoptimized data structures

The filter rules are implemented as alinked list which is traversed from startto end Two actions may be takenaccording to the rules ldquopassrdquo or ldquoblockrdquoA ldquopassrdquo action forwards the packet anda ldquoblockrdquo action will drop it Wheremore than one rule matches a packetthe last rule wins Rules that are markedas ldquofinalrdquo will immediately terminate any further rule evaluation An opti-mization called ldquoskip-stepsrdquo was alsoimplemented where blocks of similarrules are skipped if they cannot matchthe current packet These skip-steps arecalculated when the rule set is loadedFurthermore a state table keeps track ofTCP connections Only packets thatmatch the sequence numbers areallowed UDP and ICMP queries andreplies are considered in the state tablewhere initial packets will create an entryfor a pseudo-connection with a low ini-tial timeout value The state table isimplemented using a balanced binarysearch tree NAT mappings are alsostored in the state table whereas appli-cation proxies reside in user space Thepacket filter also is able to perform frag-ment reassembly and to modulate TCPsequence numbers to protect hostsbehind the firewall

Pf was compared against IPFilter as wellas Linuxrsquos Iptables by measuringthroughput and latency with increasingtraffic rates and different packet sizes Ina test with a fixed rule set size of 100Iptables outperformed the other two fil-ters whose results were close to each

other In a second test where the rule setsize was continually increased Iptablesconsistently had about twice thethroughput of the other two (whichevaluate the rule set on both the incom-ing and outgoing interfaces) A third testcompared only pf and IPFilter using asingle rule that created state in the statetable with a fixed-state entry size Pfreached an overloaded state much laterthan IPFilter The experiment wasrepeated with a variable-state entry sizeand pf performed much better thanIPFilter for a small number of states

In general rule set evaluation is expen-sive and benchmarks only reflectextreme cases whereas in real life otherbehavior should be observed Further-more the benchmarks show that statefulfiltering can actually improve perfor-mance due to cheap state-table lookupsas compared to rule evaluation

ENHANCING NFS CROSS-ADMINISTRATIVE

DOMAIN ACCESS

Joseph Spadavecchia and Erez ZadokStony Brook University

The speaker presented modification toan NFS server that allows an improvedNFS access between administrativedomains A problem lies in the fact thatNFS assumes a shared UIDGID spacewhich makes it unsafe to export filesoutside the administrative domain ofthe server Design goals were to leaveprotocol and existing clients unchangeda minimum amount of server changesflexibility and increased performance

To solve the problem two techniques areutilized ldquorange-mappingrdquo and ldquofile-cloakingrdquo Range-mapping maps IDsbetween client and server The mappingis performed on a per-export basis andhas to be manually set up in an exportfile The mappings can be 1-1 N-N orN-1 In file-cloaking the server restrictsfile access based on UIDGID and spe-cial cloaking-mask bits Here users canonly access their own file permissionsThe policy on whether or not the file isvisible to others is set by the cloakingmask which is logically ANDed with the

79October 2002 login

C

ON

FERE

NC

ERE

PORT

Sfilersquos protection bits File-cloaking onlyworks however if the client doesnrsquot holdcached copies of directory contents andfile-attributes Because of this the clientsare forced to re-read directories byincrementing the mtime value of thedirectory each time it is listed

To test the performance of the modifiedserver five different NFS configurationswere evaluated An unmodified NFSserver was compared against one serverwith the modified code included but notused one with only range-mappingenabled one with only file-cloakingenabled and one version with all modi-fications enabled For each setup differ-ent file system benchmarks were runThe results show only a small overheadwhen the modifications are used gener-ally an increase of below 5 Anotherexperiment tested the performance ofthe system with an increasing number ofmapped or cloaked entries on a systemThe results show that an increase from10 to 1000 entries resulted in a maxi-mum of about 14 cost in performance

One member of the audience pointedout that if clients choose to ignore thechanged mtimes from the server andthus still hold caches of the directoryentries the file-cloaking mechanismcould be defeated After a rather lengthydebate the speaker had to concede thatthe model doesnrsquot add any extra securityAnother question was asked about scala-bility of the setup of range mappingThe speaker referred to application-leveltools that could be developed for thatpurpose

The software is available atftpftpfslcssunysbedupubenf

ENGINEERING OPEN SOURCE

SOFTWARE

Summarized by Teri Lampoudi

NINGAUI A LINUX CLUSTER FOR BUSINESS

Andrew Hume ATampT Labs ndash ResearchScott Daniels Electronic Data SystemsCorp

Ningaui is a general purpose highlyavailable resilient architecture built

from commodity software and hard-ware Emphatically however Ningaui isnot a Beowulf Hume calls the clusterdesign the ldquoSwiss canton modelrdquo inwhich there are a number of looselyaffiliated independent nodes with datareplicated among them Jobs areassigned by bidding and leases and clus-ter services done as session-based serversare sited via generic job assignment Theemphasis is on keeping the architectureend-to-end checking all work via check-sums and logging everything Theresilience and high availability requiredby their goal of 8-5 maintenance ndash vsthe typical 24-7 model where people getpaged whenever the slightest thing goeswrong regardless of the time of day ndash isachieved by job restartability Finally allcomputation is performed on local datawithout the use of NFS or networkattached storage

Humersquos message is a hopeful onedespite the many problems encounteredndash things like kernel and service limitsauto-installation problems TCP stormsand the like ndash the existence of sourcecode and the paranoid practice of log-ging and checksumming everything hashelped The final product performs rea-sonably well and it appears that theresilient techniques employed do make adifference One drawback however isthat the software mentioned in thepaper is not yet available for download

CPCMS A CONFIGURATION MANAGEMENT

SYSTEM BASED ON CRYPTOGRAPHIC NAMES

Jonathan S Shapiro and John Vander-burgh Johns Hopkins University

This paper won the FREENIX BestPaper award

The basic notion behind the project isthe fact that everyone has a pet com-plaint about CVS and yet it is currentlythe configuration manager in mostwidespread use Shapiro has unleashedan alternative Interestingly he did notbegin but ended with the usual host ofreasons why CVS is bad The talk insteadbegan abruptly by characterizing the job

USENIX 2002

of a software configuration managercontinued by stating the namespaceswhich it must handle and wrapped upwith the challenges faced

X MEETS Z VERIFYING CORRECTNESS IN THE

PRESENCE OF POSIX THREADS

Bart Massey Portland State UniversityRobert T Bauer Rational SoftwareCorp

Massey delivered a humorous talk onthe insight gained from applying Z for-mal specification notation to systemsoftware design rather than the moreinformal analysis and design processnormally used

The story is told with respect to writingXCB which replaces the Xlib protocollayer and is supposed to be threadfriendly But where threads are con-cerned deadlock avoidance becomes ahard problem that cannot be solved inan ad-hoc manner But full modelchecking is also too hard In this caseMassey resorted to Z specification tomodel the XCB lower layer abstractaway locking and data transfers andlocate fundamental issues Essentiallythe difficulties of searching the literatureand locating information relevant to theproblem at hand were overcome AsMassey put it ldquothe formal method savedthe dayrdquo

FILE SYSTEMS

Summarized by Bosko Milekic

PLANNED EXTENSIONS TO THE LINUX

EXT2EXT3 FILESYSTEM

Theodore Y Tsrsquoo IBM StephenTweedie Red Hat

The speaker presented improvements tothe Linux Ext2 file system with the goalof allowing for various expansions whilestriving to maintain compatibility witholder code Improvements have beenfacilitated by a few extra superblockfields that were added to Ext2 not longbefore the Linux 20 kernel was releasedThe fields allow for file system featuresto be added without compromising theexisting setup this is done by providing

80 Vol 27 No 5 login

bits indicating the impact of the addedfeatures with respect to compatibilityNamely a file system with the ldquoincom-patrdquo bit marked is not allowed to bemounted Similarly a ldquoread-onlyrdquo mark-ing would only allow the file system tobe mounted read-only

Directory indexing changes linear direc-tory searches with a faster search using afixed-depth tree and hashed keys Filesystem size can be dynamicallyincreased and the expanded inode dou-bled from 128 to 256 bytes allows formore extensions

Other potential improvements were dis-cussed as well in particular pre-alloca-tion for contiguous files which allowsfor better performance in certain setupsby pre-allocating contiguous blocksSecurity-related modifications extendedattributes and ACLs were mentionedAn implementation of these featuresalready exists but has not yet beenmerged into the mainline Ext23 code

RECENT FILESYSTEM OPTIMISATIONS ON

FREEBSD

Ian Dowse Corvil Networks DavidMalone CNRI Dublin Institute of Technology

David Malone presented four importantfile system optimizations for FreeBSDOS soft updates dirpref vmiodir anddirhash It turns out that certain combi-nations of the optimizations (beautifullyillustrated in the paper) may yield per-formance improvements of anywherebetween 2 and 10 orders of magnitudefor real-world applications

All four of the optimizations deal withfile system metadata Soft updates allowfor asynchronous metadata updatesDirpref changes the way directories areorganized attempting to place childdirectories closer to their parentsthereby increasing locality of referenceand reducing disk-seek times Vmiodirtrades some extra memory in order toachieve better directory caching Finallydirhash which was implemented by IanDowse changes the way in which entries

are searched in a directory SpecificallyFFS uses a linear search to find an entryby name dirhash builds a hashtable ofdirectory entries on the fly For a direc-tory of size n with a working set of mfiles a search that in certain cases couldhave been O(nm) has been reduceddue to dirhash to effectively O(n + m)

FILESYSTEM PERFORMANCE AND SCALABILITY

IN LINUX 2417

Ray Bryant SGI Ruth Forester IBMLTC John Hawkes SGI

This talk focused on performance evalu-ation of a number of file systems avail-able and commonly deployed on Linuxmachines Comparisons under variousconfigurations of Ext2 Ext3 ReiserFSXFS and JFS were presented

The benchmarks chosen for the datagathering were pgmeter filemark andAIM Benchmark Suite VII Pgmetermeasures the rate of data transfer ofreadswrites of a file under a syntheticworkload Filemark is similar to post-mark in that it is an operation-intensivebenchmark although filemark isthreaded and offers various other fea-tures that postmark lacks AIM VIImeasures performance for various file-system-related functionalities it offersvarious metrics under an imposedworkload thus stressing the perfor-mance of the file system not only underIO load but also under significant CPUload

Tests were run on three different setupsa small a medium and a large configu-ration ReiserFS and Ext2 appear at thetop of the pile for smaller and mediumsetups Notably XFS and JFS performworse for smaller system configurationsthan the others although XFS clearlyappears to generate better numbers thanJFS It should be noted that XFS seemsto scale well under a higher load Thiswas most evident in the large-systemresults where XFS appears to offer thebest overall results

THINGS TO THINK ABOUT

Summarized by Bosko Milekic

SPEEDING UP KERNEL SCHEDULER BY REDUC-

ING CACHE MISSES

Shuji Yamamura Akira Hirai MitsuruSato Masao Yamamoto Akira NaruseKouichi Kumon Fujitsu Labs

This was an interesting talk pertaining tothe effects of cache coloring for taskstructures in the Linux kernel scheduler(Linux kernel 24x) The speaker firstpresented some interesting benchmarknumbers for the Linux scheduler show-ing that as the number of processes onthe task queue was increased the perfor-mance decreased The authors usedsome really nifty hardware to measurethe number of bus transactionsthroughout their tests and were thusable to reasonably quantify the impactthat cache misses had in the Linuxscheduler

Their experiments led them to imple-ment a cache coloring scheme for taskstructures which were previouslyaligned on 8KB boundaries and there-fore were being eventually mapped tothe same cache lines This unfortunateplacement of task structures in memoryinduced a significant number of cachemisses as the number of tasks grew inthe scheduler

The implemented solution consisted ofaligning task structures to more evenlydistribute cached entries across the L2cache The result was inevitably fewercache misses in the scheduler Some neg-ative effects were observed in certain sit-uations These were primarily due tomore cache slots being used by taskstructure data in the scheduler thusforcing data previously cached there tobe pushed out

81October 2002 login

C

ON

FERE

NC

ERE

PORT

SWORK-IN-PROGRESS REPORTS

Summarized by Brennan Reynolds

RESOURCE VIRTUALIZATION TECHNIQUES FOR

WIDE-AREA OVERLAY NETWORKS

Kartik Gopalan University Stony Brook

This work addressed the issue of provi-sioning a maximum number of virtualoverlay networks (VON) with diversequality of service (QoS) requirementson a single physical network Each of theVONs is logically isolated from others toensure the QoS requirements Gopalanmentioned several critical researchissues with this problem that are cur-rently being investigated Dealing withhow to provision the network at variouslevels (link route or path) and thenenforce the provisioning at run-time isone of the toughest challenges Cur-rently Gopalan has developed severalalgorithms to handle admission controlend-to-end QoS route selection sched-uling and fault tolerance in the net-work

For more information visithttpwwwecslcssunysbedu

VISUALIZING SOFTWARE INSTABILITY

Jennifer Bevan University of CaliforniaSanta Cruz

Detection of instability in software hastypically been an afterthought Thepoint in the development cycle when thesoftware is reviewed for instability isusually after it is difficult and costly togo back and perform major modifica-tions to the code base Bevan has devel-oped a technique to allow thevisualization of unstable regions of codethat can be used much earlier in thedevelopment cycle Her technique cre-ates a time series of dependent graphsthat include clusters and lines calledfault lines From the graphs a developeris able to easily determine where theunstable sections of code are and proac-tively restructure them She is currentlyworking on a prototype implementa-tion

For more information visit httpwwwcseucscecu~jbevanevo_viz

RELIABLE AND SCALABLE PEER-TO-PEER WEB

DOCUMENT SHARING

Li Xiao William and Mary College

The idea presented by Xiao would allowend users to share the content of theWeb browser caches with neighboringInternet users The rationale for this isthat todayrsquos end users are increasinglyconnected to the Internet over high-speed links and the browser caches arebecoming large storage systems There-fore if an individual accesses a pagewhich does not exist in their local cacheXiao is suggesting that they first queryother end users for the content beforetrying to access the machine hosting theoriginal This strategy does have someserious problems associated with it thatstill need to be addressed includingensuring the integrity of the content andprotecting the identity and privacy ofthe end users

SEREL FAST BOOTING FOR UNIX

Leni Mayo Fast Boot Software

Serel is a tool that generates a visual rep-resentation of a UNIX machinersquos boot-up sequence It can be used to identifythe critical path and can show if a par-ticular service or process blocks for anextended period of time This informa-tion could be used to determine wherethe largest performance gains could berealized by tuning the order of executionat boot-up Serel creates a dependencygraph expressed in XML during boot-up This graph is then used to create thevisual representation Currently the toolonly works on POSIX-compliant sys-tems but Mayo stated that he would beporting it to other platforms Otherextensions that were mentionedincluded having the metadata includethe use of shared libraries and monitor-ing the suspendresume sequence ofportable machines

For more information visithttpwwwfastbootorg

USENIX 2002

BERKELEY DB XML

John Merrells Sleepycat Software

Merrells gave a quick introduction andoverview of the new XML library forBerkeley DB The library specializes instorage and retrieval of XML contentthrough a tool called XPath The libraryallows for multiple containers per docu-ment and stores everything natively asXML The user is also given a wide rangeof elements to create indices withincluding edges elements text stringsor presence The XPath tool consists of aquery parser generator optimizer andexecution engine To conclude his pre-sentation Merrell gave a live demonstra-tion of the software

For more information visithttpwwwsleepycatcomxml

CLUSTER-ON-DEMAND (COD)

Justin Moore Duke University

Modern clusters are growing at a rapidrate Many have pushed beyond the 5000-machine mark and deployingthem results in large expenses as well asmanagement and provisioning issuesFurthermore if the cluster is ldquorentedrdquoout to various users it is very time-con-suming to configure it to a userrsquos specsregardless of how long they need to useit The COD work presented createsdynamic virtual clusters within a givenphysical cluster The goal was to have aprovisioning tool that would automati-cally select a chunk of available nodesand install the operating system andmiddleware specified by the customer ina short period of time This would allowa greater use of resources since multiplevirtual clusters can exist at once Byusing a virtual cluster the size can bechanged dynamically Moore stated thatthey have created a working prototypeand are currently testing and bench-marking its performance

For more information visithttpwwwcsdukeedu~justincod

82 Vol 27 No 5 login

CATACOMB

Elias Sinderson University of CaliforniaSanta Cruz

Catacomb is a project to develop a data-base-backed DASL module for theApache Web server It was designed as areplacement for the WebDAV The initialrelease of the module only contains sup-port for a MySQL database but could beextended to others Sinderson brieflytouched on the performance of hermodule It was comparable to themod_dav Apache module for all querytypes but search The presentation wasconcluded with remarks about addingsupport for the lock method and includ-ing ACL specifications in the future

For more information visithttpoceancseucsceducatacomb

SELF-ORGANIZING STORAGE

Dan Ellard Harvard University

Ellardrsquos presentation introduced a stor-age system that tuned itself based on theworkload of the system without requir-ing the intervention of the user Theintelligence was implemented as a vir-tual self-organizing disk that residesbelow any file system The virtual diskobserves the access patterns exhibited bythe system and then attempts to predictwhat information will be accessed nextAn experiment was done using an NFStrace at a large ISP on one of their emailservers Ellardrsquos self-organizing storagesystem worked well which he attributesto the fact that most of the files beingrequested were large email boxes Areasof future work include exploration ofthe length and detail of the data collec-tion stage as well as the CPU impact ofrunning the virtual disk layer

VISUAL DEBUGGING

John Costigan Virginia Tech

Costigan feels that the state of currentdebugging facilities in the UNIX worldis not as good as it should be He pro-poses the addition of several elements toprograms like ddd and gdb The firstaddition is including separate heap and

stack components Another is to displayonly certain fields of a data structureand have the ability to zoom in and outif needed Finally the debugger shouldprovide the ability to visually presentcomplex data structures includinglinked lists trees etc He said that a betaversion of a debugger with these abilitiesis currently available

For more information visit httpinfoviscsvtedudatastruct

IMPROVING APPLICATION PERFORMANCE

THROUGH SYSTEM-CALL COMPOSITION

Amit Purohit University of Stony Brook

Web servers perform a huge number ofcontext switches and internal data copy-ing during normal operation These twoelements can drastically limit the perfor-mance of an application regardless ofthe hardware platform it is run on TheCompound System Call (CoSy) frame-work is an attempt to reduce the perfor-mance penalty for context switches anddata copies It includes a user-levellibrary and several kernel facilities thatcan be used via system calls The libraryprovides the programmer with a com-plete set of memory-management func-tions Performance tests using the CoSyframework showed a large saving forcontext switches and data copies Theonly area where the savings betweenconventional libraries and CoSy werenegligible was for very large file copies

For more information visithttpwwwcssunysbedu~purohit

ELASTIC QUOTAS

John Oscar Columbia University

Oscar began by stating that mostresources are flexible but that to datedisks have not been While approxi-mately 80 percent of files are short-liveddisk quotas are hard limits imposed byadministrators The idea behind elasticquotas is to have a non-permanent stor-age area that each user can temporarilyuse Oscar suggested the creation of aehome directory structure to be used indeploying elastic quotas Currently he

has implemented elastic quotas as astackable file system There were alsoseveral trends apparent in the dataMany users would ssh to their remotehost (good) but then launch an emailreader on the local machine whichwould connect via POP to the same hostand send the password in clear text(bad) This situation can easily be reme-died by tunneling protocols like POPthrough SSH but it appeared that manypeople were not aware this could bedone While most of his comments onthe use of the wireless network werenegative the list of passwords he hadcollected showed that people wereindeed using strong passwords His rec-ommendations were to educate andencourage people to use protocols likeIPSec SSH and SSL when conductingwork over a wireless network becauseyou never know who else is listening

SYSTRACE

Niels Provos University of Michigan

How can one be sure the applicationsone is using actually do exactly whattheir developers said they do Shortanswer you canrsquot People today are usinga large number of complex applicationswhich means it is impossible to checkeach application thoroughly for securityvulnerabilities There are tools out therethat can help though Provos has devel-oped a tool called systrace that allows auser to generate a policy of acceptablesystem calls a particular application canmake If the application attempts tomake a call that is not defined in thepolicy the user is notified and allowed tochoose an action Systrace includes anauto-policy generation mechanismwhich uses a training phase to record theactions of all programs the user exe-cutes If the user chooses not to be both-ered by applications breaking policysystrace allows default enforcementactions to be set Currently systrace isimplemented for FreeBSD and NetBSDwith a Linux version coming out shortly

83October 2002 login

C

ON

FERE

NC

ERE

PORT

SFor more information visit httpwwwcitiumicheduuprovossystrace

VERIFIABLE SECRET REDISTRIBUTION FOR

SURVIVABLE STORAGE SYSTEMS

Ted Wong Carnegie Mellon University

Wong presented a protocol that can beused to re-create a file distributed over aset of servers even if one of the serversis damaged In his scheme the user mustchoose how many shares the file is splitinto and the number of servers it will bestored across The goal of this work is toprovide persistent secure storage ofinformation even if it comes underattack Wong stated that his design ofthe protocol was complete and he is cur-rently building a prototype implementa-tion of it

For more information visit httpwwwcscmuedu~tmwongresearch

THE GURU IS IN SESSIONS

LINUX ON LAPTOPPDA

Bdale Garbee HP Linux Systems Operation

Summarized by Brennan Reynolds

This was a roundtable-like discussionwith Garbee acting as moderator He iscurrently in charge of the Debian distri-bution of Linux and is involved in port-ing Linux to platforms other than i386He has successfully help port the kernelto the Alpha Sparc and ARM and iscurrently working on a version to runon VAX machines

A discussion was held on which file sys-tem is best for battery life The Riser andExt3 file systems were discussed peoplecommented that when using Ext3 ontheir laptops the disc would never spindown and thus was always consuming alarge amount of power One suggestionwas to use a RAM-based file system forany volume that requires a large numberof writes and to only have the file systemwritten to disk at infrequent intervals orwhen the machine is shut down or sus-pended

A question was directed to Garbee abouthis knowledge of how the HP-Compaqmerger would affect Linux Garbeethought that the merger was good forLinux both within the company and forthe open source community as a wholeHe stated that the new company wouldbe the number one shipper of systemswith Linux as the base operating systemHe also fielded a question about thesupport of older HP servers and theirability to run Linux saying that indeedLinux has been ported to them and hepersonally had several machines in hisbasement running it

A question which sparked a large num-ber of responses concerned problemspeople had with running Linux onmobile platforms Most people actuallydid not have many problems at allThere were a few cases of a particularmachine model not working but therewere only two widespread issues dock-ing stations and the new ACPI powermanagement scheme Docking stationsstill appear to be somewhat of aheadache but most people had devel-oped ad-hoc solutions for getting themto work The ACPI power managementscheme developed by Intel does notappear to have a quick solution TedTrsquoso one of the head kernel developersstated that there are fundamental archi-tectural problems with the 24 series ker-nel that do not easily allow ACPI to beadded However he also stated thatACPI is supported in the newest devel-opment kernel 25 and will be includedin 26

The final topic was the possibility ofpurchasing a laptop without paying anychargetax for Microsoft Windows Theconsensus was that currently this is notpossible Even if the machine comeswith Linux or without any operatingsystem the vendors have contracts inplace which require them to payMicrosoft for each machine they ship ifthat machine is capable of running aMicrosoft operating system The onlyway this will change is if the demand for

USENIX 2002

other operating systems increases to thepoint that vendors renegotiate their con-tracts with Microsoft and no one sawthis happening in the near future

LARGE CLUSTERS

Andrew Hume ATampT Labs ndash Research

Summarized by Teri Lampoudi

This was a session on clusters big dataand resilient computing Given that theaudience was primarily interested inclusters and not necessarily big data orbeating nodes to a pulp the conversa-tion revolved mainly around what ittakes to make a heterogeneous clusterresilient Hume also presented aFREENIX paper on his current clusterproject which explained in more detailmuch of what was abstractly claimed inthe guru session

Humersquos use of the word ldquoclusterrdquoreferred not to what I had assumedwould be a Beowulf-type system but to aloosely coupled farm of machines inwhich concurrency was much less of anissue than it would be in a Beowulf Infact the architecture Hume describedwas designed to deal with large amountsof independent transaction processingessentially the process of billing callswhich requires no interprocess messagepassing or anything of the sort a parallelscientific application might

Where does the large data come inSince the transactions in question con-sist of large numbers of flat files amechanism for getting the files ontonodes and off-loading the results is nec-essary In this particular cluster namedNingaui this task is handled by theldquoreplication managerrdquo which generateda large amount of interest in the audi-ence All management functions in thecluster as well as job allocation consti-tute services that are to be bid for andleased out to nodes Furthermore fail-ures are handled by simply repeating thefailed task until it is completed properlyif that is possible and if it is not thenlooking through detailed logs to discover

84 Vol 27 No 5 login

what went wrong Logging and testingfor the successful completion of eachtask are what make this model resilientEvery management operation is loggedat some appropriate level of detail gen-erating 120MBday and every single filetransfer is md5 checksummed This wascharacterized by Hume as a ldquopatientrdquostyle of management where no one con-trols the cluster scheduling behavior isemergent rather than stringentlyplanned and nodes are free to drop inand out of the cluster a downside to thismodel is the increase in latency offset bythe fact that the cluster continues tofunction unattended outside of 8-to-5support hours

In response to questions about the pro-jected scalability of such a schemeHume said the system would presum-ably scale from the present 60 nodes toabout 100 without modifications butthat scaling higher would be a matter forfurther design The guru session endedon a somewhat unrelated note regardingthe problems of getting data off main-frames and onto Linux machines ndashissues of fixed- vs variable-lengthencoding of data blocks resulting in use-ful information being stripped by FTPand problems in converting COBOLcopybooks to something useful on a C-based architecture To reiterate Humersquosclaim there are homemade solutions forsome of these problems and the readershould feel free to contact Mr Hume forthem

WRITING PORTABLE APPLICATIONS

Nick Stoughton MSB Consultants

Summarized by Josh Lothian

Nick Stoughton addressed a concernthat developers are facing more andmore writing portable applications Heproposed that there is no such thing as aldquoportable applicationrdquo only those thathave been ported Developers are still insearch of the Holy Grail of applicationsprogramming a write-once compile-anywhere program Languages such asJava are leading up to this but virtual

machines still have to be ported pathswill still be different etc Web-basedapplications are another area that couldbe promising for portable applications

Nick went on to talk about the future ofportability The recently released POSIX2001 standards are expected to help thesituation The 2001 revision expands theoriginal POSIX sepc into seven volumesEven though POSIX 2001 is more spe-cific than the previous release Nickpointed out that even this standardallows for small differences where ven-dors may define their own behaviorsThis can only hurt developers in thelong run Along these lines it waspointed out that there may be room forautomated tools that have the capabilityto check application code and determinethe level of portability and standardscompliance of the source

NETWORK MANAGEMENT SYSTEM

PERFORMANCE TUNING

Jeff Allen Tellme Networks Inc

Summarized by Matt Selsky

Jeff Allen author of Cricket discussednetwork tuning The basic idea of tuningis measure twiddle and measure againCricket can be used for large installa-tions but itrsquos overkill for smaller instal-lations for which Mrtg is better suitedYou donrsquot need Cricket to do measure-ment Doing something like a shellscript that dumps data to Excel is goodbut you need to get some sort of mea-surement Cricket was not designed forbilling it was meant to help answerquestions

Useful tools and techniques coveredincluded looking for the differencebetween machines in order to identifythe causes for the differences measuringbut also thinking about what yoursquoremeasuring and why remembering touse strace and tcpdump to identifyproblems Problems repeat themselvesperformance tuning requires an intricateunderstanding of all the layers involvedto solve the problem and simplify theautomation (Having a computer science

background helps but is not essential) Ifyou can reproduce the problem youthen should investigate each piece tofind the bottleneck If you canrsquot repro-duce the problem then measure Youneed to understand all the system inter-actions Measurement can help deter-mine whether things are actually slow orif the user is imagining it

Troubleshooting begins with a hunchbut scientific processes are essential Youshould be able to determine the eventstream or when each event occurs hav-ing observation logs can help Somehints provided include checking cron forperiodic anomalies slowing downbinrm to avoid IO overload and look-ing for unusual kernel CPU usage

Also try to use lightweight monitoringto reduce overhead You donrsquot need tomonitor every resource on every systembut you should monitor those resourceson those systems that are essential Donrsquotcheck things too often since you canintroduce overhead that way Lots ofinformation can be gathered from theproc file system interface to the kernel

BSD BOF

Host Kirk McKusick Author and Consultant

Summarized by Bosko Milekic

The BSD Birds of a Feather session atUSENIX started off with five quick andinformative talks and ended with anexciting discussion of licensing issues

Christos Zoulas for the NetBSD Projectwent over the primary goals of theNetBSD Project (namely portability aclean architecture and security) andthen proceeded to briefly discuss someof the challenges that the project hasencountered The successes and areas forimprovement of 16 were then exam-ined NetBSD has recently seen variousimprovements in its cross-build systempackaging system and third-party toolsupport For what concerns the kernel azero-copy TCPUDP implementationhas been integrated and a new pmap

85October 2002 login

C

ON

FERE

NC

ERE

PORT

Scode for arm32 has been introduced ashave some other rather major changes tovarious subsystems More information isavailable in Christosrsquos slides which arenow available at httpwwwnetbsdorggalleryeventsusenix2002

Theo de Raadt of the OpenBSD Projectpresented a rundown of various newthings that the OpenBSD and OpenSSHteams have been looking at Theorsquos talkincluded an amusing description (andpictures) of some of the events thattranspired during OpenBSDrsquos hack-a-thon which occurred the week beforeUSENIX rsquo02 Finally Theo mentionedthat following complications with theIPFilter license the OpenBSD team haddone a fairly extensive license audit

Robert Watson of the FreeBSD Projectbrought up various examples ofFreeBSD being used in the real worldHe went on to describe a wide variety ofchanges that will surface with theupcoming FreeBSD release 50 Notably50 will introduce a re-architecting ofthe kernel aimed at providing muchmore scalable support for SMPmachines 50 will also include an earlyimplementation of KSE FreeBSDrsquos ver-sion of scheduler activations largeimprovements to pccard and significantframework bits from the TrustedBSDproject

Mike Karels from Wind River Systemspresented an overview of how BSDOShas been evolving following the WindRiver Systems acquisition of BSDirsquos soft-ware assets Ernie Prabhakar from Applediscussed Applersquos OS X and its successon the desktop He explained how OS Xaims to bring BSD to the desktop whilestriving to maintain a strong and stablecore one that is heavily based on BSDtechnology

The BoF finished with a discussion onlicensing specifically some folks ques-tioned the impact of Applersquos licensingfor code from their core OS (the DarwinProject) on the BSD community as awhole

CLOSING SESSION

HOW FLIES FLY

Michael H Dickinson University ofCalifornia Berkeley

Summarized by JD Welch

In this lively and offbeat talk Dickinsondiscussed his research into the mecha-nisms of flight in the fruit fly (Droso-phila melanogaster) and related theautonomic behavior to that of a techno-logical system Insects are the most suc-cessful organisms on earth due in nosmall part to their early adoption offlight Flies travel through space onstraight trajectories interrupted by sac-cades jumpy turning motions some-what analogous to the fast jumpymovements of the human eye Using avariety of monitoring techniquesincluding high-speed video in a ldquovirtualreality flight arenardquo Dickinson and hiscolleagues have observed that fliesrespond to visual cues to decide when tosaccade during flight For example morevisual texture makes flies saccade earlier(ie further away from the arena wall)described by Dickinson as a ldquocollisioncontrol algorithmrdquo

Through a combination of sensoryinputs flies can make decisions abouttheir flight path Flies possess a visualldquoexpansion detectorrdquo which at a certainthreshold causes the animal to turn acertain direction However expansion infront of the fly sometimes causes it toland How does the fly decide Using thevirtual reality device to replicate variousvisual scenarios Dickinson observedthat the flies fixate on vertical objectsthat move back and forth across thefield Expansion on the left side of theanimal causes it to turn right and viceversa while expansion directly in frontof the animal triggers the legs to flareout in preparation for landing

If the eyes detect these changes how arethe responses implemented The fliesrsquowings have ldquopower musclesrdquo controlledby mechanical resonance (as opposed tothe nervous system) which drive the

USENIX 2002

wings combined with neurally activatedldquosteering musclesrdquo which change theconfiguration of the wing joints Subtlevariations in the timing of impulses cor-respond to changes in wing movementcontrolled by sensors in the ldquowing-pitrdquoA small wing-like structure the halterecontrols equilibrium by beating con-stantly

Voluntary control is accomplished bythe halteres whose signals can interferewith the autonomic control of the strokecycle The halteres have ldquosteering mus-clesrdquo as well and information derivedfrom the visual system can turn off thehaltere or trick it into a ldquovirtualrdquo prob-lem requiring a response

Dickinson has also studied the aerody-namics of insect flight using a devicecalled the ldquoRobo Flyrdquo an oversizedmechanical insect wing suspended in alarge tank of mineral oil InterestinglyDickinson observed that the average liftgenerated by the fliesrsquo wings is under itsbody weight the flies use three mecha-nisms to overcome this including rotat-ing the wings (rotational lift)

Insects are extraordinarily robust crea-tures and because Dickinson analyzedthe problem in a systems-oriented waythese observations and analysis areimmediately applicable to technologythe response system can be used as anefficient search algorithm for controlsystems in autonomous vehicles forexample

THE AFS WORKSHOP

Summarized by Garry Zacheiss MIT

Love Hornquist-Astrand of the Arladevelopment team presented the Arlastatus report Arla 0358 has beenreleased Scheduled for release soon areimproved support for Tru64 UNIXMacOS X and FreeBSD improved vol-ume handling and implementation ofmore of the vospts subcommands Itwas stressed that MacOS X is consideredan important platform and that a GUIconfiguration manager for the cache

86 Vol 27 No 5 login

manager a GUI ACL manager for thefinder and a graphical login that obtainsKerberos tickets and AFS tokens at logintime were all under development

Future goals planned for the underlyingAFS protocols include GSSAPISPNEGOsupport for Rx performance improve-ments to Rx an enhanced disconnectedmode and IPv6 support for Rx anexperimental patch is already availablefor the latter Future Arla-specific goalsinclude improved performance partialfile reading and increased stability forseveral platforms Work is also inprogress for the RXGSS protocol exten-sions (integrating GSSAPI into Rx) Apartial implementation exists and workcontinues as developers find time

Derrick Brashear of the OpenAFS devel-opment team presented the OpenAFSstatus report OpenAFS was releasedimmediately prior to the conferenceOpenAFS 125 fixed a remotelyexploitable denial-of-service attack inseveral OpenAFS platforms mostnotably IRIX and AIX Future workplanned for OpenAFS includes bettersupport for MacOS X including work-ing around Finder interaction issuesBetter support for the BSDs is alsoplanned FreeBSD has a partially imple-mented client NetBSD and OpenBSDhave only server programs availableright now AIX 5 Tru64 51A andMacOS X 102 (aka Jaguar) are allplanned for the future Other plannedenhancements include nested groups inthe ptserver (code donated by the Uni-versity of Michigan awaits integration)disconnected AFS and further work ona native Windows client Derrick statedthat the guiding principles of OpenAFSwere to maintain compatibility withIBM AFS support new platforms easethe administrative burden and add newfunctionality

AFS performance was discussed Theopenafsorg cell consists of two fileservers one in Stockholm Sweden andone in Pittsburgh PA AFS works rea-sonably well over trans-Atlantic links

although OpenAFS clients donrsquot deter-mine which file server to talk to veryeffectively Arla clients use RTTs to theserver to determine the optimal fileserver to fetch replicated data fromModifications to OpenAFS to supportthis behavior in the future are desired

Jimmy Engelbrecht and Harald Barth ofKTH discussed their AFSCrawler scriptwritten to determine how many AFScells and clients were in the world whatimplementationsversions they were(Arla vs IBM AFS vs OpenAFS) andhow much data was in AFS The scriptunfortunately triggered a bug in IBMAFS 36ndashderived code causing someclients to panic while handling a specificRPC This has since been fixed in OpenAFS 125 and the most recent IBMAFS patch level and all AFS users arestrongly encouraged to upgrade Norelease of Arla is vulnerable to this par-ticular denial-of-service attack Therewas an extended discussion of the use-fulness of this exploration Many sitesbelieved this was useful information andsuch scanning should continue in thefuture but only on an opt-in basis

Many sites face the problem of manag-ing KerberosAFS credentials for batch-scheduled jobs Specifically most batchprocessing software needs to be modi-fied to forward tickets as part of thebatch submission process renew ticketsand tokens while the job is in the queueand for the lifetime of the job and prop-erly destroy credentials when the jobcompletes Ken Hornstein of NRL wasable to pay a commercial vendor to sup-port Kerberos 45 credential manage-ment in their product although they didnot implement AFS token managementMIT has implemented some of thedesired functionality in OpenPBS andmight be able to make it available toother interested sites

Tools to simplify AFS administrationwere discussed including

AFS Balancer A tool written byCMU to automate the process ofbalancing disk usage across all

servers in a cell Available fromftpftpandrewcmuedupubAFS-Toolsbalance-11btargz

Themis Themis is KTHrsquos enhancedversion of the AFS tool ldquopackagerdquofor updating files on local disk froma central AFS image KTHrsquosenhancements include allowing thedeletion of files simplifying theprocess of adding a file and allow-ing the merging of multiple rulesets for determining which files areupdated Themis is available fromthe Arla CVS repository

Stanford was presented as an example ofa large AFS site Stanfordrsquos AFS usageconsists of approximately 14TB of datain AFS in the form of approximately100000 volumes 33TB of storage isavailable in their primary cell irstan-fordedu Their file servers consistentirely of Solaris machines running acombination of Transarc 36 patch-level3 and OpenAFS 12x while their data-base servers run OpenAFS 122 Theircell consists of 25 file servers using acombination of EMC and Sun StorEdgehardware Stanford continues to use thekaserver for their authentication infra-structure with future plans to migrateentirely to an MIT Kerberos 5 KDC

Stanford has approximately 3400 clientson their campus not including SLAC(Stanford Linear Accelerator) approxi-mately 2100 AFS clients from outsideStanford contact their cell every monthTheir supported clients are almostentirely IBM AFS 36 clients althoughthey plan to release OpenAFS clientssoon Stanford currently supports onlyUNIX clients There is some on-campuspresence of Windows clients but Stan-ford has never publicly released or sup-ported it They do intend to release andsupport the MacOS X client in the nearfuture

All Stanford students faculty and staffare assigned AFS home directories witha default quota of 50MB for a total ofapproximately 550GB of user homedirectories Other significant uses of AFS

87October 2002 login

C

ON

FERE

NC

ERE

PORT

Sstorage are data volumes for workstationsoftware (400GB) and volumes forcourse Web sites and assignments(100GB)

AFS usage at Intel was also presentedIntel has been an AFS site since 1994They had bad experiences with the IBM35 Linux client their experience withOpenAFS on Linux 24x kernels hasbeen much better They use and are sat-isfied with the OpenAFS IA64 Linuxport Intel has hundreds of OpenAFS123 and 124 clients in many produc-tion cells accessing data stored on IBMAFS file servers They have not encoun-tered any interoperability issues Intelhas some concerns about OpenAFS theywould like to purchase commercial sup-port for OpenAFS and to see OpenAFSsupport for HP-UX on both PA-RISCand Itanium hardware HP-UX supportis currently unavailable due to a specificHP-UX header file being unavailablefrom HP this may be available soonIntel has not yet committed to migratingtheir file servers to OpenAFS and areunsure if they will do so without com-mercial support

Backups are a traditional topic of dis-cussion at AFS workshops and this timewas no exception Many users complainthat the traditional AFS backup tools(ldquobackuprdquo and ldquobutcrdquo) are complex anddifficult to automate requiring manyhome-grown scripts and much userintervention for error recovery An addi-tional complaint was that the traditionalAFS tools do not support file- or direc-tory-level backups and restores datamust be backed up and restored at thevolume level

Mitch Collinsworth of Cornell presentedwork to make AMANDA the freebackup software from the University ofMaryland suitable for AFS backupsUsing AMANDA for AFS backups allowsone to share AFS backup tapes withnon-AFS backups easily run multiplebackups in parallel automate errorrecovery and provide a robust degradedmode that prevents tape errors from

stopping backups altogether Theirimplementation allows for full volumerestores as well as individual directoryand file restores They have finished cod-ing this work and are in the process oftesting and documenting it

Peter Honeyman of CITI at the Univer-sity of Michigan spoke about work hehas proposed to replace Rx with RPC-SEC GSS in OpenAFS this would allowAFS to use a TCP-based transport mech-anism rather than the UDP-based Rxand possibly gain better congestion con-trol dynamic adaptation and fragmen-tation avoidance as a result RPCSECGSS uses the GSSAPI to authenticateSUN ONC RPC RPCSEC GSS is trans-port agnostic provides strong security isa developing Internet standard and hasmultiple open source implementationsBackward compatibility with existingAFS servers and clients is an importantgoal of this project

Page 5: inside - USENIX · 2019. 2. 25. · Bosko Milekic Juan Navarro David E. Ott Amit Purohit Brennan Reynolds Matt Selsky J.D. Welch Li Xiao Praveen Yalagandula Haijin Yan Gary Zacheiss

THE JOY OF BREAKING THINGS

Pat Parseghian Transmeta

Summarized by Juan Navarro

Pat Parseghian described her experiencein testing the Crusoe microprocessor atthe Transmeta Lab for Compatibility(TLC) where the motto is ldquoYou make it we break itrdquo and the goal is to makeengineers miserable

She first gave an overview of Crusoewhose most distinctive feature is a codemorphing layer that translates x86instructions into native VLIW Suchpeculiarity makes the testing processparticularly challenging because itinvolves testing two processors (theldquoexternalrdquo x86 and the ldquointernalrdquo VLIW)and also because of variations on howthe code-morphing layer works (it mayinterpret code the first few times it seesan instruction sequence and translateand save in a translation cache after-wards) There are also reproducibilityissues since the VLIW processor mayrun at different speeds to save energy

Pat then described the tests that theysubject the processor to (hardware com-patibility tests common applicationsoperating systems envelope-pushinggames and hard-to-acquire legacy appli-cations) and the issues that must be con-sidered when defining testing policiesShe gave some testing tips includingorganizational issues like trackingresources and results

To illustrate the testing process Pat gavea list of possible causes of a system crashthat must be investigated If it is the sili-con then it might be because it is dam-aged or because of a manufacturingproblem or a design flaw If the problemis in the code-morphing layer is thefault with the interpreter or with thetranslator The fault could also be exter-nal to the processor it could be thehomemade system BIOS a faulty main-board or an operator error Or Trans-meta might not be at fault at all thecrash might be due to a bug in the OS orthe application The key to pinpointing

65October 2002 login

C

ON

FERE

NC

ERE

PORT

Sthe cause of a problem is to isolate it byidentifying the conditions that repro-duce the problem and repeating thoseconditions in other test platforms andnon-Crusoe systems Then relevant fac-tors must be identified based on productknowledge past experience and com-mon sense

To conclude Pat suggested that some ofTLCrsquos testing lessons can be applied toproducts that the audience was involvedin and assured us that breaking things isfun

TECHNOLOGY LIBERTY FREEDOM AND

WASHINGTON

Alan Davidson Center for Democracyand Technology

Summarized by Steve Bauer

ldquoExperience should teach us to be moston our guard to protect liberty when theGovernmentrsquos purposes are beneficentMen born to freedom are naturally alertto repel invasion of their liberty by evil-minded rulers The greatest dangers toliberty lurk in insidious encroachmentby men of zeal well-meaning but with-out understandingrdquo ndash Louis BrandeisOlmstead v US

Alan Davison an associate director atthe CDT (httpwwwcdtorg) since1996 concluded his talk with this quoteIn many ways it aptly characterizes theimportance to the USENIX communityof the topics he covered The majorthemes of the talk were the impact oflaw on individual liberty system archi-tectures and appropriate responses bythe technical community

The first part of the talk provided theaudience with an overview of legislationeither already introduced or likely to beintroduced in the US Congress Thisincluded various proposals to protectchildren such as relegating all sexualcontent to a xxx domain or providingkid-safe zones such as kidsus Othersimilar laws discussed were the Chil-drenrsquos Online Protection Act and legisla-tion dealing with virtual child pornography

Other lower-profile pieces covered werelaws prohibiting false contact informa-tion in emails and domain registrationdatabases Similar proposals exist thatwould prohibit misleading subject linesin emails and require messages to clearlyidentify if they are advertisementsBriefly mentioned were laws impactingonline gambling

Bills and laws affecting the architecturaldesign of systems and networks and var-ious efforts to establish boundaries andborders in cyberspace were then dis-cussed These included the ConsumerBroadband and Television PromotionAct and the French cases against Yahooand its CEO for allowing the sale of Nazimaterials on the Yahoo auction site

A major part of the talk focused on thetechnological aspects of the USA-PatriotAct passed in the wake of the September11 attacks Topics included ldquopen regis-tersrdquo for the Internet expanded govern-ment power to conduct secret searchesroving wiretaps nationwide service ofwarrants sharing grand jury and intelli-gence information and the establish-ment of an identification system forvisitors to the US

The technological community needs tobe aware and care about the impact oflaw on individual liberty and the systemsthe community builds Specific sugges-tions included designing privacy andanonymity into architectures and limit-ing information collection particularlyany personally identifiable data to onlywhat is essential Finally Alan empha-sized that many people simply do notunderstand the implications of lawThus the technology community has aimportant role in helping make theseimplications clear

TAKING AN OPEN SOURCE PROJECT TO

MARKET

Eric Allman Sendmail Inc

Summarized by Scott Kilroy

Eric started the talk with the upsides anddownsides to open source software Theupsides include rapid feedback high-

USENIX 2002

quality work (mostly) lots of hands andan intensely engineering-drivenapproach The downsides include nosupport structure limited project mar-keting expertise and volunteers havinglimited time

In 1998 Sendmail was becoming a suc-cess disaster A success disaster includestwo key elements (1) volunteers startspending all their time supporting cur-rent functionality instead of enhancing(this heavy burden can lead to projectstagnation and eventually death)(2) competition for the better-fundedsources can lead to FUD (fear uncer-tainty and doubt) about the project

Eric then led listeners down the path hetook to turn Sendmail into a businessEric envisioned Sendmail Inc as a small-ish and close-knit company but soonrealized that the family atmosphere hedesired could not last as the companygrew He observed that maybe only thefirst 10 people will buy into your visionafter which each individual is primarilyinterested in something else (eg powerwealth or status) He warns that therewill be people working for you that youdo not like Eric particularly did not carefor sales people but he emphasized howimportant sales marketing finance andlegal people are to companies

The nature of companies in infancy canbe summed up as you need money andyou need investors in order to getmoney Investors want a return oninvestment so money managementbecomes critical Companies thereforemust optimize money functions Ericrsquosexperience with investors were lessonsall in themselves More than givingmoney good investors can provide con-nections and insight so you donrsquot wantinvestors who donrsquot believe in you

A company must have bug historychange management and people whounderstand the product and have a senseof the target market Eric now sees theimportance of marketing targetresearch A company can never forget

66 Vol 27 No 5 login

the simple equation that drives businessprofit = revenue - expense

Finally the concept of value should bebased on what is valuable to the cus-tomer Eric observed that his customersneeded documentation extra value inthe product that they couldnrsquot get else-where and technical support

Eric learned hard lessons along the wayIf you want to do interesting opensource it might be best not to be toosuccessful You donrsquot want to let thelarger dragons (companies) notice youIf you want a commercial user base youhave to manage process watch corpo-rate culture provide value to customerswatch bottom lines and develop a thickskin

Starting a company is easy but perpetu-ating it is extremely difficult

INFORMATION VISUALIZATION FOR

SYSTEMS PEOPLE

Tamara Munzner University of BritishColumbia

Summarized by JD Welch

Munzner presented interesting evidencethat information visualization throughinteractive representations of data canhelp people to perform tasks more effec-tively and reduce the load on workingmemory

A popular technique in visualizing datais abstracting it through node-link rep-resentation ndash for example a list of cross-references in a book Representing therelationships between references in agraphical (node-link) manner ldquooffloadscognition to the human perceptual sys-temrdquo These graphs can be producedmanually but automated drawing allowsfor increased complexity and quickerproduction times

The way in which people interpret visualdata is important in designing effectivevisualization systems Attention to pre-attentive visual cues are key to successPicking out a red dot among a field ofidentically shaped blue dots is signifi-cantly faster than the same data repre-

sented in a mixture of hue and shapevariations There are many pre-attentivevisual cues including size orientationintersection and intensity The accuracyof these cues can be ranked with posi-tion triggering high accuracy and coloror density on the low end

Visual cues combined with differentdata types ndash quantitative (eg 10inches) ordered (small medium large)and categorical (apples oranges) ndash canalso be ranked with spatial positionbeing best for all types Beyond thiscommonality however accuracy variedwidely with length ranked second forquantitative data eighth for ordinal andninth for categorical data types

These guidelines about visual perceptioncan be applied to information visualiza-tion which unlike scientific visualiza-tion focuses on abstract data and choiceof specialization Several techniques areused to clarify abstract data includingmulti-part glyphs where changes inindividual parts are incorporated into aneasier to understand gestalt interactiv-ity motion and animation In large datasets techniques like ldquofocus + contextrdquowhere a zoomed portion of the graph isshown along with a thumbnail view ofthe entire graph are used to minimizeuser disorientation

Future problems include dealing withhuge databases such as the HumanGenome reckoning dynamic data likethe changing structure of the Web orreal-time network monitoring andtransforming ldquopixel boundrdquo displays intolarge ldquodigital wallpaperrdquo-type systems

FIXING NETWORK SECURITY BY HACKING

THE BUSINESS CLIMATE

Bruce Schneier Counterpane InternetSecurity

Summarized by Florian Buchholz

Bruce Schneier identified security as oneof the fundamental building blocks ofthe Internet A certain degree of securityis needed for all things on the Net andthe limits of security will become limitsof the Net itself However companies are

hesitant to employ security measuresScience is doing well but one cannot seethe benefits in the real world Anincreasing number of users means moreproblems affecting more people Oldproblems such as buffer overflowshavenrsquot gone away and new problemsshow up Furthermore the amount ofexpertise needed to launch attacks isdecreasing

Schneier argues that complexity is theenemy of security and that while secu-rity is getting better the growth in com-plexity is outpacing it As security isfundamentally a people problem oneshouldnrsquot focus on technologies butrather should look at businesses busi-ness motivations and business costsTraditionally one can distinguishbetween two security models threatavoidance where security is absoluteand risk management where security isrelative and one has to mitigate the riskwith technologies and procedures In thelatter model one has to find a pointwith reasonable security at an acceptablecost

After a brief discussion on how securityis handled in businesses today in whichhe concluded that businesses ldquotalk bigabout it but do as little as possiblerdquoSchneier identified four necessary stepsto fix the problems

First enforce liabilities Today no realconsequences have to be feared fromsecurity incidents Holding peopleaccountable will increase the trans-parency of security and give an incentiveto make processes public As possibleoptions to achieve this he listed indus-try-defined standards federal regula-tion and lawsuits The problems withenforcement however lie in difficultiesassociated with the international natureof the problem and the fact that com-plexity makes assigning liabilities diffi-cult Furthermore fear of liability couldhave a chilling effect on free softwaredevelopment and could stifle new com-panies

67October 2002 login

C

ON

FERE

NC

ERE

PORT

SAs a second step Schneier identified theneed to allow partners to transfer liabil-ity Insurance would spread liability riskamong a group and would be a CEOrsquosprimary risk analysis tool There is aneed for standardized risk models andprotection profiles and for more securi-ties as opposed to empty press releasesSchneier predicts that insurance will cre-ate a marketplace for security where cus-tomers and vendors will have the abilityto accept liabilities from each otherComputer-security insurance shouldsoon be as common as household or fireinsurance and from that developmentinsurance will become the driving factorof the security business

The next step is to provide mechanismsto reduce risks both before and aftersoftware is released techniques andprocesses to improve software quality aswell as an evolution of security manage-ment are therefore needed Currentlysecurity products try to rebuild the wallsndash such as physical badges and entrydoorways ndash that were lost when gettingconnected Schneier believes thisldquofortress metaphorrdquo is bad one shouldthink of the problem more in terms of acity Since most businesses cannot affordproper security outsourcing is the onlyway to make security scalable With out-sourcing the concenpt of best practicesbecomes important and insurance com-panies can be tied to them outsourcingwill level the playing field

As a final step Schneier predicts thatrational prosecution and education willlead to deterrence He claims that peoplefeel safe because they live in a lawfulsociety whereas the Internet is classifiedas lawless very much like the AmericanWest in the 1800s or a society ruled bywarlords This is because it is difficult toprove who an attacker is prosecution ishampered by complicated evidencegathering and irrational prosecutionSchneier believes however that educa-tion will play a major role in turning theInternet into a lawful society Specifi-cally he pointed out that we need lawsthat can be explained

Risks wonrsquot go away the best we can dois manage them A company able tomanage risk better will be more prof-itable and we need to give CEOs thenecessary risk-management tools

There were numerous questions Listedbelow are the more complicated ones ina paraphrased QampA format

Q Will liability be effective will insur-ance companies be willing to accept therisks A The government might have tostep in it needs to be seen how it playsout

Q Is there an analogy to the real worldin the fact that in a lawless society onlythe rich can afford security Securitysolutions differentiate between classesA Schneier disagreed with the state-ment giving an analogy to front-doorlocks but conceded that there might bespecial cases

Q Doesnrsquot homogeneity hurt securityA Homogeneity is oversold Diversetypes can be more survivable but giventhe limited number of options the dif-ference will be negligible

Q Regulations in airbag protection haveled to deaths in some cases How can wekeep the pendulum from swinging theother way A Lobbying will not be pre-vented An imperfect solution is proba-ble there might be reversals ofrequirements such as the airbag one

Q What about personal liability A Thiswill be analogous to auto insurance lia-bility comes with computerNet access

Q If the rule of law is to become realitythere must be a law enforcement func-tion that applies to a physical space Youcannot do that with any existing govern-ment agency for the whole worldShould an organization like the UNassume that role A Schneier was notconvinced global enforcement is possi-ble

Q What advice should we take awayfrom this talk A Liability is comingSince the network is important to ourinfrastructure eventually the problems

USENIX 2002

will be solved in a legal environmentYou need to start thinking about how tosolve the problems and how the solu-tions will affect us

The entire speech (including slides) isavailable at httpwwwcounterpanecompresentation4pdf

LIFE IN AN OPEN SOURCE STARTUP

Daryll Strauss Consultant

Summarized by Teri Lampoudi

This talk was packed with morsels ofinsight and tidbits of information aboutwhat life in an open source startup islike (though ldquowas likerdquo might be moreappropriate) what the issues are instarting maintaining and commercial-izing an open source project and theway hardware vendors treat open sourcedevelopers Strauss began by tracing thetimeline of his involvement with devel-oping 3-D graphics support for Linuxfrom the 3dfx voodoo1 driver for Linuxin 1997 to the establishment of Preci-sion Insight in 1998 to its buyout by VALinux to the dismantling of his groupby VA to what the future may hold

Obviously there are benefits from doingopen source development both for thedeveloper and for the end user Animportant point was that open sourcedevelops entire technologies not justproducts A misapprehension is theinherent difficulty of managing a projectthat accepts code from a large number ofdevelopers some volunteer and somepaid The notion that code can just bethrown over the wall into the world is aBig Lie

How does one start and keep afloat anopen source development company Inthe case of Precision Insight the subjectmatter ndash developing graphics andOpenGL for Linux ndash required a consid-erable amount of expertise intimateknowledge of the hardware the librariesX and the Linux kernel as well as theend applications Expertise is mar-ketable What helps even further is hav-ing a visible virtual team of expertspeople who have an established trackrecord of contributions to open source

68 Vol 27 No 5 login

in the particular area of expertise Sup-port from the hardware and softwarevendors came next While everyone waswilling to contract PI to write the por-tions of code that were specific to theirhardware no one wanted to pay the billfor developing the underlying founda-tion that was necessary for the drivers tobe useful In the PI model the infra-structure cost was shared by several cus-tomers at once and the technology keptmoving Once a driver is written for avendor the vendor is perfectly capableof figuring out how to write the nextdriver necessary thereby obviating theneed to contract the job This and ques-tions of protecting intellectual propertyndash hardware design in this case ndash aredeeply problematic with the open sourcemode of development This is not to saythat there is no way to get over thembut that they are likely to arise moreoften than not

Management issues seem equally impor-tant In the PI setting contracts wereflexible ndash both a good and a bad thing ndashand developers overworked Furtherdevelopment ldquoin a fishbowlrdquo that isunder public scrutiny is not easyStrauss stressed the value of good com-munication and the use of various out-of-sync communication methods likeIRC and mailing lists

Finally Strauss discussed what portionsof software can be free and what can beproprietary His suggestion was that hor-izontal markets want to be free whereasvertically one can develop proprietarysolutions The talk closed with a smallstroll through the problems that projectpolitics brings up from things likemerging branches to mailing list andsource tree readability

GENERAL TRACK SESSIONS

FILE SYSTEMS

Summarized by Haijin Yan

STRUCTURE AND PERFORMANCE OF THE

DIRECT ACCESS FILE SYSTEM

Kostas Magoutis Salimah AddetiaAlexandra Fedorova and Margo ISeltzer Harvard University JeffreyChase Andrew Gallatin Richard Kisleyand Rajiv Wickremesinghe Duke Uni-versity and Eran Gabber Lucent Tech-nologies

This paper won the Best Paper award

The Direct Access File System (DAFS) isa new standard for network-attachedstorage over direct-access transport net-works DAFS takes advantage of user-level network interface standards thatenable client-side functionality in whichremote data access resides in a libraryrather than in the kernel This reducesthe overhead of memory copy for datamovement and protocol overheadRemote Direct Memory Access (RDMA)is a direct access transport network thatallows the network adapter to reducecopy overhead by accessing applicationbuffers directly

This paper explores the fundamentalstructural and performance characteris-tics of network file access using a user-level file system structure on adirect-access transport network withRDMA It describes DAFS-based clientand server reference implementationsfor FreeBSD and reports experimentalresults comparing DAFS to a zero-copyNFS implementation It illustrates thebenefits and trade-offs of these tech-niques to provide a basis for informedchoices about deployment of DAFS-based systems and similar extensions toother network file protocols such asNFS Experiments show that DAFS givesapplications direct control over an IOsystem and increases the client CPUrsquosusage while the client is doing IOFuture work includes how to addresslongstanding problems related to theintegration of the application and filesystem for high-performance applica-tions

CONQUEST BETTER PERFORMANCE THROUGH

A DISKPERSISTENT-RAM HYBRID FILE

SYSTEM

An-I A Wang Peter Reiher and GeraldJ Popek UCLA Geoffrey M Kuenning Harvey Mudd College

Motivated by the declining cost of per-sistent RAM the authors propose theConquest file system which stores allsmall files metadata executables andshared libraries in persistent RAM diskshold only the data content of remaininglarge files Compared to alternativessuch as caching and RAM file systemsConquest has the advantages of effi-ciency consistency and reliability at areduced cost Using popular bench-marks experiments show that Conquestincurs little overhead while achievingfaster performance Future workincludes designing mechanisms foradjusting file-size threshold dynamicallyand finding a better disk layout for largedata blocks

EXPLOITING GRAY-BOX KNOWLEDGE OF

BUFFER-CACHE MANAGEMENT

Nathan C Burnett John Bent AndreaC Arpaci-Dusseau and Remzi HArpaci-Dusseau University of Wisconsin Madison

Knowing what algorithm is used tomanage the buffer cache is very impor-tant for improving application perfor-mance However there is currently nointerface for finding this algorithm Thispaper introduces Dust a simple finger-printing tool that is able to identify thebuffer-cache replacement policy Dustautomatically identifies the cache sizeand replacement policy based on theconfiguring attributes of access ordersrecency frequency and long-term his-tory Through simulation Dust was ableto distinguish between a variety ofreplacement algorithm policies found inthe literature FIFO LRU LFU ClockSegmented FIFO 2Qand LRU-K Fur-ther experiments of fingerprinting realoperating system such as NetBSDLinux and Solaris show that Dust is ableto identify the hidden cache replacementalgorithm

69October 2002 login

C

ON

FERE

NC

ERE

PORT

SBy knowing the underlying cachereplacement policy a cache-aware Webserver can reschedule the requests on acached request-first policy to obtain per-formance improvement Experimentsshow that cache-aware schedulingimproves average response time and sys-tem throughput

OPERATING SYSTEMS

(AND DANCING BEARS)

Summarized by Matt Butner

THE JX OPERATING SYSTEM

Michael Golm Meik Felser ChristianWawersich and Juumlrgen Kleinoumlder University of Erlagen-Nuumlrnberg

The talk opened by emphasizing theneed for and the practicality of a func-tional Java OS Such implementation inthe JX OS attempts to mimic the recenttrend toward using highly abstractedlanguages in application development inorder to create an OS as functional andpowerful as one written in a lower-levellanguage

The JX is a micro-kernel solution thatuses separate JVMs for each entity of thekernel and in some cases for eachapplication The separated domains donot share objects and have no threadmigration and each domain implementsits own code The interesting part of thepresentation was the discussion of sys-tem-level programming with Java keyareas discussed were memory manage-ment and interrupt handlers Theauthors concluded by noting that theirtype-safe and modular system resultedin a robust system with great configura-tion flexibility and acceptable perfor-mance

DESIGN EVOLUTION OF THE EROS

SINGLE-LEVEL STORE

Jonathan S Shapiro Johns HopkinsUniversity Jonathan Adams Universityof Pennsylvania

This presentation outlined current char-acteristics of file systems and some oftheir least desirable characteristics Itwas based on the revival of the EROS

single-level-store approach for currentattempts to capitalize on system consis-tency and efficiency Their solution didnot relate directly to EROS but ratherto the design and use of the constructsand notions on which EROS is based Byextending the mapping of the memorysystem to include the disk systems arefurther able to ensure global persistencewithout regard for a disk structure Inexplaining the need for absolute systemconsistency and an environment whichby definition is not allowed to crash thegoal of having such an exhaustive designbecomes clear The cool part of thiswork is the availability of its design andarchitecture to the public

THINK A SOFTWARE FRAMEWORK FOR

COMPONENT-BASED OPERATING SYSTEM

KERNELS

Jean-Philippe Fassino France TeacuteleacutecomRampD Jean-Bernard Stefani INRIA JuliaLawall DIKU Gilles Muller INRIA

This presentation discussed the need forcomponent-based operating systemsand respective structures to ensure flexi-bility for arbitrary-sized systems Thinkprovides a binding model that mapsuniformed components for OS develop-ers and architects to follow ensuringconsistent implementations for an arbi-trary system size However the goal isnot to force developers into a predefinedkernel but to promote the use of certaincomponents in varied ways

Thinkrsquos concentration is primarily onembedded systems where short devel-opment time is necessary but is con-strained by rigorous needs and limitedresources The need to build flexible sys-tems in such an environment can becostly and implementation-specific butThink creates an environment sup-ported by the ability to dynamically loadtype-safe components This allows formore flexible systems that retain func-tionality because of Thinkrsquos ability tobind fine-grained components Bench-marks revealed that dedicated micro-kernels can show performanceimprovements in comparison to mono-

USENIX 2002

lithic kernels Notable future workincludes building components for low-end appliances and the development of areal-time OS component library

BUILDING SERVICES

Summarized by Pradipta De

NINJA A FRAMEWORK FOR NETWORK

SERVICES

J Robert von Behren Eric A BrewerNikita Borisov Michael Chen MattWelsh Josh MacDonald Jeremy Lauand David Culler University of Califor-nia Berkeley

Robert von Behren presented Ninja anongoing project that aims to provide aframework for building robust and scal-able Internet-based network serviceslike Web-hosting instant-messagingemail and file-sharing applicationsRobert drew attention to the difficultiesof writing cluster applications One hasto take care of data consistency andissues of concurrency and for robustapplications there are problems relatedto fault tolerance Ninja works as awrapper to relieve the user of theseproblems The goal of the project isbuilding network services that are scala-ble and highly available maintainingpersistent data providing gracefuldegradation and supporting online evo-lution

The use of clusters distinguishes thissetup from generic distributed systemsin terms of reliability and security aswell as providing a partition-free net-work The next important feature of thisproject is the new programming modelwhich is more restrictive than a generalprogramming model but still expressiveenough to write most of the applica-tions This model described as ldquosingleprogram multiple connectionrdquo usesintertask parallelism instead of multi-threaded concurrency and is character-ized by all nodes running the sameprogram with connections delegated tothe nodes by a centralized connectionmanager (CM) A CM takes care of hid-ing the details of mapping an externalconnection to an internal node Ninja

70 Vol 27 No 5 login

also uses a design called ldquostaged event-driven architecturerdquo where each serviceis broken down into a set of stages con-nected by event queues This architec-ture is suitable for a modular design andhelps in graceful degradation by adap-tive load shedding from the eventqueues

The presentation concluded with exam-ples of implementation and evaluationof a Web server and an email systemcalled NinjaMail to show the ease ofauthoring and the efficacy of using theNinja framework for service develop-ment

USING COHORT-SCHEDULING TO ENHANCE

SERVER PERFORMANCE

James R Larus and Michael ParkesMicrosoft Research

James Larus presented a new schedulingpolicy for increasing the throughput ofserver applications Cohort schedulingbatches execution of similar operationsarising in different server requests Theusual programming paradigm in han-dling server requests is to spawn multi-ple concurrent threads and switch fromone thread to another whenever a threadgets blocked for IO Since the threads ina server mostly execute unrelated piecesof code the locality of reference isreduced hence the effectiveness of dif-ferent caching mechanisms One way tosolve this problem is to throw in morehardware But Larus presented a com-plementary view where the programbehavior is investigated and used toimprove the performance

The problem in this scheme is to iden-tify pieces of code that can be batchedtogether for processing One simple wayis to look at the next program countervalues and use them to club threadstogether However this talk presented anew programming abstraction ldquostagedcomputationrdquo which replaces the threadmodel with ldquostagesrdquo A stage is anabstraction for operations with similarbehavior and locality The StagedServerlibrary can be used for programming inthis model It is a collection of C++

classes that implement staged computa-tion and cohort scheduling on either auniprocessor or multiprocessor It canbe used to define stages

The presentation showed the experi-mental evaluation of the cohort schedul-ing over the thread-based model byimplementing two servers a Web serverwhich is mainly IO bound and a pub-lish-subscribe server which is mainlycompute bound The SURGE bench-mark was used for the first experimentand the Fabret workload for the secondThe results showed that cohort-schedul-ing-based implementation gave a betterthroughput than the thread-basedimplementation at high loads

NETWORK PERFORMANCE

Summarized by Xiaobo Fan

ETE PASSIVE END-TO-END INTERNET SERVICE

PERFORMANCE MONITORING

Yun Fu Amin Vahdat Duke UniversityLudmila Cherkasova Wenting Tang HPLabs

This paper won the Best Student Paperaward

Ludmila Cherkasova began by listingseveral questions most Web serviceproviders want answered in order toimprove service quality She reviewedthe difficulties of making accurate andefficient end-to-end Web service mea-surement and the shortcomings of cur-rently available approaches Theypropose a passive trace-based architec-ture called EtE to monitor Web serverperformance on behalf of end users

The first step is to collect network pack-ets passively The second module recon-structs all TCP connections and extractsHTTP transactions To reconstruct Webpage accesses they first build a knowl-edge base indexed by client IP and URLand then group objects to the relatedWeb pages they are embedded in Statis-tical analysis is used to handle non-matched objects EtE Monitor cangenerate three groups of metrics to mea-sure Web service performance responsetime Web caching and page abortion

To demonstrate the benefits of EtE mon-itor Cherkasova talked about two casestudies and based on the calculatedmetrics gave some insightful explana-tions about whatrsquos happening behind the variations of Web performanceThrough validation they claim theirapproach provides a very close approxi-mation to the real scenario

THE PERFORMANCE OF REMOTE DISPLAY

MECHANISMS FOR THIN-CLIENT COMPUTING

S Jae Yang Jason Nieh Matt SelskyNikhil Tiwari Columbia University

Noting the trend toward thin-clientcomputing the authors compared dif-ferent techniques and design choices inmeasuring the performance of six popu-lar thin-client platforms ndash CitrixMetaFrame Microsoft Terminal Ser-vices Sun Ray Tarantella VNC and XAfter pointing out several challenges inbenchmarking thin clients Yang pro-posed slow-motion benchmarking toachieve non-invasive packet monitoringand consistent visual quality Basicallythey insert delays between separatevisual events in the benchmark applica-tions of the server side so that theclientrsquos display update can catch up withthe serverrsquos processing speed The exper-iments are conducted on an emulatednetwork over a range of network band-widths

Their results show that thin clients canprovide good performance for Webapplications in LAN environments butonly some platforms performed well forvideo benchmark Pixel-based encodingmay achieve better performance andbandwidth efficiency than high-levelgraphics Display caching and compres-sion should be used with care

A MECHANISM FOR TCP-FRIENDLY

TRANSPORT-LEVEL PROTOCOL

COORDINATION

David E Ott and Ketan Mayer-PatelUniversity of North Carolina ChapelHill

A revised transport-level protocol opti-mized for cluster-to-cluster (C-to-C)

71October 2002 login

C

ON

FERE

NC

ERE

PORT

Sapplications is what David Ott tries toexplain in his talk A C-to-C applicationclass is identified as one set of processescommunicating to another set ofprocesses across a common Internetpath The fundamental problem with C-to-C applications is how to coordinateall C-to-C communication flows so thatthey share a consistent view of the com-mon C-to-C network adapt to changingnetwork conditions and cooperate tomeet specific requirements Aggregationpoints (AP) are placed at the first andlast hop of the common data path toprobe network conditions (latencybandwidth loss rate etc) To carry andtransfer this information a new protocolndash Coordination Protocol (CP) ndash isinserted between the network layer (IP)and the transport layer (TCP UCP etc)Ott illustrated how the CP header isupdated and used when packets origi-nate from source and traverse throughlocal and remote AP to arrive at theirdestination and how AP maintains aper-cluster state table and detects net-work conditions Through simulationresults this coordination mechanismappears effective in sharing commonnetwork resources among C-to-C com-munication flows

STORAGE SYSTEMS

Summarized by Praveen Yalagandula

MY CACHE OR YOURS MAKING STORAGE

MORE EXCLUSIVE

Theodore M Wong Carnegie MellonUniversity John Wilkes HP Labs

Theodore Wong explained the ineffi-ciency of current ldquoinclusiverdquo cachingschemes in storage area networks ndashwhen a client accesses a block the blockis read from the disk and is cached atboth the disk array cache and at theclientrsquos cache Then he presented theconcept of ldquoexclusiverdquo caching where ablock accessed by a client is only cachedin that clientrsquos cache On eviction fromthe clientrsquos cache they have come upwith a DEMOTE operation to move thedata block to the tail of the array cache

The new exclusion caching schemes areevaluated on both single-client and mul-tiple-client systems The single-clientresults are presented for two differenttypes of synthetic workloads Random(transaction-processing type workloads)and Zipf (typical Web workloads) Theexclusive policy was quite effective inachieving higher hit rates and lower readlatencies They also showed a 22 timeshit-rate improvement over inclusivetechniques in the case of a real-lifeworkload httpd with a single-client set-ting

The DEMOTE scheme also performedwell in the case of multiple-client sys-tems when the data accessed by clients isdisjointed In the case of shared dataworkloads this scheme performed worsethan the inclusive schemes Theodorethen presented an adaptive exclusivecaching scheme ndash a block accessed by aclient is placed at the tail in the clientrsquoscache and also in the array cache at anappropriate place determined by thepopularity of the block The popularityof the block is measured by maintaininga ghost cache to accumulate the numberof times each block is accessed This newadaptive scheme has achieved a maxi-mum of 52 speedup in the meanlatency in the experiments with real-lifeworkloads

For more information visit httpwwwcscmuedu~tmwongresearch and httpwwwhplhpcomresearchitccslssp

BRIDGING THE INFORMATION GAP IN

STORAGE PROTOCOL STACKS

Timothy E Denehy Andrea C Arpaci-Dusseau Remzi H Arpaci-DusseauUniversity of Wisconsin Madison

Currently there is a huge informationgap between storage systems and file sys-tems The interface exposed by storagesystems to file systems based on blocksand providing only readwrite interfacesis very narrow This leads to poor per-formance as a whole because of dupli-cated functionality in both systems and

USENIX 2002

reduced functionality resulting fromstorage systemsrsquo lack of file information

The speaker presented two enhance-ments ExRaid an exposed RAID andILFS Informed Log-Structured File Sys-tem ExRAID is an enhancement to theblock-based RAID storage system thatexposes the following three types ofinformation to the file system (1)regions ndash contiguous portions of theaddress space comprising one or multi-ple disks (2) performance informationabout queue lengths and throughput ofthe regions revealing disk heterogeneityto the file systems and (3) failure infor-mation ndash dynamically updated informa-tion conveying the number of tolerablefailures in each region

ILFS allows online incremental expan-sion of the storage space performsdynamic parallelism using ExRAIDrsquosperformance information to performwell on heterogeneous storage systemsand provides a range of different mecha-nisms with different granularities forredundancy of files using ExRAIDrsquosregion failure characteristics These newfeatures are added to LFS with only a19 increase in the code size

More informationhttpwwwcswisceduwind

MAXIMIZING THROUGHPUT IN REPLICATED

DISK STRIPING OF VARIABLE

BIT-RATE STREAMS

Stergios V Anastasiadis Duke Univer-sity Kenneth C Sevcik and MichaelStumm University of Toronto

There is an increasing demand for thecontinuous real-time streaming ofmedia files Disk striping is a commontechnique used for supporting manyconnections on the media servers Buteven with 12 million hours of meantime between failures there will be morethan one disk failure per week with 7000disks Fault tolerance can be achieved byusing data replication and reservingextra bandwidth during normal opera-tion This work focused on supportingthe most common variable bit-ratestream formats (eg MPEG)

72 Vol 27 No 5 login

The authors used prototype mediaserver EXEDRA to implement and eval-uate different replication and bandwidthreservation schemes This system sup-ports a variable bit-rate scheme doesstride-based disk allocation and is capa-ble of supporting variable-grain diskstriping Two replication policies arepresented deterministic ndash data blocksare replicated in a round-robin fashionon the secondary replicas random ndashdata blocks are replicated on the ran-domly chosen secondary replicas Twobandwidth reservation techniques arealso presented mirroring reservation ndashdisk bandwidth is reserved for both primary and replicas of the media fileduring the playback and minimumreservation ndash a more efficient scheme inwhich bandwidth is reserved only for thesum of primary data access time and themaximum of the backup data accesstimes required in each round

Experimental results showed that deter-ministic replica placement is better thanrandom placement for small disks mini-mum disk bandwidth reservation istwice as good as mirroring in thethroughput achieved and fault tolerancecan be achieved with a minimal impacton the throughput

TOOLS

Summarized by Li Xiao

SIMPLE AND GENERAL STATISTICAL PROFILING

WITH PCT

Charles Blake and Steven Bauer MIT

This talk introduced a Profile CollectionToolkit (PCT) ndash a sampling-based CPUprofiling facility A novel aspect of PCTis that it allows sampling of semanticallyrich data such as function call stacksfunction parameters local or globalvariables CPU registers or other execu-tion contexts The design objectives ofPCT were driven by user needs and theinadequacies or inaccessibility of priorsystems

The rich data collection capability isachieved via a debugger-controller pro-gram dbct1 The talk then introducedthe profiling collector and data collec-

tion activation PCT file format optionsprovide flexibility and various ways tostore exact states PCT also has analysisoptions two examples were given ndashldquoSimplerdquo and ldquoMixed User and LinuxCoderdquo

The talk then went into how generalsampling helps in the following areas adebugger-controller state machineexample script fragments a simpleexample program and general valuereports in terms of call-path histogramsnumeric value histograms and data gen-erality Related work such as gprof andExpect was then summarized The con-tribution of this work is to provide aportable general value sampling toolThe limitations are that PCT does notsupport much of strip and count inac-curacies happen because of its statisticalnature Based on this limitation -g ispreferred over strip

Possible future directions included sup-porting more debuggers such as dbxand xdb script generators for otherkinds of traces more canned reports forgeneral values and a libgdb-basedlibrary sampler PCT is available athttppdoslcsmitedupct PCT runs onalmost any UNIX-like system

ENGINEERING A DIFFERENCING AND

COMPRESSION DATA FORMAT

David G Korn and Kiem Phong VoATampT Laboratories ndash Research

This talk began with an equation ldquoDif-ferencing + Compression = Delta Com-pressionrdquo Compression removesinformation redundancy in a data setExamples are gzip bzip compress andpack Differencing encodes differencesbetween two data sets Examples are diff -e fdelta bdiff and xdelta Deltacompression compresses a data set givenanother combining differencing andcompression and reduces to pure com-pression when therersquos no commonality

After presenting an overall scheme for adelta compressor and showing deltacompression performance this talk dis-cussed the encoding format of a newlydesigned Vcdiff for delta compression A

Vcdiff instruction code table consists of256 entries of each coding up to a pair ofinstructions and recodes I-byte indicesand any additional data

The talk then showed Vcdiff rsquos perfor-mance with Web data collected fromCNN compared with W3C standardGdiff Gdiff+gzip and diff+gzip whereGdiff was computed using Vcdiff deltainstructions The results of two experi-ments ldquoFirstrdquo and ldquoSuccessiverdquo werepresented In ldquoFirstrdquo each file is com-pressed against the first file collected inldquoSuccessiverdquo each file is compressedagainst the one in the previous hourThe diff+gzip did not work well becausediff was line-oriented Vcdiff performedfavorably compared with other formats

Vcdiff is part of the Vcodex packageVcodex is a platform for all commondata transformations delta compres-sion plain compression encryptiontranscoding (eg uuencode base64) Itis structured in three layers for maxi-mum usability Base library uses Dis-plines and Methods interfaces

The code for Vcdiff can be found athttpwwwresearchattcomswtoolsThey are moving Vcdiff to the IETFstandard as a comprehensive platformfor transforming data Please refer tohttpwwwietforginternet-draftdraft-korn-vcdiff06txt

WHERE IN THE NET

Summarized by Amit Purohit

A PRECISE AND EFFICIENT EVALUATION OF

THE PROXIMITY BETWEEN WEB CLIENTS AND

THEIR LOCAL DNS SERVERS

Zhuoqing Morley Mao University ofCalifornia Berkeley Charles D CranorFred Douglis Michael RabinovichOliver Spatscheck and Jia Wang ATampTLabs ndash Research

Content Distribution Networks (CDNs)attempt to improve Web performance bydelivering Web content to end usersfrom servers located at the edge of thenetwork When a Web client requestscontent the CDN dynamically chooses aserver to route the request to usually the

73October 2002 login

C

ON

FERE

NC

ERE

PORT

Sone that is nearest the client As CDNshave access only to the IP address of thelocal DNS server (LDNS) of the clientthe CDNrsquos authoritative DNS servermaps the clientrsquos LDNS to a geographicregion within a particular network andcombines that with network and serverload information to perform CDNserver selection

This method has two limitations First itis based on the implicit assumption thatclients are close to their LDNS This maynot always be valid Second a singlerequest from an LDNS can represent adiffering number of Web clients ndash calledthe hidden load factor Knowledge of thehidden load factor can be used toachieve better load distribution

The extent of the first limitation and itsimpact on the CDN server selection isdealt with To determine the associationsbetween clients and their LDNS a sim-ple non-intrusive and efficient map-ping technique was developed The datacollected was used to study the impact ofproximity on DNS-based server selec-tion using four different proximity met-rics (1) autonomous system (AS)clustering ndash observing whether a client isin the same AS as its LDNS ndash concludedthat LDNS is very good for coarse-grained server selection as 64 of theassociations belong to the same AS(2) network clustering ndash observingwhether a client is in the same network-aware cluster (NAC) ndash implied that DNSis less useful for finer-grained serverselection since only 16 of the clientand LDNS are in the same NAC (3)traceroute divergence ndash the length of thedivergent paths to the client and itsLDNS from a probe point using tracer-oute ndash implies that most clients aretopologically close to their LDNS asviewed from a randomly chosen probesite (4) round-trip time (RTT) correla-tion (some CDNs select severs based onRTT between the CDN server and theclientrsquos LDNS) examines the correlationbetween the message RTTs from a probepoint to the client and its local DNS

results imply this is a reasonable metricto use to avoid really distant servers

A study of the impact that client-LDNSassociations have on DNS-based serverselection concludes that knowing theclientrsquos IP address would allow moreaccurate server selection in a large num-ber of cases The optimality of the serverselection also depends on the serverdensity placement and selection algo-rithms

Further information can be found athttpwwweecsberkeleyedu~zmaomyresearchhtml or by contactingzmaoeecsberkeleyedu

GEOGRAPHIC PROPERTIES OF INTERNET

ROUTING

Lakshminarayanan SubramanianVenkata N Padmanabhan MicrosoftResearch Randy H Katz University ofCalifornia at Berkeley

Geographic information can provideinsights into the structure and function-ing of the Internet including interac-tions between different autonomoussystems by analyzing certain propertiesof Internet routing It can be used tomeasure and quantify certain routingproperties such as circuitous routinghot-potato routing and geographic faulttolerance

Traceroute has been used to gather therequired data and Geotrack tool hasbeen used to determine the location ofthe nodes along each network path Thisenables the computation of ldquolinearizeddistancesrdquo which is the sum of the geo-graphic distances between successivepairs of routers along the path

In order to measure the circuitousnessof a path a metric ldquodistance ratiordquo hasbeen defined as the ratio of the lin-earized distance of a path to the geo-graphic distance between the source anddestination of the path From the data ithas been observed that the circuitous-ness of a route depends on both geo-graphic and network location of the endhosts A large value of the distance ratioenables us to flag paths that are highly

USENIX 2002

circuitous possibly (though not neces-sarily) because of routing anomalies Itis also shown that the minimum delaybetween end hosts is strongly correlatedwith the linearized distance of the path

Geographic information can be used tostudy various aspects of wide-area Inter-net paths that traverse multiple ISPs Itwas found that end-to-end Internetpaths tend to be more circuitous thanintra-ISP paths the cause for this beingthe peering relationships between ISPsAlso paths traversing substantial dis-tances within two or more ISPs tend tobe more circuitous than paths largelytraversing only a single ISP Anotherfinding is that ISPs generally employhot-potato routing

Geographic information can also beused to capture the fact that two seem-ingly unrelated routers can be suscepti-ble to correlated failures By using thegeographic information of routers wecan construct a geographic topology ofan ISP Using this we can find the toler-ance of an ISPrsquos network to the total net-work failure in a geographic region

For further information contactlakmecsberkeleyedu

PROVIDING PROCESS ORIGIN INFORMATION

TO AID IN NETWORK TRACEBACK

Florian P Buchholz Purdue UniversityClay Shields Georgetown University

Network traceback is currently limitedbecause host audit systems do not main-tain enough information to matchincoming network traffic to outgoingnetwork traffic The talk presented analternative assigning origin informationto every process and logging it duringinteractive login creation

The current implementation concen-trates mainly on interactive sessions inwhich an event is logged when a newconnection is established using SSH orTelnet The method is effective andcould successfully determine steppingstones and reliably detect the source of aDDoS attack The speaker then talkedabout the related work done in this area

74 Vol 27 No 5 login

which led to discussion of the latestdevelopment in the area of ldquohostcausalityrdquo

The speaker ended with the limitationsand future directions of the frameworkThis framework could be extended andcould find many applications in the areaof security Current implementationdoesnrsquot take care of startup scripts andcron jobs but incorporating the origininformation in FS could solve this prob-lem In the current implementation log-ging is just implemented as a proof ofconcept It could be made safe in manyways and this could be another impor-tant aspect of future work

PROGRAMMING

Summarized by Amit Purohit

CYCLONE A SAFE DIALECT OF C

Trevor Jim ATampT Labs ndash Research GregMorrisett Dan Grossman MichaelHicks James Cheney Yanling WangCornell University

Cyclone is designed to provide protec-tion against attacks such as buffer over-flows format string attacks andmemory management errors The cur-rent structure of C allows programmersto write vulnerable programs Cycloneextends C so that it has the safety guar-antee of Java while keeping C syntaxtypes and semantics untouched

The Cyclone compiler performs staticanalysis of the source code and insertsruntime checks into the compiled out-put at places where the analysis cannotdetermine that the code execution willnot violate safety constraints Cycloneimposes many restrictions to preservesafety such as NULL checksThesechecks do not exist in normal C

The speaker then talked about somesample code written in Cyclone and howit tackles safety problems Porting anexisting C application to Cyclone ispretty easy with fewer than 10 changerequired The current implementationconcentrates more on safety than onperformance ndash hence the performance

penalty is substantial Cyclone was ableto find many lingering bugs The finalstep of the project will be to write acompiler to convert a normal C programto an equivalent Cyclone program

COOPERATIVE TASK MANAGEMENT WITHOUT

MANUAL STACK MANAGEMENT

Atul Adya Jon Howell MarvinTheimer William J Bolosky John RDouceur Microsoft Research

The speaker described the definitionsand motivations behind the distinctconcepts of task management stackmanagement IO management conflictmanagement and data partitioningConventional concurrent programminguses preemptive task management andexploits the automatic stack manage-ment of a standard language In the sec-ond approach cooperative tasks areorganized as event handlers that yieldcontrol by returning control to the eventscheduler manually unrolling theirstacks In this project they have adopteda hybrid approach that makes it possiblefor both stack management styles tocoexist Thus programmers can codeassuming one or the other of these stackmanagement styles will operate depend-ing upon the application The speakeralso gave a detailed example of how touse their mechanism to switch betweenthe two styles

The talk then continued with the imple-mentation They were able to preemptmany subtle concurrency problems byusing cooperative task managementPaying a cost up-front to reduce a subtlerace condition proved to be a goodinvestment

Though the choice of task managementis fundamental the choice of stack man-agement can be left to individual tasteThis project enables use of any type ofstack management in conjunction withcooperative task management

IMPROVING WAIT-FREE ALGORITHMS FOR

INTERPROCESS COMMUNICATION IN

EMBEDDED REAL-TIME SYSTEMS

Hai Huang Padmanabhan Pillai KangG Shin University of Michigan

The main characteristic of the real-timetime-sensitive system is its pre-dictable response time But concurrencymanagement creates hurdles to achiev-ing this goal because of the use of locksto maintain consistency To solve thisproblem many wait-free algorithmshave been developed but these are typi-cally high-cost solutions By takingadvantage of the temporal characteris-tics of the system however the time andspace overhead can be reduced

The speaker presented an algorithm fortemporal concurrency control andapplied this technique to improve threewait-free algorithms A single-writermultiple-reader wait-free algorithm anda double-buffer algorithm were pro-posed

Using their transformation mechanismthey achieved an improvement of17ndash66 in ACET and a 14ndash70 reduc-tion in memory requirements for IPCalgorithms This mechanism is extensi-ble and can be applied to other non-blocking IPC algorithms as well Futurework involves reducing the synchroniz-ing overheads in more general IPC algo-rithms with multiple-writer semantics

MOBILITY

Summarized by Praveen Yalagandula

ROBUST POSITIONING ALGORITHMS FOR

DISTRIBUTED AD-HOC WIRELESS SENSOR

NETWORKS

Chris Savarese Jan Rabaey Universityof California Berkeley Koen Langen-doen Delft University of Technology

The Pico Radio Network comprisesmore than a hundred sensors monitorsand actuators equipped with wirelessconnectivity with the following proper-ties (1) no infrastructure (2) computa-tion in a distributed fashion instead ofcentralized computation (3) dynamictopology (4) limited radio range

75October 2002 login

C

ON

FERE

NC

ERE

PORT

S(5) nodes that can act as repeaters(6) devices of low cost and with lowpower and (7) sparse anchor nodes ndashnodes with GPS capability A positioningalgorithm determines the geographicalposition of each node in the network

In two-dimensional space each nodeneeds three reference positions to esti-mate its geographical position There aretwo problems that make positioning dif-ficult in the Pico Radio Network typesetting (1) a sparse anchor node prob-lem and (2) a range error problem

The two-phase approach that theauthors have taken in solving the prob-lem consists of (1) a hop-terrain algo-rithm ndash in this first phase each noderoughly guesses its location by the dis-tance calculated using multi-hops to theanchor points and (2) a refinementalgorithm ndash in this second phase eachnode uses its neighborsrsquo positions torefine its own position estimate Toguarantee the convergence thisapproach uses confidence metricsassigning a value of 10 for anchor nodesand 01 to start for other nodes andincreasing these with each iteration

A simulation tool OMNet++ was usedfor both phases and in various scenariosThe results show that they achievedposition errors of less than 33 in a sce-nario with 5 anchor nodes an averageconnectivity of 7 and 5 range mea-surement error

Guidelines for anchor node deploymentare high connectivity (gt10) a reason-able fraction of anchor nodes (gt 5)and for anchor placement coverededges

APPLICATION-SPECIFIC NETWORK

MANAGEMENT FOR ENERGY-AWARE

STREAMING OF POPULAR MULTIMEDIA

FORMATS

Surendar Chandra University of Geor-gia and Amin Vahdat Duke University

The main hindrance in supporting theincreasing demand for mobile multime-dia on PDA is the battery capacity of the

small devices The idle-time power con-sumption is almost the same as thereceive power consumption on typicalwireless network cards (1319mW vs1425mW) while the sleep state con-sumes far less power (177mW) To letthe system consume energy proportionalto the stream quality the network cardshould transition to sleep state aggres-sively between each packet

The speaker presented previous workwhere different multimedia streamswere studied for PDAs MS Media RealMedia and Quicktime It was found thatif the inter-packet gap is predictablethen huge savings are possible forexample about 80 savings in MSmedia streams with few packet lossesThe limitation of the IEEE 80211bpower-save mode comes into play whenthere are two or more nodes producinga delay between beacon time and whenthe node receivessends packets Thisbadly affects the streams since higherenergy is consumed while waiting

The authors propose traffic shaping forenergy conservation where this is doneby proxy such that packets arrive at reg-ular intervals This is achieved using alocal proxy in the access point and aclient-side proxy in the mobile hostSimulations show that traffic shapingreduces energy consumption and alsoreveals that the higher bandwidthstreams have a lower energy metric(mJkB)

More information is at httpgreenhousecsugaedu

CHARACTERIZING ALERT AND BROWSE SER-

VICES OF MOBILE CLIENTS

Atul Adya Paramvir Bahl and Lili QiuMicrosoft Research

Even though there is a dramatic increasein Internet access from wireless devicesthere are not many studies done oncharacterizing this traffic In this paperthe authors characterize the trafficobserved on the MSN Web site withboth notification and browse traces

USENIX 2002

Around 33 million browsing accessesand about 325 million notificationentries are present in the traces Threetypes of analyses are done for each oneof these two services content analysisconcerning the most popular contentcategories and their distribution popu-larity analysis or the popularity distri-bution of documents and user behavioranalysis

The analysis of the notification logsshows that document access rates followa Zipf-like distribution with most of theaccesses concentrated on a small num-ber of messages and the accesses exhibitgeographical locality ndash users from samelocality tend to receive similar notifica-tion content The browser log analysisshows that a smaller set of URLs areaccessed most times though the accesspattern does not fit any Zipf curve andthe highly accessed URLS remain stableA correlation study between the notifi-cation and browsing services shows thatwireless users have a moderate correla-tion of 012

FREENIX TRACK SESSIONS

BUILDING APPLICATIONS

Summarized by Matt Butner

INTERACTIVE 3-D GRAPHICS APPLICATIONS

FOR TCL

Oliver Kersting Juumlrgen Doumlllner HassoPlattner Institute for Software SystemsEngineering University of Potsdam

The integration of 3-D image renderingfunctionality into a scripting languagepermits interactive and animated 3-Ddevelopment and application withoutthe formalities and precision demandedby low-level CC++ graphics and visual-ization libraries The large and complexC++ API of the Virtual Rendering Sys-tem (VRS) can be combined with theconveniences of the Tcl scripting lan-guage The mapping of class interfaces isdone via an automated process and gen-erates respective wrapper classes all ofwhich ensures complete API accessibilityand functionality without any signifi-

76 Vol 27 No 5 login

cant performance retribution The gen-eral C++-to-Tcl mapper SWIG grantsthe necessary object and hierarchal abili-ties without the use of object-orientedTcl extensions The final API mappingtechniques address C++ features such asclasses overloaded methods enumera-tions and inheritance relations All areimplemented in a proof-of-concept thatmaps the complete C++ API of the VRSto Tcl and are showcased in a completeinteractive 3-D-map system

THE AGFL GRAMMAR WORK LAB

Cornelis HA Coster Erik VerbruggenUniversity of Nijmegen (KUN)

The growth and implementation of Nat-ural Language Processing (NLP) is thecornerstone of the continued evolutionand implementation of truly intelligentsearch machines and services In partdue to the growing collections of com-puter-stored human-readable docu-ments in the public domain theimplementation of linguistic analysiswill become necessary for desirable pre-cision and resolution Subsequentlysuch tools and linguistic resources mustbe released into the public domain andthey have done so with the release of theAGFL Grammar Work Lab under theGNU Public License as a tool for lin-guistic research and the development ofNLP-based applications

The AGFL (Affix Grammars over aFinite Lattice) Grammar Work Labmeshes context-free grammars withfinite set-valued features that are accept-able to a range of languages In com-puter science terms ldquoSyntax rules areprocedures with parameters and a non-deterministic executionrdquo The EnglishPhrases for Information Retrieval(EP4IR) released with the AGFL-GWLas a robust grammar of English is anAGFL-GWL generated English parserthat outputs ldquoHeadModifiedrdquo framesThe sentences ldquoCompanyX sponsoredthis conferencerdquo and ldquoThis conferencewas sponsored by CompanyXrdquo both gen-erate [CompanyX[sponsored confer-

ence]] The hope is that public availabil-ity of such tools will encourage furtherdevelopment of grammar and lexicalsoftware systems

SWILL A SIMPLE EMBEDDED WEB SERVER

LIBRARY

Sotiria Lampoudi David M BeazleyUniversity of Chicago

This paper won the FREENIX Best Stu-dent Paper award

SWILL (Simple Web Interface LinkLibrary) is a simple Web server in theform of a C library whose developmentwas motivated by a wish to give coolapplications an interface to the Web TheSWILL library provides a simple inter-face that can be efficiently implementedfor tasks that vary from flexible Web-based monitoring to software debuggingand diagnostics Though originallydesigned to be integrated with high-per-formance scientific simulation softwarethe interface is generic enough to allowfor unbounded uses SWILL is a single-threaded server relying upon non-blocking IO through the creation of atemporary server which services IOrequests SWILL does not provide SSL orcryptographic authentication but doeshave HTTP authentication abilities

A fantastic feature of SWILL is its sup-port for SPMD-style parallel applica-tions which utilize MPI provingvaluable for Beowulf clusters and largeparallel machines Another practicalapplication was the implementation ofSWILL in a modified Yalnix emulator byUniversity of Chicago Operating Sys-tems courses which utilized the addedYalnix functionality for OS developmentand debugging SWILL requires minimalmemory overhead and relies upon theHTTP10 protocol

NETWORK PERFORMANCE

Summarized by Florian Buchholz

LINUX NFS CLIENT WRITE PERFORMANCE

Chuck Lever Network Appliance PeterHoneyman CITI University of Michi-gan

Lever introduced a benchmark to mea-sure an NFS client write performanceClient performance is difficult to mea-sure due to hindrances such as poorhardware or bandwidth limitations Fur-thermore measuring application perfor-mance does not identify weaknessesspecifically at the client side Thus abenchmark was developed trying toexercise only data transfers in one direc-tion between server and application Forthis purpose the benchmark was basedon the block sequential write portion ofthe Bonnie file system benchmark Oncea benchmark for NFS clients is estab-lished it can be used to improve clientperformance

The performance measurements wereperformed with an SMP Linux clientand both a Linux NFS server and a Net-work Appliance F85 filer During testinga periodic jump in write latency timewas discovered This was due to a ratherlarge number of pending write opera-tions that were scheduled to be writtenafter certain threshold values wereexceeded By introducing a separate dae-mon that flushes the cached writerequest the spikes could be eliminatedbut as a result the average latency growsover time The problem could be tracedto a function that scans a linked list ofwrite requests After having added ahashtable to improve lookup perfor-mance the latency improved consider-ably

The improved client was then used tomeasure throughput against the twoservers A discrepancy between theLinux server and the filer test wasnoticed and the reason for that tracedback to a global kernel lock that wasunnecessarily held when accessing thenetwork stack After correcting this per-formance improved further However

77October 2002 login

C

ON

FERE

NC

ERE

PORT

Sthe measurements showed that a clientmay run slower when paired with fastservers on fast networks This is due toheavy client interrupt loads more net-work processing on the client side andmore global kernel lock contention

The source code of the project is avail-able at httpwwwcitiumicheduprojectsnfs-perfpatches

A STUDY OF THE RELATIVE COSTS OF

NETWORK SECURITY PROTOCOLS

Stefan Miltchev and Sotiris IoannidisUniversity of Pennsylvania AngelosKeromytis Columbia University

With the increasing need for securityand integrity of remote network ser-vices it becomes important to quantifythe communication overhead of IPSecand compare it to alternatives such asSSH SCP and HTTPS

For this purpose the authors set upthree testing networks direct link twohosts separated by two gateways andthree hosts connecting through onegateway Protocols were compared ineach setup and manual keying was usedto eliminate connection setup costs ForIPSec the different encryption algo-rithms AES DES 3DES hardware DESand hardware 3DES were used In detailFTP was compared to SFTP SCP andFTP over IPSec HTTP to HTTPS andHTTP over IPSec and NFS to NFS overIPSec and local disk performance

The result of the measurements werethat IPSec outperforms other popularencryption schemes Overall unen-crypted communication was fastest butin some cases like FTP the overhead canbe small The use of crypto hardwarecan significantly improve performanceFor future work the inclusion of setupcosts hardware-accelerated SSL SFTPand SSH were mentioned

CONGESTION CONTROL IN LINUX TCP

Pasi Sarolahti University of HelsinkiAlexey Kuznetsov Institute for NuclearResearch at Moscow

Having attended the talk and read thepaper I am still unclear about whetherthe authors are merely describing thedesign decisions of TCP congestion con-trol or whether they are actually the cre-ators of that part of the Linux code Myguess leans toward the former

In the talk the speaker compared theTCP protocol congestion control meas-ures according to IETF and RFC specifi-cations with the actual Linux implemen-tation which does conform to the basicprinciples but nevertheless has differ-ences A specific emphasis was placed onretransmission mechanisms and thecongestion window Also several TCPenhancements ndash the NewReno algo-rithm Selective ACKs (SACK) ForwardACKs (FACK) ndash were discussed andcompared

In some instances Linux does not con-form to the IETF specifications The fastrecovery does not fully follow RFC 2582since the threshold for triggering re-transmit is adjusted dynamically and thecongestion windowrsquos size is not changedAlso the roundtrip-time estimator andthe RTO calculation differ from RFC2988 since it uses more conservativeRTT estimates and a minimum RTO of200ms The performance measuresshowed that with the additional Linux-specific features enabled slightly higherthroughput more steady data flow andfewer unnecessary retransmissions canbe achieved

XTREME XCITEMENT

Summarized by Steve Bauer

THE FUTURE IS COMING WHERE THE X

WINDOW SHOULD GO

Jim Gettys Compaq Computer Corp

Jim Gettys one of the principal authorsof the X Window System outlined thenear-term objectives for the system pri-marily focusing on the changes andinfrastructure required to enable replica-tion and migration of X applicationsProviding better support for this func-tionality would enable users to retrieveor duplicate X applications betweentheir servers at home and work

USENIX 2002

One interesting example of an applica-tion that currently is capable of migra-tion and replication is Emacs To create anew frame on DISPLAY try ldquoM-x make-frame-on-display ltReturngt DISPLAYltReturngtrdquo

However technical challenges makereplication and migration difficult ingeneral These include the ldquomajorheadachesrdquo of server-side fonts thenonuniformity of X servers and screensizes and the need to appropriatelyretrofit toolkits Authentication andauthorization issues are obviously alsoimportant The rest of the talk delvedinto some of the details of these interest-ing technical challenges

HACKING IN THE KERNEL

Summarized by Hai Huang

AN IMPLEMENTATION OF SCHEDULER

ACTIVATIONS ON THE NETBSD OPERATING

SYSTEM

Nathan J Williams Wasabi Systems

Scheduler activation is an old idea Basi-cally there are benefits and drawbacks tousing solely kernel-level or user-levelthreading Scheduler activation is able tocombine the two layers of control toprovide more concurrency in the sys-tem

In his talk Nathan gave a fairly detaileddescription of the implementation ofscheduler activation in the NetBSD ker-nel One important change in the imple-mentation is to differentiate the threadcontext from the process context This isdone by defining a separate data struc-ture for these threads and relocatingsome of the information that wasembedded in the process context tothese thread contexts Stack was espe-cially a concern due to the upcall Spe-cial handling must be done to make surethat the upcall doesnrsquot mess up the stackso that the preempted user-level threadcan continue afterwards Lastly Nathanexplained that signals were handled byupcalls

78 Vol 27 No 5 login

AUTHORIZATION AND CHARGING IN PUBLIC

WLANS USING FREEBSD AND 8021X

Pekka Nikander Ericsson ResearchNomadicLab

8021x standards are well known in thewireless community as link-layerauthentication protocols In this talkPekka explained some novel ways ofusing the 8021x protocols that might beof interest to people on the move It ispossible to set up a public WLAN thatwould support various chargingschemes via virtual tokens which peoplecan purchase or earn and later use

This is implemented on FreeBSD usingthe netgraph utility It is basically a filterin the link layer that would differentiatetraffic based on the MAC address of theclient node which is either authenti-cated denied or let through The over-head for this service is fairly minimal

ACPI IMPLEMENTATION ON FREEBSD

Takanori Watanabe Kobe University

ACPI (Advanced Configuration andPower Management Interface) was pro-posed as a joint effort by Intel Toshibaand Microsoft to provide a standard andfiner-control method of managingpower states of individual devices withina system Such low-level power manage-ment is especially important for thosemobile and embedded systems that arepowered by fixed-capacity energy batter-ies

Takanori explained that the ACPI speci-fication is composed of three partstables BIOS and registers He was ableto implement some functionalities ofACPI in a FreeBSD kernel ACPI Com-ponent Architecture was implementedby Intel and it provides a high-levelACPI API to the operating systemTakanorirsquos ACPI implementation is builtupon this underlying layer of APIs

ACCESS CONTROL

Summarized by Florian Buchholz

DESIGN AND PERFORMANCE OF THE

OPENBSD STATEFUL PACKET FILTER (PF)

Daniel Hartmeier Systor AG

Daniel Hartmeier described the newstateful packet filter (pf) that replacedIPFilter in the OpenBSD 30 releaseIPFilter could no longer be included dueto licensing issues and thus there was aneed to write a new filter making use ofoptimized data structures

The filter rules are implemented as alinked list which is traversed from startto end Two actions may be takenaccording to the rules ldquopassrdquo or ldquoblockrdquoA ldquopassrdquo action forwards the packet anda ldquoblockrdquo action will drop it Wheremore than one rule matches a packetthe last rule wins Rules that are markedas ldquofinalrdquo will immediately terminate any further rule evaluation An opti-mization called ldquoskip-stepsrdquo was alsoimplemented where blocks of similarrules are skipped if they cannot matchthe current packet These skip-steps arecalculated when the rule set is loadedFurthermore a state table keeps track ofTCP connections Only packets thatmatch the sequence numbers areallowed UDP and ICMP queries andreplies are considered in the state tablewhere initial packets will create an entryfor a pseudo-connection with a low ini-tial timeout value The state table isimplemented using a balanced binarysearch tree NAT mappings are alsostored in the state table whereas appli-cation proxies reside in user space Thepacket filter also is able to perform frag-ment reassembly and to modulate TCPsequence numbers to protect hostsbehind the firewall

Pf was compared against IPFilter as wellas Linuxrsquos Iptables by measuringthroughput and latency with increasingtraffic rates and different packet sizes Ina test with a fixed rule set size of 100Iptables outperformed the other two fil-ters whose results were close to each

other In a second test where the rule setsize was continually increased Iptablesconsistently had about twice thethroughput of the other two (whichevaluate the rule set on both the incom-ing and outgoing interfaces) A third testcompared only pf and IPFilter using asingle rule that created state in the statetable with a fixed-state entry size Pfreached an overloaded state much laterthan IPFilter The experiment wasrepeated with a variable-state entry sizeand pf performed much better thanIPFilter for a small number of states

In general rule set evaluation is expen-sive and benchmarks only reflectextreme cases whereas in real life otherbehavior should be observed Further-more the benchmarks show that statefulfiltering can actually improve perfor-mance due to cheap state-table lookupsas compared to rule evaluation

ENHANCING NFS CROSS-ADMINISTRATIVE

DOMAIN ACCESS

Joseph Spadavecchia and Erez ZadokStony Brook University

The speaker presented modification toan NFS server that allows an improvedNFS access between administrativedomains A problem lies in the fact thatNFS assumes a shared UIDGID spacewhich makes it unsafe to export filesoutside the administrative domain ofthe server Design goals were to leaveprotocol and existing clients unchangeda minimum amount of server changesflexibility and increased performance

To solve the problem two techniques areutilized ldquorange-mappingrdquo and ldquofile-cloakingrdquo Range-mapping maps IDsbetween client and server The mappingis performed on a per-export basis andhas to be manually set up in an exportfile The mappings can be 1-1 N-N orN-1 In file-cloaking the server restrictsfile access based on UIDGID and spe-cial cloaking-mask bits Here users canonly access their own file permissionsThe policy on whether or not the file isvisible to others is set by the cloakingmask which is logically ANDed with the

79October 2002 login

C

ON

FERE

NC

ERE

PORT

Sfilersquos protection bits File-cloaking onlyworks however if the client doesnrsquot holdcached copies of directory contents andfile-attributes Because of this the clientsare forced to re-read directories byincrementing the mtime value of thedirectory each time it is listed

To test the performance of the modifiedserver five different NFS configurationswere evaluated An unmodified NFSserver was compared against one serverwith the modified code included but notused one with only range-mappingenabled one with only file-cloakingenabled and one version with all modi-fications enabled For each setup differ-ent file system benchmarks were runThe results show only a small overheadwhen the modifications are used gener-ally an increase of below 5 Anotherexperiment tested the performance ofthe system with an increasing number ofmapped or cloaked entries on a systemThe results show that an increase from10 to 1000 entries resulted in a maxi-mum of about 14 cost in performance

One member of the audience pointedout that if clients choose to ignore thechanged mtimes from the server andthus still hold caches of the directoryentries the file-cloaking mechanismcould be defeated After a rather lengthydebate the speaker had to concede thatthe model doesnrsquot add any extra securityAnother question was asked about scala-bility of the setup of range mappingThe speaker referred to application-leveltools that could be developed for thatpurpose

The software is available atftpftpfslcssunysbedupubenf

ENGINEERING OPEN SOURCE

SOFTWARE

Summarized by Teri Lampoudi

NINGAUI A LINUX CLUSTER FOR BUSINESS

Andrew Hume ATampT Labs ndash ResearchScott Daniels Electronic Data SystemsCorp

Ningaui is a general purpose highlyavailable resilient architecture built

from commodity software and hard-ware Emphatically however Ningaui isnot a Beowulf Hume calls the clusterdesign the ldquoSwiss canton modelrdquo inwhich there are a number of looselyaffiliated independent nodes with datareplicated among them Jobs areassigned by bidding and leases and clus-ter services done as session-based serversare sited via generic job assignment Theemphasis is on keeping the architectureend-to-end checking all work via check-sums and logging everything Theresilience and high availability requiredby their goal of 8-5 maintenance ndash vsthe typical 24-7 model where people getpaged whenever the slightest thing goeswrong regardless of the time of day ndash isachieved by job restartability Finally allcomputation is performed on local datawithout the use of NFS or networkattached storage

Humersquos message is a hopeful onedespite the many problems encounteredndash things like kernel and service limitsauto-installation problems TCP stormsand the like ndash the existence of sourcecode and the paranoid practice of log-ging and checksumming everything hashelped The final product performs rea-sonably well and it appears that theresilient techniques employed do make adifference One drawback however isthat the software mentioned in thepaper is not yet available for download

CPCMS A CONFIGURATION MANAGEMENT

SYSTEM BASED ON CRYPTOGRAPHIC NAMES

Jonathan S Shapiro and John Vander-burgh Johns Hopkins University

This paper won the FREENIX BestPaper award

The basic notion behind the project isthe fact that everyone has a pet com-plaint about CVS and yet it is currentlythe configuration manager in mostwidespread use Shapiro has unleashedan alternative Interestingly he did notbegin but ended with the usual host ofreasons why CVS is bad The talk insteadbegan abruptly by characterizing the job

USENIX 2002

of a software configuration managercontinued by stating the namespaceswhich it must handle and wrapped upwith the challenges faced

X MEETS Z VERIFYING CORRECTNESS IN THE

PRESENCE OF POSIX THREADS

Bart Massey Portland State UniversityRobert T Bauer Rational SoftwareCorp

Massey delivered a humorous talk onthe insight gained from applying Z for-mal specification notation to systemsoftware design rather than the moreinformal analysis and design processnormally used

The story is told with respect to writingXCB which replaces the Xlib protocollayer and is supposed to be threadfriendly But where threads are con-cerned deadlock avoidance becomes ahard problem that cannot be solved inan ad-hoc manner But full modelchecking is also too hard In this caseMassey resorted to Z specification tomodel the XCB lower layer abstractaway locking and data transfers andlocate fundamental issues Essentiallythe difficulties of searching the literatureand locating information relevant to theproblem at hand were overcome AsMassey put it ldquothe formal method savedthe dayrdquo

FILE SYSTEMS

Summarized by Bosko Milekic

PLANNED EXTENSIONS TO THE LINUX

EXT2EXT3 FILESYSTEM

Theodore Y Tsrsquoo IBM StephenTweedie Red Hat

The speaker presented improvements tothe Linux Ext2 file system with the goalof allowing for various expansions whilestriving to maintain compatibility witholder code Improvements have beenfacilitated by a few extra superblockfields that were added to Ext2 not longbefore the Linux 20 kernel was releasedThe fields allow for file system featuresto be added without compromising theexisting setup this is done by providing

80 Vol 27 No 5 login

bits indicating the impact of the addedfeatures with respect to compatibilityNamely a file system with the ldquoincom-patrdquo bit marked is not allowed to bemounted Similarly a ldquoread-onlyrdquo mark-ing would only allow the file system tobe mounted read-only

Directory indexing changes linear direc-tory searches with a faster search using afixed-depth tree and hashed keys Filesystem size can be dynamicallyincreased and the expanded inode dou-bled from 128 to 256 bytes allows formore extensions

Other potential improvements were dis-cussed as well in particular pre-alloca-tion for contiguous files which allowsfor better performance in certain setupsby pre-allocating contiguous blocksSecurity-related modifications extendedattributes and ACLs were mentionedAn implementation of these featuresalready exists but has not yet beenmerged into the mainline Ext23 code

RECENT FILESYSTEM OPTIMISATIONS ON

FREEBSD

Ian Dowse Corvil Networks DavidMalone CNRI Dublin Institute of Technology

David Malone presented four importantfile system optimizations for FreeBSDOS soft updates dirpref vmiodir anddirhash It turns out that certain combi-nations of the optimizations (beautifullyillustrated in the paper) may yield per-formance improvements of anywherebetween 2 and 10 orders of magnitudefor real-world applications

All four of the optimizations deal withfile system metadata Soft updates allowfor asynchronous metadata updatesDirpref changes the way directories areorganized attempting to place childdirectories closer to their parentsthereby increasing locality of referenceand reducing disk-seek times Vmiodirtrades some extra memory in order toachieve better directory caching Finallydirhash which was implemented by IanDowse changes the way in which entries

are searched in a directory SpecificallyFFS uses a linear search to find an entryby name dirhash builds a hashtable ofdirectory entries on the fly For a direc-tory of size n with a working set of mfiles a search that in certain cases couldhave been O(nm) has been reduceddue to dirhash to effectively O(n + m)

FILESYSTEM PERFORMANCE AND SCALABILITY

IN LINUX 2417

Ray Bryant SGI Ruth Forester IBMLTC John Hawkes SGI

This talk focused on performance evalu-ation of a number of file systems avail-able and commonly deployed on Linuxmachines Comparisons under variousconfigurations of Ext2 Ext3 ReiserFSXFS and JFS were presented

The benchmarks chosen for the datagathering were pgmeter filemark andAIM Benchmark Suite VII Pgmetermeasures the rate of data transfer ofreadswrites of a file under a syntheticworkload Filemark is similar to post-mark in that it is an operation-intensivebenchmark although filemark isthreaded and offers various other fea-tures that postmark lacks AIM VIImeasures performance for various file-system-related functionalities it offersvarious metrics under an imposedworkload thus stressing the perfor-mance of the file system not only underIO load but also under significant CPUload

Tests were run on three different setupsa small a medium and a large configu-ration ReiserFS and Ext2 appear at thetop of the pile for smaller and mediumsetups Notably XFS and JFS performworse for smaller system configurationsthan the others although XFS clearlyappears to generate better numbers thanJFS It should be noted that XFS seemsto scale well under a higher load Thiswas most evident in the large-systemresults where XFS appears to offer thebest overall results

THINGS TO THINK ABOUT

Summarized by Bosko Milekic

SPEEDING UP KERNEL SCHEDULER BY REDUC-

ING CACHE MISSES

Shuji Yamamura Akira Hirai MitsuruSato Masao Yamamoto Akira NaruseKouichi Kumon Fujitsu Labs

This was an interesting talk pertaining tothe effects of cache coloring for taskstructures in the Linux kernel scheduler(Linux kernel 24x) The speaker firstpresented some interesting benchmarknumbers for the Linux scheduler show-ing that as the number of processes onthe task queue was increased the perfor-mance decreased The authors usedsome really nifty hardware to measurethe number of bus transactionsthroughout their tests and were thusable to reasonably quantify the impactthat cache misses had in the Linuxscheduler

Their experiments led them to imple-ment a cache coloring scheme for taskstructures which were previouslyaligned on 8KB boundaries and there-fore were being eventually mapped tothe same cache lines This unfortunateplacement of task structures in memoryinduced a significant number of cachemisses as the number of tasks grew inthe scheduler

The implemented solution consisted ofaligning task structures to more evenlydistribute cached entries across the L2cache The result was inevitably fewercache misses in the scheduler Some neg-ative effects were observed in certain sit-uations These were primarily due tomore cache slots being used by taskstructure data in the scheduler thusforcing data previously cached there tobe pushed out

81October 2002 login

C

ON

FERE

NC

ERE

PORT

SWORK-IN-PROGRESS REPORTS

Summarized by Brennan Reynolds

RESOURCE VIRTUALIZATION TECHNIQUES FOR

WIDE-AREA OVERLAY NETWORKS

Kartik Gopalan University Stony Brook

This work addressed the issue of provi-sioning a maximum number of virtualoverlay networks (VON) with diversequality of service (QoS) requirementson a single physical network Each of theVONs is logically isolated from others toensure the QoS requirements Gopalanmentioned several critical researchissues with this problem that are cur-rently being investigated Dealing withhow to provision the network at variouslevels (link route or path) and thenenforce the provisioning at run-time isone of the toughest challenges Cur-rently Gopalan has developed severalalgorithms to handle admission controlend-to-end QoS route selection sched-uling and fault tolerance in the net-work

For more information visithttpwwwecslcssunysbedu

VISUALIZING SOFTWARE INSTABILITY

Jennifer Bevan University of CaliforniaSanta Cruz

Detection of instability in software hastypically been an afterthought Thepoint in the development cycle when thesoftware is reviewed for instability isusually after it is difficult and costly togo back and perform major modifica-tions to the code base Bevan has devel-oped a technique to allow thevisualization of unstable regions of codethat can be used much earlier in thedevelopment cycle Her technique cre-ates a time series of dependent graphsthat include clusters and lines calledfault lines From the graphs a developeris able to easily determine where theunstable sections of code are and proac-tively restructure them She is currentlyworking on a prototype implementa-tion

For more information visit httpwwwcseucscecu~jbevanevo_viz

RELIABLE AND SCALABLE PEER-TO-PEER WEB

DOCUMENT SHARING

Li Xiao William and Mary College

The idea presented by Xiao would allowend users to share the content of theWeb browser caches with neighboringInternet users The rationale for this isthat todayrsquos end users are increasinglyconnected to the Internet over high-speed links and the browser caches arebecoming large storage systems There-fore if an individual accesses a pagewhich does not exist in their local cacheXiao is suggesting that they first queryother end users for the content beforetrying to access the machine hosting theoriginal This strategy does have someserious problems associated with it thatstill need to be addressed includingensuring the integrity of the content andprotecting the identity and privacy ofthe end users

SEREL FAST BOOTING FOR UNIX

Leni Mayo Fast Boot Software

Serel is a tool that generates a visual rep-resentation of a UNIX machinersquos boot-up sequence It can be used to identifythe critical path and can show if a par-ticular service or process blocks for anextended period of time This informa-tion could be used to determine wherethe largest performance gains could berealized by tuning the order of executionat boot-up Serel creates a dependencygraph expressed in XML during boot-up This graph is then used to create thevisual representation Currently the toolonly works on POSIX-compliant sys-tems but Mayo stated that he would beporting it to other platforms Otherextensions that were mentionedincluded having the metadata includethe use of shared libraries and monitor-ing the suspendresume sequence ofportable machines

For more information visithttpwwwfastbootorg

USENIX 2002

BERKELEY DB XML

John Merrells Sleepycat Software

Merrells gave a quick introduction andoverview of the new XML library forBerkeley DB The library specializes instorage and retrieval of XML contentthrough a tool called XPath The libraryallows for multiple containers per docu-ment and stores everything natively asXML The user is also given a wide rangeof elements to create indices withincluding edges elements text stringsor presence The XPath tool consists of aquery parser generator optimizer andexecution engine To conclude his pre-sentation Merrell gave a live demonstra-tion of the software

For more information visithttpwwwsleepycatcomxml

CLUSTER-ON-DEMAND (COD)

Justin Moore Duke University

Modern clusters are growing at a rapidrate Many have pushed beyond the 5000-machine mark and deployingthem results in large expenses as well asmanagement and provisioning issuesFurthermore if the cluster is ldquorentedrdquoout to various users it is very time-con-suming to configure it to a userrsquos specsregardless of how long they need to useit The COD work presented createsdynamic virtual clusters within a givenphysical cluster The goal was to have aprovisioning tool that would automati-cally select a chunk of available nodesand install the operating system andmiddleware specified by the customer ina short period of time This would allowa greater use of resources since multiplevirtual clusters can exist at once Byusing a virtual cluster the size can bechanged dynamically Moore stated thatthey have created a working prototypeand are currently testing and bench-marking its performance

For more information visithttpwwwcsdukeedu~justincod

82 Vol 27 No 5 login

CATACOMB

Elias Sinderson University of CaliforniaSanta Cruz

Catacomb is a project to develop a data-base-backed DASL module for theApache Web server It was designed as areplacement for the WebDAV The initialrelease of the module only contains sup-port for a MySQL database but could beextended to others Sinderson brieflytouched on the performance of hermodule It was comparable to themod_dav Apache module for all querytypes but search The presentation wasconcluded with remarks about addingsupport for the lock method and includ-ing ACL specifications in the future

For more information visithttpoceancseucsceducatacomb

SELF-ORGANIZING STORAGE

Dan Ellard Harvard University

Ellardrsquos presentation introduced a stor-age system that tuned itself based on theworkload of the system without requir-ing the intervention of the user Theintelligence was implemented as a vir-tual self-organizing disk that residesbelow any file system The virtual diskobserves the access patterns exhibited bythe system and then attempts to predictwhat information will be accessed nextAn experiment was done using an NFStrace at a large ISP on one of their emailservers Ellardrsquos self-organizing storagesystem worked well which he attributesto the fact that most of the files beingrequested were large email boxes Areasof future work include exploration ofthe length and detail of the data collec-tion stage as well as the CPU impact ofrunning the virtual disk layer

VISUAL DEBUGGING

John Costigan Virginia Tech

Costigan feels that the state of currentdebugging facilities in the UNIX worldis not as good as it should be He pro-poses the addition of several elements toprograms like ddd and gdb The firstaddition is including separate heap and

stack components Another is to displayonly certain fields of a data structureand have the ability to zoom in and outif needed Finally the debugger shouldprovide the ability to visually presentcomplex data structures includinglinked lists trees etc He said that a betaversion of a debugger with these abilitiesis currently available

For more information visit httpinfoviscsvtedudatastruct

IMPROVING APPLICATION PERFORMANCE

THROUGH SYSTEM-CALL COMPOSITION

Amit Purohit University of Stony Brook

Web servers perform a huge number ofcontext switches and internal data copy-ing during normal operation These twoelements can drastically limit the perfor-mance of an application regardless ofthe hardware platform it is run on TheCompound System Call (CoSy) frame-work is an attempt to reduce the perfor-mance penalty for context switches anddata copies It includes a user-levellibrary and several kernel facilities thatcan be used via system calls The libraryprovides the programmer with a com-plete set of memory-management func-tions Performance tests using the CoSyframework showed a large saving forcontext switches and data copies Theonly area where the savings betweenconventional libraries and CoSy werenegligible was for very large file copies

For more information visithttpwwwcssunysbedu~purohit

ELASTIC QUOTAS

John Oscar Columbia University

Oscar began by stating that mostresources are flexible but that to datedisks have not been While approxi-mately 80 percent of files are short-liveddisk quotas are hard limits imposed byadministrators The idea behind elasticquotas is to have a non-permanent stor-age area that each user can temporarilyuse Oscar suggested the creation of aehome directory structure to be used indeploying elastic quotas Currently he

has implemented elastic quotas as astackable file system There were alsoseveral trends apparent in the dataMany users would ssh to their remotehost (good) but then launch an emailreader on the local machine whichwould connect via POP to the same hostand send the password in clear text(bad) This situation can easily be reme-died by tunneling protocols like POPthrough SSH but it appeared that manypeople were not aware this could bedone While most of his comments onthe use of the wireless network werenegative the list of passwords he hadcollected showed that people wereindeed using strong passwords His rec-ommendations were to educate andencourage people to use protocols likeIPSec SSH and SSL when conductingwork over a wireless network becauseyou never know who else is listening

SYSTRACE

Niels Provos University of Michigan

How can one be sure the applicationsone is using actually do exactly whattheir developers said they do Shortanswer you canrsquot People today are usinga large number of complex applicationswhich means it is impossible to checkeach application thoroughly for securityvulnerabilities There are tools out therethat can help though Provos has devel-oped a tool called systrace that allows auser to generate a policy of acceptablesystem calls a particular application canmake If the application attempts tomake a call that is not defined in thepolicy the user is notified and allowed tochoose an action Systrace includes anauto-policy generation mechanismwhich uses a training phase to record theactions of all programs the user exe-cutes If the user chooses not to be both-ered by applications breaking policysystrace allows default enforcementactions to be set Currently systrace isimplemented for FreeBSD and NetBSDwith a Linux version coming out shortly

83October 2002 login

C

ON

FERE

NC

ERE

PORT

SFor more information visit httpwwwcitiumicheduuprovossystrace

VERIFIABLE SECRET REDISTRIBUTION FOR

SURVIVABLE STORAGE SYSTEMS

Ted Wong Carnegie Mellon University

Wong presented a protocol that can beused to re-create a file distributed over aset of servers even if one of the serversis damaged In his scheme the user mustchoose how many shares the file is splitinto and the number of servers it will bestored across The goal of this work is toprovide persistent secure storage ofinformation even if it comes underattack Wong stated that his design ofthe protocol was complete and he is cur-rently building a prototype implementa-tion of it

For more information visit httpwwwcscmuedu~tmwongresearch

THE GURU IS IN SESSIONS

LINUX ON LAPTOPPDA

Bdale Garbee HP Linux Systems Operation

Summarized by Brennan Reynolds

This was a roundtable-like discussionwith Garbee acting as moderator He iscurrently in charge of the Debian distri-bution of Linux and is involved in port-ing Linux to platforms other than i386He has successfully help port the kernelto the Alpha Sparc and ARM and iscurrently working on a version to runon VAX machines

A discussion was held on which file sys-tem is best for battery life The Riser andExt3 file systems were discussed peoplecommented that when using Ext3 ontheir laptops the disc would never spindown and thus was always consuming alarge amount of power One suggestionwas to use a RAM-based file system forany volume that requires a large numberof writes and to only have the file systemwritten to disk at infrequent intervals orwhen the machine is shut down or sus-pended

A question was directed to Garbee abouthis knowledge of how the HP-Compaqmerger would affect Linux Garbeethought that the merger was good forLinux both within the company and forthe open source community as a wholeHe stated that the new company wouldbe the number one shipper of systemswith Linux as the base operating systemHe also fielded a question about thesupport of older HP servers and theirability to run Linux saying that indeedLinux has been ported to them and hepersonally had several machines in hisbasement running it

A question which sparked a large num-ber of responses concerned problemspeople had with running Linux onmobile platforms Most people actuallydid not have many problems at allThere were a few cases of a particularmachine model not working but therewere only two widespread issues dock-ing stations and the new ACPI powermanagement scheme Docking stationsstill appear to be somewhat of aheadache but most people had devel-oped ad-hoc solutions for getting themto work The ACPI power managementscheme developed by Intel does notappear to have a quick solution TedTrsquoso one of the head kernel developersstated that there are fundamental archi-tectural problems with the 24 series ker-nel that do not easily allow ACPI to beadded However he also stated thatACPI is supported in the newest devel-opment kernel 25 and will be includedin 26

The final topic was the possibility ofpurchasing a laptop without paying anychargetax for Microsoft Windows Theconsensus was that currently this is notpossible Even if the machine comeswith Linux or without any operatingsystem the vendors have contracts inplace which require them to payMicrosoft for each machine they ship ifthat machine is capable of running aMicrosoft operating system The onlyway this will change is if the demand for

USENIX 2002

other operating systems increases to thepoint that vendors renegotiate their con-tracts with Microsoft and no one sawthis happening in the near future

LARGE CLUSTERS

Andrew Hume ATampT Labs ndash Research

Summarized by Teri Lampoudi

This was a session on clusters big dataand resilient computing Given that theaudience was primarily interested inclusters and not necessarily big data orbeating nodes to a pulp the conversa-tion revolved mainly around what ittakes to make a heterogeneous clusterresilient Hume also presented aFREENIX paper on his current clusterproject which explained in more detailmuch of what was abstractly claimed inthe guru session

Humersquos use of the word ldquoclusterrdquoreferred not to what I had assumedwould be a Beowulf-type system but to aloosely coupled farm of machines inwhich concurrency was much less of anissue than it would be in a Beowulf Infact the architecture Hume describedwas designed to deal with large amountsof independent transaction processingessentially the process of billing callswhich requires no interprocess messagepassing or anything of the sort a parallelscientific application might

Where does the large data come inSince the transactions in question con-sist of large numbers of flat files amechanism for getting the files ontonodes and off-loading the results is nec-essary In this particular cluster namedNingaui this task is handled by theldquoreplication managerrdquo which generateda large amount of interest in the audi-ence All management functions in thecluster as well as job allocation consti-tute services that are to be bid for andleased out to nodes Furthermore fail-ures are handled by simply repeating thefailed task until it is completed properlyif that is possible and if it is not thenlooking through detailed logs to discover

84 Vol 27 No 5 login

what went wrong Logging and testingfor the successful completion of eachtask are what make this model resilientEvery management operation is loggedat some appropriate level of detail gen-erating 120MBday and every single filetransfer is md5 checksummed This wascharacterized by Hume as a ldquopatientrdquostyle of management where no one con-trols the cluster scheduling behavior isemergent rather than stringentlyplanned and nodes are free to drop inand out of the cluster a downside to thismodel is the increase in latency offset bythe fact that the cluster continues tofunction unattended outside of 8-to-5support hours

In response to questions about the pro-jected scalability of such a schemeHume said the system would presum-ably scale from the present 60 nodes toabout 100 without modifications butthat scaling higher would be a matter forfurther design The guru session endedon a somewhat unrelated note regardingthe problems of getting data off main-frames and onto Linux machines ndashissues of fixed- vs variable-lengthencoding of data blocks resulting in use-ful information being stripped by FTPand problems in converting COBOLcopybooks to something useful on a C-based architecture To reiterate Humersquosclaim there are homemade solutions forsome of these problems and the readershould feel free to contact Mr Hume forthem

WRITING PORTABLE APPLICATIONS

Nick Stoughton MSB Consultants

Summarized by Josh Lothian

Nick Stoughton addressed a concernthat developers are facing more andmore writing portable applications Heproposed that there is no such thing as aldquoportable applicationrdquo only those thathave been ported Developers are still insearch of the Holy Grail of applicationsprogramming a write-once compile-anywhere program Languages such asJava are leading up to this but virtual

machines still have to be ported pathswill still be different etc Web-basedapplications are another area that couldbe promising for portable applications

Nick went on to talk about the future ofportability The recently released POSIX2001 standards are expected to help thesituation The 2001 revision expands theoriginal POSIX sepc into seven volumesEven though POSIX 2001 is more spe-cific than the previous release Nickpointed out that even this standardallows for small differences where ven-dors may define their own behaviorsThis can only hurt developers in thelong run Along these lines it waspointed out that there may be room forautomated tools that have the capabilityto check application code and determinethe level of portability and standardscompliance of the source

NETWORK MANAGEMENT SYSTEM

PERFORMANCE TUNING

Jeff Allen Tellme Networks Inc

Summarized by Matt Selsky

Jeff Allen author of Cricket discussednetwork tuning The basic idea of tuningis measure twiddle and measure againCricket can be used for large installa-tions but itrsquos overkill for smaller instal-lations for which Mrtg is better suitedYou donrsquot need Cricket to do measure-ment Doing something like a shellscript that dumps data to Excel is goodbut you need to get some sort of mea-surement Cricket was not designed forbilling it was meant to help answerquestions

Useful tools and techniques coveredincluded looking for the differencebetween machines in order to identifythe causes for the differences measuringbut also thinking about what yoursquoremeasuring and why remembering touse strace and tcpdump to identifyproblems Problems repeat themselvesperformance tuning requires an intricateunderstanding of all the layers involvedto solve the problem and simplify theautomation (Having a computer science

background helps but is not essential) Ifyou can reproduce the problem youthen should investigate each piece tofind the bottleneck If you canrsquot repro-duce the problem then measure Youneed to understand all the system inter-actions Measurement can help deter-mine whether things are actually slow orif the user is imagining it

Troubleshooting begins with a hunchbut scientific processes are essential Youshould be able to determine the eventstream or when each event occurs hav-ing observation logs can help Somehints provided include checking cron forperiodic anomalies slowing downbinrm to avoid IO overload and look-ing for unusual kernel CPU usage

Also try to use lightweight monitoringto reduce overhead You donrsquot need tomonitor every resource on every systembut you should monitor those resourceson those systems that are essential Donrsquotcheck things too often since you canintroduce overhead that way Lots ofinformation can be gathered from theproc file system interface to the kernel

BSD BOF

Host Kirk McKusick Author and Consultant

Summarized by Bosko Milekic

The BSD Birds of a Feather session atUSENIX started off with five quick andinformative talks and ended with anexciting discussion of licensing issues

Christos Zoulas for the NetBSD Projectwent over the primary goals of theNetBSD Project (namely portability aclean architecture and security) andthen proceeded to briefly discuss someof the challenges that the project hasencountered The successes and areas forimprovement of 16 were then exam-ined NetBSD has recently seen variousimprovements in its cross-build systempackaging system and third-party toolsupport For what concerns the kernel azero-copy TCPUDP implementationhas been integrated and a new pmap

85October 2002 login

C

ON

FERE

NC

ERE

PORT

Scode for arm32 has been introduced ashave some other rather major changes tovarious subsystems More information isavailable in Christosrsquos slides which arenow available at httpwwwnetbsdorggalleryeventsusenix2002

Theo de Raadt of the OpenBSD Projectpresented a rundown of various newthings that the OpenBSD and OpenSSHteams have been looking at Theorsquos talkincluded an amusing description (andpictures) of some of the events thattranspired during OpenBSDrsquos hack-a-thon which occurred the week beforeUSENIX rsquo02 Finally Theo mentionedthat following complications with theIPFilter license the OpenBSD team haddone a fairly extensive license audit

Robert Watson of the FreeBSD Projectbrought up various examples ofFreeBSD being used in the real worldHe went on to describe a wide variety ofchanges that will surface with theupcoming FreeBSD release 50 Notably50 will introduce a re-architecting ofthe kernel aimed at providing muchmore scalable support for SMPmachines 50 will also include an earlyimplementation of KSE FreeBSDrsquos ver-sion of scheduler activations largeimprovements to pccard and significantframework bits from the TrustedBSDproject

Mike Karels from Wind River Systemspresented an overview of how BSDOShas been evolving following the WindRiver Systems acquisition of BSDirsquos soft-ware assets Ernie Prabhakar from Applediscussed Applersquos OS X and its successon the desktop He explained how OS Xaims to bring BSD to the desktop whilestriving to maintain a strong and stablecore one that is heavily based on BSDtechnology

The BoF finished with a discussion onlicensing specifically some folks ques-tioned the impact of Applersquos licensingfor code from their core OS (the DarwinProject) on the BSD community as awhole

CLOSING SESSION

HOW FLIES FLY

Michael H Dickinson University ofCalifornia Berkeley

Summarized by JD Welch

In this lively and offbeat talk Dickinsondiscussed his research into the mecha-nisms of flight in the fruit fly (Droso-phila melanogaster) and related theautonomic behavior to that of a techno-logical system Insects are the most suc-cessful organisms on earth due in nosmall part to their early adoption offlight Flies travel through space onstraight trajectories interrupted by sac-cades jumpy turning motions some-what analogous to the fast jumpymovements of the human eye Using avariety of monitoring techniquesincluding high-speed video in a ldquovirtualreality flight arenardquo Dickinson and hiscolleagues have observed that fliesrespond to visual cues to decide when tosaccade during flight For example morevisual texture makes flies saccade earlier(ie further away from the arena wall)described by Dickinson as a ldquocollisioncontrol algorithmrdquo

Through a combination of sensoryinputs flies can make decisions abouttheir flight path Flies possess a visualldquoexpansion detectorrdquo which at a certainthreshold causes the animal to turn acertain direction However expansion infront of the fly sometimes causes it toland How does the fly decide Using thevirtual reality device to replicate variousvisual scenarios Dickinson observedthat the flies fixate on vertical objectsthat move back and forth across thefield Expansion on the left side of theanimal causes it to turn right and viceversa while expansion directly in frontof the animal triggers the legs to flareout in preparation for landing

If the eyes detect these changes how arethe responses implemented The fliesrsquowings have ldquopower musclesrdquo controlledby mechanical resonance (as opposed tothe nervous system) which drive the

USENIX 2002

wings combined with neurally activatedldquosteering musclesrdquo which change theconfiguration of the wing joints Subtlevariations in the timing of impulses cor-respond to changes in wing movementcontrolled by sensors in the ldquowing-pitrdquoA small wing-like structure the halterecontrols equilibrium by beating con-stantly

Voluntary control is accomplished bythe halteres whose signals can interferewith the autonomic control of the strokecycle The halteres have ldquosteering mus-clesrdquo as well and information derivedfrom the visual system can turn off thehaltere or trick it into a ldquovirtualrdquo prob-lem requiring a response

Dickinson has also studied the aerody-namics of insect flight using a devicecalled the ldquoRobo Flyrdquo an oversizedmechanical insect wing suspended in alarge tank of mineral oil InterestinglyDickinson observed that the average liftgenerated by the fliesrsquo wings is under itsbody weight the flies use three mecha-nisms to overcome this including rotat-ing the wings (rotational lift)

Insects are extraordinarily robust crea-tures and because Dickinson analyzedthe problem in a systems-oriented waythese observations and analysis areimmediately applicable to technologythe response system can be used as anefficient search algorithm for controlsystems in autonomous vehicles forexample

THE AFS WORKSHOP

Summarized by Garry Zacheiss MIT

Love Hornquist-Astrand of the Arladevelopment team presented the Arlastatus report Arla 0358 has beenreleased Scheduled for release soon areimproved support for Tru64 UNIXMacOS X and FreeBSD improved vol-ume handling and implementation ofmore of the vospts subcommands Itwas stressed that MacOS X is consideredan important platform and that a GUIconfiguration manager for the cache

86 Vol 27 No 5 login

manager a GUI ACL manager for thefinder and a graphical login that obtainsKerberos tickets and AFS tokens at logintime were all under development

Future goals planned for the underlyingAFS protocols include GSSAPISPNEGOsupport for Rx performance improve-ments to Rx an enhanced disconnectedmode and IPv6 support for Rx anexperimental patch is already availablefor the latter Future Arla-specific goalsinclude improved performance partialfile reading and increased stability forseveral platforms Work is also inprogress for the RXGSS protocol exten-sions (integrating GSSAPI into Rx) Apartial implementation exists and workcontinues as developers find time

Derrick Brashear of the OpenAFS devel-opment team presented the OpenAFSstatus report OpenAFS was releasedimmediately prior to the conferenceOpenAFS 125 fixed a remotelyexploitable denial-of-service attack inseveral OpenAFS platforms mostnotably IRIX and AIX Future workplanned for OpenAFS includes bettersupport for MacOS X including work-ing around Finder interaction issuesBetter support for the BSDs is alsoplanned FreeBSD has a partially imple-mented client NetBSD and OpenBSDhave only server programs availableright now AIX 5 Tru64 51A andMacOS X 102 (aka Jaguar) are allplanned for the future Other plannedenhancements include nested groups inthe ptserver (code donated by the Uni-versity of Michigan awaits integration)disconnected AFS and further work ona native Windows client Derrick statedthat the guiding principles of OpenAFSwere to maintain compatibility withIBM AFS support new platforms easethe administrative burden and add newfunctionality

AFS performance was discussed Theopenafsorg cell consists of two fileservers one in Stockholm Sweden andone in Pittsburgh PA AFS works rea-sonably well over trans-Atlantic links

although OpenAFS clients donrsquot deter-mine which file server to talk to veryeffectively Arla clients use RTTs to theserver to determine the optimal fileserver to fetch replicated data fromModifications to OpenAFS to supportthis behavior in the future are desired

Jimmy Engelbrecht and Harald Barth ofKTH discussed their AFSCrawler scriptwritten to determine how many AFScells and clients were in the world whatimplementationsversions they were(Arla vs IBM AFS vs OpenAFS) andhow much data was in AFS The scriptunfortunately triggered a bug in IBMAFS 36ndashderived code causing someclients to panic while handling a specificRPC This has since been fixed in OpenAFS 125 and the most recent IBMAFS patch level and all AFS users arestrongly encouraged to upgrade Norelease of Arla is vulnerable to this par-ticular denial-of-service attack Therewas an extended discussion of the use-fulness of this exploration Many sitesbelieved this was useful information andsuch scanning should continue in thefuture but only on an opt-in basis

Many sites face the problem of manag-ing KerberosAFS credentials for batch-scheduled jobs Specifically most batchprocessing software needs to be modi-fied to forward tickets as part of thebatch submission process renew ticketsand tokens while the job is in the queueand for the lifetime of the job and prop-erly destroy credentials when the jobcompletes Ken Hornstein of NRL wasable to pay a commercial vendor to sup-port Kerberos 45 credential manage-ment in their product although they didnot implement AFS token managementMIT has implemented some of thedesired functionality in OpenPBS andmight be able to make it available toother interested sites

Tools to simplify AFS administrationwere discussed including

AFS Balancer A tool written byCMU to automate the process ofbalancing disk usage across all

servers in a cell Available fromftpftpandrewcmuedupubAFS-Toolsbalance-11btargz

Themis Themis is KTHrsquos enhancedversion of the AFS tool ldquopackagerdquofor updating files on local disk froma central AFS image KTHrsquosenhancements include allowing thedeletion of files simplifying theprocess of adding a file and allow-ing the merging of multiple rulesets for determining which files areupdated Themis is available fromthe Arla CVS repository

Stanford was presented as an example ofa large AFS site Stanfordrsquos AFS usageconsists of approximately 14TB of datain AFS in the form of approximately100000 volumes 33TB of storage isavailable in their primary cell irstan-fordedu Their file servers consistentirely of Solaris machines running acombination of Transarc 36 patch-level3 and OpenAFS 12x while their data-base servers run OpenAFS 122 Theircell consists of 25 file servers using acombination of EMC and Sun StorEdgehardware Stanford continues to use thekaserver for their authentication infra-structure with future plans to migrateentirely to an MIT Kerberos 5 KDC

Stanford has approximately 3400 clientson their campus not including SLAC(Stanford Linear Accelerator) approxi-mately 2100 AFS clients from outsideStanford contact their cell every monthTheir supported clients are almostentirely IBM AFS 36 clients althoughthey plan to release OpenAFS clientssoon Stanford currently supports onlyUNIX clients There is some on-campuspresence of Windows clients but Stan-ford has never publicly released or sup-ported it They do intend to release andsupport the MacOS X client in the nearfuture

All Stanford students faculty and staffare assigned AFS home directories witha default quota of 50MB for a total ofapproximately 550GB of user homedirectories Other significant uses of AFS

87October 2002 login

C

ON

FERE

NC

ERE

PORT

Sstorage are data volumes for workstationsoftware (400GB) and volumes forcourse Web sites and assignments(100GB)

AFS usage at Intel was also presentedIntel has been an AFS site since 1994They had bad experiences with the IBM35 Linux client their experience withOpenAFS on Linux 24x kernels hasbeen much better They use and are sat-isfied with the OpenAFS IA64 Linuxport Intel has hundreds of OpenAFS123 and 124 clients in many produc-tion cells accessing data stored on IBMAFS file servers They have not encoun-tered any interoperability issues Intelhas some concerns about OpenAFS theywould like to purchase commercial sup-port for OpenAFS and to see OpenAFSsupport for HP-UX on both PA-RISCand Itanium hardware HP-UX supportis currently unavailable due to a specificHP-UX header file being unavailablefrom HP this may be available soonIntel has not yet committed to migratingtheir file servers to OpenAFS and areunsure if they will do so without com-mercial support

Backups are a traditional topic of dis-cussion at AFS workshops and this timewas no exception Many users complainthat the traditional AFS backup tools(ldquobackuprdquo and ldquobutcrdquo) are complex anddifficult to automate requiring manyhome-grown scripts and much userintervention for error recovery An addi-tional complaint was that the traditionalAFS tools do not support file- or direc-tory-level backups and restores datamust be backed up and restored at thevolume level

Mitch Collinsworth of Cornell presentedwork to make AMANDA the freebackup software from the University ofMaryland suitable for AFS backupsUsing AMANDA for AFS backups allowsone to share AFS backup tapes withnon-AFS backups easily run multiplebackups in parallel automate errorrecovery and provide a robust degradedmode that prevents tape errors from

stopping backups altogether Theirimplementation allows for full volumerestores as well as individual directoryand file restores They have finished cod-ing this work and are in the process oftesting and documenting it

Peter Honeyman of CITI at the Univer-sity of Michigan spoke about work hehas proposed to replace Rx with RPC-SEC GSS in OpenAFS this would allowAFS to use a TCP-based transport mech-anism rather than the UDP-based Rxand possibly gain better congestion con-trol dynamic adaptation and fragmen-tation avoidance as a result RPCSECGSS uses the GSSAPI to authenticateSUN ONC RPC RPCSEC GSS is trans-port agnostic provides strong security isa developing Internet standard and hasmultiple open source implementationsBackward compatibility with existingAFS servers and clients is an importantgoal of this project

Page 6: inside - USENIX · 2019. 2. 25. · Bosko Milekic Juan Navarro David E. Ott Amit Purohit Brennan Reynolds Matt Selsky J.D. Welch Li Xiao Praveen Yalagandula Haijin Yan Gary Zacheiss

quality work (mostly) lots of hands andan intensely engineering-drivenapproach The downsides include nosupport structure limited project mar-keting expertise and volunteers havinglimited time

In 1998 Sendmail was becoming a suc-cess disaster A success disaster includestwo key elements (1) volunteers startspending all their time supporting cur-rent functionality instead of enhancing(this heavy burden can lead to projectstagnation and eventually death)(2) competition for the better-fundedsources can lead to FUD (fear uncer-tainty and doubt) about the project

Eric then led listeners down the path hetook to turn Sendmail into a businessEric envisioned Sendmail Inc as a small-ish and close-knit company but soonrealized that the family atmosphere hedesired could not last as the companygrew He observed that maybe only thefirst 10 people will buy into your visionafter which each individual is primarilyinterested in something else (eg powerwealth or status) He warns that therewill be people working for you that youdo not like Eric particularly did not carefor sales people but he emphasized howimportant sales marketing finance andlegal people are to companies

The nature of companies in infancy canbe summed up as you need money andyou need investors in order to getmoney Investors want a return oninvestment so money managementbecomes critical Companies thereforemust optimize money functions Ericrsquosexperience with investors were lessonsall in themselves More than givingmoney good investors can provide con-nections and insight so you donrsquot wantinvestors who donrsquot believe in you

A company must have bug historychange management and people whounderstand the product and have a senseof the target market Eric now sees theimportance of marketing targetresearch A company can never forget

66 Vol 27 No 5 login

the simple equation that drives businessprofit = revenue - expense

Finally the concept of value should bebased on what is valuable to the cus-tomer Eric observed that his customersneeded documentation extra value inthe product that they couldnrsquot get else-where and technical support

Eric learned hard lessons along the wayIf you want to do interesting opensource it might be best not to be toosuccessful You donrsquot want to let thelarger dragons (companies) notice youIf you want a commercial user base youhave to manage process watch corpo-rate culture provide value to customerswatch bottom lines and develop a thickskin

Starting a company is easy but perpetu-ating it is extremely difficult

INFORMATION VISUALIZATION FOR

SYSTEMS PEOPLE

Tamara Munzner University of BritishColumbia

Summarized by JD Welch

Munzner presented interesting evidencethat information visualization throughinteractive representations of data canhelp people to perform tasks more effec-tively and reduce the load on workingmemory

A popular technique in visualizing datais abstracting it through node-link rep-resentation ndash for example a list of cross-references in a book Representing therelationships between references in agraphical (node-link) manner ldquooffloadscognition to the human perceptual sys-temrdquo These graphs can be producedmanually but automated drawing allowsfor increased complexity and quickerproduction times

The way in which people interpret visualdata is important in designing effectivevisualization systems Attention to pre-attentive visual cues are key to successPicking out a red dot among a field ofidentically shaped blue dots is signifi-cantly faster than the same data repre-

sented in a mixture of hue and shapevariations There are many pre-attentivevisual cues including size orientationintersection and intensity The accuracyof these cues can be ranked with posi-tion triggering high accuracy and coloror density on the low end

Visual cues combined with differentdata types ndash quantitative (eg 10inches) ordered (small medium large)and categorical (apples oranges) ndash canalso be ranked with spatial positionbeing best for all types Beyond thiscommonality however accuracy variedwidely with length ranked second forquantitative data eighth for ordinal andninth for categorical data types

These guidelines about visual perceptioncan be applied to information visualiza-tion which unlike scientific visualiza-tion focuses on abstract data and choiceof specialization Several techniques areused to clarify abstract data includingmulti-part glyphs where changes inindividual parts are incorporated into aneasier to understand gestalt interactiv-ity motion and animation In large datasets techniques like ldquofocus + contextrdquowhere a zoomed portion of the graph isshown along with a thumbnail view ofthe entire graph are used to minimizeuser disorientation

Future problems include dealing withhuge databases such as the HumanGenome reckoning dynamic data likethe changing structure of the Web orreal-time network monitoring andtransforming ldquopixel boundrdquo displays intolarge ldquodigital wallpaperrdquo-type systems

FIXING NETWORK SECURITY BY HACKING

THE BUSINESS CLIMATE

Bruce Schneier Counterpane InternetSecurity

Summarized by Florian Buchholz

Bruce Schneier identified security as oneof the fundamental building blocks ofthe Internet A certain degree of securityis needed for all things on the Net andthe limits of security will become limitsof the Net itself However companies are

hesitant to employ security measuresScience is doing well but one cannot seethe benefits in the real world Anincreasing number of users means moreproblems affecting more people Oldproblems such as buffer overflowshavenrsquot gone away and new problemsshow up Furthermore the amount ofexpertise needed to launch attacks isdecreasing

Schneier argues that complexity is theenemy of security and that while secu-rity is getting better the growth in com-plexity is outpacing it As security isfundamentally a people problem oneshouldnrsquot focus on technologies butrather should look at businesses busi-ness motivations and business costsTraditionally one can distinguishbetween two security models threatavoidance where security is absoluteand risk management where security isrelative and one has to mitigate the riskwith technologies and procedures In thelatter model one has to find a pointwith reasonable security at an acceptablecost

After a brief discussion on how securityis handled in businesses today in whichhe concluded that businesses ldquotalk bigabout it but do as little as possiblerdquoSchneier identified four necessary stepsto fix the problems

First enforce liabilities Today no realconsequences have to be feared fromsecurity incidents Holding peopleaccountable will increase the trans-parency of security and give an incentiveto make processes public As possibleoptions to achieve this he listed indus-try-defined standards federal regula-tion and lawsuits The problems withenforcement however lie in difficultiesassociated with the international natureof the problem and the fact that com-plexity makes assigning liabilities diffi-cult Furthermore fear of liability couldhave a chilling effect on free softwaredevelopment and could stifle new com-panies

67October 2002 login

C

ON

FERE

NC

ERE

PORT

SAs a second step Schneier identified theneed to allow partners to transfer liabil-ity Insurance would spread liability riskamong a group and would be a CEOrsquosprimary risk analysis tool There is aneed for standardized risk models andprotection profiles and for more securi-ties as opposed to empty press releasesSchneier predicts that insurance will cre-ate a marketplace for security where cus-tomers and vendors will have the abilityto accept liabilities from each otherComputer-security insurance shouldsoon be as common as household or fireinsurance and from that developmentinsurance will become the driving factorof the security business

The next step is to provide mechanismsto reduce risks both before and aftersoftware is released techniques andprocesses to improve software quality aswell as an evolution of security manage-ment are therefore needed Currentlysecurity products try to rebuild the wallsndash such as physical badges and entrydoorways ndash that were lost when gettingconnected Schneier believes thisldquofortress metaphorrdquo is bad one shouldthink of the problem more in terms of acity Since most businesses cannot affordproper security outsourcing is the onlyway to make security scalable With out-sourcing the concenpt of best practicesbecomes important and insurance com-panies can be tied to them outsourcingwill level the playing field

As a final step Schneier predicts thatrational prosecution and education willlead to deterrence He claims that peoplefeel safe because they live in a lawfulsociety whereas the Internet is classifiedas lawless very much like the AmericanWest in the 1800s or a society ruled bywarlords This is because it is difficult toprove who an attacker is prosecution ishampered by complicated evidencegathering and irrational prosecutionSchneier believes however that educa-tion will play a major role in turning theInternet into a lawful society Specifi-cally he pointed out that we need lawsthat can be explained

Risks wonrsquot go away the best we can dois manage them A company able tomanage risk better will be more prof-itable and we need to give CEOs thenecessary risk-management tools

There were numerous questions Listedbelow are the more complicated ones ina paraphrased QampA format

Q Will liability be effective will insur-ance companies be willing to accept therisks A The government might have tostep in it needs to be seen how it playsout

Q Is there an analogy to the real worldin the fact that in a lawless society onlythe rich can afford security Securitysolutions differentiate between classesA Schneier disagreed with the state-ment giving an analogy to front-doorlocks but conceded that there might bespecial cases

Q Doesnrsquot homogeneity hurt securityA Homogeneity is oversold Diversetypes can be more survivable but giventhe limited number of options the dif-ference will be negligible

Q Regulations in airbag protection haveled to deaths in some cases How can wekeep the pendulum from swinging theother way A Lobbying will not be pre-vented An imperfect solution is proba-ble there might be reversals ofrequirements such as the airbag one

Q What about personal liability A Thiswill be analogous to auto insurance lia-bility comes with computerNet access

Q If the rule of law is to become realitythere must be a law enforcement func-tion that applies to a physical space Youcannot do that with any existing govern-ment agency for the whole worldShould an organization like the UNassume that role A Schneier was notconvinced global enforcement is possi-ble

Q What advice should we take awayfrom this talk A Liability is comingSince the network is important to ourinfrastructure eventually the problems

USENIX 2002

will be solved in a legal environmentYou need to start thinking about how tosolve the problems and how the solu-tions will affect us

The entire speech (including slides) isavailable at httpwwwcounterpanecompresentation4pdf

LIFE IN AN OPEN SOURCE STARTUP

Daryll Strauss Consultant

Summarized by Teri Lampoudi

This talk was packed with morsels ofinsight and tidbits of information aboutwhat life in an open source startup islike (though ldquowas likerdquo might be moreappropriate) what the issues are instarting maintaining and commercial-izing an open source project and theway hardware vendors treat open sourcedevelopers Strauss began by tracing thetimeline of his involvement with devel-oping 3-D graphics support for Linuxfrom the 3dfx voodoo1 driver for Linuxin 1997 to the establishment of Preci-sion Insight in 1998 to its buyout by VALinux to the dismantling of his groupby VA to what the future may hold

Obviously there are benefits from doingopen source development both for thedeveloper and for the end user Animportant point was that open sourcedevelops entire technologies not justproducts A misapprehension is theinherent difficulty of managing a projectthat accepts code from a large number ofdevelopers some volunteer and somepaid The notion that code can just bethrown over the wall into the world is aBig Lie

How does one start and keep afloat anopen source development company Inthe case of Precision Insight the subjectmatter ndash developing graphics andOpenGL for Linux ndash required a consid-erable amount of expertise intimateknowledge of the hardware the librariesX and the Linux kernel as well as theend applications Expertise is mar-ketable What helps even further is hav-ing a visible virtual team of expertspeople who have an established trackrecord of contributions to open source

68 Vol 27 No 5 login

in the particular area of expertise Sup-port from the hardware and softwarevendors came next While everyone waswilling to contract PI to write the por-tions of code that were specific to theirhardware no one wanted to pay the billfor developing the underlying founda-tion that was necessary for the drivers tobe useful In the PI model the infra-structure cost was shared by several cus-tomers at once and the technology keptmoving Once a driver is written for avendor the vendor is perfectly capableof figuring out how to write the nextdriver necessary thereby obviating theneed to contract the job This and ques-tions of protecting intellectual propertyndash hardware design in this case ndash aredeeply problematic with the open sourcemode of development This is not to saythat there is no way to get over thembut that they are likely to arise moreoften than not

Management issues seem equally impor-tant In the PI setting contracts wereflexible ndash both a good and a bad thing ndashand developers overworked Furtherdevelopment ldquoin a fishbowlrdquo that isunder public scrutiny is not easyStrauss stressed the value of good com-munication and the use of various out-of-sync communication methods likeIRC and mailing lists

Finally Strauss discussed what portionsof software can be free and what can beproprietary His suggestion was that hor-izontal markets want to be free whereasvertically one can develop proprietarysolutions The talk closed with a smallstroll through the problems that projectpolitics brings up from things likemerging branches to mailing list andsource tree readability

GENERAL TRACK SESSIONS

FILE SYSTEMS

Summarized by Haijin Yan

STRUCTURE AND PERFORMANCE OF THE

DIRECT ACCESS FILE SYSTEM

Kostas Magoutis Salimah AddetiaAlexandra Fedorova and Margo ISeltzer Harvard University JeffreyChase Andrew Gallatin Richard Kisleyand Rajiv Wickremesinghe Duke Uni-versity and Eran Gabber Lucent Tech-nologies

This paper won the Best Paper award

The Direct Access File System (DAFS) isa new standard for network-attachedstorage over direct-access transport net-works DAFS takes advantage of user-level network interface standards thatenable client-side functionality in whichremote data access resides in a libraryrather than in the kernel This reducesthe overhead of memory copy for datamovement and protocol overheadRemote Direct Memory Access (RDMA)is a direct access transport network thatallows the network adapter to reducecopy overhead by accessing applicationbuffers directly

This paper explores the fundamentalstructural and performance characteris-tics of network file access using a user-level file system structure on adirect-access transport network withRDMA It describes DAFS-based clientand server reference implementationsfor FreeBSD and reports experimentalresults comparing DAFS to a zero-copyNFS implementation It illustrates thebenefits and trade-offs of these tech-niques to provide a basis for informedchoices about deployment of DAFS-based systems and similar extensions toother network file protocols such asNFS Experiments show that DAFS givesapplications direct control over an IOsystem and increases the client CPUrsquosusage while the client is doing IOFuture work includes how to addresslongstanding problems related to theintegration of the application and filesystem for high-performance applica-tions

CONQUEST BETTER PERFORMANCE THROUGH

A DISKPERSISTENT-RAM HYBRID FILE

SYSTEM

An-I A Wang Peter Reiher and GeraldJ Popek UCLA Geoffrey M Kuenning Harvey Mudd College

Motivated by the declining cost of per-sistent RAM the authors propose theConquest file system which stores allsmall files metadata executables andshared libraries in persistent RAM diskshold only the data content of remaininglarge files Compared to alternativessuch as caching and RAM file systemsConquest has the advantages of effi-ciency consistency and reliability at areduced cost Using popular bench-marks experiments show that Conquestincurs little overhead while achievingfaster performance Future workincludes designing mechanisms foradjusting file-size threshold dynamicallyand finding a better disk layout for largedata blocks

EXPLOITING GRAY-BOX KNOWLEDGE OF

BUFFER-CACHE MANAGEMENT

Nathan C Burnett John Bent AndreaC Arpaci-Dusseau and Remzi HArpaci-Dusseau University of Wisconsin Madison

Knowing what algorithm is used tomanage the buffer cache is very impor-tant for improving application perfor-mance However there is currently nointerface for finding this algorithm Thispaper introduces Dust a simple finger-printing tool that is able to identify thebuffer-cache replacement policy Dustautomatically identifies the cache sizeand replacement policy based on theconfiguring attributes of access ordersrecency frequency and long-term his-tory Through simulation Dust was ableto distinguish between a variety ofreplacement algorithm policies found inthe literature FIFO LRU LFU ClockSegmented FIFO 2Qand LRU-K Fur-ther experiments of fingerprinting realoperating system such as NetBSDLinux and Solaris show that Dust is ableto identify the hidden cache replacementalgorithm

69October 2002 login

C

ON

FERE

NC

ERE

PORT

SBy knowing the underlying cachereplacement policy a cache-aware Webserver can reschedule the requests on acached request-first policy to obtain per-formance improvement Experimentsshow that cache-aware schedulingimproves average response time and sys-tem throughput

OPERATING SYSTEMS

(AND DANCING BEARS)

Summarized by Matt Butner

THE JX OPERATING SYSTEM

Michael Golm Meik Felser ChristianWawersich and Juumlrgen Kleinoumlder University of Erlagen-Nuumlrnberg

The talk opened by emphasizing theneed for and the practicality of a func-tional Java OS Such implementation inthe JX OS attempts to mimic the recenttrend toward using highly abstractedlanguages in application development inorder to create an OS as functional andpowerful as one written in a lower-levellanguage

The JX is a micro-kernel solution thatuses separate JVMs for each entity of thekernel and in some cases for eachapplication The separated domains donot share objects and have no threadmigration and each domain implementsits own code The interesting part of thepresentation was the discussion of sys-tem-level programming with Java keyareas discussed were memory manage-ment and interrupt handlers Theauthors concluded by noting that theirtype-safe and modular system resultedin a robust system with great configura-tion flexibility and acceptable perfor-mance

DESIGN EVOLUTION OF THE EROS

SINGLE-LEVEL STORE

Jonathan S Shapiro Johns HopkinsUniversity Jonathan Adams Universityof Pennsylvania

This presentation outlined current char-acteristics of file systems and some oftheir least desirable characteristics Itwas based on the revival of the EROS

single-level-store approach for currentattempts to capitalize on system consis-tency and efficiency Their solution didnot relate directly to EROS but ratherto the design and use of the constructsand notions on which EROS is based Byextending the mapping of the memorysystem to include the disk systems arefurther able to ensure global persistencewithout regard for a disk structure Inexplaining the need for absolute systemconsistency and an environment whichby definition is not allowed to crash thegoal of having such an exhaustive designbecomes clear The cool part of thiswork is the availability of its design andarchitecture to the public

THINK A SOFTWARE FRAMEWORK FOR

COMPONENT-BASED OPERATING SYSTEM

KERNELS

Jean-Philippe Fassino France TeacuteleacutecomRampD Jean-Bernard Stefani INRIA JuliaLawall DIKU Gilles Muller INRIA

This presentation discussed the need forcomponent-based operating systemsand respective structures to ensure flexi-bility for arbitrary-sized systems Thinkprovides a binding model that mapsuniformed components for OS develop-ers and architects to follow ensuringconsistent implementations for an arbi-trary system size However the goal isnot to force developers into a predefinedkernel but to promote the use of certaincomponents in varied ways

Thinkrsquos concentration is primarily onembedded systems where short devel-opment time is necessary but is con-strained by rigorous needs and limitedresources The need to build flexible sys-tems in such an environment can becostly and implementation-specific butThink creates an environment sup-ported by the ability to dynamically loadtype-safe components This allows formore flexible systems that retain func-tionality because of Thinkrsquos ability tobind fine-grained components Bench-marks revealed that dedicated micro-kernels can show performanceimprovements in comparison to mono-

USENIX 2002

lithic kernels Notable future workincludes building components for low-end appliances and the development of areal-time OS component library

BUILDING SERVICES

Summarized by Pradipta De

NINJA A FRAMEWORK FOR NETWORK

SERVICES

J Robert von Behren Eric A BrewerNikita Borisov Michael Chen MattWelsh Josh MacDonald Jeremy Lauand David Culler University of Califor-nia Berkeley

Robert von Behren presented Ninja anongoing project that aims to provide aframework for building robust and scal-able Internet-based network serviceslike Web-hosting instant-messagingemail and file-sharing applicationsRobert drew attention to the difficultiesof writing cluster applications One hasto take care of data consistency andissues of concurrency and for robustapplications there are problems relatedto fault tolerance Ninja works as awrapper to relieve the user of theseproblems The goal of the project isbuilding network services that are scala-ble and highly available maintainingpersistent data providing gracefuldegradation and supporting online evo-lution

The use of clusters distinguishes thissetup from generic distributed systemsin terms of reliability and security aswell as providing a partition-free net-work The next important feature of thisproject is the new programming modelwhich is more restrictive than a generalprogramming model but still expressiveenough to write most of the applica-tions This model described as ldquosingleprogram multiple connectionrdquo usesintertask parallelism instead of multi-threaded concurrency and is character-ized by all nodes running the sameprogram with connections delegated tothe nodes by a centralized connectionmanager (CM) A CM takes care of hid-ing the details of mapping an externalconnection to an internal node Ninja

70 Vol 27 No 5 login

also uses a design called ldquostaged event-driven architecturerdquo where each serviceis broken down into a set of stages con-nected by event queues This architec-ture is suitable for a modular design andhelps in graceful degradation by adap-tive load shedding from the eventqueues

The presentation concluded with exam-ples of implementation and evaluationof a Web server and an email systemcalled NinjaMail to show the ease ofauthoring and the efficacy of using theNinja framework for service develop-ment

USING COHORT-SCHEDULING TO ENHANCE

SERVER PERFORMANCE

James R Larus and Michael ParkesMicrosoft Research

James Larus presented a new schedulingpolicy for increasing the throughput ofserver applications Cohort schedulingbatches execution of similar operationsarising in different server requests Theusual programming paradigm in han-dling server requests is to spawn multi-ple concurrent threads and switch fromone thread to another whenever a threadgets blocked for IO Since the threads ina server mostly execute unrelated piecesof code the locality of reference isreduced hence the effectiveness of dif-ferent caching mechanisms One way tosolve this problem is to throw in morehardware But Larus presented a com-plementary view where the programbehavior is investigated and used toimprove the performance

The problem in this scheme is to iden-tify pieces of code that can be batchedtogether for processing One simple wayis to look at the next program countervalues and use them to club threadstogether However this talk presented anew programming abstraction ldquostagedcomputationrdquo which replaces the threadmodel with ldquostagesrdquo A stage is anabstraction for operations with similarbehavior and locality The StagedServerlibrary can be used for programming inthis model It is a collection of C++

classes that implement staged computa-tion and cohort scheduling on either auniprocessor or multiprocessor It canbe used to define stages

The presentation showed the experi-mental evaluation of the cohort schedul-ing over the thread-based model byimplementing two servers a Web serverwhich is mainly IO bound and a pub-lish-subscribe server which is mainlycompute bound The SURGE bench-mark was used for the first experimentand the Fabret workload for the secondThe results showed that cohort-schedul-ing-based implementation gave a betterthroughput than the thread-basedimplementation at high loads

NETWORK PERFORMANCE

Summarized by Xiaobo Fan

ETE PASSIVE END-TO-END INTERNET SERVICE

PERFORMANCE MONITORING

Yun Fu Amin Vahdat Duke UniversityLudmila Cherkasova Wenting Tang HPLabs

This paper won the Best Student Paperaward

Ludmila Cherkasova began by listingseveral questions most Web serviceproviders want answered in order toimprove service quality She reviewedthe difficulties of making accurate andefficient end-to-end Web service mea-surement and the shortcomings of cur-rently available approaches Theypropose a passive trace-based architec-ture called EtE to monitor Web serverperformance on behalf of end users

The first step is to collect network pack-ets passively The second module recon-structs all TCP connections and extractsHTTP transactions To reconstruct Webpage accesses they first build a knowl-edge base indexed by client IP and URLand then group objects to the relatedWeb pages they are embedded in Statis-tical analysis is used to handle non-matched objects EtE Monitor cangenerate three groups of metrics to mea-sure Web service performance responsetime Web caching and page abortion

To demonstrate the benefits of EtE mon-itor Cherkasova talked about two casestudies and based on the calculatedmetrics gave some insightful explana-tions about whatrsquos happening behind the variations of Web performanceThrough validation they claim theirapproach provides a very close approxi-mation to the real scenario

THE PERFORMANCE OF REMOTE DISPLAY

MECHANISMS FOR THIN-CLIENT COMPUTING

S Jae Yang Jason Nieh Matt SelskyNikhil Tiwari Columbia University

Noting the trend toward thin-clientcomputing the authors compared dif-ferent techniques and design choices inmeasuring the performance of six popu-lar thin-client platforms ndash CitrixMetaFrame Microsoft Terminal Ser-vices Sun Ray Tarantella VNC and XAfter pointing out several challenges inbenchmarking thin clients Yang pro-posed slow-motion benchmarking toachieve non-invasive packet monitoringand consistent visual quality Basicallythey insert delays between separatevisual events in the benchmark applica-tions of the server side so that theclientrsquos display update can catch up withthe serverrsquos processing speed The exper-iments are conducted on an emulatednetwork over a range of network band-widths

Their results show that thin clients canprovide good performance for Webapplications in LAN environments butonly some platforms performed well forvideo benchmark Pixel-based encodingmay achieve better performance andbandwidth efficiency than high-levelgraphics Display caching and compres-sion should be used with care

A MECHANISM FOR TCP-FRIENDLY

TRANSPORT-LEVEL PROTOCOL

COORDINATION

David E Ott and Ketan Mayer-PatelUniversity of North Carolina ChapelHill

A revised transport-level protocol opti-mized for cluster-to-cluster (C-to-C)

71October 2002 login

C

ON

FERE

NC

ERE

PORT

Sapplications is what David Ott tries toexplain in his talk A C-to-C applicationclass is identified as one set of processescommunicating to another set ofprocesses across a common Internetpath The fundamental problem with C-to-C applications is how to coordinateall C-to-C communication flows so thatthey share a consistent view of the com-mon C-to-C network adapt to changingnetwork conditions and cooperate tomeet specific requirements Aggregationpoints (AP) are placed at the first andlast hop of the common data path toprobe network conditions (latencybandwidth loss rate etc) To carry andtransfer this information a new protocolndash Coordination Protocol (CP) ndash isinserted between the network layer (IP)and the transport layer (TCP UCP etc)Ott illustrated how the CP header isupdated and used when packets origi-nate from source and traverse throughlocal and remote AP to arrive at theirdestination and how AP maintains aper-cluster state table and detects net-work conditions Through simulationresults this coordination mechanismappears effective in sharing commonnetwork resources among C-to-C com-munication flows

STORAGE SYSTEMS

Summarized by Praveen Yalagandula

MY CACHE OR YOURS MAKING STORAGE

MORE EXCLUSIVE

Theodore M Wong Carnegie MellonUniversity John Wilkes HP Labs

Theodore Wong explained the ineffi-ciency of current ldquoinclusiverdquo cachingschemes in storage area networks ndashwhen a client accesses a block the blockis read from the disk and is cached atboth the disk array cache and at theclientrsquos cache Then he presented theconcept of ldquoexclusiverdquo caching where ablock accessed by a client is only cachedin that clientrsquos cache On eviction fromthe clientrsquos cache they have come upwith a DEMOTE operation to move thedata block to the tail of the array cache

The new exclusion caching schemes areevaluated on both single-client and mul-tiple-client systems The single-clientresults are presented for two differenttypes of synthetic workloads Random(transaction-processing type workloads)and Zipf (typical Web workloads) Theexclusive policy was quite effective inachieving higher hit rates and lower readlatencies They also showed a 22 timeshit-rate improvement over inclusivetechniques in the case of a real-lifeworkload httpd with a single-client set-ting

The DEMOTE scheme also performedwell in the case of multiple-client sys-tems when the data accessed by clients isdisjointed In the case of shared dataworkloads this scheme performed worsethan the inclusive schemes Theodorethen presented an adaptive exclusivecaching scheme ndash a block accessed by aclient is placed at the tail in the clientrsquoscache and also in the array cache at anappropriate place determined by thepopularity of the block The popularityof the block is measured by maintaininga ghost cache to accumulate the numberof times each block is accessed This newadaptive scheme has achieved a maxi-mum of 52 speedup in the meanlatency in the experiments with real-lifeworkloads

For more information visit httpwwwcscmuedu~tmwongresearch and httpwwwhplhpcomresearchitccslssp

BRIDGING THE INFORMATION GAP IN

STORAGE PROTOCOL STACKS

Timothy E Denehy Andrea C Arpaci-Dusseau Remzi H Arpaci-DusseauUniversity of Wisconsin Madison

Currently there is a huge informationgap between storage systems and file sys-tems The interface exposed by storagesystems to file systems based on blocksand providing only readwrite interfacesis very narrow This leads to poor per-formance as a whole because of dupli-cated functionality in both systems and

USENIX 2002

reduced functionality resulting fromstorage systemsrsquo lack of file information

The speaker presented two enhance-ments ExRaid an exposed RAID andILFS Informed Log-Structured File Sys-tem ExRAID is an enhancement to theblock-based RAID storage system thatexposes the following three types ofinformation to the file system (1)regions ndash contiguous portions of theaddress space comprising one or multi-ple disks (2) performance informationabout queue lengths and throughput ofthe regions revealing disk heterogeneityto the file systems and (3) failure infor-mation ndash dynamically updated informa-tion conveying the number of tolerablefailures in each region

ILFS allows online incremental expan-sion of the storage space performsdynamic parallelism using ExRAIDrsquosperformance information to performwell on heterogeneous storage systemsand provides a range of different mecha-nisms with different granularities forredundancy of files using ExRAIDrsquosregion failure characteristics These newfeatures are added to LFS with only a19 increase in the code size

More informationhttpwwwcswisceduwind

MAXIMIZING THROUGHPUT IN REPLICATED

DISK STRIPING OF VARIABLE

BIT-RATE STREAMS

Stergios V Anastasiadis Duke Univer-sity Kenneth C Sevcik and MichaelStumm University of Toronto

There is an increasing demand for thecontinuous real-time streaming ofmedia files Disk striping is a commontechnique used for supporting manyconnections on the media servers Buteven with 12 million hours of meantime between failures there will be morethan one disk failure per week with 7000disks Fault tolerance can be achieved byusing data replication and reservingextra bandwidth during normal opera-tion This work focused on supportingthe most common variable bit-ratestream formats (eg MPEG)

72 Vol 27 No 5 login

The authors used prototype mediaserver EXEDRA to implement and eval-uate different replication and bandwidthreservation schemes This system sup-ports a variable bit-rate scheme doesstride-based disk allocation and is capa-ble of supporting variable-grain diskstriping Two replication policies arepresented deterministic ndash data blocksare replicated in a round-robin fashionon the secondary replicas random ndashdata blocks are replicated on the ran-domly chosen secondary replicas Twobandwidth reservation techniques arealso presented mirroring reservation ndashdisk bandwidth is reserved for both primary and replicas of the media fileduring the playback and minimumreservation ndash a more efficient scheme inwhich bandwidth is reserved only for thesum of primary data access time and themaximum of the backup data accesstimes required in each round

Experimental results showed that deter-ministic replica placement is better thanrandom placement for small disks mini-mum disk bandwidth reservation istwice as good as mirroring in thethroughput achieved and fault tolerancecan be achieved with a minimal impacton the throughput

TOOLS

Summarized by Li Xiao

SIMPLE AND GENERAL STATISTICAL PROFILING

WITH PCT

Charles Blake and Steven Bauer MIT

This talk introduced a Profile CollectionToolkit (PCT) ndash a sampling-based CPUprofiling facility A novel aspect of PCTis that it allows sampling of semanticallyrich data such as function call stacksfunction parameters local or globalvariables CPU registers or other execu-tion contexts The design objectives ofPCT were driven by user needs and theinadequacies or inaccessibility of priorsystems

The rich data collection capability isachieved via a debugger-controller pro-gram dbct1 The talk then introducedthe profiling collector and data collec-

tion activation PCT file format optionsprovide flexibility and various ways tostore exact states PCT also has analysisoptions two examples were given ndashldquoSimplerdquo and ldquoMixed User and LinuxCoderdquo

The talk then went into how generalsampling helps in the following areas adebugger-controller state machineexample script fragments a simpleexample program and general valuereports in terms of call-path histogramsnumeric value histograms and data gen-erality Related work such as gprof andExpect was then summarized The con-tribution of this work is to provide aportable general value sampling toolThe limitations are that PCT does notsupport much of strip and count inac-curacies happen because of its statisticalnature Based on this limitation -g ispreferred over strip

Possible future directions included sup-porting more debuggers such as dbxand xdb script generators for otherkinds of traces more canned reports forgeneral values and a libgdb-basedlibrary sampler PCT is available athttppdoslcsmitedupct PCT runs onalmost any UNIX-like system

ENGINEERING A DIFFERENCING AND

COMPRESSION DATA FORMAT

David G Korn and Kiem Phong VoATampT Laboratories ndash Research

This talk began with an equation ldquoDif-ferencing + Compression = Delta Com-pressionrdquo Compression removesinformation redundancy in a data setExamples are gzip bzip compress andpack Differencing encodes differencesbetween two data sets Examples are diff -e fdelta bdiff and xdelta Deltacompression compresses a data set givenanother combining differencing andcompression and reduces to pure com-pression when therersquos no commonality

After presenting an overall scheme for adelta compressor and showing deltacompression performance this talk dis-cussed the encoding format of a newlydesigned Vcdiff for delta compression A

Vcdiff instruction code table consists of256 entries of each coding up to a pair ofinstructions and recodes I-byte indicesand any additional data

The talk then showed Vcdiff rsquos perfor-mance with Web data collected fromCNN compared with W3C standardGdiff Gdiff+gzip and diff+gzip whereGdiff was computed using Vcdiff deltainstructions The results of two experi-ments ldquoFirstrdquo and ldquoSuccessiverdquo werepresented In ldquoFirstrdquo each file is com-pressed against the first file collected inldquoSuccessiverdquo each file is compressedagainst the one in the previous hourThe diff+gzip did not work well becausediff was line-oriented Vcdiff performedfavorably compared with other formats

Vcdiff is part of the Vcodex packageVcodex is a platform for all commondata transformations delta compres-sion plain compression encryptiontranscoding (eg uuencode base64) Itis structured in three layers for maxi-mum usability Base library uses Dis-plines and Methods interfaces

The code for Vcdiff can be found athttpwwwresearchattcomswtoolsThey are moving Vcdiff to the IETFstandard as a comprehensive platformfor transforming data Please refer tohttpwwwietforginternet-draftdraft-korn-vcdiff06txt

WHERE IN THE NET

Summarized by Amit Purohit

A PRECISE AND EFFICIENT EVALUATION OF

THE PROXIMITY BETWEEN WEB CLIENTS AND

THEIR LOCAL DNS SERVERS

Zhuoqing Morley Mao University ofCalifornia Berkeley Charles D CranorFred Douglis Michael RabinovichOliver Spatscheck and Jia Wang ATampTLabs ndash Research

Content Distribution Networks (CDNs)attempt to improve Web performance bydelivering Web content to end usersfrom servers located at the edge of thenetwork When a Web client requestscontent the CDN dynamically chooses aserver to route the request to usually the

73October 2002 login

C

ON

FERE

NC

ERE

PORT

Sone that is nearest the client As CDNshave access only to the IP address of thelocal DNS server (LDNS) of the clientthe CDNrsquos authoritative DNS servermaps the clientrsquos LDNS to a geographicregion within a particular network andcombines that with network and serverload information to perform CDNserver selection

This method has two limitations First itis based on the implicit assumption thatclients are close to their LDNS This maynot always be valid Second a singlerequest from an LDNS can represent adiffering number of Web clients ndash calledthe hidden load factor Knowledge of thehidden load factor can be used toachieve better load distribution

The extent of the first limitation and itsimpact on the CDN server selection isdealt with To determine the associationsbetween clients and their LDNS a sim-ple non-intrusive and efficient map-ping technique was developed The datacollected was used to study the impact ofproximity on DNS-based server selec-tion using four different proximity met-rics (1) autonomous system (AS)clustering ndash observing whether a client isin the same AS as its LDNS ndash concludedthat LDNS is very good for coarse-grained server selection as 64 of theassociations belong to the same AS(2) network clustering ndash observingwhether a client is in the same network-aware cluster (NAC) ndash implied that DNSis less useful for finer-grained serverselection since only 16 of the clientand LDNS are in the same NAC (3)traceroute divergence ndash the length of thedivergent paths to the client and itsLDNS from a probe point using tracer-oute ndash implies that most clients aretopologically close to their LDNS asviewed from a randomly chosen probesite (4) round-trip time (RTT) correla-tion (some CDNs select severs based onRTT between the CDN server and theclientrsquos LDNS) examines the correlationbetween the message RTTs from a probepoint to the client and its local DNS

results imply this is a reasonable metricto use to avoid really distant servers

A study of the impact that client-LDNSassociations have on DNS-based serverselection concludes that knowing theclientrsquos IP address would allow moreaccurate server selection in a large num-ber of cases The optimality of the serverselection also depends on the serverdensity placement and selection algo-rithms

Further information can be found athttpwwweecsberkeleyedu~zmaomyresearchhtml or by contactingzmaoeecsberkeleyedu

GEOGRAPHIC PROPERTIES OF INTERNET

ROUTING

Lakshminarayanan SubramanianVenkata N Padmanabhan MicrosoftResearch Randy H Katz University ofCalifornia at Berkeley

Geographic information can provideinsights into the structure and function-ing of the Internet including interac-tions between different autonomoussystems by analyzing certain propertiesof Internet routing It can be used tomeasure and quantify certain routingproperties such as circuitous routinghot-potato routing and geographic faulttolerance

Traceroute has been used to gather therequired data and Geotrack tool hasbeen used to determine the location ofthe nodes along each network path Thisenables the computation of ldquolinearizeddistancesrdquo which is the sum of the geo-graphic distances between successivepairs of routers along the path

In order to measure the circuitousnessof a path a metric ldquodistance ratiordquo hasbeen defined as the ratio of the lin-earized distance of a path to the geo-graphic distance between the source anddestination of the path From the data ithas been observed that the circuitous-ness of a route depends on both geo-graphic and network location of the endhosts A large value of the distance ratioenables us to flag paths that are highly

USENIX 2002

circuitous possibly (though not neces-sarily) because of routing anomalies Itis also shown that the minimum delaybetween end hosts is strongly correlatedwith the linearized distance of the path

Geographic information can be used tostudy various aspects of wide-area Inter-net paths that traverse multiple ISPs Itwas found that end-to-end Internetpaths tend to be more circuitous thanintra-ISP paths the cause for this beingthe peering relationships between ISPsAlso paths traversing substantial dis-tances within two or more ISPs tend tobe more circuitous than paths largelytraversing only a single ISP Anotherfinding is that ISPs generally employhot-potato routing

Geographic information can also beused to capture the fact that two seem-ingly unrelated routers can be suscepti-ble to correlated failures By using thegeographic information of routers wecan construct a geographic topology ofan ISP Using this we can find the toler-ance of an ISPrsquos network to the total net-work failure in a geographic region

For further information contactlakmecsberkeleyedu

PROVIDING PROCESS ORIGIN INFORMATION

TO AID IN NETWORK TRACEBACK

Florian P Buchholz Purdue UniversityClay Shields Georgetown University

Network traceback is currently limitedbecause host audit systems do not main-tain enough information to matchincoming network traffic to outgoingnetwork traffic The talk presented analternative assigning origin informationto every process and logging it duringinteractive login creation

The current implementation concen-trates mainly on interactive sessions inwhich an event is logged when a newconnection is established using SSH orTelnet The method is effective andcould successfully determine steppingstones and reliably detect the source of aDDoS attack The speaker then talkedabout the related work done in this area

74 Vol 27 No 5 login

which led to discussion of the latestdevelopment in the area of ldquohostcausalityrdquo

The speaker ended with the limitationsand future directions of the frameworkThis framework could be extended andcould find many applications in the areaof security Current implementationdoesnrsquot take care of startup scripts andcron jobs but incorporating the origininformation in FS could solve this prob-lem In the current implementation log-ging is just implemented as a proof ofconcept It could be made safe in manyways and this could be another impor-tant aspect of future work

PROGRAMMING

Summarized by Amit Purohit

CYCLONE A SAFE DIALECT OF C

Trevor Jim ATampT Labs ndash Research GregMorrisett Dan Grossman MichaelHicks James Cheney Yanling WangCornell University

Cyclone is designed to provide protec-tion against attacks such as buffer over-flows format string attacks andmemory management errors The cur-rent structure of C allows programmersto write vulnerable programs Cycloneextends C so that it has the safety guar-antee of Java while keeping C syntaxtypes and semantics untouched

The Cyclone compiler performs staticanalysis of the source code and insertsruntime checks into the compiled out-put at places where the analysis cannotdetermine that the code execution willnot violate safety constraints Cycloneimposes many restrictions to preservesafety such as NULL checksThesechecks do not exist in normal C

The speaker then talked about somesample code written in Cyclone and howit tackles safety problems Porting anexisting C application to Cyclone ispretty easy with fewer than 10 changerequired The current implementationconcentrates more on safety than onperformance ndash hence the performance

penalty is substantial Cyclone was ableto find many lingering bugs The finalstep of the project will be to write acompiler to convert a normal C programto an equivalent Cyclone program

COOPERATIVE TASK MANAGEMENT WITHOUT

MANUAL STACK MANAGEMENT

Atul Adya Jon Howell MarvinTheimer William J Bolosky John RDouceur Microsoft Research

The speaker described the definitionsand motivations behind the distinctconcepts of task management stackmanagement IO management conflictmanagement and data partitioningConventional concurrent programminguses preemptive task management andexploits the automatic stack manage-ment of a standard language In the sec-ond approach cooperative tasks areorganized as event handlers that yieldcontrol by returning control to the eventscheduler manually unrolling theirstacks In this project they have adopteda hybrid approach that makes it possiblefor both stack management styles tocoexist Thus programmers can codeassuming one or the other of these stackmanagement styles will operate depend-ing upon the application The speakeralso gave a detailed example of how touse their mechanism to switch betweenthe two styles

The talk then continued with the imple-mentation They were able to preemptmany subtle concurrency problems byusing cooperative task managementPaying a cost up-front to reduce a subtlerace condition proved to be a goodinvestment

Though the choice of task managementis fundamental the choice of stack man-agement can be left to individual tasteThis project enables use of any type ofstack management in conjunction withcooperative task management

IMPROVING WAIT-FREE ALGORITHMS FOR

INTERPROCESS COMMUNICATION IN

EMBEDDED REAL-TIME SYSTEMS

Hai Huang Padmanabhan Pillai KangG Shin University of Michigan

The main characteristic of the real-timetime-sensitive system is its pre-dictable response time But concurrencymanagement creates hurdles to achiev-ing this goal because of the use of locksto maintain consistency To solve thisproblem many wait-free algorithmshave been developed but these are typi-cally high-cost solutions By takingadvantage of the temporal characteris-tics of the system however the time andspace overhead can be reduced

The speaker presented an algorithm fortemporal concurrency control andapplied this technique to improve threewait-free algorithms A single-writermultiple-reader wait-free algorithm anda double-buffer algorithm were pro-posed

Using their transformation mechanismthey achieved an improvement of17ndash66 in ACET and a 14ndash70 reduc-tion in memory requirements for IPCalgorithms This mechanism is extensi-ble and can be applied to other non-blocking IPC algorithms as well Futurework involves reducing the synchroniz-ing overheads in more general IPC algo-rithms with multiple-writer semantics

MOBILITY

Summarized by Praveen Yalagandula

ROBUST POSITIONING ALGORITHMS FOR

DISTRIBUTED AD-HOC WIRELESS SENSOR

NETWORKS

Chris Savarese Jan Rabaey Universityof California Berkeley Koen Langen-doen Delft University of Technology

The Pico Radio Network comprisesmore than a hundred sensors monitorsand actuators equipped with wirelessconnectivity with the following proper-ties (1) no infrastructure (2) computa-tion in a distributed fashion instead ofcentralized computation (3) dynamictopology (4) limited radio range

75October 2002 login

C

ON

FERE

NC

ERE

PORT

S(5) nodes that can act as repeaters(6) devices of low cost and with lowpower and (7) sparse anchor nodes ndashnodes with GPS capability A positioningalgorithm determines the geographicalposition of each node in the network

In two-dimensional space each nodeneeds three reference positions to esti-mate its geographical position There aretwo problems that make positioning dif-ficult in the Pico Radio Network typesetting (1) a sparse anchor node prob-lem and (2) a range error problem

The two-phase approach that theauthors have taken in solving the prob-lem consists of (1) a hop-terrain algo-rithm ndash in this first phase each noderoughly guesses its location by the dis-tance calculated using multi-hops to theanchor points and (2) a refinementalgorithm ndash in this second phase eachnode uses its neighborsrsquo positions torefine its own position estimate Toguarantee the convergence thisapproach uses confidence metricsassigning a value of 10 for anchor nodesand 01 to start for other nodes andincreasing these with each iteration

A simulation tool OMNet++ was usedfor both phases and in various scenariosThe results show that they achievedposition errors of less than 33 in a sce-nario with 5 anchor nodes an averageconnectivity of 7 and 5 range mea-surement error

Guidelines for anchor node deploymentare high connectivity (gt10) a reason-able fraction of anchor nodes (gt 5)and for anchor placement coverededges

APPLICATION-SPECIFIC NETWORK

MANAGEMENT FOR ENERGY-AWARE

STREAMING OF POPULAR MULTIMEDIA

FORMATS

Surendar Chandra University of Geor-gia and Amin Vahdat Duke University

The main hindrance in supporting theincreasing demand for mobile multime-dia on PDA is the battery capacity of the

small devices The idle-time power con-sumption is almost the same as thereceive power consumption on typicalwireless network cards (1319mW vs1425mW) while the sleep state con-sumes far less power (177mW) To letthe system consume energy proportionalto the stream quality the network cardshould transition to sleep state aggres-sively between each packet

The speaker presented previous workwhere different multimedia streamswere studied for PDAs MS Media RealMedia and Quicktime It was found thatif the inter-packet gap is predictablethen huge savings are possible forexample about 80 savings in MSmedia streams with few packet lossesThe limitation of the IEEE 80211bpower-save mode comes into play whenthere are two or more nodes producinga delay between beacon time and whenthe node receivessends packets Thisbadly affects the streams since higherenergy is consumed while waiting

The authors propose traffic shaping forenergy conservation where this is doneby proxy such that packets arrive at reg-ular intervals This is achieved using alocal proxy in the access point and aclient-side proxy in the mobile hostSimulations show that traffic shapingreduces energy consumption and alsoreveals that the higher bandwidthstreams have a lower energy metric(mJkB)

More information is at httpgreenhousecsugaedu

CHARACTERIZING ALERT AND BROWSE SER-

VICES OF MOBILE CLIENTS

Atul Adya Paramvir Bahl and Lili QiuMicrosoft Research

Even though there is a dramatic increasein Internet access from wireless devicesthere are not many studies done oncharacterizing this traffic In this paperthe authors characterize the trafficobserved on the MSN Web site withboth notification and browse traces

USENIX 2002

Around 33 million browsing accessesand about 325 million notificationentries are present in the traces Threetypes of analyses are done for each oneof these two services content analysisconcerning the most popular contentcategories and their distribution popu-larity analysis or the popularity distri-bution of documents and user behavioranalysis

The analysis of the notification logsshows that document access rates followa Zipf-like distribution with most of theaccesses concentrated on a small num-ber of messages and the accesses exhibitgeographical locality ndash users from samelocality tend to receive similar notifica-tion content The browser log analysisshows that a smaller set of URLs areaccessed most times though the accesspattern does not fit any Zipf curve andthe highly accessed URLS remain stableA correlation study between the notifi-cation and browsing services shows thatwireless users have a moderate correla-tion of 012

FREENIX TRACK SESSIONS

BUILDING APPLICATIONS

Summarized by Matt Butner

INTERACTIVE 3-D GRAPHICS APPLICATIONS

FOR TCL

Oliver Kersting Juumlrgen Doumlllner HassoPlattner Institute for Software SystemsEngineering University of Potsdam

The integration of 3-D image renderingfunctionality into a scripting languagepermits interactive and animated 3-Ddevelopment and application withoutthe formalities and precision demandedby low-level CC++ graphics and visual-ization libraries The large and complexC++ API of the Virtual Rendering Sys-tem (VRS) can be combined with theconveniences of the Tcl scripting lan-guage The mapping of class interfaces isdone via an automated process and gen-erates respective wrapper classes all ofwhich ensures complete API accessibilityand functionality without any signifi-

76 Vol 27 No 5 login

cant performance retribution The gen-eral C++-to-Tcl mapper SWIG grantsthe necessary object and hierarchal abili-ties without the use of object-orientedTcl extensions The final API mappingtechniques address C++ features such asclasses overloaded methods enumera-tions and inheritance relations All areimplemented in a proof-of-concept thatmaps the complete C++ API of the VRSto Tcl and are showcased in a completeinteractive 3-D-map system

THE AGFL GRAMMAR WORK LAB

Cornelis HA Coster Erik VerbruggenUniversity of Nijmegen (KUN)

The growth and implementation of Nat-ural Language Processing (NLP) is thecornerstone of the continued evolutionand implementation of truly intelligentsearch machines and services In partdue to the growing collections of com-puter-stored human-readable docu-ments in the public domain theimplementation of linguistic analysiswill become necessary for desirable pre-cision and resolution Subsequentlysuch tools and linguistic resources mustbe released into the public domain andthey have done so with the release of theAGFL Grammar Work Lab under theGNU Public License as a tool for lin-guistic research and the development ofNLP-based applications

The AGFL (Affix Grammars over aFinite Lattice) Grammar Work Labmeshes context-free grammars withfinite set-valued features that are accept-able to a range of languages In com-puter science terms ldquoSyntax rules areprocedures with parameters and a non-deterministic executionrdquo The EnglishPhrases for Information Retrieval(EP4IR) released with the AGFL-GWLas a robust grammar of English is anAGFL-GWL generated English parserthat outputs ldquoHeadModifiedrdquo framesThe sentences ldquoCompanyX sponsoredthis conferencerdquo and ldquoThis conferencewas sponsored by CompanyXrdquo both gen-erate [CompanyX[sponsored confer-

ence]] The hope is that public availabil-ity of such tools will encourage furtherdevelopment of grammar and lexicalsoftware systems

SWILL A SIMPLE EMBEDDED WEB SERVER

LIBRARY

Sotiria Lampoudi David M BeazleyUniversity of Chicago

This paper won the FREENIX Best Stu-dent Paper award

SWILL (Simple Web Interface LinkLibrary) is a simple Web server in theform of a C library whose developmentwas motivated by a wish to give coolapplications an interface to the Web TheSWILL library provides a simple inter-face that can be efficiently implementedfor tasks that vary from flexible Web-based monitoring to software debuggingand diagnostics Though originallydesigned to be integrated with high-per-formance scientific simulation softwarethe interface is generic enough to allowfor unbounded uses SWILL is a single-threaded server relying upon non-blocking IO through the creation of atemporary server which services IOrequests SWILL does not provide SSL orcryptographic authentication but doeshave HTTP authentication abilities

A fantastic feature of SWILL is its sup-port for SPMD-style parallel applica-tions which utilize MPI provingvaluable for Beowulf clusters and largeparallel machines Another practicalapplication was the implementation ofSWILL in a modified Yalnix emulator byUniversity of Chicago Operating Sys-tems courses which utilized the addedYalnix functionality for OS developmentand debugging SWILL requires minimalmemory overhead and relies upon theHTTP10 protocol

NETWORK PERFORMANCE

Summarized by Florian Buchholz

LINUX NFS CLIENT WRITE PERFORMANCE

Chuck Lever Network Appliance PeterHoneyman CITI University of Michi-gan

Lever introduced a benchmark to mea-sure an NFS client write performanceClient performance is difficult to mea-sure due to hindrances such as poorhardware or bandwidth limitations Fur-thermore measuring application perfor-mance does not identify weaknessesspecifically at the client side Thus abenchmark was developed trying toexercise only data transfers in one direc-tion between server and application Forthis purpose the benchmark was basedon the block sequential write portion ofthe Bonnie file system benchmark Oncea benchmark for NFS clients is estab-lished it can be used to improve clientperformance

The performance measurements wereperformed with an SMP Linux clientand both a Linux NFS server and a Net-work Appliance F85 filer During testinga periodic jump in write latency timewas discovered This was due to a ratherlarge number of pending write opera-tions that were scheduled to be writtenafter certain threshold values wereexceeded By introducing a separate dae-mon that flushes the cached writerequest the spikes could be eliminatedbut as a result the average latency growsover time The problem could be tracedto a function that scans a linked list ofwrite requests After having added ahashtable to improve lookup perfor-mance the latency improved consider-ably

The improved client was then used tomeasure throughput against the twoservers A discrepancy between theLinux server and the filer test wasnoticed and the reason for that tracedback to a global kernel lock that wasunnecessarily held when accessing thenetwork stack After correcting this per-formance improved further However

77October 2002 login

C

ON

FERE

NC

ERE

PORT

Sthe measurements showed that a clientmay run slower when paired with fastservers on fast networks This is due toheavy client interrupt loads more net-work processing on the client side andmore global kernel lock contention

The source code of the project is avail-able at httpwwwcitiumicheduprojectsnfs-perfpatches

A STUDY OF THE RELATIVE COSTS OF

NETWORK SECURITY PROTOCOLS

Stefan Miltchev and Sotiris IoannidisUniversity of Pennsylvania AngelosKeromytis Columbia University

With the increasing need for securityand integrity of remote network ser-vices it becomes important to quantifythe communication overhead of IPSecand compare it to alternatives such asSSH SCP and HTTPS

For this purpose the authors set upthree testing networks direct link twohosts separated by two gateways andthree hosts connecting through onegateway Protocols were compared ineach setup and manual keying was usedto eliminate connection setup costs ForIPSec the different encryption algo-rithms AES DES 3DES hardware DESand hardware 3DES were used In detailFTP was compared to SFTP SCP andFTP over IPSec HTTP to HTTPS andHTTP over IPSec and NFS to NFS overIPSec and local disk performance

The result of the measurements werethat IPSec outperforms other popularencryption schemes Overall unen-crypted communication was fastest butin some cases like FTP the overhead canbe small The use of crypto hardwarecan significantly improve performanceFor future work the inclusion of setupcosts hardware-accelerated SSL SFTPand SSH were mentioned

CONGESTION CONTROL IN LINUX TCP

Pasi Sarolahti University of HelsinkiAlexey Kuznetsov Institute for NuclearResearch at Moscow

Having attended the talk and read thepaper I am still unclear about whetherthe authors are merely describing thedesign decisions of TCP congestion con-trol or whether they are actually the cre-ators of that part of the Linux code Myguess leans toward the former

In the talk the speaker compared theTCP protocol congestion control meas-ures according to IETF and RFC specifi-cations with the actual Linux implemen-tation which does conform to the basicprinciples but nevertheless has differ-ences A specific emphasis was placed onretransmission mechanisms and thecongestion window Also several TCPenhancements ndash the NewReno algo-rithm Selective ACKs (SACK) ForwardACKs (FACK) ndash were discussed andcompared

In some instances Linux does not con-form to the IETF specifications The fastrecovery does not fully follow RFC 2582since the threshold for triggering re-transmit is adjusted dynamically and thecongestion windowrsquos size is not changedAlso the roundtrip-time estimator andthe RTO calculation differ from RFC2988 since it uses more conservativeRTT estimates and a minimum RTO of200ms The performance measuresshowed that with the additional Linux-specific features enabled slightly higherthroughput more steady data flow andfewer unnecessary retransmissions canbe achieved

XTREME XCITEMENT

Summarized by Steve Bauer

THE FUTURE IS COMING WHERE THE X

WINDOW SHOULD GO

Jim Gettys Compaq Computer Corp

Jim Gettys one of the principal authorsof the X Window System outlined thenear-term objectives for the system pri-marily focusing on the changes andinfrastructure required to enable replica-tion and migration of X applicationsProviding better support for this func-tionality would enable users to retrieveor duplicate X applications betweentheir servers at home and work

USENIX 2002

One interesting example of an applica-tion that currently is capable of migra-tion and replication is Emacs To create anew frame on DISPLAY try ldquoM-x make-frame-on-display ltReturngt DISPLAYltReturngtrdquo

However technical challenges makereplication and migration difficult ingeneral These include the ldquomajorheadachesrdquo of server-side fonts thenonuniformity of X servers and screensizes and the need to appropriatelyretrofit toolkits Authentication andauthorization issues are obviously alsoimportant The rest of the talk delvedinto some of the details of these interest-ing technical challenges

HACKING IN THE KERNEL

Summarized by Hai Huang

AN IMPLEMENTATION OF SCHEDULER

ACTIVATIONS ON THE NETBSD OPERATING

SYSTEM

Nathan J Williams Wasabi Systems

Scheduler activation is an old idea Basi-cally there are benefits and drawbacks tousing solely kernel-level or user-levelthreading Scheduler activation is able tocombine the two layers of control toprovide more concurrency in the sys-tem

In his talk Nathan gave a fairly detaileddescription of the implementation ofscheduler activation in the NetBSD ker-nel One important change in the imple-mentation is to differentiate the threadcontext from the process context This isdone by defining a separate data struc-ture for these threads and relocatingsome of the information that wasembedded in the process context tothese thread contexts Stack was espe-cially a concern due to the upcall Spe-cial handling must be done to make surethat the upcall doesnrsquot mess up the stackso that the preempted user-level threadcan continue afterwards Lastly Nathanexplained that signals were handled byupcalls

78 Vol 27 No 5 login

AUTHORIZATION AND CHARGING IN PUBLIC

WLANS USING FREEBSD AND 8021X

Pekka Nikander Ericsson ResearchNomadicLab

8021x standards are well known in thewireless community as link-layerauthentication protocols In this talkPekka explained some novel ways ofusing the 8021x protocols that might beof interest to people on the move It ispossible to set up a public WLAN thatwould support various chargingschemes via virtual tokens which peoplecan purchase or earn and later use

This is implemented on FreeBSD usingthe netgraph utility It is basically a filterin the link layer that would differentiatetraffic based on the MAC address of theclient node which is either authenti-cated denied or let through The over-head for this service is fairly minimal

ACPI IMPLEMENTATION ON FREEBSD

Takanori Watanabe Kobe University

ACPI (Advanced Configuration andPower Management Interface) was pro-posed as a joint effort by Intel Toshibaand Microsoft to provide a standard andfiner-control method of managingpower states of individual devices withina system Such low-level power manage-ment is especially important for thosemobile and embedded systems that arepowered by fixed-capacity energy batter-ies

Takanori explained that the ACPI speci-fication is composed of three partstables BIOS and registers He was ableto implement some functionalities ofACPI in a FreeBSD kernel ACPI Com-ponent Architecture was implementedby Intel and it provides a high-levelACPI API to the operating systemTakanorirsquos ACPI implementation is builtupon this underlying layer of APIs

ACCESS CONTROL

Summarized by Florian Buchholz

DESIGN AND PERFORMANCE OF THE

OPENBSD STATEFUL PACKET FILTER (PF)

Daniel Hartmeier Systor AG

Daniel Hartmeier described the newstateful packet filter (pf) that replacedIPFilter in the OpenBSD 30 releaseIPFilter could no longer be included dueto licensing issues and thus there was aneed to write a new filter making use ofoptimized data structures

The filter rules are implemented as alinked list which is traversed from startto end Two actions may be takenaccording to the rules ldquopassrdquo or ldquoblockrdquoA ldquopassrdquo action forwards the packet anda ldquoblockrdquo action will drop it Wheremore than one rule matches a packetthe last rule wins Rules that are markedas ldquofinalrdquo will immediately terminate any further rule evaluation An opti-mization called ldquoskip-stepsrdquo was alsoimplemented where blocks of similarrules are skipped if they cannot matchthe current packet These skip-steps arecalculated when the rule set is loadedFurthermore a state table keeps track ofTCP connections Only packets thatmatch the sequence numbers areallowed UDP and ICMP queries andreplies are considered in the state tablewhere initial packets will create an entryfor a pseudo-connection with a low ini-tial timeout value The state table isimplemented using a balanced binarysearch tree NAT mappings are alsostored in the state table whereas appli-cation proxies reside in user space Thepacket filter also is able to perform frag-ment reassembly and to modulate TCPsequence numbers to protect hostsbehind the firewall

Pf was compared against IPFilter as wellas Linuxrsquos Iptables by measuringthroughput and latency with increasingtraffic rates and different packet sizes Ina test with a fixed rule set size of 100Iptables outperformed the other two fil-ters whose results were close to each

other In a second test where the rule setsize was continually increased Iptablesconsistently had about twice thethroughput of the other two (whichevaluate the rule set on both the incom-ing and outgoing interfaces) A third testcompared only pf and IPFilter using asingle rule that created state in the statetable with a fixed-state entry size Pfreached an overloaded state much laterthan IPFilter The experiment wasrepeated with a variable-state entry sizeand pf performed much better thanIPFilter for a small number of states

In general rule set evaluation is expen-sive and benchmarks only reflectextreme cases whereas in real life otherbehavior should be observed Further-more the benchmarks show that statefulfiltering can actually improve perfor-mance due to cheap state-table lookupsas compared to rule evaluation

ENHANCING NFS CROSS-ADMINISTRATIVE

DOMAIN ACCESS

Joseph Spadavecchia and Erez ZadokStony Brook University

The speaker presented modification toan NFS server that allows an improvedNFS access between administrativedomains A problem lies in the fact thatNFS assumes a shared UIDGID spacewhich makes it unsafe to export filesoutside the administrative domain ofthe server Design goals were to leaveprotocol and existing clients unchangeda minimum amount of server changesflexibility and increased performance

To solve the problem two techniques areutilized ldquorange-mappingrdquo and ldquofile-cloakingrdquo Range-mapping maps IDsbetween client and server The mappingis performed on a per-export basis andhas to be manually set up in an exportfile The mappings can be 1-1 N-N orN-1 In file-cloaking the server restrictsfile access based on UIDGID and spe-cial cloaking-mask bits Here users canonly access their own file permissionsThe policy on whether or not the file isvisible to others is set by the cloakingmask which is logically ANDed with the

79October 2002 login

C

ON

FERE

NC

ERE

PORT

Sfilersquos protection bits File-cloaking onlyworks however if the client doesnrsquot holdcached copies of directory contents andfile-attributes Because of this the clientsare forced to re-read directories byincrementing the mtime value of thedirectory each time it is listed

To test the performance of the modifiedserver five different NFS configurationswere evaluated An unmodified NFSserver was compared against one serverwith the modified code included but notused one with only range-mappingenabled one with only file-cloakingenabled and one version with all modi-fications enabled For each setup differ-ent file system benchmarks were runThe results show only a small overheadwhen the modifications are used gener-ally an increase of below 5 Anotherexperiment tested the performance ofthe system with an increasing number ofmapped or cloaked entries on a systemThe results show that an increase from10 to 1000 entries resulted in a maxi-mum of about 14 cost in performance

One member of the audience pointedout that if clients choose to ignore thechanged mtimes from the server andthus still hold caches of the directoryentries the file-cloaking mechanismcould be defeated After a rather lengthydebate the speaker had to concede thatthe model doesnrsquot add any extra securityAnother question was asked about scala-bility of the setup of range mappingThe speaker referred to application-leveltools that could be developed for thatpurpose

The software is available atftpftpfslcssunysbedupubenf

ENGINEERING OPEN SOURCE

SOFTWARE

Summarized by Teri Lampoudi

NINGAUI A LINUX CLUSTER FOR BUSINESS

Andrew Hume ATampT Labs ndash ResearchScott Daniels Electronic Data SystemsCorp

Ningaui is a general purpose highlyavailable resilient architecture built

from commodity software and hard-ware Emphatically however Ningaui isnot a Beowulf Hume calls the clusterdesign the ldquoSwiss canton modelrdquo inwhich there are a number of looselyaffiliated independent nodes with datareplicated among them Jobs areassigned by bidding and leases and clus-ter services done as session-based serversare sited via generic job assignment Theemphasis is on keeping the architectureend-to-end checking all work via check-sums and logging everything Theresilience and high availability requiredby their goal of 8-5 maintenance ndash vsthe typical 24-7 model where people getpaged whenever the slightest thing goeswrong regardless of the time of day ndash isachieved by job restartability Finally allcomputation is performed on local datawithout the use of NFS or networkattached storage

Humersquos message is a hopeful onedespite the many problems encounteredndash things like kernel and service limitsauto-installation problems TCP stormsand the like ndash the existence of sourcecode and the paranoid practice of log-ging and checksumming everything hashelped The final product performs rea-sonably well and it appears that theresilient techniques employed do make adifference One drawback however isthat the software mentioned in thepaper is not yet available for download

CPCMS A CONFIGURATION MANAGEMENT

SYSTEM BASED ON CRYPTOGRAPHIC NAMES

Jonathan S Shapiro and John Vander-burgh Johns Hopkins University

This paper won the FREENIX BestPaper award

The basic notion behind the project isthe fact that everyone has a pet com-plaint about CVS and yet it is currentlythe configuration manager in mostwidespread use Shapiro has unleashedan alternative Interestingly he did notbegin but ended with the usual host ofreasons why CVS is bad The talk insteadbegan abruptly by characterizing the job

USENIX 2002

of a software configuration managercontinued by stating the namespaceswhich it must handle and wrapped upwith the challenges faced

X MEETS Z VERIFYING CORRECTNESS IN THE

PRESENCE OF POSIX THREADS

Bart Massey Portland State UniversityRobert T Bauer Rational SoftwareCorp

Massey delivered a humorous talk onthe insight gained from applying Z for-mal specification notation to systemsoftware design rather than the moreinformal analysis and design processnormally used

The story is told with respect to writingXCB which replaces the Xlib protocollayer and is supposed to be threadfriendly But where threads are con-cerned deadlock avoidance becomes ahard problem that cannot be solved inan ad-hoc manner But full modelchecking is also too hard In this caseMassey resorted to Z specification tomodel the XCB lower layer abstractaway locking and data transfers andlocate fundamental issues Essentiallythe difficulties of searching the literatureand locating information relevant to theproblem at hand were overcome AsMassey put it ldquothe formal method savedthe dayrdquo

FILE SYSTEMS

Summarized by Bosko Milekic

PLANNED EXTENSIONS TO THE LINUX

EXT2EXT3 FILESYSTEM

Theodore Y Tsrsquoo IBM StephenTweedie Red Hat

The speaker presented improvements tothe Linux Ext2 file system with the goalof allowing for various expansions whilestriving to maintain compatibility witholder code Improvements have beenfacilitated by a few extra superblockfields that were added to Ext2 not longbefore the Linux 20 kernel was releasedThe fields allow for file system featuresto be added without compromising theexisting setup this is done by providing

80 Vol 27 No 5 login

bits indicating the impact of the addedfeatures with respect to compatibilityNamely a file system with the ldquoincom-patrdquo bit marked is not allowed to bemounted Similarly a ldquoread-onlyrdquo mark-ing would only allow the file system tobe mounted read-only

Directory indexing changes linear direc-tory searches with a faster search using afixed-depth tree and hashed keys Filesystem size can be dynamicallyincreased and the expanded inode dou-bled from 128 to 256 bytes allows formore extensions

Other potential improvements were dis-cussed as well in particular pre-alloca-tion for contiguous files which allowsfor better performance in certain setupsby pre-allocating contiguous blocksSecurity-related modifications extendedattributes and ACLs were mentionedAn implementation of these featuresalready exists but has not yet beenmerged into the mainline Ext23 code

RECENT FILESYSTEM OPTIMISATIONS ON

FREEBSD

Ian Dowse Corvil Networks DavidMalone CNRI Dublin Institute of Technology

David Malone presented four importantfile system optimizations for FreeBSDOS soft updates dirpref vmiodir anddirhash It turns out that certain combi-nations of the optimizations (beautifullyillustrated in the paper) may yield per-formance improvements of anywherebetween 2 and 10 orders of magnitudefor real-world applications

All four of the optimizations deal withfile system metadata Soft updates allowfor asynchronous metadata updatesDirpref changes the way directories areorganized attempting to place childdirectories closer to their parentsthereby increasing locality of referenceand reducing disk-seek times Vmiodirtrades some extra memory in order toachieve better directory caching Finallydirhash which was implemented by IanDowse changes the way in which entries

are searched in a directory SpecificallyFFS uses a linear search to find an entryby name dirhash builds a hashtable ofdirectory entries on the fly For a direc-tory of size n with a working set of mfiles a search that in certain cases couldhave been O(nm) has been reduceddue to dirhash to effectively O(n + m)

FILESYSTEM PERFORMANCE AND SCALABILITY

IN LINUX 2417

Ray Bryant SGI Ruth Forester IBMLTC John Hawkes SGI

This talk focused on performance evalu-ation of a number of file systems avail-able and commonly deployed on Linuxmachines Comparisons under variousconfigurations of Ext2 Ext3 ReiserFSXFS and JFS were presented

The benchmarks chosen for the datagathering were pgmeter filemark andAIM Benchmark Suite VII Pgmetermeasures the rate of data transfer ofreadswrites of a file under a syntheticworkload Filemark is similar to post-mark in that it is an operation-intensivebenchmark although filemark isthreaded and offers various other fea-tures that postmark lacks AIM VIImeasures performance for various file-system-related functionalities it offersvarious metrics under an imposedworkload thus stressing the perfor-mance of the file system not only underIO load but also under significant CPUload

Tests were run on three different setupsa small a medium and a large configu-ration ReiserFS and Ext2 appear at thetop of the pile for smaller and mediumsetups Notably XFS and JFS performworse for smaller system configurationsthan the others although XFS clearlyappears to generate better numbers thanJFS It should be noted that XFS seemsto scale well under a higher load Thiswas most evident in the large-systemresults where XFS appears to offer thebest overall results

THINGS TO THINK ABOUT

Summarized by Bosko Milekic

SPEEDING UP KERNEL SCHEDULER BY REDUC-

ING CACHE MISSES

Shuji Yamamura Akira Hirai MitsuruSato Masao Yamamoto Akira NaruseKouichi Kumon Fujitsu Labs

This was an interesting talk pertaining tothe effects of cache coloring for taskstructures in the Linux kernel scheduler(Linux kernel 24x) The speaker firstpresented some interesting benchmarknumbers for the Linux scheduler show-ing that as the number of processes onthe task queue was increased the perfor-mance decreased The authors usedsome really nifty hardware to measurethe number of bus transactionsthroughout their tests and were thusable to reasonably quantify the impactthat cache misses had in the Linuxscheduler

Their experiments led them to imple-ment a cache coloring scheme for taskstructures which were previouslyaligned on 8KB boundaries and there-fore were being eventually mapped tothe same cache lines This unfortunateplacement of task structures in memoryinduced a significant number of cachemisses as the number of tasks grew inthe scheduler

The implemented solution consisted ofaligning task structures to more evenlydistribute cached entries across the L2cache The result was inevitably fewercache misses in the scheduler Some neg-ative effects were observed in certain sit-uations These were primarily due tomore cache slots being used by taskstructure data in the scheduler thusforcing data previously cached there tobe pushed out

81October 2002 login

C

ON

FERE

NC

ERE

PORT

SWORK-IN-PROGRESS REPORTS

Summarized by Brennan Reynolds

RESOURCE VIRTUALIZATION TECHNIQUES FOR

WIDE-AREA OVERLAY NETWORKS

Kartik Gopalan University Stony Brook

This work addressed the issue of provi-sioning a maximum number of virtualoverlay networks (VON) with diversequality of service (QoS) requirementson a single physical network Each of theVONs is logically isolated from others toensure the QoS requirements Gopalanmentioned several critical researchissues with this problem that are cur-rently being investigated Dealing withhow to provision the network at variouslevels (link route or path) and thenenforce the provisioning at run-time isone of the toughest challenges Cur-rently Gopalan has developed severalalgorithms to handle admission controlend-to-end QoS route selection sched-uling and fault tolerance in the net-work

For more information visithttpwwwecslcssunysbedu

VISUALIZING SOFTWARE INSTABILITY

Jennifer Bevan University of CaliforniaSanta Cruz

Detection of instability in software hastypically been an afterthought Thepoint in the development cycle when thesoftware is reviewed for instability isusually after it is difficult and costly togo back and perform major modifica-tions to the code base Bevan has devel-oped a technique to allow thevisualization of unstable regions of codethat can be used much earlier in thedevelopment cycle Her technique cre-ates a time series of dependent graphsthat include clusters and lines calledfault lines From the graphs a developeris able to easily determine where theunstable sections of code are and proac-tively restructure them She is currentlyworking on a prototype implementa-tion

For more information visit httpwwwcseucscecu~jbevanevo_viz

RELIABLE AND SCALABLE PEER-TO-PEER WEB

DOCUMENT SHARING

Li Xiao William and Mary College

The idea presented by Xiao would allowend users to share the content of theWeb browser caches with neighboringInternet users The rationale for this isthat todayrsquos end users are increasinglyconnected to the Internet over high-speed links and the browser caches arebecoming large storage systems There-fore if an individual accesses a pagewhich does not exist in their local cacheXiao is suggesting that they first queryother end users for the content beforetrying to access the machine hosting theoriginal This strategy does have someserious problems associated with it thatstill need to be addressed includingensuring the integrity of the content andprotecting the identity and privacy ofthe end users

SEREL FAST BOOTING FOR UNIX

Leni Mayo Fast Boot Software

Serel is a tool that generates a visual rep-resentation of a UNIX machinersquos boot-up sequence It can be used to identifythe critical path and can show if a par-ticular service or process blocks for anextended period of time This informa-tion could be used to determine wherethe largest performance gains could berealized by tuning the order of executionat boot-up Serel creates a dependencygraph expressed in XML during boot-up This graph is then used to create thevisual representation Currently the toolonly works on POSIX-compliant sys-tems but Mayo stated that he would beporting it to other platforms Otherextensions that were mentionedincluded having the metadata includethe use of shared libraries and monitor-ing the suspendresume sequence ofportable machines

For more information visithttpwwwfastbootorg

USENIX 2002

BERKELEY DB XML

John Merrells Sleepycat Software

Merrells gave a quick introduction andoverview of the new XML library forBerkeley DB The library specializes instorage and retrieval of XML contentthrough a tool called XPath The libraryallows for multiple containers per docu-ment and stores everything natively asXML The user is also given a wide rangeof elements to create indices withincluding edges elements text stringsor presence The XPath tool consists of aquery parser generator optimizer andexecution engine To conclude his pre-sentation Merrell gave a live demonstra-tion of the software

For more information visithttpwwwsleepycatcomxml

CLUSTER-ON-DEMAND (COD)

Justin Moore Duke University

Modern clusters are growing at a rapidrate Many have pushed beyond the 5000-machine mark and deployingthem results in large expenses as well asmanagement and provisioning issuesFurthermore if the cluster is ldquorentedrdquoout to various users it is very time-con-suming to configure it to a userrsquos specsregardless of how long they need to useit The COD work presented createsdynamic virtual clusters within a givenphysical cluster The goal was to have aprovisioning tool that would automati-cally select a chunk of available nodesand install the operating system andmiddleware specified by the customer ina short period of time This would allowa greater use of resources since multiplevirtual clusters can exist at once Byusing a virtual cluster the size can bechanged dynamically Moore stated thatthey have created a working prototypeand are currently testing and bench-marking its performance

For more information visithttpwwwcsdukeedu~justincod

82 Vol 27 No 5 login

CATACOMB

Elias Sinderson University of CaliforniaSanta Cruz

Catacomb is a project to develop a data-base-backed DASL module for theApache Web server It was designed as areplacement for the WebDAV The initialrelease of the module only contains sup-port for a MySQL database but could beextended to others Sinderson brieflytouched on the performance of hermodule It was comparable to themod_dav Apache module for all querytypes but search The presentation wasconcluded with remarks about addingsupport for the lock method and includ-ing ACL specifications in the future

For more information visithttpoceancseucsceducatacomb

SELF-ORGANIZING STORAGE

Dan Ellard Harvard University

Ellardrsquos presentation introduced a stor-age system that tuned itself based on theworkload of the system without requir-ing the intervention of the user Theintelligence was implemented as a vir-tual self-organizing disk that residesbelow any file system The virtual diskobserves the access patterns exhibited bythe system and then attempts to predictwhat information will be accessed nextAn experiment was done using an NFStrace at a large ISP on one of their emailservers Ellardrsquos self-organizing storagesystem worked well which he attributesto the fact that most of the files beingrequested were large email boxes Areasof future work include exploration ofthe length and detail of the data collec-tion stage as well as the CPU impact ofrunning the virtual disk layer

VISUAL DEBUGGING

John Costigan Virginia Tech

Costigan feels that the state of currentdebugging facilities in the UNIX worldis not as good as it should be He pro-poses the addition of several elements toprograms like ddd and gdb The firstaddition is including separate heap and

stack components Another is to displayonly certain fields of a data structureand have the ability to zoom in and outif needed Finally the debugger shouldprovide the ability to visually presentcomplex data structures includinglinked lists trees etc He said that a betaversion of a debugger with these abilitiesis currently available

For more information visit httpinfoviscsvtedudatastruct

IMPROVING APPLICATION PERFORMANCE

THROUGH SYSTEM-CALL COMPOSITION

Amit Purohit University of Stony Brook

Web servers perform a huge number ofcontext switches and internal data copy-ing during normal operation These twoelements can drastically limit the perfor-mance of an application regardless ofthe hardware platform it is run on TheCompound System Call (CoSy) frame-work is an attempt to reduce the perfor-mance penalty for context switches anddata copies It includes a user-levellibrary and several kernel facilities thatcan be used via system calls The libraryprovides the programmer with a com-plete set of memory-management func-tions Performance tests using the CoSyframework showed a large saving forcontext switches and data copies Theonly area where the savings betweenconventional libraries and CoSy werenegligible was for very large file copies

For more information visithttpwwwcssunysbedu~purohit

ELASTIC QUOTAS

John Oscar Columbia University

Oscar began by stating that mostresources are flexible but that to datedisks have not been While approxi-mately 80 percent of files are short-liveddisk quotas are hard limits imposed byadministrators The idea behind elasticquotas is to have a non-permanent stor-age area that each user can temporarilyuse Oscar suggested the creation of aehome directory structure to be used indeploying elastic quotas Currently he

has implemented elastic quotas as astackable file system There were alsoseveral trends apparent in the dataMany users would ssh to their remotehost (good) but then launch an emailreader on the local machine whichwould connect via POP to the same hostand send the password in clear text(bad) This situation can easily be reme-died by tunneling protocols like POPthrough SSH but it appeared that manypeople were not aware this could bedone While most of his comments onthe use of the wireless network werenegative the list of passwords he hadcollected showed that people wereindeed using strong passwords His rec-ommendations were to educate andencourage people to use protocols likeIPSec SSH and SSL when conductingwork over a wireless network becauseyou never know who else is listening

SYSTRACE

Niels Provos University of Michigan

How can one be sure the applicationsone is using actually do exactly whattheir developers said they do Shortanswer you canrsquot People today are usinga large number of complex applicationswhich means it is impossible to checkeach application thoroughly for securityvulnerabilities There are tools out therethat can help though Provos has devel-oped a tool called systrace that allows auser to generate a policy of acceptablesystem calls a particular application canmake If the application attempts tomake a call that is not defined in thepolicy the user is notified and allowed tochoose an action Systrace includes anauto-policy generation mechanismwhich uses a training phase to record theactions of all programs the user exe-cutes If the user chooses not to be both-ered by applications breaking policysystrace allows default enforcementactions to be set Currently systrace isimplemented for FreeBSD and NetBSDwith a Linux version coming out shortly

83October 2002 login

C

ON

FERE

NC

ERE

PORT

SFor more information visit httpwwwcitiumicheduuprovossystrace

VERIFIABLE SECRET REDISTRIBUTION FOR

SURVIVABLE STORAGE SYSTEMS

Ted Wong Carnegie Mellon University

Wong presented a protocol that can beused to re-create a file distributed over aset of servers even if one of the serversis damaged In his scheme the user mustchoose how many shares the file is splitinto and the number of servers it will bestored across The goal of this work is toprovide persistent secure storage ofinformation even if it comes underattack Wong stated that his design ofthe protocol was complete and he is cur-rently building a prototype implementa-tion of it

For more information visit httpwwwcscmuedu~tmwongresearch

THE GURU IS IN SESSIONS

LINUX ON LAPTOPPDA

Bdale Garbee HP Linux Systems Operation

Summarized by Brennan Reynolds

This was a roundtable-like discussionwith Garbee acting as moderator He iscurrently in charge of the Debian distri-bution of Linux and is involved in port-ing Linux to platforms other than i386He has successfully help port the kernelto the Alpha Sparc and ARM and iscurrently working on a version to runon VAX machines

A discussion was held on which file sys-tem is best for battery life The Riser andExt3 file systems were discussed peoplecommented that when using Ext3 ontheir laptops the disc would never spindown and thus was always consuming alarge amount of power One suggestionwas to use a RAM-based file system forany volume that requires a large numberof writes and to only have the file systemwritten to disk at infrequent intervals orwhen the machine is shut down or sus-pended

A question was directed to Garbee abouthis knowledge of how the HP-Compaqmerger would affect Linux Garbeethought that the merger was good forLinux both within the company and forthe open source community as a wholeHe stated that the new company wouldbe the number one shipper of systemswith Linux as the base operating systemHe also fielded a question about thesupport of older HP servers and theirability to run Linux saying that indeedLinux has been ported to them and hepersonally had several machines in hisbasement running it

A question which sparked a large num-ber of responses concerned problemspeople had with running Linux onmobile platforms Most people actuallydid not have many problems at allThere were a few cases of a particularmachine model not working but therewere only two widespread issues dock-ing stations and the new ACPI powermanagement scheme Docking stationsstill appear to be somewhat of aheadache but most people had devel-oped ad-hoc solutions for getting themto work The ACPI power managementscheme developed by Intel does notappear to have a quick solution TedTrsquoso one of the head kernel developersstated that there are fundamental archi-tectural problems with the 24 series ker-nel that do not easily allow ACPI to beadded However he also stated thatACPI is supported in the newest devel-opment kernel 25 and will be includedin 26

The final topic was the possibility ofpurchasing a laptop without paying anychargetax for Microsoft Windows Theconsensus was that currently this is notpossible Even if the machine comeswith Linux or without any operatingsystem the vendors have contracts inplace which require them to payMicrosoft for each machine they ship ifthat machine is capable of running aMicrosoft operating system The onlyway this will change is if the demand for

USENIX 2002

other operating systems increases to thepoint that vendors renegotiate their con-tracts with Microsoft and no one sawthis happening in the near future

LARGE CLUSTERS

Andrew Hume ATampT Labs ndash Research

Summarized by Teri Lampoudi

This was a session on clusters big dataand resilient computing Given that theaudience was primarily interested inclusters and not necessarily big data orbeating nodes to a pulp the conversa-tion revolved mainly around what ittakes to make a heterogeneous clusterresilient Hume also presented aFREENIX paper on his current clusterproject which explained in more detailmuch of what was abstractly claimed inthe guru session

Humersquos use of the word ldquoclusterrdquoreferred not to what I had assumedwould be a Beowulf-type system but to aloosely coupled farm of machines inwhich concurrency was much less of anissue than it would be in a Beowulf Infact the architecture Hume describedwas designed to deal with large amountsof independent transaction processingessentially the process of billing callswhich requires no interprocess messagepassing or anything of the sort a parallelscientific application might

Where does the large data come inSince the transactions in question con-sist of large numbers of flat files amechanism for getting the files ontonodes and off-loading the results is nec-essary In this particular cluster namedNingaui this task is handled by theldquoreplication managerrdquo which generateda large amount of interest in the audi-ence All management functions in thecluster as well as job allocation consti-tute services that are to be bid for andleased out to nodes Furthermore fail-ures are handled by simply repeating thefailed task until it is completed properlyif that is possible and if it is not thenlooking through detailed logs to discover

84 Vol 27 No 5 login

what went wrong Logging and testingfor the successful completion of eachtask are what make this model resilientEvery management operation is loggedat some appropriate level of detail gen-erating 120MBday and every single filetransfer is md5 checksummed This wascharacterized by Hume as a ldquopatientrdquostyle of management where no one con-trols the cluster scheduling behavior isemergent rather than stringentlyplanned and nodes are free to drop inand out of the cluster a downside to thismodel is the increase in latency offset bythe fact that the cluster continues tofunction unattended outside of 8-to-5support hours

In response to questions about the pro-jected scalability of such a schemeHume said the system would presum-ably scale from the present 60 nodes toabout 100 without modifications butthat scaling higher would be a matter forfurther design The guru session endedon a somewhat unrelated note regardingthe problems of getting data off main-frames and onto Linux machines ndashissues of fixed- vs variable-lengthencoding of data blocks resulting in use-ful information being stripped by FTPand problems in converting COBOLcopybooks to something useful on a C-based architecture To reiterate Humersquosclaim there are homemade solutions forsome of these problems and the readershould feel free to contact Mr Hume forthem

WRITING PORTABLE APPLICATIONS

Nick Stoughton MSB Consultants

Summarized by Josh Lothian

Nick Stoughton addressed a concernthat developers are facing more andmore writing portable applications Heproposed that there is no such thing as aldquoportable applicationrdquo only those thathave been ported Developers are still insearch of the Holy Grail of applicationsprogramming a write-once compile-anywhere program Languages such asJava are leading up to this but virtual

machines still have to be ported pathswill still be different etc Web-basedapplications are another area that couldbe promising for portable applications

Nick went on to talk about the future ofportability The recently released POSIX2001 standards are expected to help thesituation The 2001 revision expands theoriginal POSIX sepc into seven volumesEven though POSIX 2001 is more spe-cific than the previous release Nickpointed out that even this standardallows for small differences where ven-dors may define their own behaviorsThis can only hurt developers in thelong run Along these lines it waspointed out that there may be room forautomated tools that have the capabilityto check application code and determinethe level of portability and standardscompliance of the source

NETWORK MANAGEMENT SYSTEM

PERFORMANCE TUNING

Jeff Allen Tellme Networks Inc

Summarized by Matt Selsky

Jeff Allen author of Cricket discussednetwork tuning The basic idea of tuningis measure twiddle and measure againCricket can be used for large installa-tions but itrsquos overkill for smaller instal-lations for which Mrtg is better suitedYou donrsquot need Cricket to do measure-ment Doing something like a shellscript that dumps data to Excel is goodbut you need to get some sort of mea-surement Cricket was not designed forbilling it was meant to help answerquestions

Useful tools and techniques coveredincluded looking for the differencebetween machines in order to identifythe causes for the differences measuringbut also thinking about what yoursquoremeasuring and why remembering touse strace and tcpdump to identifyproblems Problems repeat themselvesperformance tuning requires an intricateunderstanding of all the layers involvedto solve the problem and simplify theautomation (Having a computer science

background helps but is not essential) Ifyou can reproduce the problem youthen should investigate each piece tofind the bottleneck If you canrsquot repro-duce the problem then measure Youneed to understand all the system inter-actions Measurement can help deter-mine whether things are actually slow orif the user is imagining it

Troubleshooting begins with a hunchbut scientific processes are essential Youshould be able to determine the eventstream or when each event occurs hav-ing observation logs can help Somehints provided include checking cron forperiodic anomalies slowing downbinrm to avoid IO overload and look-ing for unusual kernel CPU usage

Also try to use lightweight monitoringto reduce overhead You donrsquot need tomonitor every resource on every systembut you should monitor those resourceson those systems that are essential Donrsquotcheck things too often since you canintroduce overhead that way Lots ofinformation can be gathered from theproc file system interface to the kernel

BSD BOF

Host Kirk McKusick Author and Consultant

Summarized by Bosko Milekic

The BSD Birds of a Feather session atUSENIX started off with five quick andinformative talks and ended with anexciting discussion of licensing issues

Christos Zoulas for the NetBSD Projectwent over the primary goals of theNetBSD Project (namely portability aclean architecture and security) andthen proceeded to briefly discuss someof the challenges that the project hasencountered The successes and areas forimprovement of 16 were then exam-ined NetBSD has recently seen variousimprovements in its cross-build systempackaging system and third-party toolsupport For what concerns the kernel azero-copy TCPUDP implementationhas been integrated and a new pmap

85October 2002 login

C

ON

FERE

NC

ERE

PORT

Scode for arm32 has been introduced ashave some other rather major changes tovarious subsystems More information isavailable in Christosrsquos slides which arenow available at httpwwwnetbsdorggalleryeventsusenix2002

Theo de Raadt of the OpenBSD Projectpresented a rundown of various newthings that the OpenBSD and OpenSSHteams have been looking at Theorsquos talkincluded an amusing description (andpictures) of some of the events thattranspired during OpenBSDrsquos hack-a-thon which occurred the week beforeUSENIX rsquo02 Finally Theo mentionedthat following complications with theIPFilter license the OpenBSD team haddone a fairly extensive license audit

Robert Watson of the FreeBSD Projectbrought up various examples ofFreeBSD being used in the real worldHe went on to describe a wide variety ofchanges that will surface with theupcoming FreeBSD release 50 Notably50 will introduce a re-architecting ofthe kernel aimed at providing muchmore scalable support for SMPmachines 50 will also include an earlyimplementation of KSE FreeBSDrsquos ver-sion of scheduler activations largeimprovements to pccard and significantframework bits from the TrustedBSDproject

Mike Karels from Wind River Systemspresented an overview of how BSDOShas been evolving following the WindRiver Systems acquisition of BSDirsquos soft-ware assets Ernie Prabhakar from Applediscussed Applersquos OS X and its successon the desktop He explained how OS Xaims to bring BSD to the desktop whilestriving to maintain a strong and stablecore one that is heavily based on BSDtechnology

The BoF finished with a discussion onlicensing specifically some folks ques-tioned the impact of Applersquos licensingfor code from their core OS (the DarwinProject) on the BSD community as awhole

CLOSING SESSION

HOW FLIES FLY

Michael H Dickinson University ofCalifornia Berkeley

Summarized by JD Welch

In this lively and offbeat talk Dickinsondiscussed his research into the mecha-nisms of flight in the fruit fly (Droso-phila melanogaster) and related theautonomic behavior to that of a techno-logical system Insects are the most suc-cessful organisms on earth due in nosmall part to their early adoption offlight Flies travel through space onstraight trajectories interrupted by sac-cades jumpy turning motions some-what analogous to the fast jumpymovements of the human eye Using avariety of monitoring techniquesincluding high-speed video in a ldquovirtualreality flight arenardquo Dickinson and hiscolleagues have observed that fliesrespond to visual cues to decide when tosaccade during flight For example morevisual texture makes flies saccade earlier(ie further away from the arena wall)described by Dickinson as a ldquocollisioncontrol algorithmrdquo

Through a combination of sensoryinputs flies can make decisions abouttheir flight path Flies possess a visualldquoexpansion detectorrdquo which at a certainthreshold causes the animal to turn acertain direction However expansion infront of the fly sometimes causes it toland How does the fly decide Using thevirtual reality device to replicate variousvisual scenarios Dickinson observedthat the flies fixate on vertical objectsthat move back and forth across thefield Expansion on the left side of theanimal causes it to turn right and viceversa while expansion directly in frontof the animal triggers the legs to flareout in preparation for landing

If the eyes detect these changes how arethe responses implemented The fliesrsquowings have ldquopower musclesrdquo controlledby mechanical resonance (as opposed tothe nervous system) which drive the

USENIX 2002

wings combined with neurally activatedldquosteering musclesrdquo which change theconfiguration of the wing joints Subtlevariations in the timing of impulses cor-respond to changes in wing movementcontrolled by sensors in the ldquowing-pitrdquoA small wing-like structure the halterecontrols equilibrium by beating con-stantly

Voluntary control is accomplished bythe halteres whose signals can interferewith the autonomic control of the strokecycle The halteres have ldquosteering mus-clesrdquo as well and information derivedfrom the visual system can turn off thehaltere or trick it into a ldquovirtualrdquo prob-lem requiring a response

Dickinson has also studied the aerody-namics of insect flight using a devicecalled the ldquoRobo Flyrdquo an oversizedmechanical insect wing suspended in alarge tank of mineral oil InterestinglyDickinson observed that the average liftgenerated by the fliesrsquo wings is under itsbody weight the flies use three mecha-nisms to overcome this including rotat-ing the wings (rotational lift)

Insects are extraordinarily robust crea-tures and because Dickinson analyzedthe problem in a systems-oriented waythese observations and analysis areimmediately applicable to technologythe response system can be used as anefficient search algorithm for controlsystems in autonomous vehicles forexample

THE AFS WORKSHOP

Summarized by Garry Zacheiss MIT

Love Hornquist-Astrand of the Arladevelopment team presented the Arlastatus report Arla 0358 has beenreleased Scheduled for release soon areimproved support for Tru64 UNIXMacOS X and FreeBSD improved vol-ume handling and implementation ofmore of the vospts subcommands Itwas stressed that MacOS X is consideredan important platform and that a GUIconfiguration manager for the cache

86 Vol 27 No 5 login

manager a GUI ACL manager for thefinder and a graphical login that obtainsKerberos tickets and AFS tokens at logintime were all under development

Future goals planned for the underlyingAFS protocols include GSSAPISPNEGOsupport for Rx performance improve-ments to Rx an enhanced disconnectedmode and IPv6 support for Rx anexperimental patch is already availablefor the latter Future Arla-specific goalsinclude improved performance partialfile reading and increased stability forseveral platforms Work is also inprogress for the RXGSS protocol exten-sions (integrating GSSAPI into Rx) Apartial implementation exists and workcontinues as developers find time

Derrick Brashear of the OpenAFS devel-opment team presented the OpenAFSstatus report OpenAFS was releasedimmediately prior to the conferenceOpenAFS 125 fixed a remotelyexploitable denial-of-service attack inseveral OpenAFS platforms mostnotably IRIX and AIX Future workplanned for OpenAFS includes bettersupport for MacOS X including work-ing around Finder interaction issuesBetter support for the BSDs is alsoplanned FreeBSD has a partially imple-mented client NetBSD and OpenBSDhave only server programs availableright now AIX 5 Tru64 51A andMacOS X 102 (aka Jaguar) are allplanned for the future Other plannedenhancements include nested groups inthe ptserver (code donated by the Uni-versity of Michigan awaits integration)disconnected AFS and further work ona native Windows client Derrick statedthat the guiding principles of OpenAFSwere to maintain compatibility withIBM AFS support new platforms easethe administrative burden and add newfunctionality

AFS performance was discussed Theopenafsorg cell consists of two fileservers one in Stockholm Sweden andone in Pittsburgh PA AFS works rea-sonably well over trans-Atlantic links

although OpenAFS clients donrsquot deter-mine which file server to talk to veryeffectively Arla clients use RTTs to theserver to determine the optimal fileserver to fetch replicated data fromModifications to OpenAFS to supportthis behavior in the future are desired

Jimmy Engelbrecht and Harald Barth ofKTH discussed their AFSCrawler scriptwritten to determine how many AFScells and clients were in the world whatimplementationsversions they were(Arla vs IBM AFS vs OpenAFS) andhow much data was in AFS The scriptunfortunately triggered a bug in IBMAFS 36ndashderived code causing someclients to panic while handling a specificRPC This has since been fixed in OpenAFS 125 and the most recent IBMAFS patch level and all AFS users arestrongly encouraged to upgrade Norelease of Arla is vulnerable to this par-ticular denial-of-service attack Therewas an extended discussion of the use-fulness of this exploration Many sitesbelieved this was useful information andsuch scanning should continue in thefuture but only on an opt-in basis

Many sites face the problem of manag-ing KerberosAFS credentials for batch-scheduled jobs Specifically most batchprocessing software needs to be modi-fied to forward tickets as part of thebatch submission process renew ticketsand tokens while the job is in the queueand for the lifetime of the job and prop-erly destroy credentials when the jobcompletes Ken Hornstein of NRL wasable to pay a commercial vendor to sup-port Kerberos 45 credential manage-ment in their product although they didnot implement AFS token managementMIT has implemented some of thedesired functionality in OpenPBS andmight be able to make it available toother interested sites

Tools to simplify AFS administrationwere discussed including

AFS Balancer A tool written byCMU to automate the process ofbalancing disk usage across all

servers in a cell Available fromftpftpandrewcmuedupubAFS-Toolsbalance-11btargz

Themis Themis is KTHrsquos enhancedversion of the AFS tool ldquopackagerdquofor updating files on local disk froma central AFS image KTHrsquosenhancements include allowing thedeletion of files simplifying theprocess of adding a file and allow-ing the merging of multiple rulesets for determining which files areupdated Themis is available fromthe Arla CVS repository

Stanford was presented as an example ofa large AFS site Stanfordrsquos AFS usageconsists of approximately 14TB of datain AFS in the form of approximately100000 volumes 33TB of storage isavailable in their primary cell irstan-fordedu Their file servers consistentirely of Solaris machines running acombination of Transarc 36 patch-level3 and OpenAFS 12x while their data-base servers run OpenAFS 122 Theircell consists of 25 file servers using acombination of EMC and Sun StorEdgehardware Stanford continues to use thekaserver for their authentication infra-structure with future plans to migrateentirely to an MIT Kerberos 5 KDC

Stanford has approximately 3400 clientson their campus not including SLAC(Stanford Linear Accelerator) approxi-mately 2100 AFS clients from outsideStanford contact their cell every monthTheir supported clients are almostentirely IBM AFS 36 clients althoughthey plan to release OpenAFS clientssoon Stanford currently supports onlyUNIX clients There is some on-campuspresence of Windows clients but Stan-ford has never publicly released or sup-ported it They do intend to release andsupport the MacOS X client in the nearfuture

All Stanford students faculty and staffare assigned AFS home directories witha default quota of 50MB for a total ofapproximately 550GB of user homedirectories Other significant uses of AFS

87October 2002 login

C

ON

FERE

NC

ERE

PORT

Sstorage are data volumes for workstationsoftware (400GB) and volumes forcourse Web sites and assignments(100GB)

AFS usage at Intel was also presentedIntel has been an AFS site since 1994They had bad experiences with the IBM35 Linux client their experience withOpenAFS on Linux 24x kernels hasbeen much better They use and are sat-isfied with the OpenAFS IA64 Linuxport Intel has hundreds of OpenAFS123 and 124 clients in many produc-tion cells accessing data stored on IBMAFS file servers They have not encoun-tered any interoperability issues Intelhas some concerns about OpenAFS theywould like to purchase commercial sup-port for OpenAFS and to see OpenAFSsupport for HP-UX on both PA-RISCand Itanium hardware HP-UX supportis currently unavailable due to a specificHP-UX header file being unavailablefrom HP this may be available soonIntel has not yet committed to migratingtheir file servers to OpenAFS and areunsure if they will do so without com-mercial support

Backups are a traditional topic of dis-cussion at AFS workshops and this timewas no exception Many users complainthat the traditional AFS backup tools(ldquobackuprdquo and ldquobutcrdquo) are complex anddifficult to automate requiring manyhome-grown scripts and much userintervention for error recovery An addi-tional complaint was that the traditionalAFS tools do not support file- or direc-tory-level backups and restores datamust be backed up and restored at thevolume level

Mitch Collinsworth of Cornell presentedwork to make AMANDA the freebackup software from the University ofMaryland suitable for AFS backupsUsing AMANDA for AFS backups allowsone to share AFS backup tapes withnon-AFS backups easily run multiplebackups in parallel automate errorrecovery and provide a robust degradedmode that prevents tape errors from

stopping backups altogether Theirimplementation allows for full volumerestores as well as individual directoryand file restores They have finished cod-ing this work and are in the process oftesting and documenting it

Peter Honeyman of CITI at the Univer-sity of Michigan spoke about work hehas proposed to replace Rx with RPC-SEC GSS in OpenAFS this would allowAFS to use a TCP-based transport mech-anism rather than the UDP-based Rxand possibly gain better congestion con-trol dynamic adaptation and fragmen-tation avoidance as a result RPCSECGSS uses the GSSAPI to authenticateSUN ONC RPC RPCSEC GSS is trans-port agnostic provides strong security isa developing Internet standard and hasmultiple open source implementationsBackward compatibility with existingAFS servers and clients is an importantgoal of this project

Page 7: inside - USENIX · 2019. 2. 25. · Bosko Milekic Juan Navarro David E. Ott Amit Purohit Brennan Reynolds Matt Selsky J.D. Welch Li Xiao Praveen Yalagandula Haijin Yan Gary Zacheiss

hesitant to employ security measuresScience is doing well but one cannot seethe benefits in the real world Anincreasing number of users means moreproblems affecting more people Oldproblems such as buffer overflowshavenrsquot gone away and new problemsshow up Furthermore the amount ofexpertise needed to launch attacks isdecreasing

Schneier argues that complexity is theenemy of security and that while secu-rity is getting better the growth in com-plexity is outpacing it As security isfundamentally a people problem oneshouldnrsquot focus on technologies butrather should look at businesses busi-ness motivations and business costsTraditionally one can distinguishbetween two security models threatavoidance where security is absoluteand risk management where security isrelative and one has to mitigate the riskwith technologies and procedures In thelatter model one has to find a pointwith reasonable security at an acceptablecost

After a brief discussion on how securityis handled in businesses today in whichhe concluded that businesses ldquotalk bigabout it but do as little as possiblerdquoSchneier identified four necessary stepsto fix the problems

First enforce liabilities Today no realconsequences have to be feared fromsecurity incidents Holding peopleaccountable will increase the trans-parency of security and give an incentiveto make processes public As possibleoptions to achieve this he listed indus-try-defined standards federal regula-tion and lawsuits The problems withenforcement however lie in difficultiesassociated with the international natureof the problem and the fact that com-plexity makes assigning liabilities diffi-cult Furthermore fear of liability couldhave a chilling effect on free softwaredevelopment and could stifle new com-panies

67October 2002 login

C

ON

FERE

NC

ERE

PORT

SAs a second step Schneier identified theneed to allow partners to transfer liabil-ity Insurance would spread liability riskamong a group and would be a CEOrsquosprimary risk analysis tool There is aneed for standardized risk models andprotection profiles and for more securi-ties as opposed to empty press releasesSchneier predicts that insurance will cre-ate a marketplace for security where cus-tomers and vendors will have the abilityto accept liabilities from each otherComputer-security insurance shouldsoon be as common as household or fireinsurance and from that developmentinsurance will become the driving factorof the security business

The next step is to provide mechanismsto reduce risks both before and aftersoftware is released techniques andprocesses to improve software quality aswell as an evolution of security manage-ment are therefore needed Currentlysecurity products try to rebuild the wallsndash such as physical badges and entrydoorways ndash that were lost when gettingconnected Schneier believes thisldquofortress metaphorrdquo is bad one shouldthink of the problem more in terms of acity Since most businesses cannot affordproper security outsourcing is the onlyway to make security scalable With out-sourcing the concenpt of best practicesbecomes important and insurance com-panies can be tied to them outsourcingwill level the playing field

As a final step Schneier predicts thatrational prosecution and education willlead to deterrence He claims that peoplefeel safe because they live in a lawfulsociety whereas the Internet is classifiedas lawless very much like the AmericanWest in the 1800s or a society ruled bywarlords This is because it is difficult toprove who an attacker is prosecution ishampered by complicated evidencegathering and irrational prosecutionSchneier believes however that educa-tion will play a major role in turning theInternet into a lawful society Specifi-cally he pointed out that we need lawsthat can be explained

Risks wonrsquot go away the best we can dois manage them A company able tomanage risk better will be more prof-itable and we need to give CEOs thenecessary risk-management tools

There were numerous questions Listedbelow are the more complicated ones ina paraphrased QampA format

Q Will liability be effective will insur-ance companies be willing to accept therisks A The government might have tostep in it needs to be seen how it playsout

Q Is there an analogy to the real worldin the fact that in a lawless society onlythe rich can afford security Securitysolutions differentiate between classesA Schneier disagreed with the state-ment giving an analogy to front-doorlocks but conceded that there might bespecial cases

Q Doesnrsquot homogeneity hurt securityA Homogeneity is oversold Diversetypes can be more survivable but giventhe limited number of options the dif-ference will be negligible

Q Regulations in airbag protection haveled to deaths in some cases How can wekeep the pendulum from swinging theother way A Lobbying will not be pre-vented An imperfect solution is proba-ble there might be reversals ofrequirements such as the airbag one

Q What about personal liability A Thiswill be analogous to auto insurance lia-bility comes with computerNet access

Q If the rule of law is to become realitythere must be a law enforcement func-tion that applies to a physical space Youcannot do that with any existing govern-ment agency for the whole worldShould an organization like the UNassume that role A Schneier was notconvinced global enforcement is possi-ble

Q What advice should we take awayfrom this talk A Liability is comingSince the network is important to ourinfrastructure eventually the problems

USENIX 2002

will be solved in a legal environmentYou need to start thinking about how tosolve the problems and how the solu-tions will affect us

The entire speech (including slides) isavailable at httpwwwcounterpanecompresentation4pdf

LIFE IN AN OPEN SOURCE STARTUP

Daryll Strauss Consultant

Summarized by Teri Lampoudi

This talk was packed with morsels ofinsight and tidbits of information aboutwhat life in an open source startup islike (though ldquowas likerdquo might be moreappropriate) what the issues are instarting maintaining and commercial-izing an open source project and theway hardware vendors treat open sourcedevelopers Strauss began by tracing thetimeline of his involvement with devel-oping 3-D graphics support for Linuxfrom the 3dfx voodoo1 driver for Linuxin 1997 to the establishment of Preci-sion Insight in 1998 to its buyout by VALinux to the dismantling of his groupby VA to what the future may hold

Obviously there are benefits from doingopen source development both for thedeveloper and for the end user Animportant point was that open sourcedevelops entire technologies not justproducts A misapprehension is theinherent difficulty of managing a projectthat accepts code from a large number ofdevelopers some volunteer and somepaid The notion that code can just bethrown over the wall into the world is aBig Lie

How does one start and keep afloat anopen source development company Inthe case of Precision Insight the subjectmatter ndash developing graphics andOpenGL for Linux ndash required a consid-erable amount of expertise intimateknowledge of the hardware the librariesX and the Linux kernel as well as theend applications Expertise is mar-ketable What helps even further is hav-ing a visible virtual team of expertspeople who have an established trackrecord of contributions to open source

68 Vol 27 No 5 login

in the particular area of expertise Sup-port from the hardware and softwarevendors came next While everyone waswilling to contract PI to write the por-tions of code that were specific to theirhardware no one wanted to pay the billfor developing the underlying founda-tion that was necessary for the drivers tobe useful In the PI model the infra-structure cost was shared by several cus-tomers at once and the technology keptmoving Once a driver is written for avendor the vendor is perfectly capableof figuring out how to write the nextdriver necessary thereby obviating theneed to contract the job This and ques-tions of protecting intellectual propertyndash hardware design in this case ndash aredeeply problematic with the open sourcemode of development This is not to saythat there is no way to get over thembut that they are likely to arise moreoften than not

Management issues seem equally impor-tant In the PI setting contracts wereflexible ndash both a good and a bad thing ndashand developers overworked Furtherdevelopment ldquoin a fishbowlrdquo that isunder public scrutiny is not easyStrauss stressed the value of good com-munication and the use of various out-of-sync communication methods likeIRC and mailing lists

Finally Strauss discussed what portionsof software can be free and what can beproprietary His suggestion was that hor-izontal markets want to be free whereasvertically one can develop proprietarysolutions The talk closed with a smallstroll through the problems that projectpolitics brings up from things likemerging branches to mailing list andsource tree readability

GENERAL TRACK SESSIONS

FILE SYSTEMS

Summarized by Haijin Yan

STRUCTURE AND PERFORMANCE OF THE

DIRECT ACCESS FILE SYSTEM

Kostas Magoutis Salimah AddetiaAlexandra Fedorova and Margo ISeltzer Harvard University JeffreyChase Andrew Gallatin Richard Kisleyand Rajiv Wickremesinghe Duke Uni-versity and Eran Gabber Lucent Tech-nologies

This paper won the Best Paper award

The Direct Access File System (DAFS) isa new standard for network-attachedstorage over direct-access transport net-works DAFS takes advantage of user-level network interface standards thatenable client-side functionality in whichremote data access resides in a libraryrather than in the kernel This reducesthe overhead of memory copy for datamovement and protocol overheadRemote Direct Memory Access (RDMA)is a direct access transport network thatallows the network adapter to reducecopy overhead by accessing applicationbuffers directly

This paper explores the fundamentalstructural and performance characteris-tics of network file access using a user-level file system structure on adirect-access transport network withRDMA It describes DAFS-based clientand server reference implementationsfor FreeBSD and reports experimentalresults comparing DAFS to a zero-copyNFS implementation It illustrates thebenefits and trade-offs of these tech-niques to provide a basis for informedchoices about deployment of DAFS-based systems and similar extensions toother network file protocols such asNFS Experiments show that DAFS givesapplications direct control over an IOsystem and increases the client CPUrsquosusage while the client is doing IOFuture work includes how to addresslongstanding problems related to theintegration of the application and filesystem for high-performance applica-tions

CONQUEST BETTER PERFORMANCE THROUGH

A DISKPERSISTENT-RAM HYBRID FILE

SYSTEM

An-I A Wang Peter Reiher and GeraldJ Popek UCLA Geoffrey M Kuenning Harvey Mudd College

Motivated by the declining cost of per-sistent RAM the authors propose theConquest file system which stores allsmall files metadata executables andshared libraries in persistent RAM diskshold only the data content of remaininglarge files Compared to alternativessuch as caching and RAM file systemsConquest has the advantages of effi-ciency consistency and reliability at areduced cost Using popular bench-marks experiments show that Conquestincurs little overhead while achievingfaster performance Future workincludes designing mechanisms foradjusting file-size threshold dynamicallyand finding a better disk layout for largedata blocks

EXPLOITING GRAY-BOX KNOWLEDGE OF

BUFFER-CACHE MANAGEMENT

Nathan C Burnett John Bent AndreaC Arpaci-Dusseau and Remzi HArpaci-Dusseau University of Wisconsin Madison

Knowing what algorithm is used tomanage the buffer cache is very impor-tant for improving application perfor-mance However there is currently nointerface for finding this algorithm Thispaper introduces Dust a simple finger-printing tool that is able to identify thebuffer-cache replacement policy Dustautomatically identifies the cache sizeand replacement policy based on theconfiguring attributes of access ordersrecency frequency and long-term his-tory Through simulation Dust was ableto distinguish between a variety ofreplacement algorithm policies found inthe literature FIFO LRU LFU ClockSegmented FIFO 2Qand LRU-K Fur-ther experiments of fingerprinting realoperating system such as NetBSDLinux and Solaris show that Dust is ableto identify the hidden cache replacementalgorithm

69October 2002 login

C

ON

FERE

NC

ERE

PORT

SBy knowing the underlying cachereplacement policy a cache-aware Webserver can reschedule the requests on acached request-first policy to obtain per-formance improvement Experimentsshow that cache-aware schedulingimproves average response time and sys-tem throughput

OPERATING SYSTEMS

(AND DANCING BEARS)

Summarized by Matt Butner

THE JX OPERATING SYSTEM

Michael Golm Meik Felser ChristianWawersich and Juumlrgen Kleinoumlder University of Erlagen-Nuumlrnberg

The talk opened by emphasizing theneed for and the practicality of a func-tional Java OS Such implementation inthe JX OS attempts to mimic the recenttrend toward using highly abstractedlanguages in application development inorder to create an OS as functional andpowerful as one written in a lower-levellanguage

The JX is a micro-kernel solution thatuses separate JVMs for each entity of thekernel and in some cases for eachapplication The separated domains donot share objects and have no threadmigration and each domain implementsits own code The interesting part of thepresentation was the discussion of sys-tem-level programming with Java keyareas discussed were memory manage-ment and interrupt handlers Theauthors concluded by noting that theirtype-safe and modular system resultedin a robust system with great configura-tion flexibility and acceptable perfor-mance

DESIGN EVOLUTION OF THE EROS

SINGLE-LEVEL STORE

Jonathan S Shapiro Johns HopkinsUniversity Jonathan Adams Universityof Pennsylvania

This presentation outlined current char-acteristics of file systems and some oftheir least desirable characteristics Itwas based on the revival of the EROS

single-level-store approach for currentattempts to capitalize on system consis-tency and efficiency Their solution didnot relate directly to EROS but ratherto the design and use of the constructsand notions on which EROS is based Byextending the mapping of the memorysystem to include the disk systems arefurther able to ensure global persistencewithout regard for a disk structure Inexplaining the need for absolute systemconsistency and an environment whichby definition is not allowed to crash thegoal of having such an exhaustive designbecomes clear The cool part of thiswork is the availability of its design andarchitecture to the public

THINK A SOFTWARE FRAMEWORK FOR

COMPONENT-BASED OPERATING SYSTEM

KERNELS

Jean-Philippe Fassino France TeacuteleacutecomRampD Jean-Bernard Stefani INRIA JuliaLawall DIKU Gilles Muller INRIA

This presentation discussed the need forcomponent-based operating systemsand respective structures to ensure flexi-bility for arbitrary-sized systems Thinkprovides a binding model that mapsuniformed components for OS develop-ers and architects to follow ensuringconsistent implementations for an arbi-trary system size However the goal isnot to force developers into a predefinedkernel but to promote the use of certaincomponents in varied ways

Thinkrsquos concentration is primarily onembedded systems where short devel-opment time is necessary but is con-strained by rigorous needs and limitedresources The need to build flexible sys-tems in such an environment can becostly and implementation-specific butThink creates an environment sup-ported by the ability to dynamically loadtype-safe components This allows formore flexible systems that retain func-tionality because of Thinkrsquos ability tobind fine-grained components Bench-marks revealed that dedicated micro-kernels can show performanceimprovements in comparison to mono-

USENIX 2002

lithic kernels Notable future workincludes building components for low-end appliances and the development of areal-time OS component library

BUILDING SERVICES

Summarized by Pradipta De

NINJA A FRAMEWORK FOR NETWORK

SERVICES

J Robert von Behren Eric A BrewerNikita Borisov Michael Chen MattWelsh Josh MacDonald Jeremy Lauand David Culler University of Califor-nia Berkeley

Robert von Behren presented Ninja anongoing project that aims to provide aframework for building robust and scal-able Internet-based network serviceslike Web-hosting instant-messagingemail and file-sharing applicationsRobert drew attention to the difficultiesof writing cluster applications One hasto take care of data consistency andissues of concurrency and for robustapplications there are problems relatedto fault tolerance Ninja works as awrapper to relieve the user of theseproblems The goal of the project isbuilding network services that are scala-ble and highly available maintainingpersistent data providing gracefuldegradation and supporting online evo-lution

The use of clusters distinguishes thissetup from generic distributed systemsin terms of reliability and security aswell as providing a partition-free net-work The next important feature of thisproject is the new programming modelwhich is more restrictive than a generalprogramming model but still expressiveenough to write most of the applica-tions This model described as ldquosingleprogram multiple connectionrdquo usesintertask parallelism instead of multi-threaded concurrency and is character-ized by all nodes running the sameprogram with connections delegated tothe nodes by a centralized connectionmanager (CM) A CM takes care of hid-ing the details of mapping an externalconnection to an internal node Ninja

70 Vol 27 No 5 login

also uses a design called ldquostaged event-driven architecturerdquo where each serviceis broken down into a set of stages con-nected by event queues This architec-ture is suitable for a modular design andhelps in graceful degradation by adap-tive load shedding from the eventqueues

The presentation concluded with exam-ples of implementation and evaluationof a Web server and an email systemcalled NinjaMail to show the ease ofauthoring and the efficacy of using theNinja framework for service develop-ment

USING COHORT-SCHEDULING TO ENHANCE

SERVER PERFORMANCE

James R Larus and Michael ParkesMicrosoft Research

James Larus presented a new schedulingpolicy for increasing the throughput ofserver applications Cohort schedulingbatches execution of similar operationsarising in different server requests Theusual programming paradigm in han-dling server requests is to spawn multi-ple concurrent threads and switch fromone thread to another whenever a threadgets blocked for IO Since the threads ina server mostly execute unrelated piecesof code the locality of reference isreduced hence the effectiveness of dif-ferent caching mechanisms One way tosolve this problem is to throw in morehardware But Larus presented a com-plementary view where the programbehavior is investigated and used toimprove the performance

The problem in this scheme is to iden-tify pieces of code that can be batchedtogether for processing One simple wayis to look at the next program countervalues and use them to club threadstogether However this talk presented anew programming abstraction ldquostagedcomputationrdquo which replaces the threadmodel with ldquostagesrdquo A stage is anabstraction for operations with similarbehavior and locality The StagedServerlibrary can be used for programming inthis model It is a collection of C++

classes that implement staged computa-tion and cohort scheduling on either auniprocessor or multiprocessor It canbe used to define stages

The presentation showed the experi-mental evaluation of the cohort schedul-ing over the thread-based model byimplementing two servers a Web serverwhich is mainly IO bound and a pub-lish-subscribe server which is mainlycompute bound The SURGE bench-mark was used for the first experimentand the Fabret workload for the secondThe results showed that cohort-schedul-ing-based implementation gave a betterthroughput than the thread-basedimplementation at high loads

NETWORK PERFORMANCE

Summarized by Xiaobo Fan

ETE PASSIVE END-TO-END INTERNET SERVICE

PERFORMANCE MONITORING

Yun Fu Amin Vahdat Duke UniversityLudmila Cherkasova Wenting Tang HPLabs

This paper won the Best Student Paperaward

Ludmila Cherkasova began by listingseveral questions most Web serviceproviders want answered in order toimprove service quality She reviewedthe difficulties of making accurate andefficient end-to-end Web service mea-surement and the shortcomings of cur-rently available approaches Theypropose a passive trace-based architec-ture called EtE to monitor Web serverperformance on behalf of end users

The first step is to collect network pack-ets passively The second module recon-structs all TCP connections and extractsHTTP transactions To reconstruct Webpage accesses they first build a knowl-edge base indexed by client IP and URLand then group objects to the relatedWeb pages they are embedded in Statis-tical analysis is used to handle non-matched objects EtE Monitor cangenerate three groups of metrics to mea-sure Web service performance responsetime Web caching and page abortion

To demonstrate the benefits of EtE mon-itor Cherkasova talked about two casestudies and based on the calculatedmetrics gave some insightful explana-tions about whatrsquos happening behind the variations of Web performanceThrough validation they claim theirapproach provides a very close approxi-mation to the real scenario

THE PERFORMANCE OF REMOTE DISPLAY

MECHANISMS FOR THIN-CLIENT COMPUTING

S Jae Yang Jason Nieh Matt SelskyNikhil Tiwari Columbia University

Noting the trend toward thin-clientcomputing the authors compared dif-ferent techniques and design choices inmeasuring the performance of six popu-lar thin-client platforms ndash CitrixMetaFrame Microsoft Terminal Ser-vices Sun Ray Tarantella VNC and XAfter pointing out several challenges inbenchmarking thin clients Yang pro-posed slow-motion benchmarking toachieve non-invasive packet monitoringand consistent visual quality Basicallythey insert delays between separatevisual events in the benchmark applica-tions of the server side so that theclientrsquos display update can catch up withthe serverrsquos processing speed The exper-iments are conducted on an emulatednetwork over a range of network band-widths

Their results show that thin clients canprovide good performance for Webapplications in LAN environments butonly some platforms performed well forvideo benchmark Pixel-based encodingmay achieve better performance andbandwidth efficiency than high-levelgraphics Display caching and compres-sion should be used with care

A MECHANISM FOR TCP-FRIENDLY

TRANSPORT-LEVEL PROTOCOL

COORDINATION

David E Ott and Ketan Mayer-PatelUniversity of North Carolina ChapelHill

A revised transport-level protocol opti-mized for cluster-to-cluster (C-to-C)

71October 2002 login

C

ON

FERE

NC

ERE

PORT

Sapplications is what David Ott tries toexplain in his talk A C-to-C applicationclass is identified as one set of processescommunicating to another set ofprocesses across a common Internetpath The fundamental problem with C-to-C applications is how to coordinateall C-to-C communication flows so thatthey share a consistent view of the com-mon C-to-C network adapt to changingnetwork conditions and cooperate tomeet specific requirements Aggregationpoints (AP) are placed at the first andlast hop of the common data path toprobe network conditions (latencybandwidth loss rate etc) To carry andtransfer this information a new protocolndash Coordination Protocol (CP) ndash isinserted between the network layer (IP)and the transport layer (TCP UCP etc)Ott illustrated how the CP header isupdated and used when packets origi-nate from source and traverse throughlocal and remote AP to arrive at theirdestination and how AP maintains aper-cluster state table and detects net-work conditions Through simulationresults this coordination mechanismappears effective in sharing commonnetwork resources among C-to-C com-munication flows

STORAGE SYSTEMS

Summarized by Praveen Yalagandula

MY CACHE OR YOURS MAKING STORAGE

MORE EXCLUSIVE

Theodore M Wong Carnegie MellonUniversity John Wilkes HP Labs

Theodore Wong explained the ineffi-ciency of current ldquoinclusiverdquo cachingschemes in storage area networks ndashwhen a client accesses a block the blockis read from the disk and is cached atboth the disk array cache and at theclientrsquos cache Then he presented theconcept of ldquoexclusiverdquo caching where ablock accessed by a client is only cachedin that clientrsquos cache On eviction fromthe clientrsquos cache they have come upwith a DEMOTE operation to move thedata block to the tail of the array cache

The new exclusion caching schemes areevaluated on both single-client and mul-tiple-client systems The single-clientresults are presented for two differenttypes of synthetic workloads Random(transaction-processing type workloads)and Zipf (typical Web workloads) Theexclusive policy was quite effective inachieving higher hit rates and lower readlatencies They also showed a 22 timeshit-rate improvement over inclusivetechniques in the case of a real-lifeworkload httpd with a single-client set-ting

The DEMOTE scheme also performedwell in the case of multiple-client sys-tems when the data accessed by clients isdisjointed In the case of shared dataworkloads this scheme performed worsethan the inclusive schemes Theodorethen presented an adaptive exclusivecaching scheme ndash a block accessed by aclient is placed at the tail in the clientrsquoscache and also in the array cache at anappropriate place determined by thepopularity of the block The popularityof the block is measured by maintaininga ghost cache to accumulate the numberof times each block is accessed This newadaptive scheme has achieved a maxi-mum of 52 speedup in the meanlatency in the experiments with real-lifeworkloads

For more information visit httpwwwcscmuedu~tmwongresearch and httpwwwhplhpcomresearchitccslssp

BRIDGING THE INFORMATION GAP IN

STORAGE PROTOCOL STACKS

Timothy E Denehy Andrea C Arpaci-Dusseau Remzi H Arpaci-DusseauUniversity of Wisconsin Madison

Currently there is a huge informationgap between storage systems and file sys-tems The interface exposed by storagesystems to file systems based on blocksand providing only readwrite interfacesis very narrow This leads to poor per-formance as a whole because of dupli-cated functionality in both systems and

USENIX 2002

reduced functionality resulting fromstorage systemsrsquo lack of file information

The speaker presented two enhance-ments ExRaid an exposed RAID andILFS Informed Log-Structured File Sys-tem ExRAID is an enhancement to theblock-based RAID storage system thatexposes the following three types ofinformation to the file system (1)regions ndash contiguous portions of theaddress space comprising one or multi-ple disks (2) performance informationabout queue lengths and throughput ofthe regions revealing disk heterogeneityto the file systems and (3) failure infor-mation ndash dynamically updated informa-tion conveying the number of tolerablefailures in each region

ILFS allows online incremental expan-sion of the storage space performsdynamic parallelism using ExRAIDrsquosperformance information to performwell on heterogeneous storage systemsand provides a range of different mecha-nisms with different granularities forredundancy of files using ExRAIDrsquosregion failure characteristics These newfeatures are added to LFS with only a19 increase in the code size

More informationhttpwwwcswisceduwind

MAXIMIZING THROUGHPUT IN REPLICATED

DISK STRIPING OF VARIABLE

BIT-RATE STREAMS

Stergios V Anastasiadis Duke Univer-sity Kenneth C Sevcik and MichaelStumm University of Toronto

There is an increasing demand for thecontinuous real-time streaming ofmedia files Disk striping is a commontechnique used for supporting manyconnections on the media servers Buteven with 12 million hours of meantime between failures there will be morethan one disk failure per week with 7000disks Fault tolerance can be achieved byusing data replication and reservingextra bandwidth during normal opera-tion This work focused on supportingthe most common variable bit-ratestream formats (eg MPEG)

72 Vol 27 No 5 login

The authors used prototype mediaserver EXEDRA to implement and eval-uate different replication and bandwidthreservation schemes This system sup-ports a variable bit-rate scheme doesstride-based disk allocation and is capa-ble of supporting variable-grain diskstriping Two replication policies arepresented deterministic ndash data blocksare replicated in a round-robin fashionon the secondary replicas random ndashdata blocks are replicated on the ran-domly chosen secondary replicas Twobandwidth reservation techniques arealso presented mirroring reservation ndashdisk bandwidth is reserved for both primary and replicas of the media fileduring the playback and minimumreservation ndash a more efficient scheme inwhich bandwidth is reserved only for thesum of primary data access time and themaximum of the backup data accesstimes required in each round

Experimental results showed that deter-ministic replica placement is better thanrandom placement for small disks mini-mum disk bandwidth reservation istwice as good as mirroring in thethroughput achieved and fault tolerancecan be achieved with a minimal impacton the throughput

TOOLS

Summarized by Li Xiao

SIMPLE AND GENERAL STATISTICAL PROFILING

WITH PCT

Charles Blake and Steven Bauer MIT

This talk introduced a Profile CollectionToolkit (PCT) ndash a sampling-based CPUprofiling facility A novel aspect of PCTis that it allows sampling of semanticallyrich data such as function call stacksfunction parameters local or globalvariables CPU registers or other execu-tion contexts The design objectives ofPCT were driven by user needs and theinadequacies or inaccessibility of priorsystems

The rich data collection capability isachieved via a debugger-controller pro-gram dbct1 The talk then introducedthe profiling collector and data collec-

tion activation PCT file format optionsprovide flexibility and various ways tostore exact states PCT also has analysisoptions two examples were given ndashldquoSimplerdquo and ldquoMixed User and LinuxCoderdquo

The talk then went into how generalsampling helps in the following areas adebugger-controller state machineexample script fragments a simpleexample program and general valuereports in terms of call-path histogramsnumeric value histograms and data gen-erality Related work such as gprof andExpect was then summarized The con-tribution of this work is to provide aportable general value sampling toolThe limitations are that PCT does notsupport much of strip and count inac-curacies happen because of its statisticalnature Based on this limitation -g ispreferred over strip

Possible future directions included sup-porting more debuggers such as dbxand xdb script generators for otherkinds of traces more canned reports forgeneral values and a libgdb-basedlibrary sampler PCT is available athttppdoslcsmitedupct PCT runs onalmost any UNIX-like system

ENGINEERING A DIFFERENCING AND

COMPRESSION DATA FORMAT

David G Korn and Kiem Phong VoATampT Laboratories ndash Research

This talk began with an equation ldquoDif-ferencing + Compression = Delta Com-pressionrdquo Compression removesinformation redundancy in a data setExamples are gzip bzip compress andpack Differencing encodes differencesbetween two data sets Examples are diff -e fdelta bdiff and xdelta Deltacompression compresses a data set givenanother combining differencing andcompression and reduces to pure com-pression when therersquos no commonality

After presenting an overall scheme for adelta compressor and showing deltacompression performance this talk dis-cussed the encoding format of a newlydesigned Vcdiff for delta compression A

Vcdiff instruction code table consists of256 entries of each coding up to a pair ofinstructions and recodes I-byte indicesand any additional data

The talk then showed Vcdiff rsquos perfor-mance with Web data collected fromCNN compared with W3C standardGdiff Gdiff+gzip and diff+gzip whereGdiff was computed using Vcdiff deltainstructions The results of two experi-ments ldquoFirstrdquo and ldquoSuccessiverdquo werepresented In ldquoFirstrdquo each file is com-pressed against the first file collected inldquoSuccessiverdquo each file is compressedagainst the one in the previous hourThe diff+gzip did not work well becausediff was line-oriented Vcdiff performedfavorably compared with other formats

Vcdiff is part of the Vcodex packageVcodex is a platform for all commondata transformations delta compres-sion plain compression encryptiontranscoding (eg uuencode base64) Itis structured in three layers for maxi-mum usability Base library uses Dis-plines and Methods interfaces

The code for Vcdiff can be found athttpwwwresearchattcomswtoolsThey are moving Vcdiff to the IETFstandard as a comprehensive platformfor transforming data Please refer tohttpwwwietforginternet-draftdraft-korn-vcdiff06txt

WHERE IN THE NET

Summarized by Amit Purohit

A PRECISE AND EFFICIENT EVALUATION OF

THE PROXIMITY BETWEEN WEB CLIENTS AND

THEIR LOCAL DNS SERVERS

Zhuoqing Morley Mao University ofCalifornia Berkeley Charles D CranorFred Douglis Michael RabinovichOliver Spatscheck and Jia Wang ATampTLabs ndash Research

Content Distribution Networks (CDNs)attempt to improve Web performance bydelivering Web content to end usersfrom servers located at the edge of thenetwork When a Web client requestscontent the CDN dynamically chooses aserver to route the request to usually the

73October 2002 login

C

ON

FERE

NC

ERE

PORT

Sone that is nearest the client As CDNshave access only to the IP address of thelocal DNS server (LDNS) of the clientthe CDNrsquos authoritative DNS servermaps the clientrsquos LDNS to a geographicregion within a particular network andcombines that with network and serverload information to perform CDNserver selection

This method has two limitations First itis based on the implicit assumption thatclients are close to their LDNS This maynot always be valid Second a singlerequest from an LDNS can represent adiffering number of Web clients ndash calledthe hidden load factor Knowledge of thehidden load factor can be used toachieve better load distribution

The extent of the first limitation and itsimpact on the CDN server selection isdealt with To determine the associationsbetween clients and their LDNS a sim-ple non-intrusive and efficient map-ping technique was developed The datacollected was used to study the impact ofproximity on DNS-based server selec-tion using four different proximity met-rics (1) autonomous system (AS)clustering ndash observing whether a client isin the same AS as its LDNS ndash concludedthat LDNS is very good for coarse-grained server selection as 64 of theassociations belong to the same AS(2) network clustering ndash observingwhether a client is in the same network-aware cluster (NAC) ndash implied that DNSis less useful for finer-grained serverselection since only 16 of the clientand LDNS are in the same NAC (3)traceroute divergence ndash the length of thedivergent paths to the client and itsLDNS from a probe point using tracer-oute ndash implies that most clients aretopologically close to their LDNS asviewed from a randomly chosen probesite (4) round-trip time (RTT) correla-tion (some CDNs select severs based onRTT between the CDN server and theclientrsquos LDNS) examines the correlationbetween the message RTTs from a probepoint to the client and its local DNS

results imply this is a reasonable metricto use to avoid really distant servers

A study of the impact that client-LDNSassociations have on DNS-based serverselection concludes that knowing theclientrsquos IP address would allow moreaccurate server selection in a large num-ber of cases The optimality of the serverselection also depends on the serverdensity placement and selection algo-rithms

Further information can be found athttpwwweecsberkeleyedu~zmaomyresearchhtml or by contactingzmaoeecsberkeleyedu

GEOGRAPHIC PROPERTIES OF INTERNET

ROUTING

Lakshminarayanan SubramanianVenkata N Padmanabhan MicrosoftResearch Randy H Katz University ofCalifornia at Berkeley

Geographic information can provideinsights into the structure and function-ing of the Internet including interac-tions between different autonomoussystems by analyzing certain propertiesof Internet routing It can be used tomeasure and quantify certain routingproperties such as circuitous routinghot-potato routing and geographic faulttolerance

Traceroute has been used to gather therequired data and Geotrack tool hasbeen used to determine the location ofthe nodes along each network path Thisenables the computation of ldquolinearizeddistancesrdquo which is the sum of the geo-graphic distances between successivepairs of routers along the path

In order to measure the circuitousnessof a path a metric ldquodistance ratiordquo hasbeen defined as the ratio of the lin-earized distance of a path to the geo-graphic distance between the source anddestination of the path From the data ithas been observed that the circuitous-ness of a route depends on both geo-graphic and network location of the endhosts A large value of the distance ratioenables us to flag paths that are highly

USENIX 2002

circuitous possibly (though not neces-sarily) because of routing anomalies Itis also shown that the minimum delaybetween end hosts is strongly correlatedwith the linearized distance of the path

Geographic information can be used tostudy various aspects of wide-area Inter-net paths that traverse multiple ISPs Itwas found that end-to-end Internetpaths tend to be more circuitous thanintra-ISP paths the cause for this beingthe peering relationships between ISPsAlso paths traversing substantial dis-tances within two or more ISPs tend tobe more circuitous than paths largelytraversing only a single ISP Anotherfinding is that ISPs generally employhot-potato routing

Geographic information can also beused to capture the fact that two seem-ingly unrelated routers can be suscepti-ble to correlated failures By using thegeographic information of routers wecan construct a geographic topology ofan ISP Using this we can find the toler-ance of an ISPrsquos network to the total net-work failure in a geographic region

For further information contactlakmecsberkeleyedu

PROVIDING PROCESS ORIGIN INFORMATION

TO AID IN NETWORK TRACEBACK

Florian P Buchholz Purdue UniversityClay Shields Georgetown University

Network traceback is currently limitedbecause host audit systems do not main-tain enough information to matchincoming network traffic to outgoingnetwork traffic The talk presented analternative assigning origin informationto every process and logging it duringinteractive login creation

The current implementation concen-trates mainly on interactive sessions inwhich an event is logged when a newconnection is established using SSH orTelnet The method is effective andcould successfully determine steppingstones and reliably detect the source of aDDoS attack The speaker then talkedabout the related work done in this area

74 Vol 27 No 5 login

which led to discussion of the latestdevelopment in the area of ldquohostcausalityrdquo

The speaker ended with the limitationsand future directions of the frameworkThis framework could be extended andcould find many applications in the areaof security Current implementationdoesnrsquot take care of startup scripts andcron jobs but incorporating the origininformation in FS could solve this prob-lem In the current implementation log-ging is just implemented as a proof ofconcept It could be made safe in manyways and this could be another impor-tant aspect of future work

PROGRAMMING

Summarized by Amit Purohit

CYCLONE A SAFE DIALECT OF C

Trevor Jim ATampT Labs ndash Research GregMorrisett Dan Grossman MichaelHicks James Cheney Yanling WangCornell University

Cyclone is designed to provide protec-tion against attacks such as buffer over-flows format string attacks andmemory management errors The cur-rent structure of C allows programmersto write vulnerable programs Cycloneextends C so that it has the safety guar-antee of Java while keeping C syntaxtypes and semantics untouched

The Cyclone compiler performs staticanalysis of the source code and insertsruntime checks into the compiled out-put at places where the analysis cannotdetermine that the code execution willnot violate safety constraints Cycloneimposes many restrictions to preservesafety such as NULL checksThesechecks do not exist in normal C

The speaker then talked about somesample code written in Cyclone and howit tackles safety problems Porting anexisting C application to Cyclone ispretty easy with fewer than 10 changerequired The current implementationconcentrates more on safety than onperformance ndash hence the performance

penalty is substantial Cyclone was ableto find many lingering bugs The finalstep of the project will be to write acompiler to convert a normal C programto an equivalent Cyclone program

COOPERATIVE TASK MANAGEMENT WITHOUT

MANUAL STACK MANAGEMENT

Atul Adya Jon Howell MarvinTheimer William J Bolosky John RDouceur Microsoft Research

The speaker described the definitionsand motivations behind the distinctconcepts of task management stackmanagement IO management conflictmanagement and data partitioningConventional concurrent programminguses preemptive task management andexploits the automatic stack manage-ment of a standard language In the sec-ond approach cooperative tasks areorganized as event handlers that yieldcontrol by returning control to the eventscheduler manually unrolling theirstacks In this project they have adopteda hybrid approach that makes it possiblefor both stack management styles tocoexist Thus programmers can codeassuming one or the other of these stackmanagement styles will operate depend-ing upon the application The speakeralso gave a detailed example of how touse their mechanism to switch betweenthe two styles

The talk then continued with the imple-mentation They were able to preemptmany subtle concurrency problems byusing cooperative task managementPaying a cost up-front to reduce a subtlerace condition proved to be a goodinvestment

Though the choice of task managementis fundamental the choice of stack man-agement can be left to individual tasteThis project enables use of any type ofstack management in conjunction withcooperative task management

IMPROVING WAIT-FREE ALGORITHMS FOR

INTERPROCESS COMMUNICATION IN

EMBEDDED REAL-TIME SYSTEMS

Hai Huang Padmanabhan Pillai KangG Shin University of Michigan

The main characteristic of the real-timetime-sensitive system is its pre-dictable response time But concurrencymanagement creates hurdles to achiev-ing this goal because of the use of locksto maintain consistency To solve thisproblem many wait-free algorithmshave been developed but these are typi-cally high-cost solutions By takingadvantage of the temporal characteris-tics of the system however the time andspace overhead can be reduced

The speaker presented an algorithm fortemporal concurrency control andapplied this technique to improve threewait-free algorithms A single-writermultiple-reader wait-free algorithm anda double-buffer algorithm were pro-posed

Using their transformation mechanismthey achieved an improvement of17ndash66 in ACET and a 14ndash70 reduc-tion in memory requirements for IPCalgorithms This mechanism is extensi-ble and can be applied to other non-blocking IPC algorithms as well Futurework involves reducing the synchroniz-ing overheads in more general IPC algo-rithms with multiple-writer semantics

MOBILITY

Summarized by Praveen Yalagandula

ROBUST POSITIONING ALGORITHMS FOR

DISTRIBUTED AD-HOC WIRELESS SENSOR

NETWORKS

Chris Savarese Jan Rabaey Universityof California Berkeley Koen Langen-doen Delft University of Technology

The Pico Radio Network comprisesmore than a hundred sensors monitorsand actuators equipped with wirelessconnectivity with the following proper-ties (1) no infrastructure (2) computa-tion in a distributed fashion instead ofcentralized computation (3) dynamictopology (4) limited radio range

75October 2002 login

C

ON

FERE

NC

ERE

PORT

S(5) nodes that can act as repeaters(6) devices of low cost and with lowpower and (7) sparse anchor nodes ndashnodes with GPS capability A positioningalgorithm determines the geographicalposition of each node in the network

In two-dimensional space each nodeneeds three reference positions to esti-mate its geographical position There aretwo problems that make positioning dif-ficult in the Pico Radio Network typesetting (1) a sparse anchor node prob-lem and (2) a range error problem

The two-phase approach that theauthors have taken in solving the prob-lem consists of (1) a hop-terrain algo-rithm ndash in this first phase each noderoughly guesses its location by the dis-tance calculated using multi-hops to theanchor points and (2) a refinementalgorithm ndash in this second phase eachnode uses its neighborsrsquo positions torefine its own position estimate Toguarantee the convergence thisapproach uses confidence metricsassigning a value of 10 for anchor nodesand 01 to start for other nodes andincreasing these with each iteration

A simulation tool OMNet++ was usedfor both phases and in various scenariosThe results show that they achievedposition errors of less than 33 in a sce-nario with 5 anchor nodes an averageconnectivity of 7 and 5 range mea-surement error

Guidelines for anchor node deploymentare high connectivity (gt10) a reason-able fraction of anchor nodes (gt 5)and for anchor placement coverededges

APPLICATION-SPECIFIC NETWORK

MANAGEMENT FOR ENERGY-AWARE

STREAMING OF POPULAR MULTIMEDIA

FORMATS

Surendar Chandra University of Geor-gia and Amin Vahdat Duke University

The main hindrance in supporting theincreasing demand for mobile multime-dia on PDA is the battery capacity of the

small devices The idle-time power con-sumption is almost the same as thereceive power consumption on typicalwireless network cards (1319mW vs1425mW) while the sleep state con-sumes far less power (177mW) To letthe system consume energy proportionalto the stream quality the network cardshould transition to sleep state aggres-sively between each packet

The speaker presented previous workwhere different multimedia streamswere studied for PDAs MS Media RealMedia and Quicktime It was found thatif the inter-packet gap is predictablethen huge savings are possible forexample about 80 savings in MSmedia streams with few packet lossesThe limitation of the IEEE 80211bpower-save mode comes into play whenthere are two or more nodes producinga delay between beacon time and whenthe node receivessends packets Thisbadly affects the streams since higherenergy is consumed while waiting

The authors propose traffic shaping forenergy conservation where this is doneby proxy such that packets arrive at reg-ular intervals This is achieved using alocal proxy in the access point and aclient-side proxy in the mobile hostSimulations show that traffic shapingreduces energy consumption and alsoreveals that the higher bandwidthstreams have a lower energy metric(mJkB)

More information is at httpgreenhousecsugaedu

CHARACTERIZING ALERT AND BROWSE SER-

VICES OF MOBILE CLIENTS

Atul Adya Paramvir Bahl and Lili QiuMicrosoft Research

Even though there is a dramatic increasein Internet access from wireless devicesthere are not many studies done oncharacterizing this traffic In this paperthe authors characterize the trafficobserved on the MSN Web site withboth notification and browse traces

USENIX 2002

Around 33 million browsing accessesand about 325 million notificationentries are present in the traces Threetypes of analyses are done for each oneof these two services content analysisconcerning the most popular contentcategories and their distribution popu-larity analysis or the popularity distri-bution of documents and user behavioranalysis

The analysis of the notification logsshows that document access rates followa Zipf-like distribution with most of theaccesses concentrated on a small num-ber of messages and the accesses exhibitgeographical locality ndash users from samelocality tend to receive similar notifica-tion content The browser log analysisshows that a smaller set of URLs areaccessed most times though the accesspattern does not fit any Zipf curve andthe highly accessed URLS remain stableA correlation study between the notifi-cation and browsing services shows thatwireless users have a moderate correla-tion of 012

FREENIX TRACK SESSIONS

BUILDING APPLICATIONS

Summarized by Matt Butner

INTERACTIVE 3-D GRAPHICS APPLICATIONS

FOR TCL

Oliver Kersting Juumlrgen Doumlllner HassoPlattner Institute for Software SystemsEngineering University of Potsdam

The integration of 3-D image renderingfunctionality into a scripting languagepermits interactive and animated 3-Ddevelopment and application withoutthe formalities and precision demandedby low-level CC++ graphics and visual-ization libraries The large and complexC++ API of the Virtual Rendering Sys-tem (VRS) can be combined with theconveniences of the Tcl scripting lan-guage The mapping of class interfaces isdone via an automated process and gen-erates respective wrapper classes all ofwhich ensures complete API accessibilityand functionality without any signifi-

76 Vol 27 No 5 login

cant performance retribution The gen-eral C++-to-Tcl mapper SWIG grantsthe necessary object and hierarchal abili-ties without the use of object-orientedTcl extensions The final API mappingtechniques address C++ features such asclasses overloaded methods enumera-tions and inheritance relations All areimplemented in a proof-of-concept thatmaps the complete C++ API of the VRSto Tcl and are showcased in a completeinteractive 3-D-map system

THE AGFL GRAMMAR WORK LAB

Cornelis HA Coster Erik VerbruggenUniversity of Nijmegen (KUN)

The growth and implementation of Nat-ural Language Processing (NLP) is thecornerstone of the continued evolutionand implementation of truly intelligentsearch machines and services In partdue to the growing collections of com-puter-stored human-readable docu-ments in the public domain theimplementation of linguistic analysiswill become necessary for desirable pre-cision and resolution Subsequentlysuch tools and linguistic resources mustbe released into the public domain andthey have done so with the release of theAGFL Grammar Work Lab under theGNU Public License as a tool for lin-guistic research and the development ofNLP-based applications

The AGFL (Affix Grammars over aFinite Lattice) Grammar Work Labmeshes context-free grammars withfinite set-valued features that are accept-able to a range of languages In com-puter science terms ldquoSyntax rules areprocedures with parameters and a non-deterministic executionrdquo The EnglishPhrases for Information Retrieval(EP4IR) released with the AGFL-GWLas a robust grammar of English is anAGFL-GWL generated English parserthat outputs ldquoHeadModifiedrdquo framesThe sentences ldquoCompanyX sponsoredthis conferencerdquo and ldquoThis conferencewas sponsored by CompanyXrdquo both gen-erate [CompanyX[sponsored confer-

ence]] The hope is that public availabil-ity of such tools will encourage furtherdevelopment of grammar and lexicalsoftware systems

SWILL A SIMPLE EMBEDDED WEB SERVER

LIBRARY

Sotiria Lampoudi David M BeazleyUniversity of Chicago

This paper won the FREENIX Best Stu-dent Paper award

SWILL (Simple Web Interface LinkLibrary) is a simple Web server in theform of a C library whose developmentwas motivated by a wish to give coolapplications an interface to the Web TheSWILL library provides a simple inter-face that can be efficiently implementedfor tasks that vary from flexible Web-based monitoring to software debuggingand diagnostics Though originallydesigned to be integrated with high-per-formance scientific simulation softwarethe interface is generic enough to allowfor unbounded uses SWILL is a single-threaded server relying upon non-blocking IO through the creation of atemporary server which services IOrequests SWILL does not provide SSL orcryptographic authentication but doeshave HTTP authentication abilities

A fantastic feature of SWILL is its sup-port for SPMD-style parallel applica-tions which utilize MPI provingvaluable for Beowulf clusters and largeparallel machines Another practicalapplication was the implementation ofSWILL in a modified Yalnix emulator byUniversity of Chicago Operating Sys-tems courses which utilized the addedYalnix functionality for OS developmentand debugging SWILL requires minimalmemory overhead and relies upon theHTTP10 protocol

NETWORK PERFORMANCE

Summarized by Florian Buchholz

LINUX NFS CLIENT WRITE PERFORMANCE

Chuck Lever Network Appliance PeterHoneyman CITI University of Michi-gan

Lever introduced a benchmark to mea-sure an NFS client write performanceClient performance is difficult to mea-sure due to hindrances such as poorhardware or bandwidth limitations Fur-thermore measuring application perfor-mance does not identify weaknessesspecifically at the client side Thus abenchmark was developed trying toexercise only data transfers in one direc-tion between server and application Forthis purpose the benchmark was basedon the block sequential write portion ofthe Bonnie file system benchmark Oncea benchmark for NFS clients is estab-lished it can be used to improve clientperformance

The performance measurements wereperformed with an SMP Linux clientand both a Linux NFS server and a Net-work Appliance F85 filer During testinga periodic jump in write latency timewas discovered This was due to a ratherlarge number of pending write opera-tions that were scheduled to be writtenafter certain threshold values wereexceeded By introducing a separate dae-mon that flushes the cached writerequest the spikes could be eliminatedbut as a result the average latency growsover time The problem could be tracedto a function that scans a linked list ofwrite requests After having added ahashtable to improve lookup perfor-mance the latency improved consider-ably

The improved client was then used tomeasure throughput against the twoservers A discrepancy between theLinux server and the filer test wasnoticed and the reason for that tracedback to a global kernel lock that wasunnecessarily held when accessing thenetwork stack After correcting this per-formance improved further However

77October 2002 login

C

ON

FERE

NC

ERE

PORT

Sthe measurements showed that a clientmay run slower when paired with fastservers on fast networks This is due toheavy client interrupt loads more net-work processing on the client side andmore global kernel lock contention

The source code of the project is avail-able at httpwwwcitiumicheduprojectsnfs-perfpatches

A STUDY OF THE RELATIVE COSTS OF

NETWORK SECURITY PROTOCOLS

Stefan Miltchev and Sotiris IoannidisUniversity of Pennsylvania AngelosKeromytis Columbia University

With the increasing need for securityand integrity of remote network ser-vices it becomes important to quantifythe communication overhead of IPSecand compare it to alternatives such asSSH SCP and HTTPS

For this purpose the authors set upthree testing networks direct link twohosts separated by two gateways andthree hosts connecting through onegateway Protocols were compared ineach setup and manual keying was usedto eliminate connection setup costs ForIPSec the different encryption algo-rithms AES DES 3DES hardware DESand hardware 3DES were used In detailFTP was compared to SFTP SCP andFTP over IPSec HTTP to HTTPS andHTTP over IPSec and NFS to NFS overIPSec and local disk performance

The result of the measurements werethat IPSec outperforms other popularencryption schemes Overall unen-crypted communication was fastest butin some cases like FTP the overhead canbe small The use of crypto hardwarecan significantly improve performanceFor future work the inclusion of setupcosts hardware-accelerated SSL SFTPand SSH were mentioned

CONGESTION CONTROL IN LINUX TCP

Pasi Sarolahti University of HelsinkiAlexey Kuznetsov Institute for NuclearResearch at Moscow

Having attended the talk and read thepaper I am still unclear about whetherthe authors are merely describing thedesign decisions of TCP congestion con-trol or whether they are actually the cre-ators of that part of the Linux code Myguess leans toward the former

In the talk the speaker compared theTCP protocol congestion control meas-ures according to IETF and RFC specifi-cations with the actual Linux implemen-tation which does conform to the basicprinciples but nevertheless has differ-ences A specific emphasis was placed onretransmission mechanisms and thecongestion window Also several TCPenhancements ndash the NewReno algo-rithm Selective ACKs (SACK) ForwardACKs (FACK) ndash were discussed andcompared

In some instances Linux does not con-form to the IETF specifications The fastrecovery does not fully follow RFC 2582since the threshold for triggering re-transmit is adjusted dynamically and thecongestion windowrsquos size is not changedAlso the roundtrip-time estimator andthe RTO calculation differ from RFC2988 since it uses more conservativeRTT estimates and a minimum RTO of200ms The performance measuresshowed that with the additional Linux-specific features enabled slightly higherthroughput more steady data flow andfewer unnecessary retransmissions canbe achieved

XTREME XCITEMENT

Summarized by Steve Bauer

THE FUTURE IS COMING WHERE THE X

WINDOW SHOULD GO

Jim Gettys Compaq Computer Corp

Jim Gettys one of the principal authorsof the X Window System outlined thenear-term objectives for the system pri-marily focusing on the changes andinfrastructure required to enable replica-tion and migration of X applicationsProviding better support for this func-tionality would enable users to retrieveor duplicate X applications betweentheir servers at home and work

USENIX 2002

One interesting example of an applica-tion that currently is capable of migra-tion and replication is Emacs To create anew frame on DISPLAY try ldquoM-x make-frame-on-display ltReturngt DISPLAYltReturngtrdquo

However technical challenges makereplication and migration difficult ingeneral These include the ldquomajorheadachesrdquo of server-side fonts thenonuniformity of X servers and screensizes and the need to appropriatelyretrofit toolkits Authentication andauthorization issues are obviously alsoimportant The rest of the talk delvedinto some of the details of these interest-ing technical challenges

HACKING IN THE KERNEL

Summarized by Hai Huang

AN IMPLEMENTATION OF SCHEDULER

ACTIVATIONS ON THE NETBSD OPERATING

SYSTEM

Nathan J Williams Wasabi Systems

Scheduler activation is an old idea Basi-cally there are benefits and drawbacks tousing solely kernel-level or user-levelthreading Scheduler activation is able tocombine the two layers of control toprovide more concurrency in the sys-tem

In his talk Nathan gave a fairly detaileddescription of the implementation ofscheduler activation in the NetBSD ker-nel One important change in the imple-mentation is to differentiate the threadcontext from the process context This isdone by defining a separate data struc-ture for these threads and relocatingsome of the information that wasembedded in the process context tothese thread contexts Stack was espe-cially a concern due to the upcall Spe-cial handling must be done to make surethat the upcall doesnrsquot mess up the stackso that the preempted user-level threadcan continue afterwards Lastly Nathanexplained that signals were handled byupcalls

78 Vol 27 No 5 login

AUTHORIZATION AND CHARGING IN PUBLIC

WLANS USING FREEBSD AND 8021X

Pekka Nikander Ericsson ResearchNomadicLab

8021x standards are well known in thewireless community as link-layerauthentication protocols In this talkPekka explained some novel ways ofusing the 8021x protocols that might beof interest to people on the move It ispossible to set up a public WLAN thatwould support various chargingschemes via virtual tokens which peoplecan purchase or earn and later use

This is implemented on FreeBSD usingthe netgraph utility It is basically a filterin the link layer that would differentiatetraffic based on the MAC address of theclient node which is either authenti-cated denied or let through The over-head for this service is fairly minimal

ACPI IMPLEMENTATION ON FREEBSD

Takanori Watanabe Kobe University

ACPI (Advanced Configuration andPower Management Interface) was pro-posed as a joint effort by Intel Toshibaand Microsoft to provide a standard andfiner-control method of managingpower states of individual devices withina system Such low-level power manage-ment is especially important for thosemobile and embedded systems that arepowered by fixed-capacity energy batter-ies

Takanori explained that the ACPI speci-fication is composed of three partstables BIOS and registers He was ableto implement some functionalities ofACPI in a FreeBSD kernel ACPI Com-ponent Architecture was implementedby Intel and it provides a high-levelACPI API to the operating systemTakanorirsquos ACPI implementation is builtupon this underlying layer of APIs

ACCESS CONTROL

Summarized by Florian Buchholz

DESIGN AND PERFORMANCE OF THE

OPENBSD STATEFUL PACKET FILTER (PF)

Daniel Hartmeier Systor AG

Daniel Hartmeier described the newstateful packet filter (pf) that replacedIPFilter in the OpenBSD 30 releaseIPFilter could no longer be included dueto licensing issues and thus there was aneed to write a new filter making use ofoptimized data structures

The filter rules are implemented as alinked list which is traversed from startto end Two actions may be takenaccording to the rules ldquopassrdquo or ldquoblockrdquoA ldquopassrdquo action forwards the packet anda ldquoblockrdquo action will drop it Wheremore than one rule matches a packetthe last rule wins Rules that are markedas ldquofinalrdquo will immediately terminate any further rule evaluation An opti-mization called ldquoskip-stepsrdquo was alsoimplemented where blocks of similarrules are skipped if they cannot matchthe current packet These skip-steps arecalculated when the rule set is loadedFurthermore a state table keeps track ofTCP connections Only packets thatmatch the sequence numbers areallowed UDP and ICMP queries andreplies are considered in the state tablewhere initial packets will create an entryfor a pseudo-connection with a low ini-tial timeout value The state table isimplemented using a balanced binarysearch tree NAT mappings are alsostored in the state table whereas appli-cation proxies reside in user space Thepacket filter also is able to perform frag-ment reassembly and to modulate TCPsequence numbers to protect hostsbehind the firewall

Pf was compared against IPFilter as wellas Linuxrsquos Iptables by measuringthroughput and latency with increasingtraffic rates and different packet sizes Ina test with a fixed rule set size of 100Iptables outperformed the other two fil-ters whose results were close to each

other In a second test where the rule setsize was continually increased Iptablesconsistently had about twice thethroughput of the other two (whichevaluate the rule set on both the incom-ing and outgoing interfaces) A third testcompared only pf and IPFilter using asingle rule that created state in the statetable with a fixed-state entry size Pfreached an overloaded state much laterthan IPFilter The experiment wasrepeated with a variable-state entry sizeand pf performed much better thanIPFilter for a small number of states

In general rule set evaluation is expen-sive and benchmarks only reflectextreme cases whereas in real life otherbehavior should be observed Further-more the benchmarks show that statefulfiltering can actually improve perfor-mance due to cheap state-table lookupsas compared to rule evaluation

ENHANCING NFS CROSS-ADMINISTRATIVE

DOMAIN ACCESS

Joseph Spadavecchia and Erez ZadokStony Brook University

The speaker presented modification toan NFS server that allows an improvedNFS access between administrativedomains A problem lies in the fact thatNFS assumes a shared UIDGID spacewhich makes it unsafe to export filesoutside the administrative domain ofthe server Design goals were to leaveprotocol and existing clients unchangeda minimum amount of server changesflexibility and increased performance

To solve the problem two techniques areutilized ldquorange-mappingrdquo and ldquofile-cloakingrdquo Range-mapping maps IDsbetween client and server The mappingis performed on a per-export basis andhas to be manually set up in an exportfile The mappings can be 1-1 N-N orN-1 In file-cloaking the server restrictsfile access based on UIDGID and spe-cial cloaking-mask bits Here users canonly access their own file permissionsThe policy on whether or not the file isvisible to others is set by the cloakingmask which is logically ANDed with the

79October 2002 login

C

ON

FERE

NC

ERE

PORT

Sfilersquos protection bits File-cloaking onlyworks however if the client doesnrsquot holdcached copies of directory contents andfile-attributes Because of this the clientsare forced to re-read directories byincrementing the mtime value of thedirectory each time it is listed

To test the performance of the modifiedserver five different NFS configurationswere evaluated An unmodified NFSserver was compared against one serverwith the modified code included but notused one with only range-mappingenabled one with only file-cloakingenabled and one version with all modi-fications enabled For each setup differ-ent file system benchmarks were runThe results show only a small overheadwhen the modifications are used gener-ally an increase of below 5 Anotherexperiment tested the performance ofthe system with an increasing number ofmapped or cloaked entries on a systemThe results show that an increase from10 to 1000 entries resulted in a maxi-mum of about 14 cost in performance

One member of the audience pointedout that if clients choose to ignore thechanged mtimes from the server andthus still hold caches of the directoryentries the file-cloaking mechanismcould be defeated After a rather lengthydebate the speaker had to concede thatthe model doesnrsquot add any extra securityAnother question was asked about scala-bility of the setup of range mappingThe speaker referred to application-leveltools that could be developed for thatpurpose

The software is available atftpftpfslcssunysbedupubenf

ENGINEERING OPEN SOURCE

SOFTWARE

Summarized by Teri Lampoudi

NINGAUI A LINUX CLUSTER FOR BUSINESS

Andrew Hume ATampT Labs ndash ResearchScott Daniels Electronic Data SystemsCorp

Ningaui is a general purpose highlyavailable resilient architecture built

from commodity software and hard-ware Emphatically however Ningaui isnot a Beowulf Hume calls the clusterdesign the ldquoSwiss canton modelrdquo inwhich there are a number of looselyaffiliated independent nodes with datareplicated among them Jobs areassigned by bidding and leases and clus-ter services done as session-based serversare sited via generic job assignment Theemphasis is on keeping the architectureend-to-end checking all work via check-sums and logging everything Theresilience and high availability requiredby their goal of 8-5 maintenance ndash vsthe typical 24-7 model where people getpaged whenever the slightest thing goeswrong regardless of the time of day ndash isachieved by job restartability Finally allcomputation is performed on local datawithout the use of NFS or networkattached storage

Humersquos message is a hopeful onedespite the many problems encounteredndash things like kernel and service limitsauto-installation problems TCP stormsand the like ndash the existence of sourcecode and the paranoid practice of log-ging and checksumming everything hashelped The final product performs rea-sonably well and it appears that theresilient techniques employed do make adifference One drawback however isthat the software mentioned in thepaper is not yet available for download

CPCMS A CONFIGURATION MANAGEMENT

SYSTEM BASED ON CRYPTOGRAPHIC NAMES

Jonathan S Shapiro and John Vander-burgh Johns Hopkins University

This paper won the FREENIX BestPaper award

The basic notion behind the project isthe fact that everyone has a pet com-plaint about CVS and yet it is currentlythe configuration manager in mostwidespread use Shapiro has unleashedan alternative Interestingly he did notbegin but ended with the usual host ofreasons why CVS is bad The talk insteadbegan abruptly by characterizing the job

USENIX 2002

of a software configuration managercontinued by stating the namespaceswhich it must handle and wrapped upwith the challenges faced

X MEETS Z VERIFYING CORRECTNESS IN THE

PRESENCE OF POSIX THREADS

Bart Massey Portland State UniversityRobert T Bauer Rational SoftwareCorp

Massey delivered a humorous talk onthe insight gained from applying Z for-mal specification notation to systemsoftware design rather than the moreinformal analysis and design processnormally used

The story is told with respect to writingXCB which replaces the Xlib protocollayer and is supposed to be threadfriendly But where threads are con-cerned deadlock avoidance becomes ahard problem that cannot be solved inan ad-hoc manner But full modelchecking is also too hard In this caseMassey resorted to Z specification tomodel the XCB lower layer abstractaway locking and data transfers andlocate fundamental issues Essentiallythe difficulties of searching the literatureand locating information relevant to theproblem at hand were overcome AsMassey put it ldquothe formal method savedthe dayrdquo

FILE SYSTEMS

Summarized by Bosko Milekic

PLANNED EXTENSIONS TO THE LINUX

EXT2EXT3 FILESYSTEM

Theodore Y Tsrsquoo IBM StephenTweedie Red Hat

The speaker presented improvements tothe Linux Ext2 file system with the goalof allowing for various expansions whilestriving to maintain compatibility witholder code Improvements have beenfacilitated by a few extra superblockfields that were added to Ext2 not longbefore the Linux 20 kernel was releasedThe fields allow for file system featuresto be added without compromising theexisting setup this is done by providing

80 Vol 27 No 5 login

bits indicating the impact of the addedfeatures with respect to compatibilityNamely a file system with the ldquoincom-patrdquo bit marked is not allowed to bemounted Similarly a ldquoread-onlyrdquo mark-ing would only allow the file system tobe mounted read-only

Directory indexing changes linear direc-tory searches with a faster search using afixed-depth tree and hashed keys Filesystem size can be dynamicallyincreased and the expanded inode dou-bled from 128 to 256 bytes allows formore extensions

Other potential improvements were dis-cussed as well in particular pre-alloca-tion for contiguous files which allowsfor better performance in certain setupsby pre-allocating contiguous blocksSecurity-related modifications extendedattributes and ACLs were mentionedAn implementation of these featuresalready exists but has not yet beenmerged into the mainline Ext23 code

RECENT FILESYSTEM OPTIMISATIONS ON

FREEBSD

Ian Dowse Corvil Networks DavidMalone CNRI Dublin Institute of Technology

David Malone presented four importantfile system optimizations for FreeBSDOS soft updates dirpref vmiodir anddirhash It turns out that certain combi-nations of the optimizations (beautifullyillustrated in the paper) may yield per-formance improvements of anywherebetween 2 and 10 orders of magnitudefor real-world applications

All four of the optimizations deal withfile system metadata Soft updates allowfor asynchronous metadata updatesDirpref changes the way directories areorganized attempting to place childdirectories closer to their parentsthereby increasing locality of referenceand reducing disk-seek times Vmiodirtrades some extra memory in order toachieve better directory caching Finallydirhash which was implemented by IanDowse changes the way in which entries

are searched in a directory SpecificallyFFS uses a linear search to find an entryby name dirhash builds a hashtable ofdirectory entries on the fly For a direc-tory of size n with a working set of mfiles a search that in certain cases couldhave been O(nm) has been reduceddue to dirhash to effectively O(n + m)

FILESYSTEM PERFORMANCE AND SCALABILITY

IN LINUX 2417

Ray Bryant SGI Ruth Forester IBMLTC John Hawkes SGI

This talk focused on performance evalu-ation of a number of file systems avail-able and commonly deployed on Linuxmachines Comparisons under variousconfigurations of Ext2 Ext3 ReiserFSXFS and JFS were presented

The benchmarks chosen for the datagathering were pgmeter filemark andAIM Benchmark Suite VII Pgmetermeasures the rate of data transfer ofreadswrites of a file under a syntheticworkload Filemark is similar to post-mark in that it is an operation-intensivebenchmark although filemark isthreaded and offers various other fea-tures that postmark lacks AIM VIImeasures performance for various file-system-related functionalities it offersvarious metrics under an imposedworkload thus stressing the perfor-mance of the file system not only underIO load but also under significant CPUload

Tests were run on three different setupsa small a medium and a large configu-ration ReiserFS and Ext2 appear at thetop of the pile for smaller and mediumsetups Notably XFS and JFS performworse for smaller system configurationsthan the others although XFS clearlyappears to generate better numbers thanJFS It should be noted that XFS seemsto scale well under a higher load Thiswas most evident in the large-systemresults where XFS appears to offer thebest overall results

THINGS TO THINK ABOUT

Summarized by Bosko Milekic

SPEEDING UP KERNEL SCHEDULER BY REDUC-

ING CACHE MISSES

Shuji Yamamura Akira Hirai MitsuruSato Masao Yamamoto Akira NaruseKouichi Kumon Fujitsu Labs

This was an interesting talk pertaining tothe effects of cache coloring for taskstructures in the Linux kernel scheduler(Linux kernel 24x) The speaker firstpresented some interesting benchmarknumbers for the Linux scheduler show-ing that as the number of processes onthe task queue was increased the perfor-mance decreased The authors usedsome really nifty hardware to measurethe number of bus transactionsthroughout their tests and were thusable to reasonably quantify the impactthat cache misses had in the Linuxscheduler

Their experiments led them to imple-ment a cache coloring scheme for taskstructures which were previouslyaligned on 8KB boundaries and there-fore were being eventually mapped tothe same cache lines This unfortunateplacement of task structures in memoryinduced a significant number of cachemisses as the number of tasks grew inthe scheduler

The implemented solution consisted ofaligning task structures to more evenlydistribute cached entries across the L2cache The result was inevitably fewercache misses in the scheduler Some neg-ative effects were observed in certain sit-uations These were primarily due tomore cache slots being used by taskstructure data in the scheduler thusforcing data previously cached there tobe pushed out

81October 2002 login

C

ON

FERE

NC

ERE

PORT

SWORK-IN-PROGRESS REPORTS

Summarized by Brennan Reynolds

RESOURCE VIRTUALIZATION TECHNIQUES FOR

WIDE-AREA OVERLAY NETWORKS

Kartik Gopalan University Stony Brook

This work addressed the issue of provi-sioning a maximum number of virtualoverlay networks (VON) with diversequality of service (QoS) requirementson a single physical network Each of theVONs is logically isolated from others toensure the QoS requirements Gopalanmentioned several critical researchissues with this problem that are cur-rently being investigated Dealing withhow to provision the network at variouslevels (link route or path) and thenenforce the provisioning at run-time isone of the toughest challenges Cur-rently Gopalan has developed severalalgorithms to handle admission controlend-to-end QoS route selection sched-uling and fault tolerance in the net-work

For more information visithttpwwwecslcssunysbedu

VISUALIZING SOFTWARE INSTABILITY

Jennifer Bevan University of CaliforniaSanta Cruz

Detection of instability in software hastypically been an afterthought Thepoint in the development cycle when thesoftware is reviewed for instability isusually after it is difficult and costly togo back and perform major modifica-tions to the code base Bevan has devel-oped a technique to allow thevisualization of unstable regions of codethat can be used much earlier in thedevelopment cycle Her technique cre-ates a time series of dependent graphsthat include clusters and lines calledfault lines From the graphs a developeris able to easily determine where theunstable sections of code are and proac-tively restructure them She is currentlyworking on a prototype implementa-tion

For more information visit httpwwwcseucscecu~jbevanevo_viz

RELIABLE AND SCALABLE PEER-TO-PEER WEB

DOCUMENT SHARING

Li Xiao William and Mary College

The idea presented by Xiao would allowend users to share the content of theWeb browser caches with neighboringInternet users The rationale for this isthat todayrsquos end users are increasinglyconnected to the Internet over high-speed links and the browser caches arebecoming large storage systems There-fore if an individual accesses a pagewhich does not exist in their local cacheXiao is suggesting that they first queryother end users for the content beforetrying to access the machine hosting theoriginal This strategy does have someserious problems associated with it thatstill need to be addressed includingensuring the integrity of the content andprotecting the identity and privacy ofthe end users

SEREL FAST BOOTING FOR UNIX

Leni Mayo Fast Boot Software

Serel is a tool that generates a visual rep-resentation of a UNIX machinersquos boot-up sequence It can be used to identifythe critical path and can show if a par-ticular service or process blocks for anextended period of time This informa-tion could be used to determine wherethe largest performance gains could berealized by tuning the order of executionat boot-up Serel creates a dependencygraph expressed in XML during boot-up This graph is then used to create thevisual representation Currently the toolonly works on POSIX-compliant sys-tems but Mayo stated that he would beporting it to other platforms Otherextensions that were mentionedincluded having the metadata includethe use of shared libraries and monitor-ing the suspendresume sequence ofportable machines

For more information visithttpwwwfastbootorg

USENIX 2002

BERKELEY DB XML

John Merrells Sleepycat Software

Merrells gave a quick introduction andoverview of the new XML library forBerkeley DB The library specializes instorage and retrieval of XML contentthrough a tool called XPath The libraryallows for multiple containers per docu-ment and stores everything natively asXML The user is also given a wide rangeof elements to create indices withincluding edges elements text stringsor presence The XPath tool consists of aquery parser generator optimizer andexecution engine To conclude his pre-sentation Merrell gave a live demonstra-tion of the software

For more information visithttpwwwsleepycatcomxml

CLUSTER-ON-DEMAND (COD)

Justin Moore Duke University

Modern clusters are growing at a rapidrate Many have pushed beyond the 5000-machine mark and deployingthem results in large expenses as well asmanagement and provisioning issuesFurthermore if the cluster is ldquorentedrdquoout to various users it is very time-con-suming to configure it to a userrsquos specsregardless of how long they need to useit The COD work presented createsdynamic virtual clusters within a givenphysical cluster The goal was to have aprovisioning tool that would automati-cally select a chunk of available nodesand install the operating system andmiddleware specified by the customer ina short period of time This would allowa greater use of resources since multiplevirtual clusters can exist at once Byusing a virtual cluster the size can bechanged dynamically Moore stated thatthey have created a working prototypeand are currently testing and bench-marking its performance

For more information visithttpwwwcsdukeedu~justincod

82 Vol 27 No 5 login

CATACOMB

Elias Sinderson University of CaliforniaSanta Cruz

Catacomb is a project to develop a data-base-backed DASL module for theApache Web server It was designed as areplacement for the WebDAV The initialrelease of the module only contains sup-port for a MySQL database but could beextended to others Sinderson brieflytouched on the performance of hermodule It was comparable to themod_dav Apache module for all querytypes but search The presentation wasconcluded with remarks about addingsupport for the lock method and includ-ing ACL specifications in the future

For more information visithttpoceancseucsceducatacomb

SELF-ORGANIZING STORAGE

Dan Ellard Harvard University

Ellardrsquos presentation introduced a stor-age system that tuned itself based on theworkload of the system without requir-ing the intervention of the user Theintelligence was implemented as a vir-tual self-organizing disk that residesbelow any file system The virtual diskobserves the access patterns exhibited bythe system and then attempts to predictwhat information will be accessed nextAn experiment was done using an NFStrace at a large ISP on one of their emailservers Ellardrsquos self-organizing storagesystem worked well which he attributesto the fact that most of the files beingrequested were large email boxes Areasof future work include exploration ofthe length and detail of the data collec-tion stage as well as the CPU impact ofrunning the virtual disk layer

VISUAL DEBUGGING

John Costigan Virginia Tech

Costigan feels that the state of currentdebugging facilities in the UNIX worldis not as good as it should be He pro-poses the addition of several elements toprograms like ddd and gdb The firstaddition is including separate heap and

stack components Another is to displayonly certain fields of a data structureand have the ability to zoom in and outif needed Finally the debugger shouldprovide the ability to visually presentcomplex data structures includinglinked lists trees etc He said that a betaversion of a debugger with these abilitiesis currently available

For more information visit httpinfoviscsvtedudatastruct

IMPROVING APPLICATION PERFORMANCE

THROUGH SYSTEM-CALL COMPOSITION

Amit Purohit University of Stony Brook

Web servers perform a huge number ofcontext switches and internal data copy-ing during normal operation These twoelements can drastically limit the perfor-mance of an application regardless ofthe hardware platform it is run on TheCompound System Call (CoSy) frame-work is an attempt to reduce the perfor-mance penalty for context switches anddata copies It includes a user-levellibrary and several kernel facilities thatcan be used via system calls The libraryprovides the programmer with a com-plete set of memory-management func-tions Performance tests using the CoSyframework showed a large saving forcontext switches and data copies Theonly area where the savings betweenconventional libraries and CoSy werenegligible was for very large file copies

For more information visithttpwwwcssunysbedu~purohit

ELASTIC QUOTAS

John Oscar Columbia University

Oscar began by stating that mostresources are flexible but that to datedisks have not been While approxi-mately 80 percent of files are short-liveddisk quotas are hard limits imposed byadministrators The idea behind elasticquotas is to have a non-permanent stor-age area that each user can temporarilyuse Oscar suggested the creation of aehome directory structure to be used indeploying elastic quotas Currently he

has implemented elastic quotas as astackable file system There were alsoseveral trends apparent in the dataMany users would ssh to their remotehost (good) but then launch an emailreader on the local machine whichwould connect via POP to the same hostand send the password in clear text(bad) This situation can easily be reme-died by tunneling protocols like POPthrough SSH but it appeared that manypeople were not aware this could bedone While most of his comments onthe use of the wireless network werenegative the list of passwords he hadcollected showed that people wereindeed using strong passwords His rec-ommendations were to educate andencourage people to use protocols likeIPSec SSH and SSL when conductingwork over a wireless network becauseyou never know who else is listening

SYSTRACE

Niels Provos University of Michigan

How can one be sure the applicationsone is using actually do exactly whattheir developers said they do Shortanswer you canrsquot People today are usinga large number of complex applicationswhich means it is impossible to checkeach application thoroughly for securityvulnerabilities There are tools out therethat can help though Provos has devel-oped a tool called systrace that allows auser to generate a policy of acceptablesystem calls a particular application canmake If the application attempts tomake a call that is not defined in thepolicy the user is notified and allowed tochoose an action Systrace includes anauto-policy generation mechanismwhich uses a training phase to record theactions of all programs the user exe-cutes If the user chooses not to be both-ered by applications breaking policysystrace allows default enforcementactions to be set Currently systrace isimplemented for FreeBSD and NetBSDwith a Linux version coming out shortly

83October 2002 login

C

ON

FERE

NC

ERE

PORT

SFor more information visit httpwwwcitiumicheduuprovossystrace

VERIFIABLE SECRET REDISTRIBUTION FOR

SURVIVABLE STORAGE SYSTEMS

Ted Wong Carnegie Mellon University

Wong presented a protocol that can beused to re-create a file distributed over aset of servers even if one of the serversis damaged In his scheme the user mustchoose how many shares the file is splitinto and the number of servers it will bestored across The goal of this work is toprovide persistent secure storage ofinformation even if it comes underattack Wong stated that his design ofthe protocol was complete and he is cur-rently building a prototype implementa-tion of it

For more information visit httpwwwcscmuedu~tmwongresearch

THE GURU IS IN SESSIONS

LINUX ON LAPTOPPDA

Bdale Garbee HP Linux Systems Operation

Summarized by Brennan Reynolds

This was a roundtable-like discussionwith Garbee acting as moderator He iscurrently in charge of the Debian distri-bution of Linux and is involved in port-ing Linux to platforms other than i386He has successfully help port the kernelto the Alpha Sparc and ARM and iscurrently working on a version to runon VAX machines

A discussion was held on which file sys-tem is best for battery life The Riser andExt3 file systems were discussed peoplecommented that when using Ext3 ontheir laptops the disc would never spindown and thus was always consuming alarge amount of power One suggestionwas to use a RAM-based file system forany volume that requires a large numberof writes and to only have the file systemwritten to disk at infrequent intervals orwhen the machine is shut down or sus-pended

A question was directed to Garbee abouthis knowledge of how the HP-Compaqmerger would affect Linux Garbeethought that the merger was good forLinux both within the company and forthe open source community as a wholeHe stated that the new company wouldbe the number one shipper of systemswith Linux as the base operating systemHe also fielded a question about thesupport of older HP servers and theirability to run Linux saying that indeedLinux has been ported to them and hepersonally had several machines in hisbasement running it

A question which sparked a large num-ber of responses concerned problemspeople had with running Linux onmobile platforms Most people actuallydid not have many problems at allThere were a few cases of a particularmachine model not working but therewere only two widespread issues dock-ing stations and the new ACPI powermanagement scheme Docking stationsstill appear to be somewhat of aheadache but most people had devel-oped ad-hoc solutions for getting themto work The ACPI power managementscheme developed by Intel does notappear to have a quick solution TedTrsquoso one of the head kernel developersstated that there are fundamental archi-tectural problems with the 24 series ker-nel that do not easily allow ACPI to beadded However he also stated thatACPI is supported in the newest devel-opment kernel 25 and will be includedin 26

The final topic was the possibility ofpurchasing a laptop without paying anychargetax for Microsoft Windows Theconsensus was that currently this is notpossible Even if the machine comeswith Linux or without any operatingsystem the vendors have contracts inplace which require them to payMicrosoft for each machine they ship ifthat machine is capable of running aMicrosoft operating system The onlyway this will change is if the demand for

USENIX 2002

other operating systems increases to thepoint that vendors renegotiate their con-tracts with Microsoft and no one sawthis happening in the near future

LARGE CLUSTERS

Andrew Hume ATampT Labs ndash Research

Summarized by Teri Lampoudi

This was a session on clusters big dataand resilient computing Given that theaudience was primarily interested inclusters and not necessarily big data orbeating nodes to a pulp the conversa-tion revolved mainly around what ittakes to make a heterogeneous clusterresilient Hume also presented aFREENIX paper on his current clusterproject which explained in more detailmuch of what was abstractly claimed inthe guru session

Humersquos use of the word ldquoclusterrdquoreferred not to what I had assumedwould be a Beowulf-type system but to aloosely coupled farm of machines inwhich concurrency was much less of anissue than it would be in a Beowulf Infact the architecture Hume describedwas designed to deal with large amountsof independent transaction processingessentially the process of billing callswhich requires no interprocess messagepassing or anything of the sort a parallelscientific application might

Where does the large data come inSince the transactions in question con-sist of large numbers of flat files amechanism for getting the files ontonodes and off-loading the results is nec-essary In this particular cluster namedNingaui this task is handled by theldquoreplication managerrdquo which generateda large amount of interest in the audi-ence All management functions in thecluster as well as job allocation consti-tute services that are to be bid for andleased out to nodes Furthermore fail-ures are handled by simply repeating thefailed task until it is completed properlyif that is possible and if it is not thenlooking through detailed logs to discover

84 Vol 27 No 5 login

what went wrong Logging and testingfor the successful completion of eachtask are what make this model resilientEvery management operation is loggedat some appropriate level of detail gen-erating 120MBday and every single filetransfer is md5 checksummed This wascharacterized by Hume as a ldquopatientrdquostyle of management where no one con-trols the cluster scheduling behavior isemergent rather than stringentlyplanned and nodes are free to drop inand out of the cluster a downside to thismodel is the increase in latency offset bythe fact that the cluster continues tofunction unattended outside of 8-to-5support hours

In response to questions about the pro-jected scalability of such a schemeHume said the system would presum-ably scale from the present 60 nodes toabout 100 without modifications butthat scaling higher would be a matter forfurther design The guru session endedon a somewhat unrelated note regardingthe problems of getting data off main-frames and onto Linux machines ndashissues of fixed- vs variable-lengthencoding of data blocks resulting in use-ful information being stripped by FTPand problems in converting COBOLcopybooks to something useful on a C-based architecture To reiterate Humersquosclaim there are homemade solutions forsome of these problems and the readershould feel free to contact Mr Hume forthem

WRITING PORTABLE APPLICATIONS

Nick Stoughton MSB Consultants

Summarized by Josh Lothian

Nick Stoughton addressed a concernthat developers are facing more andmore writing portable applications Heproposed that there is no such thing as aldquoportable applicationrdquo only those thathave been ported Developers are still insearch of the Holy Grail of applicationsprogramming a write-once compile-anywhere program Languages such asJava are leading up to this but virtual

machines still have to be ported pathswill still be different etc Web-basedapplications are another area that couldbe promising for portable applications

Nick went on to talk about the future ofportability The recently released POSIX2001 standards are expected to help thesituation The 2001 revision expands theoriginal POSIX sepc into seven volumesEven though POSIX 2001 is more spe-cific than the previous release Nickpointed out that even this standardallows for small differences where ven-dors may define their own behaviorsThis can only hurt developers in thelong run Along these lines it waspointed out that there may be room forautomated tools that have the capabilityto check application code and determinethe level of portability and standardscompliance of the source

NETWORK MANAGEMENT SYSTEM

PERFORMANCE TUNING

Jeff Allen Tellme Networks Inc

Summarized by Matt Selsky

Jeff Allen author of Cricket discussednetwork tuning The basic idea of tuningis measure twiddle and measure againCricket can be used for large installa-tions but itrsquos overkill for smaller instal-lations for which Mrtg is better suitedYou donrsquot need Cricket to do measure-ment Doing something like a shellscript that dumps data to Excel is goodbut you need to get some sort of mea-surement Cricket was not designed forbilling it was meant to help answerquestions

Useful tools and techniques coveredincluded looking for the differencebetween machines in order to identifythe causes for the differences measuringbut also thinking about what yoursquoremeasuring and why remembering touse strace and tcpdump to identifyproblems Problems repeat themselvesperformance tuning requires an intricateunderstanding of all the layers involvedto solve the problem and simplify theautomation (Having a computer science

background helps but is not essential) Ifyou can reproduce the problem youthen should investigate each piece tofind the bottleneck If you canrsquot repro-duce the problem then measure Youneed to understand all the system inter-actions Measurement can help deter-mine whether things are actually slow orif the user is imagining it

Troubleshooting begins with a hunchbut scientific processes are essential Youshould be able to determine the eventstream or when each event occurs hav-ing observation logs can help Somehints provided include checking cron forperiodic anomalies slowing downbinrm to avoid IO overload and look-ing for unusual kernel CPU usage

Also try to use lightweight monitoringto reduce overhead You donrsquot need tomonitor every resource on every systembut you should monitor those resourceson those systems that are essential Donrsquotcheck things too often since you canintroduce overhead that way Lots ofinformation can be gathered from theproc file system interface to the kernel

BSD BOF

Host Kirk McKusick Author and Consultant

Summarized by Bosko Milekic

The BSD Birds of a Feather session atUSENIX started off with five quick andinformative talks and ended with anexciting discussion of licensing issues

Christos Zoulas for the NetBSD Projectwent over the primary goals of theNetBSD Project (namely portability aclean architecture and security) andthen proceeded to briefly discuss someof the challenges that the project hasencountered The successes and areas forimprovement of 16 were then exam-ined NetBSD has recently seen variousimprovements in its cross-build systempackaging system and third-party toolsupport For what concerns the kernel azero-copy TCPUDP implementationhas been integrated and a new pmap

85October 2002 login

C

ON

FERE

NC

ERE

PORT

Scode for arm32 has been introduced ashave some other rather major changes tovarious subsystems More information isavailable in Christosrsquos slides which arenow available at httpwwwnetbsdorggalleryeventsusenix2002

Theo de Raadt of the OpenBSD Projectpresented a rundown of various newthings that the OpenBSD and OpenSSHteams have been looking at Theorsquos talkincluded an amusing description (andpictures) of some of the events thattranspired during OpenBSDrsquos hack-a-thon which occurred the week beforeUSENIX rsquo02 Finally Theo mentionedthat following complications with theIPFilter license the OpenBSD team haddone a fairly extensive license audit

Robert Watson of the FreeBSD Projectbrought up various examples ofFreeBSD being used in the real worldHe went on to describe a wide variety ofchanges that will surface with theupcoming FreeBSD release 50 Notably50 will introduce a re-architecting ofthe kernel aimed at providing muchmore scalable support for SMPmachines 50 will also include an earlyimplementation of KSE FreeBSDrsquos ver-sion of scheduler activations largeimprovements to pccard and significantframework bits from the TrustedBSDproject

Mike Karels from Wind River Systemspresented an overview of how BSDOShas been evolving following the WindRiver Systems acquisition of BSDirsquos soft-ware assets Ernie Prabhakar from Applediscussed Applersquos OS X and its successon the desktop He explained how OS Xaims to bring BSD to the desktop whilestriving to maintain a strong and stablecore one that is heavily based on BSDtechnology

The BoF finished with a discussion onlicensing specifically some folks ques-tioned the impact of Applersquos licensingfor code from their core OS (the DarwinProject) on the BSD community as awhole

CLOSING SESSION

HOW FLIES FLY

Michael H Dickinson University ofCalifornia Berkeley

Summarized by JD Welch

In this lively and offbeat talk Dickinsondiscussed his research into the mecha-nisms of flight in the fruit fly (Droso-phila melanogaster) and related theautonomic behavior to that of a techno-logical system Insects are the most suc-cessful organisms on earth due in nosmall part to their early adoption offlight Flies travel through space onstraight trajectories interrupted by sac-cades jumpy turning motions some-what analogous to the fast jumpymovements of the human eye Using avariety of monitoring techniquesincluding high-speed video in a ldquovirtualreality flight arenardquo Dickinson and hiscolleagues have observed that fliesrespond to visual cues to decide when tosaccade during flight For example morevisual texture makes flies saccade earlier(ie further away from the arena wall)described by Dickinson as a ldquocollisioncontrol algorithmrdquo

Through a combination of sensoryinputs flies can make decisions abouttheir flight path Flies possess a visualldquoexpansion detectorrdquo which at a certainthreshold causes the animal to turn acertain direction However expansion infront of the fly sometimes causes it toland How does the fly decide Using thevirtual reality device to replicate variousvisual scenarios Dickinson observedthat the flies fixate on vertical objectsthat move back and forth across thefield Expansion on the left side of theanimal causes it to turn right and viceversa while expansion directly in frontof the animal triggers the legs to flareout in preparation for landing

If the eyes detect these changes how arethe responses implemented The fliesrsquowings have ldquopower musclesrdquo controlledby mechanical resonance (as opposed tothe nervous system) which drive the

USENIX 2002

wings combined with neurally activatedldquosteering musclesrdquo which change theconfiguration of the wing joints Subtlevariations in the timing of impulses cor-respond to changes in wing movementcontrolled by sensors in the ldquowing-pitrdquoA small wing-like structure the halterecontrols equilibrium by beating con-stantly

Voluntary control is accomplished bythe halteres whose signals can interferewith the autonomic control of the strokecycle The halteres have ldquosteering mus-clesrdquo as well and information derivedfrom the visual system can turn off thehaltere or trick it into a ldquovirtualrdquo prob-lem requiring a response

Dickinson has also studied the aerody-namics of insect flight using a devicecalled the ldquoRobo Flyrdquo an oversizedmechanical insect wing suspended in alarge tank of mineral oil InterestinglyDickinson observed that the average liftgenerated by the fliesrsquo wings is under itsbody weight the flies use three mecha-nisms to overcome this including rotat-ing the wings (rotational lift)

Insects are extraordinarily robust crea-tures and because Dickinson analyzedthe problem in a systems-oriented waythese observations and analysis areimmediately applicable to technologythe response system can be used as anefficient search algorithm for controlsystems in autonomous vehicles forexample

THE AFS WORKSHOP

Summarized by Garry Zacheiss MIT

Love Hornquist-Astrand of the Arladevelopment team presented the Arlastatus report Arla 0358 has beenreleased Scheduled for release soon areimproved support for Tru64 UNIXMacOS X and FreeBSD improved vol-ume handling and implementation ofmore of the vospts subcommands Itwas stressed that MacOS X is consideredan important platform and that a GUIconfiguration manager for the cache

86 Vol 27 No 5 login

manager a GUI ACL manager for thefinder and a graphical login that obtainsKerberos tickets and AFS tokens at logintime were all under development

Future goals planned for the underlyingAFS protocols include GSSAPISPNEGOsupport for Rx performance improve-ments to Rx an enhanced disconnectedmode and IPv6 support for Rx anexperimental patch is already availablefor the latter Future Arla-specific goalsinclude improved performance partialfile reading and increased stability forseveral platforms Work is also inprogress for the RXGSS protocol exten-sions (integrating GSSAPI into Rx) Apartial implementation exists and workcontinues as developers find time

Derrick Brashear of the OpenAFS devel-opment team presented the OpenAFSstatus report OpenAFS was releasedimmediately prior to the conferenceOpenAFS 125 fixed a remotelyexploitable denial-of-service attack inseveral OpenAFS platforms mostnotably IRIX and AIX Future workplanned for OpenAFS includes bettersupport for MacOS X including work-ing around Finder interaction issuesBetter support for the BSDs is alsoplanned FreeBSD has a partially imple-mented client NetBSD and OpenBSDhave only server programs availableright now AIX 5 Tru64 51A andMacOS X 102 (aka Jaguar) are allplanned for the future Other plannedenhancements include nested groups inthe ptserver (code donated by the Uni-versity of Michigan awaits integration)disconnected AFS and further work ona native Windows client Derrick statedthat the guiding principles of OpenAFSwere to maintain compatibility withIBM AFS support new platforms easethe administrative burden and add newfunctionality

AFS performance was discussed Theopenafsorg cell consists of two fileservers one in Stockholm Sweden andone in Pittsburgh PA AFS works rea-sonably well over trans-Atlantic links

although OpenAFS clients donrsquot deter-mine which file server to talk to veryeffectively Arla clients use RTTs to theserver to determine the optimal fileserver to fetch replicated data fromModifications to OpenAFS to supportthis behavior in the future are desired

Jimmy Engelbrecht and Harald Barth ofKTH discussed their AFSCrawler scriptwritten to determine how many AFScells and clients were in the world whatimplementationsversions they were(Arla vs IBM AFS vs OpenAFS) andhow much data was in AFS The scriptunfortunately triggered a bug in IBMAFS 36ndashderived code causing someclients to panic while handling a specificRPC This has since been fixed in OpenAFS 125 and the most recent IBMAFS patch level and all AFS users arestrongly encouraged to upgrade Norelease of Arla is vulnerable to this par-ticular denial-of-service attack Therewas an extended discussion of the use-fulness of this exploration Many sitesbelieved this was useful information andsuch scanning should continue in thefuture but only on an opt-in basis

Many sites face the problem of manag-ing KerberosAFS credentials for batch-scheduled jobs Specifically most batchprocessing software needs to be modi-fied to forward tickets as part of thebatch submission process renew ticketsand tokens while the job is in the queueand for the lifetime of the job and prop-erly destroy credentials when the jobcompletes Ken Hornstein of NRL wasable to pay a commercial vendor to sup-port Kerberos 45 credential manage-ment in their product although they didnot implement AFS token managementMIT has implemented some of thedesired functionality in OpenPBS andmight be able to make it available toother interested sites

Tools to simplify AFS administrationwere discussed including

AFS Balancer A tool written byCMU to automate the process ofbalancing disk usage across all

servers in a cell Available fromftpftpandrewcmuedupubAFS-Toolsbalance-11btargz

Themis Themis is KTHrsquos enhancedversion of the AFS tool ldquopackagerdquofor updating files on local disk froma central AFS image KTHrsquosenhancements include allowing thedeletion of files simplifying theprocess of adding a file and allow-ing the merging of multiple rulesets for determining which files areupdated Themis is available fromthe Arla CVS repository

Stanford was presented as an example ofa large AFS site Stanfordrsquos AFS usageconsists of approximately 14TB of datain AFS in the form of approximately100000 volumes 33TB of storage isavailable in their primary cell irstan-fordedu Their file servers consistentirely of Solaris machines running acombination of Transarc 36 patch-level3 and OpenAFS 12x while their data-base servers run OpenAFS 122 Theircell consists of 25 file servers using acombination of EMC and Sun StorEdgehardware Stanford continues to use thekaserver for their authentication infra-structure with future plans to migrateentirely to an MIT Kerberos 5 KDC

Stanford has approximately 3400 clientson their campus not including SLAC(Stanford Linear Accelerator) approxi-mately 2100 AFS clients from outsideStanford contact their cell every monthTheir supported clients are almostentirely IBM AFS 36 clients althoughthey plan to release OpenAFS clientssoon Stanford currently supports onlyUNIX clients There is some on-campuspresence of Windows clients but Stan-ford has never publicly released or sup-ported it They do intend to release andsupport the MacOS X client in the nearfuture

All Stanford students faculty and staffare assigned AFS home directories witha default quota of 50MB for a total ofapproximately 550GB of user homedirectories Other significant uses of AFS

87October 2002 login

C

ON

FERE

NC

ERE

PORT

Sstorage are data volumes for workstationsoftware (400GB) and volumes forcourse Web sites and assignments(100GB)

AFS usage at Intel was also presentedIntel has been an AFS site since 1994They had bad experiences with the IBM35 Linux client their experience withOpenAFS on Linux 24x kernels hasbeen much better They use and are sat-isfied with the OpenAFS IA64 Linuxport Intel has hundreds of OpenAFS123 and 124 clients in many produc-tion cells accessing data stored on IBMAFS file servers They have not encoun-tered any interoperability issues Intelhas some concerns about OpenAFS theywould like to purchase commercial sup-port for OpenAFS and to see OpenAFSsupport for HP-UX on both PA-RISCand Itanium hardware HP-UX supportis currently unavailable due to a specificHP-UX header file being unavailablefrom HP this may be available soonIntel has not yet committed to migratingtheir file servers to OpenAFS and areunsure if they will do so without com-mercial support

Backups are a traditional topic of dis-cussion at AFS workshops and this timewas no exception Many users complainthat the traditional AFS backup tools(ldquobackuprdquo and ldquobutcrdquo) are complex anddifficult to automate requiring manyhome-grown scripts and much userintervention for error recovery An addi-tional complaint was that the traditionalAFS tools do not support file- or direc-tory-level backups and restores datamust be backed up and restored at thevolume level

Mitch Collinsworth of Cornell presentedwork to make AMANDA the freebackup software from the University ofMaryland suitable for AFS backupsUsing AMANDA for AFS backups allowsone to share AFS backup tapes withnon-AFS backups easily run multiplebackups in parallel automate errorrecovery and provide a robust degradedmode that prevents tape errors from

stopping backups altogether Theirimplementation allows for full volumerestores as well as individual directoryand file restores They have finished cod-ing this work and are in the process oftesting and documenting it

Peter Honeyman of CITI at the Univer-sity of Michigan spoke about work hehas proposed to replace Rx with RPC-SEC GSS in OpenAFS this would allowAFS to use a TCP-based transport mech-anism rather than the UDP-based Rxand possibly gain better congestion con-trol dynamic adaptation and fragmen-tation avoidance as a result RPCSECGSS uses the GSSAPI to authenticateSUN ONC RPC RPCSEC GSS is trans-port agnostic provides strong security isa developing Internet standard and hasmultiple open source implementationsBackward compatibility with existingAFS servers and clients is an importantgoal of this project

Page 8: inside - USENIX · 2019. 2. 25. · Bosko Milekic Juan Navarro David E. Ott Amit Purohit Brennan Reynolds Matt Selsky J.D. Welch Li Xiao Praveen Yalagandula Haijin Yan Gary Zacheiss

will be solved in a legal environmentYou need to start thinking about how tosolve the problems and how the solu-tions will affect us

The entire speech (including slides) isavailable at httpwwwcounterpanecompresentation4pdf

LIFE IN AN OPEN SOURCE STARTUP

Daryll Strauss Consultant

Summarized by Teri Lampoudi

This talk was packed with morsels ofinsight and tidbits of information aboutwhat life in an open source startup islike (though ldquowas likerdquo might be moreappropriate) what the issues are instarting maintaining and commercial-izing an open source project and theway hardware vendors treat open sourcedevelopers Strauss began by tracing thetimeline of his involvement with devel-oping 3-D graphics support for Linuxfrom the 3dfx voodoo1 driver for Linuxin 1997 to the establishment of Preci-sion Insight in 1998 to its buyout by VALinux to the dismantling of his groupby VA to what the future may hold

Obviously there are benefits from doingopen source development both for thedeveloper and for the end user Animportant point was that open sourcedevelops entire technologies not justproducts A misapprehension is theinherent difficulty of managing a projectthat accepts code from a large number ofdevelopers some volunteer and somepaid The notion that code can just bethrown over the wall into the world is aBig Lie

How does one start and keep afloat anopen source development company Inthe case of Precision Insight the subjectmatter ndash developing graphics andOpenGL for Linux ndash required a consid-erable amount of expertise intimateknowledge of the hardware the librariesX and the Linux kernel as well as theend applications Expertise is mar-ketable What helps even further is hav-ing a visible virtual team of expertspeople who have an established trackrecord of contributions to open source

68 Vol 27 No 5 login

in the particular area of expertise Sup-port from the hardware and softwarevendors came next While everyone waswilling to contract PI to write the por-tions of code that were specific to theirhardware no one wanted to pay the billfor developing the underlying founda-tion that was necessary for the drivers tobe useful In the PI model the infra-structure cost was shared by several cus-tomers at once and the technology keptmoving Once a driver is written for avendor the vendor is perfectly capableof figuring out how to write the nextdriver necessary thereby obviating theneed to contract the job This and ques-tions of protecting intellectual propertyndash hardware design in this case ndash aredeeply problematic with the open sourcemode of development This is not to saythat there is no way to get over thembut that they are likely to arise moreoften than not

Management issues seem equally impor-tant In the PI setting contracts wereflexible ndash both a good and a bad thing ndashand developers overworked Furtherdevelopment ldquoin a fishbowlrdquo that isunder public scrutiny is not easyStrauss stressed the value of good com-munication and the use of various out-of-sync communication methods likeIRC and mailing lists

Finally Strauss discussed what portionsof software can be free and what can beproprietary His suggestion was that hor-izontal markets want to be free whereasvertically one can develop proprietarysolutions The talk closed with a smallstroll through the problems that projectpolitics brings up from things likemerging branches to mailing list andsource tree readability

GENERAL TRACK SESSIONS

FILE SYSTEMS

Summarized by Haijin Yan

STRUCTURE AND PERFORMANCE OF THE

DIRECT ACCESS FILE SYSTEM

Kostas Magoutis Salimah AddetiaAlexandra Fedorova and Margo ISeltzer Harvard University JeffreyChase Andrew Gallatin Richard Kisleyand Rajiv Wickremesinghe Duke Uni-versity and Eran Gabber Lucent Tech-nologies

This paper won the Best Paper award

The Direct Access File System (DAFS) isa new standard for network-attachedstorage over direct-access transport net-works DAFS takes advantage of user-level network interface standards thatenable client-side functionality in whichremote data access resides in a libraryrather than in the kernel This reducesthe overhead of memory copy for datamovement and protocol overheadRemote Direct Memory Access (RDMA)is a direct access transport network thatallows the network adapter to reducecopy overhead by accessing applicationbuffers directly

This paper explores the fundamentalstructural and performance characteris-tics of network file access using a user-level file system structure on adirect-access transport network withRDMA It describes DAFS-based clientand server reference implementationsfor FreeBSD and reports experimentalresults comparing DAFS to a zero-copyNFS implementation It illustrates thebenefits and trade-offs of these tech-niques to provide a basis for informedchoices about deployment of DAFS-based systems and similar extensions toother network file protocols such asNFS Experiments show that DAFS givesapplications direct control over an IOsystem and increases the client CPUrsquosusage while the client is doing IOFuture work includes how to addresslongstanding problems related to theintegration of the application and filesystem for high-performance applica-tions

CONQUEST BETTER PERFORMANCE THROUGH

A DISKPERSISTENT-RAM HYBRID FILE

SYSTEM

An-I A Wang Peter Reiher and GeraldJ Popek UCLA Geoffrey M Kuenning Harvey Mudd College

Motivated by the declining cost of per-sistent RAM the authors propose theConquest file system which stores allsmall files metadata executables andshared libraries in persistent RAM diskshold only the data content of remaininglarge files Compared to alternativessuch as caching and RAM file systemsConquest has the advantages of effi-ciency consistency and reliability at areduced cost Using popular bench-marks experiments show that Conquestincurs little overhead while achievingfaster performance Future workincludes designing mechanisms foradjusting file-size threshold dynamicallyand finding a better disk layout for largedata blocks

EXPLOITING GRAY-BOX KNOWLEDGE OF

BUFFER-CACHE MANAGEMENT

Nathan C Burnett John Bent AndreaC Arpaci-Dusseau and Remzi HArpaci-Dusseau University of Wisconsin Madison

Knowing what algorithm is used tomanage the buffer cache is very impor-tant for improving application perfor-mance However there is currently nointerface for finding this algorithm Thispaper introduces Dust a simple finger-printing tool that is able to identify thebuffer-cache replacement policy Dustautomatically identifies the cache sizeand replacement policy based on theconfiguring attributes of access ordersrecency frequency and long-term his-tory Through simulation Dust was ableto distinguish between a variety ofreplacement algorithm policies found inthe literature FIFO LRU LFU ClockSegmented FIFO 2Qand LRU-K Fur-ther experiments of fingerprinting realoperating system such as NetBSDLinux and Solaris show that Dust is ableto identify the hidden cache replacementalgorithm

69October 2002 login

C

ON

FERE

NC

ERE

PORT

SBy knowing the underlying cachereplacement policy a cache-aware Webserver can reschedule the requests on acached request-first policy to obtain per-formance improvement Experimentsshow that cache-aware schedulingimproves average response time and sys-tem throughput

OPERATING SYSTEMS

(AND DANCING BEARS)

Summarized by Matt Butner

THE JX OPERATING SYSTEM

Michael Golm Meik Felser ChristianWawersich and Juumlrgen Kleinoumlder University of Erlagen-Nuumlrnberg

The talk opened by emphasizing theneed for and the practicality of a func-tional Java OS Such implementation inthe JX OS attempts to mimic the recenttrend toward using highly abstractedlanguages in application development inorder to create an OS as functional andpowerful as one written in a lower-levellanguage

The JX is a micro-kernel solution thatuses separate JVMs for each entity of thekernel and in some cases for eachapplication The separated domains donot share objects and have no threadmigration and each domain implementsits own code The interesting part of thepresentation was the discussion of sys-tem-level programming with Java keyareas discussed were memory manage-ment and interrupt handlers Theauthors concluded by noting that theirtype-safe and modular system resultedin a robust system with great configura-tion flexibility and acceptable perfor-mance

DESIGN EVOLUTION OF THE EROS

SINGLE-LEVEL STORE

Jonathan S Shapiro Johns HopkinsUniversity Jonathan Adams Universityof Pennsylvania

This presentation outlined current char-acteristics of file systems and some oftheir least desirable characteristics Itwas based on the revival of the EROS

single-level-store approach for currentattempts to capitalize on system consis-tency and efficiency Their solution didnot relate directly to EROS but ratherto the design and use of the constructsand notions on which EROS is based Byextending the mapping of the memorysystem to include the disk systems arefurther able to ensure global persistencewithout regard for a disk structure Inexplaining the need for absolute systemconsistency and an environment whichby definition is not allowed to crash thegoal of having such an exhaustive designbecomes clear The cool part of thiswork is the availability of its design andarchitecture to the public

THINK A SOFTWARE FRAMEWORK FOR

COMPONENT-BASED OPERATING SYSTEM

KERNELS

Jean-Philippe Fassino France TeacuteleacutecomRampD Jean-Bernard Stefani INRIA JuliaLawall DIKU Gilles Muller INRIA

This presentation discussed the need forcomponent-based operating systemsand respective structures to ensure flexi-bility for arbitrary-sized systems Thinkprovides a binding model that mapsuniformed components for OS develop-ers and architects to follow ensuringconsistent implementations for an arbi-trary system size However the goal isnot to force developers into a predefinedkernel but to promote the use of certaincomponents in varied ways

Thinkrsquos concentration is primarily onembedded systems where short devel-opment time is necessary but is con-strained by rigorous needs and limitedresources The need to build flexible sys-tems in such an environment can becostly and implementation-specific butThink creates an environment sup-ported by the ability to dynamically loadtype-safe components This allows formore flexible systems that retain func-tionality because of Thinkrsquos ability tobind fine-grained components Bench-marks revealed that dedicated micro-kernels can show performanceimprovements in comparison to mono-

USENIX 2002

lithic kernels Notable future workincludes building components for low-end appliances and the development of areal-time OS component library

BUILDING SERVICES

Summarized by Pradipta De

NINJA A FRAMEWORK FOR NETWORK

SERVICES

J Robert von Behren Eric A BrewerNikita Borisov Michael Chen MattWelsh Josh MacDonald Jeremy Lauand David Culler University of Califor-nia Berkeley

Robert von Behren presented Ninja anongoing project that aims to provide aframework for building robust and scal-able Internet-based network serviceslike Web-hosting instant-messagingemail and file-sharing applicationsRobert drew attention to the difficultiesof writing cluster applications One hasto take care of data consistency andissues of concurrency and for robustapplications there are problems relatedto fault tolerance Ninja works as awrapper to relieve the user of theseproblems The goal of the project isbuilding network services that are scala-ble and highly available maintainingpersistent data providing gracefuldegradation and supporting online evo-lution

The use of clusters distinguishes thissetup from generic distributed systemsin terms of reliability and security aswell as providing a partition-free net-work The next important feature of thisproject is the new programming modelwhich is more restrictive than a generalprogramming model but still expressiveenough to write most of the applica-tions This model described as ldquosingleprogram multiple connectionrdquo usesintertask parallelism instead of multi-threaded concurrency and is character-ized by all nodes running the sameprogram with connections delegated tothe nodes by a centralized connectionmanager (CM) A CM takes care of hid-ing the details of mapping an externalconnection to an internal node Ninja

70 Vol 27 No 5 login

also uses a design called ldquostaged event-driven architecturerdquo where each serviceis broken down into a set of stages con-nected by event queues This architec-ture is suitable for a modular design andhelps in graceful degradation by adap-tive load shedding from the eventqueues

The presentation concluded with exam-ples of implementation and evaluationof a Web server and an email systemcalled NinjaMail to show the ease ofauthoring and the efficacy of using theNinja framework for service develop-ment

USING COHORT-SCHEDULING TO ENHANCE

SERVER PERFORMANCE

James R Larus and Michael ParkesMicrosoft Research

James Larus presented a new schedulingpolicy for increasing the throughput ofserver applications Cohort schedulingbatches execution of similar operationsarising in different server requests Theusual programming paradigm in han-dling server requests is to spawn multi-ple concurrent threads and switch fromone thread to another whenever a threadgets blocked for IO Since the threads ina server mostly execute unrelated piecesof code the locality of reference isreduced hence the effectiveness of dif-ferent caching mechanisms One way tosolve this problem is to throw in morehardware But Larus presented a com-plementary view where the programbehavior is investigated and used toimprove the performance

The problem in this scheme is to iden-tify pieces of code that can be batchedtogether for processing One simple wayis to look at the next program countervalues and use them to club threadstogether However this talk presented anew programming abstraction ldquostagedcomputationrdquo which replaces the threadmodel with ldquostagesrdquo A stage is anabstraction for operations with similarbehavior and locality The StagedServerlibrary can be used for programming inthis model It is a collection of C++

classes that implement staged computa-tion and cohort scheduling on either auniprocessor or multiprocessor It canbe used to define stages

The presentation showed the experi-mental evaluation of the cohort schedul-ing over the thread-based model byimplementing two servers a Web serverwhich is mainly IO bound and a pub-lish-subscribe server which is mainlycompute bound The SURGE bench-mark was used for the first experimentand the Fabret workload for the secondThe results showed that cohort-schedul-ing-based implementation gave a betterthroughput than the thread-basedimplementation at high loads

NETWORK PERFORMANCE

Summarized by Xiaobo Fan

ETE PASSIVE END-TO-END INTERNET SERVICE

PERFORMANCE MONITORING

Yun Fu Amin Vahdat Duke UniversityLudmila Cherkasova Wenting Tang HPLabs

This paper won the Best Student Paperaward

Ludmila Cherkasova began by listingseveral questions most Web serviceproviders want answered in order toimprove service quality She reviewedthe difficulties of making accurate andefficient end-to-end Web service mea-surement and the shortcomings of cur-rently available approaches Theypropose a passive trace-based architec-ture called EtE to monitor Web serverperformance on behalf of end users

The first step is to collect network pack-ets passively The second module recon-structs all TCP connections and extractsHTTP transactions To reconstruct Webpage accesses they first build a knowl-edge base indexed by client IP and URLand then group objects to the relatedWeb pages they are embedded in Statis-tical analysis is used to handle non-matched objects EtE Monitor cangenerate three groups of metrics to mea-sure Web service performance responsetime Web caching and page abortion

To demonstrate the benefits of EtE mon-itor Cherkasova talked about two casestudies and based on the calculatedmetrics gave some insightful explana-tions about whatrsquos happening behind the variations of Web performanceThrough validation they claim theirapproach provides a very close approxi-mation to the real scenario

THE PERFORMANCE OF REMOTE DISPLAY

MECHANISMS FOR THIN-CLIENT COMPUTING

S Jae Yang Jason Nieh Matt SelskyNikhil Tiwari Columbia University

Noting the trend toward thin-clientcomputing the authors compared dif-ferent techniques and design choices inmeasuring the performance of six popu-lar thin-client platforms ndash CitrixMetaFrame Microsoft Terminal Ser-vices Sun Ray Tarantella VNC and XAfter pointing out several challenges inbenchmarking thin clients Yang pro-posed slow-motion benchmarking toachieve non-invasive packet monitoringand consistent visual quality Basicallythey insert delays between separatevisual events in the benchmark applica-tions of the server side so that theclientrsquos display update can catch up withthe serverrsquos processing speed The exper-iments are conducted on an emulatednetwork over a range of network band-widths

Their results show that thin clients canprovide good performance for Webapplications in LAN environments butonly some platforms performed well forvideo benchmark Pixel-based encodingmay achieve better performance andbandwidth efficiency than high-levelgraphics Display caching and compres-sion should be used with care

A MECHANISM FOR TCP-FRIENDLY

TRANSPORT-LEVEL PROTOCOL

COORDINATION

David E Ott and Ketan Mayer-PatelUniversity of North Carolina ChapelHill

A revised transport-level protocol opti-mized for cluster-to-cluster (C-to-C)

71October 2002 login

C

ON

FERE

NC

ERE

PORT

Sapplications is what David Ott tries toexplain in his talk A C-to-C applicationclass is identified as one set of processescommunicating to another set ofprocesses across a common Internetpath The fundamental problem with C-to-C applications is how to coordinateall C-to-C communication flows so thatthey share a consistent view of the com-mon C-to-C network adapt to changingnetwork conditions and cooperate tomeet specific requirements Aggregationpoints (AP) are placed at the first andlast hop of the common data path toprobe network conditions (latencybandwidth loss rate etc) To carry andtransfer this information a new protocolndash Coordination Protocol (CP) ndash isinserted between the network layer (IP)and the transport layer (TCP UCP etc)Ott illustrated how the CP header isupdated and used when packets origi-nate from source and traverse throughlocal and remote AP to arrive at theirdestination and how AP maintains aper-cluster state table and detects net-work conditions Through simulationresults this coordination mechanismappears effective in sharing commonnetwork resources among C-to-C com-munication flows

STORAGE SYSTEMS

Summarized by Praveen Yalagandula

MY CACHE OR YOURS MAKING STORAGE

MORE EXCLUSIVE

Theodore M Wong Carnegie MellonUniversity John Wilkes HP Labs

Theodore Wong explained the ineffi-ciency of current ldquoinclusiverdquo cachingschemes in storage area networks ndashwhen a client accesses a block the blockis read from the disk and is cached atboth the disk array cache and at theclientrsquos cache Then he presented theconcept of ldquoexclusiverdquo caching where ablock accessed by a client is only cachedin that clientrsquos cache On eviction fromthe clientrsquos cache they have come upwith a DEMOTE operation to move thedata block to the tail of the array cache

The new exclusion caching schemes areevaluated on both single-client and mul-tiple-client systems The single-clientresults are presented for two differenttypes of synthetic workloads Random(transaction-processing type workloads)and Zipf (typical Web workloads) Theexclusive policy was quite effective inachieving higher hit rates and lower readlatencies They also showed a 22 timeshit-rate improvement over inclusivetechniques in the case of a real-lifeworkload httpd with a single-client set-ting

The DEMOTE scheme also performedwell in the case of multiple-client sys-tems when the data accessed by clients isdisjointed In the case of shared dataworkloads this scheme performed worsethan the inclusive schemes Theodorethen presented an adaptive exclusivecaching scheme ndash a block accessed by aclient is placed at the tail in the clientrsquoscache and also in the array cache at anappropriate place determined by thepopularity of the block The popularityof the block is measured by maintaininga ghost cache to accumulate the numberof times each block is accessed This newadaptive scheme has achieved a maxi-mum of 52 speedup in the meanlatency in the experiments with real-lifeworkloads

For more information visit httpwwwcscmuedu~tmwongresearch and httpwwwhplhpcomresearchitccslssp

BRIDGING THE INFORMATION GAP IN

STORAGE PROTOCOL STACKS

Timothy E Denehy Andrea C Arpaci-Dusseau Remzi H Arpaci-DusseauUniversity of Wisconsin Madison

Currently there is a huge informationgap between storage systems and file sys-tems The interface exposed by storagesystems to file systems based on blocksand providing only readwrite interfacesis very narrow This leads to poor per-formance as a whole because of dupli-cated functionality in both systems and

USENIX 2002

reduced functionality resulting fromstorage systemsrsquo lack of file information

The speaker presented two enhance-ments ExRaid an exposed RAID andILFS Informed Log-Structured File Sys-tem ExRAID is an enhancement to theblock-based RAID storage system thatexposes the following three types ofinformation to the file system (1)regions ndash contiguous portions of theaddress space comprising one or multi-ple disks (2) performance informationabout queue lengths and throughput ofthe regions revealing disk heterogeneityto the file systems and (3) failure infor-mation ndash dynamically updated informa-tion conveying the number of tolerablefailures in each region

ILFS allows online incremental expan-sion of the storage space performsdynamic parallelism using ExRAIDrsquosperformance information to performwell on heterogeneous storage systemsand provides a range of different mecha-nisms with different granularities forredundancy of files using ExRAIDrsquosregion failure characteristics These newfeatures are added to LFS with only a19 increase in the code size

More informationhttpwwwcswisceduwind

MAXIMIZING THROUGHPUT IN REPLICATED

DISK STRIPING OF VARIABLE

BIT-RATE STREAMS

Stergios V Anastasiadis Duke Univer-sity Kenneth C Sevcik and MichaelStumm University of Toronto

There is an increasing demand for thecontinuous real-time streaming ofmedia files Disk striping is a commontechnique used for supporting manyconnections on the media servers Buteven with 12 million hours of meantime between failures there will be morethan one disk failure per week with 7000disks Fault tolerance can be achieved byusing data replication and reservingextra bandwidth during normal opera-tion This work focused on supportingthe most common variable bit-ratestream formats (eg MPEG)

72 Vol 27 No 5 login

The authors used prototype mediaserver EXEDRA to implement and eval-uate different replication and bandwidthreservation schemes This system sup-ports a variable bit-rate scheme doesstride-based disk allocation and is capa-ble of supporting variable-grain diskstriping Two replication policies arepresented deterministic ndash data blocksare replicated in a round-robin fashionon the secondary replicas random ndashdata blocks are replicated on the ran-domly chosen secondary replicas Twobandwidth reservation techniques arealso presented mirroring reservation ndashdisk bandwidth is reserved for both primary and replicas of the media fileduring the playback and minimumreservation ndash a more efficient scheme inwhich bandwidth is reserved only for thesum of primary data access time and themaximum of the backup data accesstimes required in each round

Experimental results showed that deter-ministic replica placement is better thanrandom placement for small disks mini-mum disk bandwidth reservation istwice as good as mirroring in thethroughput achieved and fault tolerancecan be achieved with a minimal impacton the throughput

TOOLS

Summarized by Li Xiao

SIMPLE AND GENERAL STATISTICAL PROFILING

WITH PCT

Charles Blake and Steven Bauer MIT

This talk introduced a Profile CollectionToolkit (PCT) ndash a sampling-based CPUprofiling facility A novel aspect of PCTis that it allows sampling of semanticallyrich data such as function call stacksfunction parameters local or globalvariables CPU registers or other execu-tion contexts The design objectives ofPCT were driven by user needs and theinadequacies or inaccessibility of priorsystems

The rich data collection capability isachieved via a debugger-controller pro-gram dbct1 The talk then introducedthe profiling collector and data collec-

tion activation PCT file format optionsprovide flexibility and various ways tostore exact states PCT also has analysisoptions two examples were given ndashldquoSimplerdquo and ldquoMixed User and LinuxCoderdquo

The talk then went into how generalsampling helps in the following areas adebugger-controller state machineexample script fragments a simpleexample program and general valuereports in terms of call-path histogramsnumeric value histograms and data gen-erality Related work such as gprof andExpect was then summarized The con-tribution of this work is to provide aportable general value sampling toolThe limitations are that PCT does notsupport much of strip and count inac-curacies happen because of its statisticalnature Based on this limitation -g ispreferred over strip

Possible future directions included sup-porting more debuggers such as dbxand xdb script generators for otherkinds of traces more canned reports forgeneral values and a libgdb-basedlibrary sampler PCT is available athttppdoslcsmitedupct PCT runs onalmost any UNIX-like system

ENGINEERING A DIFFERENCING AND

COMPRESSION DATA FORMAT

David G Korn and Kiem Phong VoATampT Laboratories ndash Research

This talk began with an equation ldquoDif-ferencing + Compression = Delta Com-pressionrdquo Compression removesinformation redundancy in a data setExamples are gzip bzip compress andpack Differencing encodes differencesbetween two data sets Examples are diff -e fdelta bdiff and xdelta Deltacompression compresses a data set givenanother combining differencing andcompression and reduces to pure com-pression when therersquos no commonality

After presenting an overall scheme for adelta compressor and showing deltacompression performance this talk dis-cussed the encoding format of a newlydesigned Vcdiff for delta compression A

Vcdiff instruction code table consists of256 entries of each coding up to a pair ofinstructions and recodes I-byte indicesand any additional data

The talk then showed Vcdiff rsquos perfor-mance with Web data collected fromCNN compared with W3C standardGdiff Gdiff+gzip and diff+gzip whereGdiff was computed using Vcdiff deltainstructions The results of two experi-ments ldquoFirstrdquo and ldquoSuccessiverdquo werepresented In ldquoFirstrdquo each file is com-pressed against the first file collected inldquoSuccessiverdquo each file is compressedagainst the one in the previous hourThe diff+gzip did not work well becausediff was line-oriented Vcdiff performedfavorably compared with other formats

Vcdiff is part of the Vcodex packageVcodex is a platform for all commondata transformations delta compres-sion plain compression encryptiontranscoding (eg uuencode base64) Itis structured in three layers for maxi-mum usability Base library uses Dis-plines and Methods interfaces

The code for Vcdiff can be found athttpwwwresearchattcomswtoolsThey are moving Vcdiff to the IETFstandard as a comprehensive platformfor transforming data Please refer tohttpwwwietforginternet-draftdraft-korn-vcdiff06txt

WHERE IN THE NET

Summarized by Amit Purohit

A PRECISE AND EFFICIENT EVALUATION OF

THE PROXIMITY BETWEEN WEB CLIENTS AND

THEIR LOCAL DNS SERVERS

Zhuoqing Morley Mao University ofCalifornia Berkeley Charles D CranorFred Douglis Michael RabinovichOliver Spatscheck and Jia Wang ATampTLabs ndash Research

Content Distribution Networks (CDNs)attempt to improve Web performance bydelivering Web content to end usersfrom servers located at the edge of thenetwork When a Web client requestscontent the CDN dynamically chooses aserver to route the request to usually the

73October 2002 login

C

ON

FERE

NC

ERE

PORT

Sone that is nearest the client As CDNshave access only to the IP address of thelocal DNS server (LDNS) of the clientthe CDNrsquos authoritative DNS servermaps the clientrsquos LDNS to a geographicregion within a particular network andcombines that with network and serverload information to perform CDNserver selection

This method has two limitations First itis based on the implicit assumption thatclients are close to their LDNS This maynot always be valid Second a singlerequest from an LDNS can represent adiffering number of Web clients ndash calledthe hidden load factor Knowledge of thehidden load factor can be used toachieve better load distribution

The extent of the first limitation and itsimpact on the CDN server selection isdealt with To determine the associationsbetween clients and their LDNS a sim-ple non-intrusive and efficient map-ping technique was developed The datacollected was used to study the impact ofproximity on DNS-based server selec-tion using four different proximity met-rics (1) autonomous system (AS)clustering ndash observing whether a client isin the same AS as its LDNS ndash concludedthat LDNS is very good for coarse-grained server selection as 64 of theassociations belong to the same AS(2) network clustering ndash observingwhether a client is in the same network-aware cluster (NAC) ndash implied that DNSis less useful for finer-grained serverselection since only 16 of the clientand LDNS are in the same NAC (3)traceroute divergence ndash the length of thedivergent paths to the client and itsLDNS from a probe point using tracer-oute ndash implies that most clients aretopologically close to their LDNS asviewed from a randomly chosen probesite (4) round-trip time (RTT) correla-tion (some CDNs select severs based onRTT between the CDN server and theclientrsquos LDNS) examines the correlationbetween the message RTTs from a probepoint to the client and its local DNS

results imply this is a reasonable metricto use to avoid really distant servers

A study of the impact that client-LDNSassociations have on DNS-based serverselection concludes that knowing theclientrsquos IP address would allow moreaccurate server selection in a large num-ber of cases The optimality of the serverselection also depends on the serverdensity placement and selection algo-rithms

Further information can be found athttpwwweecsberkeleyedu~zmaomyresearchhtml or by contactingzmaoeecsberkeleyedu

GEOGRAPHIC PROPERTIES OF INTERNET

ROUTING

Lakshminarayanan SubramanianVenkata N Padmanabhan MicrosoftResearch Randy H Katz University ofCalifornia at Berkeley

Geographic information can provideinsights into the structure and function-ing of the Internet including interac-tions between different autonomoussystems by analyzing certain propertiesof Internet routing It can be used tomeasure and quantify certain routingproperties such as circuitous routinghot-potato routing and geographic faulttolerance

Traceroute has been used to gather therequired data and Geotrack tool hasbeen used to determine the location ofthe nodes along each network path Thisenables the computation of ldquolinearizeddistancesrdquo which is the sum of the geo-graphic distances between successivepairs of routers along the path

In order to measure the circuitousnessof a path a metric ldquodistance ratiordquo hasbeen defined as the ratio of the lin-earized distance of a path to the geo-graphic distance between the source anddestination of the path From the data ithas been observed that the circuitous-ness of a route depends on both geo-graphic and network location of the endhosts A large value of the distance ratioenables us to flag paths that are highly

USENIX 2002

circuitous possibly (though not neces-sarily) because of routing anomalies Itis also shown that the minimum delaybetween end hosts is strongly correlatedwith the linearized distance of the path

Geographic information can be used tostudy various aspects of wide-area Inter-net paths that traverse multiple ISPs Itwas found that end-to-end Internetpaths tend to be more circuitous thanintra-ISP paths the cause for this beingthe peering relationships between ISPsAlso paths traversing substantial dis-tances within two or more ISPs tend tobe more circuitous than paths largelytraversing only a single ISP Anotherfinding is that ISPs generally employhot-potato routing

Geographic information can also beused to capture the fact that two seem-ingly unrelated routers can be suscepti-ble to correlated failures By using thegeographic information of routers wecan construct a geographic topology ofan ISP Using this we can find the toler-ance of an ISPrsquos network to the total net-work failure in a geographic region

For further information contactlakmecsberkeleyedu

PROVIDING PROCESS ORIGIN INFORMATION

TO AID IN NETWORK TRACEBACK

Florian P Buchholz Purdue UniversityClay Shields Georgetown University

Network traceback is currently limitedbecause host audit systems do not main-tain enough information to matchincoming network traffic to outgoingnetwork traffic The talk presented analternative assigning origin informationto every process and logging it duringinteractive login creation

The current implementation concen-trates mainly on interactive sessions inwhich an event is logged when a newconnection is established using SSH orTelnet The method is effective andcould successfully determine steppingstones and reliably detect the source of aDDoS attack The speaker then talkedabout the related work done in this area

74 Vol 27 No 5 login

which led to discussion of the latestdevelopment in the area of ldquohostcausalityrdquo

The speaker ended with the limitationsand future directions of the frameworkThis framework could be extended andcould find many applications in the areaof security Current implementationdoesnrsquot take care of startup scripts andcron jobs but incorporating the origininformation in FS could solve this prob-lem In the current implementation log-ging is just implemented as a proof ofconcept It could be made safe in manyways and this could be another impor-tant aspect of future work

PROGRAMMING

Summarized by Amit Purohit

CYCLONE A SAFE DIALECT OF C

Trevor Jim ATampT Labs ndash Research GregMorrisett Dan Grossman MichaelHicks James Cheney Yanling WangCornell University

Cyclone is designed to provide protec-tion against attacks such as buffer over-flows format string attacks andmemory management errors The cur-rent structure of C allows programmersto write vulnerable programs Cycloneextends C so that it has the safety guar-antee of Java while keeping C syntaxtypes and semantics untouched

The Cyclone compiler performs staticanalysis of the source code and insertsruntime checks into the compiled out-put at places where the analysis cannotdetermine that the code execution willnot violate safety constraints Cycloneimposes many restrictions to preservesafety such as NULL checksThesechecks do not exist in normal C

The speaker then talked about somesample code written in Cyclone and howit tackles safety problems Porting anexisting C application to Cyclone ispretty easy with fewer than 10 changerequired The current implementationconcentrates more on safety than onperformance ndash hence the performance

penalty is substantial Cyclone was ableto find many lingering bugs The finalstep of the project will be to write acompiler to convert a normal C programto an equivalent Cyclone program

COOPERATIVE TASK MANAGEMENT WITHOUT

MANUAL STACK MANAGEMENT

Atul Adya Jon Howell MarvinTheimer William J Bolosky John RDouceur Microsoft Research

The speaker described the definitionsand motivations behind the distinctconcepts of task management stackmanagement IO management conflictmanagement and data partitioningConventional concurrent programminguses preemptive task management andexploits the automatic stack manage-ment of a standard language In the sec-ond approach cooperative tasks areorganized as event handlers that yieldcontrol by returning control to the eventscheduler manually unrolling theirstacks In this project they have adopteda hybrid approach that makes it possiblefor both stack management styles tocoexist Thus programmers can codeassuming one or the other of these stackmanagement styles will operate depend-ing upon the application The speakeralso gave a detailed example of how touse their mechanism to switch betweenthe two styles

The talk then continued with the imple-mentation They were able to preemptmany subtle concurrency problems byusing cooperative task managementPaying a cost up-front to reduce a subtlerace condition proved to be a goodinvestment

Though the choice of task managementis fundamental the choice of stack man-agement can be left to individual tasteThis project enables use of any type ofstack management in conjunction withcooperative task management

IMPROVING WAIT-FREE ALGORITHMS FOR

INTERPROCESS COMMUNICATION IN

EMBEDDED REAL-TIME SYSTEMS

Hai Huang Padmanabhan Pillai KangG Shin University of Michigan

The main characteristic of the real-timetime-sensitive system is its pre-dictable response time But concurrencymanagement creates hurdles to achiev-ing this goal because of the use of locksto maintain consistency To solve thisproblem many wait-free algorithmshave been developed but these are typi-cally high-cost solutions By takingadvantage of the temporal characteris-tics of the system however the time andspace overhead can be reduced

The speaker presented an algorithm fortemporal concurrency control andapplied this technique to improve threewait-free algorithms A single-writermultiple-reader wait-free algorithm anda double-buffer algorithm were pro-posed

Using their transformation mechanismthey achieved an improvement of17ndash66 in ACET and a 14ndash70 reduc-tion in memory requirements for IPCalgorithms This mechanism is extensi-ble and can be applied to other non-blocking IPC algorithms as well Futurework involves reducing the synchroniz-ing overheads in more general IPC algo-rithms with multiple-writer semantics

MOBILITY

Summarized by Praveen Yalagandula

ROBUST POSITIONING ALGORITHMS FOR

DISTRIBUTED AD-HOC WIRELESS SENSOR

NETWORKS

Chris Savarese Jan Rabaey Universityof California Berkeley Koen Langen-doen Delft University of Technology

The Pico Radio Network comprisesmore than a hundred sensors monitorsand actuators equipped with wirelessconnectivity with the following proper-ties (1) no infrastructure (2) computa-tion in a distributed fashion instead ofcentralized computation (3) dynamictopology (4) limited radio range

75October 2002 login

C

ON

FERE

NC

ERE

PORT

S(5) nodes that can act as repeaters(6) devices of low cost and with lowpower and (7) sparse anchor nodes ndashnodes with GPS capability A positioningalgorithm determines the geographicalposition of each node in the network

In two-dimensional space each nodeneeds three reference positions to esti-mate its geographical position There aretwo problems that make positioning dif-ficult in the Pico Radio Network typesetting (1) a sparse anchor node prob-lem and (2) a range error problem

The two-phase approach that theauthors have taken in solving the prob-lem consists of (1) a hop-terrain algo-rithm ndash in this first phase each noderoughly guesses its location by the dis-tance calculated using multi-hops to theanchor points and (2) a refinementalgorithm ndash in this second phase eachnode uses its neighborsrsquo positions torefine its own position estimate Toguarantee the convergence thisapproach uses confidence metricsassigning a value of 10 for anchor nodesand 01 to start for other nodes andincreasing these with each iteration

A simulation tool OMNet++ was usedfor both phases and in various scenariosThe results show that they achievedposition errors of less than 33 in a sce-nario with 5 anchor nodes an averageconnectivity of 7 and 5 range mea-surement error

Guidelines for anchor node deploymentare high connectivity (gt10) a reason-able fraction of anchor nodes (gt 5)and for anchor placement coverededges

APPLICATION-SPECIFIC NETWORK

MANAGEMENT FOR ENERGY-AWARE

STREAMING OF POPULAR MULTIMEDIA

FORMATS

Surendar Chandra University of Geor-gia and Amin Vahdat Duke University

The main hindrance in supporting theincreasing demand for mobile multime-dia on PDA is the battery capacity of the

small devices The idle-time power con-sumption is almost the same as thereceive power consumption on typicalwireless network cards (1319mW vs1425mW) while the sleep state con-sumes far less power (177mW) To letthe system consume energy proportionalto the stream quality the network cardshould transition to sleep state aggres-sively between each packet

The speaker presented previous workwhere different multimedia streamswere studied for PDAs MS Media RealMedia and Quicktime It was found thatif the inter-packet gap is predictablethen huge savings are possible forexample about 80 savings in MSmedia streams with few packet lossesThe limitation of the IEEE 80211bpower-save mode comes into play whenthere are two or more nodes producinga delay between beacon time and whenthe node receivessends packets Thisbadly affects the streams since higherenergy is consumed while waiting

The authors propose traffic shaping forenergy conservation where this is doneby proxy such that packets arrive at reg-ular intervals This is achieved using alocal proxy in the access point and aclient-side proxy in the mobile hostSimulations show that traffic shapingreduces energy consumption and alsoreveals that the higher bandwidthstreams have a lower energy metric(mJkB)

More information is at httpgreenhousecsugaedu

CHARACTERIZING ALERT AND BROWSE SER-

VICES OF MOBILE CLIENTS

Atul Adya Paramvir Bahl and Lili QiuMicrosoft Research

Even though there is a dramatic increasein Internet access from wireless devicesthere are not many studies done oncharacterizing this traffic In this paperthe authors characterize the trafficobserved on the MSN Web site withboth notification and browse traces

USENIX 2002

Around 33 million browsing accessesand about 325 million notificationentries are present in the traces Threetypes of analyses are done for each oneof these two services content analysisconcerning the most popular contentcategories and their distribution popu-larity analysis or the popularity distri-bution of documents and user behavioranalysis

The analysis of the notification logsshows that document access rates followa Zipf-like distribution with most of theaccesses concentrated on a small num-ber of messages and the accesses exhibitgeographical locality ndash users from samelocality tend to receive similar notifica-tion content The browser log analysisshows that a smaller set of URLs areaccessed most times though the accesspattern does not fit any Zipf curve andthe highly accessed URLS remain stableA correlation study between the notifi-cation and browsing services shows thatwireless users have a moderate correla-tion of 012

FREENIX TRACK SESSIONS

BUILDING APPLICATIONS

Summarized by Matt Butner

INTERACTIVE 3-D GRAPHICS APPLICATIONS

FOR TCL

Oliver Kersting Juumlrgen Doumlllner HassoPlattner Institute for Software SystemsEngineering University of Potsdam

The integration of 3-D image renderingfunctionality into a scripting languagepermits interactive and animated 3-Ddevelopment and application withoutthe formalities and precision demandedby low-level CC++ graphics and visual-ization libraries The large and complexC++ API of the Virtual Rendering Sys-tem (VRS) can be combined with theconveniences of the Tcl scripting lan-guage The mapping of class interfaces isdone via an automated process and gen-erates respective wrapper classes all ofwhich ensures complete API accessibilityand functionality without any signifi-

76 Vol 27 No 5 login

cant performance retribution The gen-eral C++-to-Tcl mapper SWIG grantsthe necessary object and hierarchal abili-ties without the use of object-orientedTcl extensions The final API mappingtechniques address C++ features such asclasses overloaded methods enumera-tions and inheritance relations All areimplemented in a proof-of-concept thatmaps the complete C++ API of the VRSto Tcl and are showcased in a completeinteractive 3-D-map system

THE AGFL GRAMMAR WORK LAB

Cornelis HA Coster Erik VerbruggenUniversity of Nijmegen (KUN)

The growth and implementation of Nat-ural Language Processing (NLP) is thecornerstone of the continued evolutionand implementation of truly intelligentsearch machines and services In partdue to the growing collections of com-puter-stored human-readable docu-ments in the public domain theimplementation of linguistic analysiswill become necessary for desirable pre-cision and resolution Subsequentlysuch tools and linguistic resources mustbe released into the public domain andthey have done so with the release of theAGFL Grammar Work Lab under theGNU Public License as a tool for lin-guistic research and the development ofNLP-based applications

The AGFL (Affix Grammars over aFinite Lattice) Grammar Work Labmeshes context-free grammars withfinite set-valued features that are accept-able to a range of languages In com-puter science terms ldquoSyntax rules areprocedures with parameters and a non-deterministic executionrdquo The EnglishPhrases for Information Retrieval(EP4IR) released with the AGFL-GWLas a robust grammar of English is anAGFL-GWL generated English parserthat outputs ldquoHeadModifiedrdquo framesThe sentences ldquoCompanyX sponsoredthis conferencerdquo and ldquoThis conferencewas sponsored by CompanyXrdquo both gen-erate [CompanyX[sponsored confer-

ence]] The hope is that public availabil-ity of such tools will encourage furtherdevelopment of grammar and lexicalsoftware systems

SWILL A SIMPLE EMBEDDED WEB SERVER

LIBRARY

Sotiria Lampoudi David M BeazleyUniversity of Chicago

This paper won the FREENIX Best Stu-dent Paper award

SWILL (Simple Web Interface LinkLibrary) is a simple Web server in theform of a C library whose developmentwas motivated by a wish to give coolapplications an interface to the Web TheSWILL library provides a simple inter-face that can be efficiently implementedfor tasks that vary from flexible Web-based monitoring to software debuggingand diagnostics Though originallydesigned to be integrated with high-per-formance scientific simulation softwarethe interface is generic enough to allowfor unbounded uses SWILL is a single-threaded server relying upon non-blocking IO through the creation of atemporary server which services IOrequests SWILL does not provide SSL orcryptographic authentication but doeshave HTTP authentication abilities

A fantastic feature of SWILL is its sup-port for SPMD-style parallel applica-tions which utilize MPI provingvaluable for Beowulf clusters and largeparallel machines Another practicalapplication was the implementation ofSWILL in a modified Yalnix emulator byUniversity of Chicago Operating Sys-tems courses which utilized the addedYalnix functionality for OS developmentand debugging SWILL requires minimalmemory overhead and relies upon theHTTP10 protocol

NETWORK PERFORMANCE

Summarized by Florian Buchholz

LINUX NFS CLIENT WRITE PERFORMANCE

Chuck Lever Network Appliance PeterHoneyman CITI University of Michi-gan

Lever introduced a benchmark to mea-sure an NFS client write performanceClient performance is difficult to mea-sure due to hindrances such as poorhardware or bandwidth limitations Fur-thermore measuring application perfor-mance does not identify weaknessesspecifically at the client side Thus abenchmark was developed trying toexercise only data transfers in one direc-tion between server and application Forthis purpose the benchmark was basedon the block sequential write portion ofthe Bonnie file system benchmark Oncea benchmark for NFS clients is estab-lished it can be used to improve clientperformance

The performance measurements wereperformed with an SMP Linux clientand both a Linux NFS server and a Net-work Appliance F85 filer During testinga periodic jump in write latency timewas discovered This was due to a ratherlarge number of pending write opera-tions that were scheduled to be writtenafter certain threshold values wereexceeded By introducing a separate dae-mon that flushes the cached writerequest the spikes could be eliminatedbut as a result the average latency growsover time The problem could be tracedto a function that scans a linked list ofwrite requests After having added ahashtable to improve lookup perfor-mance the latency improved consider-ably

The improved client was then used tomeasure throughput against the twoservers A discrepancy between theLinux server and the filer test wasnoticed and the reason for that tracedback to a global kernel lock that wasunnecessarily held when accessing thenetwork stack After correcting this per-formance improved further However

77October 2002 login

C

ON

FERE

NC

ERE

PORT

Sthe measurements showed that a clientmay run slower when paired with fastservers on fast networks This is due toheavy client interrupt loads more net-work processing on the client side andmore global kernel lock contention

The source code of the project is avail-able at httpwwwcitiumicheduprojectsnfs-perfpatches

A STUDY OF THE RELATIVE COSTS OF

NETWORK SECURITY PROTOCOLS

Stefan Miltchev and Sotiris IoannidisUniversity of Pennsylvania AngelosKeromytis Columbia University

With the increasing need for securityand integrity of remote network ser-vices it becomes important to quantifythe communication overhead of IPSecand compare it to alternatives such asSSH SCP and HTTPS

For this purpose the authors set upthree testing networks direct link twohosts separated by two gateways andthree hosts connecting through onegateway Protocols were compared ineach setup and manual keying was usedto eliminate connection setup costs ForIPSec the different encryption algo-rithms AES DES 3DES hardware DESand hardware 3DES were used In detailFTP was compared to SFTP SCP andFTP over IPSec HTTP to HTTPS andHTTP over IPSec and NFS to NFS overIPSec and local disk performance

The result of the measurements werethat IPSec outperforms other popularencryption schemes Overall unen-crypted communication was fastest butin some cases like FTP the overhead canbe small The use of crypto hardwarecan significantly improve performanceFor future work the inclusion of setupcosts hardware-accelerated SSL SFTPand SSH were mentioned

CONGESTION CONTROL IN LINUX TCP

Pasi Sarolahti University of HelsinkiAlexey Kuznetsov Institute for NuclearResearch at Moscow

Having attended the talk and read thepaper I am still unclear about whetherthe authors are merely describing thedesign decisions of TCP congestion con-trol or whether they are actually the cre-ators of that part of the Linux code Myguess leans toward the former

In the talk the speaker compared theTCP protocol congestion control meas-ures according to IETF and RFC specifi-cations with the actual Linux implemen-tation which does conform to the basicprinciples but nevertheless has differ-ences A specific emphasis was placed onretransmission mechanisms and thecongestion window Also several TCPenhancements ndash the NewReno algo-rithm Selective ACKs (SACK) ForwardACKs (FACK) ndash were discussed andcompared

In some instances Linux does not con-form to the IETF specifications The fastrecovery does not fully follow RFC 2582since the threshold for triggering re-transmit is adjusted dynamically and thecongestion windowrsquos size is not changedAlso the roundtrip-time estimator andthe RTO calculation differ from RFC2988 since it uses more conservativeRTT estimates and a minimum RTO of200ms The performance measuresshowed that with the additional Linux-specific features enabled slightly higherthroughput more steady data flow andfewer unnecessary retransmissions canbe achieved

XTREME XCITEMENT

Summarized by Steve Bauer

THE FUTURE IS COMING WHERE THE X

WINDOW SHOULD GO

Jim Gettys Compaq Computer Corp

Jim Gettys one of the principal authorsof the X Window System outlined thenear-term objectives for the system pri-marily focusing on the changes andinfrastructure required to enable replica-tion and migration of X applicationsProviding better support for this func-tionality would enable users to retrieveor duplicate X applications betweentheir servers at home and work

USENIX 2002

One interesting example of an applica-tion that currently is capable of migra-tion and replication is Emacs To create anew frame on DISPLAY try ldquoM-x make-frame-on-display ltReturngt DISPLAYltReturngtrdquo

However technical challenges makereplication and migration difficult ingeneral These include the ldquomajorheadachesrdquo of server-side fonts thenonuniformity of X servers and screensizes and the need to appropriatelyretrofit toolkits Authentication andauthorization issues are obviously alsoimportant The rest of the talk delvedinto some of the details of these interest-ing technical challenges

HACKING IN THE KERNEL

Summarized by Hai Huang

AN IMPLEMENTATION OF SCHEDULER

ACTIVATIONS ON THE NETBSD OPERATING

SYSTEM

Nathan J Williams Wasabi Systems

Scheduler activation is an old idea Basi-cally there are benefits and drawbacks tousing solely kernel-level or user-levelthreading Scheduler activation is able tocombine the two layers of control toprovide more concurrency in the sys-tem

In his talk Nathan gave a fairly detaileddescription of the implementation ofscheduler activation in the NetBSD ker-nel One important change in the imple-mentation is to differentiate the threadcontext from the process context This isdone by defining a separate data struc-ture for these threads and relocatingsome of the information that wasembedded in the process context tothese thread contexts Stack was espe-cially a concern due to the upcall Spe-cial handling must be done to make surethat the upcall doesnrsquot mess up the stackso that the preempted user-level threadcan continue afterwards Lastly Nathanexplained that signals were handled byupcalls

78 Vol 27 No 5 login

AUTHORIZATION AND CHARGING IN PUBLIC

WLANS USING FREEBSD AND 8021X

Pekka Nikander Ericsson ResearchNomadicLab

8021x standards are well known in thewireless community as link-layerauthentication protocols In this talkPekka explained some novel ways ofusing the 8021x protocols that might beof interest to people on the move It ispossible to set up a public WLAN thatwould support various chargingschemes via virtual tokens which peoplecan purchase or earn and later use

This is implemented on FreeBSD usingthe netgraph utility It is basically a filterin the link layer that would differentiatetraffic based on the MAC address of theclient node which is either authenti-cated denied or let through The over-head for this service is fairly minimal

ACPI IMPLEMENTATION ON FREEBSD

Takanori Watanabe Kobe University

ACPI (Advanced Configuration andPower Management Interface) was pro-posed as a joint effort by Intel Toshibaand Microsoft to provide a standard andfiner-control method of managingpower states of individual devices withina system Such low-level power manage-ment is especially important for thosemobile and embedded systems that arepowered by fixed-capacity energy batter-ies

Takanori explained that the ACPI speci-fication is composed of three partstables BIOS and registers He was ableto implement some functionalities ofACPI in a FreeBSD kernel ACPI Com-ponent Architecture was implementedby Intel and it provides a high-levelACPI API to the operating systemTakanorirsquos ACPI implementation is builtupon this underlying layer of APIs

ACCESS CONTROL

Summarized by Florian Buchholz

DESIGN AND PERFORMANCE OF THE

OPENBSD STATEFUL PACKET FILTER (PF)

Daniel Hartmeier Systor AG

Daniel Hartmeier described the newstateful packet filter (pf) that replacedIPFilter in the OpenBSD 30 releaseIPFilter could no longer be included dueto licensing issues and thus there was aneed to write a new filter making use ofoptimized data structures

The filter rules are implemented as alinked list which is traversed from startto end Two actions may be takenaccording to the rules ldquopassrdquo or ldquoblockrdquoA ldquopassrdquo action forwards the packet anda ldquoblockrdquo action will drop it Wheremore than one rule matches a packetthe last rule wins Rules that are markedas ldquofinalrdquo will immediately terminate any further rule evaluation An opti-mization called ldquoskip-stepsrdquo was alsoimplemented where blocks of similarrules are skipped if they cannot matchthe current packet These skip-steps arecalculated when the rule set is loadedFurthermore a state table keeps track ofTCP connections Only packets thatmatch the sequence numbers areallowed UDP and ICMP queries andreplies are considered in the state tablewhere initial packets will create an entryfor a pseudo-connection with a low ini-tial timeout value The state table isimplemented using a balanced binarysearch tree NAT mappings are alsostored in the state table whereas appli-cation proxies reside in user space Thepacket filter also is able to perform frag-ment reassembly and to modulate TCPsequence numbers to protect hostsbehind the firewall

Pf was compared against IPFilter as wellas Linuxrsquos Iptables by measuringthroughput and latency with increasingtraffic rates and different packet sizes Ina test with a fixed rule set size of 100Iptables outperformed the other two fil-ters whose results were close to each

other In a second test where the rule setsize was continually increased Iptablesconsistently had about twice thethroughput of the other two (whichevaluate the rule set on both the incom-ing and outgoing interfaces) A third testcompared only pf and IPFilter using asingle rule that created state in the statetable with a fixed-state entry size Pfreached an overloaded state much laterthan IPFilter The experiment wasrepeated with a variable-state entry sizeand pf performed much better thanIPFilter for a small number of states

In general rule set evaluation is expen-sive and benchmarks only reflectextreme cases whereas in real life otherbehavior should be observed Further-more the benchmarks show that statefulfiltering can actually improve perfor-mance due to cheap state-table lookupsas compared to rule evaluation

ENHANCING NFS CROSS-ADMINISTRATIVE

DOMAIN ACCESS

Joseph Spadavecchia and Erez ZadokStony Brook University

The speaker presented modification toan NFS server that allows an improvedNFS access between administrativedomains A problem lies in the fact thatNFS assumes a shared UIDGID spacewhich makes it unsafe to export filesoutside the administrative domain ofthe server Design goals were to leaveprotocol and existing clients unchangeda minimum amount of server changesflexibility and increased performance

To solve the problem two techniques areutilized ldquorange-mappingrdquo and ldquofile-cloakingrdquo Range-mapping maps IDsbetween client and server The mappingis performed on a per-export basis andhas to be manually set up in an exportfile The mappings can be 1-1 N-N orN-1 In file-cloaking the server restrictsfile access based on UIDGID and spe-cial cloaking-mask bits Here users canonly access their own file permissionsThe policy on whether or not the file isvisible to others is set by the cloakingmask which is logically ANDed with the

79October 2002 login

C

ON

FERE

NC

ERE

PORT

Sfilersquos protection bits File-cloaking onlyworks however if the client doesnrsquot holdcached copies of directory contents andfile-attributes Because of this the clientsare forced to re-read directories byincrementing the mtime value of thedirectory each time it is listed

To test the performance of the modifiedserver five different NFS configurationswere evaluated An unmodified NFSserver was compared against one serverwith the modified code included but notused one with only range-mappingenabled one with only file-cloakingenabled and one version with all modi-fications enabled For each setup differ-ent file system benchmarks were runThe results show only a small overheadwhen the modifications are used gener-ally an increase of below 5 Anotherexperiment tested the performance ofthe system with an increasing number ofmapped or cloaked entries on a systemThe results show that an increase from10 to 1000 entries resulted in a maxi-mum of about 14 cost in performance

One member of the audience pointedout that if clients choose to ignore thechanged mtimes from the server andthus still hold caches of the directoryentries the file-cloaking mechanismcould be defeated After a rather lengthydebate the speaker had to concede thatthe model doesnrsquot add any extra securityAnother question was asked about scala-bility of the setup of range mappingThe speaker referred to application-leveltools that could be developed for thatpurpose

The software is available atftpftpfslcssunysbedupubenf

ENGINEERING OPEN SOURCE

SOFTWARE

Summarized by Teri Lampoudi

NINGAUI A LINUX CLUSTER FOR BUSINESS

Andrew Hume ATampT Labs ndash ResearchScott Daniels Electronic Data SystemsCorp

Ningaui is a general purpose highlyavailable resilient architecture built

from commodity software and hard-ware Emphatically however Ningaui isnot a Beowulf Hume calls the clusterdesign the ldquoSwiss canton modelrdquo inwhich there are a number of looselyaffiliated independent nodes with datareplicated among them Jobs areassigned by bidding and leases and clus-ter services done as session-based serversare sited via generic job assignment Theemphasis is on keeping the architectureend-to-end checking all work via check-sums and logging everything Theresilience and high availability requiredby their goal of 8-5 maintenance ndash vsthe typical 24-7 model where people getpaged whenever the slightest thing goeswrong regardless of the time of day ndash isachieved by job restartability Finally allcomputation is performed on local datawithout the use of NFS or networkattached storage

Humersquos message is a hopeful onedespite the many problems encounteredndash things like kernel and service limitsauto-installation problems TCP stormsand the like ndash the existence of sourcecode and the paranoid practice of log-ging and checksumming everything hashelped The final product performs rea-sonably well and it appears that theresilient techniques employed do make adifference One drawback however isthat the software mentioned in thepaper is not yet available for download

CPCMS A CONFIGURATION MANAGEMENT

SYSTEM BASED ON CRYPTOGRAPHIC NAMES

Jonathan S Shapiro and John Vander-burgh Johns Hopkins University

This paper won the FREENIX BestPaper award

The basic notion behind the project isthe fact that everyone has a pet com-plaint about CVS and yet it is currentlythe configuration manager in mostwidespread use Shapiro has unleashedan alternative Interestingly he did notbegin but ended with the usual host ofreasons why CVS is bad The talk insteadbegan abruptly by characterizing the job

USENIX 2002

of a software configuration managercontinued by stating the namespaceswhich it must handle and wrapped upwith the challenges faced

X MEETS Z VERIFYING CORRECTNESS IN THE

PRESENCE OF POSIX THREADS

Bart Massey Portland State UniversityRobert T Bauer Rational SoftwareCorp

Massey delivered a humorous talk onthe insight gained from applying Z for-mal specification notation to systemsoftware design rather than the moreinformal analysis and design processnormally used

The story is told with respect to writingXCB which replaces the Xlib protocollayer and is supposed to be threadfriendly But where threads are con-cerned deadlock avoidance becomes ahard problem that cannot be solved inan ad-hoc manner But full modelchecking is also too hard In this caseMassey resorted to Z specification tomodel the XCB lower layer abstractaway locking and data transfers andlocate fundamental issues Essentiallythe difficulties of searching the literatureand locating information relevant to theproblem at hand were overcome AsMassey put it ldquothe formal method savedthe dayrdquo

FILE SYSTEMS

Summarized by Bosko Milekic

PLANNED EXTENSIONS TO THE LINUX

EXT2EXT3 FILESYSTEM

Theodore Y Tsrsquoo IBM StephenTweedie Red Hat

The speaker presented improvements tothe Linux Ext2 file system with the goalof allowing for various expansions whilestriving to maintain compatibility witholder code Improvements have beenfacilitated by a few extra superblockfields that were added to Ext2 not longbefore the Linux 20 kernel was releasedThe fields allow for file system featuresto be added without compromising theexisting setup this is done by providing

80 Vol 27 No 5 login

bits indicating the impact of the addedfeatures with respect to compatibilityNamely a file system with the ldquoincom-patrdquo bit marked is not allowed to bemounted Similarly a ldquoread-onlyrdquo mark-ing would only allow the file system tobe mounted read-only

Directory indexing changes linear direc-tory searches with a faster search using afixed-depth tree and hashed keys Filesystem size can be dynamicallyincreased and the expanded inode dou-bled from 128 to 256 bytes allows formore extensions

Other potential improvements were dis-cussed as well in particular pre-alloca-tion for contiguous files which allowsfor better performance in certain setupsby pre-allocating contiguous blocksSecurity-related modifications extendedattributes and ACLs were mentionedAn implementation of these featuresalready exists but has not yet beenmerged into the mainline Ext23 code

RECENT FILESYSTEM OPTIMISATIONS ON

FREEBSD

Ian Dowse Corvil Networks DavidMalone CNRI Dublin Institute of Technology

David Malone presented four importantfile system optimizations for FreeBSDOS soft updates dirpref vmiodir anddirhash It turns out that certain combi-nations of the optimizations (beautifullyillustrated in the paper) may yield per-formance improvements of anywherebetween 2 and 10 orders of magnitudefor real-world applications

All four of the optimizations deal withfile system metadata Soft updates allowfor asynchronous metadata updatesDirpref changes the way directories areorganized attempting to place childdirectories closer to their parentsthereby increasing locality of referenceand reducing disk-seek times Vmiodirtrades some extra memory in order toachieve better directory caching Finallydirhash which was implemented by IanDowse changes the way in which entries

are searched in a directory SpecificallyFFS uses a linear search to find an entryby name dirhash builds a hashtable ofdirectory entries on the fly For a direc-tory of size n with a working set of mfiles a search that in certain cases couldhave been O(nm) has been reduceddue to dirhash to effectively O(n + m)

FILESYSTEM PERFORMANCE AND SCALABILITY

IN LINUX 2417

Ray Bryant SGI Ruth Forester IBMLTC John Hawkes SGI

This talk focused on performance evalu-ation of a number of file systems avail-able and commonly deployed on Linuxmachines Comparisons under variousconfigurations of Ext2 Ext3 ReiserFSXFS and JFS were presented

The benchmarks chosen for the datagathering were pgmeter filemark andAIM Benchmark Suite VII Pgmetermeasures the rate of data transfer ofreadswrites of a file under a syntheticworkload Filemark is similar to post-mark in that it is an operation-intensivebenchmark although filemark isthreaded and offers various other fea-tures that postmark lacks AIM VIImeasures performance for various file-system-related functionalities it offersvarious metrics under an imposedworkload thus stressing the perfor-mance of the file system not only underIO load but also under significant CPUload

Tests were run on three different setupsa small a medium and a large configu-ration ReiserFS and Ext2 appear at thetop of the pile for smaller and mediumsetups Notably XFS and JFS performworse for smaller system configurationsthan the others although XFS clearlyappears to generate better numbers thanJFS It should be noted that XFS seemsto scale well under a higher load Thiswas most evident in the large-systemresults where XFS appears to offer thebest overall results

THINGS TO THINK ABOUT

Summarized by Bosko Milekic

SPEEDING UP KERNEL SCHEDULER BY REDUC-

ING CACHE MISSES

Shuji Yamamura Akira Hirai MitsuruSato Masao Yamamoto Akira NaruseKouichi Kumon Fujitsu Labs

This was an interesting talk pertaining tothe effects of cache coloring for taskstructures in the Linux kernel scheduler(Linux kernel 24x) The speaker firstpresented some interesting benchmarknumbers for the Linux scheduler show-ing that as the number of processes onthe task queue was increased the perfor-mance decreased The authors usedsome really nifty hardware to measurethe number of bus transactionsthroughout their tests and were thusable to reasonably quantify the impactthat cache misses had in the Linuxscheduler

Their experiments led them to imple-ment a cache coloring scheme for taskstructures which were previouslyaligned on 8KB boundaries and there-fore were being eventually mapped tothe same cache lines This unfortunateplacement of task structures in memoryinduced a significant number of cachemisses as the number of tasks grew inthe scheduler

The implemented solution consisted ofaligning task structures to more evenlydistribute cached entries across the L2cache The result was inevitably fewercache misses in the scheduler Some neg-ative effects were observed in certain sit-uations These were primarily due tomore cache slots being used by taskstructure data in the scheduler thusforcing data previously cached there tobe pushed out

81October 2002 login

C

ON

FERE

NC

ERE

PORT

SWORK-IN-PROGRESS REPORTS

Summarized by Brennan Reynolds

RESOURCE VIRTUALIZATION TECHNIQUES FOR

WIDE-AREA OVERLAY NETWORKS

Kartik Gopalan University Stony Brook

This work addressed the issue of provi-sioning a maximum number of virtualoverlay networks (VON) with diversequality of service (QoS) requirementson a single physical network Each of theVONs is logically isolated from others toensure the QoS requirements Gopalanmentioned several critical researchissues with this problem that are cur-rently being investigated Dealing withhow to provision the network at variouslevels (link route or path) and thenenforce the provisioning at run-time isone of the toughest challenges Cur-rently Gopalan has developed severalalgorithms to handle admission controlend-to-end QoS route selection sched-uling and fault tolerance in the net-work

For more information visithttpwwwecslcssunysbedu

VISUALIZING SOFTWARE INSTABILITY

Jennifer Bevan University of CaliforniaSanta Cruz

Detection of instability in software hastypically been an afterthought Thepoint in the development cycle when thesoftware is reviewed for instability isusually after it is difficult and costly togo back and perform major modifica-tions to the code base Bevan has devel-oped a technique to allow thevisualization of unstable regions of codethat can be used much earlier in thedevelopment cycle Her technique cre-ates a time series of dependent graphsthat include clusters and lines calledfault lines From the graphs a developeris able to easily determine where theunstable sections of code are and proac-tively restructure them She is currentlyworking on a prototype implementa-tion

For more information visit httpwwwcseucscecu~jbevanevo_viz

RELIABLE AND SCALABLE PEER-TO-PEER WEB

DOCUMENT SHARING

Li Xiao William and Mary College

The idea presented by Xiao would allowend users to share the content of theWeb browser caches with neighboringInternet users The rationale for this isthat todayrsquos end users are increasinglyconnected to the Internet over high-speed links and the browser caches arebecoming large storage systems There-fore if an individual accesses a pagewhich does not exist in their local cacheXiao is suggesting that they first queryother end users for the content beforetrying to access the machine hosting theoriginal This strategy does have someserious problems associated with it thatstill need to be addressed includingensuring the integrity of the content andprotecting the identity and privacy ofthe end users

SEREL FAST BOOTING FOR UNIX

Leni Mayo Fast Boot Software

Serel is a tool that generates a visual rep-resentation of a UNIX machinersquos boot-up sequence It can be used to identifythe critical path and can show if a par-ticular service or process blocks for anextended period of time This informa-tion could be used to determine wherethe largest performance gains could berealized by tuning the order of executionat boot-up Serel creates a dependencygraph expressed in XML during boot-up This graph is then used to create thevisual representation Currently the toolonly works on POSIX-compliant sys-tems but Mayo stated that he would beporting it to other platforms Otherextensions that were mentionedincluded having the metadata includethe use of shared libraries and monitor-ing the suspendresume sequence ofportable machines

For more information visithttpwwwfastbootorg

USENIX 2002

BERKELEY DB XML

John Merrells Sleepycat Software

Merrells gave a quick introduction andoverview of the new XML library forBerkeley DB The library specializes instorage and retrieval of XML contentthrough a tool called XPath The libraryallows for multiple containers per docu-ment and stores everything natively asXML The user is also given a wide rangeof elements to create indices withincluding edges elements text stringsor presence The XPath tool consists of aquery parser generator optimizer andexecution engine To conclude his pre-sentation Merrell gave a live demonstra-tion of the software

For more information visithttpwwwsleepycatcomxml

CLUSTER-ON-DEMAND (COD)

Justin Moore Duke University

Modern clusters are growing at a rapidrate Many have pushed beyond the 5000-machine mark and deployingthem results in large expenses as well asmanagement and provisioning issuesFurthermore if the cluster is ldquorentedrdquoout to various users it is very time-con-suming to configure it to a userrsquos specsregardless of how long they need to useit The COD work presented createsdynamic virtual clusters within a givenphysical cluster The goal was to have aprovisioning tool that would automati-cally select a chunk of available nodesand install the operating system andmiddleware specified by the customer ina short period of time This would allowa greater use of resources since multiplevirtual clusters can exist at once Byusing a virtual cluster the size can bechanged dynamically Moore stated thatthey have created a working prototypeand are currently testing and bench-marking its performance

For more information visithttpwwwcsdukeedu~justincod

82 Vol 27 No 5 login

CATACOMB

Elias Sinderson University of CaliforniaSanta Cruz

Catacomb is a project to develop a data-base-backed DASL module for theApache Web server It was designed as areplacement for the WebDAV The initialrelease of the module only contains sup-port for a MySQL database but could beextended to others Sinderson brieflytouched on the performance of hermodule It was comparable to themod_dav Apache module for all querytypes but search The presentation wasconcluded with remarks about addingsupport for the lock method and includ-ing ACL specifications in the future

For more information visithttpoceancseucsceducatacomb

SELF-ORGANIZING STORAGE

Dan Ellard Harvard University

Ellardrsquos presentation introduced a stor-age system that tuned itself based on theworkload of the system without requir-ing the intervention of the user Theintelligence was implemented as a vir-tual self-organizing disk that residesbelow any file system The virtual diskobserves the access patterns exhibited bythe system and then attempts to predictwhat information will be accessed nextAn experiment was done using an NFStrace at a large ISP on one of their emailservers Ellardrsquos self-organizing storagesystem worked well which he attributesto the fact that most of the files beingrequested were large email boxes Areasof future work include exploration ofthe length and detail of the data collec-tion stage as well as the CPU impact ofrunning the virtual disk layer

VISUAL DEBUGGING

John Costigan Virginia Tech

Costigan feels that the state of currentdebugging facilities in the UNIX worldis not as good as it should be He pro-poses the addition of several elements toprograms like ddd and gdb The firstaddition is including separate heap and

stack components Another is to displayonly certain fields of a data structureand have the ability to zoom in and outif needed Finally the debugger shouldprovide the ability to visually presentcomplex data structures includinglinked lists trees etc He said that a betaversion of a debugger with these abilitiesis currently available

For more information visit httpinfoviscsvtedudatastruct

IMPROVING APPLICATION PERFORMANCE

THROUGH SYSTEM-CALL COMPOSITION

Amit Purohit University of Stony Brook

Web servers perform a huge number ofcontext switches and internal data copy-ing during normal operation These twoelements can drastically limit the perfor-mance of an application regardless ofthe hardware platform it is run on TheCompound System Call (CoSy) frame-work is an attempt to reduce the perfor-mance penalty for context switches anddata copies It includes a user-levellibrary and several kernel facilities thatcan be used via system calls The libraryprovides the programmer with a com-plete set of memory-management func-tions Performance tests using the CoSyframework showed a large saving forcontext switches and data copies Theonly area where the savings betweenconventional libraries and CoSy werenegligible was for very large file copies

For more information visithttpwwwcssunysbedu~purohit

ELASTIC QUOTAS

John Oscar Columbia University

Oscar began by stating that mostresources are flexible but that to datedisks have not been While approxi-mately 80 percent of files are short-liveddisk quotas are hard limits imposed byadministrators The idea behind elasticquotas is to have a non-permanent stor-age area that each user can temporarilyuse Oscar suggested the creation of aehome directory structure to be used indeploying elastic quotas Currently he

has implemented elastic quotas as astackable file system There were alsoseveral trends apparent in the dataMany users would ssh to their remotehost (good) but then launch an emailreader on the local machine whichwould connect via POP to the same hostand send the password in clear text(bad) This situation can easily be reme-died by tunneling protocols like POPthrough SSH but it appeared that manypeople were not aware this could bedone While most of his comments onthe use of the wireless network werenegative the list of passwords he hadcollected showed that people wereindeed using strong passwords His rec-ommendations were to educate andencourage people to use protocols likeIPSec SSH and SSL when conductingwork over a wireless network becauseyou never know who else is listening

SYSTRACE

Niels Provos University of Michigan

How can one be sure the applicationsone is using actually do exactly whattheir developers said they do Shortanswer you canrsquot People today are usinga large number of complex applicationswhich means it is impossible to checkeach application thoroughly for securityvulnerabilities There are tools out therethat can help though Provos has devel-oped a tool called systrace that allows auser to generate a policy of acceptablesystem calls a particular application canmake If the application attempts tomake a call that is not defined in thepolicy the user is notified and allowed tochoose an action Systrace includes anauto-policy generation mechanismwhich uses a training phase to record theactions of all programs the user exe-cutes If the user chooses not to be both-ered by applications breaking policysystrace allows default enforcementactions to be set Currently systrace isimplemented for FreeBSD and NetBSDwith a Linux version coming out shortly

83October 2002 login

C

ON

FERE

NC

ERE

PORT

SFor more information visit httpwwwcitiumicheduuprovossystrace

VERIFIABLE SECRET REDISTRIBUTION FOR

SURVIVABLE STORAGE SYSTEMS

Ted Wong Carnegie Mellon University

Wong presented a protocol that can beused to re-create a file distributed over aset of servers even if one of the serversis damaged In his scheme the user mustchoose how many shares the file is splitinto and the number of servers it will bestored across The goal of this work is toprovide persistent secure storage ofinformation even if it comes underattack Wong stated that his design ofthe protocol was complete and he is cur-rently building a prototype implementa-tion of it

For more information visit httpwwwcscmuedu~tmwongresearch

THE GURU IS IN SESSIONS

LINUX ON LAPTOPPDA

Bdale Garbee HP Linux Systems Operation

Summarized by Brennan Reynolds

This was a roundtable-like discussionwith Garbee acting as moderator He iscurrently in charge of the Debian distri-bution of Linux and is involved in port-ing Linux to platforms other than i386He has successfully help port the kernelto the Alpha Sparc and ARM and iscurrently working on a version to runon VAX machines

A discussion was held on which file sys-tem is best for battery life The Riser andExt3 file systems were discussed peoplecommented that when using Ext3 ontheir laptops the disc would never spindown and thus was always consuming alarge amount of power One suggestionwas to use a RAM-based file system forany volume that requires a large numberof writes and to only have the file systemwritten to disk at infrequent intervals orwhen the machine is shut down or sus-pended

A question was directed to Garbee abouthis knowledge of how the HP-Compaqmerger would affect Linux Garbeethought that the merger was good forLinux both within the company and forthe open source community as a wholeHe stated that the new company wouldbe the number one shipper of systemswith Linux as the base operating systemHe also fielded a question about thesupport of older HP servers and theirability to run Linux saying that indeedLinux has been ported to them and hepersonally had several machines in hisbasement running it

A question which sparked a large num-ber of responses concerned problemspeople had with running Linux onmobile platforms Most people actuallydid not have many problems at allThere were a few cases of a particularmachine model not working but therewere only two widespread issues dock-ing stations and the new ACPI powermanagement scheme Docking stationsstill appear to be somewhat of aheadache but most people had devel-oped ad-hoc solutions for getting themto work The ACPI power managementscheme developed by Intel does notappear to have a quick solution TedTrsquoso one of the head kernel developersstated that there are fundamental archi-tectural problems with the 24 series ker-nel that do not easily allow ACPI to beadded However he also stated thatACPI is supported in the newest devel-opment kernel 25 and will be includedin 26

The final topic was the possibility ofpurchasing a laptop without paying anychargetax for Microsoft Windows Theconsensus was that currently this is notpossible Even if the machine comeswith Linux or without any operatingsystem the vendors have contracts inplace which require them to payMicrosoft for each machine they ship ifthat machine is capable of running aMicrosoft operating system The onlyway this will change is if the demand for

USENIX 2002

other operating systems increases to thepoint that vendors renegotiate their con-tracts with Microsoft and no one sawthis happening in the near future

LARGE CLUSTERS

Andrew Hume ATampT Labs ndash Research

Summarized by Teri Lampoudi

This was a session on clusters big dataand resilient computing Given that theaudience was primarily interested inclusters and not necessarily big data orbeating nodes to a pulp the conversa-tion revolved mainly around what ittakes to make a heterogeneous clusterresilient Hume also presented aFREENIX paper on his current clusterproject which explained in more detailmuch of what was abstractly claimed inthe guru session

Humersquos use of the word ldquoclusterrdquoreferred not to what I had assumedwould be a Beowulf-type system but to aloosely coupled farm of machines inwhich concurrency was much less of anissue than it would be in a Beowulf Infact the architecture Hume describedwas designed to deal with large amountsof independent transaction processingessentially the process of billing callswhich requires no interprocess messagepassing or anything of the sort a parallelscientific application might

Where does the large data come inSince the transactions in question con-sist of large numbers of flat files amechanism for getting the files ontonodes and off-loading the results is nec-essary In this particular cluster namedNingaui this task is handled by theldquoreplication managerrdquo which generateda large amount of interest in the audi-ence All management functions in thecluster as well as job allocation consti-tute services that are to be bid for andleased out to nodes Furthermore fail-ures are handled by simply repeating thefailed task until it is completed properlyif that is possible and if it is not thenlooking through detailed logs to discover

84 Vol 27 No 5 login

what went wrong Logging and testingfor the successful completion of eachtask are what make this model resilientEvery management operation is loggedat some appropriate level of detail gen-erating 120MBday and every single filetransfer is md5 checksummed This wascharacterized by Hume as a ldquopatientrdquostyle of management where no one con-trols the cluster scheduling behavior isemergent rather than stringentlyplanned and nodes are free to drop inand out of the cluster a downside to thismodel is the increase in latency offset bythe fact that the cluster continues tofunction unattended outside of 8-to-5support hours

In response to questions about the pro-jected scalability of such a schemeHume said the system would presum-ably scale from the present 60 nodes toabout 100 without modifications butthat scaling higher would be a matter forfurther design The guru session endedon a somewhat unrelated note regardingthe problems of getting data off main-frames and onto Linux machines ndashissues of fixed- vs variable-lengthencoding of data blocks resulting in use-ful information being stripped by FTPand problems in converting COBOLcopybooks to something useful on a C-based architecture To reiterate Humersquosclaim there are homemade solutions forsome of these problems and the readershould feel free to contact Mr Hume forthem

WRITING PORTABLE APPLICATIONS

Nick Stoughton MSB Consultants

Summarized by Josh Lothian

Nick Stoughton addressed a concernthat developers are facing more andmore writing portable applications Heproposed that there is no such thing as aldquoportable applicationrdquo only those thathave been ported Developers are still insearch of the Holy Grail of applicationsprogramming a write-once compile-anywhere program Languages such asJava are leading up to this but virtual

machines still have to be ported pathswill still be different etc Web-basedapplications are another area that couldbe promising for portable applications

Nick went on to talk about the future ofportability The recently released POSIX2001 standards are expected to help thesituation The 2001 revision expands theoriginal POSIX sepc into seven volumesEven though POSIX 2001 is more spe-cific than the previous release Nickpointed out that even this standardallows for small differences where ven-dors may define their own behaviorsThis can only hurt developers in thelong run Along these lines it waspointed out that there may be room forautomated tools that have the capabilityto check application code and determinethe level of portability and standardscompliance of the source

NETWORK MANAGEMENT SYSTEM

PERFORMANCE TUNING

Jeff Allen Tellme Networks Inc

Summarized by Matt Selsky

Jeff Allen author of Cricket discussednetwork tuning The basic idea of tuningis measure twiddle and measure againCricket can be used for large installa-tions but itrsquos overkill for smaller instal-lations for which Mrtg is better suitedYou donrsquot need Cricket to do measure-ment Doing something like a shellscript that dumps data to Excel is goodbut you need to get some sort of mea-surement Cricket was not designed forbilling it was meant to help answerquestions

Useful tools and techniques coveredincluded looking for the differencebetween machines in order to identifythe causes for the differences measuringbut also thinking about what yoursquoremeasuring and why remembering touse strace and tcpdump to identifyproblems Problems repeat themselvesperformance tuning requires an intricateunderstanding of all the layers involvedto solve the problem and simplify theautomation (Having a computer science

background helps but is not essential) Ifyou can reproduce the problem youthen should investigate each piece tofind the bottleneck If you canrsquot repro-duce the problem then measure Youneed to understand all the system inter-actions Measurement can help deter-mine whether things are actually slow orif the user is imagining it

Troubleshooting begins with a hunchbut scientific processes are essential Youshould be able to determine the eventstream or when each event occurs hav-ing observation logs can help Somehints provided include checking cron forperiodic anomalies slowing downbinrm to avoid IO overload and look-ing for unusual kernel CPU usage

Also try to use lightweight monitoringto reduce overhead You donrsquot need tomonitor every resource on every systembut you should monitor those resourceson those systems that are essential Donrsquotcheck things too often since you canintroduce overhead that way Lots ofinformation can be gathered from theproc file system interface to the kernel

BSD BOF

Host Kirk McKusick Author and Consultant

Summarized by Bosko Milekic

The BSD Birds of a Feather session atUSENIX started off with five quick andinformative talks and ended with anexciting discussion of licensing issues

Christos Zoulas for the NetBSD Projectwent over the primary goals of theNetBSD Project (namely portability aclean architecture and security) andthen proceeded to briefly discuss someof the challenges that the project hasencountered The successes and areas forimprovement of 16 were then exam-ined NetBSD has recently seen variousimprovements in its cross-build systempackaging system and third-party toolsupport For what concerns the kernel azero-copy TCPUDP implementationhas been integrated and a new pmap

85October 2002 login

C

ON

FERE

NC

ERE

PORT

Scode for arm32 has been introduced ashave some other rather major changes tovarious subsystems More information isavailable in Christosrsquos slides which arenow available at httpwwwnetbsdorggalleryeventsusenix2002

Theo de Raadt of the OpenBSD Projectpresented a rundown of various newthings that the OpenBSD and OpenSSHteams have been looking at Theorsquos talkincluded an amusing description (andpictures) of some of the events thattranspired during OpenBSDrsquos hack-a-thon which occurred the week beforeUSENIX rsquo02 Finally Theo mentionedthat following complications with theIPFilter license the OpenBSD team haddone a fairly extensive license audit

Robert Watson of the FreeBSD Projectbrought up various examples ofFreeBSD being used in the real worldHe went on to describe a wide variety ofchanges that will surface with theupcoming FreeBSD release 50 Notably50 will introduce a re-architecting ofthe kernel aimed at providing muchmore scalable support for SMPmachines 50 will also include an earlyimplementation of KSE FreeBSDrsquos ver-sion of scheduler activations largeimprovements to pccard and significantframework bits from the TrustedBSDproject

Mike Karels from Wind River Systemspresented an overview of how BSDOShas been evolving following the WindRiver Systems acquisition of BSDirsquos soft-ware assets Ernie Prabhakar from Applediscussed Applersquos OS X and its successon the desktop He explained how OS Xaims to bring BSD to the desktop whilestriving to maintain a strong and stablecore one that is heavily based on BSDtechnology

The BoF finished with a discussion onlicensing specifically some folks ques-tioned the impact of Applersquos licensingfor code from their core OS (the DarwinProject) on the BSD community as awhole

CLOSING SESSION

HOW FLIES FLY

Michael H Dickinson University ofCalifornia Berkeley

Summarized by JD Welch

In this lively and offbeat talk Dickinsondiscussed his research into the mecha-nisms of flight in the fruit fly (Droso-phila melanogaster) and related theautonomic behavior to that of a techno-logical system Insects are the most suc-cessful organisms on earth due in nosmall part to their early adoption offlight Flies travel through space onstraight trajectories interrupted by sac-cades jumpy turning motions some-what analogous to the fast jumpymovements of the human eye Using avariety of monitoring techniquesincluding high-speed video in a ldquovirtualreality flight arenardquo Dickinson and hiscolleagues have observed that fliesrespond to visual cues to decide when tosaccade during flight For example morevisual texture makes flies saccade earlier(ie further away from the arena wall)described by Dickinson as a ldquocollisioncontrol algorithmrdquo

Through a combination of sensoryinputs flies can make decisions abouttheir flight path Flies possess a visualldquoexpansion detectorrdquo which at a certainthreshold causes the animal to turn acertain direction However expansion infront of the fly sometimes causes it toland How does the fly decide Using thevirtual reality device to replicate variousvisual scenarios Dickinson observedthat the flies fixate on vertical objectsthat move back and forth across thefield Expansion on the left side of theanimal causes it to turn right and viceversa while expansion directly in frontof the animal triggers the legs to flareout in preparation for landing

If the eyes detect these changes how arethe responses implemented The fliesrsquowings have ldquopower musclesrdquo controlledby mechanical resonance (as opposed tothe nervous system) which drive the

USENIX 2002

wings combined with neurally activatedldquosteering musclesrdquo which change theconfiguration of the wing joints Subtlevariations in the timing of impulses cor-respond to changes in wing movementcontrolled by sensors in the ldquowing-pitrdquoA small wing-like structure the halterecontrols equilibrium by beating con-stantly

Voluntary control is accomplished bythe halteres whose signals can interferewith the autonomic control of the strokecycle The halteres have ldquosteering mus-clesrdquo as well and information derivedfrom the visual system can turn off thehaltere or trick it into a ldquovirtualrdquo prob-lem requiring a response

Dickinson has also studied the aerody-namics of insect flight using a devicecalled the ldquoRobo Flyrdquo an oversizedmechanical insect wing suspended in alarge tank of mineral oil InterestinglyDickinson observed that the average liftgenerated by the fliesrsquo wings is under itsbody weight the flies use three mecha-nisms to overcome this including rotat-ing the wings (rotational lift)

Insects are extraordinarily robust crea-tures and because Dickinson analyzedthe problem in a systems-oriented waythese observations and analysis areimmediately applicable to technologythe response system can be used as anefficient search algorithm for controlsystems in autonomous vehicles forexample

THE AFS WORKSHOP

Summarized by Garry Zacheiss MIT

Love Hornquist-Astrand of the Arladevelopment team presented the Arlastatus report Arla 0358 has beenreleased Scheduled for release soon areimproved support for Tru64 UNIXMacOS X and FreeBSD improved vol-ume handling and implementation ofmore of the vospts subcommands Itwas stressed that MacOS X is consideredan important platform and that a GUIconfiguration manager for the cache

86 Vol 27 No 5 login

manager a GUI ACL manager for thefinder and a graphical login that obtainsKerberos tickets and AFS tokens at logintime were all under development

Future goals planned for the underlyingAFS protocols include GSSAPISPNEGOsupport for Rx performance improve-ments to Rx an enhanced disconnectedmode and IPv6 support for Rx anexperimental patch is already availablefor the latter Future Arla-specific goalsinclude improved performance partialfile reading and increased stability forseveral platforms Work is also inprogress for the RXGSS protocol exten-sions (integrating GSSAPI into Rx) Apartial implementation exists and workcontinues as developers find time

Derrick Brashear of the OpenAFS devel-opment team presented the OpenAFSstatus report OpenAFS was releasedimmediately prior to the conferenceOpenAFS 125 fixed a remotelyexploitable denial-of-service attack inseveral OpenAFS platforms mostnotably IRIX and AIX Future workplanned for OpenAFS includes bettersupport for MacOS X including work-ing around Finder interaction issuesBetter support for the BSDs is alsoplanned FreeBSD has a partially imple-mented client NetBSD and OpenBSDhave only server programs availableright now AIX 5 Tru64 51A andMacOS X 102 (aka Jaguar) are allplanned for the future Other plannedenhancements include nested groups inthe ptserver (code donated by the Uni-versity of Michigan awaits integration)disconnected AFS and further work ona native Windows client Derrick statedthat the guiding principles of OpenAFSwere to maintain compatibility withIBM AFS support new platforms easethe administrative burden and add newfunctionality

AFS performance was discussed Theopenafsorg cell consists of two fileservers one in Stockholm Sweden andone in Pittsburgh PA AFS works rea-sonably well over trans-Atlantic links

although OpenAFS clients donrsquot deter-mine which file server to talk to veryeffectively Arla clients use RTTs to theserver to determine the optimal fileserver to fetch replicated data fromModifications to OpenAFS to supportthis behavior in the future are desired

Jimmy Engelbrecht and Harald Barth ofKTH discussed their AFSCrawler scriptwritten to determine how many AFScells and clients were in the world whatimplementationsversions they were(Arla vs IBM AFS vs OpenAFS) andhow much data was in AFS The scriptunfortunately triggered a bug in IBMAFS 36ndashderived code causing someclients to panic while handling a specificRPC This has since been fixed in OpenAFS 125 and the most recent IBMAFS patch level and all AFS users arestrongly encouraged to upgrade Norelease of Arla is vulnerable to this par-ticular denial-of-service attack Therewas an extended discussion of the use-fulness of this exploration Many sitesbelieved this was useful information andsuch scanning should continue in thefuture but only on an opt-in basis

Many sites face the problem of manag-ing KerberosAFS credentials for batch-scheduled jobs Specifically most batchprocessing software needs to be modi-fied to forward tickets as part of thebatch submission process renew ticketsand tokens while the job is in the queueand for the lifetime of the job and prop-erly destroy credentials when the jobcompletes Ken Hornstein of NRL wasable to pay a commercial vendor to sup-port Kerberos 45 credential manage-ment in their product although they didnot implement AFS token managementMIT has implemented some of thedesired functionality in OpenPBS andmight be able to make it available toother interested sites

Tools to simplify AFS administrationwere discussed including

AFS Balancer A tool written byCMU to automate the process ofbalancing disk usage across all

servers in a cell Available fromftpftpandrewcmuedupubAFS-Toolsbalance-11btargz

Themis Themis is KTHrsquos enhancedversion of the AFS tool ldquopackagerdquofor updating files on local disk froma central AFS image KTHrsquosenhancements include allowing thedeletion of files simplifying theprocess of adding a file and allow-ing the merging of multiple rulesets for determining which files areupdated Themis is available fromthe Arla CVS repository

Stanford was presented as an example ofa large AFS site Stanfordrsquos AFS usageconsists of approximately 14TB of datain AFS in the form of approximately100000 volumes 33TB of storage isavailable in their primary cell irstan-fordedu Their file servers consistentirely of Solaris machines running acombination of Transarc 36 patch-level3 and OpenAFS 12x while their data-base servers run OpenAFS 122 Theircell consists of 25 file servers using acombination of EMC and Sun StorEdgehardware Stanford continues to use thekaserver for their authentication infra-structure with future plans to migrateentirely to an MIT Kerberos 5 KDC

Stanford has approximately 3400 clientson their campus not including SLAC(Stanford Linear Accelerator) approxi-mately 2100 AFS clients from outsideStanford contact their cell every monthTheir supported clients are almostentirely IBM AFS 36 clients althoughthey plan to release OpenAFS clientssoon Stanford currently supports onlyUNIX clients There is some on-campuspresence of Windows clients but Stan-ford has never publicly released or sup-ported it They do intend to release andsupport the MacOS X client in the nearfuture

All Stanford students faculty and staffare assigned AFS home directories witha default quota of 50MB for a total ofapproximately 550GB of user homedirectories Other significant uses of AFS

87October 2002 login

C

ON

FERE

NC

ERE

PORT

Sstorage are data volumes for workstationsoftware (400GB) and volumes forcourse Web sites and assignments(100GB)

AFS usage at Intel was also presentedIntel has been an AFS site since 1994They had bad experiences with the IBM35 Linux client their experience withOpenAFS on Linux 24x kernels hasbeen much better They use and are sat-isfied with the OpenAFS IA64 Linuxport Intel has hundreds of OpenAFS123 and 124 clients in many produc-tion cells accessing data stored on IBMAFS file servers They have not encoun-tered any interoperability issues Intelhas some concerns about OpenAFS theywould like to purchase commercial sup-port for OpenAFS and to see OpenAFSsupport for HP-UX on both PA-RISCand Itanium hardware HP-UX supportis currently unavailable due to a specificHP-UX header file being unavailablefrom HP this may be available soonIntel has not yet committed to migratingtheir file servers to OpenAFS and areunsure if they will do so without com-mercial support

Backups are a traditional topic of dis-cussion at AFS workshops and this timewas no exception Many users complainthat the traditional AFS backup tools(ldquobackuprdquo and ldquobutcrdquo) are complex anddifficult to automate requiring manyhome-grown scripts and much userintervention for error recovery An addi-tional complaint was that the traditionalAFS tools do not support file- or direc-tory-level backups and restores datamust be backed up and restored at thevolume level

Mitch Collinsworth of Cornell presentedwork to make AMANDA the freebackup software from the University ofMaryland suitable for AFS backupsUsing AMANDA for AFS backups allowsone to share AFS backup tapes withnon-AFS backups easily run multiplebackups in parallel automate errorrecovery and provide a robust degradedmode that prevents tape errors from

stopping backups altogether Theirimplementation allows for full volumerestores as well as individual directoryand file restores They have finished cod-ing this work and are in the process oftesting and documenting it

Peter Honeyman of CITI at the Univer-sity of Michigan spoke about work hehas proposed to replace Rx with RPC-SEC GSS in OpenAFS this would allowAFS to use a TCP-based transport mech-anism rather than the UDP-based Rxand possibly gain better congestion con-trol dynamic adaptation and fragmen-tation avoidance as a result RPCSECGSS uses the GSSAPI to authenticateSUN ONC RPC RPCSEC GSS is trans-port agnostic provides strong security isa developing Internet standard and hasmultiple open source implementationsBackward compatibility with existingAFS servers and clients is an importantgoal of this project

Page 9: inside - USENIX · 2019. 2. 25. · Bosko Milekic Juan Navarro David E. Ott Amit Purohit Brennan Reynolds Matt Selsky J.D. Welch Li Xiao Praveen Yalagandula Haijin Yan Gary Zacheiss

CONQUEST BETTER PERFORMANCE THROUGH

A DISKPERSISTENT-RAM HYBRID FILE

SYSTEM

An-I A Wang Peter Reiher and GeraldJ Popek UCLA Geoffrey M Kuenning Harvey Mudd College

Motivated by the declining cost of per-sistent RAM the authors propose theConquest file system which stores allsmall files metadata executables andshared libraries in persistent RAM diskshold only the data content of remaininglarge files Compared to alternativessuch as caching and RAM file systemsConquest has the advantages of effi-ciency consistency and reliability at areduced cost Using popular bench-marks experiments show that Conquestincurs little overhead while achievingfaster performance Future workincludes designing mechanisms foradjusting file-size threshold dynamicallyand finding a better disk layout for largedata blocks

EXPLOITING GRAY-BOX KNOWLEDGE OF

BUFFER-CACHE MANAGEMENT

Nathan C Burnett John Bent AndreaC Arpaci-Dusseau and Remzi HArpaci-Dusseau University of Wisconsin Madison

Knowing what algorithm is used tomanage the buffer cache is very impor-tant for improving application perfor-mance However there is currently nointerface for finding this algorithm Thispaper introduces Dust a simple finger-printing tool that is able to identify thebuffer-cache replacement policy Dustautomatically identifies the cache sizeand replacement policy based on theconfiguring attributes of access ordersrecency frequency and long-term his-tory Through simulation Dust was ableto distinguish between a variety ofreplacement algorithm policies found inthe literature FIFO LRU LFU ClockSegmented FIFO 2Qand LRU-K Fur-ther experiments of fingerprinting realoperating system such as NetBSDLinux and Solaris show that Dust is ableto identify the hidden cache replacementalgorithm

69October 2002 login

C

ON

FERE

NC

ERE

PORT

SBy knowing the underlying cachereplacement policy a cache-aware Webserver can reschedule the requests on acached request-first policy to obtain per-formance improvement Experimentsshow that cache-aware schedulingimproves average response time and sys-tem throughput

OPERATING SYSTEMS

(AND DANCING BEARS)

Summarized by Matt Butner

THE JX OPERATING SYSTEM

Michael Golm Meik Felser ChristianWawersich and Juumlrgen Kleinoumlder University of Erlagen-Nuumlrnberg

The talk opened by emphasizing theneed for and the practicality of a func-tional Java OS Such implementation inthe JX OS attempts to mimic the recenttrend toward using highly abstractedlanguages in application development inorder to create an OS as functional andpowerful as one written in a lower-levellanguage

The JX is a micro-kernel solution thatuses separate JVMs for each entity of thekernel and in some cases for eachapplication The separated domains donot share objects and have no threadmigration and each domain implementsits own code The interesting part of thepresentation was the discussion of sys-tem-level programming with Java keyareas discussed were memory manage-ment and interrupt handlers Theauthors concluded by noting that theirtype-safe and modular system resultedin a robust system with great configura-tion flexibility and acceptable perfor-mance

DESIGN EVOLUTION OF THE EROS

SINGLE-LEVEL STORE

Jonathan S Shapiro Johns HopkinsUniversity Jonathan Adams Universityof Pennsylvania

This presentation outlined current char-acteristics of file systems and some oftheir least desirable characteristics Itwas based on the revival of the EROS

single-level-store approach for currentattempts to capitalize on system consis-tency and efficiency Their solution didnot relate directly to EROS but ratherto the design and use of the constructsand notions on which EROS is based Byextending the mapping of the memorysystem to include the disk systems arefurther able to ensure global persistencewithout regard for a disk structure Inexplaining the need for absolute systemconsistency and an environment whichby definition is not allowed to crash thegoal of having such an exhaustive designbecomes clear The cool part of thiswork is the availability of its design andarchitecture to the public

THINK A SOFTWARE FRAMEWORK FOR

COMPONENT-BASED OPERATING SYSTEM

KERNELS

Jean-Philippe Fassino France TeacuteleacutecomRampD Jean-Bernard Stefani INRIA JuliaLawall DIKU Gilles Muller INRIA

This presentation discussed the need forcomponent-based operating systemsand respective structures to ensure flexi-bility for arbitrary-sized systems Thinkprovides a binding model that mapsuniformed components for OS develop-ers and architects to follow ensuringconsistent implementations for an arbi-trary system size However the goal isnot to force developers into a predefinedkernel but to promote the use of certaincomponents in varied ways

Thinkrsquos concentration is primarily onembedded systems where short devel-opment time is necessary but is con-strained by rigorous needs and limitedresources The need to build flexible sys-tems in such an environment can becostly and implementation-specific butThink creates an environment sup-ported by the ability to dynamically loadtype-safe components This allows formore flexible systems that retain func-tionality because of Thinkrsquos ability tobind fine-grained components Bench-marks revealed that dedicated micro-kernels can show performanceimprovements in comparison to mono-

USENIX 2002

lithic kernels Notable future workincludes building components for low-end appliances and the development of areal-time OS component library

BUILDING SERVICES

Summarized by Pradipta De

NINJA A FRAMEWORK FOR NETWORK

SERVICES

J Robert von Behren Eric A BrewerNikita Borisov Michael Chen MattWelsh Josh MacDonald Jeremy Lauand David Culler University of Califor-nia Berkeley

Robert von Behren presented Ninja anongoing project that aims to provide aframework for building robust and scal-able Internet-based network serviceslike Web-hosting instant-messagingemail and file-sharing applicationsRobert drew attention to the difficultiesof writing cluster applications One hasto take care of data consistency andissues of concurrency and for robustapplications there are problems relatedto fault tolerance Ninja works as awrapper to relieve the user of theseproblems The goal of the project isbuilding network services that are scala-ble and highly available maintainingpersistent data providing gracefuldegradation and supporting online evo-lution

The use of clusters distinguishes thissetup from generic distributed systemsin terms of reliability and security aswell as providing a partition-free net-work The next important feature of thisproject is the new programming modelwhich is more restrictive than a generalprogramming model but still expressiveenough to write most of the applica-tions This model described as ldquosingleprogram multiple connectionrdquo usesintertask parallelism instead of multi-threaded concurrency and is character-ized by all nodes running the sameprogram with connections delegated tothe nodes by a centralized connectionmanager (CM) A CM takes care of hid-ing the details of mapping an externalconnection to an internal node Ninja

70 Vol 27 No 5 login

also uses a design called ldquostaged event-driven architecturerdquo where each serviceis broken down into a set of stages con-nected by event queues This architec-ture is suitable for a modular design andhelps in graceful degradation by adap-tive load shedding from the eventqueues

The presentation concluded with exam-ples of implementation and evaluationof a Web server and an email systemcalled NinjaMail to show the ease ofauthoring and the efficacy of using theNinja framework for service develop-ment

USING COHORT-SCHEDULING TO ENHANCE

SERVER PERFORMANCE

James R Larus and Michael ParkesMicrosoft Research

James Larus presented a new schedulingpolicy for increasing the throughput ofserver applications Cohort schedulingbatches execution of similar operationsarising in different server requests Theusual programming paradigm in han-dling server requests is to spawn multi-ple concurrent threads and switch fromone thread to another whenever a threadgets blocked for IO Since the threads ina server mostly execute unrelated piecesof code the locality of reference isreduced hence the effectiveness of dif-ferent caching mechanisms One way tosolve this problem is to throw in morehardware But Larus presented a com-plementary view where the programbehavior is investigated and used toimprove the performance

The problem in this scheme is to iden-tify pieces of code that can be batchedtogether for processing One simple wayis to look at the next program countervalues and use them to club threadstogether However this talk presented anew programming abstraction ldquostagedcomputationrdquo which replaces the threadmodel with ldquostagesrdquo A stage is anabstraction for operations with similarbehavior and locality The StagedServerlibrary can be used for programming inthis model It is a collection of C++

classes that implement staged computa-tion and cohort scheduling on either auniprocessor or multiprocessor It canbe used to define stages

The presentation showed the experi-mental evaluation of the cohort schedul-ing over the thread-based model byimplementing two servers a Web serverwhich is mainly IO bound and a pub-lish-subscribe server which is mainlycompute bound The SURGE bench-mark was used for the first experimentand the Fabret workload for the secondThe results showed that cohort-schedul-ing-based implementation gave a betterthroughput than the thread-basedimplementation at high loads

NETWORK PERFORMANCE

Summarized by Xiaobo Fan

ETE PASSIVE END-TO-END INTERNET SERVICE

PERFORMANCE MONITORING

Yun Fu Amin Vahdat Duke UniversityLudmila Cherkasova Wenting Tang HPLabs

This paper won the Best Student Paperaward

Ludmila Cherkasova began by listingseveral questions most Web serviceproviders want answered in order toimprove service quality She reviewedthe difficulties of making accurate andefficient end-to-end Web service mea-surement and the shortcomings of cur-rently available approaches Theypropose a passive trace-based architec-ture called EtE to monitor Web serverperformance on behalf of end users

The first step is to collect network pack-ets passively The second module recon-structs all TCP connections and extractsHTTP transactions To reconstruct Webpage accesses they first build a knowl-edge base indexed by client IP and URLand then group objects to the relatedWeb pages they are embedded in Statis-tical analysis is used to handle non-matched objects EtE Monitor cangenerate three groups of metrics to mea-sure Web service performance responsetime Web caching and page abortion

To demonstrate the benefits of EtE mon-itor Cherkasova talked about two casestudies and based on the calculatedmetrics gave some insightful explana-tions about whatrsquos happening behind the variations of Web performanceThrough validation they claim theirapproach provides a very close approxi-mation to the real scenario

THE PERFORMANCE OF REMOTE DISPLAY

MECHANISMS FOR THIN-CLIENT COMPUTING

S Jae Yang Jason Nieh Matt SelskyNikhil Tiwari Columbia University

Noting the trend toward thin-clientcomputing the authors compared dif-ferent techniques and design choices inmeasuring the performance of six popu-lar thin-client platforms ndash CitrixMetaFrame Microsoft Terminal Ser-vices Sun Ray Tarantella VNC and XAfter pointing out several challenges inbenchmarking thin clients Yang pro-posed slow-motion benchmarking toachieve non-invasive packet monitoringand consistent visual quality Basicallythey insert delays between separatevisual events in the benchmark applica-tions of the server side so that theclientrsquos display update can catch up withthe serverrsquos processing speed The exper-iments are conducted on an emulatednetwork over a range of network band-widths

Their results show that thin clients canprovide good performance for Webapplications in LAN environments butonly some platforms performed well forvideo benchmark Pixel-based encodingmay achieve better performance andbandwidth efficiency than high-levelgraphics Display caching and compres-sion should be used with care

A MECHANISM FOR TCP-FRIENDLY

TRANSPORT-LEVEL PROTOCOL

COORDINATION

David E Ott and Ketan Mayer-PatelUniversity of North Carolina ChapelHill

A revised transport-level protocol opti-mized for cluster-to-cluster (C-to-C)

71October 2002 login

C

ON

FERE

NC

ERE

PORT

Sapplications is what David Ott tries toexplain in his talk A C-to-C applicationclass is identified as one set of processescommunicating to another set ofprocesses across a common Internetpath The fundamental problem with C-to-C applications is how to coordinateall C-to-C communication flows so thatthey share a consistent view of the com-mon C-to-C network adapt to changingnetwork conditions and cooperate tomeet specific requirements Aggregationpoints (AP) are placed at the first andlast hop of the common data path toprobe network conditions (latencybandwidth loss rate etc) To carry andtransfer this information a new protocolndash Coordination Protocol (CP) ndash isinserted between the network layer (IP)and the transport layer (TCP UCP etc)Ott illustrated how the CP header isupdated and used when packets origi-nate from source and traverse throughlocal and remote AP to arrive at theirdestination and how AP maintains aper-cluster state table and detects net-work conditions Through simulationresults this coordination mechanismappears effective in sharing commonnetwork resources among C-to-C com-munication flows

STORAGE SYSTEMS

Summarized by Praveen Yalagandula

MY CACHE OR YOURS MAKING STORAGE

MORE EXCLUSIVE

Theodore M Wong Carnegie MellonUniversity John Wilkes HP Labs

Theodore Wong explained the ineffi-ciency of current ldquoinclusiverdquo cachingschemes in storage area networks ndashwhen a client accesses a block the blockis read from the disk and is cached atboth the disk array cache and at theclientrsquos cache Then he presented theconcept of ldquoexclusiverdquo caching where ablock accessed by a client is only cachedin that clientrsquos cache On eviction fromthe clientrsquos cache they have come upwith a DEMOTE operation to move thedata block to the tail of the array cache

The new exclusion caching schemes areevaluated on both single-client and mul-tiple-client systems The single-clientresults are presented for two differenttypes of synthetic workloads Random(transaction-processing type workloads)and Zipf (typical Web workloads) Theexclusive policy was quite effective inachieving higher hit rates and lower readlatencies They also showed a 22 timeshit-rate improvement over inclusivetechniques in the case of a real-lifeworkload httpd with a single-client set-ting

The DEMOTE scheme also performedwell in the case of multiple-client sys-tems when the data accessed by clients isdisjointed In the case of shared dataworkloads this scheme performed worsethan the inclusive schemes Theodorethen presented an adaptive exclusivecaching scheme ndash a block accessed by aclient is placed at the tail in the clientrsquoscache and also in the array cache at anappropriate place determined by thepopularity of the block The popularityof the block is measured by maintaininga ghost cache to accumulate the numberof times each block is accessed This newadaptive scheme has achieved a maxi-mum of 52 speedup in the meanlatency in the experiments with real-lifeworkloads

For more information visit httpwwwcscmuedu~tmwongresearch and httpwwwhplhpcomresearchitccslssp

BRIDGING THE INFORMATION GAP IN

STORAGE PROTOCOL STACKS

Timothy E Denehy Andrea C Arpaci-Dusseau Remzi H Arpaci-DusseauUniversity of Wisconsin Madison

Currently there is a huge informationgap between storage systems and file sys-tems The interface exposed by storagesystems to file systems based on blocksand providing only readwrite interfacesis very narrow This leads to poor per-formance as a whole because of dupli-cated functionality in both systems and

USENIX 2002

reduced functionality resulting fromstorage systemsrsquo lack of file information

The speaker presented two enhance-ments ExRaid an exposed RAID andILFS Informed Log-Structured File Sys-tem ExRAID is an enhancement to theblock-based RAID storage system thatexposes the following three types ofinformation to the file system (1)regions ndash contiguous portions of theaddress space comprising one or multi-ple disks (2) performance informationabout queue lengths and throughput ofthe regions revealing disk heterogeneityto the file systems and (3) failure infor-mation ndash dynamically updated informa-tion conveying the number of tolerablefailures in each region

ILFS allows online incremental expan-sion of the storage space performsdynamic parallelism using ExRAIDrsquosperformance information to performwell on heterogeneous storage systemsand provides a range of different mecha-nisms with different granularities forredundancy of files using ExRAIDrsquosregion failure characteristics These newfeatures are added to LFS with only a19 increase in the code size

More informationhttpwwwcswisceduwind

MAXIMIZING THROUGHPUT IN REPLICATED

DISK STRIPING OF VARIABLE

BIT-RATE STREAMS

Stergios V Anastasiadis Duke Univer-sity Kenneth C Sevcik and MichaelStumm University of Toronto

There is an increasing demand for thecontinuous real-time streaming ofmedia files Disk striping is a commontechnique used for supporting manyconnections on the media servers Buteven with 12 million hours of meantime between failures there will be morethan one disk failure per week with 7000disks Fault tolerance can be achieved byusing data replication and reservingextra bandwidth during normal opera-tion This work focused on supportingthe most common variable bit-ratestream formats (eg MPEG)

72 Vol 27 No 5 login

The authors used prototype mediaserver EXEDRA to implement and eval-uate different replication and bandwidthreservation schemes This system sup-ports a variable bit-rate scheme doesstride-based disk allocation and is capa-ble of supporting variable-grain diskstriping Two replication policies arepresented deterministic ndash data blocksare replicated in a round-robin fashionon the secondary replicas random ndashdata blocks are replicated on the ran-domly chosen secondary replicas Twobandwidth reservation techniques arealso presented mirroring reservation ndashdisk bandwidth is reserved for both primary and replicas of the media fileduring the playback and minimumreservation ndash a more efficient scheme inwhich bandwidth is reserved only for thesum of primary data access time and themaximum of the backup data accesstimes required in each round

Experimental results showed that deter-ministic replica placement is better thanrandom placement for small disks mini-mum disk bandwidth reservation istwice as good as mirroring in thethroughput achieved and fault tolerancecan be achieved with a minimal impacton the throughput

TOOLS

Summarized by Li Xiao

SIMPLE AND GENERAL STATISTICAL PROFILING

WITH PCT

Charles Blake and Steven Bauer MIT

This talk introduced a Profile CollectionToolkit (PCT) ndash a sampling-based CPUprofiling facility A novel aspect of PCTis that it allows sampling of semanticallyrich data such as function call stacksfunction parameters local or globalvariables CPU registers or other execu-tion contexts The design objectives ofPCT were driven by user needs and theinadequacies or inaccessibility of priorsystems

The rich data collection capability isachieved via a debugger-controller pro-gram dbct1 The talk then introducedthe profiling collector and data collec-

tion activation PCT file format optionsprovide flexibility and various ways tostore exact states PCT also has analysisoptions two examples were given ndashldquoSimplerdquo and ldquoMixed User and LinuxCoderdquo

The talk then went into how generalsampling helps in the following areas adebugger-controller state machineexample script fragments a simpleexample program and general valuereports in terms of call-path histogramsnumeric value histograms and data gen-erality Related work such as gprof andExpect was then summarized The con-tribution of this work is to provide aportable general value sampling toolThe limitations are that PCT does notsupport much of strip and count inac-curacies happen because of its statisticalnature Based on this limitation -g ispreferred over strip

Possible future directions included sup-porting more debuggers such as dbxand xdb script generators for otherkinds of traces more canned reports forgeneral values and a libgdb-basedlibrary sampler PCT is available athttppdoslcsmitedupct PCT runs onalmost any UNIX-like system

ENGINEERING A DIFFERENCING AND

COMPRESSION DATA FORMAT

David G Korn and Kiem Phong VoATampT Laboratories ndash Research

This talk began with an equation ldquoDif-ferencing + Compression = Delta Com-pressionrdquo Compression removesinformation redundancy in a data setExamples are gzip bzip compress andpack Differencing encodes differencesbetween two data sets Examples are diff -e fdelta bdiff and xdelta Deltacompression compresses a data set givenanother combining differencing andcompression and reduces to pure com-pression when therersquos no commonality

After presenting an overall scheme for adelta compressor and showing deltacompression performance this talk dis-cussed the encoding format of a newlydesigned Vcdiff for delta compression A

Vcdiff instruction code table consists of256 entries of each coding up to a pair ofinstructions and recodes I-byte indicesand any additional data

The talk then showed Vcdiff rsquos perfor-mance with Web data collected fromCNN compared with W3C standardGdiff Gdiff+gzip and diff+gzip whereGdiff was computed using Vcdiff deltainstructions The results of two experi-ments ldquoFirstrdquo and ldquoSuccessiverdquo werepresented In ldquoFirstrdquo each file is com-pressed against the first file collected inldquoSuccessiverdquo each file is compressedagainst the one in the previous hourThe diff+gzip did not work well becausediff was line-oriented Vcdiff performedfavorably compared with other formats

Vcdiff is part of the Vcodex packageVcodex is a platform for all commondata transformations delta compres-sion plain compression encryptiontranscoding (eg uuencode base64) Itis structured in three layers for maxi-mum usability Base library uses Dis-plines and Methods interfaces

The code for Vcdiff can be found athttpwwwresearchattcomswtoolsThey are moving Vcdiff to the IETFstandard as a comprehensive platformfor transforming data Please refer tohttpwwwietforginternet-draftdraft-korn-vcdiff06txt

WHERE IN THE NET

Summarized by Amit Purohit

A PRECISE AND EFFICIENT EVALUATION OF

THE PROXIMITY BETWEEN WEB CLIENTS AND

THEIR LOCAL DNS SERVERS

Zhuoqing Morley Mao University ofCalifornia Berkeley Charles D CranorFred Douglis Michael RabinovichOliver Spatscheck and Jia Wang ATampTLabs ndash Research

Content Distribution Networks (CDNs)attempt to improve Web performance bydelivering Web content to end usersfrom servers located at the edge of thenetwork When a Web client requestscontent the CDN dynamically chooses aserver to route the request to usually the

73October 2002 login

C

ON

FERE

NC

ERE

PORT

Sone that is nearest the client As CDNshave access only to the IP address of thelocal DNS server (LDNS) of the clientthe CDNrsquos authoritative DNS servermaps the clientrsquos LDNS to a geographicregion within a particular network andcombines that with network and serverload information to perform CDNserver selection

This method has two limitations First itis based on the implicit assumption thatclients are close to their LDNS This maynot always be valid Second a singlerequest from an LDNS can represent adiffering number of Web clients ndash calledthe hidden load factor Knowledge of thehidden load factor can be used toachieve better load distribution

The extent of the first limitation and itsimpact on the CDN server selection isdealt with To determine the associationsbetween clients and their LDNS a sim-ple non-intrusive and efficient map-ping technique was developed The datacollected was used to study the impact ofproximity on DNS-based server selec-tion using four different proximity met-rics (1) autonomous system (AS)clustering ndash observing whether a client isin the same AS as its LDNS ndash concludedthat LDNS is very good for coarse-grained server selection as 64 of theassociations belong to the same AS(2) network clustering ndash observingwhether a client is in the same network-aware cluster (NAC) ndash implied that DNSis less useful for finer-grained serverselection since only 16 of the clientand LDNS are in the same NAC (3)traceroute divergence ndash the length of thedivergent paths to the client and itsLDNS from a probe point using tracer-oute ndash implies that most clients aretopologically close to their LDNS asviewed from a randomly chosen probesite (4) round-trip time (RTT) correla-tion (some CDNs select severs based onRTT between the CDN server and theclientrsquos LDNS) examines the correlationbetween the message RTTs from a probepoint to the client and its local DNS

results imply this is a reasonable metricto use to avoid really distant servers

A study of the impact that client-LDNSassociations have on DNS-based serverselection concludes that knowing theclientrsquos IP address would allow moreaccurate server selection in a large num-ber of cases The optimality of the serverselection also depends on the serverdensity placement and selection algo-rithms

Further information can be found athttpwwweecsberkeleyedu~zmaomyresearchhtml or by contactingzmaoeecsberkeleyedu

GEOGRAPHIC PROPERTIES OF INTERNET

ROUTING

Lakshminarayanan SubramanianVenkata N Padmanabhan MicrosoftResearch Randy H Katz University ofCalifornia at Berkeley

Geographic information can provideinsights into the structure and function-ing of the Internet including interac-tions between different autonomoussystems by analyzing certain propertiesof Internet routing It can be used tomeasure and quantify certain routingproperties such as circuitous routinghot-potato routing and geographic faulttolerance

Traceroute has been used to gather therequired data and Geotrack tool hasbeen used to determine the location ofthe nodes along each network path Thisenables the computation of ldquolinearizeddistancesrdquo which is the sum of the geo-graphic distances between successivepairs of routers along the path

In order to measure the circuitousnessof a path a metric ldquodistance ratiordquo hasbeen defined as the ratio of the lin-earized distance of a path to the geo-graphic distance between the source anddestination of the path From the data ithas been observed that the circuitous-ness of a route depends on both geo-graphic and network location of the endhosts A large value of the distance ratioenables us to flag paths that are highly

USENIX 2002

circuitous possibly (though not neces-sarily) because of routing anomalies Itis also shown that the minimum delaybetween end hosts is strongly correlatedwith the linearized distance of the path

Geographic information can be used tostudy various aspects of wide-area Inter-net paths that traverse multiple ISPs Itwas found that end-to-end Internetpaths tend to be more circuitous thanintra-ISP paths the cause for this beingthe peering relationships between ISPsAlso paths traversing substantial dis-tances within two or more ISPs tend tobe more circuitous than paths largelytraversing only a single ISP Anotherfinding is that ISPs generally employhot-potato routing

Geographic information can also beused to capture the fact that two seem-ingly unrelated routers can be suscepti-ble to correlated failures By using thegeographic information of routers wecan construct a geographic topology ofan ISP Using this we can find the toler-ance of an ISPrsquos network to the total net-work failure in a geographic region

For further information contactlakmecsberkeleyedu

PROVIDING PROCESS ORIGIN INFORMATION

TO AID IN NETWORK TRACEBACK

Florian P Buchholz Purdue UniversityClay Shields Georgetown University

Network traceback is currently limitedbecause host audit systems do not main-tain enough information to matchincoming network traffic to outgoingnetwork traffic The talk presented analternative assigning origin informationto every process and logging it duringinteractive login creation

The current implementation concen-trates mainly on interactive sessions inwhich an event is logged when a newconnection is established using SSH orTelnet The method is effective andcould successfully determine steppingstones and reliably detect the source of aDDoS attack The speaker then talkedabout the related work done in this area

74 Vol 27 No 5 login

which led to discussion of the latestdevelopment in the area of ldquohostcausalityrdquo

The speaker ended with the limitationsand future directions of the frameworkThis framework could be extended andcould find many applications in the areaof security Current implementationdoesnrsquot take care of startup scripts andcron jobs but incorporating the origininformation in FS could solve this prob-lem In the current implementation log-ging is just implemented as a proof ofconcept It could be made safe in manyways and this could be another impor-tant aspect of future work

PROGRAMMING

Summarized by Amit Purohit

CYCLONE A SAFE DIALECT OF C

Trevor Jim ATampT Labs ndash Research GregMorrisett Dan Grossman MichaelHicks James Cheney Yanling WangCornell University

Cyclone is designed to provide protec-tion against attacks such as buffer over-flows format string attacks andmemory management errors The cur-rent structure of C allows programmersto write vulnerable programs Cycloneextends C so that it has the safety guar-antee of Java while keeping C syntaxtypes and semantics untouched

The Cyclone compiler performs staticanalysis of the source code and insertsruntime checks into the compiled out-put at places where the analysis cannotdetermine that the code execution willnot violate safety constraints Cycloneimposes many restrictions to preservesafety such as NULL checksThesechecks do not exist in normal C

The speaker then talked about somesample code written in Cyclone and howit tackles safety problems Porting anexisting C application to Cyclone ispretty easy with fewer than 10 changerequired The current implementationconcentrates more on safety than onperformance ndash hence the performance

penalty is substantial Cyclone was ableto find many lingering bugs The finalstep of the project will be to write acompiler to convert a normal C programto an equivalent Cyclone program

COOPERATIVE TASK MANAGEMENT WITHOUT

MANUAL STACK MANAGEMENT

Atul Adya Jon Howell MarvinTheimer William J Bolosky John RDouceur Microsoft Research

The speaker described the definitionsand motivations behind the distinctconcepts of task management stackmanagement IO management conflictmanagement and data partitioningConventional concurrent programminguses preemptive task management andexploits the automatic stack manage-ment of a standard language In the sec-ond approach cooperative tasks areorganized as event handlers that yieldcontrol by returning control to the eventscheduler manually unrolling theirstacks In this project they have adopteda hybrid approach that makes it possiblefor both stack management styles tocoexist Thus programmers can codeassuming one or the other of these stackmanagement styles will operate depend-ing upon the application The speakeralso gave a detailed example of how touse their mechanism to switch betweenthe two styles

The talk then continued with the imple-mentation They were able to preemptmany subtle concurrency problems byusing cooperative task managementPaying a cost up-front to reduce a subtlerace condition proved to be a goodinvestment

Though the choice of task managementis fundamental the choice of stack man-agement can be left to individual tasteThis project enables use of any type ofstack management in conjunction withcooperative task management

IMPROVING WAIT-FREE ALGORITHMS FOR

INTERPROCESS COMMUNICATION IN

EMBEDDED REAL-TIME SYSTEMS

Hai Huang Padmanabhan Pillai KangG Shin University of Michigan

The main characteristic of the real-timetime-sensitive system is its pre-dictable response time But concurrencymanagement creates hurdles to achiev-ing this goal because of the use of locksto maintain consistency To solve thisproblem many wait-free algorithmshave been developed but these are typi-cally high-cost solutions By takingadvantage of the temporal characteris-tics of the system however the time andspace overhead can be reduced

The speaker presented an algorithm fortemporal concurrency control andapplied this technique to improve threewait-free algorithms A single-writermultiple-reader wait-free algorithm anda double-buffer algorithm were pro-posed

Using their transformation mechanismthey achieved an improvement of17ndash66 in ACET and a 14ndash70 reduc-tion in memory requirements for IPCalgorithms This mechanism is extensi-ble and can be applied to other non-blocking IPC algorithms as well Futurework involves reducing the synchroniz-ing overheads in more general IPC algo-rithms with multiple-writer semantics

MOBILITY

Summarized by Praveen Yalagandula

ROBUST POSITIONING ALGORITHMS FOR

DISTRIBUTED AD-HOC WIRELESS SENSOR

NETWORKS

Chris Savarese Jan Rabaey Universityof California Berkeley Koen Langen-doen Delft University of Technology

The Pico Radio Network comprisesmore than a hundred sensors monitorsand actuators equipped with wirelessconnectivity with the following proper-ties (1) no infrastructure (2) computa-tion in a distributed fashion instead ofcentralized computation (3) dynamictopology (4) limited radio range

75October 2002 login

C

ON

FERE

NC

ERE

PORT

S(5) nodes that can act as repeaters(6) devices of low cost and with lowpower and (7) sparse anchor nodes ndashnodes with GPS capability A positioningalgorithm determines the geographicalposition of each node in the network

In two-dimensional space each nodeneeds three reference positions to esti-mate its geographical position There aretwo problems that make positioning dif-ficult in the Pico Radio Network typesetting (1) a sparse anchor node prob-lem and (2) a range error problem

The two-phase approach that theauthors have taken in solving the prob-lem consists of (1) a hop-terrain algo-rithm ndash in this first phase each noderoughly guesses its location by the dis-tance calculated using multi-hops to theanchor points and (2) a refinementalgorithm ndash in this second phase eachnode uses its neighborsrsquo positions torefine its own position estimate Toguarantee the convergence thisapproach uses confidence metricsassigning a value of 10 for anchor nodesand 01 to start for other nodes andincreasing these with each iteration

A simulation tool OMNet++ was usedfor both phases and in various scenariosThe results show that they achievedposition errors of less than 33 in a sce-nario with 5 anchor nodes an averageconnectivity of 7 and 5 range mea-surement error

Guidelines for anchor node deploymentare high connectivity (gt10) a reason-able fraction of anchor nodes (gt 5)and for anchor placement coverededges

APPLICATION-SPECIFIC NETWORK

MANAGEMENT FOR ENERGY-AWARE

STREAMING OF POPULAR MULTIMEDIA

FORMATS

Surendar Chandra University of Geor-gia and Amin Vahdat Duke University

The main hindrance in supporting theincreasing demand for mobile multime-dia on PDA is the battery capacity of the

small devices The idle-time power con-sumption is almost the same as thereceive power consumption on typicalwireless network cards (1319mW vs1425mW) while the sleep state con-sumes far less power (177mW) To letthe system consume energy proportionalto the stream quality the network cardshould transition to sleep state aggres-sively between each packet

The speaker presented previous workwhere different multimedia streamswere studied for PDAs MS Media RealMedia and Quicktime It was found thatif the inter-packet gap is predictablethen huge savings are possible forexample about 80 savings in MSmedia streams with few packet lossesThe limitation of the IEEE 80211bpower-save mode comes into play whenthere are two or more nodes producinga delay between beacon time and whenthe node receivessends packets Thisbadly affects the streams since higherenergy is consumed while waiting

The authors propose traffic shaping forenergy conservation where this is doneby proxy such that packets arrive at reg-ular intervals This is achieved using alocal proxy in the access point and aclient-side proxy in the mobile hostSimulations show that traffic shapingreduces energy consumption and alsoreveals that the higher bandwidthstreams have a lower energy metric(mJkB)

More information is at httpgreenhousecsugaedu

CHARACTERIZING ALERT AND BROWSE SER-

VICES OF MOBILE CLIENTS

Atul Adya Paramvir Bahl and Lili QiuMicrosoft Research

Even though there is a dramatic increasein Internet access from wireless devicesthere are not many studies done oncharacterizing this traffic In this paperthe authors characterize the trafficobserved on the MSN Web site withboth notification and browse traces

USENIX 2002

Around 33 million browsing accessesand about 325 million notificationentries are present in the traces Threetypes of analyses are done for each oneof these two services content analysisconcerning the most popular contentcategories and their distribution popu-larity analysis or the popularity distri-bution of documents and user behavioranalysis

The analysis of the notification logsshows that document access rates followa Zipf-like distribution with most of theaccesses concentrated on a small num-ber of messages and the accesses exhibitgeographical locality ndash users from samelocality tend to receive similar notifica-tion content The browser log analysisshows that a smaller set of URLs areaccessed most times though the accesspattern does not fit any Zipf curve andthe highly accessed URLS remain stableA correlation study between the notifi-cation and browsing services shows thatwireless users have a moderate correla-tion of 012

FREENIX TRACK SESSIONS

BUILDING APPLICATIONS

Summarized by Matt Butner

INTERACTIVE 3-D GRAPHICS APPLICATIONS

FOR TCL

Oliver Kersting Juumlrgen Doumlllner HassoPlattner Institute for Software SystemsEngineering University of Potsdam

The integration of 3-D image renderingfunctionality into a scripting languagepermits interactive and animated 3-Ddevelopment and application withoutthe formalities and precision demandedby low-level CC++ graphics and visual-ization libraries The large and complexC++ API of the Virtual Rendering Sys-tem (VRS) can be combined with theconveniences of the Tcl scripting lan-guage The mapping of class interfaces isdone via an automated process and gen-erates respective wrapper classes all ofwhich ensures complete API accessibilityand functionality without any signifi-

76 Vol 27 No 5 login

cant performance retribution The gen-eral C++-to-Tcl mapper SWIG grantsthe necessary object and hierarchal abili-ties without the use of object-orientedTcl extensions The final API mappingtechniques address C++ features such asclasses overloaded methods enumera-tions and inheritance relations All areimplemented in a proof-of-concept thatmaps the complete C++ API of the VRSto Tcl and are showcased in a completeinteractive 3-D-map system

THE AGFL GRAMMAR WORK LAB

Cornelis HA Coster Erik VerbruggenUniversity of Nijmegen (KUN)

The growth and implementation of Nat-ural Language Processing (NLP) is thecornerstone of the continued evolutionand implementation of truly intelligentsearch machines and services In partdue to the growing collections of com-puter-stored human-readable docu-ments in the public domain theimplementation of linguistic analysiswill become necessary for desirable pre-cision and resolution Subsequentlysuch tools and linguistic resources mustbe released into the public domain andthey have done so with the release of theAGFL Grammar Work Lab under theGNU Public License as a tool for lin-guistic research and the development ofNLP-based applications

The AGFL (Affix Grammars over aFinite Lattice) Grammar Work Labmeshes context-free grammars withfinite set-valued features that are accept-able to a range of languages In com-puter science terms ldquoSyntax rules areprocedures with parameters and a non-deterministic executionrdquo The EnglishPhrases for Information Retrieval(EP4IR) released with the AGFL-GWLas a robust grammar of English is anAGFL-GWL generated English parserthat outputs ldquoHeadModifiedrdquo framesThe sentences ldquoCompanyX sponsoredthis conferencerdquo and ldquoThis conferencewas sponsored by CompanyXrdquo both gen-erate [CompanyX[sponsored confer-

ence]] The hope is that public availabil-ity of such tools will encourage furtherdevelopment of grammar and lexicalsoftware systems

SWILL A SIMPLE EMBEDDED WEB SERVER

LIBRARY

Sotiria Lampoudi David M BeazleyUniversity of Chicago

This paper won the FREENIX Best Stu-dent Paper award

SWILL (Simple Web Interface LinkLibrary) is a simple Web server in theform of a C library whose developmentwas motivated by a wish to give coolapplications an interface to the Web TheSWILL library provides a simple inter-face that can be efficiently implementedfor tasks that vary from flexible Web-based monitoring to software debuggingand diagnostics Though originallydesigned to be integrated with high-per-formance scientific simulation softwarethe interface is generic enough to allowfor unbounded uses SWILL is a single-threaded server relying upon non-blocking IO through the creation of atemporary server which services IOrequests SWILL does not provide SSL orcryptographic authentication but doeshave HTTP authentication abilities

A fantastic feature of SWILL is its sup-port for SPMD-style parallel applica-tions which utilize MPI provingvaluable for Beowulf clusters and largeparallel machines Another practicalapplication was the implementation ofSWILL in a modified Yalnix emulator byUniversity of Chicago Operating Sys-tems courses which utilized the addedYalnix functionality for OS developmentand debugging SWILL requires minimalmemory overhead and relies upon theHTTP10 protocol

NETWORK PERFORMANCE

Summarized by Florian Buchholz

LINUX NFS CLIENT WRITE PERFORMANCE

Chuck Lever Network Appliance PeterHoneyman CITI University of Michi-gan

Lever introduced a benchmark to mea-sure an NFS client write performanceClient performance is difficult to mea-sure due to hindrances such as poorhardware or bandwidth limitations Fur-thermore measuring application perfor-mance does not identify weaknessesspecifically at the client side Thus abenchmark was developed trying toexercise only data transfers in one direc-tion between server and application Forthis purpose the benchmark was basedon the block sequential write portion ofthe Bonnie file system benchmark Oncea benchmark for NFS clients is estab-lished it can be used to improve clientperformance

The performance measurements wereperformed with an SMP Linux clientand both a Linux NFS server and a Net-work Appliance F85 filer During testinga periodic jump in write latency timewas discovered This was due to a ratherlarge number of pending write opera-tions that were scheduled to be writtenafter certain threshold values wereexceeded By introducing a separate dae-mon that flushes the cached writerequest the spikes could be eliminatedbut as a result the average latency growsover time The problem could be tracedto a function that scans a linked list ofwrite requests After having added ahashtable to improve lookup perfor-mance the latency improved consider-ably

The improved client was then used tomeasure throughput against the twoservers A discrepancy between theLinux server and the filer test wasnoticed and the reason for that tracedback to a global kernel lock that wasunnecessarily held when accessing thenetwork stack After correcting this per-formance improved further However

77October 2002 login

C

ON

FERE

NC

ERE

PORT

Sthe measurements showed that a clientmay run slower when paired with fastservers on fast networks This is due toheavy client interrupt loads more net-work processing on the client side andmore global kernel lock contention

The source code of the project is avail-able at httpwwwcitiumicheduprojectsnfs-perfpatches

A STUDY OF THE RELATIVE COSTS OF

NETWORK SECURITY PROTOCOLS

Stefan Miltchev and Sotiris IoannidisUniversity of Pennsylvania AngelosKeromytis Columbia University

With the increasing need for securityand integrity of remote network ser-vices it becomes important to quantifythe communication overhead of IPSecand compare it to alternatives such asSSH SCP and HTTPS

For this purpose the authors set upthree testing networks direct link twohosts separated by two gateways andthree hosts connecting through onegateway Protocols were compared ineach setup and manual keying was usedto eliminate connection setup costs ForIPSec the different encryption algo-rithms AES DES 3DES hardware DESand hardware 3DES were used In detailFTP was compared to SFTP SCP andFTP over IPSec HTTP to HTTPS andHTTP over IPSec and NFS to NFS overIPSec and local disk performance

The result of the measurements werethat IPSec outperforms other popularencryption schemes Overall unen-crypted communication was fastest butin some cases like FTP the overhead canbe small The use of crypto hardwarecan significantly improve performanceFor future work the inclusion of setupcosts hardware-accelerated SSL SFTPand SSH were mentioned

CONGESTION CONTROL IN LINUX TCP

Pasi Sarolahti University of HelsinkiAlexey Kuznetsov Institute for NuclearResearch at Moscow

Having attended the talk and read thepaper I am still unclear about whetherthe authors are merely describing thedesign decisions of TCP congestion con-trol or whether they are actually the cre-ators of that part of the Linux code Myguess leans toward the former

In the talk the speaker compared theTCP protocol congestion control meas-ures according to IETF and RFC specifi-cations with the actual Linux implemen-tation which does conform to the basicprinciples but nevertheless has differ-ences A specific emphasis was placed onretransmission mechanisms and thecongestion window Also several TCPenhancements ndash the NewReno algo-rithm Selective ACKs (SACK) ForwardACKs (FACK) ndash were discussed andcompared

In some instances Linux does not con-form to the IETF specifications The fastrecovery does not fully follow RFC 2582since the threshold for triggering re-transmit is adjusted dynamically and thecongestion windowrsquos size is not changedAlso the roundtrip-time estimator andthe RTO calculation differ from RFC2988 since it uses more conservativeRTT estimates and a minimum RTO of200ms The performance measuresshowed that with the additional Linux-specific features enabled slightly higherthroughput more steady data flow andfewer unnecessary retransmissions canbe achieved

XTREME XCITEMENT

Summarized by Steve Bauer

THE FUTURE IS COMING WHERE THE X

WINDOW SHOULD GO

Jim Gettys Compaq Computer Corp

Jim Gettys one of the principal authorsof the X Window System outlined thenear-term objectives for the system pri-marily focusing on the changes andinfrastructure required to enable replica-tion and migration of X applicationsProviding better support for this func-tionality would enable users to retrieveor duplicate X applications betweentheir servers at home and work

USENIX 2002

One interesting example of an applica-tion that currently is capable of migra-tion and replication is Emacs To create anew frame on DISPLAY try ldquoM-x make-frame-on-display ltReturngt DISPLAYltReturngtrdquo

However technical challenges makereplication and migration difficult ingeneral These include the ldquomajorheadachesrdquo of server-side fonts thenonuniformity of X servers and screensizes and the need to appropriatelyretrofit toolkits Authentication andauthorization issues are obviously alsoimportant The rest of the talk delvedinto some of the details of these interest-ing technical challenges

HACKING IN THE KERNEL

Summarized by Hai Huang

AN IMPLEMENTATION OF SCHEDULER

ACTIVATIONS ON THE NETBSD OPERATING

SYSTEM

Nathan J Williams Wasabi Systems

Scheduler activation is an old idea Basi-cally there are benefits and drawbacks tousing solely kernel-level or user-levelthreading Scheduler activation is able tocombine the two layers of control toprovide more concurrency in the sys-tem

In his talk Nathan gave a fairly detaileddescription of the implementation ofscheduler activation in the NetBSD ker-nel One important change in the imple-mentation is to differentiate the threadcontext from the process context This isdone by defining a separate data struc-ture for these threads and relocatingsome of the information that wasembedded in the process context tothese thread contexts Stack was espe-cially a concern due to the upcall Spe-cial handling must be done to make surethat the upcall doesnrsquot mess up the stackso that the preempted user-level threadcan continue afterwards Lastly Nathanexplained that signals were handled byupcalls

78 Vol 27 No 5 login

AUTHORIZATION AND CHARGING IN PUBLIC

WLANS USING FREEBSD AND 8021X

Pekka Nikander Ericsson ResearchNomadicLab

8021x standards are well known in thewireless community as link-layerauthentication protocols In this talkPekka explained some novel ways ofusing the 8021x protocols that might beof interest to people on the move It ispossible to set up a public WLAN thatwould support various chargingschemes via virtual tokens which peoplecan purchase or earn and later use

This is implemented on FreeBSD usingthe netgraph utility It is basically a filterin the link layer that would differentiatetraffic based on the MAC address of theclient node which is either authenti-cated denied or let through The over-head for this service is fairly minimal

ACPI IMPLEMENTATION ON FREEBSD

Takanori Watanabe Kobe University

ACPI (Advanced Configuration andPower Management Interface) was pro-posed as a joint effort by Intel Toshibaand Microsoft to provide a standard andfiner-control method of managingpower states of individual devices withina system Such low-level power manage-ment is especially important for thosemobile and embedded systems that arepowered by fixed-capacity energy batter-ies

Takanori explained that the ACPI speci-fication is composed of three partstables BIOS and registers He was ableto implement some functionalities ofACPI in a FreeBSD kernel ACPI Com-ponent Architecture was implementedby Intel and it provides a high-levelACPI API to the operating systemTakanorirsquos ACPI implementation is builtupon this underlying layer of APIs

ACCESS CONTROL

Summarized by Florian Buchholz

DESIGN AND PERFORMANCE OF THE

OPENBSD STATEFUL PACKET FILTER (PF)

Daniel Hartmeier Systor AG

Daniel Hartmeier described the newstateful packet filter (pf) that replacedIPFilter in the OpenBSD 30 releaseIPFilter could no longer be included dueto licensing issues and thus there was aneed to write a new filter making use ofoptimized data structures

The filter rules are implemented as alinked list which is traversed from startto end Two actions may be takenaccording to the rules ldquopassrdquo or ldquoblockrdquoA ldquopassrdquo action forwards the packet anda ldquoblockrdquo action will drop it Wheremore than one rule matches a packetthe last rule wins Rules that are markedas ldquofinalrdquo will immediately terminate any further rule evaluation An opti-mization called ldquoskip-stepsrdquo was alsoimplemented where blocks of similarrules are skipped if they cannot matchthe current packet These skip-steps arecalculated when the rule set is loadedFurthermore a state table keeps track ofTCP connections Only packets thatmatch the sequence numbers areallowed UDP and ICMP queries andreplies are considered in the state tablewhere initial packets will create an entryfor a pseudo-connection with a low ini-tial timeout value The state table isimplemented using a balanced binarysearch tree NAT mappings are alsostored in the state table whereas appli-cation proxies reside in user space Thepacket filter also is able to perform frag-ment reassembly and to modulate TCPsequence numbers to protect hostsbehind the firewall

Pf was compared against IPFilter as wellas Linuxrsquos Iptables by measuringthroughput and latency with increasingtraffic rates and different packet sizes Ina test with a fixed rule set size of 100Iptables outperformed the other two fil-ters whose results were close to each

other In a second test where the rule setsize was continually increased Iptablesconsistently had about twice thethroughput of the other two (whichevaluate the rule set on both the incom-ing and outgoing interfaces) A third testcompared only pf and IPFilter using asingle rule that created state in the statetable with a fixed-state entry size Pfreached an overloaded state much laterthan IPFilter The experiment wasrepeated with a variable-state entry sizeand pf performed much better thanIPFilter for a small number of states

In general rule set evaluation is expen-sive and benchmarks only reflectextreme cases whereas in real life otherbehavior should be observed Further-more the benchmarks show that statefulfiltering can actually improve perfor-mance due to cheap state-table lookupsas compared to rule evaluation

ENHANCING NFS CROSS-ADMINISTRATIVE

DOMAIN ACCESS

Joseph Spadavecchia and Erez ZadokStony Brook University

The speaker presented modification toan NFS server that allows an improvedNFS access between administrativedomains A problem lies in the fact thatNFS assumes a shared UIDGID spacewhich makes it unsafe to export filesoutside the administrative domain ofthe server Design goals were to leaveprotocol and existing clients unchangeda minimum amount of server changesflexibility and increased performance

To solve the problem two techniques areutilized ldquorange-mappingrdquo and ldquofile-cloakingrdquo Range-mapping maps IDsbetween client and server The mappingis performed on a per-export basis andhas to be manually set up in an exportfile The mappings can be 1-1 N-N orN-1 In file-cloaking the server restrictsfile access based on UIDGID and spe-cial cloaking-mask bits Here users canonly access their own file permissionsThe policy on whether or not the file isvisible to others is set by the cloakingmask which is logically ANDed with the

79October 2002 login

C

ON

FERE

NC

ERE

PORT

Sfilersquos protection bits File-cloaking onlyworks however if the client doesnrsquot holdcached copies of directory contents andfile-attributes Because of this the clientsare forced to re-read directories byincrementing the mtime value of thedirectory each time it is listed

To test the performance of the modifiedserver five different NFS configurationswere evaluated An unmodified NFSserver was compared against one serverwith the modified code included but notused one with only range-mappingenabled one with only file-cloakingenabled and one version with all modi-fications enabled For each setup differ-ent file system benchmarks were runThe results show only a small overheadwhen the modifications are used gener-ally an increase of below 5 Anotherexperiment tested the performance ofthe system with an increasing number ofmapped or cloaked entries on a systemThe results show that an increase from10 to 1000 entries resulted in a maxi-mum of about 14 cost in performance

One member of the audience pointedout that if clients choose to ignore thechanged mtimes from the server andthus still hold caches of the directoryentries the file-cloaking mechanismcould be defeated After a rather lengthydebate the speaker had to concede thatthe model doesnrsquot add any extra securityAnother question was asked about scala-bility of the setup of range mappingThe speaker referred to application-leveltools that could be developed for thatpurpose

The software is available atftpftpfslcssunysbedupubenf

ENGINEERING OPEN SOURCE

SOFTWARE

Summarized by Teri Lampoudi

NINGAUI A LINUX CLUSTER FOR BUSINESS

Andrew Hume ATampT Labs ndash ResearchScott Daniels Electronic Data SystemsCorp

Ningaui is a general purpose highlyavailable resilient architecture built

from commodity software and hard-ware Emphatically however Ningaui isnot a Beowulf Hume calls the clusterdesign the ldquoSwiss canton modelrdquo inwhich there are a number of looselyaffiliated independent nodes with datareplicated among them Jobs areassigned by bidding and leases and clus-ter services done as session-based serversare sited via generic job assignment Theemphasis is on keeping the architectureend-to-end checking all work via check-sums and logging everything Theresilience and high availability requiredby their goal of 8-5 maintenance ndash vsthe typical 24-7 model where people getpaged whenever the slightest thing goeswrong regardless of the time of day ndash isachieved by job restartability Finally allcomputation is performed on local datawithout the use of NFS or networkattached storage

Humersquos message is a hopeful onedespite the many problems encounteredndash things like kernel and service limitsauto-installation problems TCP stormsand the like ndash the existence of sourcecode and the paranoid practice of log-ging and checksumming everything hashelped The final product performs rea-sonably well and it appears that theresilient techniques employed do make adifference One drawback however isthat the software mentioned in thepaper is not yet available for download

CPCMS A CONFIGURATION MANAGEMENT

SYSTEM BASED ON CRYPTOGRAPHIC NAMES

Jonathan S Shapiro and John Vander-burgh Johns Hopkins University

This paper won the FREENIX BestPaper award

The basic notion behind the project isthe fact that everyone has a pet com-plaint about CVS and yet it is currentlythe configuration manager in mostwidespread use Shapiro has unleashedan alternative Interestingly he did notbegin but ended with the usual host ofreasons why CVS is bad The talk insteadbegan abruptly by characterizing the job

USENIX 2002

of a software configuration managercontinued by stating the namespaceswhich it must handle and wrapped upwith the challenges faced

X MEETS Z VERIFYING CORRECTNESS IN THE

PRESENCE OF POSIX THREADS

Bart Massey Portland State UniversityRobert T Bauer Rational SoftwareCorp

Massey delivered a humorous talk onthe insight gained from applying Z for-mal specification notation to systemsoftware design rather than the moreinformal analysis and design processnormally used

The story is told with respect to writingXCB which replaces the Xlib protocollayer and is supposed to be threadfriendly But where threads are con-cerned deadlock avoidance becomes ahard problem that cannot be solved inan ad-hoc manner But full modelchecking is also too hard In this caseMassey resorted to Z specification tomodel the XCB lower layer abstractaway locking and data transfers andlocate fundamental issues Essentiallythe difficulties of searching the literatureand locating information relevant to theproblem at hand were overcome AsMassey put it ldquothe formal method savedthe dayrdquo

FILE SYSTEMS

Summarized by Bosko Milekic

PLANNED EXTENSIONS TO THE LINUX

EXT2EXT3 FILESYSTEM

Theodore Y Tsrsquoo IBM StephenTweedie Red Hat

The speaker presented improvements tothe Linux Ext2 file system with the goalof allowing for various expansions whilestriving to maintain compatibility witholder code Improvements have beenfacilitated by a few extra superblockfields that were added to Ext2 not longbefore the Linux 20 kernel was releasedThe fields allow for file system featuresto be added without compromising theexisting setup this is done by providing

80 Vol 27 No 5 login

bits indicating the impact of the addedfeatures with respect to compatibilityNamely a file system with the ldquoincom-patrdquo bit marked is not allowed to bemounted Similarly a ldquoread-onlyrdquo mark-ing would only allow the file system tobe mounted read-only

Directory indexing changes linear direc-tory searches with a faster search using afixed-depth tree and hashed keys Filesystem size can be dynamicallyincreased and the expanded inode dou-bled from 128 to 256 bytes allows formore extensions

Other potential improvements were dis-cussed as well in particular pre-alloca-tion for contiguous files which allowsfor better performance in certain setupsby pre-allocating contiguous blocksSecurity-related modifications extendedattributes and ACLs were mentionedAn implementation of these featuresalready exists but has not yet beenmerged into the mainline Ext23 code

RECENT FILESYSTEM OPTIMISATIONS ON

FREEBSD

Ian Dowse Corvil Networks DavidMalone CNRI Dublin Institute of Technology

David Malone presented four importantfile system optimizations for FreeBSDOS soft updates dirpref vmiodir anddirhash It turns out that certain combi-nations of the optimizations (beautifullyillustrated in the paper) may yield per-formance improvements of anywherebetween 2 and 10 orders of magnitudefor real-world applications

All four of the optimizations deal withfile system metadata Soft updates allowfor asynchronous metadata updatesDirpref changes the way directories areorganized attempting to place childdirectories closer to their parentsthereby increasing locality of referenceand reducing disk-seek times Vmiodirtrades some extra memory in order toachieve better directory caching Finallydirhash which was implemented by IanDowse changes the way in which entries

are searched in a directory SpecificallyFFS uses a linear search to find an entryby name dirhash builds a hashtable ofdirectory entries on the fly For a direc-tory of size n with a working set of mfiles a search that in certain cases couldhave been O(nm) has been reduceddue to dirhash to effectively O(n + m)

FILESYSTEM PERFORMANCE AND SCALABILITY

IN LINUX 2417

Ray Bryant SGI Ruth Forester IBMLTC John Hawkes SGI

This talk focused on performance evalu-ation of a number of file systems avail-able and commonly deployed on Linuxmachines Comparisons under variousconfigurations of Ext2 Ext3 ReiserFSXFS and JFS were presented

The benchmarks chosen for the datagathering were pgmeter filemark andAIM Benchmark Suite VII Pgmetermeasures the rate of data transfer ofreadswrites of a file under a syntheticworkload Filemark is similar to post-mark in that it is an operation-intensivebenchmark although filemark isthreaded and offers various other fea-tures that postmark lacks AIM VIImeasures performance for various file-system-related functionalities it offersvarious metrics under an imposedworkload thus stressing the perfor-mance of the file system not only underIO load but also under significant CPUload

Tests were run on three different setupsa small a medium and a large configu-ration ReiserFS and Ext2 appear at thetop of the pile for smaller and mediumsetups Notably XFS and JFS performworse for smaller system configurationsthan the others although XFS clearlyappears to generate better numbers thanJFS It should be noted that XFS seemsto scale well under a higher load Thiswas most evident in the large-systemresults where XFS appears to offer thebest overall results

THINGS TO THINK ABOUT

Summarized by Bosko Milekic

SPEEDING UP KERNEL SCHEDULER BY REDUC-

ING CACHE MISSES

Shuji Yamamura Akira Hirai MitsuruSato Masao Yamamoto Akira NaruseKouichi Kumon Fujitsu Labs

This was an interesting talk pertaining tothe effects of cache coloring for taskstructures in the Linux kernel scheduler(Linux kernel 24x) The speaker firstpresented some interesting benchmarknumbers for the Linux scheduler show-ing that as the number of processes onthe task queue was increased the perfor-mance decreased The authors usedsome really nifty hardware to measurethe number of bus transactionsthroughout their tests and were thusable to reasonably quantify the impactthat cache misses had in the Linuxscheduler

Their experiments led them to imple-ment a cache coloring scheme for taskstructures which were previouslyaligned on 8KB boundaries and there-fore were being eventually mapped tothe same cache lines This unfortunateplacement of task structures in memoryinduced a significant number of cachemisses as the number of tasks grew inthe scheduler

The implemented solution consisted ofaligning task structures to more evenlydistribute cached entries across the L2cache The result was inevitably fewercache misses in the scheduler Some neg-ative effects were observed in certain sit-uations These were primarily due tomore cache slots being used by taskstructure data in the scheduler thusforcing data previously cached there tobe pushed out

81October 2002 login

C

ON

FERE

NC

ERE

PORT

SWORK-IN-PROGRESS REPORTS

Summarized by Brennan Reynolds

RESOURCE VIRTUALIZATION TECHNIQUES FOR

WIDE-AREA OVERLAY NETWORKS

Kartik Gopalan University Stony Brook

This work addressed the issue of provi-sioning a maximum number of virtualoverlay networks (VON) with diversequality of service (QoS) requirementson a single physical network Each of theVONs is logically isolated from others toensure the QoS requirements Gopalanmentioned several critical researchissues with this problem that are cur-rently being investigated Dealing withhow to provision the network at variouslevels (link route or path) and thenenforce the provisioning at run-time isone of the toughest challenges Cur-rently Gopalan has developed severalalgorithms to handle admission controlend-to-end QoS route selection sched-uling and fault tolerance in the net-work

For more information visithttpwwwecslcssunysbedu

VISUALIZING SOFTWARE INSTABILITY

Jennifer Bevan University of CaliforniaSanta Cruz

Detection of instability in software hastypically been an afterthought Thepoint in the development cycle when thesoftware is reviewed for instability isusually after it is difficult and costly togo back and perform major modifica-tions to the code base Bevan has devel-oped a technique to allow thevisualization of unstable regions of codethat can be used much earlier in thedevelopment cycle Her technique cre-ates a time series of dependent graphsthat include clusters and lines calledfault lines From the graphs a developeris able to easily determine where theunstable sections of code are and proac-tively restructure them She is currentlyworking on a prototype implementa-tion

For more information visit httpwwwcseucscecu~jbevanevo_viz

RELIABLE AND SCALABLE PEER-TO-PEER WEB

DOCUMENT SHARING

Li Xiao William and Mary College

The idea presented by Xiao would allowend users to share the content of theWeb browser caches with neighboringInternet users The rationale for this isthat todayrsquos end users are increasinglyconnected to the Internet over high-speed links and the browser caches arebecoming large storage systems There-fore if an individual accesses a pagewhich does not exist in their local cacheXiao is suggesting that they first queryother end users for the content beforetrying to access the machine hosting theoriginal This strategy does have someserious problems associated with it thatstill need to be addressed includingensuring the integrity of the content andprotecting the identity and privacy ofthe end users

SEREL FAST BOOTING FOR UNIX

Leni Mayo Fast Boot Software

Serel is a tool that generates a visual rep-resentation of a UNIX machinersquos boot-up sequence It can be used to identifythe critical path and can show if a par-ticular service or process blocks for anextended period of time This informa-tion could be used to determine wherethe largest performance gains could berealized by tuning the order of executionat boot-up Serel creates a dependencygraph expressed in XML during boot-up This graph is then used to create thevisual representation Currently the toolonly works on POSIX-compliant sys-tems but Mayo stated that he would beporting it to other platforms Otherextensions that were mentionedincluded having the metadata includethe use of shared libraries and monitor-ing the suspendresume sequence ofportable machines

For more information visithttpwwwfastbootorg

USENIX 2002

BERKELEY DB XML

John Merrells Sleepycat Software

Merrells gave a quick introduction andoverview of the new XML library forBerkeley DB The library specializes instorage and retrieval of XML contentthrough a tool called XPath The libraryallows for multiple containers per docu-ment and stores everything natively asXML The user is also given a wide rangeof elements to create indices withincluding edges elements text stringsor presence The XPath tool consists of aquery parser generator optimizer andexecution engine To conclude his pre-sentation Merrell gave a live demonstra-tion of the software

For more information visithttpwwwsleepycatcomxml

CLUSTER-ON-DEMAND (COD)

Justin Moore Duke University

Modern clusters are growing at a rapidrate Many have pushed beyond the 5000-machine mark and deployingthem results in large expenses as well asmanagement and provisioning issuesFurthermore if the cluster is ldquorentedrdquoout to various users it is very time-con-suming to configure it to a userrsquos specsregardless of how long they need to useit The COD work presented createsdynamic virtual clusters within a givenphysical cluster The goal was to have aprovisioning tool that would automati-cally select a chunk of available nodesand install the operating system andmiddleware specified by the customer ina short period of time This would allowa greater use of resources since multiplevirtual clusters can exist at once Byusing a virtual cluster the size can bechanged dynamically Moore stated thatthey have created a working prototypeand are currently testing and bench-marking its performance

For more information visithttpwwwcsdukeedu~justincod

82 Vol 27 No 5 login

CATACOMB

Elias Sinderson University of CaliforniaSanta Cruz

Catacomb is a project to develop a data-base-backed DASL module for theApache Web server It was designed as areplacement for the WebDAV The initialrelease of the module only contains sup-port for a MySQL database but could beextended to others Sinderson brieflytouched on the performance of hermodule It was comparable to themod_dav Apache module for all querytypes but search The presentation wasconcluded with remarks about addingsupport for the lock method and includ-ing ACL specifications in the future

For more information visithttpoceancseucsceducatacomb

SELF-ORGANIZING STORAGE

Dan Ellard Harvard University

Ellardrsquos presentation introduced a stor-age system that tuned itself based on theworkload of the system without requir-ing the intervention of the user Theintelligence was implemented as a vir-tual self-organizing disk that residesbelow any file system The virtual diskobserves the access patterns exhibited bythe system and then attempts to predictwhat information will be accessed nextAn experiment was done using an NFStrace at a large ISP on one of their emailservers Ellardrsquos self-organizing storagesystem worked well which he attributesto the fact that most of the files beingrequested were large email boxes Areasof future work include exploration ofthe length and detail of the data collec-tion stage as well as the CPU impact ofrunning the virtual disk layer

VISUAL DEBUGGING

John Costigan Virginia Tech

Costigan feels that the state of currentdebugging facilities in the UNIX worldis not as good as it should be He pro-poses the addition of several elements toprograms like ddd and gdb The firstaddition is including separate heap and

stack components Another is to displayonly certain fields of a data structureand have the ability to zoom in and outif needed Finally the debugger shouldprovide the ability to visually presentcomplex data structures includinglinked lists trees etc He said that a betaversion of a debugger with these abilitiesis currently available

For more information visit httpinfoviscsvtedudatastruct

IMPROVING APPLICATION PERFORMANCE

THROUGH SYSTEM-CALL COMPOSITION

Amit Purohit University of Stony Brook

Web servers perform a huge number ofcontext switches and internal data copy-ing during normal operation These twoelements can drastically limit the perfor-mance of an application regardless ofthe hardware platform it is run on TheCompound System Call (CoSy) frame-work is an attempt to reduce the perfor-mance penalty for context switches anddata copies It includes a user-levellibrary and several kernel facilities thatcan be used via system calls The libraryprovides the programmer with a com-plete set of memory-management func-tions Performance tests using the CoSyframework showed a large saving forcontext switches and data copies Theonly area where the savings betweenconventional libraries and CoSy werenegligible was for very large file copies

For more information visithttpwwwcssunysbedu~purohit

ELASTIC QUOTAS

John Oscar Columbia University

Oscar began by stating that mostresources are flexible but that to datedisks have not been While approxi-mately 80 percent of files are short-liveddisk quotas are hard limits imposed byadministrators The idea behind elasticquotas is to have a non-permanent stor-age area that each user can temporarilyuse Oscar suggested the creation of aehome directory structure to be used indeploying elastic quotas Currently he

has implemented elastic quotas as astackable file system There were alsoseveral trends apparent in the dataMany users would ssh to their remotehost (good) but then launch an emailreader on the local machine whichwould connect via POP to the same hostand send the password in clear text(bad) This situation can easily be reme-died by tunneling protocols like POPthrough SSH but it appeared that manypeople were not aware this could bedone While most of his comments onthe use of the wireless network werenegative the list of passwords he hadcollected showed that people wereindeed using strong passwords His rec-ommendations were to educate andencourage people to use protocols likeIPSec SSH and SSL when conductingwork over a wireless network becauseyou never know who else is listening

SYSTRACE

Niels Provos University of Michigan

How can one be sure the applicationsone is using actually do exactly whattheir developers said they do Shortanswer you canrsquot People today are usinga large number of complex applicationswhich means it is impossible to checkeach application thoroughly for securityvulnerabilities There are tools out therethat can help though Provos has devel-oped a tool called systrace that allows auser to generate a policy of acceptablesystem calls a particular application canmake If the application attempts tomake a call that is not defined in thepolicy the user is notified and allowed tochoose an action Systrace includes anauto-policy generation mechanismwhich uses a training phase to record theactions of all programs the user exe-cutes If the user chooses not to be both-ered by applications breaking policysystrace allows default enforcementactions to be set Currently systrace isimplemented for FreeBSD and NetBSDwith a Linux version coming out shortly

83October 2002 login

C

ON

FERE

NC

ERE

PORT

SFor more information visit httpwwwcitiumicheduuprovossystrace

VERIFIABLE SECRET REDISTRIBUTION FOR

SURVIVABLE STORAGE SYSTEMS

Ted Wong Carnegie Mellon University

Wong presented a protocol that can beused to re-create a file distributed over aset of servers even if one of the serversis damaged In his scheme the user mustchoose how many shares the file is splitinto and the number of servers it will bestored across The goal of this work is toprovide persistent secure storage ofinformation even if it comes underattack Wong stated that his design ofthe protocol was complete and he is cur-rently building a prototype implementa-tion of it

For more information visit httpwwwcscmuedu~tmwongresearch

THE GURU IS IN SESSIONS

LINUX ON LAPTOPPDA

Bdale Garbee HP Linux Systems Operation

Summarized by Brennan Reynolds

This was a roundtable-like discussionwith Garbee acting as moderator He iscurrently in charge of the Debian distri-bution of Linux and is involved in port-ing Linux to platforms other than i386He has successfully help port the kernelto the Alpha Sparc and ARM and iscurrently working on a version to runon VAX machines

A discussion was held on which file sys-tem is best for battery life The Riser andExt3 file systems were discussed peoplecommented that when using Ext3 ontheir laptops the disc would never spindown and thus was always consuming alarge amount of power One suggestionwas to use a RAM-based file system forany volume that requires a large numberof writes and to only have the file systemwritten to disk at infrequent intervals orwhen the machine is shut down or sus-pended

A question was directed to Garbee abouthis knowledge of how the HP-Compaqmerger would affect Linux Garbeethought that the merger was good forLinux both within the company and forthe open source community as a wholeHe stated that the new company wouldbe the number one shipper of systemswith Linux as the base operating systemHe also fielded a question about thesupport of older HP servers and theirability to run Linux saying that indeedLinux has been ported to them and hepersonally had several machines in hisbasement running it

A question which sparked a large num-ber of responses concerned problemspeople had with running Linux onmobile platforms Most people actuallydid not have many problems at allThere were a few cases of a particularmachine model not working but therewere only two widespread issues dock-ing stations and the new ACPI powermanagement scheme Docking stationsstill appear to be somewhat of aheadache but most people had devel-oped ad-hoc solutions for getting themto work The ACPI power managementscheme developed by Intel does notappear to have a quick solution TedTrsquoso one of the head kernel developersstated that there are fundamental archi-tectural problems with the 24 series ker-nel that do not easily allow ACPI to beadded However he also stated thatACPI is supported in the newest devel-opment kernel 25 and will be includedin 26

The final topic was the possibility ofpurchasing a laptop without paying anychargetax for Microsoft Windows Theconsensus was that currently this is notpossible Even if the machine comeswith Linux or without any operatingsystem the vendors have contracts inplace which require them to payMicrosoft for each machine they ship ifthat machine is capable of running aMicrosoft operating system The onlyway this will change is if the demand for

USENIX 2002

other operating systems increases to thepoint that vendors renegotiate their con-tracts with Microsoft and no one sawthis happening in the near future

LARGE CLUSTERS

Andrew Hume ATampT Labs ndash Research

Summarized by Teri Lampoudi

This was a session on clusters big dataand resilient computing Given that theaudience was primarily interested inclusters and not necessarily big data orbeating nodes to a pulp the conversa-tion revolved mainly around what ittakes to make a heterogeneous clusterresilient Hume also presented aFREENIX paper on his current clusterproject which explained in more detailmuch of what was abstractly claimed inthe guru session

Humersquos use of the word ldquoclusterrdquoreferred not to what I had assumedwould be a Beowulf-type system but to aloosely coupled farm of machines inwhich concurrency was much less of anissue than it would be in a Beowulf Infact the architecture Hume describedwas designed to deal with large amountsof independent transaction processingessentially the process of billing callswhich requires no interprocess messagepassing or anything of the sort a parallelscientific application might

Where does the large data come inSince the transactions in question con-sist of large numbers of flat files amechanism for getting the files ontonodes and off-loading the results is nec-essary In this particular cluster namedNingaui this task is handled by theldquoreplication managerrdquo which generateda large amount of interest in the audi-ence All management functions in thecluster as well as job allocation consti-tute services that are to be bid for andleased out to nodes Furthermore fail-ures are handled by simply repeating thefailed task until it is completed properlyif that is possible and if it is not thenlooking through detailed logs to discover

84 Vol 27 No 5 login

what went wrong Logging and testingfor the successful completion of eachtask are what make this model resilientEvery management operation is loggedat some appropriate level of detail gen-erating 120MBday and every single filetransfer is md5 checksummed This wascharacterized by Hume as a ldquopatientrdquostyle of management where no one con-trols the cluster scheduling behavior isemergent rather than stringentlyplanned and nodes are free to drop inand out of the cluster a downside to thismodel is the increase in latency offset bythe fact that the cluster continues tofunction unattended outside of 8-to-5support hours

In response to questions about the pro-jected scalability of such a schemeHume said the system would presum-ably scale from the present 60 nodes toabout 100 without modifications butthat scaling higher would be a matter forfurther design The guru session endedon a somewhat unrelated note regardingthe problems of getting data off main-frames and onto Linux machines ndashissues of fixed- vs variable-lengthencoding of data blocks resulting in use-ful information being stripped by FTPand problems in converting COBOLcopybooks to something useful on a C-based architecture To reiterate Humersquosclaim there are homemade solutions forsome of these problems and the readershould feel free to contact Mr Hume forthem

WRITING PORTABLE APPLICATIONS

Nick Stoughton MSB Consultants

Summarized by Josh Lothian

Nick Stoughton addressed a concernthat developers are facing more andmore writing portable applications Heproposed that there is no such thing as aldquoportable applicationrdquo only those thathave been ported Developers are still insearch of the Holy Grail of applicationsprogramming a write-once compile-anywhere program Languages such asJava are leading up to this but virtual

machines still have to be ported pathswill still be different etc Web-basedapplications are another area that couldbe promising for portable applications

Nick went on to talk about the future ofportability The recently released POSIX2001 standards are expected to help thesituation The 2001 revision expands theoriginal POSIX sepc into seven volumesEven though POSIX 2001 is more spe-cific than the previous release Nickpointed out that even this standardallows for small differences where ven-dors may define their own behaviorsThis can only hurt developers in thelong run Along these lines it waspointed out that there may be room forautomated tools that have the capabilityto check application code and determinethe level of portability and standardscompliance of the source

NETWORK MANAGEMENT SYSTEM

PERFORMANCE TUNING

Jeff Allen Tellme Networks Inc

Summarized by Matt Selsky

Jeff Allen author of Cricket discussednetwork tuning The basic idea of tuningis measure twiddle and measure againCricket can be used for large installa-tions but itrsquos overkill for smaller instal-lations for which Mrtg is better suitedYou donrsquot need Cricket to do measure-ment Doing something like a shellscript that dumps data to Excel is goodbut you need to get some sort of mea-surement Cricket was not designed forbilling it was meant to help answerquestions

Useful tools and techniques coveredincluded looking for the differencebetween machines in order to identifythe causes for the differences measuringbut also thinking about what yoursquoremeasuring and why remembering touse strace and tcpdump to identifyproblems Problems repeat themselvesperformance tuning requires an intricateunderstanding of all the layers involvedto solve the problem and simplify theautomation (Having a computer science

background helps but is not essential) Ifyou can reproduce the problem youthen should investigate each piece tofind the bottleneck If you canrsquot repro-duce the problem then measure Youneed to understand all the system inter-actions Measurement can help deter-mine whether things are actually slow orif the user is imagining it

Troubleshooting begins with a hunchbut scientific processes are essential Youshould be able to determine the eventstream or when each event occurs hav-ing observation logs can help Somehints provided include checking cron forperiodic anomalies slowing downbinrm to avoid IO overload and look-ing for unusual kernel CPU usage

Also try to use lightweight monitoringto reduce overhead You donrsquot need tomonitor every resource on every systembut you should monitor those resourceson those systems that are essential Donrsquotcheck things too often since you canintroduce overhead that way Lots ofinformation can be gathered from theproc file system interface to the kernel

BSD BOF

Host Kirk McKusick Author and Consultant

Summarized by Bosko Milekic

The BSD Birds of a Feather session atUSENIX started off with five quick andinformative talks and ended with anexciting discussion of licensing issues

Christos Zoulas for the NetBSD Projectwent over the primary goals of theNetBSD Project (namely portability aclean architecture and security) andthen proceeded to briefly discuss someof the challenges that the project hasencountered The successes and areas forimprovement of 16 were then exam-ined NetBSD has recently seen variousimprovements in its cross-build systempackaging system and third-party toolsupport For what concerns the kernel azero-copy TCPUDP implementationhas been integrated and a new pmap

85October 2002 login

C

ON

FERE

NC

ERE

PORT

Scode for arm32 has been introduced ashave some other rather major changes tovarious subsystems More information isavailable in Christosrsquos slides which arenow available at httpwwwnetbsdorggalleryeventsusenix2002

Theo de Raadt of the OpenBSD Projectpresented a rundown of various newthings that the OpenBSD and OpenSSHteams have been looking at Theorsquos talkincluded an amusing description (andpictures) of some of the events thattranspired during OpenBSDrsquos hack-a-thon which occurred the week beforeUSENIX rsquo02 Finally Theo mentionedthat following complications with theIPFilter license the OpenBSD team haddone a fairly extensive license audit

Robert Watson of the FreeBSD Projectbrought up various examples ofFreeBSD being used in the real worldHe went on to describe a wide variety ofchanges that will surface with theupcoming FreeBSD release 50 Notably50 will introduce a re-architecting ofthe kernel aimed at providing muchmore scalable support for SMPmachines 50 will also include an earlyimplementation of KSE FreeBSDrsquos ver-sion of scheduler activations largeimprovements to pccard and significantframework bits from the TrustedBSDproject

Mike Karels from Wind River Systemspresented an overview of how BSDOShas been evolving following the WindRiver Systems acquisition of BSDirsquos soft-ware assets Ernie Prabhakar from Applediscussed Applersquos OS X and its successon the desktop He explained how OS Xaims to bring BSD to the desktop whilestriving to maintain a strong and stablecore one that is heavily based on BSDtechnology

The BoF finished with a discussion onlicensing specifically some folks ques-tioned the impact of Applersquos licensingfor code from their core OS (the DarwinProject) on the BSD community as awhole

CLOSING SESSION

HOW FLIES FLY

Michael H Dickinson University ofCalifornia Berkeley

Summarized by JD Welch

In this lively and offbeat talk Dickinsondiscussed his research into the mecha-nisms of flight in the fruit fly (Droso-phila melanogaster) and related theautonomic behavior to that of a techno-logical system Insects are the most suc-cessful organisms on earth due in nosmall part to their early adoption offlight Flies travel through space onstraight trajectories interrupted by sac-cades jumpy turning motions some-what analogous to the fast jumpymovements of the human eye Using avariety of monitoring techniquesincluding high-speed video in a ldquovirtualreality flight arenardquo Dickinson and hiscolleagues have observed that fliesrespond to visual cues to decide when tosaccade during flight For example morevisual texture makes flies saccade earlier(ie further away from the arena wall)described by Dickinson as a ldquocollisioncontrol algorithmrdquo

Through a combination of sensoryinputs flies can make decisions abouttheir flight path Flies possess a visualldquoexpansion detectorrdquo which at a certainthreshold causes the animal to turn acertain direction However expansion infront of the fly sometimes causes it toland How does the fly decide Using thevirtual reality device to replicate variousvisual scenarios Dickinson observedthat the flies fixate on vertical objectsthat move back and forth across thefield Expansion on the left side of theanimal causes it to turn right and viceversa while expansion directly in frontof the animal triggers the legs to flareout in preparation for landing

If the eyes detect these changes how arethe responses implemented The fliesrsquowings have ldquopower musclesrdquo controlledby mechanical resonance (as opposed tothe nervous system) which drive the

USENIX 2002

wings combined with neurally activatedldquosteering musclesrdquo which change theconfiguration of the wing joints Subtlevariations in the timing of impulses cor-respond to changes in wing movementcontrolled by sensors in the ldquowing-pitrdquoA small wing-like structure the halterecontrols equilibrium by beating con-stantly

Voluntary control is accomplished bythe halteres whose signals can interferewith the autonomic control of the strokecycle The halteres have ldquosteering mus-clesrdquo as well and information derivedfrom the visual system can turn off thehaltere or trick it into a ldquovirtualrdquo prob-lem requiring a response

Dickinson has also studied the aerody-namics of insect flight using a devicecalled the ldquoRobo Flyrdquo an oversizedmechanical insect wing suspended in alarge tank of mineral oil InterestinglyDickinson observed that the average liftgenerated by the fliesrsquo wings is under itsbody weight the flies use three mecha-nisms to overcome this including rotat-ing the wings (rotational lift)

Insects are extraordinarily robust crea-tures and because Dickinson analyzedthe problem in a systems-oriented waythese observations and analysis areimmediately applicable to technologythe response system can be used as anefficient search algorithm for controlsystems in autonomous vehicles forexample

THE AFS WORKSHOP

Summarized by Garry Zacheiss MIT

Love Hornquist-Astrand of the Arladevelopment team presented the Arlastatus report Arla 0358 has beenreleased Scheduled for release soon areimproved support for Tru64 UNIXMacOS X and FreeBSD improved vol-ume handling and implementation ofmore of the vospts subcommands Itwas stressed that MacOS X is consideredan important platform and that a GUIconfiguration manager for the cache

86 Vol 27 No 5 login

manager a GUI ACL manager for thefinder and a graphical login that obtainsKerberos tickets and AFS tokens at logintime were all under development

Future goals planned for the underlyingAFS protocols include GSSAPISPNEGOsupport for Rx performance improve-ments to Rx an enhanced disconnectedmode and IPv6 support for Rx anexperimental patch is already availablefor the latter Future Arla-specific goalsinclude improved performance partialfile reading and increased stability forseveral platforms Work is also inprogress for the RXGSS protocol exten-sions (integrating GSSAPI into Rx) Apartial implementation exists and workcontinues as developers find time

Derrick Brashear of the OpenAFS devel-opment team presented the OpenAFSstatus report OpenAFS was releasedimmediately prior to the conferenceOpenAFS 125 fixed a remotelyexploitable denial-of-service attack inseveral OpenAFS platforms mostnotably IRIX and AIX Future workplanned for OpenAFS includes bettersupport for MacOS X including work-ing around Finder interaction issuesBetter support for the BSDs is alsoplanned FreeBSD has a partially imple-mented client NetBSD and OpenBSDhave only server programs availableright now AIX 5 Tru64 51A andMacOS X 102 (aka Jaguar) are allplanned for the future Other plannedenhancements include nested groups inthe ptserver (code donated by the Uni-versity of Michigan awaits integration)disconnected AFS and further work ona native Windows client Derrick statedthat the guiding principles of OpenAFSwere to maintain compatibility withIBM AFS support new platforms easethe administrative burden and add newfunctionality

AFS performance was discussed Theopenafsorg cell consists of two fileservers one in Stockholm Sweden andone in Pittsburgh PA AFS works rea-sonably well over trans-Atlantic links

although OpenAFS clients donrsquot deter-mine which file server to talk to veryeffectively Arla clients use RTTs to theserver to determine the optimal fileserver to fetch replicated data fromModifications to OpenAFS to supportthis behavior in the future are desired

Jimmy Engelbrecht and Harald Barth ofKTH discussed their AFSCrawler scriptwritten to determine how many AFScells and clients were in the world whatimplementationsversions they were(Arla vs IBM AFS vs OpenAFS) andhow much data was in AFS The scriptunfortunately triggered a bug in IBMAFS 36ndashderived code causing someclients to panic while handling a specificRPC This has since been fixed in OpenAFS 125 and the most recent IBMAFS patch level and all AFS users arestrongly encouraged to upgrade Norelease of Arla is vulnerable to this par-ticular denial-of-service attack Therewas an extended discussion of the use-fulness of this exploration Many sitesbelieved this was useful information andsuch scanning should continue in thefuture but only on an opt-in basis

Many sites face the problem of manag-ing KerberosAFS credentials for batch-scheduled jobs Specifically most batchprocessing software needs to be modi-fied to forward tickets as part of thebatch submission process renew ticketsand tokens while the job is in the queueand for the lifetime of the job and prop-erly destroy credentials when the jobcompletes Ken Hornstein of NRL wasable to pay a commercial vendor to sup-port Kerberos 45 credential manage-ment in their product although they didnot implement AFS token managementMIT has implemented some of thedesired functionality in OpenPBS andmight be able to make it available toother interested sites

Tools to simplify AFS administrationwere discussed including

AFS Balancer A tool written byCMU to automate the process ofbalancing disk usage across all

servers in a cell Available fromftpftpandrewcmuedupubAFS-Toolsbalance-11btargz

Themis Themis is KTHrsquos enhancedversion of the AFS tool ldquopackagerdquofor updating files on local disk froma central AFS image KTHrsquosenhancements include allowing thedeletion of files simplifying theprocess of adding a file and allow-ing the merging of multiple rulesets for determining which files areupdated Themis is available fromthe Arla CVS repository

Stanford was presented as an example ofa large AFS site Stanfordrsquos AFS usageconsists of approximately 14TB of datain AFS in the form of approximately100000 volumes 33TB of storage isavailable in their primary cell irstan-fordedu Their file servers consistentirely of Solaris machines running acombination of Transarc 36 patch-level3 and OpenAFS 12x while their data-base servers run OpenAFS 122 Theircell consists of 25 file servers using acombination of EMC and Sun StorEdgehardware Stanford continues to use thekaserver for their authentication infra-structure with future plans to migrateentirely to an MIT Kerberos 5 KDC

Stanford has approximately 3400 clientson their campus not including SLAC(Stanford Linear Accelerator) approxi-mately 2100 AFS clients from outsideStanford contact their cell every monthTheir supported clients are almostentirely IBM AFS 36 clients althoughthey plan to release OpenAFS clientssoon Stanford currently supports onlyUNIX clients There is some on-campuspresence of Windows clients but Stan-ford has never publicly released or sup-ported it They do intend to release andsupport the MacOS X client in the nearfuture

All Stanford students faculty and staffare assigned AFS home directories witha default quota of 50MB for a total ofapproximately 550GB of user homedirectories Other significant uses of AFS

87October 2002 login

C

ON

FERE

NC

ERE

PORT

Sstorage are data volumes for workstationsoftware (400GB) and volumes forcourse Web sites and assignments(100GB)

AFS usage at Intel was also presentedIntel has been an AFS site since 1994They had bad experiences with the IBM35 Linux client their experience withOpenAFS on Linux 24x kernels hasbeen much better They use and are sat-isfied with the OpenAFS IA64 Linuxport Intel has hundreds of OpenAFS123 and 124 clients in many produc-tion cells accessing data stored on IBMAFS file servers They have not encoun-tered any interoperability issues Intelhas some concerns about OpenAFS theywould like to purchase commercial sup-port for OpenAFS and to see OpenAFSsupport for HP-UX on both PA-RISCand Itanium hardware HP-UX supportis currently unavailable due to a specificHP-UX header file being unavailablefrom HP this may be available soonIntel has not yet committed to migratingtheir file servers to OpenAFS and areunsure if they will do so without com-mercial support

Backups are a traditional topic of dis-cussion at AFS workshops and this timewas no exception Many users complainthat the traditional AFS backup tools(ldquobackuprdquo and ldquobutcrdquo) are complex anddifficult to automate requiring manyhome-grown scripts and much userintervention for error recovery An addi-tional complaint was that the traditionalAFS tools do not support file- or direc-tory-level backups and restores datamust be backed up and restored at thevolume level

Mitch Collinsworth of Cornell presentedwork to make AMANDA the freebackup software from the University ofMaryland suitable for AFS backupsUsing AMANDA for AFS backups allowsone to share AFS backup tapes withnon-AFS backups easily run multiplebackups in parallel automate errorrecovery and provide a robust degradedmode that prevents tape errors from

stopping backups altogether Theirimplementation allows for full volumerestores as well as individual directoryand file restores They have finished cod-ing this work and are in the process oftesting and documenting it

Peter Honeyman of CITI at the Univer-sity of Michigan spoke about work hehas proposed to replace Rx with RPC-SEC GSS in OpenAFS this would allowAFS to use a TCP-based transport mech-anism rather than the UDP-based Rxand possibly gain better congestion con-trol dynamic adaptation and fragmen-tation avoidance as a result RPCSECGSS uses the GSSAPI to authenticateSUN ONC RPC RPCSEC GSS is trans-port agnostic provides strong security isa developing Internet standard and hasmultiple open source implementationsBackward compatibility with existingAFS servers and clients is an importantgoal of this project

Page 10: inside - USENIX · 2019. 2. 25. · Bosko Milekic Juan Navarro David E. Ott Amit Purohit Brennan Reynolds Matt Selsky J.D. Welch Li Xiao Praveen Yalagandula Haijin Yan Gary Zacheiss

lithic kernels Notable future workincludes building components for low-end appliances and the development of areal-time OS component library

BUILDING SERVICES

Summarized by Pradipta De

NINJA A FRAMEWORK FOR NETWORK

SERVICES

J Robert von Behren Eric A BrewerNikita Borisov Michael Chen MattWelsh Josh MacDonald Jeremy Lauand David Culler University of Califor-nia Berkeley

Robert von Behren presented Ninja anongoing project that aims to provide aframework for building robust and scal-able Internet-based network serviceslike Web-hosting instant-messagingemail and file-sharing applicationsRobert drew attention to the difficultiesof writing cluster applications One hasto take care of data consistency andissues of concurrency and for robustapplications there are problems relatedto fault tolerance Ninja works as awrapper to relieve the user of theseproblems The goal of the project isbuilding network services that are scala-ble and highly available maintainingpersistent data providing gracefuldegradation and supporting online evo-lution

The use of clusters distinguishes thissetup from generic distributed systemsin terms of reliability and security aswell as providing a partition-free net-work The next important feature of thisproject is the new programming modelwhich is more restrictive than a generalprogramming model but still expressiveenough to write most of the applica-tions This model described as ldquosingleprogram multiple connectionrdquo usesintertask parallelism instead of multi-threaded concurrency and is character-ized by all nodes running the sameprogram with connections delegated tothe nodes by a centralized connectionmanager (CM) A CM takes care of hid-ing the details of mapping an externalconnection to an internal node Ninja

70 Vol 27 No 5 login

also uses a design called ldquostaged event-driven architecturerdquo where each serviceis broken down into a set of stages con-nected by event queues This architec-ture is suitable for a modular design andhelps in graceful degradation by adap-tive load shedding from the eventqueues

The presentation concluded with exam-ples of implementation and evaluationof a Web server and an email systemcalled NinjaMail to show the ease ofauthoring and the efficacy of using theNinja framework for service develop-ment

USING COHORT-SCHEDULING TO ENHANCE

SERVER PERFORMANCE

James R Larus and Michael ParkesMicrosoft Research

James Larus presented a new schedulingpolicy for increasing the throughput ofserver applications Cohort schedulingbatches execution of similar operationsarising in different server requests Theusual programming paradigm in han-dling server requests is to spawn multi-ple concurrent threads and switch fromone thread to another whenever a threadgets blocked for IO Since the threads ina server mostly execute unrelated piecesof code the locality of reference isreduced hence the effectiveness of dif-ferent caching mechanisms One way tosolve this problem is to throw in morehardware But Larus presented a com-plementary view where the programbehavior is investigated and used toimprove the performance

The problem in this scheme is to iden-tify pieces of code that can be batchedtogether for processing One simple wayis to look at the next program countervalues and use them to club threadstogether However this talk presented anew programming abstraction ldquostagedcomputationrdquo which replaces the threadmodel with ldquostagesrdquo A stage is anabstraction for operations with similarbehavior and locality The StagedServerlibrary can be used for programming inthis model It is a collection of C++

classes that implement staged computa-tion and cohort scheduling on either auniprocessor or multiprocessor It canbe used to define stages

The presentation showed the experi-mental evaluation of the cohort schedul-ing over the thread-based model byimplementing two servers a Web serverwhich is mainly IO bound and a pub-lish-subscribe server which is mainlycompute bound The SURGE bench-mark was used for the first experimentand the Fabret workload for the secondThe results showed that cohort-schedul-ing-based implementation gave a betterthroughput than the thread-basedimplementation at high loads

NETWORK PERFORMANCE

Summarized by Xiaobo Fan

ETE PASSIVE END-TO-END INTERNET SERVICE

PERFORMANCE MONITORING

Yun Fu Amin Vahdat Duke UniversityLudmila Cherkasova Wenting Tang HPLabs

This paper won the Best Student Paperaward

Ludmila Cherkasova began by listingseveral questions most Web serviceproviders want answered in order toimprove service quality She reviewedthe difficulties of making accurate andefficient end-to-end Web service mea-surement and the shortcomings of cur-rently available approaches Theypropose a passive trace-based architec-ture called EtE to monitor Web serverperformance on behalf of end users

The first step is to collect network pack-ets passively The second module recon-structs all TCP connections and extractsHTTP transactions To reconstruct Webpage accesses they first build a knowl-edge base indexed by client IP and URLand then group objects to the relatedWeb pages they are embedded in Statis-tical analysis is used to handle non-matched objects EtE Monitor cangenerate three groups of metrics to mea-sure Web service performance responsetime Web caching and page abortion

To demonstrate the benefits of EtE mon-itor Cherkasova talked about two casestudies and based on the calculatedmetrics gave some insightful explana-tions about whatrsquos happening behind the variations of Web performanceThrough validation they claim theirapproach provides a very close approxi-mation to the real scenario

THE PERFORMANCE OF REMOTE DISPLAY

MECHANISMS FOR THIN-CLIENT COMPUTING

S Jae Yang Jason Nieh Matt SelskyNikhil Tiwari Columbia University

Noting the trend toward thin-clientcomputing the authors compared dif-ferent techniques and design choices inmeasuring the performance of six popu-lar thin-client platforms ndash CitrixMetaFrame Microsoft Terminal Ser-vices Sun Ray Tarantella VNC and XAfter pointing out several challenges inbenchmarking thin clients Yang pro-posed slow-motion benchmarking toachieve non-invasive packet monitoringand consistent visual quality Basicallythey insert delays between separatevisual events in the benchmark applica-tions of the server side so that theclientrsquos display update can catch up withthe serverrsquos processing speed The exper-iments are conducted on an emulatednetwork over a range of network band-widths

Their results show that thin clients canprovide good performance for Webapplications in LAN environments butonly some platforms performed well forvideo benchmark Pixel-based encodingmay achieve better performance andbandwidth efficiency than high-levelgraphics Display caching and compres-sion should be used with care

A MECHANISM FOR TCP-FRIENDLY

TRANSPORT-LEVEL PROTOCOL

COORDINATION

David E Ott and Ketan Mayer-PatelUniversity of North Carolina ChapelHill

A revised transport-level protocol opti-mized for cluster-to-cluster (C-to-C)

71October 2002 login

C

ON

FERE

NC

ERE

PORT

Sapplications is what David Ott tries toexplain in his talk A C-to-C applicationclass is identified as one set of processescommunicating to another set ofprocesses across a common Internetpath The fundamental problem with C-to-C applications is how to coordinateall C-to-C communication flows so thatthey share a consistent view of the com-mon C-to-C network adapt to changingnetwork conditions and cooperate tomeet specific requirements Aggregationpoints (AP) are placed at the first andlast hop of the common data path toprobe network conditions (latencybandwidth loss rate etc) To carry andtransfer this information a new protocolndash Coordination Protocol (CP) ndash isinserted between the network layer (IP)and the transport layer (TCP UCP etc)Ott illustrated how the CP header isupdated and used when packets origi-nate from source and traverse throughlocal and remote AP to arrive at theirdestination and how AP maintains aper-cluster state table and detects net-work conditions Through simulationresults this coordination mechanismappears effective in sharing commonnetwork resources among C-to-C com-munication flows

STORAGE SYSTEMS

Summarized by Praveen Yalagandula

MY CACHE OR YOURS MAKING STORAGE

MORE EXCLUSIVE

Theodore M Wong Carnegie MellonUniversity John Wilkes HP Labs

Theodore Wong explained the ineffi-ciency of current ldquoinclusiverdquo cachingschemes in storage area networks ndashwhen a client accesses a block the blockis read from the disk and is cached atboth the disk array cache and at theclientrsquos cache Then he presented theconcept of ldquoexclusiverdquo caching where ablock accessed by a client is only cachedin that clientrsquos cache On eviction fromthe clientrsquos cache they have come upwith a DEMOTE operation to move thedata block to the tail of the array cache

The new exclusion caching schemes areevaluated on both single-client and mul-tiple-client systems The single-clientresults are presented for two differenttypes of synthetic workloads Random(transaction-processing type workloads)and Zipf (typical Web workloads) Theexclusive policy was quite effective inachieving higher hit rates and lower readlatencies They also showed a 22 timeshit-rate improvement over inclusivetechniques in the case of a real-lifeworkload httpd with a single-client set-ting

The DEMOTE scheme also performedwell in the case of multiple-client sys-tems when the data accessed by clients isdisjointed In the case of shared dataworkloads this scheme performed worsethan the inclusive schemes Theodorethen presented an adaptive exclusivecaching scheme ndash a block accessed by aclient is placed at the tail in the clientrsquoscache and also in the array cache at anappropriate place determined by thepopularity of the block The popularityof the block is measured by maintaininga ghost cache to accumulate the numberof times each block is accessed This newadaptive scheme has achieved a maxi-mum of 52 speedup in the meanlatency in the experiments with real-lifeworkloads

For more information visit httpwwwcscmuedu~tmwongresearch and httpwwwhplhpcomresearchitccslssp

BRIDGING THE INFORMATION GAP IN

STORAGE PROTOCOL STACKS

Timothy E Denehy Andrea C Arpaci-Dusseau Remzi H Arpaci-DusseauUniversity of Wisconsin Madison

Currently there is a huge informationgap between storage systems and file sys-tems The interface exposed by storagesystems to file systems based on blocksand providing only readwrite interfacesis very narrow This leads to poor per-formance as a whole because of dupli-cated functionality in both systems and

USENIX 2002

reduced functionality resulting fromstorage systemsrsquo lack of file information

The speaker presented two enhance-ments ExRaid an exposed RAID andILFS Informed Log-Structured File Sys-tem ExRAID is an enhancement to theblock-based RAID storage system thatexposes the following three types ofinformation to the file system (1)regions ndash contiguous portions of theaddress space comprising one or multi-ple disks (2) performance informationabout queue lengths and throughput ofthe regions revealing disk heterogeneityto the file systems and (3) failure infor-mation ndash dynamically updated informa-tion conveying the number of tolerablefailures in each region

ILFS allows online incremental expan-sion of the storage space performsdynamic parallelism using ExRAIDrsquosperformance information to performwell on heterogeneous storage systemsand provides a range of different mecha-nisms with different granularities forredundancy of files using ExRAIDrsquosregion failure characteristics These newfeatures are added to LFS with only a19 increase in the code size

More informationhttpwwwcswisceduwind

MAXIMIZING THROUGHPUT IN REPLICATED

DISK STRIPING OF VARIABLE

BIT-RATE STREAMS

Stergios V Anastasiadis Duke Univer-sity Kenneth C Sevcik and MichaelStumm University of Toronto

There is an increasing demand for thecontinuous real-time streaming ofmedia files Disk striping is a commontechnique used for supporting manyconnections on the media servers Buteven with 12 million hours of meantime between failures there will be morethan one disk failure per week with 7000disks Fault tolerance can be achieved byusing data replication and reservingextra bandwidth during normal opera-tion This work focused on supportingthe most common variable bit-ratestream formats (eg MPEG)

72 Vol 27 No 5 login

The authors used prototype mediaserver EXEDRA to implement and eval-uate different replication and bandwidthreservation schemes This system sup-ports a variable bit-rate scheme doesstride-based disk allocation and is capa-ble of supporting variable-grain diskstriping Two replication policies arepresented deterministic ndash data blocksare replicated in a round-robin fashionon the secondary replicas random ndashdata blocks are replicated on the ran-domly chosen secondary replicas Twobandwidth reservation techniques arealso presented mirroring reservation ndashdisk bandwidth is reserved for both primary and replicas of the media fileduring the playback and minimumreservation ndash a more efficient scheme inwhich bandwidth is reserved only for thesum of primary data access time and themaximum of the backup data accesstimes required in each round

Experimental results showed that deter-ministic replica placement is better thanrandom placement for small disks mini-mum disk bandwidth reservation istwice as good as mirroring in thethroughput achieved and fault tolerancecan be achieved with a minimal impacton the throughput

TOOLS

Summarized by Li Xiao

SIMPLE AND GENERAL STATISTICAL PROFILING

WITH PCT

Charles Blake and Steven Bauer MIT

This talk introduced a Profile CollectionToolkit (PCT) ndash a sampling-based CPUprofiling facility A novel aspect of PCTis that it allows sampling of semanticallyrich data such as function call stacksfunction parameters local or globalvariables CPU registers or other execu-tion contexts The design objectives ofPCT were driven by user needs and theinadequacies or inaccessibility of priorsystems

The rich data collection capability isachieved via a debugger-controller pro-gram dbct1 The talk then introducedthe profiling collector and data collec-

tion activation PCT file format optionsprovide flexibility and various ways tostore exact states PCT also has analysisoptions two examples were given ndashldquoSimplerdquo and ldquoMixed User and LinuxCoderdquo

The talk then went into how generalsampling helps in the following areas adebugger-controller state machineexample script fragments a simpleexample program and general valuereports in terms of call-path histogramsnumeric value histograms and data gen-erality Related work such as gprof andExpect was then summarized The con-tribution of this work is to provide aportable general value sampling toolThe limitations are that PCT does notsupport much of strip and count inac-curacies happen because of its statisticalnature Based on this limitation -g ispreferred over strip

Possible future directions included sup-porting more debuggers such as dbxand xdb script generators for otherkinds of traces more canned reports forgeneral values and a libgdb-basedlibrary sampler PCT is available athttppdoslcsmitedupct PCT runs onalmost any UNIX-like system

ENGINEERING A DIFFERENCING AND

COMPRESSION DATA FORMAT

David G Korn and Kiem Phong VoATampT Laboratories ndash Research

This talk began with an equation ldquoDif-ferencing + Compression = Delta Com-pressionrdquo Compression removesinformation redundancy in a data setExamples are gzip bzip compress andpack Differencing encodes differencesbetween two data sets Examples are diff -e fdelta bdiff and xdelta Deltacompression compresses a data set givenanother combining differencing andcompression and reduces to pure com-pression when therersquos no commonality

After presenting an overall scheme for adelta compressor and showing deltacompression performance this talk dis-cussed the encoding format of a newlydesigned Vcdiff for delta compression A

Vcdiff instruction code table consists of256 entries of each coding up to a pair ofinstructions and recodes I-byte indicesand any additional data

The talk then showed Vcdiff rsquos perfor-mance with Web data collected fromCNN compared with W3C standardGdiff Gdiff+gzip and diff+gzip whereGdiff was computed using Vcdiff deltainstructions The results of two experi-ments ldquoFirstrdquo and ldquoSuccessiverdquo werepresented In ldquoFirstrdquo each file is com-pressed against the first file collected inldquoSuccessiverdquo each file is compressedagainst the one in the previous hourThe diff+gzip did not work well becausediff was line-oriented Vcdiff performedfavorably compared with other formats

Vcdiff is part of the Vcodex packageVcodex is a platform for all commondata transformations delta compres-sion plain compression encryptiontranscoding (eg uuencode base64) Itis structured in three layers for maxi-mum usability Base library uses Dis-plines and Methods interfaces

The code for Vcdiff can be found athttpwwwresearchattcomswtoolsThey are moving Vcdiff to the IETFstandard as a comprehensive platformfor transforming data Please refer tohttpwwwietforginternet-draftdraft-korn-vcdiff06txt

WHERE IN THE NET

Summarized by Amit Purohit

A PRECISE AND EFFICIENT EVALUATION OF

THE PROXIMITY BETWEEN WEB CLIENTS AND

THEIR LOCAL DNS SERVERS

Zhuoqing Morley Mao University ofCalifornia Berkeley Charles D CranorFred Douglis Michael RabinovichOliver Spatscheck and Jia Wang ATampTLabs ndash Research

Content Distribution Networks (CDNs)attempt to improve Web performance bydelivering Web content to end usersfrom servers located at the edge of thenetwork When a Web client requestscontent the CDN dynamically chooses aserver to route the request to usually the

73October 2002 login

C

ON

FERE

NC

ERE

PORT

Sone that is nearest the client As CDNshave access only to the IP address of thelocal DNS server (LDNS) of the clientthe CDNrsquos authoritative DNS servermaps the clientrsquos LDNS to a geographicregion within a particular network andcombines that with network and serverload information to perform CDNserver selection

This method has two limitations First itis based on the implicit assumption thatclients are close to their LDNS This maynot always be valid Second a singlerequest from an LDNS can represent adiffering number of Web clients ndash calledthe hidden load factor Knowledge of thehidden load factor can be used toachieve better load distribution

The extent of the first limitation and itsimpact on the CDN server selection isdealt with To determine the associationsbetween clients and their LDNS a sim-ple non-intrusive and efficient map-ping technique was developed The datacollected was used to study the impact ofproximity on DNS-based server selec-tion using four different proximity met-rics (1) autonomous system (AS)clustering ndash observing whether a client isin the same AS as its LDNS ndash concludedthat LDNS is very good for coarse-grained server selection as 64 of theassociations belong to the same AS(2) network clustering ndash observingwhether a client is in the same network-aware cluster (NAC) ndash implied that DNSis less useful for finer-grained serverselection since only 16 of the clientand LDNS are in the same NAC (3)traceroute divergence ndash the length of thedivergent paths to the client and itsLDNS from a probe point using tracer-oute ndash implies that most clients aretopologically close to their LDNS asviewed from a randomly chosen probesite (4) round-trip time (RTT) correla-tion (some CDNs select severs based onRTT between the CDN server and theclientrsquos LDNS) examines the correlationbetween the message RTTs from a probepoint to the client and its local DNS

results imply this is a reasonable metricto use to avoid really distant servers

A study of the impact that client-LDNSassociations have on DNS-based serverselection concludes that knowing theclientrsquos IP address would allow moreaccurate server selection in a large num-ber of cases The optimality of the serverselection also depends on the serverdensity placement and selection algo-rithms

Further information can be found athttpwwweecsberkeleyedu~zmaomyresearchhtml or by contactingzmaoeecsberkeleyedu

GEOGRAPHIC PROPERTIES OF INTERNET

ROUTING

Lakshminarayanan SubramanianVenkata N Padmanabhan MicrosoftResearch Randy H Katz University ofCalifornia at Berkeley

Geographic information can provideinsights into the structure and function-ing of the Internet including interac-tions between different autonomoussystems by analyzing certain propertiesof Internet routing It can be used tomeasure and quantify certain routingproperties such as circuitous routinghot-potato routing and geographic faulttolerance

Traceroute has been used to gather therequired data and Geotrack tool hasbeen used to determine the location ofthe nodes along each network path Thisenables the computation of ldquolinearizeddistancesrdquo which is the sum of the geo-graphic distances between successivepairs of routers along the path

In order to measure the circuitousnessof a path a metric ldquodistance ratiordquo hasbeen defined as the ratio of the lin-earized distance of a path to the geo-graphic distance between the source anddestination of the path From the data ithas been observed that the circuitous-ness of a route depends on both geo-graphic and network location of the endhosts A large value of the distance ratioenables us to flag paths that are highly

USENIX 2002

circuitous possibly (though not neces-sarily) because of routing anomalies Itis also shown that the minimum delaybetween end hosts is strongly correlatedwith the linearized distance of the path

Geographic information can be used tostudy various aspects of wide-area Inter-net paths that traverse multiple ISPs Itwas found that end-to-end Internetpaths tend to be more circuitous thanintra-ISP paths the cause for this beingthe peering relationships between ISPsAlso paths traversing substantial dis-tances within two or more ISPs tend tobe more circuitous than paths largelytraversing only a single ISP Anotherfinding is that ISPs generally employhot-potato routing

Geographic information can also beused to capture the fact that two seem-ingly unrelated routers can be suscepti-ble to correlated failures By using thegeographic information of routers wecan construct a geographic topology ofan ISP Using this we can find the toler-ance of an ISPrsquos network to the total net-work failure in a geographic region

For further information contactlakmecsberkeleyedu

PROVIDING PROCESS ORIGIN INFORMATION

TO AID IN NETWORK TRACEBACK

Florian P Buchholz Purdue UniversityClay Shields Georgetown University

Network traceback is currently limitedbecause host audit systems do not main-tain enough information to matchincoming network traffic to outgoingnetwork traffic The talk presented analternative assigning origin informationto every process and logging it duringinteractive login creation

The current implementation concen-trates mainly on interactive sessions inwhich an event is logged when a newconnection is established using SSH orTelnet The method is effective andcould successfully determine steppingstones and reliably detect the source of aDDoS attack The speaker then talkedabout the related work done in this area

74 Vol 27 No 5 login

which led to discussion of the latestdevelopment in the area of ldquohostcausalityrdquo

The speaker ended with the limitationsand future directions of the frameworkThis framework could be extended andcould find many applications in the areaof security Current implementationdoesnrsquot take care of startup scripts andcron jobs but incorporating the origininformation in FS could solve this prob-lem In the current implementation log-ging is just implemented as a proof ofconcept It could be made safe in manyways and this could be another impor-tant aspect of future work

PROGRAMMING

Summarized by Amit Purohit

CYCLONE A SAFE DIALECT OF C

Trevor Jim ATampT Labs ndash Research GregMorrisett Dan Grossman MichaelHicks James Cheney Yanling WangCornell University

Cyclone is designed to provide protec-tion against attacks such as buffer over-flows format string attacks andmemory management errors The cur-rent structure of C allows programmersto write vulnerable programs Cycloneextends C so that it has the safety guar-antee of Java while keeping C syntaxtypes and semantics untouched

The Cyclone compiler performs staticanalysis of the source code and insertsruntime checks into the compiled out-put at places where the analysis cannotdetermine that the code execution willnot violate safety constraints Cycloneimposes many restrictions to preservesafety such as NULL checksThesechecks do not exist in normal C

The speaker then talked about somesample code written in Cyclone and howit tackles safety problems Porting anexisting C application to Cyclone ispretty easy with fewer than 10 changerequired The current implementationconcentrates more on safety than onperformance ndash hence the performance

penalty is substantial Cyclone was ableto find many lingering bugs The finalstep of the project will be to write acompiler to convert a normal C programto an equivalent Cyclone program

COOPERATIVE TASK MANAGEMENT WITHOUT

MANUAL STACK MANAGEMENT

Atul Adya Jon Howell MarvinTheimer William J Bolosky John RDouceur Microsoft Research

The speaker described the definitionsand motivations behind the distinctconcepts of task management stackmanagement IO management conflictmanagement and data partitioningConventional concurrent programminguses preemptive task management andexploits the automatic stack manage-ment of a standard language In the sec-ond approach cooperative tasks areorganized as event handlers that yieldcontrol by returning control to the eventscheduler manually unrolling theirstacks In this project they have adopteda hybrid approach that makes it possiblefor both stack management styles tocoexist Thus programmers can codeassuming one or the other of these stackmanagement styles will operate depend-ing upon the application The speakeralso gave a detailed example of how touse their mechanism to switch betweenthe two styles

The talk then continued with the imple-mentation They were able to preemptmany subtle concurrency problems byusing cooperative task managementPaying a cost up-front to reduce a subtlerace condition proved to be a goodinvestment

Though the choice of task managementis fundamental the choice of stack man-agement can be left to individual tasteThis project enables use of any type ofstack management in conjunction withcooperative task management

IMPROVING WAIT-FREE ALGORITHMS FOR

INTERPROCESS COMMUNICATION IN

EMBEDDED REAL-TIME SYSTEMS

Hai Huang Padmanabhan Pillai KangG Shin University of Michigan

The main characteristic of the real-timetime-sensitive system is its pre-dictable response time But concurrencymanagement creates hurdles to achiev-ing this goal because of the use of locksto maintain consistency To solve thisproblem many wait-free algorithmshave been developed but these are typi-cally high-cost solutions By takingadvantage of the temporal characteris-tics of the system however the time andspace overhead can be reduced

The speaker presented an algorithm fortemporal concurrency control andapplied this technique to improve threewait-free algorithms A single-writermultiple-reader wait-free algorithm anda double-buffer algorithm were pro-posed

Using their transformation mechanismthey achieved an improvement of17ndash66 in ACET and a 14ndash70 reduc-tion in memory requirements for IPCalgorithms This mechanism is extensi-ble and can be applied to other non-blocking IPC algorithms as well Futurework involves reducing the synchroniz-ing overheads in more general IPC algo-rithms with multiple-writer semantics

MOBILITY

Summarized by Praveen Yalagandula

ROBUST POSITIONING ALGORITHMS FOR

DISTRIBUTED AD-HOC WIRELESS SENSOR

NETWORKS

Chris Savarese Jan Rabaey Universityof California Berkeley Koen Langen-doen Delft University of Technology

The Pico Radio Network comprisesmore than a hundred sensors monitorsand actuators equipped with wirelessconnectivity with the following proper-ties (1) no infrastructure (2) computa-tion in a distributed fashion instead ofcentralized computation (3) dynamictopology (4) limited radio range

75October 2002 login

C

ON

FERE

NC

ERE

PORT

S(5) nodes that can act as repeaters(6) devices of low cost and with lowpower and (7) sparse anchor nodes ndashnodes with GPS capability A positioningalgorithm determines the geographicalposition of each node in the network

In two-dimensional space each nodeneeds three reference positions to esti-mate its geographical position There aretwo problems that make positioning dif-ficult in the Pico Radio Network typesetting (1) a sparse anchor node prob-lem and (2) a range error problem

The two-phase approach that theauthors have taken in solving the prob-lem consists of (1) a hop-terrain algo-rithm ndash in this first phase each noderoughly guesses its location by the dis-tance calculated using multi-hops to theanchor points and (2) a refinementalgorithm ndash in this second phase eachnode uses its neighborsrsquo positions torefine its own position estimate Toguarantee the convergence thisapproach uses confidence metricsassigning a value of 10 for anchor nodesand 01 to start for other nodes andincreasing these with each iteration

A simulation tool OMNet++ was usedfor both phases and in various scenariosThe results show that they achievedposition errors of less than 33 in a sce-nario with 5 anchor nodes an averageconnectivity of 7 and 5 range mea-surement error

Guidelines for anchor node deploymentare high connectivity (gt10) a reason-able fraction of anchor nodes (gt 5)and for anchor placement coverededges

APPLICATION-SPECIFIC NETWORK

MANAGEMENT FOR ENERGY-AWARE

STREAMING OF POPULAR MULTIMEDIA

FORMATS

Surendar Chandra University of Geor-gia and Amin Vahdat Duke University

The main hindrance in supporting theincreasing demand for mobile multime-dia on PDA is the battery capacity of the

small devices The idle-time power con-sumption is almost the same as thereceive power consumption on typicalwireless network cards (1319mW vs1425mW) while the sleep state con-sumes far less power (177mW) To letthe system consume energy proportionalto the stream quality the network cardshould transition to sleep state aggres-sively between each packet

The speaker presented previous workwhere different multimedia streamswere studied for PDAs MS Media RealMedia and Quicktime It was found thatif the inter-packet gap is predictablethen huge savings are possible forexample about 80 savings in MSmedia streams with few packet lossesThe limitation of the IEEE 80211bpower-save mode comes into play whenthere are two or more nodes producinga delay between beacon time and whenthe node receivessends packets Thisbadly affects the streams since higherenergy is consumed while waiting

The authors propose traffic shaping forenergy conservation where this is doneby proxy such that packets arrive at reg-ular intervals This is achieved using alocal proxy in the access point and aclient-side proxy in the mobile hostSimulations show that traffic shapingreduces energy consumption and alsoreveals that the higher bandwidthstreams have a lower energy metric(mJkB)

More information is at httpgreenhousecsugaedu

CHARACTERIZING ALERT AND BROWSE SER-

VICES OF MOBILE CLIENTS

Atul Adya Paramvir Bahl and Lili QiuMicrosoft Research

Even though there is a dramatic increasein Internet access from wireless devicesthere are not many studies done oncharacterizing this traffic In this paperthe authors characterize the trafficobserved on the MSN Web site withboth notification and browse traces

USENIX 2002

Around 33 million browsing accessesand about 325 million notificationentries are present in the traces Threetypes of analyses are done for each oneof these two services content analysisconcerning the most popular contentcategories and their distribution popu-larity analysis or the popularity distri-bution of documents and user behavioranalysis

The analysis of the notification logsshows that document access rates followa Zipf-like distribution with most of theaccesses concentrated on a small num-ber of messages and the accesses exhibitgeographical locality ndash users from samelocality tend to receive similar notifica-tion content The browser log analysisshows that a smaller set of URLs areaccessed most times though the accesspattern does not fit any Zipf curve andthe highly accessed URLS remain stableA correlation study between the notifi-cation and browsing services shows thatwireless users have a moderate correla-tion of 012

FREENIX TRACK SESSIONS

BUILDING APPLICATIONS

Summarized by Matt Butner

INTERACTIVE 3-D GRAPHICS APPLICATIONS

FOR TCL

Oliver Kersting Juumlrgen Doumlllner HassoPlattner Institute for Software SystemsEngineering University of Potsdam

The integration of 3-D image renderingfunctionality into a scripting languagepermits interactive and animated 3-Ddevelopment and application withoutthe formalities and precision demandedby low-level CC++ graphics and visual-ization libraries The large and complexC++ API of the Virtual Rendering Sys-tem (VRS) can be combined with theconveniences of the Tcl scripting lan-guage The mapping of class interfaces isdone via an automated process and gen-erates respective wrapper classes all ofwhich ensures complete API accessibilityand functionality without any signifi-

76 Vol 27 No 5 login

cant performance retribution The gen-eral C++-to-Tcl mapper SWIG grantsthe necessary object and hierarchal abili-ties without the use of object-orientedTcl extensions The final API mappingtechniques address C++ features such asclasses overloaded methods enumera-tions and inheritance relations All areimplemented in a proof-of-concept thatmaps the complete C++ API of the VRSto Tcl and are showcased in a completeinteractive 3-D-map system

THE AGFL GRAMMAR WORK LAB

Cornelis HA Coster Erik VerbruggenUniversity of Nijmegen (KUN)

The growth and implementation of Nat-ural Language Processing (NLP) is thecornerstone of the continued evolutionand implementation of truly intelligentsearch machines and services In partdue to the growing collections of com-puter-stored human-readable docu-ments in the public domain theimplementation of linguistic analysiswill become necessary for desirable pre-cision and resolution Subsequentlysuch tools and linguistic resources mustbe released into the public domain andthey have done so with the release of theAGFL Grammar Work Lab under theGNU Public License as a tool for lin-guistic research and the development ofNLP-based applications

The AGFL (Affix Grammars over aFinite Lattice) Grammar Work Labmeshes context-free grammars withfinite set-valued features that are accept-able to a range of languages In com-puter science terms ldquoSyntax rules areprocedures with parameters and a non-deterministic executionrdquo The EnglishPhrases for Information Retrieval(EP4IR) released with the AGFL-GWLas a robust grammar of English is anAGFL-GWL generated English parserthat outputs ldquoHeadModifiedrdquo framesThe sentences ldquoCompanyX sponsoredthis conferencerdquo and ldquoThis conferencewas sponsored by CompanyXrdquo both gen-erate [CompanyX[sponsored confer-

ence]] The hope is that public availabil-ity of such tools will encourage furtherdevelopment of grammar and lexicalsoftware systems

SWILL A SIMPLE EMBEDDED WEB SERVER

LIBRARY

Sotiria Lampoudi David M BeazleyUniversity of Chicago

This paper won the FREENIX Best Stu-dent Paper award

SWILL (Simple Web Interface LinkLibrary) is a simple Web server in theform of a C library whose developmentwas motivated by a wish to give coolapplications an interface to the Web TheSWILL library provides a simple inter-face that can be efficiently implementedfor tasks that vary from flexible Web-based monitoring to software debuggingand diagnostics Though originallydesigned to be integrated with high-per-formance scientific simulation softwarethe interface is generic enough to allowfor unbounded uses SWILL is a single-threaded server relying upon non-blocking IO through the creation of atemporary server which services IOrequests SWILL does not provide SSL orcryptographic authentication but doeshave HTTP authentication abilities

A fantastic feature of SWILL is its sup-port for SPMD-style parallel applica-tions which utilize MPI provingvaluable for Beowulf clusters and largeparallel machines Another practicalapplication was the implementation ofSWILL in a modified Yalnix emulator byUniversity of Chicago Operating Sys-tems courses which utilized the addedYalnix functionality for OS developmentand debugging SWILL requires minimalmemory overhead and relies upon theHTTP10 protocol

NETWORK PERFORMANCE

Summarized by Florian Buchholz

LINUX NFS CLIENT WRITE PERFORMANCE

Chuck Lever Network Appliance PeterHoneyman CITI University of Michi-gan

Lever introduced a benchmark to mea-sure an NFS client write performanceClient performance is difficult to mea-sure due to hindrances such as poorhardware or bandwidth limitations Fur-thermore measuring application perfor-mance does not identify weaknessesspecifically at the client side Thus abenchmark was developed trying toexercise only data transfers in one direc-tion between server and application Forthis purpose the benchmark was basedon the block sequential write portion ofthe Bonnie file system benchmark Oncea benchmark for NFS clients is estab-lished it can be used to improve clientperformance

The performance measurements wereperformed with an SMP Linux clientand both a Linux NFS server and a Net-work Appliance F85 filer During testinga periodic jump in write latency timewas discovered This was due to a ratherlarge number of pending write opera-tions that were scheduled to be writtenafter certain threshold values wereexceeded By introducing a separate dae-mon that flushes the cached writerequest the spikes could be eliminatedbut as a result the average latency growsover time The problem could be tracedto a function that scans a linked list ofwrite requests After having added ahashtable to improve lookup perfor-mance the latency improved consider-ably

The improved client was then used tomeasure throughput against the twoservers A discrepancy between theLinux server and the filer test wasnoticed and the reason for that tracedback to a global kernel lock that wasunnecessarily held when accessing thenetwork stack After correcting this per-formance improved further However

77October 2002 login

C

ON

FERE

NC

ERE

PORT

Sthe measurements showed that a clientmay run slower when paired with fastservers on fast networks This is due toheavy client interrupt loads more net-work processing on the client side andmore global kernel lock contention

The source code of the project is avail-able at httpwwwcitiumicheduprojectsnfs-perfpatches

A STUDY OF THE RELATIVE COSTS OF

NETWORK SECURITY PROTOCOLS

Stefan Miltchev and Sotiris IoannidisUniversity of Pennsylvania AngelosKeromytis Columbia University

With the increasing need for securityand integrity of remote network ser-vices it becomes important to quantifythe communication overhead of IPSecand compare it to alternatives such asSSH SCP and HTTPS

For this purpose the authors set upthree testing networks direct link twohosts separated by two gateways andthree hosts connecting through onegateway Protocols were compared ineach setup and manual keying was usedto eliminate connection setup costs ForIPSec the different encryption algo-rithms AES DES 3DES hardware DESand hardware 3DES were used In detailFTP was compared to SFTP SCP andFTP over IPSec HTTP to HTTPS andHTTP over IPSec and NFS to NFS overIPSec and local disk performance

The result of the measurements werethat IPSec outperforms other popularencryption schemes Overall unen-crypted communication was fastest butin some cases like FTP the overhead canbe small The use of crypto hardwarecan significantly improve performanceFor future work the inclusion of setupcosts hardware-accelerated SSL SFTPand SSH were mentioned

CONGESTION CONTROL IN LINUX TCP

Pasi Sarolahti University of HelsinkiAlexey Kuznetsov Institute for NuclearResearch at Moscow

Having attended the talk and read thepaper I am still unclear about whetherthe authors are merely describing thedesign decisions of TCP congestion con-trol or whether they are actually the cre-ators of that part of the Linux code Myguess leans toward the former

In the talk the speaker compared theTCP protocol congestion control meas-ures according to IETF and RFC specifi-cations with the actual Linux implemen-tation which does conform to the basicprinciples but nevertheless has differ-ences A specific emphasis was placed onretransmission mechanisms and thecongestion window Also several TCPenhancements ndash the NewReno algo-rithm Selective ACKs (SACK) ForwardACKs (FACK) ndash were discussed andcompared

In some instances Linux does not con-form to the IETF specifications The fastrecovery does not fully follow RFC 2582since the threshold for triggering re-transmit is adjusted dynamically and thecongestion windowrsquos size is not changedAlso the roundtrip-time estimator andthe RTO calculation differ from RFC2988 since it uses more conservativeRTT estimates and a minimum RTO of200ms The performance measuresshowed that with the additional Linux-specific features enabled slightly higherthroughput more steady data flow andfewer unnecessary retransmissions canbe achieved

XTREME XCITEMENT

Summarized by Steve Bauer

THE FUTURE IS COMING WHERE THE X

WINDOW SHOULD GO

Jim Gettys Compaq Computer Corp

Jim Gettys one of the principal authorsof the X Window System outlined thenear-term objectives for the system pri-marily focusing on the changes andinfrastructure required to enable replica-tion and migration of X applicationsProviding better support for this func-tionality would enable users to retrieveor duplicate X applications betweentheir servers at home and work

USENIX 2002

One interesting example of an applica-tion that currently is capable of migra-tion and replication is Emacs To create anew frame on DISPLAY try ldquoM-x make-frame-on-display ltReturngt DISPLAYltReturngtrdquo

However technical challenges makereplication and migration difficult ingeneral These include the ldquomajorheadachesrdquo of server-side fonts thenonuniformity of X servers and screensizes and the need to appropriatelyretrofit toolkits Authentication andauthorization issues are obviously alsoimportant The rest of the talk delvedinto some of the details of these interest-ing technical challenges

HACKING IN THE KERNEL

Summarized by Hai Huang

AN IMPLEMENTATION OF SCHEDULER

ACTIVATIONS ON THE NETBSD OPERATING

SYSTEM

Nathan J Williams Wasabi Systems

Scheduler activation is an old idea Basi-cally there are benefits and drawbacks tousing solely kernel-level or user-levelthreading Scheduler activation is able tocombine the two layers of control toprovide more concurrency in the sys-tem

In his talk Nathan gave a fairly detaileddescription of the implementation ofscheduler activation in the NetBSD ker-nel One important change in the imple-mentation is to differentiate the threadcontext from the process context This isdone by defining a separate data struc-ture for these threads and relocatingsome of the information that wasembedded in the process context tothese thread contexts Stack was espe-cially a concern due to the upcall Spe-cial handling must be done to make surethat the upcall doesnrsquot mess up the stackso that the preempted user-level threadcan continue afterwards Lastly Nathanexplained that signals were handled byupcalls

78 Vol 27 No 5 login

AUTHORIZATION AND CHARGING IN PUBLIC

WLANS USING FREEBSD AND 8021X

Pekka Nikander Ericsson ResearchNomadicLab

8021x standards are well known in thewireless community as link-layerauthentication protocols In this talkPekka explained some novel ways ofusing the 8021x protocols that might beof interest to people on the move It ispossible to set up a public WLAN thatwould support various chargingschemes via virtual tokens which peoplecan purchase or earn and later use

This is implemented on FreeBSD usingthe netgraph utility It is basically a filterin the link layer that would differentiatetraffic based on the MAC address of theclient node which is either authenti-cated denied or let through The over-head for this service is fairly minimal

ACPI IMPLEMENTATION ON FREEBSD

Takanori Watanabe Kobe University

ACPI (Advanced Configuration andPower Management Interface) was pro-posed as a joint effort by Intel Toshibaand Microsoft to provide a standard andfiner-control method of managingpower states of individual devices withina system Such low-level power manage-ment is especially important for thosemobile and embedded systems that arepowered by fixed-capacity energy batter-ies

Takanori explained that the ACPI speci-fication is composed of three partstables BIOS and registers He was ableto implement some functionalities ofACPI in a FreeBSD kernel ACPI Com-ponent Architecture was implementedby Intel and it provides a high-levelACPI API to the operating systemTakanorirsquos ACPI implementation is builtupon this underlying layer of APIs

ACCESS CONTROL

Summarized by Florian Buchholz

DESIGN AND PERFORMANCE OF THE

OPENBSD STATEFUL PACKET FILTER (PF)

Daniel Hartmeier Systor AG

Daniel Hartmeier described the newstateful packet filter (pf) that replacedIPFilter in the OpenBSD 30 releaseIPFilter could no longer be included dueto licensing issues and thus there was aneed to write a new filter making use ofoptimized data structures

The filter rules are implemented as alinked list which is traversed from startto end Two actions may be takenaccording to the rules ldquopassrdquo or ldquoblockrdquoA ldquopassrdquo action forwards the packet anda ldquoblockrdquo action will drop it Wheremore than one rule matches a packetthe last rule wins Rules that are markedas ldquofinalrdquo will immediately terminate any further rule evaluation An opti-mization called ldquoskip-stepsrdquo was alsoimplemented where blocks of similarrules are skipped if they cannot matchthe current packet These skip-steps arecalculated when the rule set is loadedFurthermore a state table keeps track ofTCP connections Only packets thatmatch the sequence numbers areallowed UDP and ICMP queries andreplies are considered in the state tablewhere initial packets will create an entryfor a pseudo-connection with a low ini-tial timeout value The state table isimplemented using a balanced binarysearch tree NAT mappings are alsostored in the state table whereas appli-cation proxies reside in user space Thepacket filter also is able to perform frag-ment reassembly and to modulate TCPsequence numbers to protect hostsbehind the firewall

Pf was compared against IPFilter as wellas Linuxrsquos Iptables by measuringthroughput and latency with increasingtraffic rates and different packet sizes Ina test with a fixed rule set size of 100Iptables outperformed the other two fil-ters whose results were close to each

other In a second test where the rule setsize was continually increased Iptablesconsistently had about twice thethroughput of the other two (whichevaluate the rule set on both the incom-ing and outgoing interfaces) A third testcompared only pf and IPFilter using asingle rule that created state in the statetable with a fixed-state entry size Pfreached an overloaded state much laterthan IPFilter The experiment wasrepeated with a variable-state entry sizeand pf performed much better thanIPFilter for a small number of states

In general rule set evaluation is expen-sive and benchmarks only reflectextreme cases whereas in real life otherbehavior should be observed Further-more the benchmarks show that statefulfiltering can actually improve perfor-mance due to cheap state-table lookupsas compared to rule evaluation

ENHANCING NFS CROSS-ADMINISTRATIVE

DOMAIN ACCESS

Joseph Spadavecchia and Erez ZadokStony Brook University

The speaker presented modification toan NFS server that allows an improvedNFS access between administrativedomains A problem lies in the fact thatNFS assumes a shared UIDGID spacewhich makes it unsafe to export filesoutside the administrative domain ofthe server Design goals were to leaveprotocol and existing clients unchangeda minimum amount of server changesflexibility and increased performance

To solve the problem two techniques areutilized ldquorange-mappingrdquo and ldquofile-cloakingrdquo Range-mapping maps IDsbetween client and server The mappingis performed on a per-export basis andhas to be manually set up in an exportfile The mappings can be 1-1 N-N orN-1 In file-cloaking the server restrictsfile access based on UIDGID and spe-cial cloaking-mask bits Here users canonly access their own file permissionsThe policy on whether or not the file isvisible to others is set by the cloakingmask which is logically ANDed with the

79October 2002 login

C

ON

FERE

NC

ERE

PORT

Sfilersquos protection bits File-cloaking onlyworks however if the client doesnrsquot holdcached copies of directory contents andfile-attributes Because of this the clientsare forced to re-read directories byincrementing the mtime value of thedirectory each time it is listed

To test the performance of the modifiedserver five different NFS configurationswere evaluated An unmodified NFSserver was compared against one serverwith the modified code included but notused one with only range-mappingenabled one with only file-cloakingenabled and one version with all modi-fications enabled For each setup differ-ent file system benchmarks were runThe results show only a small overheadwhen the modifications are used gener-ally an increase of below 5 Anotherexperiment tested the performance ofthe system with an increasing number ofmapped or cloaked entries on a systemThe results show that an increase from10 to 1000 entries resulted in a maxi-mum of about 14 cost in performance

One member of the audience pointedout that if clients choose to ignore thechanged mtimes from the server andthus still hold caches of the directoryentries the file-cloaking mechanismcould be defeated After a rather lengthydebate the speaker had to concede thatthe model doesnrsquot add any extra securityAnother question was asked about scala-bility of the setup of range mappingThe speaker referred to application-leveltools that could be developed for thatpurpose

The software is available atftpftpfslcssunysbedupubenf

ENGINEERING OPEN SOURCE

SOFTWARE

Summarized by Teri Lampoudi

NINGAUI A LINUX CLUSTER FOR BUSINESS

Andrew Hume ATampT Labs ndash ResearchScott Daniels Electronic Data SystemsCorp

Ningaui is a general purpose highlyavailable resilient architecture built

from commodity software and hard-ware Emphatically however Ningaui isnot a Beowulf Hume calls the clusterdesign the ldquoSwiss canton modelrdquo inwhich there are a number of looselyaffiliated independent nodes with datareplicated among them Jobs areassigned by bidding and leases and clus-ter services done as session-based serversare sited via generic job assignment Theemphasis is on keeping the architectureend-to-end checking all work via check-sums and logging everything Theresilience and high availability requiredby their goal of 8-5 maintenance ndash vsthe typical 24-7 model where people getpaged whenever the slightest thing goeswrong regardless of the time of day ndash isachieved by job restartability Finally allcomputation is performed on local datawithout the use of NFS or networkattached storage

Humersquos message is a hopeful onedespite the many problems encounteredndash things like kernel and service limitsauto-installation problems TCP stormsand the like ndash the existence of sourcecode and the paranoid practice of log-ging and checksumming everything hashelped The final product performs rea-sonably well and it appears that theresilient techniques employed do make adifference One drawback however isthat the software mentioned in thepaper is not yet available for download

CPCMS A CONFIGURATION MANAGEMENT

SYSTEM BASED ON CRYPTOGRAPHIC NAMES

Jonathan S Shapiro and John Vander-burgh Johns Hopkins University

This paper won the FREENIX BestPaper award

The basic notion behind the project isthe fact that everyone has a pet com-plaint about CVS and yet it is currentlythe configuration manager in mostwidespread use Shapiro has unleashedan alternative Interestingly he did notbegin but ended with the usual host ofreasons why CVS is bad The talk insteadbegan abruptly by characterizing the job

USENIX 2002

of a software configuration managercontinued by stating the namespaceswhich it must handle and wrapped upwith the challenges faced

X MEETS Z VERIFYING CORRECTNESS IN THE

PRESENCE OF POSIX THREADS

Bart Massey Portland State UniversityRobert T Bauer Rational SoftwareCorp

Massey delivered a humorous talk onthe insight gained from applying Z for-mal specification notation to systemsoftware design rather than the moreinformal analysis and design processnormally used

The story is told with respect to writingXCB which replaces the Xlib protocollayer and is supposed to be threadfriendly But where threads are con-cerned deadlock avoidance becomes ahard problem that cannot be solved inan ad-hoc manner But full modelchecking is also too hard In this caseMassey resorted to Z specification tomodel the XCB lower layer abstractaway locking and data transfers andlocate fundamental issues Essentiallythe difficulties of searching the literatureand locating information relevant to theproblem at hand were overcome AsMassey put it ldquothe formal method savedthe dayrdquo

FILE SYSTEMS

Summarized by Bosko Milekic

PLANNED EXTENSIONS TO THE LINUX

EXT2EXT3 FILESYSTEM

Theodore Y Tsrsquoo IBM StephenTweedie Red Hat

The speaker presented improvements tothe Linux Ext2 file system with the goalof allowing for various expansions whilestriving to maintain compatibility witholder code Improvements have beenfacilitated by a few extra superblockfields that were added to Ext2 not longbefore the Linux 20 kernel was releasedThe fields allow for file system featuresto be added without compromising theexisting setup this is done by providing

80 Vol 27 No 5 login

bits indicating the impact of the addedfeatures with respect to compatibilityNamely a file system with the ldquoincom-patrdquo bit marked is not allowed to bemounted Similarly a ldquoread-onlyrdquo mark-ing would only allow the file system tobe mounted read-only

Directory indexing changes linear direc-tory searches with a faster search using afixed-depth tree and hashed keys Filesystem size can be dynamicallyincreased and the expanded inode dou-bled from 128 to 256 bytes allows formore extensions

Other potential improvements were dis-cussed as well in particular pre-alloca-tion for contiguous files which allowsfor better performance in certain setupsby pre-allocating contiguous blocksSecurity-related modifications extendedattributes and ACLs were mentionedAn implementation of these featuresalready exists but has not yet beenmerged into the mainline Ext23 code

RECENT FILESYSTEM OPTIMISATIONS ON

FREEBSD

Ian Dowse Corvil Networks DavidMalone CNRI Dublin Institute of Technology

David Malone presented four importantfile system optimizations for FreeBSDOS soft updates dirpref vmiodir anddirhash It turns out that certain combi-nations of the optimizations (beautifullyillustrated in the paper) may yield per-formance improvements of anywherebetween 2 and 10 orders of magnitudefor real-world applications

All four of the optimizations deal withfile system metadata Soft updates allowfor asynchronous metadata updatesDirpref changes the way directories areorganized attempting to place childdirectories closer to their parentsthereby increasing locality of referenceand reducing disk-seek times Vmiodirtrades some extra memory in order toachieve better directory caching Finallydirhash which was implemented by IanDowse changes the way in which entries

are searched in a directory SpecificallyFFS uses a linear search to find an entryby name dirhash builds a hashtable ofdirectory entries on the fly For a direc-tory of size n with a working set of mfiles a search that in certain cases couldhave been O(nm) has been reduceddue to dirhash to effectively O(n + m)

FILESYSTEM PERFORMANCE AND SCALABILITY

IN LINUX 2417

Ray Bryant SGI Ruth Forester IBMLTC John Hawkes SGI

This talk focused on performance evalu-ation of a number of file systems avail-able and commonly deployed on Linuxmachines Comparisons under variousconfigurations of Ext2 Ext3 ReiserFSXFS and JFS were presented

The benchmarks chosen for the datagathering were pgmeter filemark andAIM Benchmark Suite VII Pgmetermeasures the rate of data transfer ofreadswrites of a file under a syntheticworkload Filemark is similar to post-mark in that it is an operation-intensivebenchmark although filemark isthreaded and offers various other fea-tures that postmark lacks AIM VIImeasures performance for various file-system-related functionalities it offersvarious metrics under an imposedworkload thus stressing the perfor-mance of the file system not only underIO load but also under significant CPUload

Tests were run on three different setupsa small a medium and a large configu-ration ReiserFS and Ext2 appear at thetop of the pile for smaller and mediumsetups Notably XFS and JFS performworse for smaller system configurationsthan the others although XFS clearlyappears to generate better numbers thanJFS It should be noted that XFS seemsto scale well under a higher load Thiswas most evident in the large-systemresults where XFS appears to offer thebest overall results

THINGS TO THINK ABOUT

Summarized by Bosko Milekic

SPEEDING UP KERNEL SCHEDULER BY REDUC-

ING CACHE MISSES

Shuji Yamamura Akira Hirai MitsuruSato Masao Yamamoto Akira NaruseKouichi Kumon Fujitsu Labs

This was an interesting talk pertaining tothe effects of cache coloring for taskstructures in the Linux kernel scheduler(Linux kernel 24x) The speaker firstpresented some interesting benchmarknumbers for the Linux scheduler show-ing that as the number of processes onthe task queue was increased the perfor-mance decreased The authors usedsome really nifty hardware to measurethe number of bus transactionsthroughout their tests and were thusable to reasonably quantify the impactthat cache misses had in the Linuxscheduler

Their experiments led them to imple-ment a cache coloring scheme for taskstructures which were previouslyaligned on 8KB boundaries and there-fore were being eventually mapped tothe same cache lines This unfortunateplacement of task structures in memoryinduced a significant number of cachemisses as the number of tasks grew inthe scheduler

The implemented solution consisted ofaligning task structures to more evenlydistribute cached entries across the L2cache The result was inevitably fewercache misses in the scheduler Some neg-ative effects were observed in certain sit-uations These were primarily due tomore cache slots being used by taskstructure data in the scheduler thusforcing data previously cached there tobe pushed out

81October 2002 login

C

ON

FERE

NC

ERE

PORT

SWORK-IN-PROGRESS REPORTS

Summarized by Brennan Reynolds

RESOURCE VIRTUALIZATION TECHNIQUES FOR

WIDE-AREA OVERLAY NETWORKS

Kartik Gopalan University Stony Brook

This work addressed the issue of provi-sioning a maximum number of virtualoverlay networks (VON) with diversequality of service (QoS) requirementson a single physical network Each of theVONs is logically isolated from others toensure the QoS requirements Gopalanmentioned several critical researchissues with this problem that are cur-rently being investigated Dealing withhow to provision the network at variouslevels (link route or path) and thenenforce the provisioning at run-time isone of the toughest challenges Cur-rently Gopalan has developed severalalgorithms to handle admission controlend-to-end QoS route selection sched-uling and fault tolerance in the net-work

For more information visithttpwwwecslcssunysbedu

VISUALIZING SOFTWARE INSTABILITY

Jennifer Bevan University of CaliforniaSanta Cruz

Detection of instability in software hastypically been an afterthought Thepoint in the development cycle when thesoftware is reviewed for instability isusually after it is difficult and costly togo back and perform major modifica-tions to the code base Bevan has devel-oped a technique to allow thevisualization of unstable regions of codethat can be used much earlier in thedevelopment cycle Her technique cre-ates a time series of dependent graphsthat include clusters and lines calledfault lines From the graphs a developeris able to easily determine where theunstable sections of code are and proac-tively restructure them She is currentlyworking on a prototype implementa-tion

For more information visit httpwwwcseucscecu~jbevanevo_viz

RELIABLE AND SCALABLE PEER-TO-PEER WEB

DOCUMENT SHARING

Li Xiao William and Mary College

The idea presented by Xiao would allowend users to share the content of theWeb browser caches with neighboringInternet users The rationale for this isthat todayrsquos end users are increasinglyconnected to the Internet over high-speed links and the browser caches arebecoming large storage systems There-fore if an individual accesses a pagewhich does not exist in their local cacheXiao is suggesting that they first queryother end users for the content beforetrying to access the machine hosting theoriginal This strategy does have someserious problems associated with it thatstill need to be addressed includingensuring the integrity of the content andprotecting the identity and privacy ofthe end users

SEREL FAST BOOTING FOR UNIX

Leni Mayo Fast Boot Software

Serel is a tool that generates a visual rep-resentation of a UNIX machinersquos boot-up sequence It can be used to identifythe critical path and can show if a par-ticular service or process blocks for anextended period of time This informa-tion could be used to determine wherethe largest performance gains could berealized by tuning the order of executionat boot-up Serel creates a dependencygraph expressed in XML during boot-up This graph is then used to create thevisual representation Currently the toolonly works on POSIX-compliant sys-tems but Mayo stated that he would beporting it to other platforms Otherextensions that were mentionedincluded having the metadata includethe use of shared libraries and monitor-ing the suspendresume sequence ofportable machines

For more information visithttpwwwfastbootorg

USENIX 2002

BERKELEY DB XML

John Merrells Sleepycat Software

Merrells gave a quick introduction andoverview of the new XML library forBerkeley DB The library specializes instorage and retrieval of XML contentthrough a tool called XPath The libraryallows for multiple containers per docu-ment and stores everything natively asXML The user is also given a wide rangeof elements to create indices withincluding edges elements text stringsor presence The XPath tool consists of aquery parser generator optimizer andexecution engine To conclude his pre-sentation Merrell gave a live demonstra-tion of the software

For more information visithttpwwwsleepycatcomxml

CLUSTER-ON-DEMAND (COD)

Justin Moore Duke University

Modern clusters are growing at a rapidrate Many have pushed beyond the 5000-machine mark and deployingthem results in large expenses as well asmanagement and provisioning issuesFurthermore if the cluster is ldquorentedrdquoout to various users it is very time-con-suming to configure it to a userrsquos specsregardless of how long they need to useit The COD work presented createsdynamic virtual clusters within a givenphysical cluster The goal was to have aprovisioning tool that would automati-cally select a chunk of available nodesand install the operating system andmiddleware specified by the customer ina short period of time This would allowa greater use of resources since multiplevirtual clusters can exist at once Byusing a virtual cluster the size can bechanged dynamically Moore stated thatthey have created a working prototypeand are currently testing and bench-marking its performance

For more information visithttpwwwcsdukeedu~justincod

82 Vol 27 No 5 login

CATACOMB

Elias Sinderson University of CaliforniaSanta Cruz

Catacomb is a project to develop a data-base-backed DASL module for theApache Web server It was designed as areplacement for the WebDAV The initialrelease of the module only contains sup-port for a MySQL database but could beextended to others Sinderson brieflytouched on the performance of hermodule It was comparable to themod_dav Apache module for all querytypes but search The presentation wasconcluded with remarks about addingsupport for the lock method and includ-ing ACL specifications in the future

For more information visithttpoceancseucsceducatacomb

SELF-ORGANIZING STORAGE

Dan Ellard Harvard University

Ellardrsquos presentation introduced a stor-age system that tuned itself based on theworkload of the system without requir-ing the intervention of the user Theintelligence was implemented as a vir-tual self-organizing disk that residesbelow any file system The virtual diskobserves the access patterns exhibited bythe system and then attempts to predictwhat information will be accessed nextAn experiment was done using an NFStrace at a large ISP on one of their emailservers Ellardrsquos self-organizing storagesystem worked well which he attributesto the fact that most of the files beingrequested were large email boxes Areasof future work include exploration ofthe length and detail of the data collec-tion stage as well as the CPU impact ofrunning the virtual disk layer

VISUAL DEBUGGING

John Costigan Virginia Tech

Costigan feels that the state of currentdebugging facilities in the UNIX worldis not as good as it should be He pro-poses the addition of several elements toprograms like ddd and gdb The firstaddition is including separate heap and

stack components Another is to displayonly certain fields of a data structureand have the ability to zoom in and outif needed Finally the debugger shouldprovide the ability to visually presentcomplex data structures includinglinked lists trees etc He said that a betaversion of a debugger with these abilitiesis currently available

For more information visit httpinfoviscsvtedudatastruct

IMPROVING APPLICATION PERFORMANCE

THROUGH SYSTEM-CALL COMPOSITION

Amit Purohit University of Stony Brook

Web servers perform a huge number ofcontext switches and internal data copy-ing during normal operation These twoelements can drastically limit the perfor-mance of an application regardless ofthe hardware platform it is run on TheCompound System Call (CoSy) frame-work is an attempt to reduce the perfor-mance penalty for context switches anddata copies It includes a user-levellibrary and several kernel facilities thatcan be used via system calls The libraryprovides the programmer with a com-plete set of memory-management func-tions Performance tests using the CoSyframework showed a large saving forcontext switches and data copies Theonly area where the savings betweenconventional libraries and CoSy werenegligible was for very large file copies

For more information visithttpwwwcssunysbedu~purohit

ELASTIC QUOTAS

John Oscar Columbia University

Oscar began by stating that mostresources are flexible but that to datedisks have not been While approxi-mately 80 percent of files are short-liveddisk quotas are hard limits imposed byadministrators The idea behind elasticquotas is to have a non-permanent stor-age area that each user can temporarilyuse Oscar suggested the creation of aehome directory structure to be used indeploying elastic quotas Currently he

has implemented elastic quotas as astackable file system There were alsoseveral trends apparent in the dataMany users would ssh to their remotehost (good) but then launch an emailreader on the local machine whichwould connect via POP to the same hostand send the password in clear text(bad) This situation can easily be reme-died by tunneling protocols like POPthrough SSH but it appeared that manypeople were not aware this could bedone While most of his comments onthe use of the wireless network werenegative the list of passwords he hadcollected showed that people wereindeed using strong passwords His rec-ommendations were to educate andencourage people to use protocols likeIPSec SSH and SSL when conductingwork over a wireless network becauseyou never know who else is listening

SYSTRACE

Niels Provos University of Michigan

How can one be sure the applicationsone is using actually do exactly whattheir developers said they do Shortanswer you canrsquot People today are usinga large number of complex applicationswhich means it is impossible to checkeach application thoroughly for securityvulnerabilities There are tools out therethat can help though Provos has devel-oped a tool called systrace that allows auser to generate a policy of acceptablesystem calls a particular application canmake If the application attempts tomake a call that is not defined in thepolicy the user is notified and allowed tochoose an action Systrace includes anauto-policy generation mechanismwhich uses a training phase to record theactions of all programs the user exe-cutes If the user chooses not to be both-ered by applications breaking policysystrace allows default enforcementactions to be set Currently systrace isimplemented for FreeBSD and NetBSDwith a Linux version coming out shortly

83October 2002 login

C

ON

FERE

NC

ERE

PORT

SFor more information visit httpwwwcitiumicheduuprovossystrace

VERIFIABLE SECRET REDISTRIBUTION FOR

SURVIVABLE STORAGE SYSTEMS

Ted Wong Carnegie Mellon University

Wong presented a protocol that can beused to re-create a file distributed over aset of servers even if one of the serversis damaged In his scheme the user mustchoose how many shares the file is splitinto and the number of servers it will bestored across The goal of this work is toprovide persistent secure storage ofinformation even if it comes underattack Wong stated that his design ofthe protocol was complete and he is cur-rently building a prototype implementa-tion of it

For more information visit httpwwwcscmuedu~tmwongresearch

THE GURU IS IN SESSIONS

LINUX ON LAPTOPPDA

Bdale Garbee HP Linux Systems Operation

Summarized by Brennan Reynolds

This was a roundtable-like discussionwith Garbee acting as moderator He iscurrently in charge of the Debian distri-bution of Linux and is involved in port-ing Linux to platforms other than i386He has successfully help port the kernelto the Alpha Sparc and ARM and iscurrently working on a version to runon VAX machines

A discussion was held on which file sys-tem is best for battery life The Riser andExt3 file systems were discussed peoplecommented that when using Ext3 ontheir laptops the disc would never spindown and thus was always consuming alarge amount of power One suggestionwas to use a RAM-based file system forany volume that requires a large numberof writes and to only have the file systemwritten to disk at infrequent intervals orwhen the machine is shut down or sus-pended

A question was directed to Garbee abouthis knowledge of how the HP-Compaqmerger would affect Linux Garbeethought that the merger was good forLinux both within the company and forthe open source community as a wholeHe stated that the new company wouldbe the number one shipper of systemswith Linux as the base operating systemHe also fielded a question about thesupport of older HP servers and theirability to run Linux saying that indeedLinux has been ported to them and hepersonally had several machines in hisbasement running it

A question which sparked a large num-ber of responses concerned problemspeople had with running Linux onmobile platforms Most people actuallydid not have many problems at allThere were a few cases of a particularmachine model not working but therewere only two widespread issues dock-ing stations and the new ACPI powermanagement scheme Docking stationsstill appear to be somewhat of aheadache but most people had devel-oped ad-hoc solutions for getting themto work The ACPI power managementscheme developed by Intel does notappear to have a quick solution TedTrsquoso one of the head kernel developersstated that there are fundamental archi-tectural problems with the 24 series ker-nel that do not easily allow ACPI to beadded However he also stated thatACPI is supported in the newest devel-opment kernel 25 and will be includedin 26

The final topic was the possibility ofpurchasing a laptop without paying anychargetax for Microsoft Windows Theconsensus was that currently this is notpossible Even if the machine comeswith Linux or without any operatingsystem the vendors have contracts inplace which require them to payMicrosoft for each machine they ship ifthat machine is capable of running aMicrosoft operating system The onlyway this will change is if the demand for

USENIX 2002

other operating systems increases to thepoint that vendors renegotiate their con-tracts with Microsoft and no one sawthis happening in the near future

LARGE CLUSTERS

Andrew Hume ATampT Labs ndash Research

Summarized by Teri Lampoudi

This was a session on clusters big dataand resilient computing Given that theaudience was primarily interested inclusters and not necessarily big data orbeating nodes to a pulp the conversa-tion revolved mainly around what ittakes to make a heterogeneous clusterresilient Hume also presented aFREENIX paper on his current clusterproject which explained in more detailmuch of what was abstractly claimed inthe guru session

Humersquos use of the word ldquoclusterrdquoreferred not to what I had assumedwould be a Beowulf-type system but to aloosely coupled farm of machines inwhich concurrency was much less of anissue than it would be in a Beowulf Infact the architecture Hume describedwas designed to deal with large amountsof independent transaction processingessentially the process of billing callswhich requires no interprocess messagepassing or anything of the sort a parallelscientific application might

Where does the large data come inSince the transactions in question con-sist of large numbers of flat files amechanism for getting the files ontonodes and off-loading the results is nec-essary In this particular cluster namedNingaui this task is handled by theldquoreplication managerrdquo which generateda large amount of interest in the audi-ence All management functions in thecluster as well as job allocation consti-tute services that are to be bid for andleased out to nodes Furthermore fail-ures are handled by simply repeating thefailed task until it is completed properlyif that is possible and if it is not thenlooking through detailed logs to discover

84 Vol 27 No 5 login

what went wrong Logging and testingfor the successful completion of eachtask are what make this model resilientEvery management operation is loggedat some appropriate level of detail gen-erating 120MBday and every single filetransfer is md5 checksummed This wascharacterized by Hume as a ldquopatientrdquostyle of management where no one con-trols the cluster scheduling behavior isemergent rather than stringentlyplanned and nodes are free to drop inand out of the cluster a downside to thismodel is the increase in latency offset bythe fact that the cluster continues tofunction unattended outside of 8-to-5support hours

In response to questions about the pro-jected scalability of such a schemeHume said the system would presum-ably scale from the present 60 nodes toabout 100 without modifications butthat scaling higher would be a matter forfurther design The guru session endedon a somewhat unrelated note regardingthe problems of getting data off main-frames and onto Linux machines ndashissues of fixed- vs variable-lengthencoding of data blocks resulting in use-ful information being stripped by FTPand problems in converting COBOLcopybooks to something useful on a C-based architecture To reiterate Humersquosclaim there are homemade solutions forsome of these problems and the readershould feel free to contact Mr Hume forthem

WRITING PORTABLE APPLICATIONS

Nick Stoughton MSB Consultants

Summarized by Josh Lothian

Nick Stoughton addressed a concernthat developers are facing more andmore writing portable applications Heproposed that there is no such thing as aldquoportable applicationrdquo only those thathave been ported Developers are still insearch of the Holy Grail of applicationsprogramming a write-once compile-anywhere program Languages such asJava are leading up to this but virtual

machines still have to be ported pathswill still be different etc Web-basedapplications are another area that couldbe promising for portable applications

Nick went on to talk about the future ofportability The recently released POSIX2001 standards are expected to help thesituation The 2001 revision expands theoriginal POSIX sepc into seven volumesEven though POSIX 2001 is more spe-cific than the previous release Nickpointed out that even this standardallows for small differences where ven-dors may define their own behaviorsThis can only hurt developers in thelong run Along these lines it waspointed out that there may be room forautomated tools that have the capabilityto check application code and determinethe level of portability and standardscompliance of the source

NETWORK MANAGEMENT SYSTEM

PERFORMANCE TUNING

Jeff Allen Tellme Networks Inc

Summarized by Matt Selsky

Jeff Allen author of Cricket discussednetwork tuning The basic idea of tuningis measure twiddle and measure againCricket can be used for large installa-tions but itrsquos overkill for smaller instal-lations for which Mrtg is better suitedYou donrsquot need Cricket to do measure-ment Doing something like a shellscript that dumps data to Excel is goodbut you need to get some sort of mea-surement Cricket was not designed forbilling it was meant to help answerquestions

Useful tools and techniques coveredincluded looking for the differencebetween machines in order to identifythe causes for the differences measuringbut also thinking about what yoursquoremeasuring and why remembering touse strace and tcpdump to identifyproblems Problems repeat themselvesperformance tuning requires an intricateunderstanding of all the layers involvedto solve the problem and simplify theautomation (Having a computer science

background helps but is not essential) Ifyou can reproduce the problem youthen should investigate each piece tofind the bottleneck If you canrsquot repro-duce the problem then measure Youneed to understand all the system inter-actions Measurement can help deter-mine whether things are actually slow orif the user is imagining it

Troubleshooting begins with a hunchbut scientific processes are essential Youshould be able to determine the eventstream or when each event occurs hav-ing observation logs can help Somehints provided include checking cron forperiodic anomalies slowing downbinrm to avoid IO overload and look-ing for unusual kernel CPU usage

Also try to use lightweight monitoringto reduce overhead You donrsquot need tomonitor every resource on every systembut you should monitor those resourceson those systems that are essential Donrsquotcheck things too often since you canintroduce overhead that way Lots ofinformation can be gathered from theproc file system interface to the kernel

BSD BOF

Host Kirk McKusick Author and Consultant

Summarized by Bosko Milekic

The BSD Birds of a Feather session atUSENIX started off with five quick andinformative talks and ended with anexciting discussion of licensing issues

Christos Zoulas for the NetBSD Projectwent over the primary goals of theNetBSD Project (namely portability aclean architecture and security) andthen proceeded to briefly discuss someof the challenges that the project hasencountered The successes and areas forimprovement of 16 were then exam-ined NetBSD has recently seen variousimprovements in its cross-build systempackaging system and third-party toolsupport For what concerns the kernel azero-copy TCPUDP implementationhas been integrated and a new pmap

85October 2002 login

C

ON

FERE

NC

ERE

PORT

Scode for arm32 has been introduced ashave some other rather major changes tovarious subsystems More information isavailable in Christosrsquos slides which arenow available at httpwwwnetbsdorggalleryeventsusenix2002

Theo de Raadt of the OpenBSD Projectpresented a rundown of various newthings that the OpenBSD and OpenSSHteams have been looking at Theorsquos talkincluded an amusing description (andpictures) of some of the events thattranspired during OpenBSDrsquos hack-a-thon which occurred the week beforeUSENIX rsquo02 Finally Theo mentionedthat following complications with theIPFilter license the OpenBSD team haddone a fairly extensive license audit

Robert Watson of the FreeBSD Projectbrought up various examples ofFreeBSD being used in the real worldHe went on to describe a wide variety ofchanges that will surface with theupcoming FreeBSD release 50 Notably50 will introduce a re-architecting ofthe kernel aimed at providing muchmore scalable support for SMPmachines 50 will also include an earlyimplementation of KSE FreeBSDrsquos ver-sion of scheduler activations largeimprovements to pccard and significantframework bits from the TrustedBSDproject

Mike Karels from Wind River Systemspresented an overview of how BSDOShas been evolving following the WindRiver Systems acquisition of BSDirsquos soft-ware assets Ernie Prabhakar from Applediscussed Applersquos OS X and its successon the desktop He explained how OS Xaims to bring BSD to the desktop whilestriving to maintain a strong and stablecore one that is heavily based on BSDtechnology

The BoF finished with a discussion onlicensing specifically some folks ques-tioned the impact of Applersquos licensingfor code from their core OS (the DarwinProject) on the BSD community as awhole

CLOSING SESSION

HOW FLIES FLY

Michael H Dickinson University ofCalifornia Berkeley

Summarized by JD Welch

In this lively and offbeat talk Dickinsondiscussed his research into the mecha-nisms of flight in the fruit fly (Droso-phila melanogaster) and related theautonomic behavior to that of a techno-logical system Insects are the most suc-cessful organisms on earth due in nosmall part to their early adoption offlight Flies travel through space onstraight trajectories interrupted by sac-cades jumpy turning motions some-what analogous to the fast jumpymovements of the human eye Using avariety of monitoring techniquesincluding high-speed video in a ldquovirtualreality flight arenardquo Dickinson and hiscolleagues have observed that fliesrespond to visual cues to decide when tosaccade during flight For example morevisual texture makes flies saccade earlier(ie further away from the arena wall)described by Dickinson as a ldquocollisioncontrol algorithmrdquo

Through a combination of sensoryinputs flies can make decisions abouttheir flight path Flies possess a visualldquoexpansion detectorrdquo which at a certainthreshold causes the animal to turn acertain direction However expansion infront of the fly sometimes causes it toland How does the fly decide Using thevirtual reality device to replicate variousvisual scenarios Dickinson observedthat the flies fixate on vertical objectsthat move back and forth across thefield Expansion on the left side of theanimal causes it to turn right and viceversa while expansion directly in frontof the animal triggers the legs to flareout in preparation for landing

If the eyes detect these changes how arethe responses implemented The fliesrsquowings have ldquopower musclesrdquo controlledby mechanical resonance (as opposed tothe nervous system) which drive the

USENIX 2002

wings combined with neurally activatedldquosteering musclesrdquo which change theconfiguration of the wing joints Subtlevariations in the timing of impulses cor-respond to changes in wing movementcontrolled by sensors in the ldquowing-pitrdquoA small wing-like structure the halterecontrols equilibrium by beating con-stantly

Voluntary control is accomplished bythe halteres whose signals can interferewith the autonomic control of the strokecycle The halteres have ldquosteering mus-clesrdquo as well and information derivedfrom the visual system can turn off thehaltere or trick it into a ldquovirtualrdquo prob-lem requiring a response

Dickinson has also studied the aerody-namics of insect flight using a devicecalled the ldquoRobo Flyrdquo an oversizedmechanical insect wing suspended in alarge tank of mineral oil InterestinglyDickinson observed that the average liftgenerated by the fliesrsquo wings is under itsbody weight the flies use three mecha-nisms to overcome this including rotat-ing the wings (rotational lift)

Insects are extraordinarily robust crea-tures and because Dickinson analyzedthe problem in a systems-oriented waythese observations and analysis areimmediately applicable to technologythe response system can be used as anefficient search algorithm for controlsystems in autonomous vehicles forexample

THE AFS WORKSHOP

Summarized by Garry Zacheiss MIT

Love Hornquist-Astrand of the Arladevelopment team presented the Arlastatus report Arla 0358 has beenreleased Scheduled for release soon areimproved support for Tru64 UNIXMacOS X and FreeBSD improved vol-ume handling and implementation ofmore of the vospts subcommands Itwas stressed that MacOS X is consideredan important platform and that a GUIconfiguration manager for the cache

86 Vol 27 No 5 login

manager a GUI ACL manager for thefinder and a graphical login that obtainsKerberos tickets and AFS tokens at logintime were all under development

Future goals planned for the underlyingAFS protocols include GSSAPISPNEGOsupport for Rx performance improve-ments to Rx an enhanced disconnectedmode and IPv6 support for Rx anexperimental patch is already availablefor the latter Future Arla-specific goalsinclude improved performance partialfile reading and increased stability forseveral platforms Work is also inprogress for the RXGSS protocol exten-sions (integrating GSSAPI into Rx) Apartial implementation exists and workcontinues as developers find time

Derrick Brashear of the OpenAFS devel-opment team presented the OpenAFSstatus report OpenAFS was releasedimmediately prior to the conferenceOpenAFS 125 fixed a remotelyexploitable denial-of-service attack inseveral OpenAFS platforms mostnotably IRIX and AIX Future workplanned for OpenAFS includes bettersupport for MacOS X including work-ing around Finder interaction issuesBetter support for the BSDs is alsoplanned FreeBSD has a partially imple-mented client NetBSD and OpenBSDhave only server programs availableright now AIX 5 Tru64 51A andMacOS X 102 (aka Jaguar) are allplanned for the future Other plannedenhancements include nested groups inthe ptserver (code donated by the Uni-versity of Michigan awaits integration)disconnected AFS and further work ona native Windows client Derrick statedthat the guiding principles of OpenAFSwere to maintain compatibility withIBM AFS support new platforms easethe administrative burden and add newfunctionality

AFS performance was discussed Theopenafsorg cell consists of two fileservers one in Stockholm Sweden andone in Pittsburgh PA AFS works rea-sonably well over trans-Atlantic links

although OpenAFS clients donrsquot deter-mine which file server to talk to veryeffectively Arla clients use RTTs to theserver to determine the optimal fileserver to fetch replicated data fromModifications to OpenAFS to supportthis behavior in the future are desired

Jimmy Engelbrecht and Harald Barth ofKTH discussed their AFSCrawler scriptwritten to determine how many AFScells and clients were in the world whatimplementationsversions they were(Arla vs IBM AFS vs OpenAFS) andhow much data was in AFS The scriptunfortunately triggered a bug in IBMAFS 36ndashderived code causing someclients to panic while handling a specificRPC This has since been fixed in OpenAFS 125 and the most recent IBMAFS patch level and all AFS users arestrongly encouraged to upgrade Norelease of Arla is vulnerable to this par-ticular denial-of-service attack Therewas an extended discussion of the use-fulness of this exploration Many sitesbelieved this was useful information andsuch scanning should continue in thefuture but only on an opt-in basis

Many sites face the problem of manag-ing KerberosAFS credentials for batch-scheduled jobs Specifically most batchprocessing software needs to be modi-fied to forward tickets as part of thebatch submission process renew ticketsand tokens while the job is in the queueand for the lifetime of the job and prop-erly destroy credentials when the jobcompletes Ken Hornstein of NRL wasable to pay a commercial vendor to sup-port Kerberos 45 credential manage-ment in their product although they didnot implement AFS token managementMIT has implemented some of thedesired functionality in OpenPBS andmight be able to make it available toother interested sites

Tools to simplify AFS administrationwere discussed including

AFS Balancer A tool written byCMU to automate the process ofbalancing disk usage across all

servers in a cell Available fromftpftpandrewcmuedupubAFS-Toolsbalance-11btargz

Themis Themis is KTHrsquos enhancedversion of the AFS tool ldquopackagerdquofor updating files on local disk froma central AFS image KTHrsquosenhancements include allowing thedeletion of files simplifying theprocess of adding a file and allow-ing the merging of multiple rulesets for determining which files areupdated Themis is available fromthe Arla CVS repository

Stanford was presented as an example ofa large AFS site Stanfordrsquos AFS usageconsists of approximately 14TB of datain AFS in the form of approximately100000 volumes 33TB of storage isavailable in their primary cell irstan-fordedu Their file servers consistentirely of Solaris machines running acombination of Transarc 36 patch-level3 and OpenAFS 12x while their data-base servers run OpenAFS 122 Theircell consists of 25 file servers using acombination of EMC and Sun StorEdgehardware Stanford continues to use thekaserver for their authentication infra-structure with future plans to migrateentirely to an MIT Kerberos 5 KDC

Stanford has approximately 3400 clientson their campus not including SLAC(Stanford Linear Accelerator) approxi-mately 2100 AFS clients from outsideStanford contact their cell every monthTheir supported clients are almostentirely IBM AFS 36 clients althoughthey plan to release OpenAFS clientssoon Stanford currently supports onlyUNIX clients There is some on-campuspresence of Windows clients but Stan-ford has never publicly released or sup-ported it They do intend to release andsupport the MacOS X client in the nearfuture

All Stanford students faculty and staffare assigned AFS home directories witha default quota of 50MB for a total ofapproximately 550GB of user homedirectories Other significant uses of AFS

87October 2002 login

C

ON

FERE

NC

ERE

PORT

Sstorage are data volumes for workstationsoftware (400GB) and volumes forcourse Web sites and assignments(100GB)

AFS usage at Intel was also presentedIntel has been an AFS site since 1994They had bad experiences with the IBM35 Linux client their experience withOpenAFS on Linux 24x kernels hasbeen much better They use and are sat-isfied with the OpenAFS IA64 Linuxport Intel has hundreds of OpenAFS123 and 124 clients in many produc-tion cells accessing data stored on IBMAFS file servers They have not encoun-tered any interoperability issues Intelhas some concerns about OpenAFS theywould like to purchase commercial sup-port for OpenAFS and to see OpenAFSsupport for HP-UX on both PA-RISCand Itanium hardware HP-UX supportis currently unavailable due to a specificHP-UX header file being unavailablefrom HP this may be available soonIntel has not yet committed to migratingtheir file servers to OpenAFS and areunsure if they will do so without com-mercial support

Backups are a traditional topic of dis-cussion at AFS workshops and this timewas no exception Many users complainthat the traditional AFS backup tools(ldquobackuprdquo and ldquobutcrdquo) are complex anddifficult to automate requiring manyhome-grown scripts and much userintervention for error recovery An addi-tional complaint was that the traditionalAFS tools do not support file- or direc-tory-level backups and restores datamust be backed up and restored at thevolume level

Mitch Collinsworth of Cornell presentedwork to make AMANDA the freebackup software from the University ofMaryland suitable for AFS backupsUsing AMANDA for AFS backups allowsone to share AFS backup tapes withnon-AFS backups easily run multiplebackups in parallel automate errorrecovery and provide a robust degradedmode that prevents tape errors from

stopping backups altogether Theirimplementation allows for full volumerestores as well as individual directoryand file restores They have finished cod-ing this work and are in the process oftesting and documenting it

Peter Honeyman of CITI at the Univer-sity of Michigan spoke about work hehas proposed to replace Rx with RPC-SEC GSS in OpenAFS this would allowAFS to use a TCP-based transport mech-anism rather than the UDP-based Rxand possibly gain better congestion con-trol dynamic adaptation and fragmen-tation avoidance as a result RPCSECGSS uses the GSSAPI to authenticateSUN ONC RPC RPCSEC GSS is trans-port agnostic provides strong security isa developing Internet standard and hasmultiple open source implementationsBackward compatibility with existingAFS servers and clients is an importantgoal of this project

Page 11: inside - USENIX · 2019. 2. 25. · Bosko Milekic Juan Navarro David E. Ott Amit Purohit Brennan Reynolds Matt Selsky J.D. Welch Li Xiao Praveen Yalagandula Haijin Yan Gary Zacheiss

To demonstrate the benefits of EtE mon-itor Cherkasova talked about two casestudies and based on the calculatedmetrics gave some insightful explana-tions about whatrsquos happening behind the variations of Web performanceThrough validation they claim theirapproach provides a very close approxi-mation to the real scenario

THE PERFORMANCE OF REMOTE DISPLAY

MECHANISMS FOR THIN-CLIENT COMPUTING

S Jae Yang Jason Nieh Matt SelskyNikhil Tiwari Columbia University

Noting the trend toward thin-clientcomputing the authors compared dif-ferent techniques and design choices inmeasuring the performance of six popu-lar thin-client platforms ndash CitrixMetaFrame Microsoft Terminal Ser-vices Sun Ray Tarantella VNC and XAfter pointing out several challenges inbenchmarking thin clients Yang pro-posed slow-motion benchmarking toachieve non-invasive packet monitoringand consistent visual quality Basicallythey insert delays between separatevisual events in the benchmark applica-tions of the server side so that theclientrsquos display update can catch up withthe serverrsquos processing speed The exper-iments are conducted on an emulatednetwork over a range of network band-widths

Their results show that thin clients canprovide good performance for Webapplications in LAN environments butonly some platforms performed well forvideo benchmark Pixel-based encodingmay achieve better performance andbandwidth efficiency than high-levelgraphics Display caching and compres-sion should be used with care

A MECHANISM FOR TCP-FRIENDLY

TRANSPORT-LEVEL PROTOCOL

COORDINATION

David E Ott and Ketan Mayer-PatelUniversity of North Carolina ChapelHill

A revised transport-level protocol opti-mized for cluster-to-cluster (C-to-C)

71October 2002 login

C

ON

FERE

NC

ERE

PORT

Sapplications is what David Ott tries toexplain in his talk A C-to-C applicationclass is identified as one set of processescommunicating to another set ofprocesses across a common Internetpath The fundamental problem with C-to-C applications is how to coordinateall C-to-C communication flows so thatthey share a consistent view of the com-mon C-to-C network adapt to changingnetwork conditions and cooperate tomeet specific requirements Aggregationpoints (AP) are placed at the first andlast hop of the common data path toprobe network conditions (latencybandwidth loss rate etc) To carry andtransfer this information a new protocolndash Coordination Protocol (CP) ndash isinserted between the network layer (IP)and the transport layer (TCP UCP etc)Ott illustrated how the CP header isupdated and used when packets origi-nate from source and traverse throughlocal and remote AP to arrive at theirdestination and how AP maintains aper-cluster state table and detects net-work conditions Through simulationresults this coordination mechanismappears effective in sharing commonnetwork resources among C-to-C com-munication flows

STORAGE SYSTEMS

Summarized by Praveen Yalagandula

MY CACHE OR YOURS MAKING STORAGE

MORE EXCLUSIVE

Theodore M Wong Carnegie MellonUniversity John Wilkes HP Labs

Theodore Wong explained the ineffi-ciency of current ldquoinclusiverdquo cachingschemes in storage area networks ndashwhen a client accesses a block the blockis read from the disk and is cached atboth the disk array cache and at theclientrsquos cache Then he presented theconcept of ldquoexclusiverdquo caching where ablock accessed by a client is only cachedin that clientrsquos cache On eviction fromthe clientrsquos cache they have come upwith a DEMOTE operation to move thedata block to the tail of the array cache

The new exclusion caching schemes areevaluated on both single-client and mul-tiple-client systems The single-clientresults are presented for two differenttypes of synthetic workloads Random(transaction-processing type workloads)and Zipf (typical Web workloads) Theexclusive policy was quite effective inachieving higher hit rates and lower readlatencies They also showed a 22 timeshit-rate improvement over inclusivetechniques in the case of a real-lifeworkload httpd with a single-client set-ting

The DEMOTE scheme also performedwell in the case of multiple-client sys-tems when the data accessed by clients isdisjointed In the case of shared dataworkloads this scheme performed worsethan the inclusive schemes Theodorethen presented an adaptive exclusivecaching scheme ndash a block accessed by aclient is placed at the tail in the clientrsquoscache and also in the array cache at anappropriate place determined by thepopularity of the block The popularityof the block is measured by maintaininga ghost cache to accumulate the numberof times each block is accessed This newadaptive scheme has achieved a maxi-mum of 52 speedup in the meanlatency in the experiments with real-lifeworkloads

For more information visit httpwwwcscmuedu~tmwongresearch and httpwwwhplhpcomresearchitccslssp

BRIDGING THE INFORMATION GAP IN

STORAGE PROTOCOL STACKS

Timothy E Denehy Andrea C Arpaci-Dusseau Remzi H Arpaci-DusseauUniversity of Wisconsin Madison

Currently there is a huge informationgap between storage systems and file sys-tems The interface exposed by storagesystems to file systems based on blocksand providing only readwrite interfacesis very narrow This leads to poor per-formance as a whole because of dupli-cated functionality in both systems and

USENIX 2002

reduced functionality resulting fromstorage systemsrsquo lack of file information

The speaker presented two enhance-ments ExRaid an exposed RAID andILFS Informed Log-Structured File Sys-tem ExRAID is an enhancement to theblock-based RAID storage system thatexposes the following three types ofinformation to the file system (1)regions ndash contiguous portions of theaddress space comprising one or multi-ple disks (2) performance informationabout queue lengths and throughput ofthe regions revealing disk heterogeneityto the file systems and (3) failure infor-mation ndash dynamically updated informa-tion conveying the number of tolerablefailures in each region

ILFS allows online incremental expan-sion of the storage space performsdynamic parallelism using ExRAIDrsquosperformance information to performwell on heterogeneous storage systemsand provides a range of different mecha-nisms with different granularities forredundancy of files using ExRAIDrsquosregion failure characteristics These newfeatures are added to LFS with only a19 increase in the code size

More informationhttpwwwcswisceduwind

MAXIMIZING THROUGHPUT IN REPLICATED

DISK STRIPING OF VARIABLE

BIT-RATE STREAMS

Stergios V Anastasiadis Duke Univer-sity Kenneth C Sevcik and MichaelStumm University of Toronto

There is an increasing demand for thecontinuous real-time streaming ofmedia files Disk striping is a commontechnique used for supporting manyconnections on the media servers Buteven with 12 million hours of meantime between failures there will be morethan one disk failure per week with 7000disks Fault tolerance can be achieved byusing data replication and reservingextra bandwidth during normal opera-tion This work focused on supportingthe most common variable bit-ratestream formats (eg MPEG)

72 Vol 27 No 5 login

The authors used prototype mediaserver EXEDRA to implement and eval-uate different replication and bandwidthreservation schemes This system sup-ports a variable bit-rate scheme doesstride-based disk allocation and is capa-ble of supporting variable-grain diskstriping Two replication policies arepresented deterministic ndash data blocksare replicated in a round-robin fashionon the secondary replicas random ndashdata blocks are replicated on the ran-domly chosen secondary replicas Twobandwidth reservation techniques arealso presented mirroring reservation ndashdisk bandwidth is reserved for both primary and replicas of the media fileduring the playback and minimumreservation ndash a more efficient scheme inwhich bandwidth is reserved only for thesum of primary data access time and themaximum of the backup data accesstimes required in each round

Experimental results showed that deter-ministic replica placement is better thanrandom placement for small disks mini-mum disk bandwidth reservation istwice as good as mirroring in thethroughput achieved and fault tolerancecan be achieved with a minimal impacton the throughput

TOOLS

Summarized by Li Xiao

SIMPLE AND GENERAL STATISTICAL PROFILING

WITH PCT

Charles Blake and Steven Bauer MIT

This talk introduced a Profile CollectionToolkit (PCT) ndash a sampling-based CPUprofiling facility A novel aspect of PCTis that it allows sampling of semanticallyrich data such as function call stacksfunction parameters local or globalvariables CPU registers or other execu-tion contexts The design objectives ofPCT were driven by user needs and theinadequacies or inaccessibility of priorsystems

The rich data collection capability isachieved via a debugger-controller pro-gram dbct1 The talk then introducedthe profiling collector and data collec-

tion activation PCT file format optionsprovide flexibility and various ways tostore exact states PCT also has analysisoptions two examples were given ndashldquoSimplerdquo and ldquoMixed User and LinuxCoderdquo

The talk then went into how generalsampling helps in the following areas adebugger-controller state machineexample script fragments a simpleexample program and general valuereports in terms of call-path histogramsnumeric value histograms and data gen-erality Related work such as gprof andExpect was then summarized The con-tribution of this work is to provide aportable general value sampling toolThe limitations are that PCT does notsupport much of strip and count inac-curacies happen because of its statisticalnature Based on this limitation -g ispreferred over strip

Possible future directions included sup-porting more debuggers such as dbxand xdb script generators for otherkinds of traces more canned reports forgeneral values and a libgdb-basedlibrary sampler PCT is available athttppdoslcsmitedupct PCT runs onalmost any UNIX-like system

ENGINEERING A DIFFERENCING AND

COMPRESSION DATA FORMAT

David G Korn and Kiem Phong VoATampT Laboratories ndash Research

This talk began with an equation ldquoDif-ferencing + Compression = Delta Com-pressionrdquo Compression removesinformation redundancy in a data setExamples are gzip bzip compress andpack Differencing encodes differencesbetween two data sets Examples are diff -e fdelta bdiff and xdelta Deltacompression compresses a data set givenanother combining differencing andcompression and reduces to pure com-pression when therersquos no commonality

After presenting an overall scheme for adelta compressor and showing deltacompression performance this talk dis-cussed the encoding format of a newlydesigned Vcdiff for delta compression A

Vcdiff instruction code table consists of256 entries of each coding up to a pair ofinstructions and recodes I-byte indicesand any additional data

The talk then showed Vcdiff rsquos perfor-mance with Web data collected fromCNN compared with W3C standardGdiff Gdiff+gzip and diff+gzip whereGdiff was computed using Vcdiff deltainstructions The results of two experi-ments ldquoFirstrdquo and ldquoSuccessiverdquo werepresented In ldquoFirstrdquo each file is com-pressed against the first file collected inldquoSuccessiverdquo each file is compressedagainst the one in the previous hourThe diff+gzip did not work well becausediff was line-oriented Vcdiff performedfavorably compared with other formats

Vcdiff is part of the Vcodex packageVcodex is a platform for all commondata transformations delta compres-sion plain compression encryptiontranscoding (eg uuencode base64) Itis structured in three layers for maxi-mum usability Base library uses Dis-plines and Methods interfaces

The code for Vcdiff can be found athttpwwwresearchattcomswtoolsThey are moving Vcdiff to the IETFstandard as a comprehensive platformfor transforming data Please refer tohttpwwwietforginternet-draftdraft-korn-vcdiff06txt

WHERE IN THE NET

Summarized by Amit Purohit

A PRECISE AND EFFICIENT EVALUATION OF

THE PROXIMITY BETWEEN WEB CLIENTS AND

THEIR LOCAL DNS SERVERS

Zhuoqing Morley Mao University ofCalifornia Berkeley Charles D CranorFred Douglis Michael RabinovichOliver Spatscheck and Jia Wang ATampTLabs ndash Research

Content Distribution Networks (CDNs)attempt to improve Web performance bydelivering Web content to end usersfrom servers located at the edge of thenetwork When a Web client requestscontent the CDN dynamically chooses aserver to route the request to usually the

73October 2002 login

C

ON

FERE

NC

ERE

PORT

Sone that is nearest the client As CDNshave access only to the IP address of thelocal DNS server (LDNS) of the clientthe CDNrsquos authoritative DNS servermaps the clientrsquos LDNS to a geographicregion within a particular network andcombines that with network and serverload information to perform CDNserver selection

This method has two limitations First itis based on the implicit assumption thatclients are close to their LDNS This maynot always be valid Second a singlerequest from an LDNS can represent adiffering number of Web clients ndash calledthe hidden load factor Knowledge of thehidden load factor can be used toachieve better load distribution

The extent of the first limitation and itsimpact on the CDN server selection isdealt with To determine the associationsbetween clients and their LDNS a sim-ple non-intrusive and efficient map-ping technique was developed The datacollected was used to study the impact ofproximity on DNS-based server selec-tion using four different proximity met-rics (1) autonomous system (AS)clustering ndash observing whether a client isin the same AS as its LDNS ndash concludedthat LDNS is very good for coarse-grained server selection as 64 of theassociations belong to the same AS(2) network clustering ndash observingwhether a client is in the same network-aware cluster (NAC) ndash implied that DNSis less useful for finer-grained serverselection since only 16 of the clientand LDNS are in the same NAC (3)traceroute divergence ndash the length of thedivergent paths to the client and itsLDNS from a probe point using tracer-oute ndash implies that most clients aretopologically close to their LDNS asviewed from a randomly chosen probesite (4) round-trip time (RTT) correla-tion (some CDNs select severs based onRTT between the CDN server and theclientrsquos LDNS) examines the correlationbetween the message RTTs from a probepoint to the client and its local DNS

results imply this is a reasonable metricto use to avoid really distant servers

A study of the impact that client-LDNSassociations have on DNS-based serverselection concludes that knowing theclientrsquos IP address would allow moreaccurate server selection in a large num-ber of cases The optimality of the serverselection also depends on the serverdensity placement and selection algo-rithms

Further information can be found athttpwwweecsberkeleyedu~zmaomyresearchhtml or by contactingzmaoeecsberkeleyedu

GEOGRAPHIC PROPERTIES OF INTERNET

ROUTING

Lakshminarayanan SubramanianVenkata N Padmanabhan MicrosoftResearch Randy H Katz University ofCalifornia at Berkeley

Geographic information can provideinsights into the structure and function-ing of the Internet including interac-tions between different autonomoussystems by analyzing certain propertiesof Internet routing It can be used tomeasure and quantify certain routingproperties such as circuitous routinghot-potato routing and geographic faulttolerance

Traceroute has been used to gather therequired data and Geotrack tool hasbeen used to determine the location ofthe nodes along each network path Thisenables the computation of ldquolinearizeddistancesrdquo which is the sum of the geo-graphic distances between successivepairs of routers along the path

In order to measure the circuitousnessof a path a metric ldquodistance ratiordquo hasbeen defined as the ratio of the lin-earized distance of a path to the geo-graphic distance between the source anddestination of the path From the data ithas been observed that the circuitous-ness of a route depends on both geo-graphic and network location of the endhosts A large value of the distance ratioenables us to flag paths that are highly

USENIX 2002

circuitous possibly (though not neces-sarily) because of routing anomalies Itis also shown that the minimum delaybetween end hosts is strongly correlatedwith the linearized distance of the path

Geographic information can be used tostudy various aspects of wide-area Inter-net paths that traverse multiple ISPs Itwas found that end-to-end Internetpaths tend to be more circuitous thanintra-ISP paths the cause for this beingthe peering relationships between ISPsAlso paths traversing substantial dis-tances within two or more ISPs tend tobe more circuitous than paths largelytraversing only a single ISP Anotherfinding is that ISPs generally employhot-potato routing

Geographic information can also beused to capture the fact that two seem-ingly unrelated routers can be suscepti-ble to correlated failures By using thegeographic information of routers wecan construct a geographic topology ofan ISP Using this we can find the toler-ance of an ISPrsquos network to the total net-work failure in a geographic region

For further information contactlakmecsberkeleyedu

PROVIDING PROCESS ORIGIN INFORMATION

TO AID IN NETWORK TRACEBACK

Florian P Buchholz Purdue UniversityClay Shields Georgetown University

Network traceback is currently limitedbecause host audit systems do not main-tain enough information to matchincoming network traffic to outgoingnetwork traffic The talk presented analternative assigning origin informationto every process and logging it duringinteractive login creation

The current implementation concen-trates mainly on interactive sessions inwhich an event is logged when a newconnection is established using SSH orTelnet The method is effective andcould successfully determine steppingstones and reliably detect the source of aDDoS attack The speaker then talkedabout the related work done in this area

74 Vol 27 No 5 login

which led to discussion of the latestdevelopment in the area of ldquohostcausalityrdquo

The speaker ended with the limitationsand future directions of the frameworkThis framework could be extended andcould find many applications in the areaof security Current implementationdoesnrsquot take care of startup scripts andcron jobs but incorporating the origininformation in FS could solve this prob-lem In the current implementation log-ging is just implemented as a proof ofconcept It could be made safe in manyways and this could be another impor-tant aspect of future work

PROGRAMMING

Summarized by Amit Purohit

CYCLONE A SAFE DIALECT OF C

Trevor Jim ATampT Labs ndash Research GregMorrisett Dan Grossman MichaelHicks James Cheney Yanling WangCornell University

Cyclone is designed to provide protec-tion against attacks such as buffer over-flows format string attacks andmemory management errors The cur-rent structure of C allows programmersto write vulnerable programs Cycloneextends C so that it has the safety guar-antee of Java while keeping C syntaxtypes and semantics untouched

The Cyclone compiler performs staticanalysis of the source code and insertsruntime checks into the compiled out-put at places where the analysis cannotdetermine that the code execution willnot violate safety constraints Cycloneimposes many restrictions to preservesafety such as NULL checksThesechecks do not exist in normal C

The speaker then talked about somesample code written in Cyclone and howit tackles safety problems Porting anexisting C application to Cyclone ispretty easy with fewer than 10 changerequired The current implementationconcentrates more on safety than onperformance ndash hence the performance

penalty is substantial Cyclone was ableto find many lingering bugs The finalstep of the project will be to write acompiler to convert a normal C programto an equivalent Cyclone program

COOPERATIVE TASK MANAGEMENT WITHOUT

MANUAL STACK MANAGEMENT

Atul Adya Jon Howell MarvinTheimer William J Bolosky John RDouceur Microsoft Research

The speaker described the definitionsand motivations behind the distinctconcepts of task management stackmanagement IO management conflictmanagement and data partitioningConventional concurrent programminguses preemptive task management andexploits the automatic stack manage-ment of a standard language In the sec-ond approach cooperative tasks areorganized as event handlers that yieldcontrol by returning control to the eventscheduler manually unrolling theirstacks In this project they have adopteda hybrid approach that makes it possiblefor both stack management styles tocoexist Thus programmers can codeassuming one or the other of these stackmanagement styles will operate depend-ing upon the application The speakeralso gave a detailed example of how touse their mechanism to switch betweenthe two styles

The talk then continued with the imple-mentation They were able to preemptmany subtle concurrency problems byusing cooperative task managementPaying a cost up-front to reduce a subtlerace condition proved to be a goodinvestment

Though the choice of task managementis fundamental the choice of stack man-agement can be left to individual tasteThis project enables use of any type ofstack management in conjunction withcooperative task management

IMPROVING WAIT-FREE ALGORITHMS FOR

INTERPROCESS COMMUNICATION IN

EMBEDDED REAL-TIME SYSTEMS

Hai Huang Padmanabhan Pillai KangG Shin University of Michigan

The main characteristic of the real-timetime-sensitive system is its pre-dictable response time But concurrencymanagement creates hurdles to achiev-ing this goal because of the use of locksto maintain consistency To solve thisproblem many wait-free algorithmshave been developed but these are typi-cally high-cost solutions By takingadvantage of the temporal characteris-tics of the system however the time andspace overhead can be reduced

The speaker presented an algorithm fortemporal concurrency control andapplied this technique to improve threewait-free algorithms A single-writermultiple-reader wait-free algorithm anda double-buffer algorithm were pro-posed

Using their transformation mechanismthey achieved an improvement of17ndash66 in ACET and a 14ndash70 reduc-tion in memory requirements for IPCalgorithms This mechanism is extensi-ble and can be applied to other non-blocking IPC algorithms as well Futurework involves reducing the synchroniz-ing overheads in more general IPC algo-rithms with multiple-writer semantics

MOBILITY

Summarized by Praveen Yalagandula

ROBUST POSITIONING ALGORITHMS FOR

DISTRIBUTED AD-HOC WIRELESS SENSOR

NETWORKS

Chris Savarese Jan Rabaey Universityof California Berkeley Koen Langen-doen Delft University of Technology

The Pico Radio Network comprisesmore than a hundred sensors monitorsand actuators equipped with wirelessconnectivity with the following proper-ties (1) no infrastructure (2) computa-tion in a distributed fashion instead ofcentralized computation (3) dynamictopology (4) limited radio range

75October 2002 login

C

ON

FERE

NC

ERE

PORT

S(5) nodes that can act as repeaters(6) devices of low cost and with lowpower and (7) sparse anchor nodes ndashnodes with GPS capability A positioningalgorithm determines the geographicalposition of each node in the network

In two-dimensional space each nodeneeds three reference positions to esti-mate its geographical position There aretwo problems that make positioning dif-ficult in the Pico Radio Network typesetting (1) a sparse anchor node prob-lem and (2) a range error problem

The two-phase approach that theauthors have taken in solving the prob-lem consists of (1) a hop-terrain algo-rithm ndash in this first phase each noderoughly guesses its location by the dis-tance calculated using multi-hops to theanchor points and (2) a refinementalgorithm ndash in this second phase eachnode uses its neighborsrsquo positions torefine its own position estimate Toguarantee the convergence thisapproach uses confidence metricsassigning a value of 10 for anchor nodesand 01 to start for other nodes andincreasing these with each iteration

A simulation tool OMNet++ was usedfor both phases and in various scenariosThe results show that they achievedposition errors of less than 33 in a sce-nario with 5 anchor nodes an averageconnectivity of 7 and 5 range mea-surement error

Guidelines for anchor node deploymentare high connectivity (gt10) a reason-able fraction of anchor nodes (gt 5)and for anchor placement coverededges

APPLICATION-SPECIFIC NETWORK

MANAGEMENT FOR ENERGY-AWARE

STREAMING OF POPULAR MULTIMEDIA

FORMATS

Surendar Chandra University of Geor-gia and Amin Vahdat Duke University

The main hindrance in supporting theincreasing demand for mobile multime-dia on PDA is the battery capacity of the

small devices The idle-time power con-sumption is almost the same as thereceive power consumption on typicalwireless network cards (1319mW vs1425mW) while the sleep state con-sumes far less power (177mW) To letthe system consume energy proportionalto the stream quality the network cardshould transition to sleep state aggres-sively between each packet

The speaker presented previous workwhere different multimedia streamswere studied for PDAs MS Media RealMedia and Quicktime It was found thatif the inter-packet gap is predictablethen huge savings are possible forexample about 80 savings in MSmedia streams with few packet lossesThe limitation of the IEEE 80211bpower-save mode comes into play whenthere are two or more nodes producinga delay between beacon time and whenthe node receivessends packets Thisbadly affects the streams since higherenergy is consumed while waiting

The authors propose traffic shaping forenergy conservation where this is doneby proxy such that packets arrive at reg-ular intervals This is achieved using alocal proxy in the access point and aclient-side proxy in the mobile hostSimulations show that traffic shapingreduces energy consumption and alsoreveals that the higher bandwidthstreams have a lower energy metric(mJkB)

More information is at httpgreenhousecsugaedu

CHARACTERIZING ALERT AND BROWSE SER-

VICES OF MOBILE CLIENTS

Atul Adya Paramvir Bahl and Lili QiuMicrosoft Research

Even though there is a dramatic increasein Internet access from wireless devicesthere are not many studies done oncharacterizing this traffic In this paperthe authors characterize the trafficobserved on the MSN Web site withboth notification and browse traces

USENIX 2002

Around 33 million browsing accessesand about 325 million notificationentries are present in the traces Threetypes of analyses are done for each oneof these two services content analysisconcerning the most popular contentcategories and their distribution popu-larity analysis or the popularity distri-bution of documents and user behavioranalysis

The analysis of the notification logsshows that document access rates followa Zipf-like distribution with most of theaccesses concentrated on a small num-ber of messages and the accesses exhibitgeographical locality ndash users from samelocality tend to receive similar notifica-tion content The browser log analysisshows that a smaller set of URLs areaccessed most times though the accesspattern does not fit any Zipf curve andthe highly accessed URLS remain stableA correlation study between the notifi-cation and browsing services shows thatwireless users have a moderate correla-tion of 012

FREENIX TRACK SESSIONS

BUILDING APPLICATIONS

Summarized by Matt Butner

INTERACTIVE 3-D GRAPHICS APPLICATIONS

FOR TCL

Oliver Kersting Juumlrgen Doumlllner HassoPlattner Institute for Software SystemsEngineering University of Potsdam

The integration of 3-D image renderingfunctionality into a scripting languagepermits interactive and animated 3-Ddevelopment and application withoutthe formalities and precision demandedby low-level CC++ graphics and visual-ization libraries The large and complexC++ API of the Virtual Rendering Sys-tem (VRS) can be combined with theconveniences of the Tcl scripting lan-guage The mapping of class interfaces isdone via an automated process and gen-erates respective wrapper classes all ofwhich ensures complete API accessibilityand functionality without any signifi-

76 Vol 27 No 5 login

cant performance retribution The gen-eral C++-to-Tcl mapper SWIG grantsthe necessary object and hierarchal abili-ties without the use of object-orientedTcl extensions The final API mappingtechniques address C++ features such asclasses overloaded methods enumera-tions and inheritance relations All areimplemented in a proof-of-concept thatmaps the complete C++ API of the VRSto Tcl and are showcased in a completeinteractive 3-D-map system

THE AGFL GRAMMAR WORK LAB

Cornelis HA Coster Erik VerbruggenUniversity of Nijmegen (KUN)

The growth and implementation of Nat-ural Language Processing (NLP) is thecornerstone of the continued evolutionand implementation of truly intelligentsearch machines and services In partdue to the growing collections of com-puter-stored human-readable docu-ments in the public domain theimplementation of linguistic analysiswill become necessary for desirable pre-cision and resolution Subsequentlysuch tools and linguistic resources mustbe released into the public domain andthey have done so with the release of theAGFL Grammar Work Lab under theGNU Public License as a tool for lin-guistic research and the development ofNLP-based applications

The AGFL (Affix Grammars over aFinite Lattice) Grammar Work Labmeshes context-free grammars withfinite set-valued features that are accept-able to a range of languages In com-puter science terms ldquoSyntax rules areprocedures with parameters and a non-deterministic executionrdquo The EnglishPhrases for Information Retrieval(EP4IR) released with the AGFL-GWLas a robust grammar of English is anAGFL-GWL generated English parserthat outputs ldquoHeadModifiedrdquo framesThe sentences ldquoCompanyX sponsoredthis conferencerdquo and ldquoThis conferencewas sponsored by CompanyXrdquo both gen-erate [CompanyX[sponsored confer-

ence]] The hope is that public availabil-ity of such tools will encourage furtherdevelopment of grammar and lexicalsoftware systems

SWILL A SIMPLE EMBEDDED WEB SERVER

LIBRARY

Sotiria Lampoudi David M BeazleyUniversity of Chicago

This paper won the FREENIX Best Stu-dent Paper award

SWILL (Simple Web Interface LinkLibrary) is a simple Web server in theform of a C library whose developmentwas motivated by a wish to give coolapplications an interface to the Web TheSWILL library provides a simple inter-face that can be efficiently implementedfor tasks that vary from flexible Web-based monitoring to software debuggingand diagnostics Though originallydesigned to be integrated with high-per-formance scientific simulation softwarethe interface is generic enough to allowfor unbounded uses SWILL is a single-threaded server relying upon non-blocking IO through the creation of atemporary server which services IOrequests SWILL does not provide SSL orcryptographic authentication but doeshave HTTP authentication abilities

A fantastic feature of SWILL is its sup-port for SPMD-style parallel applica-tions which utilize MPI provingvaluable for Beowulf clusters and largeparallel machines Another practicalapplication was the implementation ofSWILL in a modified Yalnix emulator byUniversity of Chicago Operating Sys-tems courses which utilized the addedYalnix functionality for OS developmentand debugging SWILL requires minimalmemory overhead and relies upon theHTTP10 protocol

NETWORK PERFORMANCE

Summarized by Florian Buchholz

LINUX NFS CLIENT WRITE PERFORMANCE

Chuck Lever Network Appliance PeterHoneyman CITI University of Michi-gan

Lever introduced a benchmark to mea-sure an NFS client write performanceClient performance is difficult to mea-sure due to hindrances such as poorhardware or bandwidth limitations Fur-thermore measuring application perfor-mance does not identify weaknessesspecifically at the client side Thus abenchmark was developed trying toexercise only data transfers in one direc-tion between server and application Forthis purpose the benchmark was basedon the block sequential write portion ofthe Bonnie file system benchmark Oncea benchmark for NFS clients is estab-lished it can be used to improve clientperformance

The performance measurements wereperformed with an SMP Linux clientand both a Linux NFS server and a Net-work Appliance F85 filer During testinga periodic jump in write latency timewas discovered This was due to a ratherlarge number of pending write opera-tions that were scheduled to be writtenafter certain threshold values wereexceeded By introducing a separate dae-mon that flushes the cached writerequest the spikes could be eliminatedbut as a result the average latency growsover time The problem could be tracedto a function that scans a linked list ofwrite requests After having added ahashtable to improve lookup perfor-mance the latency improved consider-ably

The improved client was then used tomeasure throughput against the twoservers A discrepancy between theLinux server and the filer test wasnoticed and the reason for that tracedback to a global kernel lock that wasunnecessarily held when accessing thenetwork stack After correcting this per-formance improved further However

77October 2002 login

C

ON

FERE

NC

ERE

PORT

Sthe measurements showed that a clientmay run slower when paired with fastservers on fast networks This is due toheavy client interrupt loads more net-work processing on the client side andmore global kernel lock contention

The source code of the project is avail-able at httpwwwcitiumicheduprojectsnfs-perfpatches

A STUDY OF THE RELATIVE COSTS OF

NETWORK SECURITY PROTOCOLS

Stefan Miltchev and Sotiris IoannidisUniversity of Pennsylvania AngelosKeromytis Columbia University

With the increasing need for securityand integrity of remote network ser-vices it becomes important to quantifythe communication overhead of IPSecand compare it to alternatives such asSSH SCP and HTTPS

For this purpose the authors set upthree testing networks direct link twohosts separated by two gateways andthree hosts connecting through onegateway Protocols were compared ineach setup and manual keying was usedto eliminate connection setup costs ForIPSec the different encryption algo-rithms AES DES 3DES hardware DESand hardware 3DES were used In detailFTP was compared to SFTP SCP andFTP over IPSec HTTP to HTTPS andHTTP over IPSec and NFS to NFS overIPSec and local disk performance

The result of the measurements werethat IPSec outperforms other popularencryption schemes Overall unen-crypted communication was fastest butin some cases like FTP the overhead canbe small The use of crypto hardwarecan significantly improve performanceFor future work the inclusion of setupcosts hardware-accelerated SSL SFTPand SSH were mentioned

CONGESTION CONTROL IN LINUX TCP

Pasi Sarolahti University of HelsinkiAlexey Kuznetsov Institute for NuclearResearch at Moscow

Having attended the talk and read thepaper I am still unclear about whetherthe authors are merely describing thedesign decisions of TCP congestion con-trol or whether they are actually the cre-ators of that part of the Linux code Myguess leans toward the former

In the talk the speaker compared theTCP protocol congestion control meas-ures according to IETF and RFC specifi-cations with the actual Linux implemen-tation which does conform to the basicprinciples but nevertheless has differ-ences A specific emphasis was placed onretransmission mechanisms and thecongestion window Also several TCPenhancements ndash the NewReno algo-rithm Selective ACKs (SACK) ForwardACKs (FACK) ndash were discussed andcompared

In some instances Linux does not con-form to the IETF specifications The fastrecovery does not fully follow RFC 2582since the threshold for triggering re-transmit is adjusted dynamically and thecongestion windowrsquos size is not changedAlso the roundtrip-time estimator andthe RTO calculation differ from RFC2988 since it uses more conservativeRTT estimates and a minimum RTO of200ms The performance measuresshowed that with the additional Linux-specific features enabled slightly higherthroughput more steady data flow andfewer unnecessary retransmissions canbe achieved

XTREME XCITEMENT

Summarized by Steve Bauer

THE FUTURE IS COMING WHERE THE X

WINDOW SHOULD GO

Jim Gettys Compaq Computer Corp

Jim Gettys one of the principal authorsof the X Window System outlined thenear-term objectives for the system pri-marily focusing on the changes andinfrastructure required to enable replica-tion and migration of X applicationsProviding better support for this func-tionality would enable users to retrieveor duplicate X applications betweentheir servers at home and work

USENIX 2002

One interesting example of an applica-tion that currently is capable of migra-tion and replication is Emacs To create anew frame on DISPLAY try ldquoM-x make-frame-on-display ltReturngt DISPLAYltReturngtrdquo

However technical challenges makereplication and migration difficult ingeneral These include the ldquomajorheadachesrdquo of server-side fonts thenonuniformity of X servers and screensizes and the need to appropriatelyretrofit toolkits Authentication andauthorization issues are obviously alsoimportant The rest of the talk delvedinto some of the details of these interest-ing technical challenges

HACKING IN THE KERNEL

Summarized by Hai Huang

AN IMPLEMENTATION OF SCHEDULER

ACTIVATIONS ON THE NETBSD OPERATING

SYSTEM

Nathan J Williams Wasabi Systems

Scheduler activation is an old idea Basi-cally there are benefits and drawbacks tousing solely kernel-level or user-levelthreading Scheduler activation is able tocombine the two layers of control toprovide more concurrency in the sys-tem

In his talk Nathan gave a fairly detaileddescription of the implementation ofscheduler activation in the NetBSD ker-nel One important change in the imple-mentation is to differentiate the threadcontext from the process context This isdone by defining a separate data struc-ture for these threads and relocatingsome of the information that wasembedded in the process context tothese thread contexts Stack was espe-cially a concern due to the upcall Spe-cial handling must be done to make surethat the upcall doesnrsquot mess up the stackso that the preempted user-level threadcan continue afterwards Lastly Nathanexplained that signals were handled byupcalls

78 Vol 27 No 5 login

AUTHORIZATION AND CHARGING IN PUBLIC

WLANS USING FREEBSD AND 8021X

Pekka Nikander Ericsson ResearchNomadicLab

8021x standards are well known in thewireless community as link-layerauthentication protocols In this talkPekka explained some novel ways ofusing the 8021x protocols that might beof interest to people on the move It ispossible to set up a public WLAN thatwould support various chargingschemes via virtual tokens which peoplecan purchase or earn and later use

This is implemented on FreeBSD usingthe netgraph utility It is basically a filterin the link layer that would differentiatetraffic based on the MAC address of theclient node which is either authenti-cated denied or let through The over-head for this service is fairly minimal

ACPI IMPLEMENTATION ON FREEBSD

Takanori Watanabe Kobe University

ACPI (Advanced Configuration andPower Management Interface) was pro-posed as a joint effort by Intel Toshibaand Microsoft to provide a standard andfiner-control method of managingpower states of individual devices withina system Such low-level power manage-ment is especially important for thosemobile and embedded systems that arepowered by fixed-capacity energy batter-ies

Takanori explained that the ACPI speci-fication is composed of three partstables BIOS and registers He was ableto implement some functionalities ofACPI in a FreeBSD kernel ACPI Com-ponent Architecture was implementedby Intel and it provides a high-levelACPI API to the operating systemTakanorirsquos ACPI implementation is builtupon this underlying layer of APIs

ACCESS CONTROL

Summarized by Florian Buchholz

DESIGN AND PERFORMANCE OF THE

OPENBSD STATEFUL PACKET FILTER (PF)

Daniel Hartmeier Systor AG

Daniel Hartmeier described the newstateful packet filter (pf) that replacedIPFilter in the OpenBSD 30 releaseIPFilter could no longer be included dueto licensing issues and thus there was aneed to write a new filter making use ofoptimized data structures

The filter rules are implemented as alinked list which is traversed from startto end Two actions may be takenaccording to the rules ldquopassrdquo or ldquoblockrdquoA ldquopassrdquo action forwards the packet anda ldquoblockrdquo action will drop it Wheremore than one rule matches a packetthe last rule wins Rules that are markedas ldquofinalrdquo will immediately terminate any further rule evaluation An opti-mization called ldquoskip-stepsrdquo was alsoimplemented where blocks of similarrules are skipped if they cannot matchthe current packet These skip-steps arecalculated when the rule set is loadedFurthermore a state table keeps track ofTCP connections Only packets thatmatch the sequence numbers areallowed UDP and ICMP queries andreplies are considered in the state tablewhere initial packets will create an entryfor a pseudo-connection with a low ini-tial timeout value The state table isimplemented using a balanced binarysearch tree NAT mappings are alsostored in the state table whereas appli-cation proxies reside in user space Thepacket filter also is able to perform frag-ment reassembly and to modulate TCPsequence numbers to protect hostsbehind the firewall

Pf was compared against IPFilter as wellas Linuxrsquos Iptables by measuringthroughput and latency with increasingtraffic rates and different packet sizes Ina test with a fixed rule set size of 100Iptables outperformed the other two fil-ters whose results were close to each

other In a second test where the rule setsize was continually increased Iptablesconsistently had about twice thethroughput of the other two (whichevaluate the rule set on both the incom-ing and outgoing interfaces) A third testcompared only pf and IPFilter using asingle rule that created state in the statetable with a fixed-state entry size Pfreached an overloaded state much laterthan IPFilter The experiment wasrepeated with a variable-state entry sizeand pf performed much better thanIPFilter for a small number of states

In general rule set evaluation is expen-sive and benchmarks only reflectextreme cases whereas in real life otherbehavior should be observed Further-more the benchmarks show that statefulfiltering can actually improve perfor-mance due to cheap state-table lookupsas compared to rule evaluation

ENHANCING NFS CROSS-ADMINISTRATIVE

DOMAIN ACCESS

Joseph Spadavecchia and Erez ZadokStony Brook University

The speaker presented modification toan NFS server that allows an improvedNFS access between administrativedomains A problem lies in the fact thatNFS assumes a shared UIDGID spacewhich makes it unsafe to export filesoutside the administrative domain ofthe server Design goals were to leaveprotocol and existing clients unchangeda minimum amount of server changesflexibility and increased performance

To solve the problem two techniques areutilized ldquorange-mappingrdquo and ldquofile-cloakingrdquo Range-mapping maps IDsbetween client and server The mappingis performed on a per-export basis andhas to be manually set up in an exportfile The mappings can be 1-1 N-N orN-1 In file-cloaking the server restrictsfile access based on UIDGID and spe-cial cloaking-mask bits Here users canonly access their own file permissionsThe policy on whether or not the file isvisible to others is set by the cloakingmask which is logically ANDed with the

79October 2002 login

C

ON

FERE

NC

ERE

PORT

Sfilersquos protection bits File-cloaking onlyworks however if the client doesnrsquot holdcached copies of directory contents andfile-attributes Because of this the clientsare forced to re-read directories byincrementing the mtime value of thedirectory each time it is listed

To test the performance of the modifiedserver five different NFS configurationswere evaluated An unmodified NFSserver was compared against one serverwith the modified code included but notused one with only range-mappingenabled one with only file-cloakingenabled and one version with all modi-fications enabled For each setup differ-ent file system benchmarks were runThe results show only a small overheadwhen the modifications are used gener-ally an increase of below 5 Anotherexperiment tested the performance ofthe system with an increasing number ofmapped or cloaked entries on a systemThe results show that an increase from10 to 1000 entries resulted in a maxi-mum of about 14 cost in performance

One member of the audience pointedout that if clients choose to ignore thechanged mtimes from the server andthus still hold caches of the directoryentries the file-cloaking mechanismcould be defeated After a rather lengthydebate the speaker had to concede thatthe model doesnrsquot add any extra securityAnother question was asked about scala-bility of the setup of range mappingThe speaker referred to application-leveltools that could be developed for thatpurpose

The software is available atftpftpfslcssunysbedupubenf

ENGINEERING OPEN SOURCE

SOFTWARE

Summarized by Teri Lampoudi

NINGAUI A LINUX CLUSTER FOR BUSINESS

Andrew Hume ATampT Labs ndash ResearchScott Daniels Electronic Data SystemsCorp

Ningaui is a general purpose highlyavailable resilient architecture built

from commodity software and hard-ware Emphatically however Ningaui isnot a Beowulf Hume calls the clusterdesign the ldquoSwiss canton modelrdquo inwhich there are a number of looselyaffiliated independent nodes with datareplicated among them Jobs areassigned by bidding and leases and clus-ter services done as session-based serversare sited via generic job assignment Theemphasis is on keeping the architectureend-to-end checking all work via check-sums and logging everything Theresilience and high availability requiredby their goal of 8-5 maintenance ndash vsthe typical 24-7 model where people getpaged whenever the slightest thing goeswrong regardless of the time of day ndash isachieved by job restartability Finally allcomputation is performed on local datawithout the use of NFS or networkattached storage

Humersquos message is a hopeful onedespite the many problems encounteredndash things like kernel and service limitsauto-installation problems TCP stormsand the like ndash the existence of sourcecode and the paranoid practice of log-ging and checksumming everything hashelped The final product performs rea-sonably well and it appears that theresilient techniques employed do make adifference One drawback however isthat the software mentioned in thepaper is not yet available for download

CPCMS A CONFIGURATION MANAGEMENT

SYSTEM BASED ON CRYPTOGRAPHIC NAMES

Jonathan S Shapiro and John Vander-burgh Johns Hopkins University

This paper won the FREENIX BestPaper award

The basic notion behind the project isthe fact that everyone has a pet com-plaint about CVS and yet it is currentlythe configuration manager in mostwidespread use Shapiro has unleashedan alternative Interestingly he did notbegin but ended with the usual host ofreasons why CVS is bad The talk insteadbegan abruptly by characterizing the job

USENIX 2002

of a software configuration managercontinued by stating the namespaceswhich it must handle and wrapped upwith the challenges faced

X MEETS Z VERIFYING CORRECTNESS IN THE

PRESENCE OF POSIX THREADS

Bart Massey Portland State UniversityRobert T Bauer Rational SoftwareCorp

Massey delivered a humorous talk onthe insight gained from applying Z for-mal specification notation to systemsoftware design rather than the moreinformal analysis and design processnormally used

The story is told with respect to writingXCB which replaces the Xlib protocollayer and is supposed to be threadfriendly But where threads are con-cerned deadlock avoidance becomes ahard problem that cannot be solved inan ad-hoc manner But full modelchecking is also too hard In this caseMassey resorted to Z specification tomodel the XCB lower layer abstractaway locking and data transfers andlocate fundamental issues Essentiallythe difficulties of searching the literatureand locating information relevant to theproblem at hand were overcome AsMassey put it ldquothe formal method savedthe dayrdquo

FILE SYSTEMS

Summarized by Bosko Milekic

PLANNED EXTENSIONS TO THE LINUX

EXT2EXT3 FILESYSTEM

Theodore Y Tsrsquoo IBM StephenTweedie Red Hat

The speaker presented improvements tothe Linux Ext2 file system with the goalof allowing for various expansions whilestriving to maintain compatibility witholder code Improvements have beenfacilitated by a few extra superblockfields that were added to Ext2 not longbefore the Linux 20 kernel was releasedThe fields allow for file system featuresto be added without compromising theexisting setup this is done by providing

80 Vol 27 No 5 login

bits indicating the impact of the addedfeatures with respect to compatibilityNamely a file system with the ldquoincom-patrdquo bit marked is not allowed to bemounted Similarly a ldquoread-onlyrdquo mark-ing would only allow the file system tobe mounted read-only

Directory indexing changes linear direc-tory searches with a faster search using afixed-depth tree and hashed keys Filesystem size can be dynamicallyincreased and the expanded inode dou-bled from 128 to 256 bytes allows formore extensions

Other potential improvements were dis-cussed as well in particular pre-alloca-tion for contiguous files which allowsfor better performance in certain setupsby pre-allocating contiguous blocksSecurity-related modifications extendedattributes and ACLs were mentionedAn implementation of these featuresalready exists but has not yet beenmerged into the mainline Ext23 code

RECENT FILESYSTEM OPTIMISATIONS ON

FREEBSD

Ian Dowse Corvil Networks DavidMalone CNRI Dublin Institute of Technology

David Malone presented four importantfile system optimizations for FreeBSDOS soft updates dirpref vmiodir anddirhash It turns out that certain combi-nations of the optimizations (beautifullyillustrated in the paper) may yield per-formance improvements of anywherebetween 2 and 10 orders of magnitudefor real-world applications

All four of the optimizations deal withfile system metadata Soft updates allowfor asynchronous metadata updatesDirpref changes the way directories areorganized attempting to place childdirectories closer to their parentsthereby increasing locality of referenceand reducing disk-seek times Vmiodirtrades some extra memory in order toachieve better directory caching Finallydirhash which was implemented by IanDowse changes the way in which entries

are searched in a directory SpecificallyFFS uses a linear search to find an entryby name dirhash builds a hashtable ofdirectory entries on the fly For a direc-tory of size n with a working set of mfiles a search that in certain cases couldhave been O(nm) has been reduceddue to dirhash to effectively O(n + m)

FILESYSTEM PERFORMANCE AND SCALABILITY

IN LINUX 2417

Ray Bryant SGI Ruth Forester IBMLTC John Hawkes SGI

This talk focused on performance evalu-ation of a number of file systems avail-able and commonly deployed on Linuxmachines Comparisons under variousconfigurations of Ext2 Ext3 ReiserFSXFS and JFS were presented

The benchmarks chosen for the datagathering were pgmeter filemark andAIM Benchmark Suite VII Pgmetermeasures the rate of data transfer ofreadswrites of a file under a syntheticworkload Filemark is similar to post-mark in that it is an operation-intensivebenchmark although filemark isthreaded and offers various other fea-tures that postmark lacks AIM VIImeasures performance for various file-system-related functionalities it offersvarious metrics under an imposedworkload thus stressing the perfor-mance of the file system not only underIO load but also under significant CPUload

Tests were run on three different setupsa small a medium and a large configu-ration ReiserFS and Ext2 appear at thetop of the pile for smaller and mediumsetups Notably XFS and JFS performworse for smaller system configurationsthan the others although XFS clearlyappears to generate better numbers thanJFS It should be noted that XFS seemsto scale well under a higher load Thiswas most evident in the large-systemresults where XFS appears to offer thebest overall results

THINGS TO THINK ABOUT

Summarized by Bosko Milekic

SPEEDING UP KERNEL SCHEDULER BY REDUC-

ING CACHE MISSES

Shuji Yamamura Akira Hirai MitsuruSato Masao Yamamoto Akira NaruseKouichi Kumon Fujitsu Labs

This was an interesting talk pertaining tothe effects of cache coloring for taskstructures in the Linux kernel scheduler(Linux kernel 24x) The speaker firstpresented some interesting benchmarknumbers for the Linux scheduler show-ing that as the number of processes onthe task queue was increased the perfor-mance decreased The authors usedsome really nifty hardware to measurethe number of bus transactionsthroughout their tests and were thusable to reasonably quantify the impactthat cache misses had in the Linuxscheduler

Their experiments led them to imple-ment a cache coloring scheme for taskstructures which were previouslyaligned on 8KB boundaries and there-fore were being eventually mapped tothe same cache lines This unfortunateplacement of task structures in memoryinduced a significant number of cachemisses as the number of tasks grew inthe scheduler

The implemented solution consisted ofaligning task structures to more evenlydistribute cached entries across the L2cache The result was inevitably fewercache misses in the scheduler Some neg-ative effects were observed in certain sit-uations These were primarily due tomore cache slots being used by taskstructure data in the scheduler thusforcing data previously cached there tobe pushed out

81October 2002 login

C

ON

FERE

NC

ERE

PORT

SWORK-IN-PROGRESS REPORTS

Summarized by Brennan Reynolds

RESOURCE VIRTUALIZATION TECHNIQUES FOR

WIDE-AREA OVERLAY NETWORKS

Kartik Gopalan University Stony Brook

This work addressed the issue of provi-sioning a maximum number of virtualoverlay networks (VON) with diversequality of service (QoS) requirementson a single physical network Each of theVONs is logically isolated from others toensure the QoS requirements Gopalanmentioned several critical researchissues with this problem that are cur-rently being investigated Dealing withhow to provision the network at variouslevels (link route or path) and thenenforce the provisioning at run-time isone of the toughest challenges Cur-rently Gopalan has developed severalalgorithms to handle admission controlend-to-end QoS route selection sched-uling and fault tolerance in the net-work

For more information visithttpwwwecslcssunysbedu

VISUALIZING SOFTWARE INSTABILITY

Jennifer Bevan University of CaliforniaSanta Cruz

Detection of instability in software hastypically been an afterthought Thepoint in the development cycle when thesoftware is reviewed for instability isusually after it is difficult and costly togo back and perform major modifica-tions to the code base Bevan has devel-oped a technique to allow thevisualization of unstable regions of codethat can be used much earlier in thedevelopment cycle Her technique cre-ates a time series of dependent graphsthat include clusters and lines calledfault lines From the graphs a developeris able to easily determine where theunstable sections of code are and proac-tively restructure them She is currentlyworking on a prototype implementa-tion

For more information visit httpwwwcseucscecu~jbevanevo_viz

RELIABLE AND SCALABLE PEER-TO-PEER WEB

DOCUMENT SHARING

Li Xiao William and Mary College

The idea presented by Xiao would allowend users to share the content of theWeb browser caches with neighboringInternet users The rationale for this isthat todayrsquos end users are increasinglyconnected to the Internet over high-speed links and the browser caches arebecoming large storage systems There-fore if an individual accesses a pagewhich does not exist in their local cacheXiao is suggesting that they first queryother end users for the content beforetrying to access the machine hosting theoriginal This strategy does have someserious problems associated with it thatstill need to be addressed includingensuring the integrity of the content andprotecting the identity and privacy ofthe end users

SEREL FAST BOOTING FOR UNIX

Leni Mayo Fast Boot Software

Serel is a tool that generates a visual rep-resentation of a UNIX machinersquos boot-up sequence It can be used to identifythe critical path and can show if a par-ticular service or process blocks for anextended period of time This informa-tion could be used to determine wherethe largest performance gains could berealized by tuning the order of executionat boot-up Serel creates a dependencygraph expressed in XML during boot-up This graph is then used to create thevisual representation Currently the toolonly works on POSIX-compliant sys-tems but Mayo stated that he would beporting it to other platforms Otherextensions that were mentionedincluded having the metadata includethe use of shared libraries and monitor-ing the suspendresume sequence ofportable machines

For more information visithttpwwwfastbootorg

USENIX 2002

BERKELEY DB XML

John Merrells Sleepycat Software

Merrells gave a quick introduction andoverview of the new XML library forBerkeley DB The library specializes instorage and retrieval of XML contentthrough a tool called XPath The libraryallows for multiple containers per docu-ment and stores everything natively asXML The user is also given a wide rangeof elements to create indices withincluding edges elements text stringsor presence The XPath tool consists of aquery parser generator optimizer andexecution engine To conclude his pre-sentation Merrell gave a live demonstra-tion of the software

For more information visithttpwwwsleepycatcomxml

CLUSTER-ON-DEMAND (COD)

Justin Moore Duke University

Modern clusters are growing at a rapidrate Many have pushed beyond the 5000-machine mark and deployingthem results in large expenses as well asmanagement and provisioning issuesFurthermore if the cluster is ldquorentedrdquoout to various users it is very time-con-suming to configure it to a userrsquos specsregardless of how long they need to useit The COD work presented createsdynamic virtual clusters within a givenphysical cluster The goal was to have aprovisioning tool that would automati-cally select a chunk of available nodesand install the operating system andmiddleware specified by the customer ina short period of time This would allowa greater use of resources since multiplevirtual clusters can exist at once Byusing a virtual cluster the size can bechanged dynamically Moore stated thatthey have created a working prototypeand are currently testing and bench-marking its performance

For more information visithttpwwwcsdukeedu~justincod

82 Vol 27 No 5 login

CATACOMB

Elias Sinderson University of CaliforniaSanta Cruz

Catacomb is a project to develop a data-base-backed DASL module for theApache Web server It was designed as areplacement for the WebDAV The initialrelease of the module only contains sup-port for a MySQL database but could beextended to others Sinderson brieflytouched on the performance of hermodule It was comparable to themod_dav Apache module for all querytypes but search The presentation wasconcluded with remarks about addingsupport for the lock method and includ-ing ACL specifications in the future

For more information visithttpoceancseucsceducatacomb

SELF-ORGANIZING STORAGE

Dan Ellard Harvard University

Ellardrsquos presentation introduced a stor-age system that tuned itself based on theworkload of the system without requir-ing the intervention of the user Theintelligence was implemented as a vir-tual self-organizing disk that residesbelow any file system The virtual diskobserves the access patterns exhibited bythe system and then attempts to predictwhat information will be accessed nextAn experiment was done using an NFStrace at a large ISP on one of their emailservers Ellardrsquos self-organizing storagesystem worked well which he attributesto the fact that most of the files beingrequested were large email boxes Areasof future work include exploration ofthe length and detail of the data collec-tion stage as well as the CPU impact ofrunning the virtual disk layer

VISUAL DEBUGGING

John Costigan Virginia Tech

Costigan feels that the state of currentdebugging facilities in the UNIX worldis not as good as it should be He pro-poses the addition of several elements toprograms like ddd and gdb The firstaddition is including separate heap and

stack components Another is to displayonly certain fields of a data structureand have the ability to zoom in and outif needed Finally the debugger shouldprovide the ability to visually presentcomplex data structures includinglinked lists trees etc He said that a betaversion of a debugger with these abilitiesis currently available

For more information visit httpinfoviscsvtedudatastruct

IMPROVING APPLICATION PERFORMANCE

THROUGH SYSTEM-CALL COMPOSITION

Amit Purohit University of Stony Brook

Web servers perform a huge number ofcontext switches and internal data copy-ing during normal operation These twoelements can drastically limit the perfor-mance of an application regardless ofthe hardware platform it is run on TheCompound System Call (CoSy) frame-work is an attempt to reduce the perfor-mance penalty for context switches anddata copies It includes a user-levellibrary and several kernel facilities thatcan be used via system calls The libraryprovides the programmer with a com-plete set of memory-management func-tions Performance tests using the CoSyframework showed a large saving forcontext switches and data copies Theonly area where the savings betweenconventional libraries and CoSy werenegligible was for very large file copies

For more information visithttpwwwcssunysbedu~purohit

ELASTIC QUOTAS

John Oscar Columbia University

Oscar began by stating that mostresources are flexible but that to datedisks have not been While approxi-mately 80 percent of files are short-liveddisk quotas are hard limits imposed byadministrators The idea behind elasticquotas is to have a non-permanent stor-age area that each user can temporarilyuse Oscar suggested the creation of aehome directory structure to be used indeploying elastic quotas Currently he

has implemented elastic quotas as astackable file system There were alsoseveral trends apparent in the dataMany users would ssh to their remotehost (good) but then launch an emailreader on the local machine whichwould connect via POP to the same hostand send the password in clear text(bad) This situation can easily be reme-died by tunneling protocols like POPthrough SSH but it appeared that manypeople were not aware this could bedone While most of his comments onthe use of the wireless network werenegative the list of passwords he hadcollected showed that people wereindeed using strong passwords His rec-ommendations were to educate andencourage people to use protocols likeIPSec SSH and SSL when conductingwork over a wireless network becauseyou never know who else is listening

SYSTRACE

Niels Provos University of Michigan

How can one be sure the applicationsone is using actually do exactly whattheir developers said they do Shortanswer you canrsquot People today are usinga large number of complex applicationswhich means it is impossible to checkeach application thoroughly for securityvulnerabilities There are tools out therethat can help though Provos has devel-oped a tool called systrace that allows auser to generate a policy of acceptablesystem calls a particular application canmake If the application attempts tomake a call that is not defined in thepolicy the user is notified and allowed tochoose an action Systrace includes anauto-policy generation mechanismwhich uses a training phase to record theactions of all programs the user exe-cutes If the user chooses not to be both-ered by applications breaking policysystrace allows default enforcementactions to be set Currently systrace isimplemented for FreeBSD and NetBSDwith a Linux version coming out shortly

83October 2002 login

C

ON

FERE

NC

ERE

PORT

SFor more information visit httpwwwcitiumicheduuprovossystrace

VERIFIABLE SECRET REDISTRIBUTION FOR

SURVIVABLE STORAGE SYSTEMS

Ted Wong Carnegie Mellon University

Wong presented a protocol that can beused to re-create a file distributed over aset of servers even if one of the serversis damaged In his scheme the user mustchoose how many shares the file is splitinto and the number of servers it will bestored across The goal of this work is toprovide persistent secure storage ofinformation even if it comes underattack Wong stated that his design ofthe protocol was complete and he is cur-rently building a prototype implementa-tion of it

For more information visit httpwwwcscmuedu~tmwongresearch

THE GURU IS IN SESSIONS

LINUX ON LAPTOPPDA

Bdale Garbee HP Linux Systems Operation

Summarized by Brennan Reynolds

This was a roundtable-like discussionwith Garbee acting as moderator He iscurrently in charge of the Debian distri-bution of Linux and is involved in port-ing Linux to platforms other than i386He has successfully help port the kernelto the Alpha Sparc and ARM and iscurrently working on a version to runon VAX machines

A discussion was held on which file sys-tem is best for battery life The Riser andExt3 file systems were discussed peoplecommented that when using Ext3 ontheir laptops the disc would never spindown and thus was always consuming alarge amount of power One suggestionwas to use a RAM-based file system forany volume that requires a large numberof writes and to only have the file systemwritten to disk at infrequent intervals orwhen the machine is shut down or sus-pended

A question was directed to Garbee abouthis knowledge of how the HP-Compaqmerger would affect Linux Garbeethought that the merger was good forLinux both within the company and forthe open source community as a wholeHe stated that the new company wouldbe the number one shipper of systemswith Linux as the base operating systemHe also fielded a question about thesupport of older HP servers and theirability to run Linux saying that indeedLinux has been ported to them and hepersonally had several machines in hisbasement running it

A question which sparked a large num-ber of responses concerned problemspeople had with running Linux onmobile platforms Most people actuallydid not have many problems at allThere were a few cases of a particularmachine model not working but therewere only two widespread issues dock-ing stations and the new ACPI powermanagement scheme Docking stationsstill appear to be somewhat of aheadache but most people had devel-oped ad-hoc solutions for getting themto work The ACPI power managementscheme developed by Intel does notappear to have a quick solution TedTrsquoso one of the head kernel developersstated that there are fundamental archi-tectural problems with the 24 series ker-nel that do not easily allow ACPI to beadded However he also stated thatACPI is supported in the newest devel-opment kernel 25 and will be includedin 26

The final topic was the possibility ofpurchasing a laptop without paying anychargetax for Microsoft Windows Theconsensus was that currently this is notpossible Even if the machine comeswith Linux or without any operatingsystem the vendors have contracts inplace which require them to payMicrosoft for each machine they ship ifthat machine is capable of running aMicrosoft operating system The onlyway this will change is if the demand for

USENIX 2002

other operating systems increases to thepoint that vendors renegotiate their con-tracts with Microsoft and no one sawthis happening in the near future

LARGE CLUSTERS

Andrew Hume ATampT Labs ndash Research

Summarized by Teri Lampoudi

This was a session on clusters big dataand resilient computing Given that theaudience was primarily interested inclusters and not necessarily big data orbeating nodes to a pulp the conversa-tion revolved mainly around what ittakes to make a heterogeneous clusterresilient Hume also presented aFREENIX paper on his current clusterproject which explained in more detailmuch of what was abstractly claimed inthe guru session

Humersquos use of the word ldquoclusterrdquoreferred not to what I had assumedwould be a Beowulf-type system but to aloosely coupled farm of machines inwhich concurrency was much less of anissue than it would be in a Beowulf Infact the architecture Hume describedwas designed to deal with large amountsof independent transaction processingessentially the process of billing callswhich requires no interprocess messagepassing or anything of the sort a parallelscientific application might

Where does the large data come inSince the transactions in question con-sist of large numbers of flat files amechanism for getting the files ontonodes and off-loading the results is nec-essary In this particular cluster namedNingaui this task is handled by theldquoreplication managerrdquo which generateda large amount of interest in the audi-ence All management functions in thecluster as well as job allocation consti-tute services that are to be bid for andleased out to nodes Furthermore fail-ures are handled by simply repeating thefailed task until it is completed properlyif that is possible and if it is not thenlooking through detailed logs to discover

84 Vol 27 No 5 login

what went wrong Logging and testingfor the successful completion of eachtask are what make this model resilientEvery management operation is loggedat some appropriate level of detail gen-erating 120MBday and every single filetransfer is md5 checksummed This wascharacterized by Hume as a ldquopatientrdquostyle of management where no one con-trols the cluster scheduling behavior isemergent rather than stringentlyplanned and nodes are free to drop inand out of the cluster a downside to thismodel is the increase in latency offset bythe fact that the cluster continues tofunction unattended outside of 8-to-5support hours

In response to questions about the pro-jected scalability of such a schemeHume said the system would presum-ably scale from the present 60 nodes toabout 100 without modifications butthat scaling higher would be a matter forfurther design The guru session endedon a somewhat unrelated note regardingthe problems of getting data off main-frames and onto Linux machines ndashissues of fixed- vs variable-lengthencoding of data blocks resulting in use-ful information being stripped by FTPand problems in converting COBOLcopybooks to something useful on a C-based architecture To reiterate Humersquosclaim there are homemade solutions forsome of these problems and the readershould feel free to contact Mr Hume forthem

WRITING PORTABLE APPLICATIONS

Nick Stoughton MSB Consultants

Summarized by Josh Lothian

Nick Stoughton addressed a concernthat developers are facing more andmore writing portable applications Heproposed that there is no such thing as aldquoportable applicationrdquo only those thathave been ported Developers are still insearch of the Holy Grail of applicationsprogramming a write-once compile-anywhere program Languages such asJava are leading up to this but virtual

machines still have to be ported pathswill still be different etc Web-basedapplications are another area that couldbe promising for portable applications

Nick went on to talk about the future ofportability The recently released POSIX2001 standards are expected to help thesituation The 2001 revision expands theoriginal POSIX sepc into seven volumesEven though POSIX 2001 is more spe-cific than the previous release Nickpointed out that even this standardallows for small differences where ven-dors may define their own behaviorsThis can only hurt developers in thelong run Along these lines it waspointed out that there may be room forautomated tools that have the capabilityto check application code and determinethe level of portability and standardscompliance of the source

NETWORK MANAGEMENT SYSTEM

PERFORMANCE TUNING

Jeff Allen Tellme Networks Inc

Summarized by Matt Selsky

Jeff Allen author of Cricket discussednetwork tuning The basic idea of tuningis measure twiddle and measure againCricket can be used for large installa-tions but itrsquos overkill for smaller instal-lations for which Mrtg is better suitedYou donrsquot need Cricket to do measure-ment Doing something like a shellscript that dumps data to Excel is goodbut you need to get some sort of mea-surement Cricket was not designed forbilling it was meant to help answerquestions

Useful tools and techniques coveredincluded looking for the differencebetween machines in order to identifythe causes for the differences measuringbut also thinking about what yoursquoremeasuring and why remembering touse strace and tcpdump to identifyproblems Problems repeat themselvesperformance tuning requires an intricateunderstanding of all the layers involvedto solve the problem and simplify theautomation (Having a computer science

background helps but is not essential) Ifyou can reproduce the problem youthen should investigate each piece tofind the bottleneck If you canrsquot repro-duce the problem then measure Youneed to understand all the system inter-actions Measurement can help deter-mine whether things are actually slow orif the user is imagining it

Troubleshooting begins with a hunchbut scientific processes are essential Youshould be able to determine the eventstream or when each event occurs hav-ing observation logs can help Somehints provided include checking cron forperiodic anomalies slowing downbinrm to avoid IO overload and look-ing for unusual kernel CPU usage

Also try to use lightweight monitoringto reduce overhead You donrsquot need tomonitor every resource on every systembut you should monitor those resourceson those systems that are essential Donrsquotcheck things too often since you canintroduce overhead that way Lots ofinformation can be gathered from theproc file system interface to the kernel

BSD BOF

Host Kirk McKusick Author and Consultant

Summarized by Bosko Milekic

The BSD Birds of a Feather session atUSENIX started off with five quick andinformative talks and ended with anexciting discussion of licensing issues

Christos Zoulas for the NetBSD Projectwent over the primary goals of theNetBSD Project (namely portability aclean architecture and security) andthen proceeded to briefly discuss someof the challenges that the project hasencountered The successes and areas forimprovement of 16 were then exam-ined NetBSD has recently seen variousimprovements in its cross-build systempackaging system and third-party toolsupport For what concerns the kernel azero-copy TCPUDP implementationhas been integrated and a new pmap

85October 2002 login

C

ON

FERE

NC

ERE

PORT

Scode for arm32 has been introduced ashave some other rather major changes tovarious subsystems More information isavailable in Christosrsquos slides which arenow available at httpwwwnetbsdorggalleryeventsusenix2002

Theo de Raadt of the OpenBSD Projectpresented a rundown of various newthings that the OpenBSD and OpenSSHteams have been looking at Theorsquos talkincluded an amusing description (andpictures) of some of the events thattranspired during OpenBSDrsquos hack-a-thon which occurred the week beforeUSENIX rsquo02 Finally Theo mentionedthat following complications with theIPFilter license the OpenBSD team haddone a fairly extensive license audit

Robert Watson of the FreeBSD Projectbrought up various examples ofFreeBSD being used in the real worldHe went on to describe a wide variety ofchanges that will surface with theupcoming FreeBSD release 50 Notably50 will introduce a re-architecting ofthe kernel aimed at providing muchmore scalable support for SMPmachines 50 will also include an earlyimplementation of KSE FreeBSDrsquos ver-sion of scheduler activations largeimprovements to pccard and significantframework bits from the TrustedBSDproject

Mike Karels from Wind River Systemspresented an overview of how BSDOShas been evolving following the WindRiver Systems acquisition of BSDirsquos soft-ware assets Ernie Prabhakar from Applediscussed Applersquos OS X and its successon the desktop He explained how OS Xaims to bring BSD to the desktop whilestriving to maintain a strong and stablecore one that is heavily based on BSDtechnology

The BoF finished with a discussion onlicensing specifically some folks ques-tioned the impact of Applersquos licensingfor code from their core OS (the DarwinProject) on the BSD community as awhole

CLOSING SESSION

HOW FLIES FLY

Michael H Dickinson University ofCalifornia Berkeley

Summarized by JD Welch

In this lively and offbeat talk Dickinsondiscussed his research into the mecha-nisms of flight in the fruit fly (Droso-phila melanogaster) and related theautonomic behavior to that of a techno-logical system Insects are the most suc-cessful organisms on earth due in nosmall part to their early adoption offlight Flies travel through space onstraight trajectories interrupted by sac-cades jumpy turning motions some-what analogous to the fast jumpymovements of the human eye Using avariety of monitoring techniquesincluding high-speed video in a ldquovirtualreality flight arenardquo Dickinson and hiscolleagues have observed that fliesrespond to visual cues to decide when tosaccade during flight For example morevisual texture makes flies saccade earlier(ie further away from the arena wall)described by Dickinson as a ldquocollisioncontrol algorithmrdquo

Through a combination of sensoryinputs flies can make decisions abouttheir flight path Flies possess a visualldquoexpansion detectorrdquo which at a certainthreshold causes the animal to turn acertain direction However expansion infront of the fly sometimes causes it toland How does the fly decide Using thevirtual reality device to replicate variousvisual scenarios Dickinson observedthat the flies fixate on vertical objectsthat move back and forth across thefield Expansion on the left side of theanimal causes it to turn right and viceversa while expansion directly in frontof the animal triggers the legs to flareout in preparation for landing

If the eyes detect these changes how arethe responses implemented The fliesrsquowings have ldquopower musclesrdquo controlledby mechanical resonance (as opposed tothe nervous system) which drive the

USENIX 2002

wings combined with neurally activatedldquosteering musclesrdquo which change theconfiguration of the wing joints Subtlevariations in the timing of impulses cor-respond to changes in wing movementcontrolled by sensors in the ldquowing-pitrdquoA small wing-like structure the halterecontrols equilibrium by beating con-stantly

Voluntary control is accomplished bythe halteres whose signals can interferewith the autonomic control of the strokecycle The halteres have ldquosteering mus-clesrdquo as well and information derivedfrom the visual system can turn off thehaltere or trick it into a ldquovirtualrdquo prob-lem requiring a response

Dickinson has also studied the aerody-namics of insect flight using a devicecalled the ldquoRobo Flyrdquo an oversizedmechanical insect wing suspended in alarge tank of mineral oil InterestinglyDickinson observed that the average liftgenerated by the fliesrsquo wings is under itsbody weight the flies use three mecha-nisms to overcome this including rotat-ing the wings (rotational lift)

Insects are extraordinarily robust crea-tures and because Dickinson analyzedthe problem in a systems-oriented waythese observations and analysis areimmediately applicable to technologythe response system can be used as anefficient search algorithm for controlsystems in autonomous vehicles forexample

THE AFS WORKSHOP

Summarized by Garry Zacheiss MIT

Love Hornquist-Astrand of the Arladevelopment team presented the Arlastatus report Arla 0358 has beenreleased Scheduled for release soon areimproved support for Tru64 UNIXMacOS X and FreeBSD improved vol-ume handling and implementation ofmore of the vospts subcommands Itwas stressed that MacOS X is consideredan important platform and that a GUIconfiguration manager for the cache

86 Vol 27 No 5 login

manager a GUI ACL manager for thefinder and a graphical login that obtainsKerberos tickets and AFS tokens at logintime were all under development

Future goals planned for the underlyingAFS protocols include GSSAPISPNEGOsupport for Rx performance improve-ments to Rx an enhanced disconnectedmode and IPv6 support for Rx anexperimental patch is already availablefor the latter Future Arla-specific goalsinclude improved performance partialfile reading and increased stability forseveral platforms Work is also inprogress for the RXGSS protocol exten-sions (integrating GSSAPI into Rx) Apartial implementation exists and workcontinues as developers find time

Derrick Brashear of the OpenAFS devel-opment team presented the OpenAFSstatus report OpenAFS was releasedimmediately prior to the conferenceOpenAFS 125 fixed a remotelyexploitable denial-of-service attack inseveral OpenAFS platforms mostnotably IRIX and AIX Future workplanned for OpenAFS includes bettersupport for MacOS X including work-ing around Finder interaction issuesBetter support for the BSDs is alsoplanned FreeBSD has a partially imple-mented client NetBSD and OpenBSDhave only server programs availableright now AIX 5 Tru64 51A andMacOS X 102 (aka Jaguar) are allplanned for the future Other plannedenhancements include nested groups inthe ptserver (code donated by the Uni-versity of Michigan awaits integration)disconnected AFS and further work ona native Windows client Derrick statedthat the guiding principles of OpenAFSwere to maintain compatibility withIBM AFS support new platforms easethe administrative burden and add newfunctionality

AFS performance was discussed Theopenafsorg cell consists of two fileservers one in Stockholm Sweden andone in Pittsburgh PA AFS works rea-sonably well over trans-Atlantic links

although OpenAFS clients donrsquot deter-mine which file server to talk to veryeffectively Arla clients use RTTs to theserver to determine the optimal fileserver to fetch replicated data fromModifications to OpenAFS to supportthis behavior in the future are desired

Jimmy Engelbrecht and Harald Barth ofKTH discussed their AFSCrawler scriptwritten to determine how many AFScells and clients were in the world whatimplementationsversions they were(Arla vs IBM AFS vs OpenAFS) andhow much data was in AFS The scriptunfortunately triggered a bug in IBMAFS 36ndashderived code causing someclients to panic while handling a specificRPC This has since been fixed in OpenAFS 125 and the most recent IBMAFS patch level and all AFS users arestrongly encouraged to upgrade Norelease of Arla is vulnerable to this par-ticular denial-of-service attack Therewas an extended discussion of the use-fulness of this exploration Many sitesbelieved this was useful information andsuch scanning should continue in thefuture but only on an opt-in basis

Many sites face the problem of manag-ing KerberosAFS credentials for batch-scheduled jobs Specifically most batchprocessing software needs to be modi-fied to forward tickets as part of thebatch submission process renew ticketsand tokens while the job is in the queueand for the lifetime of the job and prop-erly destroy credentials when the jobcompletes Ken Hornstein of NRL wasable to pay a commercial vendor to sup-port Kerberos 45 credential manage-ment in their product although they didnot implement AFS token managementMIT has implemented some of thedesired functionality in OpenPBS andmight be able to make it available toother interested sites

Tools to simplify AFS administrationwere discussed including

AFS Balancer A tool written byCMU to automate the process ofbalancing disk usage across all

servers in a cell Available fromftpftpandrewcmuedupubAFS-Toolsbalance-11btargz

Themis Themis is KTHrsquos enhancedversion of the AFS tool ldquopackagerdquofor updating files on local disk froma central AFS image KTHrsquosenhancements include allowing thedeletion of files simplifying theprocess of adding a file and allow-ing the merging of multiple rulesets for determining which files areupdated Themis is available fromthe Arla CVS repository

Stanford was presented as an example ofa large AFS site Stanfordrsquos AFS usageconsists of approximately 14TB of datain AFS in the form of approximately100000 volumes 33TB of storage isavailable in their primary cell irstan-fordedu Their file servers consistentirely of Solaris machines running acombination of Transarc 36 patch-level3 and OpenAFS 12x while their data-base servers run OpenAFS 122 Theircell consists of 25 file servers using acombination of EMC and Sun StorEdgehardware Stanford continues to use thekaserver for their authentication infra-structure with future plans to migrateentirely to an MIT Kerberos 5 KDC

Stanford has approximately 3400 clientson their campus not including SLAC(Stanford Linear Accelerator) approxi-mately 2100 AFS clients from outsideStanford contact their cell every monthTheir supported clients are almostentirely IBM AFS 36 clients althoughthey plan to release OpenAFS clientssoon Stanford currently supports onlyUNIX clients There is some on-campuspresence of Windows clients but Stan-ford has never publicly released or sup-ported it They do intend to release andsupport the MacOS X client in the nearfuture

All Stanford students faculty and staffare assigned AFS home directories witha default quota of 50MB for a total ofapproximately 550GB of user homedirectories Other significant uses of AFS

87October 2002 login

C

ON

FERE

NC

ERE

PORT

Sstorage are data volumes for workstationsoftware (400GB) and volumes forcourse Web sites and assignments(100GB)

AFS usage at Intel was also presentedIntel has been an AFS site since 1994They had bad experiences with the IBM35 Linux client their experience withOpenAFS on Linux 24x kernels hasbeen much better They use and are sat-isfied with the OpenAFS IA64 Linuxport Intel has hundreds of OpenAFS123 and 124 clients in many produc-tion cells accessing data stored on IBMAFS file servers They have not encoun-tered any interoperability issues Intelhas some concerns about OpenAFS theywould like to purchase commercial sup-port for OpenAFS and to see OpenAFSsupport for HP-UX on both PA-RISCand Itanium hardware HP-UX supportis currently unavailable due to a specificHP-UX header file being unavailablefrom HP this may be available soonIntel has not yet committed to migratingtheir file servers to OpenAFS and areunsure if they will do so without com-mercial support

Backups are a traditional topic of dis-cussion at AFS workshops and this timewas no exception Many users complainthat the traditional AFS backup tools(ldquobackuprdquo and ldquobutcrdquo) are complex anddifficult to automate requiring manyhome-grown scripts and much userintervention for error recovery An addi-tional complaint was that the traditionalAFS tools do not support file- or direc-tory-level backups and restores datamust be backed up and restored at thevolume level

Mitch Collinsworth of Cornell presentedwork to make AMANDA the freebackup software from the University ofMaryland suitable for AFS backupsUsing AMANDA for AFS backups allowsone to share AFS backup tapes withnon-AFS backups easily run multiplebackups in parallel automate errorrecovery and provide a robust degradedmode that prevents tape errors from

stopping backups altogether Theirimplementation allows for full volumerestores as well as individual directoryand file restores They have finished cod-ing this work and are in the process oftesting and documenting it

Peter Honeyman of CITI at the Univer-sity of Michigan spoke about work hehas proposed to replace Rx with RPC-SEC GSS in OpenAFS this would allowAFS to use a TCP-based transport mech-anism rather than the UDP-based Rxand possibly gain better congestion con-trol dynamic adaptation and fragmen-tation avoidance as a result RPCSECGSS uses the GSSAPI to authenticateSUN ONC RPC RPCSEC GSS is trans-port agnostic provides strong security isa developing Internet standard and hasmultiple open source implementationsBackward compatibility with existingAFS servers and clients is an importantgoal of this project

Page 12: inside - USENIX · 2019. 2. 25. · Bosko Milekic Juan Navarro David E. Ott Amit Purohit Brennan Reynolds Matt Selsky J.D. Welch Li Xiao Praveen Yalagandula Haijin Yan Gary Zacheiss

reduced functionality resulting fromstorage systemsrsquo lack of file information

The speaker presented two enhance-ments ExRaid an exposed RAID andILFS Informed Log-Structured File Sys-tem ExRAID is an enhancement to theblock-based RAID storage system thatexposes the following three types ofinformation to the file system (1)regions ndash contiguous portions of theaddress space comprising one or multi-ple disks (2) performance informationabout queue lengths and throughput ofthe regions revealing disk heterogeneityto the file systems and (3) failure infor-mation ndash dynamically updated informa-tion conveying the number of tolerablefailures in each region

ILFS allows online incremental expan-sion of the storage space performsdynamic parallelism using ExRAIDrsquosperformance information to performwell on heterogeneous storage systemsand provides a range of different mecha-nisms with different granularities forredundancy of files using ExRAIDrsquosregion failure characteristics These newfeatures are added to LFS with only a19 increase in the code size

More informationhttpwwwcswisceduwind

MAXIMIZING THROUGHPUT IN REPLICATED

DISK STRIPING OF VARIABLE

BIT-RATE STREAMS

Stergios V Anastasiadis Duke Univer-sity Kenneth C Sevcik and MichaelStumm University of Toronto

There is an increasing demand for thecontinuous real-time streaming ofmedia files Disk striping is a commontechnique used for supporting manyconnections on the media servers Buteven with 12 million hours of meantime between failures there will be morethan one disk failure per week with 7000disks Fault tolerance can be achieved byusing data replication and reservingextra bandwidth during normal opera-tion This work focused on supportingthe most common variable bit-ratestream formats (eg MPEG)

72 Vol 27 No 5 login

The authors used prototype mediaserver EXEDRA to implement and eval-uate different replication and bandwidthreservation schemes This system sup-ports a variable bit-rate scheme doesstride-based disk allocation and is capa-ble of supporting variable-grain diskstriping Two replication policies arepresented deterministic ndash data blocksare replicated in a round-robin fashionon the secondary replicas random ndashdata blocks are replicated on the ran-domly chosen secondary replicas Twobandwidth reservation techniques arealso presented mirroring reservation ndashdisk bandwidth is reserved for both primary and replicas of the media fileduring the playback and minimumreservation ndash a more efficient scheme inwhich bandwidth is reserved only for thesum of primary data access time and themaximum of the backup data accesstimes required in each round

Experimental results showed that deter-ministic replica placement is better thanrandom placement for small disks mini-mum disk bandwidth reservation istwice as good as mirroring in thethroughput achieved and fault tolerancecan be achieved with a minimal impacton the throughput

TOOLS

Summarized by Li Xiao

SIMPLE AND GENERAL STATISTICAL PROFILING

WITH PCT

Charles Blake and Steven Bauer MIT

This talk introduced a Profile CollectionToolkit (PCT) ndash a sampling-based CPUprofiling facility A novel aspect of PCTis that it allows sampling of semanticallyrich data such as function call stacksfunction parameters local or globalvariables CPU registers or other execu-tion contexts The design objectives ofPCT were driven by user needs and theinadequacies or inaccessibility of priorsystems

The rich data collection capability isachieved via a debugger-controller pro-gram dbct1 The talk then introducedthe profiling collector and data collec-

tion activation PCT file format optionsprovide flexibility and various ways tostore exact states PCT also has analysisoptions two examples were given ndashldquoSimplerdquo and ldquoMixed User and LinuxCoderdquo

The talk then went into how generalsampling helps in the following areas adebugger-controller state machineexample script fragments a simpleexample program and general valuereports in terms of call-path histogramsnumeric value histograms and data gen-erality Related work such as gprof andExpect was then summarized The con-tribution of this work is to provide aportable general value sampling toolThe limitations are that PCT does notsupport much of strip and count inac-curacies happen because of its statisticalnature Based on this limitation -g ispreferred over strip

Possible future directions included sup-porting more debuggers such as dbxand xdb script generators for otherkinds of traces more canned reports forgeneral values and a libgdb-basedlibrary sampler PCT is available athttppdoslcsmitedupct PCT runs onalmost any UNIX-like system

ENGINEERING A DIFFERENCING AND

COMPRESSION DATA FORMAT

David G Korn and Kiem Phong VoATampT Laboratories ndash Research

This talk began with an equation ldquoDif-ferencing + Compression = Delta Com-pressionrdquo Compression removesinformation redundancy in a data setExamples are gzip bzip compress andpack Differencing encodes differencesbetween two data sets Examples are diff -e fdelta bdiff and xdelta Deltacompression compresses a data set givenanother combining differencing andcompression and reduces to pure com-pression when therersquos no commonality

After presenting an overall scheme for adelta compressor and showing deltacompression performance this talk dis-cussed the encoding format of a newlydesigned Vcdiff for delta compression A

Vcdiff instruction code table consists of256 entries of each coding up to a pair ofinstructions and recodes I-byte indicesand any additional data

The talk then showed Vcdiff rsquos perfor-mance with Web data collected fromCNN compared with W3C standardGdiff Gdiff+gzip and diff+gzip whereGdiff was computed using Vcdiff deltainstructions The results of two experi-ments ldquoFirstrdquo and ldquoSuccessiverdquo werepresented In ldquoFirstrdquo each file is com-pressed against the first file collected inldquoSuccessiverdquo each file is compressedagainst the one in the previous hourThe diff+gzip did not work well becausediff was line-oriented Vcdiff performedfavorably compared with other formats

Vcdiff is part of the Vcodex packageVcodex is a platform for all commondata transformations delta compres-sion plain compression encryptiontranscoding (eg uuencode base64) Itis structured in three layers for maxi-mum usability Base library uses Dis-plines and Methods interfaces

The code for Vcdiff can be found athttpwwwresearchattcomswtoolsThey are moving Vcdiff to the IETFstandard as a comprehensive platformfor transforming data Please refer tohttpwwwietforginternet-draftdraft-korn-vcdiff06txt

WHERE IN THE NET

Summarized by Amit Purohit

A PRECISE AND EFFICIENT EVALUATION OF

THE PROXIMITY BETWEEN WEB CLIENTS AND

THEIR LOCAL DNS SERVERS

Zhuoqing Morley Mao University ofCalifornia Berkeley Charles D CranorFred Douglis Michael RabinovichOliver Spatscheck and Jia Wang ATampTLabs ndash Research

Content Distribution Networks (CDNs)attempt to improve Web performance bydelivering Web content to end usersfrom servers located at the edge of thenetwork When a Web client requestscontent the CDN dynamically chooses aserver to route the request to usually the

73October 2002 login

C

ON

FERE

NC

ERE

PORT

Sone that is nearest the client As CDNshave access only to the IP address of thelocal DNS server (LDNS) of the clientthe CDNrsquos authoritative DNS servermaps the clientrsquos LDNS to a geographicregion within a particular network andcombines that with network and serverload information to perform CDNserver selection

This method has two limitations First itis based on the implicit assumption thatclients are close to their LDNS This maynot always be valid Second a singlerequest from an LDNS can represent adiffering number of Web clients ndash calledthe hidden load factor Knowledge of thehidden load factor can be used toachieve better load distribution

The extent of the first limitation and itsimpact on the CDN server selection isdealt with To determine the associationsbetween clients and their LDNS a sim-ple non-intrusive and efficient map-ping technique was developed The datacollected was used to study the impact ofproximity on DNS-based server selec-tion using four different proximity met-rics (1) autonomous system (AS)clustering ndash observing whether a client isin the same AS as its LDNS ndash concludedthat LDNS is very good for coarse-grained server selection as 64 of theassociations belong to the same AS(2) network clustering ndash observingwhether a client is in the same network-aware cluster (NAC) ndash implied that DNSis less useful for finer-grained serverselection since only 16 of the clientand LDNS are in the same NAC (3)traceroute divergence ndash the length of thedivergent paths to the client and itsLDNS from a probe point using tracer-oute ndash implies that most clients aretopologically close to their LDNS asviewed from a randomly chosen probesite (4) round-trip time (RTT) correla-tion (some CDNs select severs based onRTT between the CDN server and theclientrsquos LDNS) examines the correlationbetween the message RTTs from a probepoint to the client and its local DNS

results imply this is a reasonable metricto use to avoid really distant servers

A study of the impact that client-LDNSassociations have on DNS-based serverselection concludes that knowing theclientrsquos IP address would allow moreaccurate server selection in a large num-ber of cases The optimality of the serverselection also depends on the serverdensity placement and selection algo-rithms

Further information can be found athttpwwweecsberkeleyedu~zmaomyresearchhtml or by contactingzmaoeecsberkeleyedu

GEOGRAPHIC PROPERTIES OF INTERNET

ROUTING

Lakshminarayanan SubramanianVenkata N Padmanabhan MicrosoftResearch Randy H Katz University ofCalifornia at Berkeley

Geographic information can provideinsights into the structure and function-ing of the Internet including interac-tions between different autonomoussystems by analyzing certain propertiesof Internet routing It can be used tomeasure and quantify certain routingproperties such as circuitous routinghot-potato routing and geographic faulttolerance

Traceroute has been used to gather therequired data and Geotrack tool hasbeen used to determine the location ofthe nodes along each network path Thisenables the computation of ldquolinearizeddistancesrdquo which is the sum of the geo-graphic distances between successivepairs of routers along the path

In order to measure the circuitousnessof a path a metric ldquodistance ratiordquo hasbeen defined as the ratio of the lin-earized distance of a path to the geo-graphic distance between the source anddestination of the path From the data ithas been observed that the circuitous-ness of a route depends on both geo-graphic and network location of the endhosts A large value of the distance ratioenables us to flag paths that are highly

USENIX 2002

circuitous possibly (though not neces-sarily) because of routing anomalies Itis also shown that the minimum delaybetween end hosts is strongly correlatedwith the linearized distance of the path

Geographic information can be used tostudy various aspects of wide-area Inter-net paths that traverse multiple ISPs Itwas found that end-to-end Internetpaths tend to be more circuitous thanintra-ISP paths the cause for this beingthe peering relationships between ISPsAlso paths traversing substantial dis-tances within two or more ISPs tend tobe more circuitous than paths largelytraversing only a single ISP Anotherfinding is that ISPs generally employhot-potato routing

Geographic information can also beused to capture the fact that two seem-ingly unrelated routers can be suscepti-ble to correlated failures By using thegeographic information of routers wecan construct a geographic topology ofan ISP Using this we can find the toler-ance of an ISPrsquos network to the total net-work failure in a geographic region

For further information contactlakmecsberkeleyedu

PROVIDING PROCESS ORIGIN INFORMATION

TO AID IN NETWORK TRACEBACK

Florian P Buchholz Purdue UniversityClay Shields Georgetown University

Network traceback is currently limitedbecause host audit systems do not main-tain enough information to matchincoming network traffic to outgoingnetwork traffic The talk presented analternative assigning origin informationto every process and logging it duringinteractive login creation

The current implementation concen-trates mainly on interactive sessions inwhich an event is logged when a newconnection is established using SSH orTelnet The method is effective andcould successfully determine steppingstones and reliably detect the source of aDDoS attack The speaker then talkedabout the related work done in this area

74 Vol 27 No 5 login

which led to discussion of the latestdevelopment in the area of ldquohostcausalityrdquo

The speaker ended with the limitationsand future directions of the frameworkThis framework could be extended andcould find many applications in the areaof security Current implementationdoesnrsquot take care of startup scripts andcron jobs but incorporating the origininformation in FS could solve this prob-lem In the current implementation log-ging is just implemented as a proof ofconcept It could be made safe in manyways and this could be another impor-tant aspect of future work

PROGRAMMING

Summarized by Amit Purohit

CYCLONE A SAFE DIALECT OF C

Trevor Jim ATampT Labs ndash Research GregMorrisett Dan Grossman MichaelHicks James Cheney Yanling WangCornell University

Cyclone is designed to provide protec-tion against attacks such as buffer over-flows format string attacks andmemory management errors The cur-rent structure of C allows programmersto write vulnerable programs Cycloneextends C so that it has the safety guar-antee of Java while keeping C syntaxtypes and semantics untouched

The Cyclone compiler performs staticanalysis of the source code and insertsruntime checks into the compiled out-put at places where the analysis cannotdetermine that the code execution willnot violate safety constraints Cycloneimposes many restrictions to preservesafety such as NULL checksThesechecks do not exist in normal C

The speaker then talked about somesample code written in Cyclone and howit tackles safety problems Porting anexisting C application to Cyclone ispretty easy with fewer than 10 changerequired The current implementationconcentrates more on safety than onperformance ndash hence the performance

penalty is substantial Cyclone was ableto find many lingering bugs The finalstep of the project will be to write acompiler to convert a normal C programto an equivalent Cyclone program

COOPERATIVE TASK MANAGEMENT WITHOUT

MANUAL STACK MANAGEMENT

Atul Adya Jon Howell MarvinTheimer William J Bolosky John RDouceur Microsoft Research

The speaker described the definitionsand motivations behind the distinctconcepts of task management stackmanagement IO management conflictmanagement and data partitioningConventional concurrent programminguses preemptive task management andexploits the automatic stack manage-ment of a standard language In the sec-ond approach cooperative tasks areorganized as event handlers that yieldcontrol by returning control to the eventscheduler manually unrolling theirstacks In this project they have adopteda hybrid approach that makes it possiblefor both stack management styles tocoexist Thus programmers can codeassuming one or the other of these stackmanagement styles will operate depend-ing upon the application The speakeralso gave a detailed example of how touse their mechanism to switch betweenthe two styles

The talk then continued with the imple-mentation They were able to preemptmany subtle concurrency problems byusing cooperative task managementPaying a cost up-front to reduce a subtlerace condition proved to be a goodinvestment

Though the choice of task managementis fundamental the choice of stack man-agement can be left to individual tasteThis project enables use of any type ofstack management in conjunction withcooperative task management

IMPROVING WAIT-FREE ALGORITHMS FOR

INTERPROCESS COMMUNICATION IN

EMBEDDED REAL-TIME SYSTEMS

Hai Huang Padmanabhan Pillai KangG Shin University of Michigan

The main characteristic of the real-timetime-sensitive system is its pre-dictable response time But concurrencymanagement creates hurdles to achiev-ing this goal because of the use of locksto maintain consistency To solve thisproblem many wait-free algorithmshave been developed but these are typi-cally high-cost solutions By takingadvantage of the temporal characteris-tics of the system however the time andspace overhead can be reduced

The speaker presented an algorithm fortemporal concurrency control andapplied this technique to improve threewait-free algorithms A single-writermultiple-reader wait-free algorithm anda double-buffer algorithm were pro-posed

Using their transformation mechanismthey achieved an improvement of17ndash66 in ACET and a 14ndash70 reduc-tion in memory requirements for IPCalgorithms This mechanism is extensi-ble and can be applied to other non-blocking IPC algorithms as well Futurework involves reducing the synchroniz-ing overheads in more general IPC algo-rithms with multiple-writer semantics

MOBILITY

Summarized by Praveen Yalagandula

ROBUST POSITIONING ALGORITHMS FOR

DISTRIBUTED AD-HOC WIRELESS SENSOR

NETWORKS

Chris Savarese Jan Rabaey Universityof California Berkeley Koen Langen-doen Delft University of Technology

The Pico Radio Network comprisesmore than a hundred sensors monitorsand actuators equipped with wirelessconnectivity with the following proper-ties (1) no infrastructure (2) computa-tion in a distributed fashion instead ofcentralized computation (3) dynamictopology (4) limited radio range

75October 2002 login

C

ON

FERE

NC

ERE

PORT

S(5) nodes that can act as repeaters(6) devices of low cost and with lowpower and (7) sparse anchor nodes ndashnodes with GPS capability A positioningalgorithm determines the geographicalposition of each node in the network

In two-dimensional space each nodeneeds three reference positions to esti-mate its geographical position There aretwo problems that make positioning dif-ficult in the Pico Radio Network typesetting (1) a sparse anchor node prob-lem and (2) a range error problem

The two-phase approach that theauthors have taken in solving the prob-lem consists of (1) a hop-terrain algo-rithm ndash in this first phase each noderoughly guesses its location by the dis-tance calculated using multi-hops to theanchor points and (2) a refinementalgorithm ndash in this second phase eachnode uses its neighborsrsquo positions torefine its own position estimate Toguarantee the convergence thisapproach uses confidence metricsassigning a value of 10 for anchor nodesand 01 to start for other nodes andincreasing these with each iteration

A simulation tool OMNet++ was usedfor both phases and in various scenariosThe results show that they achievedposition errors of less than 33 in a sce-nario with 5 anchor nodes an averageconnectivity of 7 and 5 range mea-surement error

Guidelines for anchor node deploymentare high connectivity (gt10) a reason-able fraction of anchor nodes (gt 5)and for anchor placement coverededges

APPLICATION-SPECIFIC NETWORK

MANAGEMENT FOR ENERGY-AWARE

STREAMING OF POPULAR MULTIMEDIA

FORMATS

Surendar Chandra University of Geor-gia and Amin Vahdat Duke University

The main hindrance in supporting theincreasing demand for mobile multime-dia on PDA is the battery capacity of the

small devices The idle-time power con-sumption is almost the same as thereceive power consumption on typicalwireless network cards (1319mW vs1425mW) while the sleep state con-sumes far less power (177mW) To letthe system consume energy proportionalto the stream quality the network cardshould transition to sleep state aggres-sively between each packet

The speaker presented previous workwhere different multimedia streamswere studied for PDAs MS Media RealMedia and Quicktime It was found thatif the inter-packet gap is predictablethen huge savings are possible forexample about 80 savings in MSmedia streams with few packet lossesThe limitation of the IEEE 80211bpower-save mode comes into play whenthere are two or more nodes producinga delay between beacon time and whenthe node receivessends packets Thisbadly affects the streams since higherenergy is consumed while waiting

The authors propose traffic shaping forenergy conservation where this is doneby proxy such that packets arrive at reg-ular intervals This is achieved using alocal proxy in the access point and aclient-side proxy in the mobile hostSimulations show that traffic shapingreduces energy consumption and alsoreveals that the higher bandwidthstreams have a lower energy metric(mJkB)

More information is at httpgreenhousecsugaedu

CHARACTERIZING ALERT AND BROWSE SER-

VICES OF MOBILE CLIENTS

Atul Adya Paramvir Bahl and Lili QiuMicrosoft Research

Even though there is a dramatic increasein Internet access from wireless devicesthere are not many studies done oncharacterizing this traffic In this paperthe authors characterize the trafficobserved on the MSN Web site withboth notification and browse traces

USENIX 2002

Around 33 million browsing accessesand about 325 million notificationentries are present in the traces Threetypes of analyses are done for each oneof these two services content analysisconcerning the most popular contentcategories and their distribution popu-larity analysis or the popularity distri-bution of documents and user behavioranalysis

The analysis of the notification logsshows that document access rates followa Zipf-like distribution with most of theaccesses concentrated on a small num-ber of messages and the accesses exhibitgeographical locality ndash users from samelocality tend to receive similar notifica-tion content The browser log analysisshows that a smaller set of URLs areaccessed most times though the accesspattern does not fit any Zipf curve andthe highly accessed URLS remain stableA correlation study between the notifi-cation and browsing services shows thatwireless users have a moderate correla-tion of 012

FREENIX TRACK SESSIONS

BUILDING APPLICATIONS

Summarized by Matt Butner

INTERACTIVE 3-D GRAPHICS APPLICATIONS

FOR TCL

Oliver Kersting Juumlrgen Doumlllner HassoPlattner Institute for Software SystemsEngineering University of Potsdam

The integration of 3-D image renderingfunctionality into a scripting languagepermits interactive and animated 3-Ddevelopment and application withoutthe formalities and precision demandedby low-level CC++ graphics and visual-ization libraries The large and complexC++ API of the Virtual Rendering Sys-tem (VRS) can be combined with theconveniences of the Tcl scripting lan-guage The mapping of class interfaces isdone via an automated process and gen-erates respective wrapper classes all ofwhich ensures complete API accessibilityand functionality without any signifi-

76 Vol 27 No 5 login

cant performance retribution The gen-eral C++-to-Tcl mapper SWIG grantsthe necessary object and hierarchal abili-ties without the use of object-orientedTcl extensions The final API mappingtechniques address C++ features such asclasses overloaded methods enumera-tions and inheritance relations All areimplemented in a proof-of-concept thatmaps the complete C++ API of the VRSto Tcl and are showcased in a completeinteractive 3-D-map system

THE AGFL GRAMMAR WORK LAB

Cornelis HA Coster Erik VerbruggenUniversity of Nijmegen (KUN)

The growth and implementation of Nat-ural Language Processing (NLP) is thecornerstone of the continued evolutionand implementation of truly intelligentsearch machines and services In partdue to the growing collections of com-puter-stored human-readable docu-ments in the public domain theimplementation of linguistic analysiswill become necessary for desirable pre-cision and resolution Subsequentlysuch tools and linguistic resources mustbe released into the public domain andthey have done so with the release of theAGFL Grammar Work Lab under theGNU Public License as a tool for lin-guistic research and the development ofNLP-based applications

The AGFL (Affix Grammars over aFinite Lattice) Grammar Work Labmeshes context-free grammars withfinite set-valued features that are accept-able to a range of languages In com-puter science terms ldquoSyntax rules areprocedures with parameters and a non-deterministic executionrdquo The EnglishPhrases for Information Retrieval(EP4IR) released with the AGFL-GWLas a robust grammar of English is anAGFL-GWL generated English parserthat outputs ldquoHeadModifiedrdquo framesThe sentences ldquoCompanyX sponsoredthis conferencerdquo and ldquoThis conferencewas sponsored by CompanyXrdquo both gen-erate [CompanyX[sponsored confer-

ence]] The hope is that public availabil-ity of such tools will encourage furtherdevelopment of grammar and lexicalsoftware systems

SWILL A SIMPLE EMBEDDED WEB SERVER

LIBRARY

Sotiria Lampoudi David M BeazleyUniversity of Chicago

This paper won the FREENIX Best Stu-dent Paper award

SWILL (Simple Web Interface LinkLibrary) is a simple Web server in theform of a C library whose developmentwas motivated by a wish to give coolapplications an interface to the Web TheSWILL library provides a simple inter-face that can be efficiently implementedfor tasks that vary from flexible Web-based monitoring to software debuggingand diagnostics Though originallydesigned to be integrated with high-per-formance scientific simulation softwarethe interface is generic enough to allowfor unbounded uses SWILL is a single-threaded server relying upon non-blocking IO through the creation of atemporary server which services IOrequests SWILL does not provide SSL orcryptographic authentication but doeshave HTTP authentication abilities

A fantastic feature of SWILL is its sup-port for SPMD-style parallel applica-tions which utilize MPI provingvaluable for Beowulf clusters and largeparallel machines Another practicalapplication was the implementation ofSWILL in a modified Yalnix emulator byUniversity of Chicago Operating Sys-tems courses which utilized the addedYalnix functionality for OS developmentand debugging SWILL requires minimalmemory overhead and relies upon theHTTP10 protocol

NETWORK PERFORMANCE

Summarized by Florian Buchholz

LINUX NFS CLIENT WRITE PERFORMANCE

Chuck Lever Network Appliance PeterHoneyman CITI University of Michi-gan

Lever introduced a benchmark to mea-sure an NFS client write performanceClient performance is difficult to mea-sure due to hindrances such as poorhardware or bandwidth limitations Fur-thermore measuring application perfor-mance does not identify weaknessesspecifically at the client side Thus abenchmark was developed trying toexercise only data transfers in one direc-tion between server and application Forthis purpose the benchmark was basedon the block sequential write portion ofthe Bonnie file system benchmark Oncea benchmark for NFS clients is estab-lished it can be used to improve clientperformance

The performance measurements wereperformed with an SMP Linux clientand both a Linux NFS server and a Net-work Appliance F85 filer During testinga periodic jump in write latency timewas discovered This was due to a ratherlarge number of pending write opera-tions that were scheduled to be writtenafter certain threshold values wereexceeded By introducing a separate dae-mon that flushes the cached writerequest the spikes could be eliminatedbut as a result the average latency growsover time The problem could be tracedto a function that scans a linked list ofwrite requests After having added ahashtable to improve lookup perfor-mance the latency improved consider-ably

The improved client was then used tomeasure throughput against the twoservers A discrepancy between theLinux server and the filer test wasnoticed and the reason for that tracedback to a global kernel lock that wasunnecessarily held when accessing thenetwork stack After correcting this per-formance improved further However

77October 2002 login

C

ON

FERE

NC

ERE

PORT

Sthe measurements showed that a clientmay run slower when paired with fastservers on fast networks This is due toheavy client interrupt loads more net-work processing on the client side andmore global kernel lock contention

The source code of the project is avail-able at httpwwwcitiumicheduprojectsnfs-perfpatches

A STUDY OF THE RELATIVE COSTS OF

NETWORK SECURITY PROTOCOLS

Stefan Miltchev and Sotiris IoannidisUniversity of Pennsylvania AngelosKeromytis Columbia University

With the increasing need for securityand integrity of remote network ser-vices it becomes important to quantifythe communication overhead of IPSecand compare it to alternatives such asSSH SCP and HTTPS

For this purpose the authors set upthree testing networks direct link twohosts separated by two gateways andthree hosts connecting through onegateway Protocols were compared ineach setup and manual keying was usedto eliminate connection setup costs ForIPSec the different encryption algo-rithms AES DES 3DES hardware DESand hardware 3DES were used In detailFTP was compared to SFTP SCP andFTP over IPSec HTTP to HTTPS andHTTP over IPSec and NFS to NFS overIPSec and local disk performance

The result of the measurements werethat IPSec outperforms other popularencryption schemes Overall unen-crypted communication was fastest butin some cases like FTP the overhead canbe small The use of crypto hardwarecan significantly improve performanceFor future work the inclusion of setupcosts hardware-accelerated SSL SFTPand SSH were mentioned

CONGESTION CONTROL IN LINUX TCP

Pasi Sarolahti University of HelsinkiAlexey Kuznetsov Institute for NuclearResearch at Moscow

Having attended the talk and read thepaper I am still unclear about whetherthe authors are merely describing thedesign decisions of TCP congestion con-trol or whether they are actually the cre-ators of that part of the Linux code Myguess leans toward the former

In the talk the speaker compared theTCP protocol congestion control meas-ures according to IETF and RFC specifi-cations with the actual Linux implemen-tation which does conform to the basicprinciples but nevertheless has differ-ences A specific emphasis was placed onretransmission mechanisms and thecongestion window Also several TCPenhancements ndash the NewReno algo-rithm Selective ACKs (SACK) ForwardACKs (FACK) ndash were discussed andcompared

In some instances Linux does not con-form to the IETF specifications The fastrecovery does not fully follow RFC 2582since the threshold for triggering re-transmit is adjusted dynamically and thecongestion windowrsquos size is not changedAlso the roundtrip-time estimator andthe RTO calculation differ from RFC2988 since it uses more conservativeRTT estimates and a minimum RTO of200ms The performance measuresshowed that with the additional Linux-specific features enabled slightly higherthroughput more steady data flow andfewer unnecessary retransmissions canbe achieved

XTREME XCITEMENT

Summarized by Steve Bauer

THE FUTURE IS COMING WHERE THE X

WINDOW SHOULD GO

Jim Gettys Compaq Computer Corp

Jim Gettys one of the principal authorsof the X Window System outlined thenear-term objectives for the system pri-marily focusing on the changes andinfrastructure required to enable replica-tion and migration of X applicationsProviding better support for this func-tionality would enable users to retrieveor duplicate X applications betweentheir servers at home and work

USENIX 2002

One interesting example of an applica-tion that currently is capable of migra-tion and replication is Emacs To create anew frame on DISPLAY try ldquoM-x make-frame-on-display ltReturngt DISPLAYltReturngtrdquo

However technical challenges makereplication and migration difficult ingeneral These include the ldquomajorheadachesrdquo of server-side fonts thenonuniformity of X servers and screensizes and the need to appropriatelyretrofit toolkits Authentication andauthorization issues are obviously alsoimportant The rest of the talk delvedinto some of the details of these interest-ing technical challenges

HACKING IN THE KERNEL

Summarized by Hai Huang

AN IMPLEMENTATION OF SCHEDULER

ACTIVATIONS ON THE NETBSD OPERATING

SYSTEM

Nathan J Williams Wasabi Systems

Scheduler activation is an old idea Basi-cally there are benefits and drawbacks tousing solely kernel-level or user-levelthreading Scheduler activation is able tocombine the two layers of control toprovide more concurrency in the sys-tem

In his talk Nathan gave a fairly detaileddescription of the implementation ofscheduler activation in the NetBSD ker-nel One important change in the imple-mentation is to differentiate the threadcontext from the process context This isdone by defining a separate data struc-ture for these threads and relocatingsome of the information that wasembedded in the process context tothese thread contexts Stack was espe-cially a concern due to the upcall Spe-cial handling must be done to make surethat the upcall doesnrsquot mess up the stackso that the preempted user-level threadcan continue afterwards Lastly Nathanexplained that signals were handled byupcalls

78 Vol 27 No 5 login

AUTHORIZATION AND CHARGING IN PUBLIC

WLANS USING FREEBSD AND 8021X

Pekka Nikander Ericsson ResearchNomadicLab

8021x standards are well known in thewireless community as link-layerauthentication protocols In this talkPekka explained some novel ways ofusing the 8021x protocols that might beof interest to people on the move It ispossible to set up a public WLAN thatwould support various chargingschemes via virtual tokens which peoplecan purchase or earn and later use

This is implemented on FreeBSD usingthe netgraph utility It is basically a filterin the link layer that would differentiatetraffic based on the MAC address of theclient node which is either authenti-cated denied or let through The over-head for this service is fairly minimal

ACPI IMPLEMENTATION ON FREEBSD

Takanori Watanabe Kobe University

ACPI (Advanced Configuration andPower Management Interface) was pro-posed as a joint effort by Intel Toshibaand Microsoft to provide a standard andfiner-control method of managingpower states of individual devices withina system Such low-level power manage-ment is especially important for thosemobile and embedded systems that arepowered by fixed-capacity energy batter-ies

Takanori explained that the ACPI speci-fication is composed of three partstables BIOS and registers He was ableto implement some functionalities ofACPI in a FreeBSD kernel ACPI Com-ponent Architecture was implementedby Intel and it provides a high-levelACPI API to the operating systemTakanorirsquos ACPI implementation is builtupon this underlying layer of APIs

ACCESS CONTROL

Summarized by Florian Buchholz

DESIGN AND PERFORMANCE OF THE

OPENBSD STATEFUL PACKET FILTER (PF)

Daniel Hartmeier Systor AG

Daniel Hartmeier described the newstateful packet filter (pf) that replacedIPFilter in the OpenBSD 30 releaseIPFilter could no longer be included dueto licensing issues and thus there was aneed to write a new filter making use ofoptimized data structures

The filter rules are implemented as alinked list which is traversed from startto end Two actions may be takenaccording to the rules ldquopassrdquo or ldquoblockrdquoA ldquopassrdquo action forwards the packet anda ldquoblockrdquo action will drop it Wheremore than one rule matches a packetthe last rule wins Rules that are markedas ldquofinalrdquo will immediately terminate any further rule evaluation An opti-mization called ldquoskip-stepsrdquo was alsoimplemented where blocks of similarrules are skipped if they cannot matchthe current packet These skip-steps arecalculated when the rule set is loadedFurthermore a state table keeps track ofTCP connections Only packets thatmatch the sequence numbers areallowed UDP and ICMP queries andreplies are considered in the state tablewhere initial packets will create an entryfor a pseudo-connection with a low ini-tial timeout value The state table isimplemented using a balanced binarysearch tree NAT mappings are alsostored in the state table whereas appli-cation proxies reside in user space Thepacket filter also is able to perform frag-ment reassembly and to modulate TCPsequence numbers to protect hostsbehind the firewall

Pf was compared against IPFilter as wellas Linuxrsquos Iptables by measuringthroughput and latency with increasingtraffic rates and different packet sizes Ina test with a fixed rule set size of 100Iptables outperformed the other two fil-ters whose results were close to each

other In a second test where the rule setsize was continually increased Iptablesconsistently had about twice thethroughput of the other two (whichevaluate the rule set on both the incom-ing and outgoing interfaces) A third testcompared only pf and IPFilter using asingle rule that created state in the statetable with a fixed-state entry size Pfreached an overloaded state much laterthan IPFilter The experiment wasrepeated with a variable-state entry sizeand pf performed much better thanIPFilter for a small number of states

In general rule set evaluation is expen-sive and benchmarks only reflectextreme cases whereas in real life otherbehavior should be observed Further-more the benchmarks show that statefulfiltering can actually improve perfor-mance due to cheap state-table lookupsas compared to rule evaluation

ENHANCING NFS CROSS-ADMINISTRATIVE

DOMAIN ACCESS

Joseph Spadavecchia and Erez ZadokStony Brook University

The speaker presented modification toan NFS server that allows an improvedNFS access between administrativedomains A problem lies in the fact thatNFS assumes a shared UIDGID spacewhich makes it unsafe to export filesoutside the administrative domain ofthe server Design goals were to leaveprotocol and existing clients unchangeda minimum amount of server changesflexibility and increased performance

To solve the problem two techniques areutilized ldquorange-mappingrdquo and ldquofile-cloakingrdquo Range-mapping maps IDsbetween client and server The mappingis performed on a per-export basis andhas to be manually set up in an exportfile The mappings can be 1-1 N-N orN-1 In file-cloaking the server restrictsfile access based on UIDGID and spe-cial cloaking-mask bits Here users canonly access their own file permissionsThe policy on whether or not the file isvisible to others is set by the cloakingmask which is logically ANDed with the

79October 2002 login

C

ON

FERE

NC

ERE

PORT

Sfilersquos protection bits File-cloaking onlyworks however if the client doesnrsquot holdcached copies of directory contents andfile-attributes Because of this the clientsare forced to re-read directories byincrementing the mtime value of thedirectory each time it is listed

To test the performance of the modifiedserver five different NFS configurationswere evaluated An unmodified NFSserver was compared against one serverwith the modified code included but notused one with only range-mappingenabled one with only file-cloakingenabled and one version with all modi-fications enabled For each setup differ-ent file system benchmarks were runThe results show only a small overheadwhen the modifications are used gener-ally an increase of below 5 Anotherexperiment tested the performance ofthe system with an increasing number ofmapped or cloaked entries on a systemThe results show that an increase from10 to 1000 entries resulted in a maxi-mum of about 14 cost in performance

One member of the audience pointedout that if clients choose to ignore thechanged mtimes from the server andthus still hold caches of the directoryentries the file-cloaking mechanismcould be defeated After a rather lengthydebate the speaker had to concede thatthe model doesnrsquot add any extra securityAnother question was asked about scala-bility of the setup of range mappingThe speaker referred to application-leveltools that could be developed for thatpurpose

The software is available atftpftpfslcssunysbedupubenf

ENGINEERING OPEN SOURCE

SOFTWARE

Summarized by Teri Lampoudi

NINGAUI A LINUX CLUSTER FOR BUSINESS

Andrew Hume ATampT Labs ndash ResearchScott Daniels Electronic Data SystemsCorp

Ningaui is a general purpose highlyavailable resilient architecture built

from commodity software and hard-ware Emphatically however Ningaui isnot a Beowulf Hume calls the clusterdesign the ldquoSwiss canton modelrdquo inwhich there are a number of looselyaffiliated independent nodes with datareplicated among them Jobs areassigned by bidding and leases and clus-ter services done as session-based serversare sited via generic job assignment Theemphasis is on keeping the architectureend-to-end checking all work via check-sums and logging everything Theresilience and high availability requiredby their goal of 8-5 maintenance ndash vsthe typical 24-7 model where people getpaged whenever the slightest thing goeswrong regardless of the time of day ndash isachieved by job restartability Finally allcomputation is performed on local datawithout the use of NFS or networkattached storage

Humersquos message is a hopeful onedespite the many problems encounteredndash things like kernel and service limitsauto-installation problems TCP stormsand the like ndash the existence of sourcecode and the paranoid practice of log-ging and checksumming everything hashelped The final product performs rea-sonably well and it appears that theresilient techniques employed do make adifference One drawback however isthat the software mentioned in thepaper is not yet available for download

CPCMS A CONFIGURATION MANAGEMENT

SYSTEM BASED ON CRYPTOGRAPHIC NAMES

Jonathan S Shapiro and John Vander-burgh Johns Hopkins University

This paper won the FREENIX BestPaper award

The basic notion behind the project isthe fact that everyone has a pet com-plaint about CVS and yet it is currentlythe configuration manager in mostwidespread use Shapiro has unleashedan alternative Interestingly he did notbegin but ended with the usual host ofreasons why CVS is bad The talk insteadbegan abruptly by characterizing the job

USENIX 2002

of a software configuration managercontinued by stating the namespaceswhich it must handle and wrapped upwith the challenges faced

X MEETS Z VERIFYING CORRECTNESS IN THE

PRESENCE OF POSIX THREADS

Bart Massey Portland State UniversityRobert T Bauer Rational SoftwareCorp

Massey delivered a humorous talk onthe insight gained from applying Z for-mal specification notation to systemsoftware design rather than the moreinformal analysis and design processnormally used

The story is told with respect to writingXCB which replaces the Xlib protocollayer and is supposed to be threadfriendly But where threads are con-cerned deadlock avoidance becomes ahard problem that cannot be solved inan ad-hoc manner But full modelchecking is also too hard In this caseMassey resorted to Z specification tomodel the XCB lower layer abstractaway locking and data transfers andlocate fundamental issues Essentiallythe difficulties of searching the literatureand locating information relevant to theproblem at hand were overcome AsMassey put it ldquothe formal method savedthe dayrdquo

FILE SYSTEMS

Summarized by Bosko Milekic

PLANNED EXTENSIONS TO THE LINUX

EXT2EXT3 FILESYSTEM

Theodore Y Tsrsquoo IBM StephenTweedie Red Hat

The speaker presented improvements tothe Linux Ext2 file system with the goalof allowing for various expansions whilestriving to maintain compatibility witholder code Improvements have beenfacilitated by a few extra superblockfields that were added to Ext2 not longbefore the Linux 20 kernel was releasedThe fields allow for file system featuresto be added without compromising theexisting setup this is done by providing

80 Vol 27 No 5 login

bits indicating the impact of the addedfeatures with respect to compatibilityNamely a file system with the ldquoincom-patrdquo bit marked is not allowed to bemounted Similarly a ldquoread-onlyrdquo mark-ing would only allow the file system tobe mounted read-only

Directory indexing changes linear direc-tory searches with a faster search using afixed-depth tree and hashed keys Filesystem size can be dynamicallyincreased and the expanded inode dou-bled from 128 to 256 bytes allows formore extensions

Other potential improvements were dis-cussed as well in particular pre-alloca-tion for contiguous files which allowsfor better performance in certain setupsby pre-allocating contiguous blocksSecurity-related modifications extendedattributes and ACLs were mentionedAn implementation of these featuresalready exists but has not yet beenmerged into the mainline Ext23 code

RECENT FILESYSTEM OPTIMISATIONS ON

FREEBSD

Ian Dowse Corvil Networks DavidMalone CNRI Dublin Institute of Technology

David Malone presented four importantfile system optimizations for FreeBSDOS soft updates dirpref vmiodir anddirhash It turns out that certain combi-nations of the optimizations (beautifullyillustrated in the paper) may yield per-formance improvements of anywherebetween 2 and 10 orders of magnitudefor real-world applications

All four of the optimizations deal withfile system metadata Soft updates allowfor asynchronous metadata updatesDirpref changes the way directories areorganized attempting to place childdirectories closer to their parentsthereby increasing locality of referenceand reducing disk-seek times Vmiodirtrades some extra memory in order toachieve better directory caching Finallydirhash which was implemented by IanDowse changes the way in which entries

are searched in a directory SpecificallyFFS uses a linear search to find an entryby name dirhash builds a hashtable ofdirectory entries on the fly For a direc-tory of size n with a working set of mfiles a search that in certain cases couldhave been O(nm) has been reduceddue to dirhash to effectively O(n + m)

FILESYSTEM PERFORMANCE AND SCALABILITY

IN LINUX 2417

Ray Bryant SGI Ruth Forester IBMLTC John Hawkes SGI

This talk focused on performance evalu-ation of a number of file systems avail-able and commonly deployed on Linuxmachines Comparisons under variousconfigurations of Ext2 Ext3 ReiserFSXFS and JFS were presented

The benchmarks chosen for the datagathering were pgmeter filemark andAIM Benchmark Suite VII Pgmetermeasures the rate of data transfer ofreadswrites of a file under a syntheticworkload Filemark is similar to post-mark in that it is an operation-intensivebenchmark although filemark isthreaded and offers various other fea-tures that postmark lacks AIM VIImeasures performance for various file-system-related functionalities it offersvarious metrics under an imposedworkload thus stressing the perfor-mance of the file system not only underIO load but also under significant CPUload

Tests were run on three different setupsa small a medium and a large configu-ration ReiserFS and Ext2 appear at thetop of the pile for smaller and mediumsetups Notably XFS and JFS performworse for smaller system configurationsthan the others although XFS clearlyappears to generate better numbers thanJFS It should be noted that XFS seemsto scale well under a higher load Thiswas most evident in the large-systemresults where XFS appears to offer thebest overall results

THINGS TO THINK ABOUT

Summarized by Bosko Milekic

SPEEDING UP KERNEL SCHEDULER BY REDUC-

ING CACHE MISSES

Shuji Yamamura Akira Hirai MitsuruSato Masao Yamamoto Akira NaruseKouichi Kumon Fujitsu Labs

This was an interesting talk pertaining tothe effects of cache coloring for taskstructures in the Linux kernel scheduler(Linux kernel 24x) The speaker firstpresented some interesting benchmarknumbers for the Linux scheduler show-ing that as the number of processes onthe task queue was increased the perfor-mance decreased The authors usedsome really nifty hardware to measurethe number of bus transactionsthroughout their tests and were thusable to reasonably quantify the impactthat cache misses had in the Linuxscheduler

Their experiments led them to imple-ment a cache coloring scheme for taskstructures which were previouslyaligned on 8KB boundaries and there-fore were being eventually mapped tothe same cache lines This unfortunateplacement of task structures in memoryinduced a significant number of cachemisses as the number of tasks grew inthe scheduler

The implemented solution consisted ofaligning task structures to more evenlydistribute cached entries across the L2cache The result was inevitably fewercache misses in the scheduler Some neg-ative effects were observed in certain sit-uations These were primarily due tomore cache slots being used by taskstructure data in the scheduler thusforcing data previously cached there tobe pushed out

81October 2002 login

C

ON

FERE

NC

ERE

PORT

SWORK-IN-PROGRESS REPORTS

Summarized by Brennan Reynolds

RESOURCE VIRTUALIZATION TECHNIQUES FOR

WIDE-AREA OVERLAY NETWORKS

Kartik Gopalan University Stony Brook

This work addressed the issue of provi-sioning a maximum number of virtualoverlay networks (VON) with diversequality of service (QoS) requirementson a single physical network Each of theVONs is logically isolated from others toensure the QoS requirements Gopalanmentioned several critical researchissues with this problem that are cur-rently being investigated Dealing withhow to provision the network at variouslevels (link route or path) and thenenforce the provisioning at run-time isone of the toughest challenges Cur-rently Gopalan has developed severalalgorithms to handle admission controlend-to-end QoS route selection sched-uling and fault tolerance in the net-work

For more information visithttpwwwecslcssunysbedu

VISUALIZING SOFTWARE INSTABILITY

Jennifer Bevan University of CaliforniaSanta Cruz

Detection of instability in software hastypically been an afterthought Thepoint in the development cycle when thesoftware is reviewed for instability isusually after it is difficult and costly togo back and perform major modifica-tions to the code base Bevan has devel-oped a technique to allow thevisualization of unstable regions of codethat can be used much earlier in thedevelopment cycle Her technique cre-ates a time series of dependent graphsthat include clusters and lines calledfault lines From the graphs a developeris able to easily determine where theunstable sections of code are and proac-tively restructure them She is currentlyworking on a prototype implementa-tion

For more information visit httpwwwcseucscecu~jbevanevo_viz

RELIABLE AND SCALABLE PEER-TO-PEER WEB

DOCUMENT SHARING

Li Xiao William and Mary College

The idea presented by Xiao would allowend users to share the content of theWeb browser caches with neighboringInternet users The rationale for this isthat todayrsquos end users are increasinglyconnected to the Internet over high-speed links and the browser caches arebecoming large storage systems There-fore if an individual accesses a pagewhich does not exist in their local cacheXiao is suggesting that they first queryother end users for the content beforetrying to access the machine hosting theoriginal This strategy does have someserious problems associated with it thatstill need to be addressed includingensuring the integrity of the content andprotecting the identity and privacy ofthe end users

SEREL FAST BOOTING FOR UNIX

Leni Mayo Fast Boot Software

Serel is a tool that generates a visual rep-resentation of a UNIX machinersquos boot-up sequence It can be used to identifythe critical path and can show if a par-ticular service or process blocks for anextended period of time This informa-tion could be used to determine wherethe largest performance gains could berealized by tuning the order of executionat boot-up Serel creates a dependencygraph expressed in XML during boot-up This graph is then used to create thevisual representation Currently the toolonly works on POSIX-compliant sys-tems but Mayo stated that he would beporting it to other platforms Otherextensions that were mentionedincluded having the metadata includethe use of shared libraries and monitor-ing the suspendresume sequence ofportable machines

For more information visithttpwwwfastbootorg

USENIX 2002

BERKELEY DB XML

John Merrells Sleepycat Software

Merrells gave a quick introduction andoverview of the new XML library forBerkeley DB The library specializes instorage and retrieval of XML contentthrough a tool called XPath The libraryallows for multiple containers per docu-ment and stores everything natively asXML The user is also given a wide rangeof elements to create indices withincluding edges elements text stringsor presence The XPath tool consists of aquery parser generator optimizer andexecution engine To conclude his pre-sentation Merrell gave a live demonstra-tion of the software

For more information visithttpwwwsleepycatcomxml

CLUSTER-ON-DEMAND (COD)

Justin Moore Duke University

Modern clusters are growing at a rapidrate Many have pushed beyond the 5000-machine mark and deployingthem results in large expenses as well asmanagement and provisioning issuesFurthermore if the cluster is ldquorentedrdquoout to various users it is very time-con-suming to configure it to a userrsquos specsregardless of how long they need to useit The COD work presented createsdynamic virtual clusters within a givenphysical cluster The goal was to have aprovisioning tool that would automati-cally select a chunk of available nodesand install the operating system andmiddleware specified by the customer ina short period of time This would allowa greater use of resources since multiplevirtual clusters can exist at once Byusing a virtual cluster the size can bechanged dynamically Moore stated thatthey have created a working prototypeand are currently testing and bench-marking its performance

For more information visithttpwwwcsdukeedu~justincod

82 Vol 27 No 5 login

CATACOMB

Elias Sinderson University of CaliforniaSanta Cruz

Catacomb is a project to develop a data-base-backed DASL module for theApache Web server It was designed as areplacement for the WebDAV The initialrelease of the module only contains sup-port for a MySQL database but could beextended to others Sinderson brieflytouched on the performance of hermodule It was comparable to themod_dav Apache module for all querytypes but search The presentation wasconcluded with remarks about addingsupport for the lock method and includ-ing ACL specifications in the future

For more information visithttpoceancseucsceducatacomb

SELF-ORGANIZING STORAGE

Dan Ellard Harvard University

Ellardrsquos presentation introduced a stor-age system that tuned itself based on theworkload of the system without requir-ing the intervention of the user Theintelligence was implemented as a vir-tual self-organizing disk that residesbelow any file system The virtual diskobserves the access patterns exhibited bythe system and then attempts to predictwhat information will be accessed nextAn experiment was done using an NFStrace at a large ISP on one of their emailservers Ellardrsquos self-organizing storagesystem worked well which he attributesto the fact that most of the files beingrequested were large email boxes Areasof future work include exploration ofthe length and detail of the data collec-tion stage as well as the CPU impact ofrunning the virtual disk layer

VISUAL DEBUGGING

John Costigan Virginia Tech

Costigan feels that the state of currentdebugging facilities in the UNIX worldis not as good as it should be He pro-poses the addition of several elements toprograms like ddd and gdb The firstaddition is including separate heap and

stack components Another is to displayonly certain fields of a data structureand have the ability to zoom in and outif needed Finally the debugger shouldprovide the ability to visually presentcomplex data structures includinglinked lists trees etc He said that a betaversion of a debugger with these abilitiesis currently available

For more information visit httpinfoviscsvtedudatastruct

IMPROVING APPLICATION PERFORMANCE

THROUGH SYSTEM-CALL COMPOSITION

Amit Purohit University of Stony Brook

Web servers perform a huge number ofcontext switches and internal data copy-ing during normal operation These twoelements can drastically limit the perfor-mance of an application regardless ofthe hardware platform it is run on TheCompound System Call (CoSy) frame-work is an attempt to reduce the perfor-mance penalty for context switches anddata copies It includes a user-levellibrary and several kernel facilities thatcan be used via system calls The libraryprovides the programmer with a com-plete set of memory-management func-tions Performance tests using the CoSyframework showed a large saving forcontext switches and data copies Theonly area where the savings betweenconventional libraries and CoSy werenegligible was for very large file copies

For more information visithttpwwwcssunysbedu~purohit

ELASTIC QUOTAS

John Oscar Columbia University

Oscar began by stating that mostresources are flexible but that to datedisks have not been While approxi-mately 80 percent of files are short-liveddisk quotas are hard limits imposed byadministrators The idea behind elasticquotas is to have a non-permanent stor-age area that each user can temporarilyuse Oscar suggested the creation of aehome directory structure to be used indeploying elastic quotas Currently he

has implemented elastic quotas as astackable file system There were alsoseveral trends apparent in the dataMany users would ssh to their remotehost (good) but then launch an emailreader on the local machine whichwould connect via POP to the same hostand send the password in clear text(bad) This situation can easily be reme-died by tunneling protocols like POPthrough SSH but it appeared that manypeople were not aware this could bedone While most of his comments onthe use of the wireless network werenegative the list of passwords he hadcollected showed that people wereindeed using strong passwords His rec-ommendations were to educate andencourage people to use protocols likeIPSec SSH and SSL when conductingwork over a wireless network becauseyou never know who else is listening

SYSTRACE

Niels Provos University of Michigan

How can one be sure the applicationsone is using actually do exactly whattheir developers said they do Shortanswer you canrsquot People today are usinga large number of complex applicationswhich means it is impossible to checkeach application thoroughly for securityvulnerabilities There are tools out therethat can help though Provos has devel-oped a tool called systrace that allows auser to generate a policy of acceptablesystem calls a particular application canmake If the application attempts tomake a call that is not defined in thepolicy the user is notified and allowed tochoose an action Systrace includes anauto-policy generation mechanismwhich uses a training phase to record theactions of all programs the user exe-cutes If the user chooses not to be both-ered by applications breaking policysystrace allows default enforcementactions to be set Currently systrace isimplemented for FreeBSD and NetBSDwith a Linux version coming out shortly

83October 2002 login

C

ON

FERE

NC

ERE

PORT

SFor more information visit httpwwwcitiumicheduuprovossystrace

VERIFIABLE SECRET REDISTRIBUTION FOR

SURVIVABLE STORAGE SYSTEMS

Ted Wong Carnegie Mellon University

Wong presented a protocol that can beused to re-create a file distributed over aset of servers even if one of the serversis damaged In his scheme the user mustchoose how many shares the file is splitinto and the number of servers it will bestored across The goal of this work is toprovide persistent secure storage ofinformation even if it comes underattack Wong stated that his design ofthe protocol was complete and he is cur-rently building a prototype implementa-tion of it

For more information visit httpwwwcscmuedu~tmwongresearch

THE GURU IS IN SESSIONS

LINUX ON LAPTOPPDA

Bdale Garbee HP Linux Systems Operation

Summarized by Brennan Reynolds

This was a roundtable-like discussionwith Garbee acting as moderator He iscurrently in charge of the Debian distri-bution of Linux and is involved in port-ing Linux to platforms other than i386He has successfully help port the kernelto the Alpha Sparc and ARM and iscurrently working on a version to runon VAX machines

A discussion was held on which file sys-tem is best for battery life The Riser andExt3 file systems were discussed peoplecommented that when using Ext3 ontheir laptops the disc would never spindown and thus was always consuming alarge amount of power One suggestionwas to use a RAM-based file system forany volume that requires a large numberof writes and to only have the file systemwritten to disk at infrequent intervals orwhen the machine is shut down or sus-pended

A question was directed to Garbee abouthis knowledge of how the HP-Compaqmerger would affect Linux Garbeethought that the merger was good forLinux both within the company and forthe open source community as a wholeHe stated that the new company wouldbe the number one shipper of systemswith Linux as the base operating systemHe also fielded a question about thesupport of older HP servers and theirability to run Linux saying that indeedLinux has been ported to them and hepersonally had several machines in hisbasement running it

A question which sparked a large num-ber of responses concerned problemspeople had with running Linux onmobile platforms Most people actuallydid not have many problems at allThere were a few cases of a particularmachine model not working but therewere only two widespread issues dock-ing stations and the new ACPI powermanagement scheme Docking stationsstill appear to be somewhat of aheadache but most people had devel-oped ad-hoc solutions for getting themto work The ACPI power managementscheme developed by Intel does notappear to have a quick solution TedTrsquoso one of the head kernel developersstated that there are fundamental archi-tectural problems with the 24 series ker-nel that do not easily allow ACPI to beadded However he also stated thatACPI is supported in the newest devel-opment kernel 25 and will be includedin 26

The final topic was the possibility ofpurchasing a laptop without paying anychargetax for Microsoft Windows Theconsensus was that currently this is notpossible Even if the machine comeswith Linux or without any operatingsystem the vendors have contracts inplace which require them to payMicrosoft for each machine they ship ifthat machine is capable of running aMicrosoft operating system The onlyway this will change is if the demand for

USENIX 2002

other operating systems increases to thepoint that vendors renegotiate their con-tracts with Microsoft and no one sawthis happening in the near future

LARGE CLUSTERS

Andrew Hume ATampT Labs ndash Research

Summarized by Teri Lampoudi

This was a session on clusters big dataand resilient computing Given that theaudience was primarily interested inclusters and not necessarily big data orbeating nodes to a pulp the conversa-tion revolved mainly around what ittakes to make a heterogeneous clusterresilient Hume also presented aFREENIX paper on his current clusterproject which explained in more detailmuch of what was abstractly claimed inthe guru session

Humersquos use of the word ldquoclusterrdquoreferred not to what I had assumedwould be a Beowulf-type system but to aloosely coupled farm of machines inwhich concurrency was much less of anissue than it would be in a Beowulf Infact the architecture Hume describedwas designed to deal with large amountsof independent transaction processingessentially the process of billing callswhich requires no interprocess messagepassing or anything of the sort a parallelscientific application might

Where does the large data come inSince the transactions in question con-sist of large numbers of flat files amechanism for getting the files ontonodes and off-loading the results is nec-essary In this particular cluster namedNingaui this task is handled by theldquoreplication managerrdquo which generateda large amount of interest in the audi-ence All management functions in thecluster as well as job allocation consti-tute services that are to be bid for andleased out to nodes Furthermore fail-ures are handled by simply repeating thefailed task until it is completed properlyif that is possible and if it is not thenlooking through detailed logs to discover

84 Vol 27 No 5 login

what went wrong Logging and testingfor the successful completion of eachtask are what make this model resilientEvery management operation is loggedat some appropriate level of detail gen-erating 120MBday and every single filetransfer is md5 checksummed This wascharacterized by Hume as a ldquopatientrdquostyle of management where no one con-trols the cluster scheduling behavior isemergent rather than stringentlyplanned and nodes are free to drop inand out of the cluster a downside to thismodel is the increase in latency offset bythe fact that the cluster continues tofunction unattended outside of 8-to-5support hours

In response to questions about the pro-jected scalability of such a schemeHume said the system would presum-ably scale from the present 60 nodes toabout 100 without modifications butthat scaling higher would be a matter forfurther design The guru session endedon a somewhat unrelated note regardingthe problems of getting data off main-frames and onto Linux machines ndashissues of fixed- vs variable-lengthencoding of data blocks resulting in use-ful information being stripped by FTPand problems in converting COBOLcopybooks to something useful on a C-based architecture To reiterate Humersquosclaim there are homemade solutions forsome of these problems and the readershould feel free to contact Mr Hume forthem

WRITING PORTABLE APPLICATIONS

Nick Stoughton MSB Consultants

Summarized by Josh Lothian

Nick Stoughton addressed a concernthat developers are facing more andmore writing portable applications Heproposed that there is no such thing as aldquoportable applicationrdquo only those thathave been ported Developers are still insearch of the Holy Grail of applicationsprogramming a write-once compile-anywhere program Languages such asJava are leading up to this but virtual

machines still have to be ported pathswill still be different etc Web-basedapplications are another area that couldbe promising for portable applications

Nick went on to talk about the future ofportability The recently released POSIX2001 standards are expected to help thesituation The 2001 revision expands theoriginal POSIX sepc into seven volumesEven though POSIX 2001 is more spe-cific than the previous release Nickpointed out that even this standardallows for small differences where ven-dors may define their own behaviorsThis can only hurt developers in thelong run Along these lines it waspointed out that there may be room forautomated tools that have the capabilityto check application code and determinethe level of portability and standardscompliance of the source

NETWORK MANAGEMENT SYSTEM

PERFORMANCE TUNING

Jeff Allen Tellme Networks Inc

Summarized by Matt Selsky

Jeff Allen author of Cricket discussednetwork tuning The basic idea of tuningis measure twiddle and measure againCricket can be used for large installa-tions but itrsquos overkill for smaller instal-lations for which Mrtg is better suitedYou donrsquot need Cricket to do measure-ment Doing something like a shellscript that dumps data to Excel is goodbut you need to get some sort of mea-surement Cricket was not designed forbilling it was meant to help answerquestions

Useful tools and techniques coveredincluded looking for the differencebetween machines in order to identifythe causes for the differences measuringbut also thinking about what yoursquoremeasuring and why remembering touse strace and tcpdump to identifyproblems Problems repeat themselvesperformance tuning requires an intricateunderstanding of all the layers involvedto solve the problem and simplify theautomation (Having a computer science

background helps but is not essential) Ifyou can reproduce the problem youthen should investigate each piece tofind the bottleneck If you canrsquot repro-duce the problem then measure Youneed to understand all the system inter-actions Measurement can help deter-mine whether things are actually slow orif the user is imagining it

Troubleshooting begins with a hunchbut scientific processes are essential Youshould be able to determine the eventstream or when each event occurs hav-ing observation logs can help Somehints provided include checking cron forperiodic anomalies slowing downbinrm to avoid IO overload and look-ing for unusual kernel CPU usage

Also try to use lightweight monitoringto reduce overhead You donrsquot need tomonitor every resource on every systembut you should monitor those resourceson those systems that are essential Donrsquotcheck things too often since you canintroduce overhead that way Lots ofinformation can be gathered from theproc file system interface to the kernel

BSD BOF

Host Kirk McKusick Author and Consultant

Summarized by Bosko Milekic

The BSD Birds of a Feather session atUSENIX started off with five quick andinformative talks and ended with anexciting discussion of licensing issues

Christos Zoulas for the NetBSD Projectwent over the primary goals of theNetBSD Project (namely portability aclean architecture and security) andthen proceeded to briefly discuss someof the challenges that the project hasencountered The successes and areas forimprovement of 16 were then exam-ined NetBSD has recently seen variousimprovements in its cross-build systempackaging system and third-party toolsupport For what concerns the kernel azero-copy TCPUDP implementationhas been integrated and a new pmap

85October 2002 login

C

ON

FERE

NC

ERE

PORT

Scode for arm32 has been introduced ashave some other rather major changes tovarious subsystems More information isavailable in Christosrsquos slides which arenow available at httpwwwnetbsdorggalleryeventsusenix2002

Theo de Raadt of the OpenBSD Projectpresented a rundown of various newthings that the OpenBSD and OpenSSHteams have been looking at Theorsquos talkincluded an amusing description (andpictures) of some of the events thattranspired during OpenBSDrsquos hack-a-thon which occurred the week beforeUSENIX rsquo02 Finally Theo mentionedthat following complications with theIPFilter license the OpenBSD team haddone a fairly extensive license audit

Robert Watson of the FreeBSD Projectbrought up various examples ofFreeBSD being used in the real worldHe went on to describe a wide variety ofchanges that will surface with theupcoming FreeBSD release 50 Notably50 will introduce a re-architecting ofthe kernel aimed at providing muchmore scalable support for SMPmachines 50 will also include an earlyimplementation of KSE FreeBSDrsquos ver-sion of scheduler activations largeimprovements to pccard and significantframework bits from the TrustedBSDproject

Mike Karels from Wind River Systemspresented an overview of how BSDOShas been evolving following the WindRiver Systems acquisition of BSDirsquos soft-ware assets Ernie Prabhakar from Applediscussed Applersquos OS X and its successon the desktop He explained how OS Xaims to bring BSD to the desktop whilestriving to maintain a strong and stablecore one that is heavily based on BSDtechnology

The BoF finished with a discussion onlicensing specifically some folks ques-tioned the impact of Applersquos licensingfor code from their core OS (the DarwinProject) on the BSD community as awhole

CLOSING SESSION

HOW FLIES FLY

Michael H Dickinson University ofCalifornia Berkeley

Summarized by JD Welch

In this lively and offbeat talk Dickinsondiscussed his research into the mecha-nisms of flight in the fruit fly (Droso-phila melanogaster) and related theautonomic behavior to that of a techno-logical system Insects are the most suc-cessful organisms on earth due in nosmall part to their early adoption offlight Flies travel through space onstraight trajectories interrupted by sac-cades jumpy turning motions some-what analogous to the fast jumpymovements of the human eye Using avariety of monitoring techniquesincluding high-speed video in a ldquovirtualreality flight arenardquo Dickinson and hiscolleagues have observed that fliesrespond to visual cues to decide when tosaccade during flight For example morevisual texture makes flies saccade earlier(ie further away from the arena wall)described by Dickinson as a ldquocollisioncontrol algorithmrdquo

Through a combination of sensoryinputs flies can make decisions abouttheir flight path Flies possess a visualldquoexpansion detectorrdquo which at a certainthreshold causes the animal to turn acertain direction However expansion infront of the fly sometimes causes it toland How does the fly decide Using thevirtual reality device to replicate variousvisual scenarios Dickinson observedthat the flies fixate on vertical objectsthat move back and forth across thefield Expansion on the left side of theanimal causes it to turn right and viceversa while expansion directly in frontof the animal triggers the legs to flareout in preparation for landing

If the eyes detect these changes how arethe responses implemented The fliesrsquowings have ldquopower musclesrdquo controlledby mechanical resonance (as opposed tothe nervous system) which drive the

USENIX 2002

wings combined with neurally activatedldquosteering musclesrdquo which change theconfiguration of the wing joints Subtlevariations in the timing of impulses cor-respond to changes in wing movementcontrolled by sensors in the ldquowing-pitrdquoA small wing-like structure the halterecontrols equilibrium by beating con-stantly

Voluntary control is accomplished bythe halteres whose signals can interferewith the autonomic control of the strokecycle The halteres have ldquosteering mus-clesrdquo as well and information derivedfrom the visual system can turn off thehaltere or trick it into a ldquovirtualrdquo prob-lem requiring a response

Dickinson has also studied the aerody-namics of insect flight using a devicecalled the ldquoRobo Flyrdquo an oversizedmechanical insect wing suspended in alarge tank of mineral oil InterestinglyDickinson observed that the average liftgenerated by the fliesrsquo wings is under itsbody weight the flies use three mecha-nisms to overcome this including rotat-ing the wings (rotational lift)

Insects are extraordinarily robust crea-tures and because Dickinson analyzedthe problem in a systems-oriented waythese observations and analysis areimmediately applicable to technologythe response system can be used as anefficient search algorithm for controlsystems in autonomous vehicles forexample

THE AFS WORKSHOP

Summarized by Garry Zacheiss MIT

Love Hornquist-Astrand of the Arladevelopment team presented the Arlastatus report Arla 0358 has beenreleased Scheduled for release soon areimproved support for Tru64 UNIXMacOS X and FreeBSD improved vol-ume handling and implementation ofmore of the vospts subcommands Itwas stressed that MacOS X is consideredan important platform and that a GUIconfiguration manager for the cache

86 Vol 27 No 5 login

manager a GUI ACL manager for thefinder and a graphical login that obtainsKerberos tickets and AFS tokens at logintime were all under development

Future goals planned for the underlyingAFS protocols include GSSAPISPNEGOsupport for Rx performance improve-ments to Rx an enhanced disconnectedmode and IPv6 support for Rx anexperimental patch is already availablefor the latter Future Arla-specific goalsinclude improved performance partialfile reading and increased stability forseveral platforms Work is also inprogress for the RXGSS protocol exten-sions (integrating GSSAPI into Rx) Apartial implementation exists and workcontinues as developers find time

Derrick Brashear of the OpenAFS devel-opment team presented the OpenAFSstatus report OpenAFS was releasedimmediately prior to the conferenceOpenAFS 125 fixed a remotelyexploitable denial-of-service attack inseveral OpenAFS platforms mostnotably IRIX and AIX Future workplanned for OpenAFS includes bettersupport for MacOS X including work-ing around Finder interaction issuesBetter support for the BSDs is alsoplanned FreeBSD has a partially imple-mented client NetBSD and OpenBSDhave only server programs availableright now AIX 5 Tru64 51A andMacOS X 102 (aka Jaguar) are allplanned for the future Other plannedenhancements include nested groups inthe ptserver (code donated by the Uni-versity of Michigan awaits integration)disconnected AFS and further work ona native Windows client Derrick statedthat the guiding principles of OpenAFSwere to maintain compatibility withIBM AFS support new platforms easethe administrative burden and add newfunctionality

AFS performance was discussed Theopenafsorg cell consists of two fileservers one in Stockholm Sweden andone in Pittsburgh PA AFS works rea-sonably well over trans-Atlantic links

although OpenAFS clients donrsquot deter-mine which file server to talk to veryeffectively Arla clients use RTTs to theserver to determine the optimal fileserver to fetch replicated data fromModifications to OpenAFS to supportthis behavior in the future are desired

Jimmy Engelbrecht and Harald Barth ofKTH discussed their AFSCrawler scriptwritten to determine how many AFScells and clients were in the world whatimplementationsversions they were(Arla vs IBM AFS vs OpenAFS) andhow much data was in AFS The scriptunfortunately triggered a bug in IBMAFS 36ndashderived code causing someclients to panic while handling a specificRPC This has since been fixed in OpenAFS 125 and the most recent IBMAFS patch level and all AFS users arestrongly encouraged to upgrade Norelease of Arla is vulnerable to this par-ticular denial-of-service attack Therewas an extended discussion of the use-fulness of this exploration Many sitesbelieved this was useful information andsuch scanning should continue in thefuture but only on an opt-in basis

Many sites face the problem of manag-ing KerberosAFS credentials for batch-scheduled jobs Specifically most batchprocessing software needs to be modi-fied to forward tickets as part of thebatch submission process renew ticketsand tokens while the job is in the queueand for the lifetime of the job and prop-erly destroy credentials when the jobcompletes Ken Hornstein of NRL wasable to pay a commercial vendor to sup-port Kerberos 45 credential manage-ment in their product although they didnot implement AFS token managementMIT has implemented some of thedesired functionality in OpenPBS andmight be able to make it available toother interested sites

Tools to simplify AFS administrationwere discussed including

AFS Balancer A tool written byCMU to automate the process ofbalancing disk usage across all

servers in a cell Available fromftpftpandrewcmuedupubAFS-Toolsbalance-11btargz

Themis Themis is KTHrsquos enhancedversion of the AFS tool ldquopackagerdquofor updating files on local disk froma central AFS image KTHrsquosenhancements include allowing thedeletion of files simplifying theprocess of adding a file and allow-ing the merging of multiple rulesets for determining which files areupdated Themis is available fromthe Arla CVS repository

Stanford was presented as an example ofa large AFS site Stanfordrsquos AFS usageconsists of approximately 14TB of datain AFS in the form of approximately100000 volumes 33TB of storage isavailable in their primary cell irstan-fordedu Their file servers consistentirely of Solaris machines running acombination of Transarc 36 patch-level3 and OpenAFS 12x while their data-base servers run OpenAFS 122 Theircell consists of 25 file servers using acombination of EMC and Sun StorEdgehardware Stanford continues to use thekaserver for their authentication infra-structure with future plans to migrateentirely to an MIT Kerberos 5 KDC

Stanford has approximately 3400 clientson their campus not including SLAC(Stanford Linear Accelerator) approxi-mately 2100 AFS clients from outsideStanford contact their cell every monthTheir supported clients are almostentirely IBM AFS 36 clients althoughthey plan to release OpenAFS clientssoon Stanford currently supports onlyUNIX clients There is some on-campuspresence of Windows clients but Stan-ford has never publicly released or sup-ported it They do intend to release andsupport the MacOS X client in the nearfuture

All Stanford students faculty and staffare assigned AFS home directories witha default quota of 50MB for a total ofapproximately 550GB of user homedirectories Other significant uses of AFS

87October 2002 login

C

ON

FERE

NC

ERE

PORT

Sstorage are data volumes for workstationsoftware (400GB) and volumes forcourse Web sites and assignments(100GB)

AFS usage at Intel was also presentedIntel has been an AFS site since 1994They had bad experiences with the IBM35 Linux client their experience withOpenAFS on Linux 24x kernels hasbeen much better They use and are sat-isfied with the OpenAFS IA64 Linuxport Intel has hundreds of OpenAFS123 and 124 clients in many produc-tion cells accessing data stored on IBMAFS file servers They have not encoun-tered any interoperability issues Intelhas some concerns about OpenAFS theywould like to purchase commercial sup-port for OpenAFS and to see OpenAFSsupport for HP-UX on both PA-RISCand Itanium hardware HP-UX supportis currently unavailable due to a specificHP-UX header file being unavailablefrom HP this may be available soonIntel has not yet committed to migratingtheir file servers to OpenAFS and areunsure if they will do so without com-mercial support

Backups are a traditional topic of dis-cussion at AFS workshops and this timewas no exception Many users complainthat the traditional AFS backup tools(ldquobackuprdquo and ldquobutcrdquo) are complex anddifficult to automate requiring manyhome-grown scripts and much userintervention for error recovery An addi-tional complaint was that the traditionalAFS tools do not support file- or direc-tory-level backups and restores datamust be backed up and restored at thevolume level

Mitch Collinsworth of Cornell presentedwork to make AMANDA the freebackup software from the University ofMaryland suitable for AFS backupsUsing AMANDA for AFS backups allowsone to share AFS backup tapes withnon-AFS backups easily run multiplebackups in parallel automate errorrecovery and provide a robust degradedmode that prevents tape errors from

stopping backups altogether Theirimplementation allows for full volumerestores as well as individual directoryand file restores They have finished cod-ing this work and are in the process oftesting and documenting it

Peter Honeyman of CITI at the Univer-sity of Michigan spoke about work hehas proposed to replace Rx with RPC-SEC GSS in OpenAFS this would allowAFS to use a TCP-based transport mech-anism rather than the UDP-based Rxand possibly gain better congestion con-trol dynamic adaptation and fragmen-tation avoidance as a result RPCSECGSS uses the GSSAPI to authenticateSUN ONC RPC RPCSEC GSS is trans-port agnostic provides strong security isa developing Internet standard and hasmultiple open source implementationsBackward compatibility with existingAFS servers and clients is an importantgoal of this project

Page 13: inside - USENIX · 2019. 2. 25. · Bosko Milekic Juan Navarro David E. Ott Amit Purohit Brennan Reynolds Matt Selsky J.D. Welch Li Xiao Praveen Yalagandula Haijin Yan Gary Zacheiss

Vcdiff instruction code table consists of256 entries of each coding up to a pair ofinstructions and recodes I-byte indicesand any additional data

The talk then showed Vcdiff rsquos perfor-mance with Web data collected fromCNN compared with W3C standardGdiff Gdiff+gzip and diff+gzip whereGdiff was computed using Vcdiff deltainstructions The results of two experi-ments ldquoFirstrdquo and ldquoSuccessiverdquo werepresented In ldquoFirstrdquo each file is com-pressed against the first file collected inldquoSuccessiverdquo each file is compressedagainst the one in the previous hourThe diff+gzip did not work well becausediff was line-oriented Vcdiff performedfavorably compared with other formats

Vcdiff is part of the Vcodex packageVcodex is a platform for all commondata transformations delta compres-sion plain compression encryptiontranscoding (eg uuencode base64) Itis structured in three layers for maxi-mum usability Base library uses Dis-plines and Methods interfaces

The code for Vcdiff can be found athttpwwwresearchattcomswtoolsThey are moving Vcdiff to the IETFstandard as a comprehensive platformfor transforming data Please refer tohttpwwwietforginternet-draftdraft-korn-vcdiff06txt

WHERE IN THE NET

Summarized by Amit Purohit

A PRECISE AND EFFICIENT EVALUATION OF

THE PROXIMITY BETWEEN WEB CLIENTS AND

THEIR LOCAL DNS SERVERS

Zhuoqing Morley Mao University ofCalifornia Berkeley Charles D CranorFred Douglis Michael RabinovichOliver Spatscheck and Jia Wang ATampTLabs ndash Research

Content Distribution Networks (CDNs)attempt to improve Web performance bydelivering Web content to end usersfrom servers located at the edge of thenetwork When a Web client requestscontent the CDN dynamically chooses aserver to route the request to usually the

73October 2002 login

C

ON

FERE

NC

ERE

PORT

Sone that is nearest the client As CDNshave access only to the IP address of thelocal DNS server (LDNS) of the clientthe CDNrsquos authoritative DNS servermaps the clientrsquos LDNS to a geographicregion within a particular network andcombines that with network and serverload information to perform CDNserver selection

This method has two limitations First itis based on the implicit assumption thatclients are close to their LDNS This maynot always be valid Second a singlerequest from an LDNS can represent adiffering number of Web clients ndash calledthe hidden load factor Knowledge of thehidden load factor can be used toachieve better load distribution

The extent of the first limitation and itsimpact on the CDN server selection isdealt with To determine the associationsbetween clients and their LDNS a sim-ple non-intrusive and efficient map-ping technique was developed The datacollected was used to study the impact ofproximity on DNS-based server selec-tion using four different proximity met-rics (1) autonomous system (AS)clustering ndash observing whether a client isin the same AS as its LDNS ndash concludedthat LDNS is very good for coarse-grained server selection as 64 of theassociations belong to the same AS(2) network clustering ndash observingwhether a client is in the same network-aware cluster (NAC) ndash implied that DNSis less useful for finer-grained serverselection since only 16 of the clientand LDNS are in the same NAC (3)traceroute divergence ndash the length of thedivergent paths to the client and itsLDNS from a probe point using tracer-oute ndash implies that most clients aretopologically close to their LDNS asviewed from a randomly chosen probesite (4) round-trip time (RTT) correla-tion (some CDNs select severs based onRTT between the CDN server and theclientrsquos LDNS) examines the correlationbetween the message RTTs from a probepoint to the client and its local DNS

results imply this is a reasonable metricto use to avoid really distant servers

A study of the impact that client-LDNSassociations have on DNS-based serverselection concludes that knowing theclientrsquos IP address would allow moreaccurate server selection in a large num-ber of cases The optimality of the serverselection also depends on the serverdensity placement and selection algo-rithms

Further information can be found athttpwwweecsberkeleyedu~zmaomyresearchhtml or by contactingzmaoeecsberkeleyedu

GEOGRAPHIC PROPERTIES OF INTERNET

ROUTING

Lakshminarayanan SubramanianVenkata N Padmanabhan MicrosoftResearch Randy H Katz University ofCalifornia at Berkeley

Geographic information can provideinsights into the structure and function-ing of the Internet including interac-tions between different autonomoussystems by analyzing certain propertiesof Internet routing It can be used tomeasure and quantify certain routingproperties such as circuitous routinghot-potato routing and geographic faulttolerance

Traceroute has been used to gather therequired data and Geotrack tool hasbeen used to determine the location ofthe nodes along each network path Thisenables the computation of ldquolinearizeddistancesrdquo which is the sum of the geo-graphic distances between successivepairs of routers along the path

In order to measure the circuitousnessof a path a metric ldquodistance ratiordquo hasbeen defined as the ratio of the lin-earized distance of a path to the geo-graphic distance between the source anddestination of the path From the data ithas been observed that the circuitous-ness of a route depends on both geo-graphic and network location of the endhosts A large value of the distance ratioenables us to flag paths that are highly

USENIX 2002

circuitous possibly (though not neces-sarily) because of routing anomalies Itis also shown that the minimum delaybetween end hosts is strongly correlatedwith the linearized distance of the path

Geographic information can be used tostudy various aspects of wide-area Inter-net paths that traverse multiple ISPs Itwas found that end-to-end Internetpaths tend to be more circuitous thanintra-ISP paths the cause for this beingthe peering relationships between ISPsAlso paths traversing substantial dis-tances within two or more ISPs tend tobe more circuitous than paths largelytraversing only a single ISP Anotherfinding is that ISPs generally employhot-potato routing

Geographic information can also beused to capture the fact that two seem-ingly unrelated routers can be suscepti-ble to correlated failures By using thegeographic information of routers wecan construct a geographic topology ofan ISP Using this we can find the toler-ance of an ISPrsquos network to the total net-work failure in a geographic region

For further information contactlakmecsberkeleyedu

PROVIDING PROCESS ORIGIN INFORMATION

TO AID IN NETWORK TRACEBACK

Florian P Buchholz Purdue UniversityClay Shields Georgetown University

Network traceback is currently limitedbecause host audit systems do not main-tain enough information to matchincoming network traffic to outgoingnetwork traffic The talk presented analternative assigning origin informationto every process and logging it duringinteractive login creation

The current implementation concen-trates mainly on interactive sessions inwhich an event is logged when a newconnection is established using SSH orTelnet The method is effective andcould successfully determine steppingstones and reliably detect the source of aDDoS attack The speaker then talkedabout the related work done in this area

74 Vol 27 No 5 login

which led to discussion of the latestdevelopment in the area of ldquohostcausalityrdquo

The speaker ended with the limitationsand future directions of the frameworkThis framework could be extended andcould find many applications in the areaof security Current implementationdoesnrsquot take care of startup scripts andcron jobs but incorporating the origininformation in FS could solve this prob-lem In the current implementation log-ging is just implemented as a proof ofconcept It could be made safe in manyways and this could be another impor-tant aspect of future work

PROGRAMMING

Summarized by Amit Purohit

CYCLONE A SAFE DIALECT OF C

Trevor Jim ATampT Labs ndash Research GregMorrisett Dan Grossman MichaelHicks James Cheney Yanling WangCornell University

Cyclone is designed to provide protec-tion against attacks such as buffer over-flows format string attacks andmemory management errors The cur-rent structure of C allows programmersto write vulnerable programs Cycloneextends C so that it has the safety guar-antee of Java while keeping C syntaxtypes and semantics untouched

The Cyclone compiler performs staticanalysis of the source code and insertsruntime checks into the compiled out-put at places where the analysis cannotdetermine that the code execution willnot violate safety constraints Cycloneimposes many restrictions to preservesafety such as NULL checksThesechecks do not exist in normal C

The speaker then talked about somesample code written in Cyclone and howit tackles safety problems Porting anexisting C application to Cyclone ispretty easy with fewer than 10 changerequired The current implementationconcentrates more on safety than onperformance ndash hence the performance

penalty is substantial Cyclone was ableto find many lingering bugs The finalstep of the project will be to write acompiler to convert a normal C programto an equivalent Cyclone program

COOPERATIVE TASK MANAGEMENT WITHOUT

MANUAL STACK MANAGEMENT

Atul Adya Jon Howell MarvinTheimer William J Bolosky John RDouceur Microsoft Research

The speaker described the definitionsand motivations behind the distinctconcepts of task management stackmanagement IO management conflictmanagement and data partitioningConventional concurrent programminguses preemptive task management andexploits the automatic stack manage-ment of a standard language In the sec-ond approach cooperative tasks areorganized as event handlers that yieldcontrol by returning control to the eventscheduler manually unrolling theirstacks In this project they have adopteda hybrid approach that makes it possiblefor both stack management styles tocoexist Thus programmers can codeassuming one or the other of these stackmanagement styles will operate depend-ing upon the application The speakeralso gave a detailed example of how touse their mechanism to switch betweenthe two styles

The talk then continued with the imple-mentation They were able to preemptmany subtle concurrency problems byusing cooperative task managementPaying a cost up-front to reduce a subtlerace condition proved to be a goodinvestment

Though the choice of task managementis fundamental the choice of stack man-agement can be left to individual tasteThis project enables use of any type ofstack management in conjunction withcooperative task management

IMPROVING WAIT-FREE ALGORITHMS FOR

INTERPROCESS COMMUNICATION IN

EMBEDDED REAL-TIME SYSTEMS

Hai Huang Padmanabhan Pillai KangG Shin University of Michigan

The main characteristic of the real-timetime-sensitive system is its pre-dictable response time But concurrencymanagement creates hurdles to achiev-ing this goal because of the use of locksto maintain consistency To solve thisproblem many wait-free algorithmshave been developed but these are typi-cally high-cost solutions By takingadvantage of the temporal characteris-tics of the system however the time andspace overhead can be reduced

The speaker presented an algorithm fortemporal concurrency control andapplied this technique to improve threewait-free algorithms A single-writermultiple-reader wait-free algorithm anda double-buffer algorithm were pro-posed

Using their transformation mechanismthey achieved an improvement of17ndash66 in ACET and a 14ndash70 reduc-tion in memory requirements for IPCalgorithms This mechanism is extensi-ble and can be applied to other non-blocking IPC algorithms as well Futurework involves reducing the synchroniz-ing overheads in more general IPC algo-rithms with multiple-writer semantics

MOBILITY

Summarized by Praveen Yalagandula

ROBUST POSITIONING ALGORITHMS FOR

DISTRIBUTED AD-HOC WIRELESS SENSOR

NETWORKS

Chris Savarese Jan Rabaey Universityof California Berkeley Koen Langen-doen Delft University of Technology

The Pico Radio Network comprisesmore than a hundred sensors monitorsand actuators equipped with wirelessconnectivity with the following proper-ties (1) no infrastructure (2) computa-tion in a distributed fashion instead ofcentralized computation (3) dynamictopology (4) limited radio range

75October 2002 login

C

ON

FERE

NC

ERE

PORT

S(5) nodes that can act as repeaters(6) devices of low cost and with lowpower and (7) sparse anchor nodes ndashnodes with GPS capability A positioningalgorithm determines the geographicalposition of each node in the network

In two-dimensional space each nodeneeds three reference positions to esti-mate its geographical position There aretwo problems that make positioning dif-ficult in the Pico Radio Network typesetting (1) a sparse anchor node prob-lem and (2) a range error problem

The two-phase approach that theauthors have taken in solving the prob-lem consists of (1) a hop-terrain algo-rithm ndash in this first phase each noderoughly guesses its location by the dis-tance calculated using multi-hops to theanchor points and (2) a refinementalgorithm ndash in this second phase eachnode uses its neighborsrsquo positions torefine its own position estimate Toguarantee the convergence thisapproach uses confidence metricsassigning a value of 10 for anchor nodesand 01 to start for other nodes andincreasing these with each iteration

A simulation tool OMNet++ was usedfor both phases and in various scenariosThe results show that they achievedposition errors of less than 33 in a sce-nario with 5 anchor nodes an averageconnectivity of 7 and 5 range mea-surement error

Guidelines for anchor node deploymentare high connectivity (gt10) a reason-able fraction of anchor nodes (gt 5)and for anchor placement coverededges

APPLICATION-SPECIFIC NETWORK

MANAGEMENT FOR ENERGY-AWARE

STREAMING OF POPULAR MULTIMEDIA

FORMATS

Surendar Chandra University of Geor-gia and Amin Vahdat Duke University

The main hindrance in supporting theincreasing demand for mobile multime-dia on PDA is the battery capacity of the

small devices The idle-time power con-sumption is almost the same as thereceive power consumption on typicalwireless network cards (1319mW vs1425mW) while the sleep state con-sumes far less power (177mW) To letthe system consume energy proportionalto the stream quality the network cardshould transition to sleep state aggres-sively between each packet

The speaker presented previous workwhere different multimedia streamswere studied for PDAs MS Media RealMedia and Quicktime It was found thatif the inter-packet gap is predictablethen huge savings are possible forexample about 80 savings in MSmedia streams with few packet lossesThe limitation of the IEEE 80211bpower-save mode comes into play whenthere are two or more nodes producinga delay between beacon time and whenthe node receivessends packets Thisbadly affects the streams since higherenergy is consumed while waiting

The authors propose traffic shaping forenergy conservation where this is doneby proxy such that packets arrive at reg-ular intervals This is achieved using alocal proxy in the access point and aclient-side proxy in the mobile hostSimulations show that traffic shapingreduces energy consumption and alsoreveals that the higher bandwidthstreams have a lower energy metric(mJkB)

More information is at httpgreenhousecsugaedu

CHARACTERIZING ALERT AND BROWSE SER-

VICES OF MOBILE CLIENTS

Atul Adya Paramvir Bahl and Lili QiuMicrosoft Research

Even though there is a dramatic increasein Internet access from wireless devicesthere are not many studies done oncharacterizing this traffic In this paperthe authors characterize the trafficobserved on the MSN Web site withboth notification and browse traces

USENIX 2002

Around 33 million browsing accessesand about 325 million notificationentries are present in the traces Threetypes of analyses are done for each oneof these two services content analysisconcerning the most popular contentcategories and their distribution popu-larity analysis or the popularity distri-bution of documents and user behavioranalysis

The analysis of the notification logsshows that document access rates followa Zipf-like distribution with most of theaccesses concentrated on a small num-ber of messages and the accesses exhibitgeographical locality ndash users from samelocality tend to receive similar notifica-tion content The browser log analysisshows that a smaller set of URLs areaccessed most times though the accesspattern does not fit any Zipf curve andthe highly accessed URLS remain stableA correlation study between the notifi-cation and browsing services shows thatwireless users have a moderate correla-tion of 012

FREENIX TRACK SESSIONS

BUILDING APPLICATIONS

Summarized by Matt Butner

INTERACTIVE 3-D GRAPHICS APPLICATIONS

FOR TCL

Oliver Kersting Juumlrgen Doumlllner HassoPlattner Institute for Software SystemsEngineering University of Potsdam

The integration of 3-D image renderingfunctionality into a scripting languagepermits interactive and animated 3-Ddevelopment and application withoutthe formalities and precision demandedby low-level CC++ graphics and visual-ization libraries The large and complexC++ API of the Virtual Rendering Sys-tem (VRS) can be combined with theconveniences of the Tcl scripting lan-guage The mapping of class interfaces isdone via an automated process and gen-erates respective wrapper classes all ofwhich ensures complete API accessibilityand functionality without any signifi-

76 Vol 27 No 5 login

cant performance retribution The gen-eral C++-to-Tcl mapper SWIG grantsthe necessary object and hierarchal abili-ties without the use of object-orientedTcl extensions The final API mappingtechniques address C++ features such asclasses overloaded methods enumera-tions and inheritance relations All areimplemented in a proof-of-concept thatmaps the complete C++ API of the VRSto Tcl and are showcased in a completeinteractive 3-D-map system

THE AGFL GRAMMAR WORK LAB

Cornelis HA Coster Erik VerbruggenUniversity of Nijmegen (KUN)

The growth and implementation of Nat-ural Language Processing (NLP) is thecornerstone of the continued evolutionand implementation of truly intelligentsearch machines and services In partdue to the growing collections of com-puter-stored human-readable docu-ments in the public domain theimplementation of linguistic analysiswill become necessary for desirable pre-cision and resolution Subsequentlysuch tools and linguistic resources mustbe released into the public domain andthey have done so with the release of theAGFL Grammar Work Lab under theGNU Public License as a tool for lin-guistic research and the development ofNLP-based applications

The AGFL (Affix Grammars over aFinite Lattice) Grammar Work Labmeshes context-free grammars withfinite set-valued features that are accept-able to a range of languages In com-puter science terms ldquoSyntax rules areprocedures with parameters and a non-deterministic executionrdquo The EnglishPhrases for Information Retrieval(EP4IR) released with the AGFL-GWLas a robust grammar of English is anAGFL-GWL generated English parserthat outputs ldquoHeadModifiedrdquo framesThe sentences ldquoCompanyX sponsoredthis conferencerdquo and ldquoThis conferencewas sponsored by CompanyXrdquo both gen-erate [CompanyX[sponsored confer-

ence]] The hope is that public availabil-ity of such tools will encourage furtherdevelopment of grammar and lexicalsoftware systems

SWILL A SIMPLE EMBEDDED WEB SERVER

LIBRARY

Sotiria Lampoudi David M BeazleyUniversity of Chicago

This paper won the FREENIX Best Stu-dent Paper award

SWILL (Simple Web Interface LinkLibrary) is a simple Web server in theform of a C library whose developmentwas motivated by a wish to give coolapplications an interface to the Web TheSWILL library provides a simple inter-face that can be efficiently implementedfor tasks that vary from flexible Web-based monitoring to software debuggingand diagnostics Though originallydesigned to be integrated with high-per-formance scientific simulation softwarethe interface is generic enough to allowfor unbounded uses SWILL is a single-threaded server relying upon non-blocking IO through the creation of atemporary server which services IOrequests SWILL does not provide SSL orcryptographic authentication but doeshave HTTP authentication abilities

A fantastic feature of SWILL is its sup-port for SPMD-style parallel applica-tions which utilize MPI provingvaluable for Beowulf clusters and largeparallel machines Another practicalapplication was the implementation ofSWILL in a modified Yalnix emulator byUniversity of Chicago Operating Sys-tems courses which utilized the addedYalnix functionality for OS developmentand debugging SWILL requires minimalmemory overhead and relies upon theHTTP10 protocol

NETWORK PERFORMANCE

Summarized by Florian Buchholz

LINUX NFS CLIENT WRITE PERFORMANCE

Chuck Lever Network Appliance PeterHoneyman CITI University of Michi-gan

Lever introduced a benchmark to mea-sure an NFS client write performanceClient performance is difficult to mea-sure due to hindrances such as poorhardware or bandwidth limitations Fur-thermore measuring application perfor-mance does not identify weaknessesspecifically at the client side Thus abenchmark was developed trying toexercise only data transfers in one direc-tion between server and application Forthis purpose the benchmark was basedon the block sequential write portion ofthe Bonnie file system benchmark Oncea benchmark for NFS clients is estab-lished it can be used to improve clientperformance

The performance measurements wereperformed with an SMP Linux clientand both a Linux NFS server and a Net-work Appliance F85 filer During testinga periodic jump in write latency timewas discovered This was due to a ratherlarge number of pending write opera-tions that were scheduled to be writtenafter certain threshold values wereexceeded By introducing a separate dae-mon that flushes the cached writerequest the spikes could be eliminatedbut as a result the average latency growsover time The problem could be tracedto a function that scans a linked list ofwrite requests After having added ahashtable to improve lookup perfor-mance the latency improved consider-ably

The improved client was then used tomeasure throughput against the twoservers A discrepancy between theLinux server and the filer test wasnoticed and the reason for that tracedback to a global kernel lock that wasunnecessarily held when accessing thenetwork stack After correcting this per-formance improved further However

77October 2002 login

C

ON

FERE

NC

ERE

PORT

Sthe measurements showed that a clientmay run slower when paired with fastservers on fast networks This is due toheavy client interrupt loads more net-work processing on the client side andmore global kernel lock contention

The source code of the project is avail-able at httpwwwcitiumicheduprojectsnfs-perfpatches

A STUDY OF THE RELATIVE COSTS OF

NETWORK SECURITY PROTOCOLS

Stefan Miltchev and Sotiris IoannidisUniversity of Pennsylvania AngelosKeromytis Columbia University

With the increasing need for securityand integrity of remote network ser-vices it becomes important to quantifythe communication overhead of IPSecand compare it to alternatives such asSSH SCP and HTTPS

For this purpose the authors set upthree testing networks direct link twohosts separated by two gateways andthree hosts connecting through onegateway Protocols were compared ineach setup and manual keying was usedto eliminate connection setup costs ForIPSec the different encryption algo-rithms AES DES 3DES hardware DESand hardware 3DES were used In detailFTP was compared to SFTP SCP andFTP over IPSec HTTP to HTTPS andHTTP over IPSec and NFS to NFS overIPSec and local disk performance

The result of the measurements werethat IPSec outperforms other popularencryption schemes Overall unen-crypted communication was fastest butin some cases like FTP the overhead canbe small The use of crypto hardwarecan significantly improve performanceFor future work the inclusion of setupcosts hardware-accelerated SSL SFTPand SSH were mentioned

CONGESTION CONTROL IN LINUX TCP

Pasi Sarolahti University of HelsinkiAlexey Kuznetsov Institute for NuclearResearch at Moscow

Having attended the talk and read thepaper I am still unclear about whetherthe authors are merely describing thedesign decisions of TCP congestion con-trol or whether they are actually the cre-ators of that part of the Linux code Myguess leans toward the former

In the talk the speaker compared theTCP protocol congestion control meas-ures according to IETF and RFC specifi-cations with the actual Linux implemen-tation which does conform to the basicprinciples but nevertheless has differ-ences A specific emphasis was placed onretransmission mechanisms and thecongestion window Also several TCPenhancements ndash the NewReno algo-rithm Selective ACKs (SACK) ForwardACKs (FACK) ndash were discussed andcompared

In some instances Linux does not con-form to the IETF specifications The fastrecovery does not fully follow RFC 2582since the threshold for triggering re-transmit is adjusted dynamically and thecongestion windowrsquos size is not changedAlso the roundtrip-time estimator andthe RTO calculation differ from RFC2988 since it uses more conservativeRTT estimates and a minimum RTO of200ms The performance measuresshowed that with the additional Linux-specific features enabled slightly higherthroughput more steady data flow andfewer unnecessary retransmissions canbe achieved

XTREME XCITEMENT

Summarized by Steve Bauer

THE FUTURE IS COMING WHERE THE X

WINDOW SHOULD GO

Jim Gettys Compaq Computer Corp

Jim Gettys one of the principal authorsof the X Window System outlined thenear-term objectives for the system pri-marily focusing on the changes andinfrastructure required to enable replica-tion and migration of X applicationsProviding better support for this func-tionality would enable users to retrieveor duplicate X applications betweentheir servers at home and work

USENIX 2002

One interesting example of an applica-tion that currently is capable of migra-tion and replication is Emacs To create anew frame on DISPLAY try ldquoM-x make-frame-on-display ltReturngt DISPLAYltReturngtrdquo

However technical challenges makereplication and migration difficult ingeneral These include the ldquomajorheadachesrdquo of server-side fonts thenonuniformity of X servers and screensizes and the need to appropriatelyretrofit toolkits Authentication andauthorization issues are obviously alsoimportant The rest of the talk delvedinto some of the details of these interest-ing technical challenges

HACKING IN THE KERNEL

Summarized by Hai Huang

AN IMPLEMENTATION OF SCHEDULER

ACTIVATIONS ON THE NETBSD OPERATING

SYSTEM

Nathan J Williams Wasabi Systems

Scheduler activation is an old idea Basi-cally there are benefits and drawbacks tousing solely kernel-level or user-levelthreading Scheduler activation is able tocombine the two layers of control toprovide more concurrency in the sys-tem

In his talk Nathan gave a fairly detaileddescription of the implementation ofscheduler activation in the NetBSD ker-nel One important change in the imple-mentation is to differentiate the threadcontext from the process context This isdone by defining a separate data struc-ture for these threads and relocatingsome of the information that wasembedded in the process context tothese thread contexts Stack was espe-cially a concern due to the upcall Spe-cial handling must be done to make surethat the upcall doesnrsquot mess up the stackso that the preempted user-level threadcan continue afterwards Lastly Nathanexplained that signals were handled byupcalls

78 Vol 27 No 5 login

AUTHORIZATION AND CHARGING IN PUBLIC

WLANS USING FREEBSD AND 8021X

Pekka Nikander Ericsson ResearchNomadicLab

8021x standards are well known in thewireless community as link-layerauthentication protocols In this talkPekka explained some novel ways ofusing the 8021x protocols that might beof interest to people on the move It ispossible to set up a public WLAN thatwould support various chargingschemes via virtual tokens which peoplecan purchase or earn and later use

This is implemented on FreeBSD usingthe netgraph utility It is basically a filterin the link layer that would differentiatetraffic based on the MAC address of theclient node which is either authenti-cated denied or let through The over-head for this service is fairly minimal

ACPI IMPLEMENTATION ON FREEBSD

Takanori Watanabe Kobe University

ACPI (Advanced Configuration andPower Management Interface) was pro-posed as a joint effort by Intel Toshibaand Microsoft to provide a standard andfiner-control method of managingpower states of individual devices withina system Such low-level power manage-ment is especially important for thosemobile and embedded systems that arepowered by fixed-capacity energy batter-ies

Takanori explained that the ACPI speci-fication is composed of three partstables BIOS and registers He was ableto implement some functionalities ofACPI in a FreeBSD kernel ACPI Com-ponent Architecture was implementedby Intel and it provides a high-levelACPI API to the operating systemTakanorirsquos ACPI implementation is builtupon this underlying layer of APIs

ACCESS CONTROL

Summarized by Florian Buchholz

DESIGN AND PERFORMANCE OF THE

OPENBSD STATEFUL PACKET FILTER (PF)

Daniel Hartmeier Systor AG

Daniel Hartmeier described the newstateful packet filter (pf) that replacedIPFilter in the OpenBSD 30 releaseIPFilter could no longer be included dueto licensing issues and thus there was aneed to write a new filter making use ofoptimized data structures

The filter rules are implemented as alinked list which is traversed from startto end Two actions may be takenaccording to the rules ldquopassrdquo or ldquoblockrdquoA ldquopassrdquo action forwards the packet anda ldquoblockrdquo action will drop it Wheremore than one rule matches a packetthe last rule wins Rules that are markedas ldquofinalrdquo will immediately terminate any further rule evaluation An opti-mization called ldquoskip-stepsrdquo was alsoimplemented where blocks of similarrules are skipped if they cannot matchthe current packet These skip-steps arecalculated when the rule set is loadedFurthermore a state table keeps track ofTCP connections Only packets thatmatch the sequence numbers areallowed UDP and ICMP queries andreplies are considered in the state tablewhere initial packets will create an entryfor a pseudo-connection with a low ini-tial timeout value The state table isimplemented using a balanced binarysearch tree NAT mappings are alsostored in the state table whereas appli-cation proxies reside in user space Thepacket filter also is able to perform frag-ment reassembly and to modulate TCPsequence numbers to protect hostsbehind the firewall

Pf was compared against IPFilter as wellas Linuxrsquos Iptables by measuringthroughput and latency with increasingtraffic rates and different packet sizes Ina test with a fixed rule set size of 100Iptables outperformed the other two fil-ters whose results were close to each

other In a second test where the rule setsize was continually increased Iptablesconsistently had about twice thethroughput of the other two (whichevaluate the rule set on both the incom-ing and outgoing interfaces) A third testcompared only pf and IPFilter using asingle rule that created state in the statetable with a fixed-state entry size Pfreached an overloaded state much laterthan IPFilter The experiment wasrepeated with a variable-state entry sizeand pf performed much better thanIPFilter for a small number of states

In general rule set evaluation is expen-sive and benchmarks only reflectextreme cases whereas in real life otherbehavior should be observed Further-more the benchmarks show that statefulfiltering can actually improve perfor-mance due to cheap state-table lookupsas compared to rule evaluation

ENHANCING NFS CROSS-ADMINISTRATIVE

DOMAIN ACCESS

Joseph Spadavecchia and Erez ZadokStony Brook University

The speaker presented modification toan NFS server that allows an improvedNFS access between administrativedomains A problem lies in the fact thatNFS assumes a shared UIDGID spacewhich makes it unsafe to export filesoutside the administrative domain ofthe server Design goals were to leaveprotocol and existing clients unchangeda minimum amount of server changesflexibility and increased performance

To solve the problem two techniques areutilized ldquorange-mappingrdquo and ldquofile-cloakingrdquo Range-mapping maps IDsbetween client and server The mappingis performed on a per-export basis andhas to be manually set up in an exportfile The mappings can be 1-1 N-N orN-1 In file-cloaking the server restrictsfile access based on UIDGID and spe-cial cloaking-mask bits Here users canonly access their own file permissionsThe policy on whether or not the file isvisible to others is set by the cloakingmask which is logically ANDed with the

79October 2002 login

C

ON

FERE

NC

ERE

PORT

Sfilersquos protection bits File-cloaking onlyworks however if the client doesnrsquot holdcached copies of directory contents andfile-attributes Because of this the clientsare forced to re-read directories byincrementing the mtime value of thedirectory each time it is listed

To test the performance of the modifiedserver five different NFS configurationswere evaluated An unmodified NFSserver was compared against one serverwith the modified code included but notused one with only range-mappingenabled one with only file-cloakingenabled and one version with all modi-fications enabled For each setup differ-ent file system benchmarks were runThe results show only a small overheadwhen the modifications are used gener-ally an increase of below 5 Anotherexperiment tested the performance ofthe system with an increasing number ofmapped or cloaked entries on a systemThe results show that an increase from10 to 1000 entries resulted in a maxi-mum of about 14 cost in performance

One member of the audience pointedout that if clients choose to ignore thechanged mtimes from the server andthus still hold caches of the directoryentries the file-cloaking mechanismcould be defeated After a rather lengthydebate the speaker had to concede thatthe model doesnrsquot add any extra securityAnother question was asked about scala-bility of the setup of range mappingThe speaker referred to application-leveltools that could be developed for thatpurpose

The software is available atftpftpfslcssunysbedupubenf

ENGINEERING OPEN SOURCE

SOFTWARE

Summarized by Teri Lampoudi

NINGAUI A LINUX CLUSTER FOR BUSINESS

Andrew Hume ATampT Labs ndash ResearchScott Daniels Electronic Data SystemsCorp

Ningaui is a general purpose highlyavailable resilient architecture built

from commodity software and hard-ware Emphatically however Ningaui isnot a Beowulf Hume calls the clusterdesign the ldquoSwiss canton modelrdquo inwhich there are a number of looselyaffiliated independent nodes with datareplicated among them Jobs areassigned by bidding and leases and clus-ter services done as session-based serversare sited via generic job assignment Theemphasis is on keeping the architectureend-to-end checking all work via check-sums and logging everything Theresilience and high availability requiredby their goal of 8-5 maintenance ndash vsthe typical 24-7 model where people getpaged whenever the slightest thing goeswrong regardless of the time of day ndash isachieved by job restartability Finally allcomputation is performed on local datawithout the use of NFS or networkattached storage

Humersquos message is a hopeful onedespite the many problems encounteredndash things like kernel and service limitsauto-installation problems TCP stormsand the like ndash the existence of sourcecode and the paranoid practice of log-ging and checksumming everything hashelped The final product performs rea-sonably well and it appears that theresilient techniques employed do make adifference One drawback however isthat the software mentioned in thepaper is not yet available for download

CPCMS A CONFIGURATION MANAGEMENT

SYSTEM BASED ON CRYPTOGRAPHIC NAMES

Jonathan S Shapiro and John Vander-burgh Johns Hopkins University

This paper won the FREENIX BestPaper award

The basic notion behind the project isthe fact that everyone has a pet com-plaint about CVS and yet it is currentlythe configuration manager in mostwidespread use Shapiro has unleashedan alternative Interestingly he did notbegin but ended with the usual host ofreasons why CVS is bad The talk insteadbegan abruptly by characterizing the job

USENIX 2002

of a software configuration managercontinued by stating the namespaceswhich it must handle and wrapped upwith the challenges faced

X MEETS Z VERIFYING CORRECTNESS IN THE

PRESENCE OF POSIX THREADS

Bart Massey Portland State UniversityRobert T Bauer Rational SoftwareCorp

Massey delivered a humorous talk onthe insight gained from applying Z for-mal specification notation to systemsoftware design rather than the moreinformal analysis and design processnormally used

The story is told with respect to writingXCB which replaces the Xlib protocollayer and is supposed to be threadfriendly But where threads are con-cerned deadlock avoidance becomes ahard problem that cannot be solved inan ad-hoc manner But full modelchecking is also too hard In this caseMassey resorted to Z specification tomodel the XCB lower layer abstractaway locking and data transfers andlocate fundamental issues Essentiallythe difficulties of searching the literatureand locating information relevant to theproblem at hand were overcome AsMassey put it ldquothe formal method savedthe dayrdquo

FILE SYSTEMS

Summarized by Bosko Milekic

PLANNED EXTENSIONS TO THE LINUX

EXT2EXT3 FILESYSTEM

Theodore Y Tsrsquoo IBM StephenTweedie Red Hat

The speaker presented improvements tothe Linux Ext2 file system with the goalof allowing for various expansions whilestriving to maintain compatibility witholder code Improvements have beenfacilitated by a few extra superblockfields that were added to Ext2 not longbefore the Linux 20 kernel was releasedThe fields allow for file system featuresto be added without compromising theexisting setup this is done by providing

80 Vol 27 No 5 login

bits indicating the impact of the addedfeatures with respect to compatibilityNamely a file system with the ldquoincom-patrdquo bit marked is not allowed to bemounted Similarly a ldquoread-onlyrdquo mark-ing would only allow the file system tobe mounted read-only

Directory indexing changes linear direc-tory searches with a faster search using afixed-depth tree and hashed keys Filesystem size can be dynamicallyincreased and the expanded inode dou-bled from 128 to 256 bytes allows formore extensions

Other potential improvements were dis-cussed as well in particular pre-alloca-tion for contiguous files which allowsfor better performance in certain setupsby pre-allocating contiguous blocksSecurity-related modifications extendedattributes and ACLs were mentionedAn implementation of these featuresalready exists but has not yet beenmerged into the mainline Ext23 code

RECENT FILESYSTEM OPTIMISATIONS ON

FREEBSD

Ian Dowse Corvil Networks DavidMalone CNRI Dublin Institute of Technology

David Malone presented four importantfile system optimizations for FreeBSDOS soft updates dirpref vmiodir anddirhash It turns out that certain combi-nations of the optimizations (beautifullyillustrated in the paper) may yield per-formance improvements of anywherebetween 2 and 10 orders of magnitudefor real-world applications

All four of the optimizations deal withfile system metadata Soft updates allowfor asynchronous metadata updatesDirpref changes the way directories areorganized attempting to place childdirectories closer to their parentsthereby increasing locality of referenceand reducing disk-seek times Vmiodirtrades some extra memory in order toachieve better directory caching Finallydirhash which was implemented by IanDowse changes the way in which entries

are searched in a directory SpecificallyFFS uses a linear search to find an entryby name dirhash builds a hashtable ofdirectory entries on the fly For a direc-tory of size n with a working set of mfiles a search that in certain cases couldhave been O(nm) has been reduceddue to dirhash to effectively O(n + m)

FILESYSTEM PERFORMANCE AND SCALABILITY

IN LINUX 2417

Ray Bryant SGI Ruth Forester IBMLTC John Hawkes SGI

This talk focused on performance evalu-ation of a number of file systems avail-able and commonly deployed on Linuxmachines Comparisons under variousconfigurations of Ext2 Ext3 ReiserFSXFS and JFS were presented

The benchmarks chosen for the datagathering were pgmeter filemark andAIM Benchmark Suite VII Pgmetermeasures the rate of data transfer ofreadswrites of a file under a syntheticworkload Filemark is similar to post-mark in that it is an operation-intensivebenchmark although filemark isthreaded and offers various other fea-tures that postmark lacks AIM VIImeasures performance for various file-system-related functionalities it offersvarious metrics under an imposedworkload thus stressing the perfor-mance of the file system not only underIO load but also under significant CPUload

Tests were run on three different setupsa small a medium and a large configu-ration ReiserFS and Ext2 appear at thetop of the pile for smaller and mediumsetups Notably XFS and JFS performworse for smaller system configurationsthan the others although XFS clearlyappears to generate better numbers thanJFS It should be noted that XFS seemsto scale well under a higher load Thiswas most evident in the large-systemresults where XFS appears to offer thebest overall results

THINGS TO THINK ABOUT

Summarized by Bosko Milekic

SPEEDING UP KERNEL SCHEDULER BY REDUC-

ING CACHE MISSES

Shuji Yamamura Akira Hirai MitsuruSato Masao Yamamoto Akira NaruseKouichi Kumon Fujitsu Labs

This was an interesting talk pertaining tothe effects of cache coloring for taskstructures in the Linux kernel scheduler(Linux kernel 24x) The speaker firstpresented some interesting benchmarknumbers for the Linux scheduler show-ing that as the number of processes onthe task queue was increased the perfor-mance decreased The authors usedsome really nifty hardware to measurethe number of bus transactionsthroughout their tests and were thusable to reasonably quantify the impactthat cache misses had in the Linuxscheduler

Their experiments led them to imple-ment a cache coloring scheme for taskstructures which were previouslyaligned on 8KB boundaries and there-fore were being eventually mapped tothe same cache lines This unfortunateplacement of task structures in memoryinduced a significant number of cachemisses as the number of tasks grew inthe scheduler

The implemented solution consisted ofaligning task structures to more evenlydistribute cached entries across the L2cache The result was inevitably fewercache misses in the scheduler Some neg-ative effects were observed in certain sit-uations These were primarily due tomore cache slots being used by taskstructure data in the scheduler thusforcing data previously cached there tobe pushed out

81October 2002 login

C

ON

FERE

NC

ERE

PORT

SWORK-IN-PROGRESS REPORTS

Summarized by Brennan Reynolds

RESOURCE VIRTUALIZATION TECHNIQUES FOR

WIDE-AREA OVERLAY NETWORKS

Kartik Gopalan University Stony Brook

This work addressed the issue of provi-sioning a maximum number of virtualoverlay networks (VON) with diversequality of service (QoS) requirementson a single physical network Each of theVONs is logically isolated from others toensure the QoS requirements Gopalanmentioned several critical researchissues with this problem that are cur-rently being investigated Dealing withhow to provision the network at variouslevels (link route or path) and thenenforce the provisioning at run-time isone of the toughest challenges Cur-rently Gopalan has developed severalalgorithms to handle admission controlend-to-end QoS route selection sched-uling and fault tolerance in the net-work

For more information visithttpwwwecslcssunysbedu

VISUALIZING SOFTWARE INSTABILITY

Jennifer Bevan University of CaliforniaSanta Cruz

Detection of instability in software hastypically been an afterthought Thepoint in the development cycle when thesoftware is reviewed for instability isusually after it is difficult and costly togo back and perform major modifica-tions to the code base Bevan has devel-oped a technique to allow thevisualization of unstable regions of codethat can be used much earlier in thedevelopment cycle Her technique cre-ates a time series of dependent graphsthat include clusters and lines calledfault lines From the graphs a developeris able to easily determine where theunstable sections of code are and proac-tively restructure them She is currentlyworking on a prototype implementa-tion

For more information visit httpwwwcseucscecu~jbevanevo_viz

RELIABLE AND SCALABLE PEER-TO-PEER WEB

DOCUMENT SHARING

Li Xiao William and Mary College

The idea presented by Xiao would allowend users to share the content of theWeb browser caches with neighboringInternet users The rationale for this isthat todayrsquos end users are increasinglyconnected to the Internet over high-speed links and the browser caches arebecoming large storage systems There-fore if an individual accesses a pagewhich does not exist in their local cacheXiao is suggesting that they first queryother end users for the content beforetrying to access the machine hosting theoriginal This strategy does have someserious problems associated with it thatstill need to be addressed includingensuring the integrity of the content andprotecting the identity and privacy ofthe end users

SEREL FAST BOOTING FOR UNIX

Leni Mayo Fast Boot Software

Serel is a tool that generates a visual rep-resentation of a UNIX machinersquos boot-up sequence It can be used to identifythe critical path and can show if a par-ticular service or process blocks for anextended period of time This informa-tion could be used to determine wherethe largest performance gains could berealized by tuning the order of executionat boot-up Serel creates a dependencygraph expressed in XML during boot-up This graph is then used to create thevisual representation Currently the toolonly works on POSIX-compliant sys-tems but Mayo stated that he would beporting it to other platforms Otherextensions that were mentionedincluded having the metadata includethe use of shared libraries and monitor-ing the suspendresume sequence ofportable machines

For more information visithttpwwwfastbootorg

USENIX 2002

BERKELEY DB XML

John Merrells Sleepycat Software

Merrells gave a quick introduction andoverview of the new XML library forBerkeley DB The library specializes instorage and retrieval of XML contentthrough a tool called XPath The libraryallows for multiple containers per docu-ment and stores everything natively asXML The user is also given a wide rangeof elements to create indices withincluding edges elements text stringsor presence The XPath tool consists of aquery parser generator optimizer andexecution engine To conclude his pre-sentation Merrell gave a live demonstra-tion of the software

For more information visithttpwwwsleepycatcomxml

CLUSTER-ON-DEMAND (COD)

Justin Moore Duke University

Modern clusters are growing at a rapidrate Many have pushed beyond the 5000-machine mark and deployingthem results in large expenses as well asmanagement and provisioning issuesFurthermore if the cluster is ldquorentedrdquoout to various users it is very time-con-suming to configure it to a userrsquos specsregardless of how long they need to useit The COD work presented createsdynamic virtual clusters within a givenphysical cluster The goal was to have aprovisioning tool that would automati-cally select a chunk of available nodesand install the operating system andmiddleware specified by the customer ina short period of time This would allowa greater use of resources since multiplevirtual clusters can exist at once Byusing a virtual cluster the size can bechanged dynamically Moore stated thatthey have created a working prototypeand are currently testing and bench-marking its performance

For more information visithttpwwwcsdukeedu~justincod

82 Vol 27 No 5 login

CATACOMB

Elias Sinderson University of CaliforniaSanta Cruz

Catacomb is a project to develop a data-base-backed DASL module for theApache Web server It was designed as areplacement for the WebDAV The initialrelease of the module only contains sup-port for a MySQL database but could beextended to others Sinderson brieflytouched on the performance of hermodule It was comparable to themod_dav Apache module for all querytypes but search The presentation wasconcluded with remarks about addingsupport for the lock method and includ-ing ACL specifications in the future

For more information visithttpoceancseucsceducatacomb

SELF-ORGANIZING STORAGE

Dan Ellard Harvard University

Ellardrsquos presentation introduced a stor-age system that tuned itself based on theworkload of the system without requir-ing the intervention of the user Theintelligence was implemented as a vir-tual self-organizing disk that residesbelow any file system The virtual diskobserves the access patterns exhibited bythe system and then attempts to predictwhat information will be accessed nextAn experiment was done using an NFStrace at a large ISP on one of their emailservers Ellardrsquos self-organizing storagesystem worked well which he attributesto the fact that most of the files beingrequested were large email boxes Areasof future work include exploration ofthe length and detail of the data collec-tion stage as well as the CPU impact ofrunning the virtual disk layer

VISUAL DEBUGGING

John Costigan Virginia Tech

Costigan feels that the state of currentdebugging facilities in the UNIX worldis not as good as it should be He pro-poses the addition of several elements toprograms like ddd and gdb The firstaddition is including separate heap and

stack components Another is to displayonly certain fields of a data structureand have the ability to zoom in and outif needed Finally the debugger shouldprovide the ability to visually presentcomplex data structures includinglinked lists trees etc He said that a betaversion of a debugger with these abilitiesis currently available

For more information visit httpinfoviscsvtedudatastruct

IMPROVING APPLICATION PERFORMANCE

THROUGH SYSTEM-CALL COMPOSITION

Amit Purohit University of Stony Brook

Web servers perform a huge number ofcontext switches and internal data copy-ing during normal operation These twoelements can drastically limit the perfor-mance of an application regardless ofthe hardware platform it is run on TheCompound System Call (CoSy) frame-work is an attempt to reduce the perfor-mance penalty for context switches anddata copies It includes a user-levellibrary and several kernel facilities thatcan be used via system calls The libraryprovides the programmer with a com-plete set of memory-management func-tions Performance tests using the CoSyframework showed a large saving forcontext switches and data copies Theonly area where the savings betweenconventional libraries and CoSy werenegligible was for very large file copies

For more information visithttpwwwcssunysbedu~purohit

ELASTIC QUOTAS

John Oscar Columbia University

Oscar began by stating that mostresources are flexible but that to datedisks have not been While approxi-mately 80 percent of files are short-liveddisk quotas are hard limits imposed byadministrators The idea behind elasticquotas is to have a non-permanent stor-age area that each user can temporarilyuse Oscar suggested the creation of aehome directory structure to be used indeploying elastic quotas Currently he

has implemented elastic quotas as astackable file system There were alsoseveral trends apparent in the dataMany users would ssh to their remotehost (good) but then launch an emailreader on the local machine whichwould connect via POP to the same hostand send the password in clear text(bad) This situation can easily be reme-died by tunneling protocols like POPthrough SSH but it appeared that manypeople were not aware this could bedone While most of his comments onthe use of the wireless network werenegative the list of passwords he hadcollected showed that people wereindeed using strong passwords His rec-ommendations were to educate andencourage people to use protocols likeIPSec SSH and SSL when conductingwork over a wireless network becauseyou never know who else is listening

SYSTRACE

Niels Provos University of Michigan

How can one be sure the applicationsone is using actually do exactly whattheir developers said they do Shortanswer you canrsquot People today are usinga large number of complex applicationswhich means it is impossible to checkeach application thoroughly for securityvulnerabilities There are tools out therethat can help though Provos has devel-oped a tool called systrace that allows auser to generate a policy of acceptablesystem calls a particular application canmake If the application attempts tomake a call that is not defined in thepolicy the user is notified and allowed tochoose an action Systrace includes anauto-policy generation mechanismwhich uses a training phase to record theactions of all programs the user exe-cutes If the user chooses not to be both-ered by applications breaking policysystrace allows default enforcementactions to be set Currently systrace isimplemented for FreeBSD and NetBSDwith a Linux version coming out shortly

83October 2002 login

C

ON

FERE

NC

ERE

PORT

SFor more information visit httpwwwcitiumicheduuprovossystrace

VERIFIABLE SECRET REDISTRIBUTION FOR

SURVIVABLE STORAGE SYSTEMS

Ted Wong Carnegie Mellon University

Wong presented a protocol that can beused to re-create a file distributed over aset of servers even if one of the serversis damaged In his scheme the user mustchoose how many shares the file is splitinto and the number of servers it will bestored across The goal of this work is toprovide persistent secure storage ofinformation even if it comes underattack Wong stated that his design ofthe protocol was complete and he is cur-rently building a prototype implementa-tion of it

For more information visit httpwwwcscmuedu~tmwongresearch

THE GURU IS IN SESSIONS

LINUX ON LAPTOPPDA

Bdale Garbee HP Linux Systems Operation

Summarized by Brennan Reynolds

This was a roundtable-like discussionwith Garbee acting as moderator He iscurrently in charge of the Debian distri-bution of Linux and is involved in port-ing Linux to platforms other than i386He has successfully help port the kernelto the Alpha Sparc and ARM and iscurrently working on a version to runon VAX machines

A discussion was held on which file sys-tem is best for battery life The Riser andExt3 file systems were discussed peoplecommented that when using Ext3 ontheir laptops the disc would never spindown and thus was always consuming alarge amount of power One suggestionwas to use a RAM-based file system forany volume that requires a large numberof writes and to only have the file systemwritten to disk at infrequent intervals orwhen the machine is shut down or sus-pended

A question was directed to Garbee abouthis knowledge of how the HP-Compaqmerger would affect Linux Garbeethought that the merger was good forLinux both within the company and forthe open source community as a wholeHe stated that the new company wouldbe the number one shipper of systemswith Linux as the base operating systemHe also fielded a question about thesupport of older HP servers and theirability to run Linux saying that indeedLinux has been ported to them and hepersonally had several machines in hisbasement running it

A question which sparked a large num-ber of responses concerned problemspeople had with running Linux onmobile platforms Most people actuallydid not have many problems at allThere were a few cases of a particularmachine model not working but therewere only two widespread issues dock-ing stations and the new ACPI powermanagement scheme Docking stationsstill appear to be somewhat of aheadache but most people had devel-oped ad-hoc solutions for getting themto work The ACPI power managementscheme developed by Intel does notappear to have a quick solution TedTrsquoso one of the head kernel developersstated that there are fundamental archi-tectural problems with the 24 series ker-nel that do not easily allow ACPI to beadded However he also stated thatACPI is supported in the newest devel-opment kernel 25 and will be includedin 26

The final topic was the possibility ofpurchasing a laptop without paying anychargetax for Microsoft Windows Theconsensus was that currently this is notpossible Even if the machine comeswith Linux or without any operatingsystem the vendors have contracts inplace which require them to payMicrosoft for each machine they ship ifthat machine is capable of running aMicrosoft operating system The onlyway this will change is if the demand for

USENIX 2002

other operating systems increases to thepoint that vendors renegotiate their con-tracts with Microsoft and no one sawthis happening in the near future

LARGE CLUSTERS

Andrew Hume ATampT Labs ndash Research

Summarized by Teri Lampoudi

This was a session on clusters big dataand resilient computing Given that theaudience was primarily interested inclusters and not necessarily big data orbeating nodes to a pulp the conversa-tion revolved mainly around what ittakes to make a heterogeneous clusterresilient Hume also presented aFREENIX paper on his current clusterproject which explained in more detailmuch of what was abstractly claimed inthe guru session

Humersquos use of the word ldquoclusterrdquoreferred not to what I had assumedwould be a Beowulf-type system but to aloosely coupled farm of machines inwhich concurrency was much less of anissue than it would be in a Beowulf Infact the architecture Hume describedwas designed to deal with large amountsof independent transaction processingessentially the process of billing callswhich requires no interprocess messagepassing or anything of the sort a parallelscientific application might

Where does the large data come inSince the transactions in question con-sist of large numbers of flat files amechanism for getting the files ontonodes and off-loading the results is nec-essary In this particular cluster namedNingaui this task is handled by theldquoreplication managerrdquo which generateda large amount of interest in the audi-ence All management functions in thecluster as well as job allocation consti-tute services that are to be bid for andleased out to nodes Furthermore fail-ures are handled by simply repeating thefailed task until it is completed properlyif that is possible and if it is not thenlooking through detailed logs to discover

84 Vol 27 No 5 login

what went wrong Logging and testingfor the successful completion of eachtask are what make this model resilientEvery management operation is loggedat some appropriate level of detail gen-erating 120MBday and every single filetransfer is md5 checksummed This wascharacterized by Hume as a ldquopatientrdquostyle of management where no one con-trols the cluster scheduling behavior isemergent rather than stringentlyplanned and nodes are free to drop inand out of the cluster a downside to thismodel is the increase in latency offset bythe fact that the cluster continues tofunction unattended outside of 8-to-5support hours

In response to questions about the pro-jected scalability of such a schemeHume said the system would presum-ably scale from the present 60 nodes toabout 100 without modifications butthat scaling higher would be a matter forfurther design The guru session endedon a somewhat unrelated note regardingthe problems of getting data off main-frames and onto Linux machines ndashissues of fixed- vs variable-lengthencoding of data blocks resulting in use-ful information being stripped by FTPand problems in converting COBOLcopybooks to something useful on a C-based architecture To reiterate Humersquosclaim there are homemade solutions forsome of these problems and the readershould feel free to contact Mr Hume forthem

WRITING PORTABLE APPLICATIONS

Nick Stoughton MSB Consultants

Summarized by Josh Lothian

Nick Stoughton addressed a concernthat developers are facing more andmore writing portable applications Heproposed that there is no such thing as aldquoportable applicationrdquo only those thathave been ported Developers are still insearch of the Holy Grail of applicationsprogramming a write-once compile-anywhere program Languages such asJava are leading up to this but virtual

machines still have to be ported pathswill still be different etc Web-basedapplications are another area that couldbe promising for portable applications

Nick went on to talk about the future ofportability The recently released POSIX2001 standards are expected to help thesituation The 2001 revision expands theoriginal POSIX sepc into seven volumesEven though POSIX 2001 is more spe-cific than the previous release Nickpointed out that even this standardallows for small differences where ven-dors may define their own behaviorsThis can only hurt developers in thelong run Along these lines it waspointed out that there may be room forautomated tools that have the capabilityto check application code and determinethe level of portability and standardscompliance of the source

NETWORK MANAGEMENT SYSTEM

PERFORMANCE TUNING

Jeff Allen Tellme Networks Inc

Summarized by Matt Selsky

Jeff Allen author of Cricket discussednetwork tuning The basic idea of tuningis measure twiddle and measure againCricket can be used for large installa-tions but itrsquos overkill for smaller instal-lations for which Mrtg is better suitedYou donrsquot need Cricket to do measure-ment Doing something like a shellscript that dumps data to Excel is goodbut you need to get some sort of mea-surement Cricket was not designed forbilling it was meant to help answerquestions

Useful tools and techniques coveredincluded looking for the differencebetween machines in order to identifythe causes for the differences measuringbut also thinking about what yoursquoremeasuring and why remembering touse strace and tcpdump to identifyproblems Problems repeat themselvesperformance tuning requires an intricateunderstanding of all the layers involvedto solve the problem and simplify theautomation (Having a computer science

background helps but is not essential) Ifyou can reproduce the problem youthen should investigate each piece tofind the bottleneck If you canrsquot repro-duce the problem then measure Youneed to understand all the system inter-actions Measurement can help deter-mine whether things are actually slow orif the user is imagining it

Troubleshooting begins with a hunchbut scientific processes are essential Youshould be able to determine the eventstream or when each event occurs hav-ing observation logs can help Somehints provided include checking cron forperiodic anomalies slowing downbinrm to avoid IO overload and look-ing for unusual kernel CPU usage

Also try to use lightweight monitoringto reduce overhead You donrsquot need tomonitor every resource on every systembut you should monitor those resourceson those systems that are essential Donrsquotcheck things too often since you canintroduce overhead that way Lots ofinformation can be gathered from theproc file system interface to the kernel

BSD BOF

Host Kirk McKusick Author and Consultant

Summarized by Bosko Milekic

The BSD Birds of a Feather session atUSENIX started off with five quick andinformative talks and ended with anexciting discussion of licensing issues

Christos Zoulas for the NetBSD Projectwent over the primary goals of theNetBSD Project (namely portability aclean architecture and security) andthen proceeded to briefly discuss someof the challenges that the project hasencountered The successes and areas forimprovement of 16 were then exam-ined NetBSD has recently seen variousimprovements in its cross-build systempackaging system and third-party toolsupport For what concerns the kernel azero-copy TCPUDP implementationhas been integrated and a new pmap

85October 2002 login

C

ON

FERE

NC

ERE

PORT

Scode for arm32 has been introduced ashave some other rather major changes tovarious subsystems More information isavailable in Christosrsquos slides which arenow available at httpwwwnetbsdorggalleryeventsusenix2002

Theo de Raadt of the OpenBSD Projectpresented a rundown of various newthings that the OpenBSD and OpenSSHteams have been looking at Theorsquos talkincluded an amusing description (andpictures) of some of the events thattranspired during OpenBSDrsquos hack-a-thon which occurred the week beforeUSENIX rsquo02 Finally Theo mentionedthat following complications with theIPFilter license the OpenBSD team haddone a fairly extensive license audit

Robert Watson of the FreeBSD Projectbrought up various examples ofFreeBSD being used in the real worldHe went on to describe a wide variety ofchanges that will surface with theupcoming FreeBSD release 50 Notably50 will introduce a re-architecting ofthe kernel aimed at providing muchmore scalable support for SMPmachines 50 will also include an earlyimplementation of KSE FreeBSDrsquos ver-sion of scheduler activations largeimprovements to pccard and significantframework bits from the TrustedBSDproject

Mike Karels from Wind River Systemspresented an overview of how BSDOShas been evolving following the WindRiver Systems acquisition of BSDirsquos soft-ware assets Ernie Prabhakar from Applediscussed Applersquos OS X and its successon the desktop He explained how OS Xaims to bring BSD to the desktop whilestriving to maintain a strong and stablecore one that is heavily based on BSDtechnology

The BoF finished with a discussion onlicensing specifically some folks ques-tioned the impact of Applersquos licensingfor code from their core OS (the DarwinProject) on the BSD community as awhole

CLOSING SESSION

HOW FLIES FLY

Michael H Dickinson University ofCalifornia Berkeley

Summarized by JD Welch

In this lively and offbeat talk Dickinsondiscussed his research into the mecha-nisms of flight in the fruit fly (Droso-phila melanogaster) and related theautonomic behavior to that of a techno-logical system Insects are the most suc-cessful organisms on earth due in nosmall part to their early adoption offlight Flies travel through space onstraight trajectories interrupted by sac-cades jumpy turning motions some-what analogous to the fast jumpymovements of the human eye Using avariety of monitoring techniquesincluding high-speed video in a ldquovirtualreality flight arenardquo Dickinson and hiscolleagues have observed that fliesrespond to visual cues to decide when tosaccade during flight For example morevisual texture makes flies saccade earlier(ie further away from the arena wall)described by Dickinson as a ldquocollisioncontrol algorithmrdquo

Through a combination of sensoryinputs flies can make decisions abouttheir flight path Flies possess a visualldquoexpansion detectorrdquo which at a certainthreshold causes the animal to turn acertain direction However expansion infront of the fly sometimes causes it toland How does the fly decide Using thevirtual reality device to replicate variousvisual scenarios Dickinson observedthat the flies fixate on vertical objectsthat move back and forth across thefield Expansion on the left side of theanimal causes it to turn right and viceversa while expansion directly in frontof the animal triggers the legs to flareout in preparation for landing

If the eyes detect these changes how arethe responses implemented The fliesrsquowings have ldquopower musclesrdquo controlledby mechanical resonance (as opposed tothe nervous system) which drive the

USENIX 2002

wings combined with neurally activatedldquosteering musclesrdquo which change theconfiguration of the wing joints Subtlevariations in the timing of impulses cor-respond to changes in wing movementcontrolled by sensors in the ldquowing-pitrdquoA small wing-like structure the halterecontrols equilibrium by beating con-stantly

Voluntary control is accomplished bythe halteres whose signals can interferewith the autonomic control of the strokecycle The halteres have ldquosteering mus-clesrdquo as well and information derivedfrom the visual system can turn off thehaltere or trick it into a ldquovirtualrdquo prob-lem requiring a response

Dickinson has also studied the aerody-namics of insect flight using a devicecalled the ldquoRobo Flyrdquo an oversizedmechanical insect wing suspended in alarge tank of mineral oil InterestinglyDickinson observed that the average liftgenerated by the fliesrsquo wings is under itsbody weight the flies use three mecha-nisms to overcome this including rotat-ing the wings (rotational lift)

Insects are extraordinarily robust crea-tures and because Dickinson analyzedthe problem in a systems-oriented waythese observations and analysis areimmediately applicable to technologythe response system can be used as anefficient search algorithm for controlsystems in autonomous vehicles forexample

THE AFS WORKSHOP

Summarized by Garry Zacheiss MIT

Love Hornquist-Astrand of the Arladevelopment team presented the Arlastatus report Arla 0358 has beenreleased Scheduled for release soon areimproved support for Tru64 UNIXMacOS X and FreeBSD improved vol-ume handling and implementation ofmore of the vospts subcommands Itwas stressed that MacOS X is consideredan important platform and that a GUIconfiguration manager for the cache

86 Vol 27 No 5 login

manager a GUI ACL manager for thefinder and a graphical login that obtainsKerberos tickets and AFS tokens at logintime were all under development

Future goals planned for the underlyingAFS protocols include GSSAPISPNEGOsupport for Rx performance improve-ments to Rx an enhanced disconnectedmode and IPv6 support for Rx anexperimental patch is already availablefor the latter Future Arla-specific goalsinclude improved performance partialfile reading and increased stability forseveral platforms Work is also inprogress for the RXGSS protocol exten-sions (integrating GSSAPI into Rx) Apartial implementation exists and workcontinues as developers find time

Derrick Brashear of the OpenAFS devel-opment team presented the OpenAFSstatus report OpenAFS was releasedimmediately prior to the conferenceOpenAFS 125 fixed a remotelyexploitable denial-of-service attack inseveral OpenAFS platforms mostnotably IRIX and AIX Future workplanned for OpenAFS includes bettersupport for MacOS X including work-ing around Finder interaction issuesBetter support for the BSDs is alsoplanned FreeBSD has a partially imple-mented client NetBSD and OpenBSDhave only server programs availableright now AIX 5 Tru64 51A andMacOS X 102 (aka Jaguar) are allplanned for the future Other plannedenhancements include nested groups inthe ptserver (code donated by the Uni-versity of Michigan awaits integration)disconnected AFS and further work ona native Windows client Derrick statedthat the guiding principles of OpenAFSwere to maintain compatibility withIBM AFS support new platforms easethe administrative burden and add newfunctionality

AFS performance was discussed Theopenafsorg cell consists of two fileservers one in Stockholm Sweden andone in Pittsburgh PA AFS works rea-sonably well over trans-Atlantic links

although OpenAFS clients donrsquot deter-mine which file server to talk to veryeffectively Arla clients use RTTs to theserver to determine the optimal fileserver to fetch replicated data fromModifications to OpenAFS to supportthis behavior in the future are desired

Jimmy Engelbrecht and Harald Barth ofKTH discussed their AFSCrawler scriptwritten to determine how many AFScells and clients were in the world whatimplementationsversions they were(Arla vs IBM AFS vs OpenAFS) andhow much data was in AFS The scriptunfortunately triggered a bug in IBMAFS 36ndashderived code causing someclients to panic while handling a specificRPC This has since been fixed in OpenAFS 125 and the most recent IBMAFS patch level and all AFS users arestrongly encouraged to upgrade Norelease of Arla is vulnerable to this par-ticular denial-of-service attack Therewas an extended discussion of the use-fulness of this exploration Many sitesbelieved this was useful information andsuch scanning should continue in thefuture but only on an opt-in basis

Many sites face the problem of manag-ing KerberosAFS credentials for batch-scheduled jobs Specifically most batchprocessing software needs to be modi-fied to forward tickets as part of thebatch submission process renew ticketsand tokens while the job is in the queueand for the lifetime of the job and prop-erly destroy credentials when the jobcompletes Ken Hornstein of NRL wasable to pay a commercial vendor to sup-port Kerberos 45 credential manage-ment in their product although they didnot implement AFS token managementMIT has implemented some of thedesired functionality in OpenPBS andmight be able to make it available toother interested sites

Tools to simplify AFS administrationwere discussed including

AFS Balancer A tool written byCMU to automate the process ofbalancing disk usage across all

servers in a cell Available fromftpftpandrewcmuedupubAFS-Toolsbalance-11btargz

Themis Themis is KTHrsquos enhancedversion of the AFS tool ldquopackagerdquofor updating files on local disk froma central AFS image KTHrsquosenhancements include allowing thedeletion of files simplifying theprocess of adding a file and allow-ing the merging of multiple rulesets for determining which files areupdated Themis is available fromthe Arla CVS repository

Stanford was presented as an example ofa large AFS site Stanfordrsquos AFS usageconsists of approximately 14TB of datain AFS in the form of approximately100000 volumes 33TB of storage isavailable in their primary cell irstan-fordedu Their file servers consistentirely of Solaris machines running acombination of Transarc 36 patch-level3 and OpenAFS 12x while their data-base servers run OpenAFS 122 Theircell consists of 25 file servers using acombination of EMC and Sun StorEdgehardware Stanford continues to use thekaserver for their authentication infra-structure with future plans to migrateentirely to an MIT Kerberos 5 KDC

Stanford has approximately 3400 clientson their campus not including SLAC(Stanford Linear Accelerator) approxi-mately 2100 AFS clients from outsideStanford contact their cell every monthTheir supported clients are almostentirely IBM AFS 36 clients althoughthey plan to release OpenAFS clientssoon Stanford currently supports onlyUNIX clients There is some on-campuspresence of Windows clients but Stan-ford has never publicly released or sup-ported it They do intend to release andsupport the MacOS X client in the nearfuture

All Stanford students faculty and staffare assigned AFS home directories witha default quota of 50MB for a total ofapproximately 550GB of user homedirectories Other significant uses of AFS

87October 2002 login

C

ON

FERE

NC

ERE

PORT

Sstorage are data volumes for workstationsoftware (400GB) and volumes forcourse Web sites and assignments(100GB)

AFS usage at Intel was also presentedIntel has been an AFS site since 1994They had bad experiences with the IBM35 Linux client their experience withOpenAFS on Linux 24x kernels hasbeen much better They use and are sat-isfied with the OpenAFS IA64 Linuxport Intel has hundreds of OpenAFS123 and 124 clients in many produc-tion cells accessing data stored on IBMAFS file servers They have not encoun-tered any interoperability issues Intelhas some concerns about OpenAFS theywould like to purchase commercial sup-port for OpenAFS and to see OpenAFSsupport for HP-UX on both PA-RISCand Itanium hardware HP-UX supportis currently unavailable due to a specificHP-UX header file being unavailablefrom HP this may be available soonIntel has not yet committed to migratingtheir file servers to OpenAFS and areunsure if they will do so without com-mercial support

Backups are a traditional topic of dis-cussion at AFS workshops and this timewas no exception Many users complainthat the traditional AFS backup tools(ldquobackuprdquo and ldquobutcrdquo) are complex anddifficult to automate requiring manyhome-grown scripts and much userintervention for error recovery An addi-tional complaint was that the traditionalAFS tools do not support file- or direc-tory-level backups and restores datamust be backed up and restored at thevolume level

Mitch Collinsworth of Cornell presentedwork to make AMANDA the freebackup software from the University ofMaryland suitable for AFS backupsUsing AMANDA for AFS backups allowsone to share AFS backup tapes withnon-AFS backups easily run multiplebackups in parallel automate errorrecovery and provide a robust degradedmode that prevents tape errors from

stopping backups altogether Theirimplementation allows for full volumerestores as well as individual directoryand file restores They have finished cod-ing this work and are in the process oftesting and documenting it

Peter Honeyman of CITI at the Univer-sity of Michigan spoke about work hehas proposed to replace Rx with RPC-SEC GSS in OpenAFS this would allowAFS to use a TCP-based transport mech-anism rather than the UDP-based Rxand possibly gain better congestion con-trol dynamic adaptation and fragmen-tation avoidance as a result RPCSECGSS uses the GSSAPI to authenticateSUN ONC RPC RPCSEC GSS is trans-port agnostic provides strong security isa developing Internet standard and hasmultiple open source implementationsBackward compatibility with existingAFS servers and clients is an importantgoal of this project

Page 14: inside - USENIX · 2019. 2. 25. · Bosko Milekic Juan Navarro David E. Ott Amit Purohit Brennan Reynolds Matt Selsky J.D. Welch Li Xiao Praveen Yalagandula Haijin Yan Gary Zacheiss

circuitous possibly (though not neces-sarily) because of routing anomalies Itis also shown that the minimum delaybetween end hosts is strongly correlatedwith the linearized distance of the path

Geographic information can be used tostudy various aspects of wide-area Inter-net paths that traverse multiple ISPs Itwas found that end-to-end Internetpaths tend to be more circuitous thanintra-ISP paths the cause for this beingthe peering relationships between ISPsAlso paths traversing substantial dis-tances within two or more ISPs tend tobe more circuitous than paths largelytraversing only a single ISP Anotherfinding is that ISPs generally employhot-potato routing

Geographic information can also beused to capture the fact that two seem-ingly unrelated routers can be suscepti-ble to correlated failures By using thegeographic information of routers wecan construct a geographic topology ofan ISP Using this we can find the toler-ance of an ISPrsquos network to the total net-work failure in a geographic region

For further information contactlakmecsberkeleyedu

PROVIDING PROCESS ORIGIN INFORMATION

TO AID IN NETWORK TRACEBACK

Florian P Buchholz Purdue UniversityClay Shields Georgetown University

Network traceback is currently limitedbecause host audit systems do not main-tain enough information to matchincoming network traffic to outgoingnetwork traffic The talk presented analternative assigning origin informationto every process and logging it duringinteractive login creation

The current implementation concen-trates mainly on interactive sessions inwhich an event is logged when a newconnection is established using SSH orTelnet The method is effective andcould successfully determine steppingstones and reliably detect the source of aDDoS attack The speaker then talkedabout the related work done in this area

74 Vol 27 No 5 login

which led to discussion of the latestdevelopment in the area of ldquohostcausalityrdquo

The speaker ended with the limitationsand future directions of the frameworkThis framework could be extended andcould find many applications in the areaof security Current implementationdoesnrsquot take care of startup scripts andcron jobs but incorporating the origininformation in FS could solve this prob-lem In the current implementation log-ging is just implemented as a proof ofconcept It could be made safe in manyways and this could be another impor-tant aspect of future work

PROGRAMMING

Summarized by Amit Purohit

CYCLONE A SAFE DIALECT OF C

Trevor Jim ATampT Labs ndash Research GregMorrisett Dan Grossman MichaelHicks James Cheney Yanling WangCornell University

Cyclone is designed to provide protec-tion against attacks such as buffer over-flows format string attacks andmemory management errors The cur-rent structure of C allows programmersto write vulnerable programs Cycloneextends C so that it has the safety guar-antee of Java while keeping C syntaxtypes and semantics untouched

The Cyclone compiler performs staticanalysis of the source code and insertsruntime checks into the compiled out-put at places where the analysis cannotdetermine that the code execution willnot violate safety constraints Cycloneimposes many restrictions to preservesafety such as NULL checksThesechecks do not exist in normal C

The speaker then talked about somesample code written in Cyclone and howit tackles safety problems Porting anexisting C application to Cyclone ispretty easy with fewer than 10 changerequired The current implementationconcentrates more on safety than onperformance ndash hence the performance

penalty is substantial Cyclone was ableto find many lingering bugs The finalstep of the project will be to write acompiler to convert a normal C programto an equivalent Cyclone program

COOPERATIVE TASK MANAGEMENT WITHOUT

MANUAL STACK MANAGEMENT

Atul Adya Jon Howell MarvinTheimer William J Bolosky John RDouceur Microsoft Research

The speaker described the definitionsand motivations behind the distinctconcepts of task management stackmanagement IO management conflictmanagement and data partitioningConventional concurrent programminguses preemptive task management andexploits the automatic stack manage-ment of a standard language In the sec-ond approach cooperative tasks areorganized as event handlers that yieldcontrol by returning control to the eventscheduler manually unrolling theirstacks In this project they have adopteda hybrid approach that makes it possiblefor both stack management styles tocoexist Thus programmers can codeassuming one or the other of these stackmanagement styles will operate depend-ing upon the application The speakeralso gave a detailed example of how touse their mechanism to switch betweenthe two styles

The talk then continued with the imple-mentation They were able to preemptmany subtle concurrency problems byusing cooperative task managementPaying a cost up-front to reduce a subtlerace condition proved to be a goodinvestment

Though the choice of task managementis fundamental the choice of stack man-agement can be left to individual tasteThis project enables use of any type ofstack management in conjunction withcooperative task management

IMPROVING WAIT-FREE ALGORITHMS FOR

INTERPROCESS COMMUNICATION IN

EMBEDDED REAL-TIME SYSTEMS

Hai Huang Padmanabhan Pillai KangG Shin University of Michigan

The main characteristic of the real-timetime-sensitive system is its pre-dictable response time But concurrencymanagement creates hurdles to achiev-ing this goal because of the use of locksto maintain consistency To solve thisproblem many wait-free algorithmshave been developed but these are typi-cally high-cost solutions By takingadvantage of the temporal characteris-tics of the system however the time andspace overhead can be reduced

The speaker presented an algorithm fortemporal concurrency control andapplied this technique to improve threewait-free algorithms A single-writermultiple-reader wait-free algorithm anda double-buffer algorithm were pro-posed

Using their transformation mechanismthey achieved an improvement of17ndash66 in ACET and a 14ndash70 reduc-tion in memory requirements for IPCalgorithms This mechanism is extensi-ble and can be applied to other non-blocking IPC algorithms as well Futurework involves reducing the synchroniz-ing overheads in more general IPC algo-rithms with multiple-writer semantics

MOBILITY

Summarized by Praveen Yalagandula

ROBUST POSITIONING ALGORITHMS FOR

DISTRIBUTED AD-HOC WIRELESS SENSOR

NETWORKS

Chris Savarese Jan Rabaey Universityof California Berkeley Koen Langen-doen Delft University of Technology

The Pico Radio Network comprisesmore than a hundred sensors monitorsand actuators equipped with wirelessconnectivity with the following proper-ties (1) no infrastructure (2) computa-tion in a distributed fashion instead ofcentralized computation (3) dynamictopology (4) limited radio range

75October 2002 login

C

ON

FERE

NC

ERE

PORT

S(5) nodes that can act as repeaters(6) devices of low cost and with lowpower and (7) sparse anchor nodes ndashnodes with GPS capability A positioningalgorithm determines the geographicalposition of each node in the network

In two-dimensional space each nodeneeds three reference positions to esti-mate its geographical position There aretwo problems that make positioning dif-ficult in the Pico Radio Network typesetting (1) a sparse anchor node prob-lem and (2) a range error problem

The two-phase approach that theauthors have taken in solving the prob-lem consists of (1) a hop-terrain algo-rithm ndash in this first phase each noderoughly guesses its location by the dis-tance calculated using multi-hops to theanchor points and (2) a refinementalgorithm ndash in this second phase eachnode uses its neighborsrsquo positions torefine its own position estimate Toguarantee the convergence thisapproach uses confidence metricsassigning a value of 10 for anchor nodesand 01 to start for other nodes andincreasing these with each iteration

A simulation tool OMNet++ was usedfor both phases and in various scenariosThe results show that they achievedposition errors of less than 33 in a sce-nario with 5 anchor nodes an averageconnectivity of 7 and 5 range mea-surement error

Guidelines for anchor node deploymentare high connectivity (gt10) a reason-able fraction of anchor nodes (gt 5)and for anchor placement coverededges

APPLICATION-SPECIFIC NETWORK

MANAGEMENT FOR ENERGY-AWARE

STREAMING OF POPULAR MULTIMEDIA

FORMATS

Surendar Chandra University of Geor-gia and Amin Vahdat Duke University

The main hindrance in supporting theincreasing demand for mobile multime-dia on PDA is the battery capacity of the

small devices The idle-time power con-sumption is almost the same as thereceive power consumption on typicalwireless network cards (1319mW vs1425mW) while the sleep state con-sumes far less power (177mW) To letthe system consume energy proportionalto the stream quality the network cardshould transition to sleep state aggres-sively between each packet

The speaker presented previous workwhere different multimedia streamswere studied for PDAs MS Media RealMedia and Quicktime It was found thatif the inter-packet gap is predictablethen huge savings are possible forexample about 80 savings in MSmedia streams with few packet lossesThe limitation of the IEEE 80211bpower-save mode comes into play whenthere are two or more nodes producinga delay between beacon time and whenthe node receivessends packets Thisbadly affects the streams since higherenergy is consumed while waiting

The authors propose traffic shaping forenergy conservation where this is doneby proxy such that packets arrive at reg-ular intervals This is achieved using alocal proxy in the access point and aclient-side proxy in the mobile hostSimulations show that traffic shapingreduces energy consumption and alsoreveals that the higher bandwidthstreams have a lower energy metric(mJkB)

More information is at httpgreenhousecsugaedu

CHARACTERIZING ALERT AND BROWSE SER-

VICES OF MOBILE CLIENTS

Atul Adya Paramvir Bahl and Lili QiuMicrosoft Research

Even though there is a dramatic increasein Internet access from wireless devicesthere are not many studies done oncharacterizing this traffic In this paperthe authors characterize the trafficobserved on the MSN Web site withboth notification and browse traces

USENIX 2002

Around 33 million browsing accessesand about 325 million notificationentries are present in the traces Threetypes of analyses are done for each oneof these two services content analysisconcerning the most popular contentcategories and their distribution popu-larity analysis or the popularity distri-bution of documents and user behavioranalysis

The analysis of the notification logsshows that document access rates followa Zipf-like distribution with most of theaccesses concentrated on a small num-ber of messages and the accesses exhibitgeographical locality ndash users from samelocality tend to receive similar notifica-tion content The browser log analysisshows that a smaller set of URLs areaccessed most times though the accesspattern does not fit any Zipf curve andthe highly accessed URLS remain stableA correlation study between the notifi-cation and browsing services shows thatwireless users have a moderate correla-tion of 012

FREENIX TRACK SESSIONS

BUILDING APPLICATIONS

Summarized by Matt Butner

INTERACTIVE 3-D GRAPHICS APPLICATIONS

FOR TCL

Oliver Kersting Juumlrgen Doumlllner HassoPlattner Institute for Software SystemsEngineering University of Potsdam

The integration of 3-D image renderingfunctionality into a scripting languagepermits interactive and animated 3-Ddevelopment and application withoutthe formalities and precision demandedby low-level CC++ graphics and visual-ization libraries The large and complexC++ API of the Virtual Rendering Sys-tem (VRS) can be combined with theconveniences of the Tcl scripting lan-guage The mapping of class interfaces isdone via an automated process and gen-erates respective wrapper classes all ofwhich ensures complete API accessibilityand functionality without any signifi-

76 Vol 27 No 5 login

cant performance retribution The gen-eral C++-to-Tcl mapper SWIG grantsthe necessary object and hierarchal abili-ties without the use of object-orientedTcl extensions The final API mappingtechniques address C++ features such asclasses overloaded methods enumera-tions and inheritance relations All areimplemented in a proof-of-concept thatmaps the complete C++ API of the VRSto Tcl and are showcased in a completeinteractive 3-D-map system

THE AGFL GRAMMAR WORK LAB

Cornelis HA Coster Erik VerbruggenUniversity of Nijmegen (KUN)

The growth and implementation of Nat-ural Language Processing (NLP) is thecornerstone of the continued evolutionand implementation of truly intelligentsearch machines and services In partdue to the growing collections of com-puter-stored human-readable docu-ments in the public domain theimplementation of linguistic analysiswill become necessary for desirable pre-cision and resolution Subsequentlysuch tools and linguistic resources mustbe released into the public domain andthey have done so with the release of theAGFL Grammar Work Lab under theGNU Public License as a tool for lin-guistic research and the development ofNLP-based applications

The AGFL (Affix Grammars over aFinite Lattice) Grammar Work Labmeshes context-free grammars withfinite set-valued features that are accept-able to a range of languages In com-puter science terms ldquoSyntax rules areprocedures with parameters and a non-deterministic executionrdquo The EnglishPhrases for Information Retrieval(EP4IR) released with the AGFL-GWLas a robust grammar of English is anAGFL-GWL generated English parserthat outputs ldquoHeadModifiedrdquo framesThe sentences ldquoCompanyX sponsoredthis conferencerdquo and ldquoThis conferencewas sponsored by CompanyXrdquo both gen-erate [CompanyX[sponsored confer-

ence]] The hope is that public availabil-ity of such tools will encourage furtherdevelopment of grammar and lexicalsoftware systems

SWILL A SIMPLE EMBEDDED WEB SERVER

LIBRARY

Sotiria Lampoudi David M BeazleyUniversity of Chicago

This paper won the FREENIX Best Stu-dent Paper award

SWILL (Simple Web Interface LinkLibrary) is a simple Web server in theform of a C library whose developmentwas motivated by a wish to give coolapplications an interface to the Web TheSWILL library provides a simple inter-face that can be efficiently implementedfor tasks that vary from flexible Web-based monitoring to software debuggingand diagnostics Though originallydesigned to be integrated with high-per-formance scientific simulation softwarethe interface is generic enough to allowfor unbounded uses SWILL is a single-threaded server relying upon non-blocking IO through the creation of atemporary server which services IOrequests SWILL does not provide SSL orcryptographic authentication but doeshave HTTP authentication abilities

A fantastic feature of SWILL is its sup-port for SPMD-style parallel applica-tions which utilize MPI provingvaluable for Beowulf clusters and largeparallel machines Another practicalapplication was the implementation ofSWILL in a modified Yalnix emulator byUniversity of Chicago Operating Sys-tems courses which utilized the addedYalnix functionality for OS developmentand debugging SWILL requires minimalmemory overhead and relies upon theHTTP10 protocol

NETWORK PERFORMANCE

Summarized by Florian Buchholz

LINUX NFS CLIENT WRITE PERFORMANCE

Chuck Lever Network Appliance PeterHoneyman CITI University of Michi-gan

Lever introduced a benchmark to mea-sure an NFS client write performanceClient performance is difficult to mea-sure due to hindrances such as poorhardware or bandwidth limitations Fur-thermore measuring application perfor-mance does not identify weaknessesspecifically at the client side Thus abenchmark was developed trying toexercise only data transfers in one direc-tion between server and application Forthis purpose the benchmark was basedon the block sequential write portion ofthe Bonnie file system benchmark Oncea benchmark for NFS clients is estab-lished it can be used to improve clientperformance

The performance measurements wereperformed with an SMP Linux clientand both a Linux NFS server and a Net-work Appliance F85 filer During testinga periodic jump in write latency timewas discovered This was due to a ratherlarge number of pending write opera-tions that were scheduled to be writtenafter certain threshold values wereexceeded By introducing a separate dae-mon that flushes the cached writerequest the spikes could be eliminatedbut as a result the average latency growsover time The problem could be tracedto a function that scans a linked list ofwrite requests After having added ahashtable to improve lookup perfor-mance the latency improved consider-ably

The improved client was then used tomeasure throughput against the twoservers A discrepancy between theLinux server and the filer test wasnoticed and the reason for that tracedback to a global kernel lock that wasunnecessarily held when accessing thenetwork stack After correcting this per-formance improved further However

77October 2002 login

C

ON

FERE

NC

ERE

PORT

Sthe measurements showed that a clientmay run slower when paired with fastservers on fast networks This is due toheavy client interrupt loads more net-work processing on the client side andmore global kernel lock contention

The source code of the project is avail-able at httpwwwcitiumicheduprojectsnfs-perfpatches

A STUDY OF THE RELATIVE COSTS OF

NETWORK SECURITY PROTOCOLS

Stefan Miltchev and Sotiris IoannidisUniversity of Pennsylvania AngelosKeromytis Columbia University

With the increasing need for securityand integrity of remote network ser-vices it becomes important to quantifythe communication overhead of IPSecand compare it to alternatives such asSSH SCP and HTTPS

For this purpose the authors set upthree testing networks direct link twohosts separated by two gateways andthree hosts connecting through onegateway Protocols were compared ineach setup and manual keying was usedto eliminate connection setup costs ForIPSec the different encryption algo-rithms AES DES 3DES hardware DESand hardware 3DES were used In detailFTP was compared to SFTP SCP andFTP over IPSec HTTP to HTTPS andHTTP over IPSec and NFS to NFS overIPSec and local disk performance

The result of the measurements werethat IPSec outperforms other popularencryption schemes Overall unen-crypted communication was fastest butin some cases like FTP the overhead canbe small The use of crypto hardwarecan significantly improve performanceFor future work the inclusion of setupcosts hardware-accelerated SSL SFTPand SSH were mentioned

CONGESTION CONTROL IN LINUX TCP

Pasi Sarolahti University of HelsinkiAlexey Kuznetsov Institute for NuclearResearch at Moscow

Having attended the talk and read thepaper I am still unclear about whetherthe authors are merely describing thedesign decisions of TCP congestion con-trol or whether they are actually the cre-ators of that part of the Linux code Myguess leans toward the former

In the talk the speaker compared theTCP protocol congestion control meas-ures according to IETF and RFC specifi-cations with the actual Linux implemen-tation which does conform to the basicprinciples but nevertheless has differ-ences A specific emphasis was placed onretransmission mechanisms and thecongestion window Also several TCPenhancements ndash the NewReno algo-rithm Selective ACKs (SACK) ForwardACKs (FACK) ndash were discussed andcompared

In some instances Linux does not con-form to the IETF specifications The fastrecovery does not fully follow RFC 2582since the threshold for triggering re-transmit is adjusted dynamically and thecongestion windowrsquos size is not changedAlso the roundtrip-time estimator andthe RTO calculation differ from RFC2988 since it uses more conservativeRTT estimates and a minimum RTO of200ms The performance measuresshowed that with the additional Linux-specific features enabled slightly higherthroughput more steady data flow andfewer unnecessary retransmissions canbe achieved

XTREME XCITEMENT

Summarized by Steve Bauer

THE FUTURE IS COMING WHERE THE X

WINDOW SHOULD GO

Jim Gettys Compaq Computer Corp

Jim Gettys one of the principal authorsof the X Window System outlined thenear-term objectives for the system pri-marily focusing on the changes andinfrastructure required to enable replica-tion and migration of X applicationsProviding better support for this func-tionality would enable users to retrieveor duplicate X applications betweentheir servers at home and work

USENIX 2002

One interesting example of an applica-tion that currently is capable of migra-tion and replication is Emacs To create anew frame on DISPLAY try ldquoM-x make-frame-on-display ltReturngt DISPLAYltReturngtrdquo

However technical challenges makereplication and migration difficult ingeneral These include the ldquomajorheadachesrdquo of server-side fonts thenonuniformity of X servers and screensizes and the need to appropriatelyretrofit toolkits Authentication andauthorization issues are obviously alsoimportant The rest of the talk delvedinto some of the details of these interest-ing technical challenges

HACKING IN THE KERNEL

Summarized by Hai Huang

AN IMPLEMENTATION OF SCHEDULER

ACTIVATIONS ON THE NETBSD OPERATING

SYSTEM

Nathan J Williams Wasabi Systems

Scheduler activation is an old idea Basi-cally there are benefits and drawbacks tousing solely kernel-level or user-levelthreading Scheduler activation is able tocombine the two layers of control toprovide more concurrency in the sys-tem

In his talk Nathan gave a fairly detaileddescription of the implementation ofscheduler activation in the NetBSD ker-nel One important change in the imple-mentation is to differentiate the threadcontext from the process context This isdone by defining a separate data struc-ture for these threads and relocatingsome of the information that wasembedded in the process context tothese thread contexts Stack was espe-cially a concern due to the upcall Spe-cial handling must be done to make surethat the upcall doesnrsquot mess up the stackso that the preempted user-level threadcan continue afterwards Lastly Nathanexplained that signals were handled byupcalls

78 Vol 27 No 5 login

AUTHORIZATION AND CHARGING IN PUBLIC

WLANS USING FREEBSD AND 8021X

Pekka Nikander Ericsson ResearchNomadicLab

8021x standards are well known in thewireless community as link-layerauthentication protocols In this talkPekka explained some novel ways ofusing the 8021x protocols that might beof interest to people on the move It ispossible to set up a public WLAN thatwould support various chargingschemes via virtual tokens which peoplecan purchase or earn and later use

This is implemented on FreeBSD usingthe netgraph utility It is basically a filterin the link layer that would differentiatetraffic based on the MAC address of theclient node which is either authenti-cated denied or let through The over-head for this service is fairly minimal

ACPI IMPLEMENTATION ON FREEBSD

Takanori Watanabe Kobe University

ACPI (Advanced Configuration andPower Management Interface) was pro-posed as a joint effort by Intel Toshibaand Microsoft to provide a standard andfiner-control method of managingpower states of individual devices withina system Such low-level power manage-ment is especially important for thosemobile and embedded systems that arepowered by fixed-capacity energy batter-ies

Takanori explained that the ACPI speci-fication is composed of three partstables BIOS and registers He was ableto implement some functionalities ofACPI in a FreeBSD kernel ACPI Com-ponent Architecture was implementedby Intel and it provides a high-levelACPI API to the operating systemTakanorirsquos ACPI implementation is builtupon this underlying layer of APIs

ACCESS CONTROL

Summarized by Florian Buchholz

DESIGN AND PERFORMANCE OF THE

OPENBSD STATEFUL PACKET FILTER (PF)

Daniel Hartmeier Systor AG

Daniel Hartmeier described the newstateful packet filter (pf) that replacedIPFilter in the OpenBSD 30 releaseIPFilter could no longer be included dueto licensing issues and thus there was aneed to write a new filter making use ofoptimized data structures

The filter rules are implemented as alinked list which is traversed from startto end Two actions may be takenaccording to the rules ldquopassrdquo or ldquoblockrdquoA ldquopassrdquo action forwards the packet anda ldquoblockrdquo action will drop it Wheremore than one rule matches a packetthe last rule wins Rules that are markedas ldquofinalrdquo will immediately terminate any further rule evaluation An opti-mization called ldquoskip-stepsrdquo was alsoimplemented where blocks of similarrules are skipped if they cannot matchthe current packet These skip-steps arecalculated when the rule set is loadedFurthermore a state table keeps track ofTCP connections Only packets thatmatch the sequence numbers areallowed UDP and ICMP queries andreplies are considered in the state tablewhere initial packets will create an entryfor a pseudo-connection with a low ini-tial timeout value The state table isimplemented using a balanced binarysearch tree NAT mappings are alsostored in the state table whereas appli-cation proxies reside in user space Thepacket filter also is able to perform frag-ment reassembly and to modulate TCPsequence numbers to protect hostsbehind the firewall

Pf was compared against IPFilter as wellas Linuxrsquos Iptables by measuringthroughput and latency with increasingtraffic rates and different packet sizes Ina test with a fixed rule set size of 100Iptables outperformed the other two fil-ters whose results were close to each

other In a second test where the rule setsize was continually increased Iptablesconsistently had about twice thethroughput of the other two (whichevaluate the rule set on both the incom-ing and outgoing interfaces) A third testcompared only pf and IPFilter using asingle rule that created state in the statetable with a fixed-state entry size Pfreached an overloaded state much laterthan IPFilter The experiment wasrepeated with a variable-state entry sizeand pf performed much better thanIPFilter for a small number of states

In general rule set evaluation is expen-sive and benchmarks only reflectextreme cases whereas in real life otherbehavior should be observed Further-more the benchmarks show that statefulfiltering can actually improve perfor-mance due to cheap state-table lookupsas compared to rule evaluation

ENHANCING NFS CROSS-ADMINISTRATIVE

DOMAIN ACCESS

Joseph Spadavecchia and Erez ZadokStony Brook University

The speaker presented modification toan NFS server that allows an improvedNFS access between administrativedomains A problem lies in the fact thatNFS assumes a shared UIDGID spacewhich makes it unsafe to export filesoutside the administrative domain ofthe server Design goals were to leaveprotocol and existing clients unchangeda minimum amount of server changesflexibility and increased performance

To solve the problem two techniques areutilized ldquorange-mappingrdquo and ldquofile-cloakingrdquo Range-mapping maps IDsbetween client and server The mappingis performed on a per-export basis andhas to be manually set up in an exportfile The mappings can be 1-1 N-N orN-1 In file-cloaking the server restrictsfile access based on UIDGID and spe-cial cloaking-mask bits Here users canonly access their own file permissionsThe policy on whether or not the file isvisible to others is set by the cloakingmask which is logically ANDed with the

79October 2002 login

C

ON

FERE

NC

ERE

PORT

Sfilersquos protection bits File-cloaking onlyworks however if the client doesnrsquot holdcached copies of directory contents andfile-attributes Because of this the clientsare forced to re-read directories byincrementing the mtime value of thedirectory each time it is listed

To test the performance of the modifiedserver five different NFS configurationswere evaluated An unmodified NFSserver was compared against one serverwith the modified code included but notused one with only range-mappingenabled one with only file-cloakingenabled and one version with all modi-fications enabled For each setup differ-ent file system benchmarks were runThe results show only a small overheadwhen the modifications are used gener-ally an increase of below 5 Anotherexperiment tested the performance ofthe system with an increasing number ofmapped or cloaked entries on a systemThe results show that an increase from10 to 1000 entries resulted in a maxi-mum of about 14 cost in performance

One member of the audience pointedout that if clients choose to ignore thechanged mtimes from the server andthus still hold caches of the directoryentries the file-cloaking mechanismcould be defeated After a rather lengthydebate the speaker had to concede thatthe model doesnrsquot add any extra securityAnother question was asked about scala-bility of the setup of range mappingThe speaker referred to application-leveltools that could be developed for thatpurpose

The software is available atftpftpfslcssunysbedupubenf

ENGINEERING OPEN SOURCE

SOFTWARE

Summarized by Teri Lampoudi

NINGAUI A LINUX CLUSTER FOR BUSINESS

Andrew Hume ATampT Labs ndash ResearchScott Daniels Electronic Data SystemsCorp

Ningaui is a general purpose highlyavailable resilient architecture built

from commodity software and hard-ware Emphatically however Ningaui isnot a Beowulf Hume calls the clusterdesign the ldquoSwiss canton modelrdquo inwhich there are a number of looselyaffiliated independent nodes with datareplicated among them Jobs areassigned by bidding and leases and clus-ter services done as session-based serversare sited via generic job assignment Theemphasis is on keeping the architectureend-to-end checking all work via check-sums and logging everything Theresilience and high availability requiredby their goal of 8-5 maintenance ndash vsthe typical 24-7 model where people getpaged whenever the slightest thing goeswrong regardless of the time of day ndash isachieved by job restartability Finally allcomputation is performed on local datawithout the use of NFS or networkattached storage

Humersquos message is a hopeful onedespite the many problems encounteredndash things like kernel and service limitsauto-installation problems TCP stormsand the like ndash the existence of sourcecode and the paranoid practice of log-ging and checksumming everything hashelped The final product performs rea-sonably well and it appears that theresilient techniques employed do make adifference One drawback however isthat the software mentioned in thepaper is not yet available for download

CPCMS A CONFIGURATION MANAGEMENT

SYSTEM BASED ON CRYPTOGRAPHIC NAMES

Jonathan S Shapiro and John Vander-burgh Johns Hopkins University

This paper won the FREENIX BestPaper award

The basic notion behind the project isthe fact that everyone has a pet com-plaint about CVS and yet it is currentlythe configuration manager in mostwidespread use Shapiro has unleashedan alternative Interestingly he did notbegin but ended with the usual host ofreasons why CVS is bad The talk insteadbegan abruptly by characterizing the job

USENIX 2002

of a software configuration managercontinued by stating the namespaceswhich it must handle and wrapped upwith the challenges faced

X MEETS Z VERIFYING CORRECTNESS IN THE

PRESENCE OF POSIX THREADS

Bart Massey Portland State UniversityRobert T Bauer Rational SoftwareCorp

Massey delivered a humorous talk onthe insight gained from applying Z for-mal specification notation to systemsoftware design rather than the moreinformal analysis and design processnormally used

The story is told with respect to writingXCB which replaces the Xlib protocollayer and is supposed to be threadfriendly But where threads are con-cerned deadlock avoidance becomes ahard problem that cannot be solved inan ad-hoc manner But full modelchecking is also too hard In this caseMassey resorted to Z specification tomodel the XCB lower layer abstractaway locking and data transfers andlocate fundamental issues Essentiallythe difficulties of searching the literatureand locating information relevant to theproblem at hand were overcome AsMassey put it ldquothe formal method savedthe dayrdquo

FILE SYSTEMS

Summarized by Bosko Milekic

PLANNED EXTENSIONS TO THE LINUX

EXT2EXT3 FILESYSTEM

Theodore Y Tsrsquoo IBM StephenTweedie Red Hat

The speaker presented improvements tothe Linux Ext2 file system with the goalof allowing for various expansions whilestriving to maintain compatibility witholder code Improvements have beenfacilitated by a few extra superblockfields that were added to Ext2 not longbefore the Linux 20 kernel was releasedThe fields allow for file system featuresto be added without compromising theexisting setup this is done by providing

80 Vol 27 No 5 login

bits indicating the impact of the addedfeatures with respect to compatibilityNamely a file system with the ldquoincom-patrdquo bit marked is not allowed to bemounted Similarly a ldquoread-onlyrdquo mark-ing would only allow the file system tobe mounted read-only

Directory indexing changes linear direc-tory searches with a faster search using afixed-depth tree and hashed keys Filesystem size can be dynamicallyincreased and the expanded inode dou-bled from 128 to 256 bytes allows formore extensions

Other potential improvements were dis-cussed as well in particular pre-alloca-tion for contiguous files which allowsfor better performance in certain setupsby pre-allocating contiguous blocksSecurity-related modifications extendedattributes and ACLs were mentionedAn implementation of these featuresalready exists but has not yet beenmerged into the mainline Ext23 code

RECENT FILESYSTEM OPTIMISATIONS ON

FREEBSD

Ian Dowse Corvil Networks DavidMalone CNRI Dublin Institute of Technology

David Malone presented four importantfile system optimizations for FreeBSDOS soft updates dirpref vmiodir anddirhash It turns out that certain combi-nations of the optimizations (beautifullyillustrated in the paper) may yield per-formance improvements of anywherebetween 2 and 10 orders of magnitudefor real-world applications

All four of the optimizations deal withfile system metadata Soft updates allowfor asynchronous metadata updatesDirpref changes the way directories areorganized attempting to place childdirectories closer to their parentsthereby increasing locality of referenceand reducing disk-seek times Vmiodirtrades some extra memory in order toachieve better directory caching Finallydirhash which was implemented by IanDowse changes the way in which entries

are searched in a directory SpecificallyFFS uses a linear search to find an entryby name dirhash builds a hashtable ofdirectory entries on the fly For a direc-tory of size n with a working set of mfiles a search that in certain cases couldhave been O(nm) has been reduceddue to dirhash to effectively O(n + m)

FILESYSTEM PERFORMANCE AND SCALABILITY

IN LINUX 2417

Ray Bryant SGI Ruth Forester IBMLTC John Hawkes SGI

This talk focused on performance evalu-ation of a number of file systems avail-able and commonly deployed on Linuxmachines Comparisons under variousconfigurations of Ext2 Ext3 ReiserFSXFS and JFS were presented

The benchmarks chosen for the datagathering were pgmeter filemark andAIM Benchmark Suite VII Pgmetermeasures the rate of data transfer ofreadswrites of a file under a syntheticworkload Filemark is similar to post-mark in that it is an operation-intensivebenchmark although filemark isthreaded and offers various other fea-tures that postmark lacks AIM VIImeasures performance for various file-system-related functionalities it offersvarious metrics under an imposedworkload thus stressing the perfor-mance of the file system not only underIO load but also under significant CPUload

Tests were run on three different setupsa small a medium and a large configu-ration ReiserFS and Ext2 appear at thetop of the pile for smaller and mediumsetups Notably XFS and JFS performworse for smaller system configurationsthan the others although XFS clearlyappears to generate better numbers thanJFS It should be noted that XFS seemsto scale well under a higher load Thiswas most evident in the large-systemresults where XFS appears to offer thebest overall results

THINGS TO THINK ABOUT

Summarized by Bosko Milekic

SPEEDING UP KERNEL SCHEDULER BY REDUC-

ING CACHE MISSES

Shuji Yamamura Akira Hirai MitsuruSato Masao Yamamoto Akira NaruseKouichi Kumon Fujitsu Labs

This was an interesting talk pertaining tothe effects of cache coloring for taskstructures in the Linux kernel scheduler(Linux kernel 24x) The speaker firstpresented some interesting benchmarknumbers for the Linux scheduler show-ing that as the number of processes onthe task queue was increased the perfor-mance decreased The authors usedsome really nifty hardware to measurethe number of bus transactionsthroughout their tests and were thusable to reasonably quantify the impactthat cache misses had in the Linuxscheduler

Their experiments led them to imple-ment a cache coloring scheme for taskstructures which were previouslyaligned on 8KB boundaries and there-fore were being eventually mapped tothe same cache lines This unfortunateplacement of task structures in memoryinduced a significant number of cachemisses as the number of tasks grew inthe scheduler

The implemented solution consisted ofaligning task structures to more evenlydistribute cached entries across the L2cache The result was inevitably fewercache misses in the scheduler Some neg-ative effects were observed in certain sit-uations These were primarily due tomore cache slots being used by taskstructure data in the scheduler thusforcing data previously cached there tobe pushed out

81October 2002 login

C

ON

FERE

NC

ERE

PORT

SWORK-IN-PROGRESS REPORTS

Summarized by Brennan Reynolds

RESOURCE VIRTUALIZATION TECHNIQUES FOR

WIDE-AREA OVERLAY NETWORKS

Kartik Gopalan University Stony Brook

This work addressed the issue of provi-sioning a maximum number of virtualoverlay networks (VON) with diversequality of service (QoS) requirementson a single physical network Each of theVONs is logically isolated from others toensure the QoS requirements Gopalanmentioned several critical researchissues with this problem that are cur-rently being investigated Dealing withhow to provision the network at variouslevels (link route or path) and thenenforce the provisioning at run-time isone of the toughest challenges Cur-rently Gopalan has developed severalalgorithms to handle admission controlend-to-end QoS route selection sched-uling and fault tolerance in the net-work

For more information visithttpwwwecslcssunysbedu

VISUALIZING SOFTWARE INSTABILITY

Jennifer Bevan University of CaliforniaSanta Cruz

Detection of instability in software hastypically been an afterthought Thepoint in the development cycle when thesoftware is reviewed for instability isusually after it is difficult and costly togo back and perform major modifica-tions to the code base Bevan has devel-oped a technique to allow thevisualization of unstable regions of codethat can be used much earlier in thedevelopment cycle Her technique cre-ates a time series of dependent graphsthat include clusters and lines calledfault lines From the graphs a developeris able to easily determine where theunstable sections of code are and proac-tively restructure them She is currentlyworking on a prototype implementa-tion

For more information visit httpwwwcseucscecu~jbevanevo_viz

RELIABLE AND SCALABLE PEER-TO-PEER WEB

DOCUMENT SHARING

Li Xiao William and Mary College

The idea presented by Xiao would allowend users to share the content of theWeb browser caches with neighboringInternet users The rationale for this isthat todayrsquos end users are increasinglyconnected to the Internet over high-speed links and the browser caches arebecoming large storage systems There-fore if an individual accesses a pagewhich does not exist in their local cacheXiao is suggesting that they first queryother end users for the content beforetrying to access the machine hosting theoriginal This strategy does have someserious problems associated with it thatstill need to be addressed includingensuring the integrity of the content andprotecting the identity and privacy ofthe end users

SEREL FAST BOOTING FOR UNIX

Leni Mayo Fast Boot Software

Serel is a tool that generates a visual rep-resentation of a UNIX machinersquos boot-up sequence It can be used to identifythe critical path and can show if a par-ticular service or process blocks for anextended period of time This informa-tion could be used to determine wherethe largest performance gains could berealized by tuning the order of executionat boot-up Serel creates a dependencygraph expressed in XML during boot-up This graph is then used to create thevisual representation Currently the toolonly works on POSIX-compliant sys-tems but Mayo stated that he would beporting it to other platforms Otherextensions that were mentionedincluded having the metadata includethe use of shared libraries and monitor-ing the suspendresume sequence ofportable machines

For more information visithttpwwwfastbootorg

USENIX 2002

BERKELEY DB XML

John Merrells Sleepycat Software

Merrells gave a quick introduction andoverview of the new XML library forBerkeley DB The library specializes instorage and retrieval of XML contentthrough a tool called XPath The libraryallows for multiple containers per docu-ment and stores everything natively asXML The user is also given a wide rangeof elements to create indices withincluding edges elements text stringsor presence The XPath tool consists of aquery parser generator optimizer andexecution engine To conclude his pre-sentation Merrell gave a live demonstra-tion of the software

For more information visithttpwwwsleepycatcomxml

CLUSTER-ON-DEMAND (COD)

Justin Moore Duke University

Modern clusters are growing at a rapidrate Many have pushed beyond the 5000-machine mark and deployingthem results in large expenses as well asmanagement and provisioning issuesFurthermore if the cluster is ldquorentedrdquoout to various users it is very time-con-suming to configure it to a userrsquos specsregardless of how long they need to useit The COD work presented createsdynamic virtual clusters within a givenphysical cluster The goal was to have aprovisioning tool that would automati-cally select a chunk of available nodesand install the operating system andmiddleware specified by the customer ina short period of time This would allowa greater use of resources since multiplevirtual clusters can exist at once Byusing a virtual cluster the size can bechanged dynamically Moore stated thatthey have created a working prototypeand are currently testing and bench-marking its performance

For more information visithttpwwwcsdukeedu~justincod

82 Vol 27 No 5 login

CATACOMB

Elias Sinderson University of CaliforniaSanta Cruz

Catacomb is a project to develop a data-base-backed DASL module for theApache Web server It was designed as areplacement for the WebDAV The initialrelease of the module only contains sup-port for a MySQL database but could beextended to others Sinderson brieflytouched on the performance of hermodule It was comparable to themod_dav Apache module for all querytypes but search The presentation wasconcluded with remarks about addingsupport for the lock method and includ-ing ACL specifications in the future

For more information visithttpoceancseucsceducatacomb

SELF-ORGANIZING STORAGE

Dan Ellard Harvard University

Ellardrsquos presentation introduced a stor-age system that tuned itself based on theworkload of the system without requir-ing the intervention of the user Theintelligence was implemented as a vir-tual self-organizing disk that residesbelow any file system The virtual diskobserves the access patterns exhibited bythe system and then attempts to predictwhat information will be accessed nextAn experiment was done using an NFStrace at a large ISP on one of their emailservers Ellardrsquos self-organizing storagesystem worked well which he attributesto the fact that most of the files beingrequested were large email boxes Areasof future work include exploration ofthe length and detail of the data collec-tion stage as well as the CPU impact ofrunning the virtual disk layer

VISUAL DEBUGGING

John Costigan Virginia Tech

Costigan feels that the state of currentdebugging facilities in the UNIX worldis not as good as it should be He pro-poses the addition of several elements toprograms like ddd and gdb The firstaddition is including separate heap and

stack components Another is to displayonly certain fields of a data structureand have the ability to zoom in and outif needed Finally the debugger shouldprovide the ability to visually presentcomplex data structures includinglinked lists trees etc He said that a betaversion of a debugger with these abilitiesis currently available

For more information visit httpinfoviscsvtedudatastruct

IMPROVING APPLICATION PERFORMANCE

THROUGH SYSTEM-CALL COMPOSITION

Amit Purohit University of Stony Brook

Web servers perform a huge number ofcontext switches and internal data copy-ing during normal operation These twoelements can drastically limit the perfor-mance of an application regardless ofthe hardware platform it is run on TheCompound System Call (CoSy) frame-work is an attempt to reduce the perfor-mance penalty for context switches anddata copies It includes a user-levellibrary and several kernel facilities thatcan be used via system calls The libraryprovides the programmer with a com-plete set of memory-management func-tions Performance tests using the CoSyframework showed a large saving forcontext switches and data copies Theonly area where the savings betweenconventional libraries and CoSy werenegligible was for very large file copies

For more information visithttpwwwcssunysbedu~purohit

ELASTIC QUOTAS

John Oscar Columbia University

Oscar began by stating that mostresources are flexible but that to datedisks have not been While approxi-mately 80 percent of files are short-liveddisk quotas are hard limits imposed byadministrators The idea behind elasticquotas is to have a non-permanent stor-age area that each user can temporarilyuse Oscar suggested the creation of aehome directory structure to be used indeploying elastic quotas Currently he

has implemented elastic quotas as astackable file system There were alsoseveral trends apparent in the dataMany users would ssh to their remotehost (good) but then launch an emailreader on the local machine whichwould connect via POP to the same hostand send the password in clear text(bad) This situation can easily be reme-died by tunneling protocols like POPthrough SSH but it appeared that manypeople were not aware this could bedone While most of his comments onthe use of the wireless network werenegative the list of passwords he hadcollected showed that people wereindeed using strong passwords His rec-ommendations were to educate andencourage people to use protocols likeIPSec SSH and SSL when conductingwork over a wireless network becauseyou never know who else is listening

SYSTRACE

Niels Provos University of Michigan

How can one be sure the applicationsone is using actually do exactly whattheir developers said they do Shortanswer you canrsquot People today are usinga large number of complex applicationswhich means it is impossible to checkeach application thoroughly for securityvulnerabilities There are tools out therethat can help though Provos has devel-oped a tool called systrace that allows auser to generate a policy of acceptablesystem calls a particular application canmake If the application attempts tomake a call that is not defined in thepolicy the user is notified and allowed tochoose an action Systrace includes anauto-policy generation mechanismwhich uses a training phase to record theactions of all programs the user exe-cutes If the user chooses not to be both-ered by applications breaking policysystrace allows default enforcementactions to be set Currently systrace isimplemented for FreeBSD and NetBSDwith a Linux version coming out shortly

83October 2002 login

C

ON

FERE

NC

ERE

PORT

SFor more information visit httpwwwcitiumicheduuprovossystrace

VERIFIABLE SECRET REDISTRIBUTION FOR

SURVIVABLE STORAGE SYSTEMS

Ted Wong Carnegie Mellon University

Wong presented a protocol that can beused to re-create a file distributed over aset of servers even if one of the serversis damaged In his scheme the user mustchoose how many shares the file is splitinto and the number of servers it will bestored across The goal of this work is toprovide persistent secure storage ofinformation even if it comes underattack Wong stated that his design ofthe protocol was complete and he is cur-rently building a prototype implementa-tion of it

For more information visit httpwwwcscmuedu~tmwongresearch

THE GURU IS IN SESSIONS

LINUX ON LAPTOPPDA

Bdale Garbee HP Linux Systems Operation

Summarized by Brennan Reynolds

This was a roundtable-like discussionwith Garbee acting as moderator He iscurrently in charge of the Debian distri-bution of Linux and is involved in port-ing Linux to platforms other than i386He has successfully help port the kernelto the Alpha Sparc and ARM and iscurrently working on a version to runon VAX machines

A discussion was held on which file sys-tem is best for battery life The Riser andExt3 file systems were discussed peoplecommented that when using Ext3 ontheir laptops the disc would never spindown and thus was always consuming alarge amount of power One suggestionwas to use a RAM-based file system forany volume that requires a large numberof writes and to only have the file systemwritten to disk at infrequent intervals orwhen the machine is shut down or sus-pended

A question was directed to Garbee abouthis knowledge of how the HP-Compaqmerger would affect Linux Garbeethought that the merger was good forLinux both within the company and forthe open source community as a wholeHe stated that the new company wouldbe the number one shipper of systemswith Linux as the base operating systemHe also fielded a question about thesupport of older HP servers and theirability to run Linux saying that indeedLinux has been ported to them and hepersonally had several machines in hisbasement running it

A question which sparked a large num-ber of responses concerned problemspeople had with running Linux onmobile platforms Most people actuallydid not have many problems at allThere were a few cases of a particularmachine model not working but therewere only two widespread issues dock-ing stations and the new ACPI powermanagement scheme Docking stationsstill appear to be somewhat of aheadache but most people had devel-oped ad-hoc solutions for getting themto work The ACPI power managementscheme developed by Intel does notappear to have a quick solution TedTrsquoso one of the head kernel developersstated that there are fundamental archi-tectural problems with the 24 series ker-nel that do not easily allow ACPI to beadded However he also stated thatACPI is supported in the newest devel-opment kernel 25 and will be includedin 26

The final topic was the possibility ofpurchasing a laptop without paying anychargetax for Microsoft Windows Theconsensus was that currently this is notpossible Even if the machine comeswith Linux or without any operatingsystem the vendors have contracts inplace which require them to payMicrosoft for each machine they ship ifthat machine is capable of running aMicrosoft operating system The onlyway this will change is if the demand for

USENIX 2002

other operating systems increases to thepoint that vendors renegotiate their con-tracts with Microsoft and no one sawthis happening in the near future

LARGE CLUSTERS

Andrew Hume ATampT Labs ndash Research

Summarized by Teri Lampoudi

This was a session on clusters big dataand resilient computing Given that theaudience was primarily interested inclusters and not necessarily big data orbeating nodes to a pulp the conversa-tion revolved mainly around what ittakes to make a heterogeneous clusterresilient Hume also presented aFREENIX paper on his current clusterproject which explained in more detailmuch of what was abstractly claimed inthe guru session

Humersquos use of the word ldquoclusterrdquoreferred not to what I had assumedwould be a Beowulf-type system but to aloosely coupled farm of machines inwhich concurrency was much less of anissue than it would be in a Beowulf Infact the architecture Hume describedwas designed to deal with large amountsof independent transaction processingessentially the process of billing callswhich requires no interprocess messagepassing or anything of the sort a parallelscientific application might

Where does the large data come inSince the transactions in question con-sist of large numbers of flat files amechanism for getting the files ontonodes and off-loading the results is nec-essary In this particular cluster namedNingaui this task is handled by theldquoreplication managerrdquo which generateda large amount of interest in the audi-ence All management functions in thecluster as well as job allocation consti-tute services that are to be bid for andleased out to nodes Furthermore fail-ures are handled by simply repeating thefailed task until it is completed properlyif that is possible and if it is not thenlooking through detailed logs to discover

84 Vol 27 No 5 login

what went wrong Logging and testingfor the successful completion of eachtask are what make this model resilientEvery management operation is loggedat some appropriate level of detail gen-erating 120MBday and every single filetransfer is md5 checksummed This wascharacterized by Hume as a ldquopatientrdquostyle of management where no one con-trols the cluster scheduling behavior isemergent rather than stringentlyplanned and nodes are free to drop inand out of the cluster a downside to thismodel is the increase in latency offset bythe fact that the cluster continues tofunction unattended outside of 8-to-5support hours

In response to questions about the pro-jected scalability of such a schemeHume said the system would presum-ably scale from the present 60 nodes toabout 100 without modifications butthat scaling higher would be a matter forfurther design The guru session endedon a somewhat unrelated note regardingthe problems of getting data off main-frames and onto Linux machines ndashissues of fixed- vs variable-lengthencoding of data blocks resulting in use-ful information being stripped by FTPand problems in converting COBOLcopybooks to something useful on a C-based architecture To reiterate Humersquosclaim there are homemade solutions forsome of these problems and the readershould feel free to contact Mr Hume forthem

WRITING PORTABLE APPLICATIONS

Nick Stoughton MSB Consultants

Summarized by Josh Lothian

Nick Stoughton addressed a concernthat developers are facing more andmore writing portable applications Heproposed that there is no such thing as aldquoportable applicationrdquo only those thathave been ported Developers are still insearch of the Holy Grail of applicationsprogramming a write-once compile-anywhere program Languages such asJava are leading up to this but virtual

machines still have to be ported pathswill still be different etc Web-basedapplications are another area that couldbe promising for portable applications

Nick went on to talk about the future ofportability The recently released POSIX2001 standards are expected to help thesituation The 2001 revision expands theoriginal POSIX sepc into seven volumesEven though POSIX 2001 is more spe-cific than the previous release Nickpointed out that even this standardallows for small differences where ven-dors may define their own behaviorsThis can only hurt developers in thelong run Along these lines it waspointed out that there may be room forautomated tools that have the capabilityto check application code and determinethe level of portability and standardscompliance of the source

NETWORK MANAGEMENT SYSTEM

PERFORMANCE TUNING

Jeff Allen Tellme Networks Inc

Summarized by Matt Selsky

Jeff Allen author of Cricket discussednetwork tuning The basic idea of tuningis measure twiddle and measure againCricket can be used for large installa-tions but itrsquos overkill for smaller instal-lations for which Mrtg is better suitedYou donrsquot need Cricket to do measure-ment Doing something like a shellscript that dumps data to Excel is goodbut you need to get some sort of mea-surement Cricket was not designed forbilling it was meant to help answerquestions

Useful tools and techniques coveredincluded looking for the differencebetween machines in order to identifythe causes for the differences measuringbut also thinking about what yoursquoremeasuring and why remembering touse strace and tcpdump to identifyproblems Problems repeat themselvesperformance tuning requires an intricateunderstanding of all the layers involvedto solve the problem and simplify theautomation (Having a computer science

background helps but is not essential) Ifyou can reproduce the problem youthen should investigate each piece tofind the bottleneck If you canrsquot repro-duce the problem then measure Youneed to understand all the system inter-actions Measurement can help deter-mine whether things are actually slow orif the user is imagining it

Troubleshooting begins with a hunchbut scientific processes are essential Youshould be able to determine the eventstream or when each event occurs hav-ing observation logs can help Somehints provided include checking cron forperiodic anomalies slowing downbinrm to avoid IO overload and look-ing for unusual kernel CPU usage

Also try to use lightweight monitoringto reduce overhead You donrsquot need tomonitor every resource on every systembut you should monitor those resourceson those systems that are essential Donrsquotcheck things too often since you canintroduce overhead that way Lots ofinformation can be gathered from theproc file system interface to the kernel

BSD BOF

Host Kirk McKusick Author and Consultant

Summarized by Bosko Milekic

The BSD Birds of a Feather session atUSENIX started off with five quick andinformative talks and ended with anexciting discussion of licensing issues

Christos Zoulas for the NetBSD Projectwent over the primary goals of theNetBSD Project (namely portability aclean architecture and security) andthen proceeded to briefly discuss someof the challenges that the project hasencountered The successes and areas forimprovement of 16 were then exam-ined NetBSD has recently seen variousimprovements in its cross-build systempackaging system and third-party toolsupport For what concerns the kernel azero-copy TCPUDP implementationhas been integrated and a new pmap

85October 2002 login

C

ON

FERE

NC

ERE

PORT

Scode for arm32 has been introduced ashave some other rather major changes tovarious subsystems More information isavailable in Christosrsquos slides which arenow available at httpwwwnetbsdorggalleryeventsusenix2002

Theo de Raadt of the OpenBSD Projectpresented a rundown of various newthings that the OpenBSD and OpenSSHteams have been looking at Theorsquos talkincluded an amusing description (andpictures) of some of the events thattranspired during OpenBSDrsquos hack-a-thon which occurred the week beforeUSENIX rsquo02 Finally Theo mentionedthat following complications with theIPFilter license the OpenBSD team haddone a fairly extensive license audit

Robert Watson of the FreeBSD Projectbrought up various examples ofFreeBSD being used in the real worldHe went on to describe a wide variety ofchanges that will surface with theupcoming FreeBSD release 50 Notably50 will introduce a re-architecting ofthe kernel aimed at providing muchmore scalable support for SMPmachines 50 will also include an earlyimplementation of KSE FreeBSDrsquos ver-sion of scheduler activations largeimprovements to pccard and significantframework bits from the TrustedBSDproject

Mike Karels from Wind River Systemspresented an overview of how BSDOShas been evolving following the WindRiver Systems acquisition of BSDirsquos soft-ware assets Ernie Prabhakar from Applediscussed Applersquos OS X and its successon the desktop He explained how OS Xaims to bring BSD to the desktop whilestriving to maintain a strong and stablecore one that is heavily based on BSDtechnology

The BoF finished with a discussion onlicensing specifically some folks ques-tioned the impact of Applersquos licensingfor code from their core OS (the DarwinProject) on the BSD community as awhole

CLOSING SESSION

HOW FLIES FLY

Michael H Dickinson University ofCalifornia Berkeley

Summarized by JD Welch

In this lively and offbeat talk Dickinsondiscussed his research into the mecha-nisms of flight in the fruit fly (Droso-phila melanogaster) and related theautonomic behavior to that of a techno-logical system Insects are the most suc-cessful organisms on earth due in nosmall part to their early adoption offlight Flies travel through space onstraight trajectories interrupted by sac-cades jumpy turning motions some-what analogous to the fast jumpymovements of the human eye Using avariety of monitoring techniquesincluding high-speed video in a ldquovirtualreality flight arenardquo Dickinson and hiscolleagues have observed that fliesrespond to visual cues to decide when tosaccade during flight For example morevisual texture makes flies saccade earlier(ie further away from the arena wall)described by Dickinson as a ldquocollisioncontrol algorithmrdquo

Through a combination of sensoryinputs flies can make decisions abouttheir flight path Flies possess a visualldquoexpansion detectorrdquo which at a certainthreshold causes the animal to turn acertain direction However expansion infront of the fly sometimes causes it toland How does the fly decide Using thevirtual reality device to replicate variousvisual scenarios Dickinson observedthat the flies fixate on vertical objectsthat move back and forth across thefield Expansion on the left side of theanimal causes it to turn right and viceversa while expansion directly in frontof the animal triggers the legs to flareout in preparation for landing

If the eyes detect these changes how arethe responses implemented The fliesrsquowings have ldquopower musclesrdquo controlledby mechanical resonance (as opposed tothe nervous system) which drive the

USENIX 2002

wings combined with neurally activatedldquosteering musclesrdquo which change theconfiguration of the wing joints Subtlevariations in the timing of impulses cor-respond to changes in wing movementcontrolled by sensors in the ldquowing-pitrdquoA small wing-like structure the halterecontrols equilibrium by beating con-stantly

Voluntary control is accomplished bythe halteres whose signals can interferewith the autonomic control of the strokecycle The halteres have ldquosteering mus-clesrdquo as well and information derivedfrom the visual system can turn off thehaltere or trick it into a ldquovirtualrdquo prob-lem requiring a response

Dickinson has also studied the aerody-namics of insect flight using a devicecalled the ldquoRobo Flyrdquo an oversizedmechanical insect wing suspended in alarge tank of mineral oil InterestinglyDickinson observed that the average liftgenerated by the fliesrsquo wings is under itsbody weight the flies use three mecha-nisms to overcome this including rotat-ing the wings (rotational lift)

Insects are extraordinarily robust crea-tures and because Dickinson analyzedthe problem in a systems-oriented waythese observations and analysis areimmediately applicable to technologythe response system can be used as anefficient search algorithm for controlsystems in autonomous vehicles forexample

THE AFS WORKSHOP

Summarized by Garry Zacheiss MIT

Love Hornquist-Astrand of the Arladevelopment team presented the Arlastatus report Arla 0358 has beenreleased Scheduled for release soon areimproved support for Tru64 UNIXMacOS X and FreeBSD improved vol-ume handling and implementation ofmore of the vospts subcommands Itwas stressed that MacOS X is consideredan important platform and that a GUIconfiguration manager for the cache

86 Vol 27 No 5 login

manager a GUI ACL manager for thefinder and a graphical login that obtainsKerberos tickets and AFS tokens at logintime were all under development

Future goals planned for the underlyingAFS protocols include GSSAPISPNEGOsupport for Rx performance improve-ments to Rx an enhanced disconnectedmode and IPv6 support for Rx anexperimental patch is already availablefor the latter Future Arla-specific goalsinclude improved performance partialfile reading and increased stability forseveral platforms Work is also inprogress for the RXGSS protocol exten-sions (integrating GSSAPI into Rx) Apartial implementation exists and workcontinues as developers find time

Derrick Brashear of the OpenAFS devel-opment team presented the OpenAFSstatus report OpenAFS was releasedimmediately prior to the conferenceOpenAFS 125 fixed a remotelyexploitable denial-of-service attack inseveral OpenAFS platforms mostnotably IRIX and AIX Future workplanned for OpenAFS includes bettersupport for MacOS X including work-ing around Finder interaction issuesBetter support for the BSDs is alsoplanned FreeBSD has a partially imple-mented client NetBSD and OpenBSDhave only server programs availableright now AIX 5 Tru64 51A andMacOS X 102 (aka Jaguar) are allplanned for the future Other plannedenhancements include nested groups inthe ptserver (code donated by the Uni-versity of Michigan awaits integration)disconnected AFS and further work ona native Windows client Derrick statedthat the guiding principles of OpenAFSwere to maintain compatibility withIBM AFS support new platforms easethe administrative burden and add newfunctionality

AFS performance was discussed Theopenafsorg cell consists of two fileservers one in Stockholm Sweden andone in Pittsburgh PA AFS works rea-sonably well over trans-Atlantic links

although OpenAFS clients donrsquot deter-mine which file server to talk to veryeffectively Arla clients use RTTs to theserver to determine the optimal fileserver to fetch replicated data fromModifications to OpenAFS to supportthis behavior in the future are desired

Jimmy Engelbrecht and Harald Barth ofKTH discussed their AFSCrawler scriptwritten to determine how many AFScells and clients were in the world whatimplementationsversions they were(Arla vs IBM AFS vs OpenAFS) andhow much data was in AFS The scriptunfortunately triggered a bug in IBMAFS 36ndashderived code causing someclients to panic while handling a specificRPC This has since been fixed in OpenAFS 125 and the most recent IBMAFS patch level and all AFS users arestrongly encouraged to upgrade Norelease of Arla is vulnerable to this par-ticular denial-of-service attack Therewas an extended discussion of the use-fulness of this exploration Many sitesbelieved this was useful information andsuch scanning should continue in thefuture but only on an opt-in basis

Many sites face the problem of manag-ing KerberosAFS credentials for batch-scheduled jobs Specifically most batchprocessing software needs to be modi-fied to forward tickets as part of thebatch submission process renew ticketsand tokens while the job is in the queueand for the lifetime of the job and prop-erly destroy credentials when the jobcompletes Ken Hornstein of NRL wasable to pay a commercial vendor to sup-port Kerberos 45 credential manage-ment in their product although they didnot implement AFS token managementMIT has implemented some of thedesired functionality in OpenPBS andmight be able to make it available toother interested sites

Tools to simplify AFS administrationwere discussed including

AFS Balancer A tool written byCMU to automate the process ofbalancing disk usage across all

servers in a cell Available fromftpftpandrewcmuedupubAFS-Toolsbalance-11btargz

Themis Themis is KTHrsquos enhancedversion of the AFS tool ldquopackagerdquofor updating files on local disk froma central AFS image KTHrsquosenhancements include allowing thedeletion of files simplifying theprocess of adding a file and allow-ing the merging of multiple rulesets for determining which files areupdated Themis is available fromthe Arla CVS repository

Stanford was presented as an example ofa large AFS site Stanfordrsquos AFS usageconsists of approximately 14TB of datain AFS in the form of approximately100000 volumes 33TB of storage isavailable in their primary cell irstan-fordedu Their file servers consistentirely of Solaris machines running acombination of Transarc 36 patch-level3 and OpenAFS 12x while their data-base servers run OpenAFS 122 Theircell consists of 25 file servers using acombination of EMC and Sun StorEdgehardware Stanford continues to use thekaserver for their authentication infra-structure with future plans to migrateentirely to an MIT Kerberos 5 KDC

Stanford has approximately 3400 clientson their campus not including SLAC(Stanford Linear Accelerator) approxi-mately 2100 AFS clients from outsideStanford contact their cell every monthTheir supported clients are almostentirely IBM AFS 36 clients althoughthey plan to release OpenAFS clientssoon Stanford currently supports onlyUNIX clients There is some on-campuspresence of Windows clients but Stan-ford has never publicly released or sup-ported it They do intend to release andsupport the MacOS X client in the nearfuture

All Stanford students faculty and staffare assigned AFS home directories witha default quota of 50MB for a total ofapproximately 550GB of user homedirectories Other significant uses of AFS

87October 2002 login

C

ON

FERE

NC

ERE

PORT

Sstorage are data volumes for workstationsoftware (400GB) and volumes forcourse Web sites and assignments(100GB)

AFS usage at Intel was also presentedIntel has been an AFS site since 1994They had bad experiences with the IBM35 Linux client their experience withOpenAFS on Linux 24x kernels hasbeen much better They use and are sat-isfied with the OpenAFS IA64 Linuxport Intel has hundreds of OpenAFS123 and 124 clients in many produc-tion cells accessing data stored on IBMAFS file servers They have not encoun-tered any interoperability issues Intelhas some concerns about OpenAFS theywould like to purchase commercial sup-port for OpenAFS and to see OpenAFSsupport for HP-UX on both PA-RISCand Itanium hardware HP-UX supportis currently unavailable due to a specificHP-UX header file being unavailablefrom HP this may be available soonIntel has not yet committed to migratingtheir file servers to OpenAFS and areunsure if they will do so without com-mercial support

Backups are a traditional topic of dis-cussion at AFS workshops and this timewas no exception Many users complainthat the traditional AFS backup tools(ldquobackuprdquo and ldquobutcrdquo) are complex anddifficult to automate requiring manyhome-grown scripts and much userintervention for error recovery An addi-tional complaint was that the traditionalAFS tools do not support file- or direc-tory-level backups and restores datamust be backed up and restored at thevolume level

Mitch Collinsworth of Cornell presentedwork to make AMANDA the freebackup software from the University ofMaryland suitable for AFS backupsUsing AMANDA for AFS backups allowsone to share AFS backup tapes withnon-AFS backups easily run multiplebackups in parallel automate errorrecovery and provide a robust degradedmode that prevents tape errors from

stopping backups altogether Theirimplementation allows for full volumerestores as well as individual directoryand file restores They have finished cod-ing this work and are in the process oftesting and documenting it

Peter Honeyman of CITI at the Univer-sity of Michigan spoke about work hehas proposed to replace Rx with RPC-SEC GSS in OpenAFS this would allowAFS to use a TCP-based transport mech-anism rather than the UDP-based Rxand possibly gain better congestion con-trol dynamic adaptation and fragmen-tation avoidance as a result RPCSECGSS uses the GSSAPI to authenticateSUN ONC RPC RPCSEC GSS is trans-port agnostic provides strong security isa developing Internet standard and hasmultiple open source implementationsBackward compatibility with existingAFS servers and clients is an importantgoal of this project

Page 15: inside - USENIX · 2019. 2. 25. · Bosko Milekic Juan Navarro David E. Ott Amit Purohit Brennan Reynolds Matt Selsky J.D. Welch Li Xiao Praveen Yalagandula Haijin Yan Gary Zacheiss

IMPROVING WAIT-FREE ALGORITHMS FOR

INTERPROCESS COMMUNICATION IN

EMBEDDED REAL-TIME SYSTEMS

Hai Huang Padmanabhan Pillai KangG Shin University of Michigan

The main characteristic of the real-timetime-sensitive system is its pre-dictable response time But concurrencymanagement creates hurdles to achiev-ing this goal because of the use of locksto maintain consistency To solve thisproblem many wait-free algorithmshave been developed but these are typi-cally high-cost solutions By takingadvantage of the temporal characteris-tics of the system however the time andspace overhead can be reduced

The speaker presented an algorithm fortemporal concurrency control andapplied this technique to improve threewait-free algorithms A single-writermultiple-reader wait-free algorithm anda double-buffer algorithm were pro-posed

Using their transformation mechanismthey achieved an improvement of17ndash66 in ACET and a 14ndash70 reduc-tion in memory requirements for IPCalgorithms This mechanism is extensi-ble and can be applied to other non-blocking IPC algorithms as well Futurework involves reducing the synchroniz-ing overheads in more general IPC algo-rithms with multiple-writer semantics

MOBILITY

Summarized by Praveen Yalagandula

ROBUST POSITIONING ALGORITHMS FOR

DISTRIBUTED AD-HOC WIRELESS SENSOR

NETWORKS

Chris Savarese Jan Rabaey Universityof California Berkeley Koen Langen-doen Delft University of Technology

The Pico Radio Network comprisesmore than a hundred sensors monitorsand actuators equipped with wirelessconnectivity with the following proper-ties (1) no infrastructure (2) computa-tion in a distributed fashion instead ofcentralized computation (3) dynamictopology (4) limited radio range

75October 2002 login

C

ON

FERE

NC

ERE

PORT

S(5) nodes that can act as repeaters(6) devices of low cost and with lowpower and (7) sparse anchor nodes ndashnodes with GPS capability A positioningalgorithm determines the geographicalposition of each node in the network

In two-dimensional space each nodeneeds three reference positions to esti-mate its geographical position There aretwo problems that make positioning dif-ficult in the Pico Radio Network typesetting (1) a sparse anchor node prob-lem and (2) a range error problem

The two-phase approach that theauthors have taken in solving the prob-lem consists of (1) a hop-terrain algo-rithm ndash in this first phase each noderoughly guesses its location by the dis-tance calculated using multi-hops to theanchor points and (2) a refinementalgorithm ndash in this second phase eachnode uses its neighborsrsquo positions torefine its own position estimate Toguarantee the convergence thisapproach uses confidence metricsassigning a value of 10 for anchor nodesand 01 to start for other nodes andincreasing these with each iteration

A simulation tool OMNet++ was usedfor both phases and in various scenariosThe results show that they achievedposition errors of less than 33 in a sce-nario with 5 anchor nodes an averageconnectivity of 7 and 5 range mea-surement error

Guidelines for anchor node deploymentare high connectivity (gt10) a reason-able fraction of anchor nodes (gt 5)and for anchor placement coverededges

APPLICATION-SPECIFIC NETWORK

MANAGEMENT FOR ENERGY-AWARE

STREAMING OF POPULAR MULTIMEDIA

FORMATS

Surendar Chandra University of Geor-gia and Amin Vahdat Duke University

The main hindrance in supporting theincreasing demand for mobile multime-dia on PDA is the battery capacity of the

small devices The idle-time power con-sumption is almost the same as thereceive power consumption on typicalwireless network cards (1319mW vs1425mW) while the sleep state con-sumes far less power (177mW) To letthe system consume energy proportionalto the stream quality the network cardshould transition to sleep state aggres-sively between each packet

The speaker presented previous workwhere different multimedia streamswere studied for PDAs MS Media RealMedia and Quicktime It was found thatif the inter-packet gap is predictablethen huge savings are possible forexample about 80 savings in MSmedia streams with few packet lossesThe limitation of the IEEE 80211bpower-save mode comes into play whenthere are two or more nodes producinga delay between beacon time and whenthe node receivessends packets Thisbadly affects the streams since higherenergy is consumed while waiting

The authors propose traffic shaping forenergy conservation where this is doneby proxy such that packets arrive at reg-ular intervals This is achieved using alocal proxy in the access point and aclient-side proxy in the mobile hostSimulations show that traffic shapingreduces energy consumption and alsoreveals that the higher bandwidthstreams have a lower energy metric(mJkB)

More information is at httpgreenhousecsugaedu

CHARACTERIZING ALERT AND BROWSE SER-

VICES OF MOBILE CLIENTS

Atul Adya Paramvir Bahl and Lili QiuMicrosoft Research

Even though there is a dramatic increasein Internet access from wireless devicesthere are not many studies done oncharacterizing this traffic In this paperthe authors characterize the trafficobserved on the MSN Web site withboth notification and browse traces

USENIX 2002

Around 33 million browsing accessesand about 325 million notificationentries are present in the traces Threetypes of analyses are done for each oneof these two services content analysisconcerning the most popular contentcategories and their distribution popu-larity analysis or the popularity distri-bution of documents and user behavioranalysis

The analysis of the notification logsshows that document access rates followa Zipf-like distribution with most of theaccesses concentrated on a small num-ber of messages and the accesses exhibitgeographical locality ndash users from samelocality tend to receive similar notifica-tion content The browser log analysisshows that a smaller set of URLs areaccessed most times though the accesspattern does not fit any Zipf curve andthe highly accessed URLS remain stableA correlation study between the notifi-cation and browsing services shows thatwireless users have a moderate correla-tion of 012

FREENIX TRACK SESSIONS

BUILDING APPLICATIONS

Summarized by Matt Butner

INTERACTIVE 3-D GRAPHICS APPLICATIONS

FOR TCL

Oliver Kersting Juumlrgen Doumlllner HassoPlattner Institute for Software SystemsEngineering University of Potsdam

The integration of 3-D image renderingfunctionality into a scripting languagepermits interactive and animated 3-Ddevelopment and application withoutthe formalities and precision demandedby low-level CC++ graphics and visual-ization libraries The large and complexC++ API of the Virtual Rendering Sys-tem (VRS) can be combined with theconveniences of the Tcl scripting lan-guage The mapping of class interfaces isdone via an automated process and gen-erates respective wrapper classes all ofwhich ensures complete API accessibilityand functionality without any signifi-

76 Vol 27 No 5 login

cant performance retribution The gen-eral C++-to-Tcl mapper SWIG grantsthe necessary object and hierarchal abili-ties without the use of object-orientedTcl extensions The final API mappingtechniques address C++ features such asclasses overloaded methods enumera-tions and inheritance relations All areimplemented in a proof-of-concept thatmaps the complete C++ API of the VRSto Tcl and are showcased in a completeinteractive 3-D-map system

THE AGFL GRAMMAR WORK LAB

Cornelis HA Coster Erik VerbruggenUniversity of Nijmegen (KUN)

The growth and implementation of Nat-ural Language Processing (NLP) is thecornerstone of the continued evolutionand implementation of truly intelligentsearch machines and services In partdue to the growing collections of com-puter-stored human-readable docu-ments in the public domain theimplementation of linguistic analysiswill become necessary for desirable pre-cision and resolution Subsequentlysuch tools and linguistic resources mustbe released into the public domain andthey have done so with the release of theAGFL Grammar Work Lab under theGNU Public License as a tool for lin-guistic research and the development ofNLP-based applications

The AGFL (Affix Grammars over aFinite Lattice) Grammar Work Labmeshes context-free grammars withfinite set-valued features that are accept-able to a range of languages In com-puter science terms ldquoSyntax rules areprocedures with parameters and a non-deterministic executionrdquo The EnglishPhrases for Information Retrieval(EP4IR) released with the AGFL-GWLas a robust grammar of English is anAGFL-GWL generated English parserthat outputs ldquoHeadModifiedrdquo framesThe sentences ldquoCompanyX sponsoredthis conferencerdquo and ldquoThis conferencewas sponsored by CompanyXrdquo both gen-erate [CompanyX[sponsored confer-

ence]] The hope is that public availabil-ity of such tools will encourage furtherdevelopment of grammar and lexicalsoftware systems

SWILL A SIMPLE EMBEDDED WEB SERVER

LIBRARY

Sotiria Lampoudi David M BeazleyUniversity of Chicago

This paper won the FREENIX Best Stu-dent Paper award

SWILL (Simple Web Interface LinkLibrary) is a simple Web server in theform of a C library whose developmentwas motivated by a wish to give coolapplications an interface to the Web TheSWILL library provides a simple inter-face that can be efficiently implementedfor tasks that vary from flexible Web-based monitoring to software debuggingand diagnostics Though originallydesigned to be integrated with high-per-formance scientific simulation softwarethe interface is generic enough to allowfor unbounded uses SWILL is a single-threaded server relying upon non-blocking IO through the creation of atemporary server which services IOrequests SWILL does not provide SSL orcryptographic authentication but doeshave HTTP authentication abilities

A fantastic feature of SWILL is its sup-port for SPMD-style parallel applica-tions which utilize MPI provingvaluable for Beowulf clusters and largeparallel machines Another practicalapplication was the implementation ofSWILL in a modified Yalnix emulator byUniversity of Chicago Operating Sys-tems courses which utilized the addedYalnix functionality for OS developmentand debugging SWILL requires minimalmemory overhead and relies upon theHTTP10 protocol

NETWORK PERFORMANCE

Summarized by Florian Buchholz

LINUX NFS CLIENT WRITE PERFORMANCE

Chuck Lever Network Appliance PeterHoneyman CITI University of Michi-gan

Lever introduced a benchmark to mea-sure an NFS client write performanceClient performance is difficult to mea-sure due to hindrances such as poorhardware or bandwidth limitations Fur-thermore measuring application perfor-mance does not identify weaknessesspecifically at the client side Thus abenchmark was developed trying toexercise only data transfers in one direc-tion between server and application Forthis purpose the benchmark was basedon the block sequential write portion ofthe Bonnie file system benchmark Oncea benchmark for NFS clients is estab-lished it can be used to improve clientperformance

The performance measurements wereperformed with an SMP Linux clientand both a Linux NFS server and a Net-work Appliance F85 filer During testinga periodic jump in write latency timewas discovered This was due to a ratherlarge number of pending write opera-tions that were scheduled to be writtenafter certain threshold values wereexceeded By introducing a separate dae-mon that flushes the cached writerequest the spikes could be eliminatedbut as a result the average latency growsover time The problem could be tracedto a function that scans a linked list ofwrite requests After having added ahashtable to improve lookup perfor-mance the latency improved consider-ably

The improved client was then used tomeasure throughput against the twoservers A discrepancy between theLinux server and the filer test wasnoticed and the reason for that tracedback to a global kernel lock that wasunnecessarily held when accessing thenetwork stack After correcting this per-formance improved further However

77October 2002 login

C

ON

FERE

NC

ERE

PORT

Sthe measurements showed that a clientmay run slower when paired with fastservers on fast networks This is due toheavy client interrupt loads more net-work processing on the client side andmore global kernel lock contention

The source code of the project is avail-able at httpwwwcitiumicheduprojectsnfs-perfpatches

A STUDY OF THE RELATIVE COSTS OF

NETWORK SECURITY PROTOCOLS

Stefan Miltchev and Sotiris IoannidisUniversity of Pennsylvania AngelosKeromytis Columbia University

With the increasing need for securityand integrity of remote network ser-vices it becomes important to quantifythe communication overhead of IPSecand compare it to alternatives such asSSH SCP and HTTPS

For this purpose the authors set upthree testing networks direct link twohosts separated by two gateways andthree hosts connecting through onegateway Protocols were compared ineach setup and manual keying was usedto eliminate connection setup costs ForIPSec the different encryption algo-rithms AES DES 3DES hardware DESand hardware 3DES were used In detailFTP was compared to SFTP SCP andFTP over IPSec HTTP to HTTPS andHTTP over IPSec and NFS to NFS overIPSec and local disk performance

The result of the measurements werethat IPSec outperforms other popularencryption schemes Overall unen-crypted communication was fastest butin some cases like FTP the overhead canbe small The use of crypto hardwarecan significantly improve performanceFor future work the inclusion of setupcosts hardware-accelerated SSL SFTPand SSH were mentioned

CONGESTION CONTROL IN LINUX TCP

Pasi Sarolahti University of HelsinkiAlexey Kuznetsov Institute for NuclearResearch at Moscow

Having attended the talk and read thepaper I am still unclear about whetherthe authors are merely describing thedesign decisions of TCP congestion con-trol or whether they are actually the cre-ators of that part of the Linux code Myguess leans toward the former

In the talk the speaker compared theTCP protocol congestion control meas-ures according to IETF and RFC specifi-cations with the actual Linux implemen-tation which does conform to the basicprinciples but nevertheless has differ-ences A specific emphasis was placed onretransmission mechanisms and thecongestion window Also several TCPenhancements ndash the NewReno algo-rithm Selective ACKs (SACK) ForwardACKs (FACK) ndash were discussed andcompared

In some instances Linux does not con-form to the IETF specifications The fastrecovery does not fully follow RFC 2582since the threshold for triggering re-transmit is adjusted dynamically and thecongestion windowrsquos size is not changedAlso the roundtrip-time estimator andthe RTO calculation differ from RFC2988 since it uses more conservativeRTT estimates and a minimum RTO of200ms The performance measuresshowed that with the additional Linux-specific features enabled slightly higherthroughput more steady data flow andfewer unnecessary retransmissions canbe achieved

XTREME XCITEMENT

Summarized by Steve Bauer

THE FUTURE IS COMING WHERE THE X

WINDOW SHOULD GO

Jim Gettys Compaq Computer Corp

Jim Gettys one of the principal authorsof the X Window System outlined thenear-term objectives for the system pri-marily focusing on the changes andinfrastructure required to enable replica-tion and migration of X applicationsProviding better support for this func-tionality would enable users to retrieveor duplicate X applications betweentheir servers at home and work

USENIX 2002

One interesting example of an applica-tion that currently is capable of migra-tion and replication is Emacs To create anew frame on DISPLAY try ldquoM-x make-frame-on-display ltReturngt DISPLAYltReturngtrdquo

However technical challenges makereplication and migration difficult ingeneral These include the ldquomajorheadachesrdquo of server-side fonts thenonuniformity of X servers and screensizes and the need to appropriatelyretrofit toolkits Authentication andauthorization issues are obviously alsoimportant The rest of the talk delvedinto some of the details of these interest-ing technical challenges

HACKING IN THE KERNEL

Summarized by Hai Huang

AN IMPLEMENTATION OF SCHEDULER

ACTIVATIONS ON THE NETBSD OPERATING

SYSTEM

Nathan J Williams Wasabi Systems

Scheduler activation is an old idea Basi-cally there are benefits and drawbacks tousing solely kernel-level or user-levelthreading Scheduler activation is able tocombine the two layers of control toprovide more concurrency in the sys-tem

In his talk Nathan gave a fairly detaileddescription of the implementation ofscheduler activation in the NetBSD ker-nel One important change in the imple-mentation is to differentiate the threadcontext from the process context This isdone by defining a separate data struc-ture for these threads and relocatingsome of the information that wasembedded in the process context tothese thread contexts Stack was espe-cially a concern due to the upcall Spe-cial handling must be done to make surethat the upcall doesnrsquot mess up the stackso that the preempted user-level threadcan continue afterwards Lastly Nathanexplained that signals were handled byupcalls

78 Vol 27 No 5 login

AUTHORIZATION AND CHARGING IN PUBLIC

WLANS USING FREEBSD AND 8021X

Pekka Nikander Ericsson ResearchNomadicLab

8021x standards are well known in thewireless community as link-layerauthentication protocols In this talkPekka explained some novel ways ofusing the 8021x protocols that might beof interest to people on the move It ispossible to set up a public WLAN thatwould support various chargingschemes via virtual tokens which peoplecan purchase or earn and later use

This is implemented on FreeBSD usingthe netgraph utility It is basically a filterin the link layer that would differentiatetraffic based on the MAC address of theclient node which is either authenti-cated denied or let through The over-head for this service is fairly minimal

ACPI IMPLEMENTATION ON FREEBSD

Takanori Watanabe Kobe University

ACPI (Advanced Configuration andPower Management Interface) was pro-posed as a joint effort by Intel Toshibaand Microsoft to provide a standard andfiner-control method of managingpower states of individual devices withina system Such low-level power manage-ment is especially important for thosemobile and embedded systems that arepowered by fixed-capacity energy batter-ies

Takanori explained that the ACPI speci-fication is composed of three partstables BIOS and registers He was ableto implement some functionalities ofACPI in a FreeBSD kernel ACPI Com-ponent Architecture was implementedby Intel and it provides a high-levelACPI API to the operating systemTakanorirsquos ACPI implementation is builtupon this underlying layer of APIs

ACCESS CONTROL

Summarized by Florian Buchholz

DESIGN AND PERFORMANCE OF THE

OPENBSD STATEFUL PACKET FILTER (PF)

Daniel Hartmeier Systor AG

Daniel Hartmeier described the newstateful packet filter (pf) that replacedIPFilter in the OpenBSD 30 releaseIPFilter could no longer be included dueto licensing issues and thus there was aneed to write a new filter making use ofoptimized data structures

The filter rules are implemented as alinked list which is traversed from startto end Two actions may be takenaccording to the rules ldquopassrdquo or ldquoblockrdquoA ldquopassrdquo action forwards the packet anda ldquoblockrdquo action will drop it Wheremore than one rule matches a packetthe last rule wins Rules that are markedas ldquofinalrdquo will immediately terminate any further rule evaluation An opti-mization called ldquoskip-stepsrdquo was alsoimplemented where blocks of similarrules are skipped if they cannot matchthe current packet These skip-steps arecalculated when the rule set is loadedFurthermore a state table keeps track ofTCP connections Only packets thatmatch the sequence numbers areallowed UDP and ICMP queries andreplies are considered in the state tablewhere initial packets will create an entryfor a pseudo-connection with a low ini-tial timeout value The state table isimplemented using a balanced binarysearch tree NAT mappings are alsostored in the state table whereas appli-cation proxies reside in user space Thepacket filter also is able to perform frag-ment reassembly and to modulate TCPsequence numbers to protect hostsbehind the firewall

Pf was compared against IPFilter as wellas Linuxrsquos Iptables by measuringthroughput and latency with increasingtraffic rates and different packet sizes Ina test with a fixed rule set size of 100Iptables outperformed the other two fil-ters whose results were close to each

other In a second test where the rule setsize was continually increased Iptablesconsistently had about twice thethroughput of the other two (whichevaluate the rule set on both the incom-ing and outgoing interfaces) A third testcompared only pf and IPFilter using asingle rule that created state in the statetable with a fixed-state entry size Pfreached an overloaded state much laterthan IPFilter The experiment wasrepeated with a variable-state entry sizeand pf performed much better thanIPFilter for a small number of states

In general rule set evaluation is expen-sive and benchmarks only reflectextreme cases whereas in real life otherbehavior should be observed Further-more the benchmarks show that statefulfiltering can actually improve perfor-mance due to cheap state-table lookupsas compared to rule evaluation

ENHANCING NFS CROSS-ADMINISTRATIVE

DOMAIN ACCESS

Joseph Spadavecchia and Erez ZadokStony Brook University

The speaker presented modification toan NFS server that allows an improvedNFS access between administrativedomains A problem lies in the fact thatNFS assumes a shared UIDGID spacewhich makes it unsafe to export filesoutside the administrative domain ofthe server Design goals were to leaveprotocol and existing clients unchangeda minimum amount of server changesflexibility and increased performance

To solve the problem two techniques areutilized ldquorange-mappingrdquo and ldquofile-cloakingrdquo Range-mapping maps IDsbetween client and server The mappingis performed on a per-export basis andhas to be manually set up in an exportfile The mappings can be 1-1 N-N orN-1 In file-cloaking the server restrictsfile access based on UIDGID and spe-cial cloaking-mask bits Here users canonly access their own file permissionsThe policy on whether or not the file isvisible to others is set by the cloakingmask which is logically ANDed with the

79October 2002 login

C

ON

FERE

NC

ERE

PORT

Sfilersquos protection bits File-cloaking onlyworks however if the client doesnrsquot holdcached copies of directory contents andfile-attributes Because of this the clientsare forced to re-read directories byincrementing the mtime value of thedirectory each time it is listed

To test the performance of the modifiedserver five different NFS configurationswere evaluated An unmodified NFSserver was compared against one serverwith the modified code included but notused one with only range-mappingenabled one with only file-cloakingenabled and one version with all modi-fications enabled For each setup differ-ent file system benchmarks were runThe results show only a small overheadwhen the modifications are used gener-ally an increase of below 5 Anotherexperiment tested the performance ofthe system with an increasing number ofmapped or cloaked entries on a systemThe results show that an increase from10 to 1000 entries resulted in a maxi-mum of about 14 cost in performance

One member of the audience pointedout that if clients choose to ignore thechanged mtimes from the server andthus still hold caches of the directoryentries the file-cloaking mechanismcould be defeated After a rather lengthydebate the speaker had to concede thatthe model doesnrsquot add any extra securityAnother question was asked about scala-bility of the setup of range mappingThe speaker referred to application-leveltools that could be developed for thatpurpose

The software is available atftpftpfslcssunysbedupubenf

ENGINEERING OPEN SOURCE

SOFTWARE

Summarized by Teri Lampoudi

NINGAUI A LINUX CLUSTER FOR BUSINESS

Andrew Hume ATampT Labs ndash ResearchScott Daniels Electronic Data SystemsCorp

Ningaui is a general purpose highlyavailable resilient architecture built

from commodity software and hard-ware Emphatically however Ningaui isnot a Beowulf Hume calls the clusterdesign the ldquoSwiss canton modelrdquo inwhich there are a number of looselyaffiliated independent nodes with datareplicated among them Jobs areassigned by bidding and leases and clus-ter services done as session-based serversare sited via generic job assignment Theemphasis is on keeping the architectureend-to-end checking all work via check-sums and logging everything Theresilience and high availability requiredby their goal of 8-5 maintenance ndash vsthe typical 24-7 model where people getpaged whenever the slightest thing goeswrong regardless of the time of day ndash isachieved by job restartability Finally allcomputation is performed on local datawithout the use of NFS or networkattached storage

Humersquos message is a hopeful onedespite the many problems encounteredndash things like kernel and service limitsauto-installation problems TCP stormsand the like ndash the existence of sourcecode and the paranoid practice of log-ging and checksumming everything hashelped The final product performs rea-sonably well and it appears that theresilient techniques employed do make adifference One drawback however isthat the software mentioned in thepaper is not yet available for download

CPCMS A CONFIGURATION MANAGEMENT

SYSTEM BASED ON CRYPTOGRAPHIC NAMES

Jonathan S Shapiro and John Vander-burgh Johns Hopkins University

This paper won the FREENIX BestPaper award

The basic notion behind the project isthe fact that everyone has a pet com-plaint about CVS and yet it is currentlythe configuration manager in mostwidespread use Shapiro has unleashedan alternative Interestingly he did notbegin but ended with the usual host ofreasons why CVS is bad The talk insteadbegan abruptly by characterizing the job

USENIX 2002

of a software configuration managercontinued by stating the namespaceswhich it must handle and wrapped upwith the challenges faced

X MEETS Z VERIFYING CORRECTNESS IN THE

PRESENCE OF POSIX THREADS

Bart Massey Portland State UniversityRobert T Bauer Rational SoftwareCorp

Massey delivered a humorous talk onthe insight gained from applying Z for-mal specification notation to systemsoftware design rather than the moreinformal analysis and design processnormally used

The story is told with respect to writingXCB which replaces the Xlib protocollayer and is supposed to be threadfriendly But where threads are con-cerned deadlock avoidance becomes ahard problem that cannot be solved inan ad-hoc manner But full modelchecking is also too hard In this caseMassey resorted to Z specification tomodel the XCB lower layer abstractaway locking and data transfers andlocate fundamental issues Essentiallythe difficulties of searching the literatureand locating information relevant to theproblem at hand were overcome AsMassey put it ldquothe formal method savedthe dayrdquo

FILE SYSTEMS

Summarized by Bosko Milekic

PLANNED EXTENSIONS TO THE LINUX

EXT2EXT3 FILESYSTEM

Theodore Y Tsrsquoo IBM StephenTweedie Red Hat

The speaker presented improvements tothe Linux Ext2 file system with the goalof allowing for various expansions whilestriving to maintain compatibility witholder code Improvements have beenfacilitated by a few extra superblockfields that were added to Ext2 not longbefore the Linux 20 kernel was releasedThe fields allow for file system featuresto be added without compromising theexisting setup this is done by providing

80 Vol 27 No 5 login

bits indicating the impact of the addedfeatures with respect to compatibilityNamely a file system with the ldquoincom-patrdquo bit marked is not allowed to bemounted Similarly a ldquoread-onlyrdquo mark-ing would only allow the file system tobe mounted read-only

Directory indexing changes linear direc-tory searches with a faster search using afixed-depth tree and hashed keys Filesystem size can be dynamicallyincreased and the expanded inode dou-bled from 128 to 256 bytes allows formore extensions

Other potential improvements were dis-cussed as well in particular pre-alloca-tion for contiguous files which allowsfor better performance in certain setupsby pre-allocating contiguous blocksSecurity-related modifications extendedattributes and ACLs were mentionedAn implementation of these featuresalready exists but has not yet beenmerged into the mainline Ext23 code

RECENT FILESYSTEM OPTIMISATIONS ON

FREEBSD

Ian Dowse Corvil Networks DavidMalone CNRI Dublin Institute of Technology

David Malone presented four importantfile system optimizations for FreeBSDOS soft updates dirpref vmiodir anddirhash It turns out that certain combi-nations of the optimizations (beautifullyillustrated in the paper) may yield per-formance improvements of anywherebetween 2 and 10 orders of magnitudefor real-world applications

All four of the optimizations deal withfile system metadata Soft updates allowfor asynchronous metadata updatesDirpref changes the way directories areorganized attempting to place childdirectories closer to their parentsthereby increasing locality of referenceand reducing disk-seek times Vmiodirtrades some extra memory in order toachieve better directory caching Finallydirhash which was implemented by IanDowse changes the way in which entries

are searched in a directory SpecificallyFFS uses a linear search to find an entryby name dirhash builds a hashtable ofdirectory entries on the fly For a direc-tory of size n with a working set of mfiles a search that in certain cases couldhave been O(nm) has been reduceddue to dirhash to effectively O(n + m)

FILESYSTEM PERFORMANCE AND SCALABILITY

IN LINUX 2417

Ray Bryant SGI Ruth Forester IBMLTC John Hawkes SGI

This talk focused on performance evalu-ation of a number of file systems avail-able and commonly deployed on Linuxmachines Comparisons under variousconfigurations of Ext2 Ext3 ReiserFSXFS and JFS were presented

The benchmarks chosen for the datagathering were pgmeter filemark andAIM Benchmark Suite VII Pgmetermeasures the rate of data transfer ofreadswrites of a file under a syntheticworkload Filemark is similar to post-mark in that it is an operation-intensivebenchmark although filemark isthreaded and offers various other fea-tures that postmark lacks AIM VIImeasures performance for various file-system-related functionalities it offersvarious metrics under an imposedworkload thus stressing the perfor-mance of the file system not only underIO load but also under significant CPUload

Tests were run on three different setupsa small a medium and a large configu-ration ReiserFS and Ext2 appear at thetop of the pile for smaller and mediumsetups Notably XFS and JFS performworse for smaller system configurationsthan the others although XFS clearlyappears to generate better numbers thanJFS It should be noted that XFS seemsto scale well under a higher load Thiswas most evident in the large-systemresults where XFS appears to offer thebest overall results

THINGS TO THINK ABOUT

Summarized by Bosko Milekic

SPEEDING UP KERNEL SCHEDULER BY REDUC-

ING CACHE MISSES

Shuji Yamamura Akira Hirai MitsuruSato Masao Yamamoto Akira NaruseKouichi Kumon Fujitsu Labs

This was an interesting talk pertaining tothe effects of cache coloring for taskstructures in the Linux kernel scheduler(Linux kernel 24x) The speaker firstpresented some interesting benchmarknumbers for the Linux scheduler show-ing that as the number of processes onthe task queue was increased the perfor-mance decreased The authors usedsome really nifty hardware to measurethe number of bus transactionsthroughout their tests and were thusable to reasonably quantify the impactthat cache misses had in the Linuxscheduler

Their experiments led them to imple-ment a cache coloring scheme for taskstructures which were previouslyaligned on 8KB boundaries and there-fore were being eventually mapped tothe same cache lines This unfortunateplacement of task structures in memoryinduced a significant number of cachemisses as the number of tasks grew inthe scheduler

The implemented solution consisted ofaligning task structures to more evenlydistribute cached entries across the L2cache The result was inevitably fewercache misses in the scheduler Some neg-ative effects were observed in certain sit-uations These were primarily due tomore cache slots being used by taskstructure data in the scheduler thusforcing data previously cached there tobe pushed out

81October 2002 login

C

ON

FERE

NC

ERE

PORT

SWORK-IN-PROGRESS REPORTS

Summarized by Brennan Reynolds

RESOURCE VIRTUALIZATION TECHNIQUES FOR

WIDE-AREA OVERLAY NETWORKS

Kartik Gopalan University Stony Brook

This work addressed the issue of provi-sioning a maximum number of virtualoverlay networks (VON) with diversequality of service (QoS) requirementson a single physical network Each of theVONs is logically isolated from others toensure the QoS requirements Gopalanmentioned several critical researchissues with this problem that are cur-rently being investigated Dealing withhow to provision the network at variouslevels (link route or path) and thenenforce the provisioning at run-time isone of the toughest challenges Cur-rently Gopalan has developed severalalgorithms to handle admission controlend-to-end QoS route selection sched-uling and fault tolerance in the net-work

For more information visithttpwwwecslcssunysbedu

VISUALIZING SOFTWARE INSTABILITY

Jennifer Bevan University of CaliforniaSanta Cruz

Detection of instability in software hastypically been an afterthought Thepoint in the development cycle when thesoftware is reviewed for instability isusually after it is difficult and costly togo back and perform major modifica-tions to the code base Bevan has devel-oped a technique to allow thevisualization of unstable regions of codethat can be used much earlier in thedevelopment cycle Her technique cre-ates a time series of dependent graphsthat include clusters and lines calledfault lines From the graphs a developeris able to easily determine where theunstable sections of code are and proac-tively restructure them She is currentlyworking on a prototype implementa-tion

For more information visit httpwwwcseucscecu~jbevanevo_viz

RELIABLE AND SCALABLE PEER-TO-PEER WEB

DOCUMENT SHARING

Li Xiao William and Mary College

The idea presented by Xiao would allowend users to share the content of theWeb browser caches with neighboringInternet users The rationale for this isthat todayrsquos end users are increasinglyconnected to the Internet over high-speed links and the browser caches arebecoming large storage systems There-fore if an individual accesses a pagewhich does not exist in their local cacheXiao is suggesting that they first queryother end users for the content beforetrying to access the machine hosting theoriginal This strategy does have someserious problems associated with it thatstill need to be addressed includingensuring the integrity of the content andprotecting the identity and privacy ofthe end users

SEREL FAST BOOTING FOR UNIX

Leni Mayo Fast Boot Software

Serel is a tool that generates a visual rep-resentation of a UNIX machinersquos boot-up sequence It can be used to identifythe critical path and can show if a par-ticular service or process blocks for anextended period of time This informa-tion could be used to determine wherethe largest performance gains could berealized by tuning the order of executionat boot-up Serel creates a dependencygraph expressed in XML during boot-up This graph is then used to create thevisual representation Currently the toolonly works on POSIX-compliant sys-tems but Mayo stated that he would beporting it to other platforms Otherextensions that were mentionedincluded having the metadata includethe use of shared libraries and monitor-ing the suspendresume sequence ofportable machines

For more information visithttpwwwfastbootorg

USENIX 2002

BERKELEY DB XML

John Merrells Sleepycat Software

Merrells gave a quick introduction andoverview of the new XML library forBerkeley DB The library specializes instorage and retrieval of XML contentthrough a tool called XPath The libraryallows for multiple containers per docu-ment and stores everything natively asXML The user is also given a wide rangeof elements to create indices withincluding edges elements text stringsor presence The XPath tool consists of aquery parser generator optimizer andexecution engine To conclude his pre-sentation Merrell gave a live demonstra-tion of the software

For more information visithttpwwwsleepycatcomxml

CLUSTER-ON-DEMAND (COD)

Justin Moore Duke University

Modern clusters are growing at a rapidrate Many have pushed beyond the 5000-machine mark and deployingthem results in large expenses as well asmanagement and provisioning issuesFurthermore if the cluster is ldquorentedrdquoout to various users it is very time-con-suming to configure it to a userrsquos specsregardless of how long they need to useit The COD work presented createsdynamic virtual clusters within a givenphysical cluster The goal was to have aprovisioning tool that would automati-cally select a chunk of available nodesand install the operating system andmiddleware specified by the customer ina short period of time This would allowa greater use of resources since multiplevirtual clusters can exist at once Byusing a virtual cluster the size can bechanged dynamically Moore stated thatthey have created a working prototypeand are currently testing and bench-marking its performance

For more information visithttpwwwcsdukeedu~justincod

82 Vol 27 No 5 login

CATACOMB

Elias Sinderson University of CaliforniaSanta Cruz

Catacomb is a project to develop a data-base-backed DASL module for theApache Web server It was designed as areplacement for the WebDAV The initialrelease of the module only contains sup-port for a MySQL database but could beextended to others Sinderson brieflytouched on the performance of hermodule It was comparable to themod_dav Apache module for all querytypes but search The presentation wasconcluded with remarks about addingsupport for the lock method and includ-ing ACL specifications in the future

For more information visithttpoceancseucsceducatacomb

SELF-ORGANIZING STORAGE

Dan Ellard Harvard University

Ellardrsquos presentation introduced a stor-age system that tuned itself based on theworkload of the system without requir-ing the intervention of the user Theintelligence was implemented as a vir-tual self-organizing disk that residesbelow any file system The virtual diskobserves the access patterns exhibited bythe system and then attempts to predictwhat information will be accessed nextAn experiment was done using an NFStrace at a large ISP on one of their emailservers Ellardrsquos self-organizing storagesystem worked well which he attributesto the fact that most of the files beingrequested were large email boxes Areasof future work include exploration ofthe length and detail of the data collec-tion stage as well as the CPU impact ofrunning the virtual disk layer

VISUAL DEBUGGING

John Costigan Virginia Tech

Costigan feels that the state of currentdebugging facilities in the UNIX worldis not as good as it should be He pro-poses the addition of several elements toprograms like ddd and gdb The firstaddition is including separate heap and

stack components Another is to displayonly certain fields of a data structureand have the ability to zoom in and outif needed Finally the debugger shouldprovide the ability to visually presentcomplex data structures includinglinked lists trees etc He said that a betaversion of a debugger with these abilitiesis currently available

For more information visit httpinfoviscsvtedudatastruct

IMPROVING APPLICATION PERFORMANCE

THROUGH SYSTEM-CALL COMPOSITION

Amit Purohit University of Stony Brook

Web servers perform a huge number ofcontext switches and internal data copy-ing during normal operation These twoelements can drastically limit the perfor-mance of an application regardless ofthe hardware platform it is run on TheCompound System Call (CoSy) frame-work is an attempt to reduce the perfor-mance penalty for context switches anddata copies It includes a user-levellibrary and several kernel facilities thatcan be used via system calls The libraryprovides the programmer with a com-plete set of memory-management func-tions Performance tests using the CoSyframework showed a large saving forcontext switches and data copies Theonly area where the savings betweenconventional libraries and CoSy werenegligible was for very large file copies

For more information visithttpwwwcssunysbedu~purohit

ELASTIC QUOTAS

John Oscar Columbia University

Oscar began by stating that mostresources are flexible but that to datedisks have not been While approxi-mately 80 percent of files are short-liveddisk quotas are hard limits imposed byadministrators The idea behind elasticquotas is to have a non-permanent stor-age area that each user can temporarilyuse Oscar suggested the creation of aehome directory structure to be used indeploying elastic quotas Currently he

has implemented elastic quotas as astackable file system There were alsoseveral trends apparent in the dataMany users would ssh to their remotehost (good) but then launch an emailreader on the local machine whichwould connect via POP to the same hostand send the password in clear text(bad) This situation can easily be reme-died by tunneling protocols like POPthrough SSH but it appeared that manypeople were not aware this could bedone While most of his comments onthe use of the wireless network werenegative the list of passwords he hadcollected showed that people wereindeed using strong passwords His rec-ommendations were to educate andencourage people to use protocols likeIPSec SSH and SSL when conductingwork over a wireless network becauseyou never know who else is listening

SYSTRACE

Niels Provos University of Michigan

How can one be sure the applicationsone is using actually do exactly whattheir developers said they do Shortanswer you canrsquot People today are usinga large number of complex applicationswhich means it is impossible to checkeach application thoroughly for securityvulnerabilities There are tools out therethat can help though Provos has devel-oped a tool called systrace that allows auser to generate a policy of acceptablesystem calls a particular application canmake If the application attempts tomake a call that is not defined in thepolicy the user is notified and allowed tochoose an action Systrace includes anauto-policy generation mechanismwhich uses a training phase to record theactions of all programs the user exe-cutes If the user chooses not to be both-ered by applications breaking policysystrace allows default enforcementactions to be set Currently systrace isimplemented for FreeBSD and NetBSDwith a Linux version coming out shortly

83October 2002 login

C

ON

FERE

NC

ERE

PORT

SFor more information visit httpwwwcitiumicheduuprovossystrace

VERIFIABLE SECRET REDISTRIBUTION FOR

SURVIVABLE STORAGE SYSTEMS

Ted Wong Carnegie Mellon University

Wong presented a protocol that can beused to re-create a file distributed over aset of servers even if one of the serversis damaged In his scheme the user mustchoose how many shares the file is splitinto and the number of servers it will bestored across The goal of this work is toprovide persistent secure storage ofinformation even if it comes underattack Wong stated that his design ofthe protocol was complete and he is cur-rently building a prototype implementa-tion of it

For more information visit httpwwwcscmuedu~tmwongresearch

THE GURU IS IN SESSIONS

LINUX ON LAPTOPPDA

Bdale Garbee HP Linux Systems Operation

Summarized by Brennan Reynolds

This was a roundtable-like discussionwith Garbee acting as moderator He iscurrently in charge of the Debian distri-bution of Linux and is involved in port-ing Linux to platforms other than i386He has successfully help port the kernelto the Alpha Sparc and ARM and iscurrently working on a version to runon VAX machines

A discussion was held on which file sys-tem is best for battery life The Riser andExt3 file systems were discussed peoplecommented that when using Ext3 ontheir laptops the disc would never spindown and thus was always consuming alarge amount of power One suggestionwas to use a RAM-based file system forany volume that requires a large numberof writes and to only have the file systemwritten to disk at infrequent intervals orwhen the machine is shut down or sus-pended

A question was directed to Garbee abouthis knowledge of how the HP-Compaqmerger would affect Linux Garbeethought that the merger was good forLinux both within the company and forthe open source community as a wholeHe stated that the new company wouldbe the number one shipper of systemswith Linux as the base operating systemHe also fielded a question about thesupport of older HP servers and theirability to run Linux saying that indeedLinux has been ported to them and hepersonally had several machines in hisbasement running it

A question which sparked a large num-ber of responses concerned problemspeople had with running Linux onmobile platforms Most people actuallydid not have many problems at allThere were a few cases of a particularmachine model not working but therewere only two widespread issues dock-ing stations and the new ACPI powermanagement scheme Docking stationsstill appear to be somewhat of aheadache but most people had devel-oped ad-hoc solutions for getting themto work The ACPI power managementscheme developed by Intel does notappear to have a quick solution TedTrsquoso one of the head kernel developersstated that there are fundamental archi-tectural problems with the 24 series ker-nel that do not easily allow ACPI to beadded However he also stated thatACPI is supported in the newest devel-opment kernel 25 and will be includedin 26

The final topic was the possibility ofpurchasing a laptop without paying anychargetax for Microsoft Windows Theconsensus was that currently this is notpossible Even if the machine comeswith Linux or without any operatingsystem the vendors have contracts inplace which require them to payMicrosoft for each machine they ship ifthat machine is capable of running aMicrosoft operating system The onlyway this will change is if the demand for

USENIX 2002

other operating systems increases to thepoint that vendors renegotiate their con-tracts with Microsoft and no one sawthis happening in the near future

LARGE CLUSTERS

Andrew Hume ATampT Labs ndash Research

Summarized by Teri Lampoudi

This was a session on clusters big dataand resilient computing Given that theaudience was primarily interested inclusters and not necessarily big data orbeating nodes to a pulp the conversa-tion revolved mainly around what ittakes to make a heterogeneous clusterresilient Hume also presented aFREENIX paper on his current clusterproject which explained in more detailmuch of what was abstractly claimed inthe guru session

Humersquos use of the word ldquoclusterrdquoreferred not to what I had assumedwould be a Beowulf-type system but to aloosely coupled farm of machines inwhich concurrency was much less of anissue than it would be in a Beowulf Infact the architecture Hume describedwas designed to deal with large amountsof independent transaction processingessentially the process of billing callswhich requires no interprocess messagepassing or anything of the sort a parallelscientific application might

Where does the large data come inSince the transactions in question con-sist of large numbers of flat files amechanism for getting the files ontonodes and off-loading the results is nec-essary In this particular cluster namedNingaui this task is handled by theldquoreplication managerrdquo which generateda large amount of interest in the audi-ence All management functions in thecluster as well as job allocation consti-tute services that are to be bid for andleased out to nodes Furthermore fail-ures are handled by simply repeating thefailed task until it is completed properlyif that is possible and if it is not thenlooking through detailed logs to discover

84 Vol 27 No 5 login

what went wrong Logging and testingfor the successful completion of eachtask are what make this model resilientEvery management operation is loggedat some appropriate level of detail gen-erating 120MBday and every single filetransfer is md5 checksummed This wascharacterized by Hume as a ldquopatientrdquostyle of management where no one con-trols the cluster scheduling behavior isemergent rather than stringentlyplanned and nodes are free to drop inand out of the cluster a downside to thismodel is the increase in latency offset bythe fact that the cluster continues tofunction unattended outside of 8-to-5support hours

In response to questions about the pro-jected scalability of such a schemeHume said the system would presum-ably scale from the present 60 nodes toabout 100 without modifications butthat scaling higher would be a matter forfurther design The guru session endedon a somewhat unrelated note regardingthe problems of getting data off main-frames and onto Linux machines ndashissues of fixed- vs variable-lengthencoding of data blocks resulting in use-ful information being stripped by FTPand problems in converting COBOLcopybooks to something useful on a C-based architecture To reiterate Humersquosclaim there are homemade solutions forsome of these problems and the readershould feel free to contact Mr Hume forthem

WRITING PORTABLE APPLICATIONS

Nick Stoughton MSB Consultants

Summarized by Josh Lothian

Nick Stoughton addressed a concernthat developers are facing more andmore writing portable applications Heproposed that there is no such thing as aldquoportable applicationrdquo only those thathave been ported Developers are still insearch of the Holy Grail of applicationsprogramming a write-once compile-anywhere program Languages such asJava are leading up to this but virtual

machines still have to be ported pathswill still be different etc Web-basedapplications are another area that couldbe promising for portable applications

Nick went on to talk about the future ofportability The recently released POSIX2001 standards are expected to help thesituation The 2001 revision expands theoriginal POSIX sepc into seven volumesEven though POSIX 2001 is more spe-cific than the previous release Nickpointed out that even this standardallows for small differences where ven-dors may define their own behaviorsThis can only hurt developers in thelong run Along these lines it waspointed out that there may be room forautomated tools that have the capabilityto check application code and determinethe level of portability and standardscompliance of the source

NETWORK MANAGEMENT SYSTEM

PERFORMANCE TUNING

Jeff Allen Tellme Networks Inc

Summarized by Matt Selsky

Jeff Allen author of Cricket discussednetwork tuning The basic idea of tuningis measure twiddle and measure againCricket can be used for large installa-tions but itrsquos overkill for smaller instal-lations for which Mrtg is better suitedYou donrsquot need Cricket to do measure-ment Doing something like a shellscript that dumps data to Excel is goodbut you need to get some sort of mea-surement Cricket was not designed forbilling it was meant to help answerquestions

Useful tools and techniques coveredincluded looking for the differencebetween machines in order to identifythe causes for the differences measuringbut also thinking about what yoursquoremeasuring and why remembering touse strace and tcpdump to identifyproblems Problems repeat themselvesperformance tuning requires an intricateunderstanding of all the layers involvedto solve the problem and simplify theautomation (Having a computer science

background helps but is not essential) Ifyou can reproduce the problem youthen should investigate each piece tofind the bottleneck If you canrsquot repro-duce the problem then measure Youneed to understand all the system inter-actions Measurement can help deter-mine whether things are actually slow orif the user is imagining it

Troubleshooting begins with a hunchbut scientific processes are essential Youshould be able to determine the eventstream or when each event occurs hav-ing observation logs can help Somehints provided include checking cron forperiodic anomalies slowing downbinrm to avoid IO overload and look-ing for unusual kernel CPU usage

Also try to use lightweight monitoringto reduce overhead You donrsquot need tomonitor every resource on every systembut you should monitor those resourceson those systems that are essential Donrsquotcheck things too often since you canintroduce overhead that way Lots ofinformation can be gathered from theproc file system interface to the kernel

BSD BOF

Host Kirk McKusick Author and Consultant

Summarized by Bosko Milekic

The BSD Birds of a Feather session atUSENIX started off with five quick andinformative talks and ended with anexciting discussion of licensing issues

Christos Zoulas for the NetBSD Projectwent over the primary goals of theNetBSD Project (namely portability aclean architecture and security) andthen proceeded to briefly discuss someof the challenges that the project hasencountered The successes and areas forimprovement of 16 were then exam-ined NetBSD has recently seen variousimprovements in its cross-build systempackaging system and third-party toolsupport For what concerns the kernel azero-copy TCPUDP implementationhas been integrated and a new pmap

85October 2002 login

C

ON

FERE

NC

ERE

PORT

Scode for arm32 has been introduced ashave some other rather major changes tovarious subsystems More information isavailable in Christosrsquos slides which arenow available at httpwwwnetbsdorggalleryeventsusenix2002

Theo de Raadt of the OpenBSD Projectpresented a rundown of various newthings that the OpenBSD and OpenSSHteams have been looking at Theorsquos talkincluded an amusing description (andpictures) of some of the events thattranspired during OpenBSDrsquos hack-a-thon which occurred the week beforeUSENIX rsquo02 Finally Theo mentionedthat following complications with theIPFilter license the OpenBSD team haddone a fairly extensive license audit

Robert Watson of the FreeBSD Projectbrought up various examples ofFreeBSD being used in the real worldHe went on to describe a wide variety ofchanges that will surface with theupcoming FreeBSD release 50 Notably50 will introduce a re-architecting ofthe kernel aimed at providing muchmore scalable support for SMPmachines 50 will also include an earlyimplementation of KSE FreeBSDrsquos ver-sion of scheduler activations largeimprovements to pccard and significantframework bits from the TrustedBSDproject

Mike Karels from Wind River Systemspresented an overview of how BSDOShas been evolving following the WindRiver Systems acquisition of BSDirsquos soft-ware assets Ernie Prabhakar from Applediscussed Applersquos OS X and its successon the desktop He explained how OS Xaims to bring BSD to the desktop whilestriving to maintain a strong and stablecore one that is heavily based on BSDtechnology

The BoF finished with a discussion onlicensing specifically some folks ques-tioned the impact of Applersquos licensingfor code from their core OS (the DarwinProject) on the BSD community as awhole

CLOSING SESSION

HOW FLIES FLY

Michael H Dickinson University ofCalifornia Berkeley

Summarized by JD Welch

In this lively and offbeat talk Dickinsondiscussed his research into the mecha-nisms of flight in the fruit fly (Droso-phila melanogaster) and related theautonomic behavior to that of a techno-logical system Insects are the most suc-cessful organisms on earth due in nosmall part to their early adoption offlight Flies travel through space onstraight trajectories interrupted by sac-cades jumpy turning motions some-what analogous to the fast jumpymovements of the human eye Using avariety of monitoring techniquesincluding high-speed video in a ldquovirtualreality flight arenardquo Dickinson and hiscolleagues have observed that fliesrespond to visual cues to decide when tosaccade during flight For example morevisual texture makes flies saccade earlier(ie further away from the arena wall)described by Dickinson as a ldquocollisioncontrol algorithmrdquo

Through a combination of sensoryinputs flies can make decisions abouttheir flight path Flies possess a visualldquoexpansion detectorrdquo which at a certainthreshold causes the animal to turn acertain direction However expansion infront of the fly sometimes causes it toland How does the fly decide Using thevirtual reality device to replicate variousvisual scenarios Dickinson observedthat the flies fixate on vertical objectsthat move back and forth across thefield Expansion on the left side of theanimal causes it to turn right and viceversa while expansion directly in frontof the animal triggers the legs to flareout in preparation for landing

If the eyes detect these changes how arethe responses implemented The fliesrsquowings have ldquopower musclesrdquo controlledby mechanical resonance (as opposed tothe nervous system) which drive the

USENIX 2002

wings combined with neurally activatedldquosteering musclesrdquo which change theconfiguration of the wing joints Subtlevariations in the timing of impulses cor-respond to changes in wing movementcontrolled by sensors in the ldquowing-pitrdquoA small wing-like structure the halterecontrols equilibrium by beating con-stantly

Voluntary control is accomplished bythe halteres whose signals can interferewith the autonomic control of the strokecycle The halteres have ldquosteering mus-clesrdquo as well and information derivedfrom the visual system can turn off thehaltere or trick it into a ldquovirtualrdquo prob-lem requiring a response

Dickinson has also studied the aerody-namics of insect flight using a devicecalled the ldquoRobo Flyrdquo an oversizedmechanical insect wing suspended in alarge tank of mineral oil InterestinglyDickinson observed that the average liftgenerated by the fliesrsquo wings is under itsbody weight the flies use three mecha-nisms to overcome this including rotat-ing the wings (rotational lift)

Insects are extraordinarily robust crea-tures and because Dickinson analyzedthe problem in a systems-oriented waythese observations and analysis areimmediately applicable to technologythe response system can be used as anefficient search algorithm for controlsystems in autonomous vehicles forexample

THE AFS WORKSHOP

Summarized by Garry Zacheiss MIT

Love Hornquist-Astrand of the Arladevelopment team presented the Arlastatus report Arla 0358 has beenreleased Scheduled for release soon areimproved support for Tru64 UNIXMacOS X and FreeBSD improved vol-ume handling and implementation ofmore of the vospts subcommands Itwas stressed that MacOS X is consideredan important platform and that a GUIconfiguration manager for the cache

86 Vol 27 No 5 login

manager a GUI ACL manager for thefinder and a graphical login that obtainsKerberos tickets and AFS tokens at logintime were all under development

Future goals planned for the underlyingAFS protocols include GSSAPISPNEGOsupport for Rx performance improve-ments to Rx an enhanced disconnectedmode and IPv6 support for Rx anexperimental patch is already availablefor the latter Future Arla-specific goalsinclude improved performance partialfile reading and increased stability forseveral platforms Work is also inprogress for the RXGSS protocol exten-sions (integrating GSSAPI into Rx) Apartial implementation exists and workcontinues as developers find time

Derrick Brashear of the OpenAFS devel-opment team presented the OpenAFSstatus report OpenAFS was releasedimmediately prior to the conferenceOpenAFS 125 fixed a remotelyexploitable denial-of-service attack inseveral OpenAFS platforms mostnotably IRIX and AIX Future workplanned for OpenAFS includes bettersupport for MacOS X including work-ing around Finder interaction issuesBetter support for the BSDs is alsoplanned FreeBSD has a partially imple-mented client NetBSD and OpenBSDhave only server programs availableright now AIX 5 Tru64 51A andMacOS X 102 (aka Jaguar) are allplanned for the future Other plannedenhancements include nested groups inthe ptserver (code donated by the Uni-versity of Michigan awaits integration)disconnected AFS and further work ona native Windows client Derrick statedthat the guiding principles of OpenAFSwere to maintain compatibility withIBM AFS support new platforms easethe administrative burden and add newfunctionality

AFS performance was discussed Theopenafsorg cell consists of two fileservers one in Stockholm Sweden andone in Pittsburgh PA AFS works rea-sonably well over trans-Atlantic links

although OpenAFS clients donrsquot deter-mine which file server to talk to veryeffectively Arla clients use RTTs to theserver to determine the optimal fileserver to fetch replicated data fromModifications to OpenAFS to supportthis behavior in the future are desired

Jimmy Engelbrecht and Harald Barth ofKTH discussed their AFSCrawler scriptwritten to determine how many AFScells and clients were in the world whatimplementationsversions they were(Arla vs IBM AFS vs OpenAFS) andhow much data was in AFS The scriptunfortunately triggered a bug in IBMAFS 36ndashderived code causing someclients to panic while handling a specificRPC This has since been fixed in OpenAFS 125 and the most recent IBMAFS patch level and all AFS users arestrongly encouraged to upgrade Norelease of Arla is vulnerable to this par-ticular denial-of-service attack Therewas an extended discussion of the use-fulness of this exploration Many sitesbelieved this was useful information andsuch scanning should continue in thefuture but only on an opt-in basis

Many sites face the problem of manag-ing KerberosAFS credentials for batch-scheduled jobs Specifically most batchprocessing software needs to be modi-fied to forward tickets as part of thebatch submission process renew ticketsand tokens while the job is in the queueand for the lifetime of the job and prop-erly destroy credentials when the jobcompletes Ken Hornstein of NRL wasable to pay a commercial vendor to sup-port Kerberos 45 credential manage-ment in their product although they didnot implement AFS token managementMIT has implemented some of thedesired functionality in OpenPBS andmight be able to make it available toother interested sites

Tools to simplify AFS administrationwere discussed including

AFS Balancer A tool written byCMU to automate the process ofbalancing disk usage across all

servers in a cell Available fromftpftpandrewcmuedupubAFS-Toolsbalance-11btargz

Themis Themis is KTHrsquos enhancedversion of the AFS tool ldquopackagerdquofor updating files on local disk froma central AFS image KTHrsquosenhancements include allowing thedeletion of files simplifying theprocess of adding a file and allow-ing the merging of multiple rulesets for determining which files areupdated Themis is available fromthe Arla CVS repository

Stanford was presented as an example ofa large AFS site Stanfordrsquos AFS usageconsists of approximately 14TB of datain AFS in the form of approximately100000 volumes 33TB of storage isavailable in their primary cell irstan-fordedu Their file servers consistentirely of Solaris machines running acombination of Transarc 36 patch-level3 and OpenAFS 12x while their data-base servers run OpenAFS 122 Theircell consists of 25 file servers using acombination of EMC and Sun StorEdgehardware Stanford continues to use thekaserver for their authentication infra-structure with future plans to migrateentirely to an MIT Kerberos 5 KDC

Stanford has approximately 3400 clientson their campus not including SLAC(Stanford Linear Accelerator) approxi-mately 2100 AFS clients from outsideStanford contact their cell every monthTheir supported clients are almostentirely IBM AFS 36 clients althoughthey plan to release OpenAFS clientssoon Stanford currently supports onlyUNIX clients There is some on-campuspresence of Windows clients but Stan-ford has never publicly released or sup-ported it They do intend to release andsupport the MacOS X client in the nearfuture

All Stanford students faculty and staffare assigned AFS home directories witha default quota of 50MB for a total ofapproximately 550GB of user homedirectories Other significant uses of AFS

87October 2002 login

C

ON

FERE

NC

ERE

PORT

Sstorage are data volumes for workstationsoftware (400GB) and volumes forcourse Web sites and assignments(100GB)

AFS usage at Intel was also presentedIntel has been an AFS site since 1994They had bad experiences with the IBM35 Linux client their experience withOpenAFS on Linux 24x kernels hasbeen much better They use and are sat-isfied with the OpenAFS IA64 Linuxport Intel has hundreds of OpenAFS123 and 124 clients in many produc-tion cells accessing data stored on IBMAFS file servers They have not encoun-tered any interoperability issues Intelhas some concerns about OpenAFS theywould like to purchase commercial sup-port for OpenAFS and to see OpenAFSsupport for HP-UX on both PA-RISCand Itanium hardware HP-UX supportis currently unavailable due to a specificHP-UX header file being unavailablefrom HP this may be available soonIntel has not yet committed to migratingtheir file servers to OpenAFS and areunsure if they will do so without com-mercial support

Backups are a traditional topic of dis-cussion at AFS workshops and this timewas no exception Many users complainthat the traditional AFS backup tools(ldquobackuprdquo and ldquobutcrdquo) are complex anddifficult to automate requiring manyhome-grown scripts and much userintervention for error recovery An addi-tional complaint was that the traditionalAFS tools do not support file- or direc-tory-level backups and restores datamust be backed up and restored at thevolume level

Mitch Collinsworth of Cornell presentedwork to make AMANDA the freebackup software from the University ofMaryland suitable for AFS backupsUsing AMANDA for AFS backups allowsone to share AFS backup tapes withnon-AFS backups easily run multiplebackups in parallel automate errorrecovery and provide a robust degradedmode that prevents tape errors from

stopping backups altogether Theirimplementation allows for full volumerestores as well as individual directoryand file restores They have finished cod-ing this work and are in the process oftesting and documenting it

Peter Honeyman of CITI at the Univer-sity of Michigan spoke about work hehas proposed to replace Rx with RPC-SEC GSS in OpenAFS this would allowAFS to use a TCP-based transport mech-anism rather than the UDP-based Rxand possibly gain better congestion con-trol dynamic adaptation and fragmen-tation avoidance as a result RPCSECGSS uses the GSSAPI to authenticateSUN ONC RPC RPCSEC GSS is trans-port agnostic provides strong security isa developing Internet standard and hasmultiple open source implementationsBackward compatibility with existingAFS servers and clients is an importantgoal of this project

Page 16: inside - USENIX · 2019. 2. 25. · Bosko Milekic Juan Navarro David E. Ott Amit Purohit Brennan Reynolds Matt Selsky J.D. Welch Li Xiao Praveen Yalagandula Haijin Yan Gary Zacheiss

Around 33 million browsing accessesand about 325 million notificationentries are present in the traces Threetypes of analyses are done for each oneof these two services content analysisconcerning the most popular contentcategories and their distribution popu-larity analysis or the popularity distri-bution of documents and user behavioranalysis

The analysis of the notification logsshows that document access rates followa Zipf-like distribution with most of theaccesses concentrated on a small num-ber of messages and the accesses exhibitgeographical locality ndash users from samelocality tend to receive similar notifica-tion content The browser log analysisshows that a smaller set of URLs areaccessed most times though the accesspattern does not fit any Zipf curve andthe highly accessed URLS remain stableA correlation study between the notifi-cation and browsing services shows thatwireless users have a moderate correla-tion of 012

FREENIX TRACK SESSIONS

BUILDING APPLICATIONS

Summarized by Matt Butner

INTERACTIVE 3-D GRAPHICS APPLICATIONS

FOR TCL

Oliver Kersting Juumlrgen Doumlllner HassoPlattner Institute for Software SystemsEngineering University of Potsdam

The integration of 3-D image renderingfunctionality into a scripting languagepermits interactive and animated 3-Ddevelopment and application withoutthe formalities and precision demandedby low-level CC++ graphics and visual-ization libraries The large and complexC++ API of the Virtual Rendering Sys-tem (VRS) can be combined with theconveniences of the Tcl scripting lan-guage The mapping of class interfaces isdone via an automated process and gen-erates respective wrapper classes all ofwhich ensures complete API accessibilityand functionality without any signifi-

76 Vol 27 No 5 login

cant performance retribution The gen-eral C++-to-Tcl mapper SWIG grantsthe necessary object and hierarchal abili-ties without the use of object-orientedTcl extensions The final API mappingtechniques address C++ features such asclasses overloaded methods enumera-tions and inheritance relations All areimplemented in a proof-of-concept thatmaps the complete C++ API of the VRSto Tcl and are showcased in a completeinteractive 3-D-map system

THE AGFL GRAMMAR WORK LAB

Cornelis HA Coster Erik VerbruggenUniversity of Nijmegen (KUN)

The growth and implementation of Nat-ural Language Processing (NLP) is thecornerstone of the continued evolutionand implementation of truly intelligentsearch machines and services In partdue to the growing collections of com-puter-stored human-readable docu-ments in the public domain theimplementation of linguistic analysiswill become necessary for desirable pre-cision and resolution Subsequentlysuch tools and linguistic resources mustbe released into the public domain andthey have done so with the release of theAGFL Grammar Work Lab under theGNU Public License as a tool for lin-guistic research and the development ofNLP-based applications

The AGFL (Affix Grammars over aFinite Lattice) Grammar Work Labmeshes context-free grammars withfinite set-valued features that are accept-able to a range of languages In com-puter science terms ldquoSyntax rules areprocedures with parameters and a non-deterministic executionrdquo The EnglishPhrases for Information Retrieval(EP4IR) released with the AGFL-GWLas a robust grammar of English is anAGFL-GWL generated English parserthat outputs ldquoHeadModifiedrdquo framesThe sentences ldquoCompanyX sponsoredthis conferencerdquo and ldquoThis conferencewas sponsored by CompanyXrdquo both gen-erate [CompanyX[sponsored confer-

ence]] The hope is that public availabil-ity of such tools will encourage furtherdevelopment of grammar and lexicalsoftware systems

SWILL A SIMPLE EMBEDDED WEB SERVER

LIBRARY

Sotiria Lampoudi David M BeazleyUniversity of Chicago

This paper won the FREENIX Best Stu-dent Paper award

SWILL (Simple Web Interface LinkLibrary) is a simple Web server in theform of a C library whose developmentwas motivated by a wish to give coolapplications an interface to the Web TheSWILL library provides a simple inter-face that can be efficiently implementedfor tasks that vary from flexible Web-based monitoring to software debuggingand diagnostics Though originallydesigned to be integrated with high-per-formance scientific simulation softwarethe interface is generic enough to allowfor unbounded uses SWILL is a single-threaded server relying upon non-blocking IO through the creation of atemporary server which services IOrequests SWILL does not provide SSL orcryptographic authentication but doeshave HTTP authentication abilities

A fantastic feature of SWILL is its sup-port for SPMD-style parallel applica-tions which utilize MPI provingvaluable for Beowulf clusters and largeparallel machines Another practicalapplication was the implementation ofSWILL in a modified Yalnix emulator byUniversity of Chicago Operating Sys-tems courses which utilized the addedYalnix functionality for OS developmentand debugging SWILL requires minimalmemory overhead and relies upon theHTTP10 protocol

NETWORK PERFORMANCE

Summarized by Florian Buchholz

LINUX NFS CLIENT WRITE PERFORMANCE

Chuck Lever Network Appliance PeterHoneyman CITI University of Michi-gan

Lever introduced a benchmark to mea-sure an NFS client write performanceClient performance is difficult to mea-sure due to hindrances such as poorhardware or bandwidth limitations Fur-thermore measuring application perfor-mance does not identify weaknessesspecifically at the client side Thus abenchmark was developed trying toexercise only data transfers in one direc-tion between server and application Forthis purpose the benchmark was basedon the block sequential write portion ofthe Bonnie file system benchmark Oncea benchmark for NFS clients is estab-lished it can be used to improve clientperformance

The performance measurements wereperformed with an SMP Linux clientand both a Linux NFS server and a Net-work Appliance F85 filer During testinga periodic jump in write latency timewas discovered This was due to a ratherlarge number of pending write opera-tions that were scheduled to be writtenafter certain threshold values wereexceeded By introducing a separate dae-mon that flushes the cached writerequest the spikes could be eliminatedbut as a result the average latency growsover time The problem could be tracedto a function that scans a linked list ofwrite requests After having added ahashtable to improve lookup perfor-mance the latency improved consider-ably

The improved client was then used tomeasure throughput against the twoservers A discrepancy between theLinux server and the filer test wasnoticed and the reason for that tracedback to a global kernel lock that wasunnecessarily held when accessing thenetwork stack After correcting this per-formance improved further However

77October 2002 login

C

ON

FERE

NC

ERE

PORT

Sthe measurements showed that a clientmay run slower when paired with fastservers on fast networks This is due toheavy client interrupt loads more net-work processing on the client side andmore global kernel lock contention

The source code of the project is avail-able at httpwwwcitiumicheduprojectsnfs-perfpatches

A STUDY OF THE RELATIVE COSTS OF

NETWORK SECURITY PROTOCOLS

Stefan Miltchev and Sotiris IoannidisUniversity of Pennsylvania AngelosKeromytis Columbia University

With the increasing need for securityand integrity of remote network ser-vices it becomes important to quantifythe communication overhead of IPSecand compare it to alternatives such asSSH SCP and HTTPS

For this purpose the authors set upthree testing networks direct link twohosts separated by two gateways andthree hosts connecting through onegateway Protocols were compared ineach setup and manual keying was usedto eliminate connection setup costs ForIPSec the different encryption algo-rithms AES DES 3DES hardware DESand hardware 3DES were used In detailFTP was compared to SFTP SCP andFTP over IPSec HTTP to HTTPS andHTTP over IPSec and NFS to NFS overIPSec and local disk performance

The result of the measurements werethat IPSec outperforms other popularencryption schemes Overall unen-crypted communication was fastest butin some cases like FTP the overhead canbe small The use of crypto hardwarecan significantly improve performanceFor future work the inclusion of setupcosts hardware-accelerated SSL SFTPand SSH were mentioned

CONGESTION CONTROL IN LINUX TCP

Pasi Sarolahti University of HelsinkiAlexey Kuznetsov Institute for NuclearResearch at Moscow

Having attended the talk and read thepaper I am still unclear about whetherthe authors are merely describing thedesign decisions of TCP congestion con-trol or whether they are actually the cre-ators of that part of the Linux code Myguess leans toward the former

In the talk the speaker compared theTCP protocol congestion control meas-ures according to IETF and RFC specifi-cations with the actual Linux implemen-tation which does conform to the basicprinciples but nevertheless has differ-ences A specific emphasis was placed onretransmission mechanisms and thecongestion window Also several TCPenhancements ndash the NewReno algo-rithm Selective ACKs (SACK) ForwardACKs (FACK) ndash were discussed andcompared

In some instances Linux does not con-form to the IETF specifications The fastrecovery does not fully follow RFC 2582since the threshold for triggering re-transmit is adjusted dynamically and thecongestion windowrsquos size is not changedAlso the roundtrip-time estimator andthe RTO calculation differ from RFC2988 since it uses more conservativeRTT estimates and a minimum RTO of200ms The performance measuresshowed that with the additional Linux-specific features enabled slightly higherthroughput more steady data flow andfewer unnecessary retransmissions canbe achieved

XTREME XCITEMENT

Summarized by Steve Bauer

THE FUTURE IS COMING WHERE THE X

WINDOW SHOULD GO

Jim Gettys Compaq Computer Corp

Jim Gettys one of the principal authorsof the X Window System outlined thenear-term objectives for the system pri-marily focusing on the changes andinfrastructure required to enable replica-tion and migration of X applicationsProviding better support for this func-tionality would enable users to retrieveor duplicate X applications betweentheir servers at home and work

USENIX 2002

One interesting example of an applica-tion that currently is capable of migra-tion and replication is Emacs To create anew frame on DISPLAY try ldquoM-x make-frame-on-display ltReturngt DISPLAYltReturngtrdquo

However technical challenges makereplication and migration difficult ingeneral These include the ldquomajorheadachesrdquo of server-side fonts thenonuniformity of X servers and screensizes and the need to appropriatelyretrofit toolkits Authentication andauthorization issues are obviously alsoimportant The rest of the talk delvedinto some of the details of these interest-ing technical challenges

HACKING IN THE KERNEL

Summarized by Hai Huang

AN IMPLEMENTATION OF SCHEDULER

ACTIVATIONS ON THE NETBSD OPERATING

SYSTEM

Nathan J Williams Wasabi Systems

Scheduler activation is an old idea Basi-cally there are benefits and drawbacks tousing solely kernel-level or user-levelthreading Scheduler activation is able tocombine the two layers of control toprovide more concurrency in the sys-tem

In his talk Nathan gave a fairly detaileddescription of the implementation ofscheduler activation in the NetBSD ker-nel One important change in the imple-mentation is to differentiate the threadcontext from the process context This isdone by defining a separate data struc-ture for these threads and relocatingsome of the information that wasembedded in the process context tothese thread contexts Stack was espe-cially a concern due to the upcall Spe-cial handling must be done to make surethat the upcall doesnrsquot mess up the stackso that the preempted user-level threadcan continue afterwards Lastly Nathanexplained that signals were handled byupcalls

78 Vol 27 No 5 login

AUTHORIZATION AND CHARGING IN PUBLIC

WLANS USING FREEBSD AND 8021X

Pekka Nikander Ericsson ResearchNomadicLab

8021x standards are well known in thewireless community as link-layerauthentication protocols In this talkPekka explained some novel ways ofusing the 8021x protocols that might beof interest to people on the move It ispossible to set up a public WLAN thatwould support various chargingschemes via virtual tokens which peoplecan purchase or earn and later use

This is implemented on FreeBSD usingthe netgraph utility It is basically a filterin the link layer that would differentiatetraffic based on the MAC address of theclient node which is either authenti-cated denied or let through The over-head for this service is fairly minimal

ACPI IMPLEMENTATION ON FREEBSD

Takanori Watanabe Kobe University

ACPI (Advanced Configuration andPower Management Interface) was pro-posed as a joint effort by Intel Toshibaand Microsoft to provide a standard andfiner-control method of managingpower states of individual devices withina system Such low-level power manage-ment is especially important for thosemobile and embedded systems that arepowered by fixed-capacity energy batter-ies

Takanori explained that the ACPI speci-fication is composed of three partstables BIOS and registers He was ableto implement some functionalities ofACPI in a FreeBSD kernel ACPI Com-ponent Architecture was implementedby Intel and it provides a high-levelACPI API to the operating systemTakanorirsquos ACPI implementation is builtupon this underlying layer of APIs

ACCESS CONTROL

Summarized by Florian Buchholz

DESIGN AND PERFORMANCE OF THE

OPENBSD STATEFUL PACKET FILTER (PF)

Daniel Hartmeier Systor AG

Daniel Hartmeier described the newstateful packet filter (pf) that replacedIPFilter in the OpenBSD 30 releaseIPFilter could no longer be included dueto licensing issues and thus there was aneed to write a new filter making use ofoptimized data structures

The filter rules are implemented as alinked list which is traversed from startto end Two actions may be takenaccording to the rules ldquopassrdquo or ldquoblockrdquoA ldquopassrdquo action forwards the packet anda ldquoblockrdquo action will drop it Wheremore than one rule matches a packetthe last rule wins Rules that are markedas ldquofinalrdquo will immediately terminate any further rule evaluation An opti-mization called ldquoskip-stepsrdquo was alsoimplemented where blocks of similarrules are skipped if they cannot matchthe current packet These skip-steps arecalculated when the rule set is loadedFurthermore a state table keeps track ofTCP connections Only packets thatmatch the sequence numbers areallowed UDP and ICMP queries andreplies are considered in the state tablewhere initial packets will create an entryfor a pseudo-connection with a low ini-tial timeout value The state table isimplemented using a balanced binarysearch tree NAT mappings are alsostored in the state table whereas appli-cation proxies reside in user space Thepacket filter also is able to perform frag-ment reassembly and to modulate TCPsequence numbers to protect hostsbehind the firewall

Pf was compared against IPFilter as wellas Linuxrsquos Iptables by measuringthroughput and latency with increasingtraffic rates and different packet sizes Ina test with a fixed rule set size of 100Iptables outperformed the other two fil-ters whose results were close to each

other In a second test where the rule setsize was continually increased Iptablesconsistently had about twice thethroughput of the other two (whichevaluate the rule set on both the incom-ing and outgoing interfaces) A third testcompared only pf and IPFilter using asingle rule that created state in the statetable with a fixed-state entry size Pfreached an overloaded state much laterthan IPFilter The experiment wasrepeated with a variable-state entry sizeand pf performed much better thanIPFilter for a small number of states

In general rule set evaluation is expen-sive and benchmarks only reflectextreme cases whereas in real life otherbehavior should be observed Further-more the benchmarks show that statefulfiltering can actually improve perfor-mance due to cheap state-table lookupsas compared to rule evaluation

ENHANCING NFS CROSS-ADMINISTRATIVE

DOMAIN ACCESS

Joseph Spadavecchia and Erez ZadokStony Brook University

The speaker presented modification toan NFS server that allows an improvedNFS access between administrativedomains A problem lies in the fact thatNFS assumes a shared UIDGID spacewhich makes it unsafe to export filesoutside the administrative domain ofthe server Design goals were to leaveprotocol and existing clients unchangeda minimum amount of server changesflexibility and increased performance

To solve the problem two techniques areutilized ldquorange-mappingrdquo and ldquofile-cloakingrdquo Range-mapping maps IDsbetween client and server The mappingis performed on a per-export basis andhas to be manually set up in an exportfile The mappings can be 1-1 N-N orN-1 In file-cloaking the server restrictsfile access based on UIDGID and spe-cial cloaking-mask bits Here users canonly access their own file permissionsThe policy on whether or not the file isvisible to others is set by the cloakingmask which is logically ANDed with the

79October 2002 login

C

ON

FERE

NC

ERE

PORT

Sfilersquos protection bits File-cloaking onlyworks however if the client doesnrsquot holdcached copies of directory contents andfile-attributes Because of this the clientsare forced to re-read directories byincrementing the mtime value of thedirectory each time it is listed

To test the performance of the modifiedserver five different NFS configurationswere evaluated An unmodified NFSserver was compared against one serverwith the modified code included but notused one with only range-mappingenabled one with only file-cloakingenabled and one version with all modi-fications enabled For each setup differ-ent file system benchmarks were runThe results show only a small overheadwhen the modifications are used gener-ally an increase of below 5 Anotherexperiment tested the performance ofthe system with an increasing number ofmapped or cloaked entries on a systemThe results show that an increase from10 to 1000 entries resulted in a maxi-mum of about 14 cost in performance

One member of the audience pointedout that if clients choose to ignore thechanged mtimes from the server andthus still hold caches of the directoryentries the file-cloaking mechanismcould be defeated After a rather lengthydebate the speaker had to concede thatthe model doesnrsquot add any extra securityAnother question was asked about scala-bility of the setup of range mappingThe speaker referred to application-leveltools that could be developed for thatpurpose

The software is available atftpftpfslcssunysbedupubenf

ENGINEERING OPEN SOURCE

SOFTWARE

Summarized by Teri Lampoudi

NINGAUI A LINUX CLUSTER FOR BUSINESS

Andrew Hume ATampT Labs ndash ResearchScott Daniels Electronic Data SystemsCorp

Ningaui is a general purpose highlyavailable resilient architecture built

from commodity software and hard-ware Emphatically however Ningaui isnot a Beowulf Hume calls the clusterdesign the ldquoSwiss canton modelrdquo inwhich there are a number of looselyaffiliated independent nodes with datareplicated among them Jobs areassigned by bidding and leases and clus-ter services done as session-based serversare sited via generic job assignment Theemphasis is on keeping the architectureend-to-end checking all work via check-sums and logging everything Theresilience and high availability requiredby their goal of 8-5 maintenance ndash vsthe typical 24-7 model where people getpaged whenever the slightest thing goeswrong regardless of the time of day ndash isachieved by job restartability Finally allcomputation is performed on local datawithout the use of NFS or networkattached storage

Humersquos message is a hopeful onedespite the many problems encounteredndash things like kernel and service limitsauto-installation problems TCP stormsand the like ndash the existence of sourcecode and the paranoid practice of log-ging and checksumming everything hashelped The final product performs rea-sonably well and it appears that theresilient techniques employed do make adifference One drawback however isthat the software mentioned in thepaper is not yet available for download

CPCMS A CONFIGURATION MANAGEMENT

SYSTEM BASED ON CRYPTOGRAPHIC NAMES

Jonathan S Shapiro and John Vander-burgh Johns Hopkins University

This paper won the FREENIX BestPaper award

The basic notion behind the project isthe fact that everyone has a pet com-plaint about CVS and yet it is currentlythe configuration manager in mostwidespread use Shapiro has unleashedan alternative Interestingly he did notbegin but ended with the usual host ofreasons why CVS is bad The talk insteadbegan abruptly by characterizing the job

USENIX 2002

of a software configuration managercontinued by stating the namespaceswhich it must handle and wrapped upwith the challenges faced

X MEETS Z VERIFYING CORRECTNESS IN THE

PRESENCE OF POSIX THREADS

Bart Massey Portland State UniversityRobert T Bauer Rational SoftwareCorp

Massey delivered a humorous talk onthe insight gained from applying Z for-mal specification notation to systemsoftware design rather than the moreinformal analysis and design processnormally used

The story is told with respect to writingXCB which replaces the Xlib protocollayer and is supposed to be threadfriendly But where threads are con-cerned deadlock avoidance becomes ahard problem that cannot be solved inan ad-hoc manner But full modelchecking is also too hard In this caseMassey resorted to Z specification tomodel the XCB lower layer abstractaway locking and data transfers andlocate fundamental issues Essentiallythe difficulties of searching the literatureand locating information relevant to theproblem at hand were overcome AsMassey put it ldquothe formal method savedthe dayrdquo

FILE SYSTEMS

Summarized by Bosko Milekic

PLANNED EXTENSIONS TO THE LINUX

EXT2EXT3 FILESYSTEM

Theodore Y Tsrsquoo IBM StephenTweedie Red Hat

The speaker presented improvements tothe Linux Ext2 file system with the goalof allowing for various expansions whilestriving to maintain compatibility witholder code Improvements have beenfacilitated by a few extra superblockfields that were added to Ext2 not longbefore the Linux 20 kernel was releasedThe fields allow for file system featuresto be added without compromising theexisting setup this is done by providing

80 Vol 27 No 5 login

bits indicating the impact of the addedfeatures with respect to compatibilityNamely a file system with the ldquoincom-patrdquo bit marked is not allowed to bemounted Similarly a ldquoread-onlyrdquo mark-ing would only allow the file system tobe mounted read-only

Directory indexing changes linear direc-tory searches with a faster search using afixed-depth tree and hashed keys Filesystem size can be dynamicallyincreased and the expanded inode dou-bled from 128 to 256 bytes allows formore extensions

Other potential improvements were dis-cussed as well in particular pre-alloca-tion for contiguous files which allowsfor better performance in certain setupsby pre-allocating contiguous blocksSecurity-related modifications extendedattributes and ACLs were mentionedAn implementation of these featuresalready exists but has not yet beenmerged into the mainline Ext23 code

RECENT FILESYSTEM OPTIMISATIONS ON

FREEBSD

Ian Dowse Corvil Networks DavidMalone CNRI Dublin Institute of Technology

David Malone presented four importantfile system optimizations for FreeBSDOS soft updates dirpref vmiodir anddirhash It turns out that certain combi-nations of the optimizations (beautifullyillustrated in the paper) may yield per-formance improvements of anywherebetween 2 and 10 orders of magnitudefor real-world applications

All four of the optimizations deal withfile system metadata Soft updates allowfor asynchronous metadata updatesDirpref changes the way directories areorganized attempting to place childdirectories closer to their parentsthereby increasing locality of referenceand reducing disk-seek times Vmiodirtrades some extra memory in order toachieve better directory caching Finallydirhash which was implemented by IanDowse changes the way in which entries

are searched in a directory SpecificallyFFS uses a linear search to find an entryby name dirhash builds a hashtable ofdirectory entries on the fly For a direc-tory of size n with a working set of mfiles a search that in certain cases couldhave been O(nm) has been reduceddue to dirhash to effectively O(n + m)

FILESYSTEM PERFORMANCE AND SCALABILITY

IN LINUX 2417

Ray Bryant SGI Ruth Forester IBMLTC John Hawkes SGI

This talk focused on performance evalu-ation of a number of file systems avail-able and commonly deployed on Linuxmachines Comparisons under variousconfigurations of Ext2 Ext3 ReiserFSXFS and JFS were presented

The benchmarks chosen for the datagathering were pgmeter filemark andAIM Benchmark Suite VII Pgmetermeasures the rate of data transfer ofreadswrites of a file under a syntheticworkload Filemark is similar to post-mark in that it is an operation-intensivebenchmark although filemark isthreaded and offers various other fea-tures that postmark lacks AIM VIImeasures performance for various file-system-related functionalities it offersvarious metrics under an imposedworkload thus stressing the perfor-mance of the file system not only underIO load but also under significant CPUload

Tests were run on three different setupsa small a medium and a large configu-ration ReiserFS and Ext2 appear at thetop of the pile for smaller and mediumsetups Notably XFS and JFS performworse for smaller system configurationsthan the others although XFS clearlyappears to generate better numbers thanJFS It should be noted that XFS seemsto scale well under a higher load Thiswas most evident in the large-systemresults where XFS appears to offer thebest overall results

THINGS TO THINK ABOUT

Summarized by Bosko Milekic

SPEEDING UP KERNEL SCHEDULER BY REDUC-

ING CACHE MISSES

Shuji Yamamura Akira Hirai MitsuruSato Masao Yamamoto Akira NaruseKouichi Kumon Fujitsu Labs

This was an interesting talk pertaining tothe effects of cache coloring for taskstructures in the Linux kernel scheduler(Linux kernel 24x) The speaker firstpresented some interesting benchmarknumbers for the Linux scheduler show-ing that as the number of processes onthe task queue was increased the perfor-mance decreased The authors usedsome really nifty hardware to measurethe number of bus transactionsthroughout their tests and were thusable to reasonably quantify the impactthat cache misses had in the Linuxscheduler

Their experiments led them to imple-ment a cache coloring scheme for taskstructures which were previouslyaligned on 8KB boundaries and there-fore were being eventually mapped tothe same cache lines This unfortunateplacement of task structures in memoryinduced a significant number of cachemisses as the number of tasks grew inthe scheduler

The implemented solution consisted ofaligning task structures to more evenlydistribute cached entries across the L2cache The result was inevitably fewercache misses in the scheduler Some neg-ative effects were observed in certain sit-uations These were primarily due tomore cache slots being used by taskstructure data in the scheduler thusforcing data previously cached there tobe pushed out

81October 2002 login

C

ON

FERE

NC

ERE

PORT

SWORK-IN-PROGRESS REPORTS

Summarized by Brennan Reynolds

RESOURCE VIRTUALIZATION TECHNIQUES FOR

WIDE-AREA OVERLAY NETWORKS

Kartik Gopalan University Stony Brook

This work addressed the issue of provi-sioning a maximum number of virtualoverlay networks (VON) with diversequality of service (QoS) requirementson a single physical network Each of theVONs is logically isolated from others toensure the QoS requirements Gopalanmentioned several critical researchissues with this problem that are cur-rently being investigated Dealing withhow to provision the network at variouslevels (link route or path) and thenenforce the provisioning at run-time isone of the toughest challenges Cur-rently Gopalan has developed severalalgorithms to handle admission controlend-to-end QoS route selection sched-uling and fault tolerance in the net-work

For more information visithttpwwwecslcssunysbedu

VISUALIZING SOFTWARE INSTABILITY

Jennifer Bevan University of CaliforniaSanta Cruz

Detection of instability in software hastypically been an afterthought Thepoint in the development cycle when thesoftware is reviewed for instability isusually after it is difficult and costly togo back and perform major modifica-tions to the code base Bevan has devel-oped a technique to allow thevisualization of unstable regions of codethat can be used much earlier in thedevelopment cycle Her technique cre-ates a time series of dependent graphsthat include clusters and lines calledfault lines From the graphs a developeris able to easily determine where theunstable sections of code are and proac-tively restructure them She is currentlyworking on a prototype implementa-tion

For more information visit httpwwwcseucscecu~jbevanevo_viz

RELIABLE AND SCALABLE PEER-TO-PEER WEB

DOCUMENT SHARING

Li Xiao William and Mary College

The idea presented by Xiao would allowend users to share the content of theWeb browser caches with neighboringInternet users The rationale for this isthat todayrsquos end users are increasinglyconnected to the Internet over high-speed links and the browser caches arebecoming large storage systems There-fore if an individual accesses a pagewhich does not exist in their local cacheXiao is suggesting that they first queryother end users for the content beforetrying to access the machine hosting theoriginal This strategy does have someserious problems associated with it thatstill need to be addressed includingensuring the integrity of the content andprotecting the identity and privacy ofthe end users

SEREL FAST BOOTING FOR UNIX

Leni Mayo Fast Boot Software

Serel is a tool that generates a visual rep-resentation of a UNIX machinersquos boot-up sequence It can be used to identifythe critical path and can show if a par-ticular service or process blocks for anextended period of time This informa-tion could be used to determine wherethe largest performance gains could berealized by tuning the order of executionat boot-up Serel creates a dependencygraph expressed in XML during boot-up This graph is then used to create thevisual representation Currently the toolonly works on POSIX-compliant sys-tems but Mayo stated that he would beporting it to other platforms Otherextensions that were mentionedincluded having the metadata includethe use of shared libraries and monitor-ing the suspendresume sequence ofportable machines

For more information visithttpwwwfastbootorg

USENIX 2002

BERKELEY DB XML

John Merrells Sleepycat Software

Merrells gave a quick introduction andoverview of the new XML library forBerkeley DB The library specializes instorage and retrieval of XML contentthrough a tool called XPath The libraryallows for multiple containers per docu-ment and stores everything natively asXML The user is also given a wide rangeof elements to create indices withincluding edges elements text stringsor presence The XPath tool consists of aquery parser generator optimizer andexecution engine To conclude his pre-sentation Merrell gave a live demonstra-tion of the software

For more information visithttpwwwsleepycatcomxml

CLUSTER-ON-DEMAND (COD)

Justin Moore Duke University

Modern clusters are growing at a rapidrate Many have pushed beyond the 5000-machine mark and deployingthem results in large expenses as well asmanagement and provisioning issuesFurthermore if the cluster is ldquorentedrdquoout to various users it is very time-con-suming to configure it to a userrsquos specsregardless of how long they need to useit The COD work presented createsdynamic virtual clusters within a givenphysical cluster The goal was to have aprovisioning tool that would automati-cally select a chunk of available nodesand install the operating system andmiddleware specified by the customer ina short period of time This would allowa greater use of resources since multiplevirtual clusters can exist at once Byusing a virtual cluster the size can bechanged dynamically Moore stated thatthey have created a working prototypeand are currently testing and bench-marking its performance

For more information visithttpwwwcsdukeedu~justincod

82 Vol 27 No 5 login

CATACOMB

Elias Sinderson University of CaliforniaSanta Cruz

Catacomb is a project to develop a data-base-backed DASL module for theApache Web server It was designed as areplacement for the WebDAV The initialrelease of the module only contains sup-port for a MySQL database but could beextended to others Sinderson brieflytouched on the performance of hermodule It was comparable to themod_dav Apache module for all querytypes but search The presentation wasconcluded with remarks about addingsupport for the lock method and includ-ing ACL specifications in the future

For more information visithttpoceancseucsceducatacomb

SELF-ORGANIZING STORAGE

Dan Ellard Harvard University

Ellardrsquos presentation introduced a stor-age system that tuned itself based on theworkload of the system without requir-ing the intervention of the user Theintelligence was implemented as a vir-tual self-organizing disk that residesbelow any file system The virtual diskobserves the access patterns exhibited bythe system and then attempts to predictwhat information will be accessed nextAn experiment was done using an NFStrace at a large ISP on one of their emailservers Ellardrsquos self-organizing storagesystem worked well which he attributesto the fact that most of the files beingrequested were large email boxes Areasof future work include exploration ofthe length and detail of the data collec-tion stage as well as the CPU impact ofrunning the virtual disk layer

VISUAL DEBUGGING

John Costigan Virginia Tech

Costigan feels that the state of currentdebugging facilities in the UNIX worldis not as good as it should be He pro-poses the addition of several elements toprograms like ddd and gdb The firstaddition is including separate heap and

stack components Another is to displayonly certain fields of a data structureand have the ability to zoom in and outif needed Finally the debugger shouldprovide the ability to visually presentcomplex data structures includinglinked lists trees etc He said that a betaversion of a debugger with these abilitiesis currently available

For more information visit httpinfoviscsvtedudatastruct

IMPROVING APPLICATION PERFORMANCE

THROUGH SYSTEM-CALL COMPOSITION

Amit Purohit University of Stony Brook

Web servers perform a huge number ofcontext switches and internal data copy-ing during normal operation These twoelements can drastically limit the perfor-mance of an application regardless ofthe hardware platform it is run on TheCompound System Call (CoSy) frame-work is an attempt to reduce the perfor-mance penalty for context switches anddata copies It includes a user-levellibrary and several kernel facilities thatcan be used via system calls The libraryprovides the programmer with a com-plete set of memory-management func-tions Performance tests using the CoSyframework showed a large saving forcontext switches and data copies Theonly area where the savings betweenconventional libraries and CoSy werenegligible was for very large file copies

For more information visithttpwwwcssunysbedu~purohit

ELASTIC QUOTAS

John Oscar Columbia University

Oscar began by stating that mostresources are flexible but that to datedisks have not been While approxi-mately 80 percent of files are short-liveddisk quotas are hard limits imposed byadministrators The idea behind elasticquotas is to have a non-permanent stor-age area that each user can temporarilyuse Oscar suggested the creation of aehome directory structure to be used indeploying elastic quotas Currently he

has implemented elastic quotas as astackable file system There were alsoseveral trends apparent in the dataMany users would ssh to their remotehost (good) but then launch an emailreader on the local machine whichwould connect via POP to the same hostand send the password in clear text(bad) This situation can easily be reme-died by tunneling protocols like POPthrough SSH but it appeared that manypeople were not aware this could bedone While most of his comments onthe use of the wireless network werenegative the list of passwords he hadcollected showed that people wereindeed using strong passwords His rec-ommendations were to educate andencourage people to use protocols likeIPSec SSH and SSL when conductingwork over a wireless network becauseyou never know who else is listening

SYSTRACE

Niels Provos University of Michigan

How can one be sure the applicationsone is using actually do exactly whattheir developers said they do Shortanswer you canrsquot People today are usinga large number of complex applicationswhich means it is impossible to checkeach application thoroughly for securityvulnerabilities There are tools out therethat can help though Provos has devel-oped a tool called systrace that allows auser to generate a policy of acceptablesystem calls a particular application canmake If the application attempts tomake a call that is not defined in thepolicy the user is notified and allowed tochoose an action Systrace includes anauto-policy generation mechanismwhich uses a training phase to record theactions of all programs the user exe-cutes If the user chooses not to be both-ered by applications breaking policysystrace allows default enforcementactions to be set Currently systrace isimplemented for FreeBSD and NetBSDwith a Linux version coming out shortly

83October 2002 login

C

ON

FERE

NC

ERE

PORT

SFor more information visit httpwwwcitiumicheduuprovossystrace

VERIFIABLE SECRET REDISTRIBUTION FOR

SURVIVABLE STORAGE SYSTEMS

Ted Wong Carnegie Mellon University

Wong presented a protocol that can beused to re-create a file distributed over aset of servers even if one of the serversis damaged In his scheme the user mustchoose how many shares the file is splitinto and the number of servers it will bestored across The goal of this work is toprovide persistent secure storage ofinformation even if it comes underattack Wong stated that his design ofthe protocol was complete and he is cur-rently building a prototype implementa-tion of it

For more information visit httpwwwcscmuedu~tmwongresearch

THE GURU IS IN SESSIONS

LINUX ON LAPTOPPDA

Bdale Garbee HP Linux Systems Operation

Summarized by Brennan Reynolds

This was a roundtable-like discussionwith Garbee acting as moderator He iscurrently in charge of the Debian distri-bution of Linux and is involved in port-ing Linux to platforms other than i386He has successfully help port the kernelto the Alpha Sparc and ARM and iscurrently working on a version to runon VAX machines

A discussion was held on which file sys-tem is best for battery life The Riser andExt3 file systems were discussed peoplecommented that when using Ext3 ontheir laptops the disc would never spindown and thus was always consuming alarge amount of power One suggestionwas to use a RAM-based file system forany volume that requires a large numberof writes and to only have the file systemwritten to disk at infrequent intervals orwhen the machine is shut down or sus-pended

A question was directed to Garbee abouthis knowledge of how the HP-Compaqmerger would affect Linux Garbeethought that the merger was good forLinux both within the company and forthe open source community as a wholeHe stated that the new company wouldbe the number one shipper of systemswith Linux as the base operating systemHe also fielded a question about thesupport of older HP servers and theirability to run Linux saying that indeedLinux has been ported to them and hepersonally had several machines in hisbasement running it

A question which sparked a large num-ber of responses concerned problemspeople had with running Linux onmobile platforms Most people actuallydid not have many problems at allThere were a few cases of a particularmachine model not working but therewere only two widespread issues dock-ing stations and the new ACPI powermanagement scheme Docking stationsstill appear to be somewhat of aheadache but most people had devel-oped ad-hoc solutions for getting themto work The ACPI power managementscheme developed by Intel does notappear to have a quick solution TedTrsquoso one of the head kernel developersstated that there are fundamental archi-tectural problems with the 24 series ker-nel that do not easily allow ACPI to beadded However he also stated thatACPI is supported in the newest devel-opment kernel 25 and will be includedin 26

The final topic was the possibility ofpurchasing a laptop without paying anychargetax for Microsoft Windows Theconsensus was that currently this is notpossible Even if the machine comeswith Linux or without any operatingsystem the vendors have contracts inplace which require them to payMicrosoft for each machine they ship ifthat machine is capable of running aMicrosoft operating system The onlyway this will change is if the demand for

USENIX 2002

other operating systems increases to thepoint that vendors renegotiate their con-tracts with Microsoft and no one sawthis happening in the near future

LARGE CLUSTERS

Andrew Hume ATampT Labs ndash Research

Summarized by Teri Lampoudi

This was a session on clusters big dataand resilient computing Given that theaudience was primarily interested inclusters and not necessarily big data orbeating nodes to a pulp the conversa-tion revolved mainly around what ittakes to make a heterogeneous clusterresilient Hume also presented aFREENIX paper on his current clusterproject which explained in more detailmuch of what was abstractly claimed inthe guru session

Humersquos use of the word ldquoclusterrdquoreferred not to what I had assumedwould be a Beowulf-type system but to aloosely coupled farm of machines inwhich concurrency was much less of anissue than it would be in a Beowulf Infact the architecture Hume describedwas designed to deal with large amountsof independent transaction processingessentially the process of billing callswhich requires no interprocess messagepassing or anything of the sort a parallelscientific application might

Where does the large data come inSince the transactions in question con-sist of large numbers of flat files amechanism for getting the files ontonodes and off-loading the results is nec-essary In this particular cluster namedNingaui this task is handled by theldquoreplication managerrdquo which generateda large amount of interest in the audi-ence All management functions in thecluster as well as job allocation consti-tute services that are to be bid for andleased out to nodes Furthermore fail-ures are handled by simply repeating thefailed task until it is completed properlyif that is possible and if it is not thenlooking through detailed logs to discover

84 Vol 27 No 5 login

what went wrong Logging and testingfor the successful completion of eachtask are what make this model resilientEvery management operation is loggedat some appropriate level of detail gen-erating 120MBday and every single filetransfer is md5 checksummed This wascharacterized by Hume as a ldquopatientrdquostyle of management where no one con-trols the cluster scheduling behavior isemergent rather than stringentlyplanned and nodes are free to drop inand out of the cluster a downside to thismodel is the increase in latency offset bythe fact that the cluster continues tofunction unattended outside of 8-to-5support hours

In response to questions about the pro-jected scalability of such a schemeHume said the system would presum-ably scale from the present 60 nodes toabout 100 without modifications butthat scaling higher would be a matter forfurther design The guru session endedon a somewhat unrelated note regardingthe problems of getting data off main-frames and onto Linux machines ndashissues of fixed- vs variable-lengthencoding of data blocks resulting in use-ful information being stripped by FTPand problems in converting COBOLcopybooks to something useful on a C-based architecture To reiterate Humersquosclaim there are homemade solutions forsome of these problems and the readershould feel free to contact Mr Hume forthem

WRITING PORTABLE APPLICATIONS

Nick Stoughton MSB Consultants

Summarized by Josh Lothian

Nick Stoughton addressed a concernthat developers are facing more andmore writing portable applications Heproposed that there is no such thing as aldquoportable applicationrdquo only those thathave been ported Developers are still insearch of the Holy Grail of applicationsprogramming a write-once compile-anywhere program Languages such asJava are leading up to this but virtual

machines still have to be ported pathswill still be different etc Web-basedapplications are another area that couldbe promising for portable applications

Nick went on to talk about the future ofportability The recently released POSIX2001 standards are expected to help thesituation The 2001 revision expands theoriginal POSIX sepc into seven volumesEven though POSIX 2001 is more spe-cific than the previous release Nickpointed out that even this standardallows for small differences where ven-dors may define their own behaviorsThis can only hurt developers in thelong run Along these lines it waspointed out that there may be room forautomated tools that have the capabilityto check application code and determinethe level of portability and standardscompliance of the source

NETWORK MANAGEMENT SYSTEM

PERFORMANCE TUNING

Jeff Allen Tellme Networks Inc

Summarized by Matt Selsky

Jeff Allen author of Cricket discussednetwork tuning The basic idea of tuningis measure twiddle and measure againCricket can be used for large installa-tions but itrsquos overkill for smaller instal-lations for which Mrtg is better suitedYou donrsquot need Cricket to do measure-ment Doing something like a shellscript that dumps data to Excel is goodbut you need to get some sort of mea-surement Cricket was not designed forbilling it was meant to help answerquestions

Useful tools and techniques coveredincluded looking for the differencebetween machines in order to identifythe causes for the differences measuringbut also thinking about what yoursquoremeasuring and why remembering touse strace and tcpdump to identifyproblems Problems repeat themselvesperformance tuning requires an intricateunderstanding of all the layers involvedto solve the problem and simplify theautomation (Having a computer science

background helps but is not essential) Ifyou can reproduce the problem youthen should investigate each piece tofind the bottleneck If you canrsquot repro-duce the problem then measure Youneed to understand all the system inter-actions Measurement can help deter-mine whether things are actually slow orif the user is imagining it

Troubleshooting begins with a hunchbut scientific processes are essential Youshould be able to determine the eventstream or when each event occurs hav-ing observation logs can help Somehints provided include checking cron forperiodic anomalies slowing downbinrm to avoid IO overload and look-ing for unusual kernel CPU usage

Also try to use lightweight monitoringto reduce overhead You donrsquot need tomonitor every resource on every systembut you should monitor those resourceson those systems that are essential Donrsquotcheck things too often since you canintroduce overhead that way Lots ofinformation can be gathered from theproc file system interface to the kernel

BSD BOF

Host Kirk McKusick Author and Consultant

Summarized by Bosko Milekic

The BSD Birds of a Feather session atUSENIX started off with five quick andinformative talks and ended with anexciting discussion of licensing issues

Christos Zoulas for the NetBSD Projectwent over the primary goals of theNetBSD Project (namely portability aclean architecture and security) andthen proceeded to briefly discuss someof the challenges that the project hasencountered The successes and areas forimprovement of 16 were then exam-ined NetBSD has recently seen variousimprovements in its cross-build systempackaging system and third-party toolsupport For what concerns the kernel azero-copy TCPUDP implementationhas been integrated and a new pmap

85October 2002 login

C

ON

FERE

NC

ERE

PORT

Scode for arm32 has been introduced ashave some other rather major changes tovarious subsystems More information isavailable in Christosrsquos slides which arenow available at httpwwwnetbsdorggalleryeventsusenix2002

Theo de Raadt of the OpenBSD Projectpresented a rundown of various newthings that the OpenBSD and OpenSSHteams have been looking at Theorsquos talkincluded an amusing description (andpictures) of some of the events thattranspired during OpenBSDrsquos hack-a-thon which occurred the week beforeUSENIX rsquo02 Finally Theo mentionedthat following complications with theIPFilter license the OpenBSD team haddone a fairly extensive license audit

Robert Watson of the FreeBSD Projectbrought up various examples ofFreeBSD being used in the real worldHe went on to describe a wide variety ofchanges that will surface with theupcoming FreeBSD release 50 Notably50 will introduce a re-architecting ofthe kernel aimed at providing muchmore scalable support for SMPmachines 50 will also include an earlyimplementation of KSE FreeBSDrsquos ver-sion of scheduler activations largeimprovements to pccard and significantframework bits from the TrustedBSDproject

Mike Karels from Wind River Systemspresented an overview of how BSDOShas been evolving following the WindRiver Systems acquisition of BSDirsquos soft-ware assets Ernie Prabhakar from Applediscussed Applersquos OS X and its successon the desktop He explained how OS Xaims to bring BSD to the desktop whilestriving to maintain a strong and stablecore one that is heavily based on BSDtechnology

The BoF finished with a discussion onlicensing specifically some folks ques-tioned the impact of Applersquos licensingfor code from their core OS (the DarwinProject) on the BSD community as awhole

CLOSING SESSION

HOW FLIES FLY

Michael H Dickinson University ofCalifornia Berkeley

Summarized by JD Welch

In this lively and offbeat talk Dickinsondiscussed his research into the mecha-nisms of flight in the fruit fly (Droso-phila melanogaster) and related theautonomic behavior to that of a techno-logical system Insects are the most suc-cessful organisms on earth due in nosmall part to their early adoption offlight Flies travel through space onstraight trajectories interrupted by sac-cades jumpy turning motions some-what analogous to the fast jumpymovements of the human eye Using avariety of monitoring techniquesincluding high-speed video in a ldquovirtualreality flight arenardquo Dickinson and hiscolleagues have observed that fliesrespond to visual cues to decide when tosaccade during flight For example morevisual texture makes flies saccade earlier(ie further away from the arena wall)described by Dickinson as a ldquocollisioncontrol algorithmrdquo

Through a combination of sensoryinputs flies can make decisions abouttheir flight path Flies possess a visualldquoexpansion detectorrdquo which at a certainthreshold causes the animal to turn acertain direction However expansion infront of the fly sometimes causes it toland How does the fly decide Using thevirtual reality device to replicate variousvisual scenarios Dickinson observedthat the flies fixate on vertical objectsthat move back and forth across thefield Expansion on the left side of theanimal causes it to turn right and viceversa while expansion directly in frontof the animal triggers the legs to flareout in preparation for landing

If the eyes detect these changes how arethe responses implemented The fliesrsquowings have ldquopower musclesrdquo controlledby mechanical resonance (as opposed tothe nervous system) which drive the

USENIX 2002

wings combined with neurally activatedldquosteering musclesrdquo which change theconfiguration of the wing joints Subtlevariations in the timing of impulses cor-respond to changes in wing movementcontrolled by sensors in the ldquowing-pitrdquoA small wing-like structure the halterecontrols equilibrium by beating con-stantly

Voluntary control is accomplished bythe halteres whose signals can interferewith the autonomic control of the strokecycle The halteres have ldquosteering mus-clesrdquo as well and information derivedfrom the visual system can turn off thehaltere or trick it into a ldquovirtualrdquo prob-lem requiring a response

Dickinson has also studied the aerody-namics of insect flight using a devicecalled the ldquoRobo Flyrdquo an oversizedmechanical insect wing suspended in alarge tank of mineral oil InterestinglyDickinson observed that the average liftgenerated by the fliesrsquo wings is under itsbody weight the flies use three mecha-nisms to overcome this including rotat-ing the wings (rotational lift)

Insects are extraordinarily robust crea-tures and because Dickinson analyzedthe problem in a systems-oriented waythese observations and analysis areimmediately applicable to technologythe response system can be used as anefficient search algorithm for controlsystems in autonomous vehicles forexample

THE AFS WORKSHOP

Summarized by Garry Zacheiss MIT

Love Hornquist-Astrand of the Arladevelopment team presented the Arlastatus report Arla 0358 has beenreleased Scheduled for release soon areimproved support for Tru64 UNIXMacOS X and FreeBSD improved vol-ume handling and implementation ofmore of the vospts subcommands Itwas stressed that MacOS X is consideredan important platform and that a GUIconfiguration manager for the cache

86 Vol 27 No 5 login

manager a GUI ACL manager for thefinder and a graphical login that obtainsKerberos tickets and AFS tokens at logintime were all under development

Future goals planned for the underlyingAFS protocols include GSSAPISPNEGOsupport for Rx performance improve-ments to Rx an enhanced disconnectedmode and IPv6 support for Rx anexperimental patch is already availablefor the latter Future Arla-specific goalsinclude improved performance partialfile reading and increased stability forseveral platforms Work is also inprogress for the RXGSS protocol exten-sions (integrating GSSAPI into Rx) Apartial implementation exists and workcontinues as developers find time

Derrick Brashear of the OpenAFS devel-opment team presented the OpenAFSstatus report OpenAFS was releasedimmediately prior to the conferenceOpenAFS 125 fixed a remotelyexploitable denial-of-service attack inseveral OpenAFS platforms mostnotably IRIX and AIX Future workplanned for OpenAFS includes bettersupport for MacOS X including work-ing around Finder interaction issuesBetter support for the BSDs is alsoplanned FreeBSD has a partially imple-mented client NetBSD and OpenBSDhave only server programs availableright now AIX 5 Tru64 51A andMacOS X 102 (aka Jaguar) are allplanned for the future Other plannedenhancements include nested groups inthe ptserver (code donated by the Uni-versity of Michigan awaits integration)disconnected AFS and further work ona native Windows client Derrick statedthat the guiding principles of OpenAFSwere to maintain compatibility withIBM AFS support new platforms easethe administrative burden and add newfunctionality

AFS performance was discussed Theopenafsorg cell consists of two fileservers one in Stockholm Sweden andone in Pittsburgh PA AFS works rea-sonably well over trans-Atlantic links

although OpenAFS clients donrsquot deter-mine which file server to talk to veryeffectively Arla clients use RTTs to theserver to determine the optimal fileserver to fetch replicated data fromModifications to OpenAFS to supportthis behavior in the future are desired

Jimmy Engelbrecht and Harald Barth ofKTH discussed their AFSCrawler scriptwritten to determine how many AFScells and clients were in the world whatimplementationsversions they were(Arla vs IBM AFS vs OpenAFS) andhow much data was in AFS The scriptunfortunately triggered a bug in IBMAFS 36ndashderived code causing someclients to panic while handling a specificRPC This has since been fixed in OpenAFS 125 and the most recent IBMAFS patch level and all AFS users arestrongly encouraged to upgrade Norelease of Arla is vulnerable to this par-ticular denial-of-service attack Therewas an extended discussion of the use-fulness of this exploration Many sitesbelieved this was useful information andsuch scanning should continue in thefuture but only on an opt-in basis

Many sites face the problem of manag-ing KerberosAFS credentials for batch-scheduled jobs Specifically most batchprocessing software needs to be modi-fied to forward tickets as part of thebatch submission process renew ticketsand tokens while the job is in the queueand for the lifetime of the job and prop-erly destroy credentials when the jobcompletes Ken Hornstein of NRL wasable to pay a commercial vendor to sup-port Kerberos 45 credential manage-ment in their product although they didnot implement AFS token managementMIT has implemented some of thedesired functionality in OpenPBS andmight be able to make it available toother interested sites

Tools to simplify AFS administrationwere discussed including

AFS Balancer A tool written byCMU to automate the process ofbalancing disk usage across all

servers in a cell Available fromftpftpandrewcmuedupubAFS-Toolsbalance-11btargz

Themis Themis is KTHrsquos enhancedversion of the AFS tool ldquopackagerdquofor updating files on local disk froma central AFS image KTHrsquosenhancements include allowing thedeletion of files simplifying theprocess of adding a file and allow-ing the merging of multiple rulesets for determining which files areupdated Themis is available fromthe Arla CVS repository

Stanford was presented as an example ofa large AFS site Stanfordrsquos AFS usageconsists of approximately 14TB of datain AFS in the form of approximately100000 volumes 33TB of storage isavailable in their primary cell irstan-fordedu Their file servers consistentirely of Solaris machines running acombination of Transarc 36 patch-level3 and OpenAFS 12x while their data-base servers run OpenAFS 122 Theircell consists of 25 file servers using acombination of EMC and Sun StorEdgehardware Stanford continues to use thekaserver for their authentication infra-structure with future plans to migrateentirely to an MIT Kerberos 5 KDC

Stanford has approximately 3400 clientson their campus not including SLAC(Stanford Linear Accelerator) approxi-mately 2100 AFS clients from outsideStanford contact their cell every monthTheir supported clients are almostentirely IBM AFS 36 clients althoughthey plan to release OpenAFS clientssoon Stanford currently supports onlyUNIX clients There is some on-campuspresence of Windows clients but Stan-ford has never publicly released or sup-ported it They do intend to release andsupport the MacOS X client in the nearfuture

All Stanford students faculty and staffare assigned AFS home directories witha default quota of 50MB for a total ofapproximately 550GB of user homedirectories Other significant uses of AFS

87October 2002 login

C

ON

FERE

NC

ERE

PORT

Sstorage are data volumes for workstationsoftware (400GB) and volumes forcourse Web sites and assignments(100GB)

AFS usage at Intel was also presentedIntel has been an AFS site since 1994They had bad experiences with the IBM35 Linux client their experience withOpenAFS on Linux 24x kernels hasbeen much better They use and are sat-isfied with the OpenAFS IA64 Linuxport Intel has hundreds of OpenAFS123 and 124 clients in many produc-tion cells accessing data stored on IBMAFS file servers They have not encoun-tered any interoperability issues Intelhas some concerns about OpenAFS theywould like to purchase commercial sup-port for OpenAFS and to see OpenAFSsupport for HP-UX on both PA-RISCand Itanium hardware HP-UX supportis currently unavailable due to a specificHP-UX header file being unavailablefrom HP this may be available soonIntel has not yet committed to migratingtheir file servers to OpenAFS and areunsure if they will do so without com-mercial support

Backups are a traditional topic of dis-cussion at AFS workshops and this timewas no exception Many users complainthat the traditional AFS backup tools(ldquobackuprdquo and ldquobutcrdquo) are complex anddifficult to automate requiring manyhome-grown scripts and much userintervention for error recovery An addi-tional complaint was that the traditionalAFS tools do not support file- or direc-tory-level backups and restores datamust be backed up and restored at thevolume level

Mitch Collinsworth of Cornell presentedwork to make AMANDA the freebackup software from the University ofMaryland suitable for AFS backupsUsing AMANDA for AFS backups allowsone to share AFS backup tapes withnon-AFS backups easily run multiplebackups in parallel automate errorrecovery and provide a robust degradedmode that prevents tape errors from

stopping backups altogether Theirimplementation allows for full volumerestores as well as individual directoryand file restores They have finished cod-ing this work and are in the process oftesting and documenting it

Peter Honeyman of CITI at the Univer-sity of Michigan spoke about work hehas proposed to replace Rx with RPC-SEC GSS in OpenAFS this would allowAFS to use a TCP-based transport mech-anism rather than the UDP-based Rxand possibly gain better congestion con-trol dynamic adaptation and fragmen-tation avoidance as a result RPCSECGSS uses the GSSAPI to authenticateSUN ONC RPC RPCSEC GSS is trans-port agnostic provides strong security isa developing Internet standard and hasmultiple open source implementationsBackward compatibility with existingAFS servers and clients is an importantgoal of this project

Page 17: inside - USENIX · 2019. 2. 25. · Bosko Milekic Juan Navarro David E. Ott Amit Purohit Brennan Reynolds Matt Selsky J.D. Welch Li Xiao Praveen Yalagandula Haijin Yan Gary Zacheiss

NETWORK PERFORMANCE

Summarized by Florian Buchholz

LINUX NFS CLIENT WRITE PERFORMANCE

Chuck Lever Network Appliance PeterHoneyman CITI University of Michi-gan

Lever introduced a benchmark to mea-sure an NFS client write performanceClient performance is difficult to mea-sure due to hindrances such as poorhardware or bandwidth limitations Fur-thermore measuring application perfor-mance does not identify weaknessesspecifically at the client side Thus abenchmark was developed trying toexercise only data transfers in one direc-tion between server and application Forthis purpose the benchmark was basedon the block sequential write portion ofthe Bonnie file system benchmark Oncea benchmark for NFS clients is estab-lished it can be used to improve clientperformance

The performance measurements wereperformed with an SMP Linux clientand both a Linux NFS server and a Net-work Appliance F85 filer During testinga periodic jump in write latency timewas discovered This was due to a ratherlarge number of pending write opera-tions that were scheduled to be writtenafter certain threshold values wereexceeded By introducing a separate dae-mon that flushes the cached writerequest the spikes could be eliminatedbut as a result the average latency growsover time The problem could be tracedto a function that scans a linked list ofwrite requests After having added ahashtable to improve lookup perfor-mance the latency improved consider-ably

The improved client was then used tomeasure throughput against the twoservers A discrepancy between theLinux server and the filer test wasnoticed and the reason for that tracedback to a global kernel lock that wasunnecessarily held when accessing thenetwork stack After correcting this per-formance improved further However

77October 2002 login

C

ON

FERE

NC

ERE

PORT

Sthe measurements showed that a clientmay run slower when paired with fastservers on fast networks This is due toheavy client interrupt loads more net-work processing on the client side andmore global kernel lock contention

The source code of the project is avail-able at httpwwwcitiumicheduprojectsnfs-perfpatches

A STUDY OF THE RELATIVE COSTS OF

NETWORK SECURITY PROTOCOLS

Stefan Miltchev and Sotiris IoannidisUniversity of Pennsylvania AngelosKeromytis Columbia University

With the increasing need for securityand integrity of remote network ser-vices it becomes important to quantifythe communication overhead of IPSecand compare it to alternatives such asSSH SCP and HTTPS

For this purpose the authors set upthree testing networks direct link twohosts separated by two gateways andthree hosts connecting through onegateway Protocols were compared ineach setup and manual keying was usedto eliminate connection setup costs ForIPSec the different encryption algo-rithms AES DES 3DES hardware DESand hardware 3DES were used In detailFTP was compared to SFTP SCP andFTP over IPSec HTTP to HTTPS andHTTP over IPSec and NFS to NFS overIPSec and local disk performance

The result of the measurements werethat IPSec outperforms other popularencryption schemes Overall unen-crypted communication was fastest butin some cases like FTP the overhead canbe small The use of crypto hardwarecan significantly improve performanceFor future work the inclusion of setupcosts hardware-accelerated SSL SFTPand SSH were mentioned

CONGESTION CONTROL IN LINUX TCP

Pasi Sarolahti University of HelsinkiAlexey Kuznetsov Institute for NuclearResearch at Moscow

Having attended the talk and read thepaper I am still unclear about whetherthe authors are merely describing thedesign decisions of TCP congestion con-trol or whether they are actually the cre-ators of that part of the Linux code Myguess leans toward the former

In the talk the speaker compared theTCP protocol congestion control meas-ures according to IETF and RFC specifi-cations with the actual Linux implemen-tation which does conform to the basicprinciples but nevertheless has differ-ences A specific emphasis was placed onretransmission mechanisms and thecongestion window Also several TCPenhancements ndash the NewReno algo-rithm Selective ACKs (SACK) ForwardACKs (FACK) ndash were discussed andcompared

In some instances Linux does not con-form to the IETF specifications The fastrecovery does not fully follow RFC 2582since the threshold for triggering re-transmit is adjusted dynamically and thecongestion windowrsquos size is not changedAlso the roundtrip-time estimator andthe RTO calculation differ from RFC2988 since it uses more conservativeRTT estimates and a minimum RTO of200ms The performance measuresshowed that with the additional Linux-specific features enabled slightly higherthroughput more steady data flow andfewer unnecessary retransmissions canbe achieved

XTREME XCITEMENT

Summarized by Steve Bauer

THE FUTURE IS COMING WHERE THE X

WINDOW SHOULD GO

Jim Gettys Compaq Computer Corp

Jim Gettys one of the principal authorsof the X Window System outlined thenear-term objectives for the system pri-marily focusing on the changes andinfrastructure required to enable replica-tion and migration of X applicationsProviding better support for this func-tionality would enable users to retrieveor duplicate X applications betweentheir servers at home and work

USENIX 2002

One interesting example of an applica-tion that currently is capable of migra-tion and replication is Emacs To create anew frame on DISPLAY try ldquoM-x make-frame-on-display ltReturngt DISPLAYltReturngtrdquo

However technical challenges makereplication and migration difficult ingeneral These include the ldquomajorheadachesrdquo of server-side fonts thenonuniformity of X servers and screensizes and the need to appropriatelyretrofit toolkits Authentication andauthorization issues are obviously alsoimportant The rest of the talk delvedinto some of the details of these interest-ing technical challenges

HACKING IN THE KERNEL

Summarized by Hai Huang

AN IMPLEMENTATION OF SCHEDULER

ACTIVATIONS ON THE NETBSD OPERATING

SYSTEM

Nathan J Williams Wasabi Systems

Scheduler activation is an old idea Basi-cally there are benefits and drawbacks tousing solely kernel-level or user-levelthreading Scheduler activation is able tocombine the two layers of control toprovide more concurrency in the sys-tem

In his talk Nathan gave a fairly detaileddescription of the implementation ofscheduler activation in the NetBSD ker-nel One important change in the imple-mentation is to differentiate the threadcontext from the process context This isdone by defining a separate data struc-ture for these threads and relocatingsome of the information that wasembedded in the process context tothese thread contexts Stack was espe-cially a concern due to the upcall Spe-cial handling must be done to make surethat the upcall doesnrsquot mess up the stackso that the preempted user-level threadcan continue afterwards Lastly Nathanexplained that signals were handled byupcalls

78 Vol 27 No 5 login

AUTHORIZATION AND CHARGING IN PUBLIC

WLANS USING FREEBSD AND 8021X

Pekka Nikander Ericsson ResearchNomadicLab

8021x standards are well known in thewireless community as link-layerauthentication protocols In this talkPekka explained some novel ways ofusing the 8021x protocols that might beof interest to people on the move It ispossible to set up a public WLAN thatwould support various chargingschemes via virtual tokens which peoplecan purchase or earn and later use

This is implemented on FreeBSD usingthe netgraph utility It is basically a filterin the link layer that would differentiatetraffic based on the MAC address of theclient node which is either authenti-cated denied or let through The over-head for this service is fairly minimal

ACPI IMPLEMENTATION ON FREEBSD

Takanori Watanabe Kobe University

ACPI (Advanced Configuration andPower Management Interface) was pro-posed as a joint effort by Intel Toshibaand Microsoft to provide a standard andfiner-control method of managingpower states of individual devices withina system Such low-level power manage-ment is especially important for thosemobile and embedded systems that arepowered by fixed-capacity energy batter-ies

Takanori explained that the ACPI speci-fication is composed of three partstables BIOS and registers He was ableto implement some functionalities ofACPI in a FreeBSD kernel ACPI Com-ponent Architecture was implementedby Intel and it provides a high-levelACPI API to the operating systemTakanorirsquos ACPI implementation is builtupon this underlying layer of APIs

ACCESS CONTROL

Summarized by Florian Buchholz

DESIGN AND PERFORMANCE OF THE

OPENBSD STATEFUL PACKET FILTER (PF)

Daniel Hartmeier Systor AG

Daniel Hartmeier described the newstateful packet filter (pf) that replacedIPFilter in the OpenBSD 30 releaseIPFilter could no longer be included dueto licensing issues and thus there was aneed to write a new filter making use ofoptimized data structures

The filter rules are implemented as alinked list which is traversed from startto end Two actions may be takenaccording to the rules ldquopassrdquo or ldquoblockrdquoA ldquopassrdquo action forwards the packet anda ldquoblockrdquo action will drop it Wheremore than one rule matches a packetthe last rule wins Rules that are markedas ldquofinalrdquo will immediately terminate any further rule evaluation An opti-mization called ldquoskip-stepsrdquo was alsoimplemented where blocks of similarrules are skipped if they cannot matchthe current packet These skip-steps arecalculated when the rule set is loadedFurthermore a state table keeps track ofTCP connections Only packets thatmatch the sequence numbers areallowed UDP and ICMP queries andreplies are considered in the state tablewhere initial packets will create an entryfor a pseudo-connection with a low ini-tial timeout value The state table isimplemented using a balanced binarysearch tree NAT mappings are alsostored in the state table whereas appli-cation proxies reside in user space Thepacket filter also is able to perform frag-ment reassembly and to modulate TCPsequence numbers to protect hostsbehind the firewall

Pf was compared against IPFilter as wellas Linuxrsquos Iptables by measuringthroughput and latency with increasingtraffic rates and different packet sizes Ina test with a fixed rule set size of 100Iptables outperformed the other two fil-ters whose results were close to each

other In a second test where the rule setsize was continually increased Iptablesconsistently had about twice thethroughput of the other two (whichevaluate the rule set on both the incom-ing and outgoing interfaces) A third testcompared only pf and IPFilter using asingle rule that created state in the statetable with a fixed-state entry size Pfreached an overloaded state much laterthan IPFilter The experiment wasrepeated with a variable-state entry sizeand pf performed much better thanIPFilter for a small number of states

In general rule set evaluation is expen-sive and benchmarks only reflectextreme cases whereas in real life otherbehavior should be observed Further-more the benchmarks show that statefulfiltering can actually improve perfor-mance due to cheap state-table lookupsas compared to rule evaluation

ENHANCING NFS CROSS-ADMINISTRATIVE

DOMAIN ACCESS

Joseph Spadavecchia and Erez ZadokStony Brook University

The speaker presented modification toan NFS server that allows an improvedNFS access between administrativedomains A problem lies in the fact thatNFS assumes a shared UIDGID spacewhich makes it unsafe to export filesoutside the administrative domain ofthe server Design goals were to leaveprotocol and existing clients unchangeda minimum amount of server changesflexibility and increased performance

To solve the problem two techniques areutilized ldquorange-mappingrdquo and ldquofile-cloakingrdquo Range-mapping maps IDsbetween client and server The mappingis performed on a per-export basis andhas to be manually set up in an exportfile The mappings can be 1-1 N-N orN-1 In file-cloaking the server restrictsfile access based on UIDGID and spe-cial cloaking-mask bits Here users canonly access their own file permissionsThe policy on whether or not the file isvisible to others is set by the cloakingmask which is logically ANDed with the

79October 2002 login

C

ON

FERE

NC

ERE

PORT

Sfilersquos protection bits File-cloaking onlyworks however if the client doesnrsquot holdcached copies of directory contents andfile-attributes Because of this the clientsare forced to re-read directories byincrementing the mtime value of thedirectory each time it is listed

To test the performance of the modifiedserver five different NFS configurationswere evaluated An unmodified NFSserver was compared against one serverwith the modified code included but notused one with only range-mappingenabled one with only file-cloakingenabled and one version with all modi-fications enabled For each setup differ-ent file system benchmarks were runThe results show only a small overheadwhen the modifications are used gener-ally an increase of below 5 Anotherexperiment tested the performance ofthe system with an increasing number ofmapped or cloaked entries on a systemThe results show that an increase from10 to 1000 entries resulted in a maxi-mum of about 14 cost in performance

One member of the audience pointedout that if clients choose to ignore thechanged mtimes from the server andthus still hold caches of the directoryentries the file-cloaking mechanismcould be defeated After a rather lengthydebate the speaker had to concede thatthe model doesnrsquot add any extra securityAnother question was asked about scala-bility of the setup of range mappingThe speaker referred to application-leveltools that could be developed for thatpurpose

The software is available atftpftpfslcssunysbedupubenf

ENGINEERING OPEN SOURCE

SOFTWARE

Summarized by Teri Lampoudi

NINGAUI A LINUX CLUSTER FOR BUSINESS

Andrew Hume ATampT Labs ndash ResearchScott Daniels Electronic Data SystemsCorp

Ningaui is a general purpose highlyavailable resilient architecture built

from commodity software and hard-ware Emphatically however Ningaui isnot a Beowulf Hume calls the clusterdesign the ldquoSwiss canton modelrdquo inwhich there are a number of looselyaffiliated independent nodes with datareplicated among them Jobs areassigned by bidding and leases and clus-ter services done as session-based serversare sited via generic job assignment Theemphasis is on keeping the architectureend-to-end checking all work via check-sums and logging everything Theresilience and high availability requiredby their goal of 8-5 maintenance ndash vsthe typical 24-7 model where people getpaged whenever the slightest thing goeswrong regardless of the time of day ndash isachieved by job restartability Finally allcomputation is performed on local datawithout the use of NFS or networkattached storage

Humersquos message is a hopeful onedespite the many problems encounteredndash things like kernel and service limitsauto-installation problems TCP stormsand the like ndash the existence of sourcecode and the paranoid practice of log-ging and checksumming everything hashelped The final product performs rea-sonably well and it appears that theresilient techniques employed do make adifference One drawback however isthat the software mentioned in thepaper is not yet available for download

CPCMS A CONFIGURATION MANAGEMENT

SYSTEM BASED ON CRYPTOGRAPHIC NAMES

Jonathan S Shapiro and John Vander-burgh Johns Hopkins University

This paper won the FREENIX BestPaper award

The basic notion behind the project isthe fact that everyone has a pet com-plaint about CVS and yet it is currentlythe configuration manager in mostwidespread use Shapiro has unleashedan alternative Interestingly he did notbegin but ended with the usual host ofreasons why CVS is bad The talk insteadbegan abruptly by characterizing the job

USENIX 2002

of a software configuration managercontinued by stating the namespaceswhich it must handle and wrapped upwith the challenges faced

X MEETS Z VERIFYING CORRECTNESS IN THE

PRESENCE OF POSIX THREADS

Bart Massey Portland State UniversityRobert T Bauer Rational SoftwareCorp

Massey delivered a humorous talk onthe insight gained from applying Z for-mal specification notation to systemsoftware design rather than the moreinformal analysis and design processnormally used

The story is told with respect to writingXCB which replaces the Xlib protocollayer and is supposed to be threadfriendly But where threads are con-cerned deadlock avoidance becomes ahard problem that cannot be solved inan ad-hoc manner But full modelchecking is also too hard In this caseMassey resorted to Z specification tomodel the XCB lower layer abstractaway locking and data transfers andlocate fundamental issues Essentiallythe difficulties of searching the literatureand locating information relevant to theproblem at hand were overcome AsMassey put it ldquothe formal method savedthe dayrdquo

FILE SYSTEMS

Summarized by Bosko Milekic

PLANNED EXTENSIONS TO THE LINUX

EXT2EXT3 FILESYSTEM

Theodore Y Tsrsquoo IBM StephenTweedie Red Hat

The speaker presented improvements tothe Linux Ext2 file system with the goalof allowing for various expansions whilestriving to maintain compatibility witholder code Improvements have beenfacilitated by a few extra superblockfields that were added to Ext2 not longbefore the Linux 20 kernel was releasedThe fields allow for file system featuresto be added without compromising theexisting setup this is done by providing

80 Vol 27 No 5 login

bits indicating the impact of the addedfeatures with respect to compatibilityNamely a file system with the ldquoincom-patrdquo bit marked is not allowed to bemounted Similarly a ldquoread-onlyrdquo mark-ing would only allow the file system tobe mounted read-only

Directory indexing changes linear direc-tory searches with a faster search using afixed-depth tree and hashed keys Filesystem size can be dynamicallyincreased and the expanded inode dou-bled from 128 to 256 bytes allows formore extensions

Other potential improvements were dis-cussed as well in particular pre-alloca-tion for contiguous files which allowsfor better performance in certain setupsby pre-allocating contiguous blocksSecurity-related modifications extendedattributes and ACLs were mentionedAn implementation of these featuresalready exists but has not yet beenmerged into the mainline Ext23 code

RECENT FILESYSTEM OPTIMISATIONS ON

FREEBSD

Ian Dowse Corvil Networks DavidMalone CNRI Dublin Institute of Technology

David Malone presented four importantfile system optimizations for FreeBSDOS soft updates dirpref vmiodir anddirhash It turns out that certain combi-nations of the optimizations (beautifullyillustrated in the paper) may yield per-formance improvements of anywherebetween 2 and 10 orders of magnitudefor real-world applications

All four of the optimizations deal withfile system metadata Soft updates allowfor asynchronous metadata updatesDirpref changes the way directories areorganized attempting to place childdirectories closer to their parentsthereby increasing locality of referenceand reducing disk-seek times Vmiodirtrades some extra memory in order toachieve better directory caching Finallydirhash which was implemented by IanDowse changes the way in which entries

are searched in a directory SpecificallyFFS uses a linear search to find an entryby name dirhash builds a hashtable ofdirectory entries on the fly For a direc-tory of size n with a working set of mfiles a search that in certain cases couldhave been O(nm) has been reduceddue to dirhash to effectively O(n + m)

FILESYSTEM PERFORMANCE AND SCALABILITY

IN LINUX 2417

Ray Bryant SGI Ruth Forester IBMLTC John Hawkes SGI

This talk focused on performance evalu-ation of a number of file systems avail-able and commonly deployed on Linuxmachines Comparisons under variousconfigurations of Ext2 Ext3 ReiserFSXFS and JFS were presented

The benchmarks chosen for the datagathering were pgmeter filemark andAIM Benchmark Suite VII Pgmetermeasures the rate of data transfer ofreadswrites of a file under a syntheticworkload Filemark is similar to post-mark in that it is an operation-intensivebenchmark although filemark isthreaded and offers various other fea-tures that postmark lacks AIM VIImeasures performance for various file-system-related functionalities it offersvarious metrics under an imposedworkload thus stressing the perfor-mance of the file system not only underIO load but also under significant CPUload

Tests were run on three different setupsa small a medium and a large configu-ration ReiserFS and Ext2 appear at thetop of the pile for smaller and mediumsetups Notably XFS and JFS performworse for smaller system configurationsthan the others although XFS clearlyappears to generate better numbers thanJFS It should be noted that XFS seemsto scale well under a higher load Thiswas most evident in the large-systemresults where XFS appears to offer thebest overall results

THINGS TO THINK ABOUT

Summarized by Bosko Milekic

SPEEDING UP KERNEL SCHEDULER BY REDUC-

ING CACHE MISSES

Shuji Yamamura Akira Hirai MitsuruSato Masao Yamamoto Akira NaruseKouichi Kumon Fujitsu Labs

This was an interesting talk pertaining tothe effects of cache coloring for taskstructures in the Linux kernel scheduler(Linux kernel 24x) The speaker firstpresented some interesting benchmarknumbers for the Linux scheduler show-ing that as the number of processes onthe task queue was increased the perfor-mance decreased The authors usedsome really nifty hardware to measurethe number of bus transactionsthroughout their tests and were thusable to reasonably quantify the impactthat cache misses had in the Linuxscheduler

Their experiments led them to imple-ment a cache coloring scheme for taskstructures which were previouslyaligned on 8KB boundaries and there-fore were being eventually mapped tothe same cache lines This unfortunateplacement of task structures in memoryinduced a significant number of cachemisses as the number of tasks grew inthe scheduler

The implemented solution consisted ofaligning task structures to more evenlydistribute cached entries across the L2cache The result was inevitably fewercache misses in the scheduler Some neg-ative effects were observed in certain sit-uations These were primarily due tomore cache slots being used by taskstructure data in the scheduler thusforcing data previously cached there tobe pushed out

81October 2002 login

C

ON

FERE

NC

ERE

PORT

SWORK-IN-PROGRESS REPORTS

Summarized by Brennan Reynolds

RESOURCE VIRTUALIZATION TECHNIQUES FOR

WIDE-AREA OVERLAY NETWORKS

Kartik Gopalan University Stony Brook

This work addressed the issue of provi-sioning a maximum number of virtualoverlay networks (VON) with diversequality of service (QoS) requirementson a single physical network Each of theVONs is logically isolated from others toensure the QoS requirements Gopalanmentioned several critical researchissues with this problem that are cur-rently being investigated Dealing withhow to provision the network at variouslevels (link route or path) and thenenforce the provisioning at run-time isone of the toughest challenges Cur-rently Gopalan has developed severalalgorithms to handle admission controlend-to-end QoS route selection sched-uling and fault tolerance in the net-work

For more information visithttpwwwecslcssunysbedu

VISUALIZING SOFTWARE INSTABILITY

Jennifer Bevan University of CaliforniaSanta Cruz

Detection of instability in software hastypically been an afterthought Thepoint in the development cycle when thesoftware is reviewed for instability isusually after it is difficult and costly togo back and perform major modifica-tions to the code base Bevan has devel-oped a technique to allow thevisualization of unstable regions of codethat can be used much earlier in thedevelopment cycle Her technique cre-ates a time series of dependent graphsthat include clusters and lines calledfault lines From the graphs a developeris able to easily determine where theunstable sections of code are and proac-tively restructure them She is currentlyworking on a prototype implementa-tion

For more information visit httpwwwcseucscecu~jbevanevo_viz

RELIABLE AND SCALABLE PEER-TO-PEER WEB

DOCUMENT SHARING

Li Xiao William and Mary College

The idea presented by Xiao would allowend users to share the content of theWeb browser caches with neighboringInternet users The rationale for this isthat todayrsquos end users are increasinglyconnected to the Internet over high-speed links and the browser caches arebecoming large storage systems There-fore if an individual accesses a pagewhich does not exist in their local cacheXiao is suggesting that they first queryother end users for the content beforetrying to access the machine hosting theoriginal This strategy does have someserious problems associated with it thatstill need to be addressed includingensuring the integrity of the content andprotecting the identity and privacy ofthe end users

SEREL FAST BOOTING FOR UNIX

Leni Mayo Fast Boot Software

Serel is a tool that generates a visual rep-resentation of a UNIX machinersquos boot-up sequence It can be used to identifythe critical path and can show if a par-ticular service or process blocks for anextended period of time This informa-tion could be used to determine wherethe largest performance gains could berealized by tuning the order of executionat boot-up Serel creates a dependencygraph expressed in XML during boot-up This graph is then used to create thevisual representation Currently the toolonly works on POSIX-compliant sys-tems but Mayo stated that he would beporting it to other platforms Otherextensions that were mentionedincluded having the metadata includethe use of shared libraries and monitor-ing the suspendresume sequence ofportable machines

For more information visithttpwwwfastbootorg

USENIX 2002

BERKELEY DB XML

John Merrells Sleepycat Software

Merrells gave a quick introduction andoverview of the new XML library forBerkeley DB The library specializes instorage and retrieval of XML contentthrough a tool called XPath The libraryallows for multiple containers per docu-ment and stores everything natively asXML The user is also given a wide rangeof elements to create indices withincluding edges elements text stringsor presence The XPath tool consists of aquery parser generator optimizer andexecution engine To conclude his pre-sentation Merrell gave a live demonstra-tion of the software

For more information visithttpwwwsleepycatcomxml

CLUSTER-ON-DEMAND (COD)

Justin Moore Duke University

Modern clusters are growing at a rapidrate Many have pushed beyond the 5000-machine mark and deployingthem results in large expenses as well asmanagement and provisioning issuesFurthermore if the cluster is ldquorentedrdquoout to various users it is very time-con-suming to configure it to a userrsquos specsregardless of how long they need to useit The COD work presented createsdynamic virtual clusters within a givenphysical cluster The goal was to have aprovisioning tool that would automati-cally select a chunk of available nodesand install the operating system andmiddleware specified by the customer ina short period of time This would allowa greater use of resources since multiplevirtual clusters can exist at once Byusing a virtual cluster the size can bechanged dynamically Moore stated thatthey have created a working prototypeand are currently testing and bench-marking its performance

For more information visithttpwwwcsdukeedu~justincod

82 Vol 27 No 5 login

CATACOMB

Elias Sinderson University of CaliforniaSanta Cruz

Catacomb is a project to develop a data-base-backed DASL module for theApache Web server It was designed as areplacement for the WebDAV The initialrelease of the module only contains sup-port for a MySQL database but could beextended to others Sinderson brieflytouched on the performance of hermodule It was comparable to themod_dav Apache module for all querytypes but search The presentation wasconcluded with remarks about addingsupport for the lock method and includ-ing ACL specifications in the future

For more information visithttpoceancseucsceducatacomb

SELF-ORGANIZING STORAGE

Dan Ellard Harvard University

Ellardrsquos presentation introduced a stor-age system that tuned itself based on theworkload of the system without requir-ing the intervention of the user Theintelligence was implemented as a vir-tual self-organizing disk that residesbelow any file system The virtual diskobserves the access patterns exhibited bythe system and then attempts to predictwhat information will be accessed nextAn experiment was done using an NFStrace at a large ISP on one of their emailservers Ellardrsquos self-organizing storagesystem worked well which he attributesto the fact that most of the files beingrequested were large email boxes Areasof future work include exploration ofthe length and detail of the data collec-tion stage as well as the CPU impact ofrunning the virtual disk layer

VISUAL DEBUGGING

John Costigan Virginia Tech

Costigan feels that the state of currentdebugging facilities in the UNIX worldis not as good as it should be He pro-poses the addition of several elements toprograms like ddd and gdb The firstaddition is including separate heap and

stack components Another is to displayonly certain fields of a data structureand have the ability to zoom in and outif needed Finally the debugger shouldprovide the ability to visually presentcomplex data structures includinglinked lists trees etc He said that a betaversion of a debugger with these abilitiesis currently available

For more information visit httpinfoviscsvtedudatastruct

IMPROVING APPLICATION PERFORMANCE

THROUGH SYSTEM-CALL COMPOSITION

Amit Purohit University of Stony Brook

Web servers perform a huge number ofcontext switches and internal data copy-ing during normal operation These twoelements can drastically limit the perfor-mance of an application regardless ofthe hardware platform it is run on TheCompound System Call (CoSy) frame-work is an attempt to reduce the perfor-mance penalty for context switches anddata copies It includes a user-levellibrary and several kernel facilities thatcan be used via system calls The libraryprovides the programmer with a com-plete set of memory-management func-tions Performance tests using the CoSyframework showed a large saving forcontext switches and data copies Theonly area where the savings betweenconventional libraries and CoSy werenegligible was for very large file copies

For more information visithttpwwwcssunysbedu~purohit

ELASTIC QUOTAS

John Oscar Columbia University

Oscar began by stating that mostresources are flexible but that to datedisks have not been While approxi-mately 80 percent of files are short-liveddisk quotas are hard limits imposed byadministrators The idea behind elasticquotas is to have a non-permanent stor-age area that each user can temporarilyuse Oscar suggested the creation of aehome directory structure to be used indeploying elastic quotas Currently he

has implemented elastic quotas as astackable file system There were alsoseveral trends apparent in the dataMany users would ssh to their remotehost (good) but then launch an emailreader on the local machine whichwould connect via POP to the same hostand send the password in clear text(bad) This situation can easily be reme-died by tunneling protocols like POPthrough SSH but it appeared that manypeople were not aware this could bedone While most of his comments onthe use of the wireless network werenegative the list of passwords he hadcollected showed that people wereindeed using strong passwords His rec-ommendations were to educate andencourage people to use protocols likeIPSec SSH and SSL when conductingwork over a wireless network becauseyou never know who else is listening

SYSTRACE

Niels Provos University of Michigan

How can one be sure the applicationsone is using actually do exactly whattheir developers said they do Shortanswer you canrsquot People today are usinga large number of complex applicationswhich means it is impossible to checkeach application thoroughly for securityvulnerabilities There are tools out therethat can help though Provos has devel-oped a tool called systrace that allows auser to generate a policy of acceptablesystem calls a particular application canmake If the application attempts tomake a call that is not defined in thepolicy the user is notified and allowed tochoose an action Systrace includes anauto-policy generation mechanismwhich uses a training phase to record theactions of all programs the user exe-cutes If the user chooses not to be both-ered by applications breaking policysystrace allows default enforcementactions to be set Currently systrace isimplemented for FreeBSD and NetBSDwith a Linux version coming out shortly

83October 2002 login

C

ON

FERE

NC

ERE

PORT

SFor more information visit httpwwwcitiumicheduuprovossystrace

VERIFIABLE SECRET REDISTRIBUTION FOR

SURVIVABLE STORAGE SYSTEMS

Ted Wong Carnegie Mellon University

Wong presented a protocol that can beused to re-create a file distributed over aset of servers even if one of the serversis damaged In his scheme the user mustchoose how many shares the file is splitinto and the number of servers it will bestored across The goal of this work is toprovide persistent secure storage ofinformation even if it comes underattack Wong stated that his design ofthe protocol was complete and he is cur-rently building a prototype implementa-tion of it

For more information visit httpwwwcscmuedu~tmwongresearch

THE GURU IS IN SESSIONS

LINUX ON LAPTOPPDA

Bdale Garbee HP Linux Systems Operation

Summarized by Brennan Reynolds

This was a roundtable-like discussionwith Garbee acting as moderator He iscurrently in charge of the Debian distri-bution of Linux and is involved in port-ing Linux to platforms other than i386He has successfully help port the kernelto the Alpha Sparc and ARM and iscurrently working on a version to runon VAX machines

A discussion was held on which file sys-tem is best for battery life The Riser andExt3 file systems were discussed peoplecommented that when using Ext3 ontheir laptops the disc would never spindown and thus was always consuming alarge amount of power One suggestionwas to use a RAM-based file system forany volume that requires a large numberof writes and to only have the file systemwritten to disk at infrequent intervals orwhen the machine is shut down or sus-pended

A question was directed to Garbee abouthis knowledge of how the HP-Compaqmerger would affect Linux Garbeethought that the merger was good forLinux both within the company and forthe open source community as a wholeHe stated that the new company wouldbe the number one shipper of systemswith Linux as the base operating systemHe also fielded a question about thesupport of older HP servers and theirability to run Linux saying that indeedLinux has been ported to them and hepersonally had several machines in hisbasement running it

A question which sparked a large num-ber of responses concerned problemspeople had with running Linux onmobile platforms Most people actuallydid not have many problems at allThere were a few cases of a particularmachine model not working but therewere only two widespread issues dock-ing stations and the new ACPI powermanagement scheme Docking stationsstill appear to be somewhat of aheadache but most people had devel-oped ad-hoc solutions for getting themto work The ACPI power managementscheme developed by Intel does notappear to have a quick solution TedTrsquoso one of the head kernel developersstated that there are fundamental archi-tectural problems with the 24 series ker-nel that do not easily allow ACPI to beadded However he also stated thatACPI is supported in the newest devel-opment kernel 25 and will be includedin 26

The final topic was the possibility ofpurchasing a laptop without paying anychargetax for Microsoft Windows Theconsensus was that currently this is notpossible Even if the machine comeswith Linux or without any operatingsystem the vendors have contracts inplace which require them to payMicrosoft for each machine they ship ifthat machine is capable of running aMicrosoft operating system The onlyway this will change is if the demand for

USENIX 2002

other operating systems increases to thepoint that vendors renegotiate their con-tracts with Microsoft and no one sawthis happening in the near future

LARGE CLUSTERS

Andrew Hume ATampT Labs ndash Research

Summarized by Teri Lampoudi

This was a session on clusters big dataand resilient computing Given that theaudience was primarily interested inclusters and not necessarily big data orbeating nodes to a pulp the conversa-tion revolved mainly around what ittakes to make a heterogeneous clusterresilient Hume also presented aFREENIX paper on his current clusterproject which explained in more detailmuch of what was abstractly claimed inthe guru session

Humersquos use of the word ldquoclusterrdquoreferred not to what I had assumedwould be a Beowulf-type system but to aloosely coupled farm of machines inwhich concurrency was much less of anissue than it would be in a Beowulf Infact the architecture Hume describedwas designed to deal with large amountsof independent transaction processingessentially the process of billing callswhich requires no interprocess messagepassing or anything of the sort a parallelscientific application might

Where does the large data come inSince the transactions in question con-sist of large numbers of flat files amechanism for getting the files ontonodes and off-loading the results is nec-essary In this particular cluster namedNingaui this task is handled by theldquoreplication managerrdquo which generateda large amount of interest in the audi-ence All management functions in thecluster as well as job allocation consti-tute services that are to be bid for andleased out to nodes Furthermore fail-ures are handled by simply repeating thefailed task until it is completed properlyif that is possible and if it is not thenlooking through detailed logs to discover

84 Vol 27 No 5 login

what went wrong Logging and testingfor the successful completion of eachtask are what make this model resilientEvery management operation is loggedat some appropriate level of detail gen-erating 120MBday and every single filetransfer is md5 checksummed This wascharacterized by Hume as a ldquopatientrdquostyle of management where no one con-trols the cluster scheduling behavior isemergent rather than stringentlyplanned and nodes are free to drop inand out of the cluster a downside to thismodel is the increase in latency offset bythe fact that the cluster continues tofunction unattended outside of 8-to-5support hours

In response to questions about the pro-jected scalability of such a schemeHume said the system would presum-ably scale from the present 60 nodes toabout 100 without modifications butthat scaling higher would be a matter forfurther design The guru session endedon a somewhat unrelated note regardingthe problems of getting data off main-frames and onto Linux machines ndashissues of fixed- vs variable-lengthencoding of data blocks resulting in use-ful information being stripped by FTPand problems in converting COBOLcopybooks to something useful on a C-based architecture To reiterate Humersquosclaim there are homemade solutions forsome of these problems and the readershould feel free to contact Mr Hume forthem

WRITING PORTABLE APPLICATIONS

Nick Stoughton MSB Consultants

Summarized by Josh Lothian

Nick Stoughton addressed a concernthat developers are facing more andmore writing portable applications Heproposed that there is no such thing as aldquoportable applicationrdquo only those thathave been ported Developers are still insearch of the Holy Grail of applicationsprogramming a write-once compile-anywhere program Languages such asJava are leading up to this but virtual

machines still have to be ported pathswill still be different etc Web-basedapplications are another area that couldbe promising for portable applications

Nick went on to talk about the future ofportability The recently released POSIX2001 standards are expected to help thesituation The 2001 revision expands theoriginal POSIX sepc into seven volumesEven though POSIX 2001 is more spe-cific than the previous release Nickpointed out that even this standardallows for small differences where ven-dors may define their own behaviorsThis can only hurt developers in thelong run Along these lines it waspointed out that there may be room forautomated tools that have the capabilityto check application code and determinethe level of portability and standardscompliance of the source

NETWORK MANAGEMENT SYSTEM

PERFORMANCE TUNING

Jeff Allen Tellme Networks Inc

Summarized by Matt Selsky

Jeff Allen author of Cricket discussednetwork tuning The basic idea of tuningis measure twiddle and measure againCricket can be used for large installa-tions but itrsquos overkill for smaller instal-lations for which Mrtg is better suitedYou donrsquot need Cricket to do measure-ment Doing something like a shellscript that dumps data to Excel is goodbut you need to get some sort of mea-surement Cricket was not designed forbilling it was meant to help answerquestions

Useful tools and techniques coveredincluded looking for the differencebetween machines in order to identifythe causes for the differences measuringbut also thinking about what yoursquoremeasuring and why remembering touse strace and tcpdump to identifyproblems Problems repeat themselvesperformance tuning requires an intricateunderstanding of all the layers involvedto solve the problem and simplify theautomation (Having a computer science

background helps but is not essential) Ifyou can reproduce the problem youthen should investigate each piece tofind the bottleneck If you canrsquot repro-duce the problem then measure Youneed to understand all the system inter-actions Measurement can help deter-mine whether things are actually slow orif the user is imagining it

Troubleshooting begins with a hunchbut scientific processes are essential Youshould be able to determine the eventstream or when each event occurs hav-ing observation logs can help Somehints provided include checking cron forperiodic anomalies slowing downbinrm to avoid IO overload and look-ing for unusual kernel CPU usage

Also try to use lightweight monitoringto reduce overhead You donrsquot need tomonitor every resource on every systembut you should monitor those resourceson those systems that are essential Donrsquotcheck things too often since you canintroduce overhead that way Lots ofinformation can be gathered from theproc file system interface to the kernel

BSD BOF

Host Kirk McKusick Author and Consultant

Summarized by Bosko Milekic

The BSD Birds of a Feather session atUSENIX started off with five quick andinformative talks and ended with anexciting discussion of licensing issues

Christos Zoulas for the NetBSD Projectwent over the primary goals of theNetBSD Project (namely portability aclean architecture and security) andthen proceeded to briefly discuss someof the challenges that the project hasencountered The successes and areas forimprovement of 16 were then exam-ined NetBSD has recently seen variousimprovements in its cross-build systempackaging system and third-party toolsupport For what concerns the kernel azero-copy TCPUDP implementationhas been integrated and a new pmap

85October 2002 login

C

ON

FERE

NC

ERE

PORT

Scode for arm32 has been introduced ashave some other rather major changes tovarious subsystems More information isavailable in Christosrsquos slides which arenow available at httpwwwnetbsdorggalleryeventsusenix2002

Theo de Raadt of the OpenBSD Projectpresented a rundown of various newthings that the OpenBSD and OpenSSHteams have been looking at Theorsquos talkincluded an amusing description (andpictures) of some of the events thattranspired during OpenBSDrsquos hack-a-thon which occurred the week beforeUSENIX rsquo02 Finally Theo mentionedthat following complications with theIPFilter license the OpenBSD team haddone a fairly extensive license audit

Robert Watson of the FreeBSD Projectbrought up various examples ofFreeBSD being used in the real worldHe went on to describe a wide variety ofchanges that will surface with theupcoming FreeBSD release 50 Notably50 will introduce a re-architecting ofthe kernel aimed at providing muchmore scalable support for SMPmachines 50 will also include an earlyimplementation of KSE FreeBSDrsquos ver-sion of scheduler activations largeimprovements to pccard and significantframework bits from the TrustedBSDproject

Mike Karels from Wind River Systemspresented an overview of how BSDOShas been evolving following the WindRiver Systems acquisition of BSDirsquos soft-ware assets Ernie Prabhakar from Applediscussed Applersquos OS X and its successon the desktop He explained how OS Xaims to bring BSD to the desktop whilestriving to maintain a strong and stablecore one that is heavily based on BSDtechnology

The BoF finished with a discussion onlicensing specifically some folks ques-tioned the impact of Applersquos licensingfor code from their core OS (the DarwinProject) on the BSD community as awhole

CLOSING SESSION

HOW FLIES FLY

Michael H Dickinson University ofCalifornia Berkeley

Summarized by JD Welch

In this lively and offbeat talk Dickinsondiscussed his research into the mecha-nisms of flight in the fruit fly (Droso-phila melanogaster) and related theautonomic behavior to that of a techno-logical system Insects are the most suc-cessful organisms on earth due in nosmall part to their early adoption offlight Flies travel through space onstraight trajectories interrupted by sac-cades jumpy turning motions some-what analogous to the fast jumpymovements of the human eye Using avariety of monitoring techniquesincluding high-speed video in a ldquovirtualreality flight arenardquo Dickinson and hiscolleagues have observed that fliesrespond to visual cues to decide when tosaccade during flight For example morevisual texture makes flies saccade earlier(ie further away from the arena wall)described by Dickinson as a ldquocollisioncontrol algorithmrdquo

Through a combination of sensoryinputs flies can make decisions abouttheir flight path Flies possess a visualldquoexpansion detectorrdquo which at a certainthreshold causes the animal to turn acertain direction However expansion infront of the fly sometimes causes it toland How does the fly decide Using thevirtual reality device to replicate variousvisual scenarios Dickinson observedthat the flies fixate on vertical objectsthat move back and forth across thefield Expansion on the left side of theanimal causes it to turn right and viceversa while expansion directly in frontof the animal triggers the legs to flareout in preparation for landing

If the eyes detect these changes how arethe responses implemented The fliesrsquowings have ldquopower musclesrdquo controlledby mechanical resonance (as opposed tothe nervous system) which drive the

USENIX 2002

wings combined with neurally activatedldquosteering musclesrdquo which change theconfiguration of the wing joints Subtlevariations in the timing of impulses cor-respond to changes in wing movementcontrolled by sensors in the ldquowing-pitrdquoA small wing-like structure the halterecontrols equilibrium by beating con-stantly

Voluntary control is accomplished bythe halteres whose signals can interferewith the autonomic control of the strokecycle The halteres have ldquosteering mus-clesrdquo as well and information derivedfrom the visual system can turn off thehaltere or trick it into a ldquovirtualrdquo prob-lem requiring a response

Dickinson has also studied the aerody-namics of insect flight using a devicecalled the ldquoRobo Flyrdquo an oversizedmechanical insect wing suspended in alarge tank of mineral oil InterestinglyDickinson observed that the average liftgenerated by the fliesrsquo wings is under itsbody weight the flies use three mecha-nisms to overcome this including rotat-ing the wings (rotational lift)

Insects are extraordinarily robust crea-tures and because Dickinson analyzedthe problem in a systems-oriented waythese observations and analysis areimmediately applicable to technologythe response system can be used as anefficient search algorithm for controlsystems in autonomous vehicles forexample

THE AFS WORKSHOP

Summarized by Garry Zacheiss MIT

Love Hornquist-Astrand of the Arladevelopment team presented the Arlastatus report Arla 0358 has beenreleased Scheduled for release soon areimproved support for Tru64 UNIXMacOS X and FreeBSD improved vol-ume handling and implementation ofmore of the vospts subcommands Itwas stressed that MacOS X is consideredan important platform and that a GUIconfiguration manager for the cache

86 Vol 27 No 5 login

manager a GUI ACL manager for thefinder and a graphical login that obtainsKerberos tickets and AFS tokens at logintime were all under development

Future goals planned for the underlyingAFS protocols include GSSAPISPNEGOsupport for Rx performance improve-ments to Rx an enhanced disconnectedmode and IPv6 support for Rx anexperimental patch is already availablefor the latter Future Arla-specific goalsinclude improved performance partialfile reading and increased stability forseveral platforms Work is also inprogress for the RXGSS protocol exten-sions (integrating GSSAPI into Rx) Apartial implementation exists and workcontinues as developers find time

Derrick Brashear of the OpenAFS devel-opment team presented the OpenAFSstatus report OpenAFS was releasedimmediately prior to the conferenceOpenAFS 125 fixed a remotelyexploitable denial-of-service attack inseveral OpenAFS platforms mostnotably IRIX and AIX Future workplanned for OpenAFS includes bettersupport for MacOS X including work-ing around Finder interaction issuesBetter support for the BSDs is alsoplanned FreeBSD has a partially imple-mented client NetBSD and OpenBSDhave only server programs availableright now AIX 5 Tru64 51A andMacOS X 102 (aka Jaguar) are allplanned for the future Other plannedenhancements include nested groups inthe ptserver (code donated by the Uni-versity of Michigan awaits integration)disconnected AFS and further work ona native Windows client Derrick statedthat the guiding principles of OpenAFSwere to maintain compatibility withIBM AFS support new platforms easethe administrative burden and add newfunctionality

AFS performance was discussed Theopenafsorg cell consists of two fileservers one in Stockholm Sweden andone in Pittsburgh PA AFS works rea-sonably well over trans-Atlantic links

although OpenAFS clients donrsquot deter-mine which file server to talk to veryeffectively Arla clients use RTTs to theserver to determine the optimal fileserver to fetch replicated data fromModifications to OpenAFS to supportthis behavior in the future are desired

Jimmy Engelbrecht and Harald Barth ofKTH discussed their AFSCrawler scriptwritten to determine how many AFScells and clients were in the world whatimplementationsversions they were(Arla vs IBM AFS vs OpenAFS) andhow much data was in AFS The scriptunfortunately triggered a bug in IBMAFS 36ndashderived code causing someclients to panic while handling a specificRPC This has since been fixed in OpenAFS 125 and the most recent IBMAFS patch level and all AFS users arestrongly encouraged to upgrade Norelease of Arla is vulnerable to this par-ticular denial-of-service attack Therewas an extended discussion of the use-fulness of this exploration Many sitesbelieved this was useful information andsuch scanning should continue in thefuture but only on an opt-in basis

Many sites face the problem of manag-ing KerberosAFS credentials for batch-scheduled jobs Specifically most batchprocessing software needs to be modi-fied to forward tickets as part of thebatch submission process renew ticketsand tokens while the job is in the queueand for the lifetime of the job and prop-erly destroy credentials when the jobcompletes Ken Hornstein of NRL wasable to pay a commercial vendor to sup-port Kerberos 45 credential manage-ment in their product although they didnot implement AFS token managementMIT has implemented some of thedesired functionality in OpenPBS andmight be able to make it available toother interested sites

Tools to simplify AFS administrationwere discussed including

AFS Balancer A tool written byCMU to automate the process ofbalancing disk usage across all

servers in a cell Available fromftpftpandrewcmuedupubAFS-Toolsbalance-11btargz

Themis Themis is KTHrsquos enhancedversion of the AFS tool ldquopackagerdquofor updating files on local disk froma central AFS image KTHrsquosenhancements include allowing thedeletion of files simplifying theprocess of adding a file and allow-ing the merging of multiple rulesets for determining which files areupdated Themis is available fromthe Arla CVS repository

Stanford was presented as an example ofa large AFS site Stanfordrsquos AFS usageconsists of approximately 14TB of datain AFS in the form of approximately100000 volumes 33TB of storage isavailable in their primary cell irstan-fordedu Their file servers consistentirely of Solaris machines running acombination of Transarc 36 patch-level3 and OpenAFS 12x while their data-base servers run OpenAFS 122 Theircell consists of 25 file servers using acombination of EMC and Sun StorEdgehardware Stanford continues to use thekaserver for their authentication infra-structure with future plans to migrateentirely to an MIT Kerberos 5 KDC

Stanford has approximately 3400 clientson their campus not including SLAC(Stanford Linear Accelerator) approxi-mately 2100 AFS clients from outsideStanford contact their cell every monthTheir supported clients are almostentirely IBM AFS 36 clients althoughthey plan to release OpenAFS clientssoon Stanford currently supports onlyUNIX clients There is some on-campuspresence of Windows clients but Stan-ford has never publicly released or sup-ported it They do intend to release andsupport the MacOS X client in the nearfuture

All Stanford students faculty and staffare assigned AFS home directories witha default quota of 50MB for a total ofapproximately 550GB of user homedirectories Other significant uses of AFS

87October 2002 login

C

ON

FERE

NC

ERE

PORT

Sstorage are data volumes for workstationsoftware (400GB) and volumes forcourse Web sites and assignments(100GB)

AFS usage at Intel was also presentedIntel has been an AFS site since 1994They had bad experiences with the IBM35 Linux client their experience withOpenAFS on Linux 24x kernels hasbeen much better They use and are sat-isfied with the OpenAFS IA64 Linuxport Intel has hundreds of OpenAFS123 and 124 clients in many produc-tion cells accessing data stored on IBMAFS file servers They have not encoun-tered any interoperability issues Intelhas some concerns about OpenAFS theywould like to purchase commercial sup-port for OpenAFS and to see OpenAFSsupport for HP-UX on both PA-RISCand Itanium hardware HP-UX supportis currently unavailable due to a specificHP-UX header file being unavailablefrom HP this may be available soonIntel has not yet committed to migratingtheir file servers to OpenAFS and areunsure if they will do so without com-mercial support

Backups are a traditional topic of dis-cussion at AFS workshops and this timewas no exception Many users complainthat the traditional AFS backup tools(ldquobackuprdquo and ldquobutcrdquo) are complex anddifficult to automate requiring manyhome-grown scripts and much userintervention for error recovery An addi-tional complaint was that the traditionalAFS tools do not support file- or direc-tory-level backups and restores datamust be backed up and restored at thevolume level

Mitch Collinsworth of Cornell presentedwork to make AMANDA the freebackup software from the University ofMaryland suitable for AFS backupsUsing AMANDA for AFS backups allowsone to share AFS backup tapes withnon-AFS backups easily run multiplebackups in parallel automate errorrecovery and provide a robust degradedmode that prevents tape errors from

stopping backups altogether Theirimplementation allows for full volumerestores as well as individual directoryand file restores They have finished cod-ing this work and are in the process oftesting and documenting it

Peter Honeyman of CITI at the Univer-sity of Michigan spoke about work hehas proposed to replace Rx with RPC-SEC GSS in OpenAFS this would allowAFS to use a TCP-based transport mech-anism rather than the UDP-based Rxand possibly gain better congestion con-trol dynamic adaptation and fragmen-tation avoidance as a result RPCSECGSS uses the GSSAPI to authenticateSUN ONC RPC RPCSEC GSS is trans-port agnostic provides strong security isa developing Internet standard and hasmultiple open source implementationsBackward compatibility with existingAFS servers and clients is an importantgoal of this project

Page 18: inside - USENIX · 2019. 2. 25. · Bosko Milekic Juan Navarro David E. Ott Amit Purohit Brennan Reynolds Matt Selsky J.D. Welch Li Xiao Praveen Yalagandula Haijin Yan Gary Zacheiss

One interesting example of an applica-tion that currently is capable of migra-tion and replication is Emacs To create anew frame on DISPLAY try ldquoM-x make-frame-on-display ltReturngt DISPLAYltReturngtrdquo

However technical challenges makereplication and migration difficult ingeneral These include the ldquomajorheadachesrdquo of server-side fonts thenonuniformity of X servers and screensizes and the need to appropriatelyretrofit toolkits Authentication andauthorization issues are obviously alsoimportant The rest of the talk delvedinto some of the details of these interest-ing technical challenges

HACKING IN THE KERNEL

Summarized by Hai Huang

AN IMPLEMENTATION OF SCHEDULER

ACTIVATIONS ON THE NETBSD OPERATING

SYSTEM

Nathan J Williams Wasabi Systems

Scheduler activation is an old idea Basi-cally there are benefits and drawbacks tousing solely kernel-level or user-levelthreading Scheduler activation is able tocombine the two layers of control toprovide more concurrency in the sys-tem

In his talk Nathan gave a fairly detaileddescription of the implementation ofscheduler activation in the NetBSD ker-nel One important change in the imple-mentation is to differentiate the threadcontext from the process context This isdone by defining a separate data struc-ture for these threads and relocatingsome of the information that wasembedded in the process context tothese thread contexts Stack was espe-cially a concern due to the upcall Spe-cial handling must be done to make surethat the upcall doesnrsquot mess up the stackso that the preempted user-level threadcan continue afterwards Lastly Nathanexplained that signals were handled byupcalls

78 Vol 27 No 5 login

AUTHORIZATION AND CHARGING IN PUBLIC

WLANS USING FREEBSD AND 8021X

Pekka Nikander Ericsson ResearchNomadicLab

8021x standards are well known in thewireless community as link-layerauthentication protocols In this talkPekka explained some novel ways ofusing the 8021x protocols that might beof interest to people on the move It ispossible to set up a public WLAN thatwould support various chargingschemes via virtual tokens which peoplecan purchase or earn and later use

This is implemented on FreeBSD usingthe netgraph utility It is basically a filterin the link layer that would differentiatetraffic based on the MAC address of theclient node which is either authenti-cated denied or let through The over-head for this service is fairly minimal

ACPI IMPLEMENTATION ON FREEBSD

Takanori Watanabe Kobe University

ACPI (Advanced Configuration andPower Management Interface) was pro-posed as a joint effort by Intel Toshibaand Microsoft to provide a standard andfiner-control method of managingpower states of individual devices withina system Such low-level power manage-ment is especially important for thosemobile and embedded systems that arepowered by fixed-capacity energy batter-ies

Takanori explained that the ACPI speci-fication is composed of three partstables BIOS and registers He was ableto implement some functionalities ofACPI in a FreeBSD kernel ACPI Com-ponent Architecture was implementedby Intel and it provides a high-levelACPI API to the operating systemTakanorirsquos ACPI implementation is builtupon this underlying layer of APIs

ACCESS CONTROL

Summarized by Florian Buchholz

DESIGN AND PERFORMANCE OF THE

OPENBSD STATEFUL PACKET FILTER (PF)

Daniel Hartmeier Systor AG

Daniel Hartmeier described the newstateful packet filter (pf) that replacedIPFilter in the OpenBSD 30 releaseIPFilter could no longer be included dueto licensing issues and thus there was aneed to write a new filter making use ofoptimized data structures

The filter rules are implemented as alinked list which is traversed from startto end Two actions may be takenaccording to the rules ldquopassrdquo or ldquoblockrdquoA ldquopassrdquo action forwards the packet anda ldquoblockrdquo action will drop it Wheremore than one rule matches a packetthe last rule wins Rules that are markedas ldquofinalrdquo will immediately terminate any further rule evaluation An opti-mization called ldquoskip-stepsrdquo was alsoimplemented where blocks of similarrules are skipped if they cannot matchthe current packet These skip-steps arecalculated when the rule set is loadedFurthermore a state table keeps track ofTCP connections Only packets thatmatch the sequence numbers areallowed UDP and ICMP queries andreplies are considered in the state tablewhere initial packets will create an entryfor a pseudo-connection with a low ini-tial timeout value The state table isimplemented using a balanced binarysearch tree NAT mappings are alsostored in the state table whereas appli-cation proxies reside in user space Thepacket filter also is able to perform frag-ment reassembly and to modulate TCPsequence numbers to protect hostsbehind the firewall

Pf was compared against IPFilter as wellas Linuxrsquos Iptables by measuringthroughput and latency with increasingtraffic rates and different packet sizes Ina test with a fixed rule set size of 100Iptables outperformed the other two fil-ters whose results were close to each

other In a second test where the rule setsize was continually increased Iptablesconsistently had about twice thethroughput of the other two (whichevaluate the rule set on both the incom-ing and outgoing interfaces) A third testcompared only pf and IPFilter using asingle rule that created state in the statetable with a fixed-state entry size Pfreached an overloaded state much laterthan IPFilter The experiment wasrepeated with a variable-state entry sizeand pf performed much better thanIPFilter for a small number of states

In general rule set evaluation is expen-sive and benchmarks only reflectextreme cases whereas in real life otherbehavior should be observed Further-more the benchmarks show that statefulfiltering can actually improve perfor-mance due to cheap state-table lookupsas compared to rule evaluation

ENHANCING NFS CROSS-ADMINISTRATIVE

DOMAIN ACCESS

Joseph Spadavecchia and Erez ZadokStony Brook University

The speaker presented modification toan NFS server that allows an improvedNFS access between administrativedomains A problem lies in the fact thatNFS assumes a shared UIDGID spacewhich makes it unsafe to export filesoutside the administrative domain ofthe server Design goals were to leaveprotocol and existing clients unchangeda minimum amount of server changesflexibility and increased performance

To solve the problem two techniques areutilized ldquorange-mappingrdquo and ldquofile-cloakingrdquo Range-mapping maps IDsbetween client and server The mappingis performed on a per-export basis andhas to be manually set up in an exportfile The mappings can be 1-1 N-N orN-1 In file-cloaking the server restrictsfile access based on UIDGID and spe-cial cloaking-mask bits Here users canonly access their own file permissionsThe policy on whether or not the file isvisible to others is set by the cloakingmask which is logically ANDed with the

79October 2002 login

C

ON

FERE

NC

ERE

PORT

Sfilersquos protection bits File-cloaking onlyworks however if the client doesnrsquot holdcached copies of directory contents andfile-attributes Because of this the clientsare forced to re-read directories byincrementing the mtime value of thedirectory each time it is listed

To test the performance of the modifiedserver five different NFS configurationswere evaluated An unmodified NFSserver was compared against one serverwith the modified code included but notused one with only range-mappingenabled one with only file-cloakingenabled and one version with all modi-fications enabled For each setup differ-ent file system benchmarks were runThe results show only a small overheadwhen the modifications are used gener-ally an increase of below 5 Anotherexperiment tested the performance ofthe system with an increasing number ofmapped or cloaked entries on a systemThe results show that an increase from10 to 1000 entries resulted in a maxi-mum of about 14 cost in performance

One member of the audience pointedout that if clients choose to ignore thechanged mtimes from the server andthus still hold caches of the directoryentries the file-cloaking mechanismcould be defeated After a rather lengthydebate the speaker had to concede thatthe model doesnrsquot add any extra securityAnother question was asked about scala-bility of the setup of range mappingThe speaker referred to application-leveltools that could be developed for thatpurpose

The software is available atftpftpfslcssunysbedupubenf

ENGINEERING OPEN SOURCE

SOFTWARE

Summarized by Teri Lampoudi

NINGAUI A LINUX CLUSTER FOR BUSINESS

Andrew Hume ATampT Labs ndash ResearchScott Daniels Electronic Data SystemsCorp

Ningaui is a general purpose highlyavailable resilient architecture built

from commodity software and hard-ware Emphatically however Ningaui isnot a Beowulf Hume calls the clusterdesign the ldquoSwiss canton modelrdquo inwhich there are a number of looselyaffiliated independent nodes with datareplicated among them Jobs areassigned by bidding and leases and clus-ter services done as session-based serversare sited via generic job assignment Theemphasis is on keeping the architectureend-to-end checking all work via check-sums and logging everything Theresilience and high availability requiredby their goal of 8-5 maintenance ndash vsthe typical 24-7 model where people getpaged whenever the slightest thing goeswrong regardless of the time of day ndash isachieved by job restartability Finally allcomputation is performed on local datawithout the use of NFS or networkattached storage

Humersquos message is a hopeful onedespite the many problems encounteredndash things like kernel and service limitsauto-installation problems TCP stormsand the like ndash the existence of sourcecode and the paranoid practice of log-ging and checksumming everything hashelped The final product performs rea-sonably well and it appears that theresilient techniques employed do make adifference One drawback however isthat the software mentioned in thepaper is not yet available for download

CPCMS A CONFIGURATION MANAGEMENT

SYSTEM BASED ON CRYPTOGRAPHIC NAMES

Jonathan S Shapiro and John Vander-burgh Johns Hopkins University

This paper won the FREENIX BestPaper award

The basic notion behind the project isthe fact that everyone has a pet com-plaint about CVS and yet it is currentlythe configuration manager in mostwidespread use Shapiro has unleashedan alternative Interestingly he did notbegin but ended with the usual host ofreasons why CVS is bad The talk insteadbegan abruptly by characterizing the job

USENIX 2002

of a software configuration managercontinued by stating the namespaceswhich it must handle and wrapped upwith the challenges faced

X MEETS Z VERIFYING CORRECTNESS IN THE

PRESENCE OF POSIX THREADS

Bart Massey Portland State UniversityRobert T Bauer Rational SoftwareCorp

Massey delivered a humorous talk onthe insight gained from applying Z for-mal specification notation to systemsoftware design rather than the moreinformal analysis and design processnormally used

The story is told with respect to writingXCB which replaces the Xlib protocollayer and is supposed to be threadfriendly But where threads are con-cerned deadlock avoidance becomes ahard problem that cannot be solved inan ad-hoc manner But full modelchecking is also too hard In this caseMassey resorted to Z specification tomodel the XCB lower layer abstractaway locking and data transfers andlocate fundamental issues Essentiallythe difficulties of searching the literatureand locating information relevant to theproblem at hand were overcome AsMassey put it ldquothe formal method savedthe dayrdquo

FILE SYSTEMS

Summarized by Bosko Milekic

PLANNED EXTENSIONS TO THE LINUX

EXT2EXT3 FILESYSTEM

Theodore Y Tsrsquoo IBM StephenTweedie Red Hat

The speaker presented improvements tothe Linux Ext2 file system with the goalof allowing for various expansions whilestriving to maintain compatibility witholder code Improvements have beenfacilitated by a few extra superblockfields that were added to Ext2 not longbefore the Linux 20 kernel was releasedThe fields allow for file system featuresto be added without compromising theexisting setup this is done by providing

80 Vol 27 No 5 login

bits indicating the impact of the addedfeatures with respect to compatibilityNamely a file system with the ldquoincom-patrdquo bit marked is not allowed to bemounted Similarly a ldquoread-onlyrdquo mark-ing would only allow the file system tobe mounted read-only

Directory indexing changes linear direc-tory searches with a faster search using afixed-depth tree and hashed keys Filesystem size can be dynamicallyincreased and the expanded inode dou-bled from 128 to 256 bytes allows formore extensions

Other potential improvements were dis-cussed as well in particular pre-alloca-tion for contiguous files which allowsfor better performance in certain setupsby pre-allocating contiguous blocksSecurity-related modifications extendedattributes and ACLs were mentionedAn implementation of these featuresalready exists but has not yet beenmerged into the mainline Ext23 code

RECENT FILESYSTEM OPTIMISATIONS ON

FREEBSD

Ian Dowse Corvil Networks DavidMalone CNRI Dublin Institute of Technology

David Malone presented four importantfile system optimizations for FreeBSDOS soft updates dirpref vmiodir anddirhash It turns out that certain combi-nations of the optimizations (beautifullyillustrated in the paper) may yield per-formance improvements of anywherebetween 2 and 10 orders of magnitudefor real-world applications

All four of the optimizations deal withfile system metadata Soft updates allowfor asynchronous metadata updatesDirpref changes the way directories areorganized attempting to place childdirectories closer to their parentsthereby increasing locality of referenceand reducing disk-seek times Vmiodirtrades some extra memory in order toachieve better directory caching Finallydirhash which was implemented by IanDowse changes the way in which entries

are searched in a directory SpecificallyFFS uses a linear search to find an entryby name dirhash builds a hashtable ofdirectory entries on the fly For a direc-tory of size n with a working set of mfiles a search that in certain cases couldhave been O(nm) has been reduceddue to dirhash to effectively O(n + m)

FILESYSTEM PERFORMANCE AND SCALABILITY

IN LINUX 2417

Ray Bryant SGI Ruth Forester IBMLTC John Hawkes SGI

This talk focused on performance evalu-ation of a number of file systems avail-able and commonly deployed on Linuxmachines Comparisons under variousconfigurations of Ext2 Ext3 ReiserFSXFS and JFS were presented

The benchmarks chosen for the datagathering were pgmeter filemark andAIM Benchmark Suite VII Pgmetermeasures the rate of data transfer ofreadswrites of a file under a syntheticworkload Filemark is similar to post-mark in that it is an operation-intensivebenchmark although filemark isthreaded and offers various other fea-tures that postmark lacks AIM VIImeasures performance for various file-system-related functionalities it offersvarious metrics under an imposedworkload thus stressing the perfor-mance of the file system not only underIO load but also under significant CPUload

Tests were run on three different setupsa small a medium and a large configu-ration ReiserFS and Ext2 appear at thetop of the pile for smaller and mediumsetups Notably XFS and JFS performworse for smaller system configurationsthan the others although XFS clearlyappears to generate better numbers thanJFS It should be noted that XFS seemsto scale well under a higher load Thiswas most evident in the large-systemresults where XFS appears to offer thebest overall results

THINGS TO THINK ABOUT

Summarized by Bosko Milekic

SPEEDING UP KERNEL SCHEDULER BY REDUC-

ING CACHE MISSES

Shuji Yamamura Akira Hirai MitsuruSato Masao Yamamoto Akira NaruseKouichi Kumon Fujitsu Labs

This was an interesting talk pertaining tothe effects of cache coloring for taskstructures in the Linux kernel scheduler(Linux kernel 24x) The speaker firstpresented some interesting benchmarknumbers for the Linux scheduler show-ing that as the number of processes onthe task queue was increased the perfor-mance decreased The authors usedsome really nifty hardware to measurethe number of bus transactionsthroughout their tests and were thusable to reasonably quantify the impactthat cache misses had in the Linuxscheduler

Their experiments led them to imple-ment a cache coloring scheme for taskstructures which were previouslyaligned on 8KB boundaries and there-fore were being eventually mapped tothe same cache lines This unfortunateplacement of task structures in memoryinduced a significant number of cachemisses as the number of tasks grew inthe scheduler

The implemented solution consisted ofaligning task structures to more evenlydistribute cached entries across the L2cache The result was inevitably fewercache misses in the scheduler Some neg-ative effects were observed in certain sit-uations These were primarily due tomore cache slots being used by taskstructure data in the scheduler thusforcing data previously cached there tobe pushed out

81October 2002 login

C

ON

FERE

NC

ERE

PORT

SWORK-IN-PROGRESS REPORTS

Summarized by Brennan Reynolds

RESOURCE VIRTUALIZATION TECHNIQUES FOR

WIDE-AREA OVERLAY NETWORKS

Kartik Gopalan University Stony Brook

This work addressed the issue of provi-sioning a maximum number of virtualoverlay networks (VON) with diversequality of service (QoS) requirementson a single physical network Each of theVONs is logically isolated from others toensure the QoS requirements Gopalanmentioned several critical researchissues with this problem that are cur-rently being investigated Dealing withhow to provision the network at variouslevels (link route or path) and thenenforce the provisioning at run-time isone of the toughest challenges Cur-rently Gopalan has developed severalalgorithms to handle admission controlend-to-end QoS route selection sched-uling and fault tolerance in the net-work

For more information visithttpwwwecslcssunysbedu

VISUALIZING SOFTWARE INSTABILITY

Jennifer Bevan University of CaliforniaSanta Cruz

Detection of instability in software hastypically been an afterthought Thepoint in the development cycle when thesoftware is reviewed for instability isusually after it is difficult and costly togo back and perform major modifica-tions to the code base Bevan has devel-oped a technique to allow thevisualization of unstable regions of codethat can be used much earlier in thedevelopment cycle Her technique cre-ates a time series of dependent graphsthat include clusters and lines calledfault lines From the graphs a developeris able to easily determine where theunstable sections of code are and proac-tively restructure them She is currentlyworking on a prototype implementa-tion

For more information visit httpwwwcseucscecu~jbevanevo_viz

RELIABLE AND SCALABLE PEER-TO-PEER WEB

DOCUMENT SHARING

Li Xiao William and Mary College

The idea presented by Xiao would allowend users to share the content of theWeb browser caches with neighboringInternet users The rationale for this isthat todayrsquos end users are increasinglyconnected to the Internet over high-speed links and the browser caches arebecoming large storage systems There-fore if an individual accesses a pagewhich does not exist in their local cacheXiao is suggesting that they first queryother end users for the content beforetrying to access the machine hosting theoriginal This strategy does have someserious problems associated with it thatstill need to be addressed includingensuring the integrity of the content andprotecting the identity and privacy ofthe end users

SEREL FAST BOOTING FOR UNIX

Leni Mayo Fast Boot Software

Serel is a tool that generates a visual rep-resentation of a UNIX machinersquos boot-up sequence It can be used to identifythe critical path and can show if a par-ticular service or process blocks for anextended period of time This informa-tion could be used to determine wherethe largest performance gains could berealized by tuning the order of executionat boot-up Serel creates a dependencygraph expressed in XML during boot-up This graph is then used to create thevisual representation Currently the toolonly works on POSIX-compliant sys-tems but Mayo stated that he would beporting it to other platforms Otherextensions that were mentionedincluded having the metadata includethe use of shared libraries and monitor-ing the suspendresume sequence ofportable machines

For more information visithttpwwwfastbootorg

USENIX 2002

BERKELEY DB XML

John Merrells Sleepycat Software

Merrells gave a quick introduction andoverview of the new XML library forBerkeley DB The library specializes instorage and retrieval of XML contentthrough a tool called XPath The libraryallows for multiple containers per docu-ment and stores everything natively asXML The user is also given a wide rangeof elements to create indices withincluding edges elements text stringsor presence The XPath tool consists of aquery parser generator optimizer andexecution engine To conclude his pre-sentation Merrell gave a live demonstra-tion of the software

For more information visithttpwwwsleepycatcomxml

CLUSTER-ON-DEMAND (COD)

Justin Moore Duke University

Modern clusters are growing at a rapidrate Many have pushed beyond the 5000-machine mark and deployingthem results in large expenses as well asmanagement and provisioning issuesFurthermore if the cluster is ldquorentedrdquoout to various users it is very time-con-suming to configure it to a userrsquos specsregardless of how long they need to useit The COD work presented createsdynamic virtual clusters within a givenphysical cluster The goal was to have aprovisioning tool that would automati-cally select a chunk of available nodesand install the operating system andmiddleware specified by the customer ina short period of time This would allowa greater use of resources since multiplevirtual clusters can exist at once Byusing a virtual cluster the size can bechanged dynamically Moore stated thatthey have created a working prototypeand are currently testing and bench-marking its performance

For more information visithttpwwwcsdukeedu~justincod

82 Vol 27 No 5 login

CATACOMB

Elias Sinderson University of CaliforniaSanta Cruz

Catacomb is a project to develop a data-base-backed DASL module for theApache Web server It was designed as areplacement for the WebDAV The initialrelease of the module only contains sup-port for a MySQL database but could beextended to others Sinderson brieflytouched on the performance of hermodule It was comparable to themod_dav Apache module for all querytypes but search The presentation wasconcluded with remarks about addingsupport for the lock method and includ-ing ACL specifications in the future

For more information visithttpoceancseucsceducatacomb

SELF-ORGANIZING STORAGE

Dan Ellard Harvard University

Ellardrsquos presentation introduced a stor-age system that tuned itself based on theworkload of the system without requir-ing the intervention of the user Theintelligence was implemented as a vir-tual self-organizing disk that residesbelow any file system The virtual diskobserves the access patterns exhibited bythe system and then attempts to predictwhat information will be accessed nextAn experiment was done using an NFStrace at a large ISP on one of their emailservers Ellardrsquos self-organizing storagesystem worked well which he attributesto the fact that most of the files beingrequested were large email boxes Areasof future work include exploration ofthe length and detail of the data collec-tion stage as well as the CPU impact ofrunning the virtual disk layer

VISUAL DEBUGGING

John Costigan Virginia Tech

Costigan feels that the state of currentdebugging facilities in the UNIX worldis not as good as it should be He pro-poses the addition of several elements toprograms like ddd and gdb The firstaddition is including separate heap and

stack components Another is to displayonly certain fields of a data structureand have the ability to zoom in and outif needed Finally the debugger shouldprovide the ability to visually presentcomplex data structures includinglinked lists trees etc He said that a betaversion of a debugger with these abilitiesis currently available

For more information visit httpinfoviscsvtedudatastruct

IMPROVING APPLICATION PERFORMANCE

THROUGH SYSTEM-CALL COMPOSITION

Amit Purohit University of Stony Brook

Web servers perform a huge number ofcontext switches and internal data copy-ing during normal operation These twoelements can drastically limit the perfor-mance of an application regardless ofthe hardware platform it is run on TheCompound System Call (CoSy) frame-work is an attempt to reduce the perfor-mance penalty for context switches anddata copies It includes a user-levellibrary and several kernel facilities thatcan be used via system calls The libraryprovides the programmer with a com-plete set of memory-management func-tions Performance tests using the CoSyframework showed a large saving forcontext switches and data copies Theonly area where the savings betweenconventional libraries and CoSy werenegligible was for very large file copies

For more information visithttpwwwcssunysbedu~purohit

ELASTIC QUOTAS

John Oscar Columbia University

Oscar began by stating that mostresources are flexible but that to datedisks have not been While approxi-mately 80 percent of files are short-liveddisk quotas are hard limits imposed byadministrators The idea behind elasticquotas is to have a non-permanent stor-age area that each user can temporarilyuse Oscar suggested the creation of aehome directory structure to be used indeploying elastic quotas Currently he

has implemented elastic quotas as astackable file system There were alsoseveral trends apparent in the dataMany users would ssh to their remotehost (good) but then launch an emailreader on the local machine whichwould connect via POP to the same hostand send the password in clear text(bad) This situation can easily be reme-died by tunneling protocols like POPthrough SSH but it appeared that manypeople were not aware this could bedone While most of his comments onthe use of the wireless network werenegative the list of passwords he hadcollected showed that people wereindeed using strong passwords His rec-ommendations were to educate andencourage people to use protocols likeIPSec SSH and SSL when conductingwork over a wireless network becauseyou never know who else is listening

SYSTRACE

Niels Provos University of Michigan

How can one be sure the applicationsone is using actually do exactly whattheir developers said they do Shortanswer you canrsquot People today are usinga large number of complex applicationswhich means it is impossible to checkeach application thoroughly for securityvulnerabilities There are tools out therethat can help though Provos has devel-oped a tool called systrace that allows auser to generate a policy of acceptablesystem calls a particular application canmake If the application attempts tomake a call that is not defined in thepolicy the user is notified and allowed tochoose an action Systrace includes anauto-policy generation mechanismwhich uses a training phase to record theactions of all programs the user exe-cutes If the user chooses not to be both-ered by applications breaking policysystrace allows default enforcementactions to be set Currently systrace isimplemented for FreeBSD and NetBSDwith a Linux version coming out shortly

83October 2002 login

C

ON

FERE

NC

ERE

PORT

SFor more information visit httpwwwcitiumicheduuprovossystrace

VERIFIABLE SECRET REDISTRIBUTION FOR

SURVIVABLE STORAGE SYSTEMS

Ted Wong Carnegie Mellon University

Wong presented a protocol that can beused to re-create a file distributed over aset of servers even if one of the serversis damaged In his scheme the user mustchoose how many shares the file is splitinto and the number of servers it will bestored across The goal of this work is toprovide persistent secure storage ofinformation even if it comes underattack Wong stated that his design ofthe protocol was complete and he is cur-rently building a prototype implementa-tion of it

For more information visit httpwwwcscmuedu~tmwongresearch

THE GURU IS IN SESSIONS

LINUX ON LAPTOPPDA

Bdale Garbee HP Linux Systems Operation

Summarized by Brennan Reynolds

This was a roundtable-like discussionwith Garbee acting as moderator He iscurrently in charge of the Debian distri-bution of Linux and is involved in port-ing Linux to platforms other than i386He has successfully help port the kernelto the Alpha Sparc and ARM and iscurrently working on a version to runon VAX machines

A discussion was held on which file sys-tem is best for battery life The Riser andExt3 file systems were discussed peoplecommented that when using Ext3 ontheir laptops the disc would never spindown and thus was always consuming alarge amount of power One suggestionwas to use a RAM-based file system forany volume that requires a large numberof writes and to only have the file systemwritten to disk at infrequent intervals orwhen the machine is shut down or sus-pended

A question was directed to Garbee abouthis knowledge of how the HP-Compaqmerger would affect Linux Garbeethought that the merger was good forLinux both within the company and forthe open source community as a wholeHe stated that the new company wouldbe the number one shipper of systemswith Linux as the base operating systemHe also fielded a question about thesupport of older HP servers and theirability to run Linux saying that indeedLinux has been ported to them and hepersonally had several machines in hisbasement running it

A question which sparked a large num-ber of responses concerned problemspeople had with running Linux onmobile platforms Most people actuallydid not have many problems at allThere were a few cases of a particularmachine model not working but therewere only two widespread issues dock-ing stations and the new ACPI powermanagement scheme Docking stationsstill appear to be somewhat of aheadache but most people had devel-oped ad-hoc solutions for getting themto work The ACPI power managementscheme developed by Intel does notappear to have a quick solution TedTrsquoso one of the head kernel developersstated that there are fundamental archi-tectural problems with the 24 series ker-nel that do not easily allow ACPI to beadded However he also stated thatACPI is supported in the newest devel-opment kernel 25 and will be includedin 26

The final topic was the possibility ofpurchasing a laptop without paying anychargetax for Microsoft Windows Theconsensus was that currently this is notpossible Even if the machine comeswith Linux or without any operatingsystem the vendors have contracts inplace which require them to payMicrosoft for each machine they ship ifthat machine is capable of running aMicrosoft operating system The onlyway this will change is if the demand for

USENIX 2002

other operating systems increases to thepoint that vendors renegotiate their con-tracts with Microsoft and no one sawthis happening in the near future

LARGE CLUSTERS

Andrew Hume ATampT Labs ndash Research

Summarized by Teri Lampoudi

This was a session on clusters big dataand resilient computing Given that theaudience was primarily interested inclusters and not necessarily big data orbeating nodes to a pulp the conversa-tion revolved mainly around what ittakes to make a heterogeneous clusterresilient Hume also presented aFREENIX paper on his current clusterproject which explained in more detailmuch of what was abstractly claimed inthe guru session

Humersquos use of the word ldquoclusterrdquoreferred not to what I had assumedwould be a Beowulf-type system but to aloosely coupled farm of machines inwhich concurrency was much less of anissue than it would be in a Beowulf Infact the architecture Hume describedwas designed to deal with large amountsof independent transaction processingessentially the process of billing callswhich requires no interprocess messagepassing or anything of the sort a parallelscientific application might

Where does the large data come inSince the transactions in question con-sist of large numbers of flat files amechanism for getting the files ontonodes and off-loading the results is nec-essary In this particular cluster namedNingaui this task is handled by theldquoreplication managerrdquo which generateda large amount of interest in the audi-ence All management functions in thecluster as well as job allocation consti-tute services that are to be bid for andleased out to nodes Furthermore fail-ures are handled by simply repeating thefailed task until it is completed properlyif that is possible and if it is not thenlooking through detailed logs to discover

84 Vol 27 No 5 login

what went wrong Logging and testingfor the successful completion of eachtask are what make this model resilientEvery management operation is loggedat some appropriate level of detail gen-erating 120MBday and every single filetransfer is md5 checksummed This wascharacterized by Hume as a ldquopatientrdquostyle of management where no one con-trols the cluster scheduling behavior isemergent rather than stringentlyplanned and nodes are free to drop inand out of the cluster a downside to thismodel is the increase in latency offset bythe fact that the cluster continues tofunction unattended outside of 8-to-5support hours

In response to questions about the pro-jected scalability of such a schemeHume said the system would presum-ably scale from the present 60 nodes toabout 100 without modifications butthat scaling higher would be a matter forfurther design The guru session endedon a somewhat unrelated note regardingthe problems of getting data off main-frames and onto Linux machines ndashissues of fixed- vs variable-lengthencoding of data blocks resulting in use-ful information being stripped by FTPand problems in converting COBOLcopybooks to something useful on a C-based architecture To reiterate Humersquosclaim there are homemade solutions forsome of these problems and the readershould feel free to contact Mr Hume forthem

WRITING PORTABLE APPLICATIONS

Nick Stoughton MSB Consultants

Summarized by Josh Lothian

Nick Stoughton addressed a concernthat developers are facing more andmore writing portable applications Heproposed that there is no such thing as aldquoportable applicationrdquo only those thathave been ported Developers are still insearch of the Holy Grail of applicationsprogramming a write-once compile-anywhere program Languages such asJava are leading up to this but virtual

machines still have to be ported pathswill still be different etc Web-basedapplications are another area that couldbe promising for portable applications

Nick went on to talk about the future ofportability The recently released POSIX2001 standards are expected to help thesituation The 2001 revision expands theoriginal POSIX sepc into seven volumesEven though POSIX 2001 is more spe-cific than the previous release Nickpointed out that even this standardallows for small differences where ven-dors may define their own behaviorsThis can only hurt developers in thelong run Along these lines it waspointed out that there may be room forautomated tools that have the capabilityto check application code and determinethe level of portability and standardscompliance of the source

NETWORK MANAGEMENT SYSTEM

PERFORMANCE TUNING

Jeff Allen Tellme Networks Inc

Summarized by Matt Selsky

Jeff Allen author of Cricket discussednetwork tuning The basic idea of tuningis measure twiddle and measure againCricket can be used for large installa-tions but itrsquos overkill for smaller instal-lations for which Mrtg is better suitedYou donrsquot need Cricket to do measure-ment Doing something like a shellscript that dumps data to Excel is goodbut you need to get some sort of mea-surement Cricket was not designed forbilling it was meant to help answerquestions

Useful tools and techniques coveredincluded looking for the differencebetween machines in order to identifythe causes for the differences measuringbut also thinking about what yoursquoremeasuring and why remembering touse strace and tcpdump to identifyproblems Problems repeat themselvesperformance tuning requires an intricateunderstanding of all the layers involvedto solve the problem and simplify theautomation (Having a computer science

background helps but is not essential) Ifyou can reproduce the problem youthen should investigate each piece tofind the bottleneck If you canrsquot repro-duce the problem then measure Youneed to understand all the system inter-actions Measurement can help deter-mine whether things are actually slow orif the user is imagining it

Troubleshooting begins with a hunchbut scientific processes are essential Youshould be able to determine the eventstream or when each event occurs hav-ing observation logs can help Somehints provided include checking cron forperiodic anomalies slowing downbinrm to avoid IO overload and look-ing for unusual kernel CPU usage

Also try to use lightweight monitoringto reduce overhead You donrsquot need tomonitor every resource on every systembut you should monitor those resourceson those systems that are essential Donrsquotcheck things too often since you canintroduce overhead that way Lots ofinformation can be gathered from theproc file system interface to the kernel

BSD BOF

Host Kirk McKusick Author and Consultant

Summarized by Bosko Milekic

The BSD Birds of a Feather session atUSENIX started off with five quick andinformative talks and ended with anexciting discussion of licensing issues

Christos Zoulas for the NetBSD Projectwent over the primary goals of theNetBSD Project (namely portability aclean architecture and security) andthen proceeded to briefly discuss someof the challenges that the project hasencountered The successes and areas forimprovement of 16 were then exam-ined NetBSD has recently seen variousimprovements in its cross-build systempackaging system and third-party toolsupport For what concerns the kernel azero-copy TCPUDP implementationhas been integrated and a new pmap

85October 2002 login

C

ON

FERE

NC

ERE

PORT

Scode for arm32 has been introduced ashave some other rather major changes tovarious subsystems More information isavailable in Christosrsquos slides which arenow available at httpwwwnetbsdorggalleryeventsusenix2002

Theo de Raadt of the OpenBSD Projectpresented a rundown of various newthings that the OpenBSD and OpenSSHteams have been looking at Theorsquos talkincluded an amusing description (andpictures) of some of the events thattranspired during OpenBSDrsquos hack-a-thon which occurred the week beforeUSENIX rsquo02 Finally Theo mentionedthat following complications with theIPFilter license the OpenBSD team haddone a fairly extensive license audit

Robert Watson of the FreeBSD Projectbrought up various examples ofFreeBSD being used in the real worldHe went on to describe a wide variety ofchanges that will surface with theupcoming FreeBSD release 50 Notably50 will introduce a re-architecting ofthe kernel aimed at providing muchmore scalable support for SMPmachines 50 will also include an earlyimplementation of KSE FreeBSDrsquos ver-sion of scheduler activations largeimprovements to pccard and significantframework bits from the TrustedBSDproject

Mike Karels from Wind River Systemspresented an overview of how BSDOShas been evolving following the WindRiver Systems acquisition of BSDirsquos soft-ware assets Ernie Prabhakar from Applediscussed Applersquos OS X and its successon the desktop He explained how OS Xaims to bring BSD to the desktop whilestriving to maintain a strong and stablecore one that is heavily based on BSDtechnology

The BoF finished with a discussion onlicensing specifically some folks ques-tioned the impact of Applersquos licensingfor code from their core OS (the DarwinProject) on the BSD community as awhole

CLOSING SESSION

HOW FLIES FLY

Michael H Dickinson University ofCalifornia Berkeley

Summarized by JD Welch

In this lively and offbeat talk Dickinsondiscussed his research into the mecha-nisms of flight in the fruit fly (Droso-phila melanogaster) and related theautonomic behavior to that of a techno-logical system Insects are the most suc-cessful organisms on earth due in nosmall part to their early adoption offlight Flies travel through space onstraight trajectories interrupted by sac-cades jumpy turning motions some-what analogous to the fast jumpymovements of the human eye Using avariety of monitoring techniquesincluding high-speed video in a ldquovirtualreality flight arenardquo Dickinson and hiscolleagues have observed that fliesrespond to visual cues to decide when tosaccade during flight For example morevisual texture makes flies saccade earlier(ie further away from the arena wall)described by Dickinson as a ldquocollisioncontrol algorithmrdquo

Through a combination of sensoryinputs flies can make decisions abouttheir flight path Flies possess a visualldquoexpansion detectorrdquo which at a certainthreshold causes the animal to turn acertain direction However expansion infront of the fly sometimes causes it toland How does the fly decide Using thevirtual reality device to replicate variousvisual scenarios Dickinson observedthat the flies fixate on vertical objectsthat move back and forth across thefield Expansion on the left side of theanimal causes it to turn right and viceversa while expansion directly in frontof the animal triggers the legs to flareout in preparation for landing

If the eyes detect these changes how arethe responses implemented The fliesrsquowings have ldquopower musclesrdquo controlledby mechanical resonance (as opposed tothe nervous system) which drive the

USENIX 2002

wings combined with neurally activatedldquosteering musclesrdquo which change theconfiguration of the wing joints Subtlevariations in the timing of impulses cor-respond to changes in wing movementcontrolled by sensors in the ldquowing-pitrdquoA small wing-like structure the halterecontrols equilibrium by beating con-stantly

Voluntary control is accomplished bythe halteres whose signals can interferewith the autonomic control of the strokecycle The halteres have ldquosteering mus-clesrdquo as well and information derivedfrom the visual system can turn off thehaltere or trick it into a ldquovirtualrdquo prob-lem requiring a response

Dickinson has also studied the aerody-namics of insect flight using a devicecalled the ldquoRobo Flyrdquo an oversizedmechanical insect wing suspended in alarge tank of mineral oil InterestinglyDickinson observed that the average liftgenerated by the fliesrsquo wings is under itsbody weight the flies use three mecha-nisms to overcome this including rotat-ing the wings (rotational lift)

Insects are extraordinarily robust crea-tures and because Dickinson analyzedthe problem in a systems-oriented waythese observations and analysis areimmediately applicable to technologythe response system can be used as anefficient search algorithm for controlsystems in autonomous vehicles forexample

THE AFS WORKSHOP

Summarized by Garry Zacheiss MIT

Love Hornquist-Astrand of the Arladevelopment team presented the Arlastatus report Arla 0358 has beenreleased Scheduled for release soon areimproved support for Tru64 UNIXMacOS X and FreeBSD improved vol-ume handling and implementation ofmore of the vospts subcommands Itwas stressed that MacOS X is consideredan important platform and that a GUIconfiguration manager for the cache

86 Vol 27 No 5 login

manager a GUI ACL manager for thefinder and a graphical login that obtainsKerberos tickets and AFS tokens at logintime were all under development

Future goals planned for the underlyingAFS protocols include GSSAPISPNEGOsupport for Rx performance improve-ments to Rx an enhanced disconnectedmode and IPv6 support for Rx anexperimental patch is already availablefor the latter Future Arla-specific goalsinclude improved performance partialfile reading and increased stability forseveral platforms Work is also inprogress for the RXGSS protocol exten-sions (integrating GSSAPI into Rx) Apartial implementation exists and workcontinues as developers find time

Derrick Brashear of the OpenAFS devel-opment team presented the OpenAFSstatus report OpenAFS was releasedimmediately prior to the conferenceOpenAFS 125 fixed a remotelyexploitable denial-of-service attack inseveral OpenAFS platforms mostnotably IRIX and AIX Future workplanned for OpenAFS includes bettersupport for MacOS X including work-ing around Finder interaction issuesBetter support for the BSDs is alsoplanned FreeBSD has a partially imple-mented client NetBSD and OpenBSDhave only server programs availableright now AIX 5 Tru64 51A andMacOS X 102 (aka Jaguar) are allplanned for the future Other plannedenhancements include nested groups inthe ptserver (code donated by the Uni-versity of Michigan awaits integration)disconnected AFS and further work ona native Windows client Derrick statedthat the guiding principles of OpenAFSwere to maintain compatibility withIBM AFS support new platforms easethe administrative burden and add newfunctionality

AFS performance was discussed Theopenafsorg cell consists of two fileservers one in Stockholm Sweden andone in Pittsburgh PA AFS works rea-sonably well over trans-Atlantic links

although OpenAFS clients donrsquot deter-mine which file server to talk to veryeffectively Arla clients use RTTs to theserver to determine the optimal fileserver to fetch replicated data fromModifications to OpenAFS to supportthis behavior in the future are desired

Jimmy Engelbrecht and Harald Barth ofKTH discussed their AFSCrawler scriptwritten to determine how many AFScells and clients were in the world whatimplementationsversions they were(Arla vs IBM AFS vs OpenAFS) andhow much data was in AFS The scriptunfortunately triggered a bug in IBMAFS 36ndashderived code causing someclients to panic while handling a specificRPC This has since been fixed in OpenAFS 125 and the most recent IBMAFS patch level and all AFS users arestrongly encouraged to upgrade Norelease of Arla is vulnerable to this par-ticular denial-of-service attack Therewas an extended discussion of the use-fulness of this exploration Many sitesbelieved this was useful information andsuch scanning should continue in thefuture but only on an opt-in basis

Many sites face the problem of manag-ing KerberosAFS credentials for batch-scheduled jobs Specifically most batchprocessing software needs to be modi-fied to forward tickets as part of thebatch submission process renew ticketsand tokens while the job is in the queueand for the lifetime of the job and prop-erly destroy credentials when the jobcompletes Ken Hornstein of NRL wasable to pay a commercial vendor to sup-port Kerberos 45 credential manage-ment in their product although they didnot implement AFS token managementMIT has implemented some of thedesired functionality in OpenPBS andmight be able to make it available toother interested sites

Tools to simplify AFS administrationwere discussed including

AFS Balancer A tool written byCMU to automate the process ofbalancing disk usage across all

servers in a cell Available fromftpftpandrewcmuedupubAFS-Toolsbalance-11btargz

Themis Themis is KTHrsquos enhancedversion of the AFS tool ldquopackagerdquofor updating files on local disk froma central AFS image KTHrsquosenhancements include allowing thedeletion of files simplifying theprocess of adding a file and allow-ing the merging of multiple rulesets for determining which files areupdated Themis is available fromthe Arla CVS repository

Stanford was presented as an example ofa large AFS site Stanfordrsquos AFS usageconsists of approximately 14TB of datain AFS in the form of approximately100000 volumes 33TB of storage isavailable in their primary cell irstan-fordedu Their file servers consistentirely of Solaris machines running acombination of Transarc 36 patch-level3 and OpenAFS 12x while their data-base servers run OpenAFS 122 Theircell consists of 25 file servers using acombination of EMC and Sun StorEdgehardware Stanford continues to use thekaserver for their authentication infra-structure with future plans to migrateentirely to an MIT Kerberos 5 KDC

Stanford has approximately 3400 clientson their campus not including SLAC(Stanford Linear Accelerator) approxi-mately 2100 AFS clients from outsideStanford contact their cell every monthTheir supported clients are almostentirely IBM AFS 36 clients althoughthey plan to release OpenAFS clientssoon Stanford currently supports onlyUNIX clients There is some on-campuspresence of Windows clients but Stan-ford has never publicly released or sup-ported it They do intend to release andsupport the MacOS X client in the nearfuture

All Stanford students faculty and staffare assigned AFS home directories witha default quota of 50MB for a total ofapproximately 550GB of user homedirectories Other significant uses of AFS

87October 2002 login

C

ON

FERE

NC

ERE

PORT

Sstorage are data volumes for workstationsoftware (400GB) and volumes forcourse Web sites and assignments(100GB)

AFS usage at Intel was also presentedIntel has been an AFS site since 1994They had bad experiences with the IBM35 Linux client their experience withOpenAFS on Linux 24x kernels hasbeen much better They use and are sat-isfied with the OpenAFS IA64 Linuxport Intel has hundreds of OpenAFS123 and 124 clients in many produc-tion cells accessing data stored on IBMAFS file servers They have not encoun-tered any interoperability issues Intelhas some concerns about OpenAFS theywould like to purchase commercial sup-port for OpenAFS and to see OpenAFSsupport for HP-UX on both PA-RISCand Itanium hardware HP-UX supportis currently unavailable due to a specificHP-UX header file being unavailablefrom HP this may be available soonIntel has not yet committed to migratingtheir file servers to OpenAFS and areunsure if they will do so without com-mercial support

Backups are a traditional topic of dis-cussion at AFS workshops and this timewas no exception Many users complainthat the traditional AFS backup tools(ldquobackuprdquo and ldquobutcrdquo) are complex anddifficult to automate requiring manyhome-grown scripts and much userintervention for error recovery An addi-tional complaint was that the traditionalAFS tools do not support file- or direc-tory-level backups and restores datamust be backed up and restored at thevolume level

Mitch Collinsworth of Cornell presentedwork to make AMANDA the freebackup software from the University ofMaryland suitable for AFS backupsUsing AMANDA for AFS backups allowsone to share AFS backup tapes withnon-AFS backups easily run multiplebackups in parallel automate errorrecovery and provide a robust degradedmode that prevents tape errors from

stopping backups altogether Theirimplementation allows for full volumerestores as well as individual directoryand file restores They have finished cod-ing this work and are in the process oftesting and documenting it

Peter Honeyman of CITI at the Univer-sity of Michigan spoke about work hehas proposed to replace Rx with RPC-SEC GSS in OpenAFS this would allowAFS to use a TCP-based transport mech-anism rather than the UDP-based Rxand possibly gain better congestion con-trol dynamic adaptation and fragmen-tation avoidance as a result RPCSECGSS uses the GSSAPI to authenticateSUN ONC RPC RPCSEC GSS is trans-port agnostic provides strong security isa developing Internet standard and hasmultiple open source implementationsBackward compatibility with existingAFS servers and clients is an importantgoal of this project

Page 19: inside - USENIX · 2019. 2. 25. · Bosko Milekic Juan Navarro David E. Ott Amit Purohit Brennan Reynolds Matt Selsky J.D. Welch Li Xiao Praveen Yalagandula Haijin Yan Gary Zacheiss

other In a second test where the rule setsize was continually increased Iptablesconsistently had about twice thethroughput of the other two (whichevaluate the rule set on both the incom-ing and outgoing interfaces) A third testcompared only pf and IPFilter using asingle rule that created state in the statetable with a fixed-state entry size Pfreached an overloaded state much laterthan IPFilter The experiment wasrepeated with a variable-state entry sizeand pf performed much better thanIPFilter for a small number of states

In general rule set evaluation is expen-sive and benchmarks only reflectextreme cases whereas in real life otherbehavior should be observed Further-more the benchmarks show that statefulfiltering can actually improve perfor-mance due to cheap state-table lookupsas compared to rule evaluation

ENHANCING NFS CROSS-ADMINISTRATIVE

DOMAIN ACCESS

Joseph Spadavecchia and Erez ZadokStony Brook University

The speaker presented modification toan NFS server that allows an improvedNFS access between administrativedomains A problem lies in the fact thatNFS assumes a shared UIDGID spacewhich makes it unsafe to export filesoutside the administrative domain ofthe server Design goals were to leaveprotocol and existing clients unchangeda minimum amount of server changesflexibility and increased performance

To solve the problem two techniques areutilized ldquorange-mappingrdquo and ldquofile-cloakingrdquo Range-mapping maps IDsbetween client and server The mappingis performed on a per-export basis andhas to be manually set up in an exportfile The mappings can be 1-1 N-N orN-1 In file-cloaking the server restrictsfile access based on UIDGID and spe-cial cloaking-mask bits Here users canonly access their own file permissionsThe policy on whether or not the file isvisible to others is set by the cloakingmask which is logically ANDed with the

79October 2002 login

C

ON

FERE

NC

ERE

PORT

Sfilersquos protection bits File-cloaking onlyworks however if the client doesnrsquot holdcached copies of directory contents andfile-attributes Because of this the clientsare forced to re-read directories byincrementing the mtime value of thedirectory each time it is listed

To test the performance of the modifiedserver five different NFS configurationswere evaluated An unmodified NFSserver was compared against one serverwith the modified code included but notused one with only range-mappingenabled one with only file-cloakingenabled and one version with all modi-fications enabled For each setup differ-ent file system benchmarks were runThe results show only a small overheadwhen the modifications are used gener-ally an increase of below 5 Anotherexperiment tested the performance ofthe system with an increasing number ofmapped or cloaked entries on a systemThe results show that an increase from10 to 1000 entries resulted in a maxi-mum of about 14 cost in performance

One member of the audience pointedout that if clients choose to ignore thechanged mtimes from the server andthus still hold caches of the directoryentries the file-cloaking mechanismcould be defeated After a rather lengthydebate the speaker had to concede thatthe model doesnrsquot add any extra securityAnother question was asked about scala-bility of the setup of range mappingThe speaker referred to application-leveltools that could be developed for thatpurpose

The software is available atftpftpfslcssunysbedupubenf

ENGINEERING OPEN SOURCE

SOFTWARE

Summarized by Teri Lampoudi

NINGAUI A LINUX CLUSTER FOR BUSINESS

Andrew Hume ATampT Labs ndash ResearchScott Daniels Electronic Data SystemsCorp

Ningaui is a general purpose highlyavailable resilient architecture built

from commodity software and hard-ware Emphatically however Ningaui isnot a Beowulf Hume calls the clusterdesign the ldquoSwiss canton modelrdquo inwhich there are a number of looselyaffiliated independent nodes with datareplicated among them Jobs areassigned by bidding and leases and clus-ter services done as session-based serversare sited via generic job assignment Theemphasis is on keeping the architectureend-to-end checking all work via check-sums and logging everything Theresilience and high availability requiredby their goal of 8-5 maintenance ndash vsthe typical 24-7 model where people getpaged whenever the slightest thing goeswrong regardless of the time of day ndash isachieved by job restartability Finally allcomputation is performed on local datawithout the use of NFS or networkattached storage

Humersquos message is a hopeful onedespite the many problems encounteredndash things like kernel and service limitsauto-installation problems TCP stormsand the like ndash the existence of sourcecode and the paranoid practice of log-ging and checksumming everything hashelped The final product performs rea-sonably well and it appears that theresilient techniques employed do make adifference One drawback however isthat the software mentioned in thepaper is not yet available for download

CPCMS A CONFIGURATION MANAGEMENT

SYSTEM BASED ON CRYPTOGRAPHIC NAMES

Jonathan S Shapiro and John Vander-burgh Johns Hopkins University

This paper won the FREENIX BestPaper award

The basic notion behind the project isthe fact that everyone has a pet com-plaint about CVS and yet it is currentlythe configuration manager in mostwidespread use Shapiro has unleashedan alternative Interestingly he did notbegin but ended with the usual host ofreasons why CVS is bad The talk insteadbegan abruptly by characterizing the job

USENIX 2002

of a software configuration managercontinued by stating the namespaceswhich it must handle and wrapped upwith the challenges faced

X MEETS Z VERIFYING CORRECTNESS IN THE

PRESENCE OF POSIX THREADS

Bart Massey Portland State UniversityRobert T Bauer Rational SoftwareCorp

Massey delivered a humorous talk onthe insight gained from applying Z for-mal specification notation to systemsoftware design rather than the moreinformal analysis and design processnormally used

The story is told with respect to writingXCB which replaces the Xlib protocollayer and is supposed to be threadfriendly But where threads are con-cerned deadlock avoidance becomes ahard problem that cannot be solved inan ad-hoc manner But full modelchecking is also too hard In this caseMassey resorted to Z specification tomodel the XCB lower layer abstractaway locking and data transfers andlocate fundamental issues Essentiallythe difficulties of searching the literatureand locating information relevant to theproblem at hand were overcome AsMassey put it ldquothe formal method savedthe dayrdquo

FILE SYSTEMS

Summarized by Bosko Milekic

PLANNED EXTENSIONS TO THE LINUX

EXT2EXT3 FILESYSTEM

Theodore Y Tsrsquoo IBM StephenTweedie Red Hat

The speaker presented improvements tothe Linux Ext2 file system with the goalof allowing for various expansions whilestriving to maintain compatibility witholder code Improvements have beenfacilitated by a few extra superblockfields that were added to Ext2 not longbefore the Linux 20 kernel was releasedThe fields allow for file system featuresto be added without compromising theexisting setup this is done by providing

80 Vol 27 No 5 login

bits indicating the impact of the addedfeatures with respect to compatibilityNamely a file system with the ldquoincom-patrdquo bit marked is not allowed to bemounted Similarly a ldquoread-onlyrdquo mark-ing would only allow the file system tobe mounted read-only

Directory indexing changes linear direc-tory searches with a faster search using afixed-depth tree and hashed keys Filesystem size can be dynamicallyincreased and the expanded inode dou-bled from 128 to 256 bytes allows formore extensions

Other potential improvements were dis-cussed as well in particular pre-alloca-tion for contiguous files which allowsfor better performance in certain setupsby pre-allocating contiguous blocksSecurity-related modifications extendedattributes and ACLs were mentionedAn implementation of these featuresalready exists but has not yet beenmerged into the mainline Ext23 code

RECENT FILESYSTEM OPTIMISATIONS ON

FREEBSD

Ian Dowse Corvil Networks DavidMalone CNRI Dublin Institute of Technology

David Malone presented four importantfile system optimizations for FreeBSDOS soft updates dirpref vmiodir anddirhash It turns out that certain combi-nations of the optimizations (beautifullyillustrated in the paper) may yield per-formance improvements of anywherebetween 2 and 10 orders of magnitudefor real-world applications

All four of the optimizations deal withfile system metadata Soft updates allowfor asynchronous metadata updatesDirpref changes the way directories areorganized attempting to place childdirectories closer to their parentsthereby increasing locality of referenceand reducing disk-seek times Vmiodirtrades some extra memory in order toachieve better directory caching Finallydirhash which was implemented by IanDowse changes the way in which entries

are searched in a directory SpecificallyFFS uses a linear search to find an entryby name dirhash builds a hashtable ofdirectory entries on the fly For a direc-tory of size n with a working set of mfiles a search that in certain cases couldhave been O(nm) has been reduceddue to dirhash to effectively O(n + m)

FILESYSTEM PERFORMANCE AND SCALABILITY

IN LINUX 2417

Ray Bryant SGI Ruth Forester IBMLTC John Hawkes SGI

This talk focused on performance evalu-ation of a number of file systems avail-able and commonly deployed on Linuxmachines Comparisons under variousconfigurations of Ext2 Ext3 ReiserFSXFS and JFS were presented

The benchmarks chosen for the datagathering were pgmeter filemark andAIM Benchmark Suite VII Pgmetermeasures the rate of data transfer ofreadswrites of a file under a syntheticworkload Filemark is similar to post-mark in that it is an operation-intensivebenchmark although filemark isthreaded and offers various other fea-tures that postmark lacks AIM VIImeasures performance for various file-system-related functionalities it offersvarious metrics under an imposedworkload thus stressing the perfor-mance of the file system not only underIO load but also under significant CPUload

Tests were run on three different setupsa small a medium and a large configu-ration ReiserFS and Ext2 appear at thetop of the pile for smaller and mediumsetups Notably XFS and JFS performworse for smaller system configurationsthan the others although XFS clearlyappears to generate better numbers thanJFS It should be noted that XFS seemsto scale well under a higher load Thiswas most evident in the large-systemresults where XFS appears to offer thebest overall results

THINGS TO THINK ABOUT

Summarized by Bosko Milekic

SPEEDING UP KERNEL SCHEDULER BY REDUC-

ING CACHE MISSES

Shuji Yamamura Akira Hirai MitsuruSato Masao Yamamoto Akira NaruseKouichi Kumon Fujitsu Labs

This was an interesting talk pertaining tothe effects of cache coloring for taskstructures in the Linux kernel scheduler(Linux kernel 24x) The speaker firstpresented some interesting benchmarknumbers for the Linux scheduler show-ing that as the number of processes onthe task queue was increased the perfor-mance decreased The authors usedsome really nifty hardware to measurethe number of bus transactionsthroughout their tests and were thusable to reasonably quantify the impactthat cache misses had in the Linuxscheduler

Their experiments led them to imple-ment a cache coloring scheme for taskstructures which were previouslyaligned on 8KB boundaries and there-fore were being eventually mapped tothe same cache lines This unfortunateplacement of task structures in memoryinduced a significant number of cachemisses as the number of tasks grew inthe scheduler

The implemented solution consisted ofaligning task structures to more evenlydistribute cached entries across the L2cache The result was inevitably fewercache misses in the scheduler Some neg-ative effects were observed in certain sit-uations These were primarily due tomore cache slots being used by taskstructure data in the scheduler thusforcing data previously cached there tobe pushed out

81October 2002 login

C

ON

FERE

NC

ERE

PORT

SWORK-IN-PROGRESS REPORTS

Summarized by Brennan Reynolds

RESOURCE VIRTUALIZATION TECHNIQUES FOR

WIDE-AREA OVERLAY NETWORKS

Kartik Gopalan University Stony Brook

This work addressed the issue of provi-sioning a maximum number of virtualoverlay networks (VON) with diversequality of service (QoS) requirementson a single physical network Each of theVONs is logically isolated from others toensure the QoS requirements Gopalanmentioned several critical researchissues with this problem that are cur-rently being investigated Dealing withhow to provision the network at variouslevels (link route or path) and thenenforce the provisioning at run-time isone of the toughest challenges Cur-rently Gopalan has developed severalalgorithms to handle admission controlend-to-end QoS route selection sched-uling and fault tolerance in the net-work

For more information visithttpwwwecslcssunysbedu

VISUALIZING SOFTWARE INSTABILITY

Jennifer Bevan University of CaliforniaSanta Cruz

Detection of instability in software hastypically been an afterthought Thepoint in the development cycle when thesoftware is reviewed for instability isusually after it is difficult and costly togo back and perform major modifica-tions to the code base Bevan has devel-oped a technique to allow thevisualization of unstable regions of codethat can be used much earlier in thedevelopment cycle Her technique cre-ates a time series of dependent graphsthat include clusters and lines calledfault lines From the graphs a developeris able to easily determine where theunstable sections of code are and proac-tively restructure them She is currentlyworking on a prototype implementa-tion

For more information visit httpwwwcseucscecu~jbevanevo_viz

RELIABLE AND SCALABLE PEER-TO-PEER WEB

DOCUMENT SHARING

Li Xiao William and Mary College

The idea presented by Xiao would allowend users to share the content of theWeb browser caches with neighboringInternet users The rationale for this isthat todayrsquos end users are increasinglyconnected to the Internet over high-speed links and the browser caches arebecoming large storage systems There-fore if an individual accesses a pagewhich does not exist in their local cacheXiao is suggesting that they first queryother end users for the content beforetrying to access the machine hosting theoriginal This strategy does have someserious problems associated with it thatstill need to be addressed includingensuring the integrity of the content andprotecting the identity and privacy ofthe end users

SEREL FAST BOOTING FOR UNIX

Leni Mayo Fast Boot Software

Serel is a tool that generates a visual rep-resentation of a UNIX machinersquos boot-up sequence It can be used to identifythe critical path and can show if a par-ticular service or process blocks for anextended period of time This informa-tion could be used to determine wherethe largest performance gains could berealized by tuning the order of executionat boot-up Serel creates a dependencygraph expressed in XML during boot-up This graph is then used to create thevisual representation Currently the toolonly works on POSIX-compliant sys-tems but Mayo stated that he would beporting it to other platforms Otherextensions that were mentionedincluded having the metadata includethe use of shared libraries and monitor-ing the suspendresume sequence ofportable machines

For more information visithttpwwwfastbootorg

USENIX 2002

BERKELEY DB XML

John Merrells Sleepycat Software

Merrells gave a quick introduction andoverview of the new XML library forBerkeley DB The library specializes instorage and retrieval of XML contentthrough a tool called XPath The libraryallows for multiple containers per docu-ment and stores everything natively asXML The user is also given a wide rangeof elements to create indices withincluding edges elements text stringsor presence The XPath tool consists of aquery parser generator optimizer andexecution engine To conclude his pre-sentation Merrell gave a live demonstra-tion of the software

For more information visithttpwwwsleepycatcomxml

CLUSTER-ON-DEMAND (COD)

Justin Moore Duke University

Modern clusters are growing at a rapidrate Many have pushed beyond the 5000-machine mark and deployingthem results in large expenses as well asmanagement and provisioning issuesFurthermore if the cluster is ldquorentedrdquoout to various users it is very time-con-suming to configure it to a userrsquos specsregardless of how long they need to useit The COD work presented createsdynamic virtual clusters within a givenphysical cluster The goal was to have aprovisioning tool that would automati-cally select a chunk of available nodesand install the operating system andmiddleware specified by the customer ina short period of time This would allowa greater use of resources since multiplevirtual clusters can exist at once Byusing a virtual cluster the size can bechanged dynamically Moore stated thatthey have created a working prototypeand are currently testing and bench-marking its performance

For more information visithttpwwwcsdukeedu~justincod

82 Vol 27 No 5 login

CATACOMB

Elias Sinderson University of CaliforniaSanta Cruz

Catacomb is a project to develop a data-base-backed DASL module for theApache Web server It was designed as areplacement for the WebDAV The initialrelease of the module only contains sup-port for a MySQL database but could beextended to others Sinderson brieflytouched on the performance of hermodule It was comparable to themod_dav Apache module for all querytypes but search The presentation wasconcluded with remarks about addingsupport for the lock method and includ-ing ACL specifications in the future

For more information visithttpoceancseucsceducatacomb

SELF-ORGANIZING STORAGE

Dan Ellard Harvard University

Ellardrsquos presentation introduced a stor-age system that tuned itself based on theworkload of the system without requir-ing the intervention of the user Theintelligence was implemented as a vir-tual self-organizing disk that residesbelow any file system The virtual diskobserves the access patterns exhibited bythe system and then attempts to predictwhat information will be accessed nextAn experiment was done using an NFStrace at a large ISP on one of their emailservers Ellardrsquos self-organizing storagesystem worked well which he attributesto the fact that most of the files beingrequested were large email boxes Areasof future work include exploration ofthe length and detail of the data collec-tion stage as well as the CPU impact ofrunning the virtual disk layer

VISUAL DEBUGGING

John Costigan Virginia Tech

Costigan feels that the state of currentdebugging facilities in the UNIX worldis not as good as it should be He pro-poses the addition of several elements toprograms like ddd and gdb The firstaddition is including separate heap and

stack components Another is to displayonly certain fields of a data structureand have the ability to zoom in and outif needed Finally the debugger shouldprovide the ability to visually presentcomplex data structures includinglinked lists trees etc He said that a betaversion of a debugger with these abilitiesis currently available

For more information visit httpinfoviscsvtedudatastruct

IMPROVING APPLICATION PERFORMANCE

THROUGH SYSTEM-CALL COMPOSITION

Amit Purohit University of Stony Brook

Web servers perform a huge number ofcontext switches and internal data copy-ing during normal operation These twoelements can drastically limit the perfor-mance of an application regardless ofthe hardware platform it is run on TheCompound System Call (CoSy) frame-work is an attempt to reduce the perfor-mance penalty for context switches anddata copies It includes a user-levellibrary and several kernel facilities thatcan be used via system calls The libraryprovides the programmer with a com-plete set of memory-management func-tions Performance tests using the CoSyframework showed a large saving forcontext switches and data copies Theonly area where the savings betweenconventional libraries and CoSy werenegligible was for very large file copies

For more information visithttpwwwcssunysbedu~purohit

ELASTIC QUOTAS

John Oscar Columbia University

Oscar began by stating that mostresources are flexible but that to datedisks have not been While approxi-mately 80 percent of files are short-liveddisk quotas are hard limits imposed byadministrators The idea behind elasticquotas is to have a non-permanent stor-age area that each user can temporarilyuse Oscar suggested the creation of aehome directory structure to be used indeploying elastic quotas Currently he

has implemented elastic quotas as astackable file system There were alsoseveral trends apparent in the dataMany users would ssh to their remotehost (good) but then launch an emailreader on the local machine whichwould connect via POP to the same hostand send the password in clear text(bad) This situation can easily be reme-died by tunneling protocols like POPthrough SSH but it appeared that manypeople were not aware this could bedone While most of his comments onthe use of the wireless network werenegative the list of passwords he hadcollected showed that people wereindeed using strong passwords His rec-ommendations were to educate andencourage people to use protocols likeIPSec SSH and SSL when conductingwork over a wireless network becauseyou never know who else is listening

SYSTRACE

Niels Provos University of Michigan

How can one be sure the applicationsone is using actually do exactly whattheir developers said they do Shortanswer you canrsquot People today are usinga large number of complex applicationswhich means it is impossible to checkeach application thoroughly for securityvulnerabilities There are tools out therethat can help though Provos has devel-oped a tool called systrace that allows auser to generate a policy of acceptablesystem calls a particular application canmake If the application attempts tomake a call that is not defined in thepolicy the user is notified and allowed tochoose an action Systrace includes anauto-policy generation mechanismwhich uses a training phase to record theactions of all programs the user exe-cutes If the user chooses not to be both-ered by applications breaking policysystrace allows default enforcementactions to be set Currently systrace isimplemented for FreeBSD and NetBSDwith a Linux version coming out shortly

83October 2002 login

C

ON

FERE

NC

ERE

PORT

SFor more information visit httpwwwcitiumicheduuprovossystrace

VERIFIABLE SECRET REDISTRIBUTION FOR

SURVIVABLE STORAGE SYSTEMS

Ted Wong Carnegie Mellon University

Wong presented a protocol that can beused to re-create a file distributed over aset of servers even if one of the serversis damaged In his scheme the user mustchoose how many shares the file is splitinto and the number of servers it will bestored across The goal of this work is toprovide persistent secure storage ofinformation even if it comes underattack Wong stated that his design ofthe protocol was complete and he is cur-rently building a prototype implementa-tion of it

For more information visit httpwwwcscmuedu~tmwongresearch

THE GURU IS IN SESSIONS

LINUX ON LAPTOPPDA

Bdale Garbee HP Linux Systems Operation

Summarized by Brennan Reynolds

This was a roundtable-like discussionwith Garbee acting as moderator He iscurrently in charge of the Debian distri-bution of Linux and is involved in port-ing Linux to platforms other than i386He has successfully help port the kernelto the Alpha Sparc and ARM and iscurrently working on a version to runon VAX machines

A discussion was held on which file sys-tem is best for battery life The Riser andExt3 file systems were discussed peoplecommented that when using Ext3 ontheir laptops the disc would never spindown and thus was always consuming alarge amount of power One suggestionwas to use a RAM-based file system forany volume that requires a large numberof writes and to only have the file systemwritten to disk at infrequent intervals orwhen the machine is shut down or sus-pended

A question was directed to Garbee abouthis knowledge of how the HP-Compaqmerger would affect Linux Garbeethought that the merger was good forLinux both within the company and forthe open source community as a wholeHe stated that the new company wouldbe the number one shipper of systemswith Linux as the base operating systemHe also fielded a question about thesupport of older HP servers and theirability to run Linux saying that indeedLinux has been ported to them and hepersonally had several machines in hisbasement running it

A question which sparked a large num-ber of responses concerned problemspeople had with running Linux onmobile platforms Most people actuallydid not have many problems at allThere were a few cases of a particularmachine model not working but therewere only two widespread issues dock-ing stations and the new ACPI powermanagement scheme Docking stationsstill appear to be somewhat of aheadache but most people had devel-oped ad-hoc solutions for getting themto work The ACPI power managementscheme developed by Intel does notappear to have a quick solution TedTrsquoso one of the head kernel developersstated that there are fundamental archi-tectural problems with the 24 series ker-nel that do not easily allow ACPI to beadded However he also stated thatACPI is supported in the newest devel-opment kernel 25 and will be includedin 26

The final topic was the possibility ofpurchasing a laptop without paying anychargetax for Microsoft Windows Theconsensus was that currently this is notpossible Even if the machine comeswith Linux or without any operatingsystem the vendors have contracts inplace which require them to payMicrosoft for each machine they ship ifthat machine is capable of running aMicrosoft operating system The onlyway this will change is if the demand for

USENIX 2002

other operating systems increases to thepoint that vendors renegotiate their con-tracts with Microsoft and no one sawthis happening in the near future

LARGE CLUSTERS

Andrew Hume ATampT Labs ndash Research

Summarized by Teri Lampoudi

This was a session on clusters big dataand resilient computing Given that theaudience was primarily interested inclusters and not necessarily big data orbeating nodes to a pulp the conversa-tion revolved mainly around what ittakes to make a heterogeneous clusterresilient Hume also presented aFREENIX paper on his current clusterproject which explained in more detailmuch of what was abstractly claimed inthe guru session

Humersquos use of the word ldquoclusterrdquoreferred not to what I had assumedwould be a Beowulf-type system but to aloosely coupled farm of machines inwhich concurrency was much less of anissue than it would be in a Beowulf Infact the architecture Hume describedwas designed to deal with large amountsof independent transaction processingessentially the process of billing callswhich requires no interprocess messagepassing or anything of the sort a parallelscientific application might

Where does the large data come inSince the transactions in question con-sist of large numbers of flat files amechanism for getting the files ontonodes and off-loading the results is nec-essary In this particular cluster namedNingaui this task is handled by theldquoreplication managerrdquo which generateda large amount of interest in the audi-ence All management functions in thecluster as well as job allocation consti-tute services that are to be bid for andleased out to nodes Furthermore fail-ures are handled by simply repeating thefailed task until it is completed properlyif that is possible and if it is not thenlooking through detailed logs to discover

84 Vol 27 No 5 login

what went wrong Logging and testingfor the successful completion of eachtask are what make this model resilientEvery management operation is loggedat some appropriate level of detail gen-erating 120MBday and every single filetransfer is md5 checksummed This wascharacterized by Hume as a ldquopatientrdquostyle of management where no one con-trols the cluster scheduling behavior isemergent rather than stringentlyplanned and nodes are free to drop inand out of the cluster a downside to thismodel is the increase in latency offset bythe fact that the cluster continues tofunction unattended outside of 8-to-5support hours

In response to questions about the pro-jected scalability of such a schemeHume said the system would presum-ably scale from the present 60 nodes toabout 100 without modifications butthat scaling higher would be a matter forfurther design The guru session endedon a somewhat unrelated note regardingthe problems of getting data off main-frames and onto Linux machines ndashissues of fixed- vs variable-lengthencoding of data blocks resulting in use-ful information being stripped by FTPand problems in converting COBOLcopybooks to something useful on a C-based architecture To reiterate Humersquosclaim there are homemade solutions forsome of these problems and the readershould feel free to contact Mr Hume forthem

WRITING PORTABLE APPLICATIONS

Nick Stoughton MSB Consultants

Summarized by Josh Lothian

Nick Stoughton addressed a concernthat developers are facing more andmore writing portable applications Heproposed that there is no such thing as aldquoportable applicationrdquo only those thathave been ported Developers are still insearch of the Holy Grail of applicationsprogramming a write-once compile-anywhere program Languages such asJava are leading up to this but virtual

machines still have to be ported pathswill still be different etc Web-basedapplications are another area that couldbe promising for portable applications

Nick went on to talk about the future ofportability The recently released POSIX2001 standards are expected to help thesituation The 2001 revision expands theoriginal POSIX sepc into seven volumesEven though POSIX 2001 is more spe-cific than the previous release Nickpointed out that even this standardallows for small differences where ven-dors may define their own behaviorsThis can only hurt developers in thelong run Along these lines it waspointed out that there may be room forautomated tools that have the capabilityto check application code and determinethe level of portability and standardscompliance of the source

NETWORK MANAGEMENT SYSTEM

PERFORMANCE TUNING

Jeff Allen Tellme Networks Inc

Summarized by Matt Selsky

Jeff Allen author of Cricket discussednetwork tuning The basic idea of tuningis measure twiddle and measure againCricket can be used for large installa-tions but itrsquos overkill for smaller instal-lations for which Mrtg is better suitedYou donrsquot need Cricket to do measure-ment Doing something like a shellscript that dumps data to Excel is goodbut you need to get some sort of mea-surement Cricket was not designed forbilling it was meant to help answerquestions

Useful tools and techniques coveredincluded looking for the differencebetween machines in order to identifythe causes for the differences measuringbut also thinking about what yoursquoremeasuring and why remembering touse strace and tcpdump to identifyproblems Problems repeat themselvesperformance tuning requires an intricateunderstanding of all the layers involvedto solve the problem and simplify theautomation (Having a computer science

background helps but is not essential) Ifyou can reproduce the problem youthen should investigate each piece tofind the bottleneck If you canrsquot repro-duce the problem then measure Youneed to understand all the system inter-actions Measurement can help deter-mine whether things are actually slow orif the user is imagining it

Troubleshooting begins with a hunchbut scientific processes are essential Youshould be able to determine the eventstream or when each event occurs hav-ing observation logs can help Somehints provided include checking cron forperiodic anomalies slowing downbinrm to avoid IO overload and look-ing for unusual kernel CPU usage

Also try to use lightweight monitoringto reduce overhead You donrsquot need tomonitor every resource on every systembut you should monitor those resourceson those systems that are essential Donrsquotcheck things too often since you canintroduce overhead that way Lots ofinformation can be gathered from theproc file system interface to the kernel

BSD BOF

Host Kirk McKusick Author and Consultant

Summarized by Bosko Milekic

The BSD Birds of a Feather session atUSENIX started off with five quick andinformative talks and ended with anexciting discussion of licensing issues

Christos Zoulas for the NetBSD Projectwent over the primary goals of theNetBSD Project (namely portability aclean architecture and security) andthen proceeded to briefly discuss someof the challenges that the project hasencountered The successes and areas forimprovement of 16 were then exam-ined NetBSD has recently seen variousimprovements in its cross-build systempackaging system and third-party toolsupport For what concerns the kernel azero-copy TCPUDP implementationhas been integrated and a new pmap

85October 2002 login

C

ON

FERE

NC

ERE

PORT

Scode for arm32 has been introduced ashave some other rather major changes tovarious subsystems More information isavailable in Christosrsquos slides which arenow available at httpwwwnetbsdorggalleryeventsusenix2002

Theo de Raadt of the OpenBSD Projectpresented a rundown of various newthings that the OpenBSD and OpenSSHteams have been looking at Theorsquos talkincluded an amusing description (andpictures) of some of the events thattranspired during OpenBSDrsquos hack-a-thon which occurred the week beforeUSENIX rsquo02 Finally Theo mentionedthat following complications with theIPFilter license the OpenBSD team haddone a fairly extensive license audit

Robert Watson of the FreeBSD Projectbrought up various examples ofFreeBSD being used in the real worldHe went on to describe a wide variety ofchanges that will surface with theupcoming FreeBSD release 50 Notably50 will introduce a re-architecting ofthe kernel aimed at providing muchmore scalable support for SMPmachines 50 will also include an earlyimplementation of KSE FreeBSDrsquos ver-sion of scheduler activations largeimprovements to pccard and significantframework bits from the TrustedBSDproject

Mike Karels from Wind River Systemspresented an overview of how BSDOShas been evolving following the WindRiver Systems acquisition of BSDirsquos soft-ware assets Ernie Prabhakar from Applediscussed Applersquos OS X and its successon the desktop He explained how OS Xaims to bring BSD to the desktop whilestriving to maintain a strong and stablecore one that is heavily based on BSDtechnology

The BoF finished with a discussion onlicensing specifically some folks ques-tioned the impact of Applersquos licensingfor code from their core OS (the DarwinProject) on the BSD community as awhole

CLOSING SESSION

HOW FLIES FLY

Michael H Dickinson University ofCalifornia Berkeley

Summarized by JD Welch

In this lively and offbeat talk Dickinsondiscussed his research into the mecha-nisms of flight in the fruit fly (Droso-phila melanogaster) and related theautonomic behavior to that of a techno-logical system Insects are the most suc-cessful organisms on earth due in nosmall part to their early adoption offlight Flies travel through space onstraight trajectories interrupted by sac-cades jumpy turning motions some-what analogous to the fast jumpymovements of the human eye Using avariety of monitoring techniquesincluding high-speed video in a ldquovirtualreality flight arenardquo Dickinson and hiscolleagues have observed that fliesrespond to visual cues to decide when tosaccade during flight For example morevisual texture makes flies saccade earlier(ie further away from the arena wall)described by Dickinson as a ldquocollisioncontrol algorithmrdquo

Through a combination of sensoryinputs flies can make decisions abouttheir flight path Flies possess a visualldquoexpansion detectorrdquo which at a certainthreshold causes the animal to turn acertain direction However expansion infront of the fly sometimes causes it toland How does the fly decide Using thevirtual reality device to replicate variousvisual scenarios Dickinson observedthat the flies fixate on vertical objectsthat move back and forth across thefield Expansion on the left side of theanimal causes it to turn right and viceversa while expansion directly in frontof the animal triggers the legs to flareout in preparation for landing

If the eyes detect these changes how arethe responses implemented The fliesrsquowings have ldquopower musclesrdquo controlledby mechanical resonance (as opposed tothe nervous system) which drive the

USENIX 2002

wings combined with neurally activatedldquosteering musclesrdquo which change theconfiguration of the wing joints Subtlevariations in the timing of impulses cor-respond to changes in wing movementcontrolled by sensors in the ldquowing-pitrdquoA small wing-like structure the halterecontrols equilibrium by beating con-stantly

Voluntary control is accomplished bythe halteres whose signals can interferewith the autonomic control of the strokecycle The halteres have ldquosteering mus-clesrdquo as well and information derivedfrom the visual system can turn off thehaltere or trick it into a ldquovirtualrdquo prob-lem requiring a response

Dickinson has also studied the aerody-namics of insect flight using a devicecalled the ldquoRobo Flyrdquo an oversizedmechanical insect wing suspended in alarge tank of mineral oil InterestinglyDickinson observed that the average liftgenerated by the fliesrsquo wings is under itsbody weight the flies use three mecha-nisms to overcome this including rotat-ing the wings (rotational lift)

Insects are extraordinarily robust crea-tures and because Dickinson analyzedthe problem in a systems-oriented waythese observations and analysis areimmediately applicable to technologythe response system can be used as anefficient search algorithm for controlsystems in autonomous vehicles forexample

THE AFS WORKSHOP

Summarized by Garry Zacheiss MIT

Love Hornquist-Astrand of the Arladevelopment team presented the Arlastatus report Arla 0358 has beenreleased Scheduled for release soon areimproved support for Tru64 UNIXMacOS X and FreeBSD improved vol-ume handling and implementation ofmore of the vospts subcommands Itwas stressed that MacOS X is consideredan important platform and that a GUIconfiguration manager for the cache

86 Vol 27 No 5 login

manager a GUI ACL manager for thefinder and a graphical login that obtainsKerberos tickets and AFS tokens at logintime were all under development

Future goals planned for the underlyingAFS protocols include GSSAPISPNEGOsupport for Rx performance improve-ments to Rx an enhanced disconnectedmode and IPv6 support for Rx anexperimental patch is already availablefor the latter Future Arla-specific goalsinclude improved performance partialfile reading and increased stability forseveral platforms Work is also inprogress for the RXGSS protocol exten-sions (integrating GSSAPI into Rx) Apartial implementation exists and workcontinues as developers find time

Derrick Brashear of the OpenAFS devel-opment team presented the OpenAFSstatus report OpenAFS was releasedimmediately prior to the conferenceOpenAFS 125 fixed a remotelyexploitable denial-of-service attack inseveral OpenAFS platforms mostnotably IRIX and AIX Future workplanned for OpenAFS includes bettersupport for MacOS X including work-ing around Finder interaction issuesBetter support for the BSDs is alsoplanned FreeBSD has a partially imple-mented client NetBSD and OpenBSDhave only server programs availableright now AIX 5 Tru64 51A andMacOS X 102 (aka Jaguar) are allplanned for the future Other plannedenhancements include nested groups inthe ptserver (code donated by the Uni-versity of Michigan awaits integration)disconnected AFS and further work ona native Windows client Derrick statedthat the guiding principles of OpenAFSwere to maintain compatibility withIBM AFS support new platforms easethe administrative burden and add newfunctionality

AFS performance was discussed Theopenafsorg cell consists of two fileservers one in Stockholm Sweden andone in Pittsburgh PA AFS works rea-sonably well over trans-Atlantic links

although OpenAFS clients donrsquot deter-mine which file server to talk to veryeffectively Arla clients use RTTs to theserver to determine the optimal fileserver to fetch replicated data fromModifications to OpenAFS to supportthis behavior in the future are desired

Jimmy Engelbrecht and Harald Barth ofKTH discussed their AFSCrawler scriptwritten to determine how many AFScells and clients were in the world whatimplementationsversions they were(Arla vs IBM AFS vs OpenAFS) andhow much data was in AFS The scriptunfortunately triggered a bug in IBMAFS 36ndashderived code causing someclients to panic while handling a specificRPC This has since been fixed in OpenAFS 125 and the most recent IBMAFS patch level and all AFS users arestrongly encouraged to upgrade Norelease of Arla is vulnerable to this par-ticular denial-of-service attack Therewas an extended discussion of the use-fulness of this exploration Many sitesbelieved this was useful information andsuch scanning should continue in thefuture but only on an opt-in basis

Many sites face the problem of manag-ing KerberosAFS credentials for batch-scheduled jobs Specifically most batchprocessing software needs to be modi-fied to forward tickets as part of thebatch submission process renew ticketsand tokens while the job is in the queueand for the lifetime of the job and prop-erly destroy credentials when the jobcompletes Ken Hornstein of NRL wasable to pay a commercial vendor to sup-port Kerberos 45 credential manage-ment in their product although they didnot implement AFS token managementMIT has implemented some of thedesired functionality in OpenPBS andmight be able to make it available toother interested sites

Tools to simplify AFS administrationwere discussed including

AFS Balancer A tool written byCMU to automate the process ofbalancing disk usage across all

servers in a cell Available fromftpftpandrewcmuedupubAFS-Toolsbalance-11btargz

Themis Themis is KTHrsquos enhancedversion of the AFS tool ldquopackagerdquofor updating files on local disk froma central AFS image KTHrsquosenhancements include allowing thedeletion of files simplifying theprocess of adding a file and allow-ing the merging of multiple rulesets for determining which files areupdated Themis is available fromthe Arla CVS repository

Stanford was presented as an example ofa large AFS site Stanfordrsquos AFS usageconsists of approximately 14TB of datain AFS in the form of approximately100000 volumes 33TB of storage isavailable in their primary cell irstan-fordedu Their file servers consistentirely of Solaris machines running acombination of Transarc 36 patch-level3 and OpenAFS 12x while their data-base servers run OpenAFS 122 Theircell consists of 25 file servers using acombination of EMC and Sun StorEdgehardware Stanford continues to use thekaserver for their authentication infra-structure with future plans to migrateentirely to an MIT Kerberos 5 KDC

Stanford has approximately 3400 clientson their campus not including SLAC(Stanford Linear Accelerator) approxi-mately 2100 AFS clients from outsideStanford contact their cell every monthTheir supported clients are almostentirely IBM AFS 36 clients althoughthey plan to release OpenAFS clientssoon Stanford currently supports onlyUNIX clients There is some on-campuspresence of Windows clients but Stan-ford has never publicly released or sup-ported it They do intend to release andsupport the MacOS X client in the nearfuture

All Stanford students faculty and staffare assigned AFS home directories witha default quota of 50MB for a total ofapproximately 550GB of user homedirectories Other significant uses of AFS

87October 2002 login

C

ON

FERE

NC

ERE

PORT

Sstorage are data volumes for workstationsoftware (400GB) and volumes forcourse Web sites and assignments(100GB)

AFS usage at Intel was also presentedIntel has been an AFS site since 1994They had bad experiences with the IBM35 Linux client their experience withOpenAFS on Linux 24x kernels hasbeen much better They use and are sat-isfied with the OpenAFS IA64 Linuxport Intel has hundreds of OpenAFS123 and 124 clients in many produc-tion cells accessing data stored on IBMAFS file servers They have not encoun-tered any interoperability issues Intelhas some concerns about OpenAFS theywould like to purchase commercial sup-port for OpenAFS and to see OpenAFSsupport for HP-UX on both PA-RISCand Itanium hardware HP-UX supportis currently unavailable due to a specificHP-UX header file being unavailablefrom HP this may be available soonIntel has not yet committed to migratingtheir file servers to OpenAFS and areunsure if they will do so without com-mercial support

Backups are a traditional topic of dis-cussion at AFS workshops and this timewas no exception Many users complainthat the traditional AFS backup tools(ldquobackuprdquo and ldquobutcrdquo) are complex anddifficult to automate requiring manyhome-grown scripts and much userintervention for error recovery An addi-tional complaint was that the traditionalAFS tools do not support file- or direc-tory-level backups and restores datamust be backed up and restored at thevolume level

Mitch Collinsworth of Cornell presentedwork to make AMANDA the freebackup software from the University ofMaryland suitable for AFS backupsUsing AMANDA for AFS backups allowsone to share AFS backup tapes withnon-AFS backups easily run multiplebackups in parallel automate errorrecovery and provide a robust degradedmode that prevents tape errors from

stopping backups altogether Theirimplementation allows for full volumerestores as well as individual directoryand file restores They have finished cod-ing this work and are in the process oftesting and documenting it

Peter Honeyman of CITI at the Univer-sity of Michigan spoke about work hehas proposed to replace Rx with RPC-SEC GSS in OpenAFS this would allowAFS to use a TCP-based transport mech-anism rather than the UDP-based Rxand possibly gain better congestion con-trol dynamic adaptation and fragmen-tation avoidance as a result RPCSECGSS uses the GSSAPI to authenticateSUN ONC RPC RPCSEC GSS is trans-port agnostic provides strong security isa developing Internet standard and hasmultiple open source implementationsBackward compatibility with existingAFS servers and clients is an importantgoal of this project

Page 20: inside - USENIX · 2019. 2. 25. · Bosko Milekic Juan Navarro David E. Ott Amit Purohit Brennan Reynolds Matt Selsky J.D. Welch Li Xiao Praveen Yalagandula Haijin Yan Gary Zacheiss

of a software configuration managercontinued by stating the namespaceswhich it must handle and wrapped upwith the challenges faced

X MEETS Z VERIFYING CORRECTNESS IN THE

PRESENCE OF POSIX THREADS

Bart Massey Portland State UniversityRobert T Bauer Rational SoftwareCorp

Massey delivered a humorous talk onthe insight gained from applying Z for-mal specification notation to systemsoftware design rather than the moreinformal analysis and design processnormally used

The story is told with respect to writingXCB which replaces the Xlib protocollayer and is supposed to be threadfriendly But where threads are con-cerned deadlock avoidance becomes ahard problem that cannot be solved inan ad-hoc manner But full modelchecking is also too hard In this caseMassey resorted to Z specification tomodel the XCB lower layer abstractaway locking and data transfers andlocate fundamental issues Essentiallythe difficulties of searching the literatureand locating information relevant to theproblem at hand were overcome AsMassey put it ldquothe formal method savedthe dayrdquo

FILE SYSTEMS

Summarized by Bosko Milekic

PLANNED EXTENSIONS TO THE LINUX

EXT2EXT3 FILESYSTEM

Theodore Y Tsrsquoo IBM StephenTweedie Red Hat

The speaker presented improvements tothe Linux Ext2 file system with the goalof allowing for various expansions whilestriving to maintain compatibility witholder code Improvements have beenfacilitated by a few extra superblockfields that were added to Ext2 not longbefore the Linux 20 kernel was releasedThe fields allow for file system featuresto be added without compromising theexisting setup this is done by providing

80 Vol 27 No 5 login

bits indicating the impact of the addedfeatures with respect to compatibilityNamely a file system with the ldquoincom-patrdquo bit marked is not allowed to bemounted Similarly a ldquoread-onlyrdquo mark-ing would only allow the file system tobe mounted read-only

Directory indexing changes linear direc-tory searches with a faster search using afixed-depth tree and hashed keys Filesystem size can be dynamicallyincreased and the expanded inode dou-bled from 128 to 256 bytes allows formore extensions

Other potential improvements were dis-cussed as well in particular pre-alloca-tion for contiguous files which allowsfor better performance in certain setupsby pre-allocating contiguous blocksSecurity-related modifications extendedattributes and ACLs were mentionedAn implementation of these featuresalready exists but has not yet beenmerged into the mainline Ext23 code

RECENT FILESYSTEM OPTIMISATIONS ON

FREEBSD

Ian Dowse Corvil Networks DavidMalone CNRI Dublin Institute of Technology

David Malone presented four importantfile system optimizations for FreeBSDOS soft updates dirpref vmiodir anddirhash It turns out that certain combi-nations of the optimizations (beautifullyillustrated in the paper) may yield per-formance improvements of anywherebetween 2 and 10 orders of magnitudefor real-world applications

All four of the optimizations deal withfile system metadata Soft updates allowfor asynchronous metadata updatesDirpref changes the way directories areorganized attempting to place childdirectories closer to their parentsthereby increasing locality of referenceand reducing disk-seek times Vmiodirtrades some extra memory in order toachieve better directory caching Finallydirhash which was implemented by IanDowse changes the way in which entries

are searched in a directory SpecificallyFFS uses a linear search to find an entryby name dirhash builds a hashtable ofdirectory entries on the fly For a direc-tory of size n with a working set of mfiles a search that in certain cases couldhave been O(nm) has been reduceddue to dirhash to effectively O(n + m)

FILESYSTEM PERFORMANCE AND SCALABILITY

IN LINUX 2417

Ray Bryant SGI Ruth Forester IBMLTC John Hawkes SGI

This talk focused on performance evalu-ation of a number of file systems avail-able and commonly deployed on Linuxmachines Comparisons under variousconfigurations of Ext2 Ext3 ReiserFSXFS and JFS were presented

The benchmarks chosen for the datagathering were pgmeter filemark andAIM Benchmark Suite VII Pgmetermeasures the rate of data transfer ofreadswrites of a file under a syntheticworkload Filemark is similar to post-mark in that it is an operation-intensivebenchmark although filemark isthreaded and offers various other fea-tures that postmark lacks AIM VIImeasures performance for various file-system-related functionalities it offersvarious metrics under an imposedworkload thus stressing the perfor-mance of the file system not only underIO load but also under significant CPUload

Tests were run on three different setupsa small a medium and a large configu-ration ReiserFS and Ext2 appear at thetop of the pile for smaller and mediumsetups Notably XFS and JFS performworse for smaller system configurationsthan the others although XFS clearlyappears to generate better numbers thanJFS It should be noted that XFS seemsto scale well under a higher load Thiswas most evident in the large-systemresults where XFS appears to offer thebest overall results

THINGS TO THINK ABOUT

Summarized by Bosko Milekic

SPEEDING UP KERNEL SCHEDULER BY REDUC-

ING CACHE MISSES

Shuji Yamamura Akira Hirai MitsuruSato Masao Yamamoto Akira NaruseKouichi Kumon Fujitsu Labs

This was an interesting talk pertaining tothe effects of cache coloring for taskstructures in the Linux kernel scheduler(Linux kernel 24x) The speaker firstpresented some interesting benchmarknumbers for the Linux scheduler show-ing that as the number of processes onthe task queue was increased the perfor-mance decreased The authors usedsome really nifty hardware to measurethe number of bus transactionsthroughout their tests and were thusable to reasonably quantify the impactthat cache misses had in the Linuxscheduler

Their experiments led them to imple-ment a cache coloring scheme for taskstructures which were previouslyaligned on 8KB boundaries and there-fore were being eventually mapped tothe same cache lines This unfortunateplacement of task structures in memoryinduced a significant number of cachemisses as the number of tasks grew inthe scheduler

The implemented solution consisted ofaligning task structures to more evenlydistribute cached entries across the L2cache The result was inevitably fewercache misses in the scheduler Some neg-ative effects were observed in certain sit-uations These were primarily due tomore cache slots being used by taskstructure data in the scheduler thusforcing data previously cached there tobe pushed out

81October 2002 login

C

ON

FERE

NC

ERE

PORT

SWORK-IN-PROGRESS REPORTS

Summarized by Brennan Reynolds

RESOURCE VIRTUALIZATION TECHNIQUES FOR

WIDE-AREA OVERLAY NETWORKS

Kartik Gopalan University Stony Brook

This work addressed the issue of provi-sioning a maximum number of virtualoverlay networks (VON) with diversequality of service (QoS) requirementson a single physical network Each of theVONs is logically isolated from others toensure the QoS requirements Gopalanmentioned several critical researchissues with this problem that are cur-rently being investigated Dealing withhow to provision the network at variouslevels (link route or path) and thenenforce the provisioning at run-time isone of the toughest challenges Cur-rently Gopalan has developed severalalgorithms to handle admission controlend-to-end QoS route selection sched-uling and fault tolerance in the net-work

For more information visithttpwwwecslcssunysbedu

VISUALIZING SOFTWARE INSTABILITY

Jennifer Bevan University of CaliforniaSanta Cruz

Detection of instability in software hastypically been an afterthought Thepoint in the development cycle when thesoftware is reviewed for instability isusually after it is difficult and costly togo back and perform major modifica-tions to the code base Bevan has devel-oped a technique to allow thevisualization of unstable regions of codethat can be used much earlier in thedevelopment cycle Her technique cre-ates a time series of dependent graphsthat include clusters and lines calledfault lines From the graphs a developeris able to easily determine where theunstable sections of code are and proac-tively restructure them She is currentlyworking on a prototype implementa-tion

For more information visit httpwwwcseucscecu~jbevanevo_viz

RELIABLE AND SCALABLE PEER-TO-PEER WEB

DOCUMENT SHARING

Li Xiao William and Mary College

The idea presented by Xiao would allowend users to share the content of theWeb browser caches with neighboringInternet users The rationale for this isthat todayrsquos end users are increasinglyconnected to the Internet over high-speed links and the browser caches arebecoming large storage systems There-fore if an individual accesses a pagewhich does not exist in their local cacheXiao is suggesting that they first queryother end users for the content beforetrying to access the machine hosting theoriginal This strategy does have someserious problems associated with it thatstill need to be addressed includingensuring the integrity of the content andprotecting the identity and privacy ofthe end users

SEREL FAST BOOTING FOR UNIX

Leni Mayo Fast Boot Software

Serel is a tool that generates a visual rep-resentation of a UNIX machinersquos boot-up sequence It can be used to identifythe critical path and can show if a par-ticular service or process blocks for anextended period of time This informa-tion could be used to determine wherethe largest performance gains could berealized by tuning the order of executionat boot-up Serel creates a dependencygraph expressed in XML during boot-up This graph is then used to create thevisual representation Currently the toolonly works on POSIX-compliant sys-tems but Mayo stated that he would beporting it to other platforms Otherextensions that were mentionedincluded having the metadata includethe use of shared libraries and monitor-ing the suspendresume sequence ofportable machines

For more information visithttpwwwfastbootorg

USENIX 2002

BERKELEY DB XML

John Merrells Sleepycat Software

Merrells gave a quick introduction andoverview of the new XML library forBerkeley DB The library specializes instorage and retrieval of XML contentthrough a tool called XPath The libraryallows for multiple containers per docu-ment and stores everything natively asXML The user is also given a wide rangeof elements to create indices withincluding edges elements text stringsor presence The XPath tool consists of aquery parser generator optimizer andexecution engine To conclude his pre-sentation Merrell gave a live demonstra-tion of the software

For more information visithttpwwwsleepycatcomxml

CLUSTER-ON-DEMAND (COD)

Justin Moore Duke University

Modern clusters are growing at a rapidrate Many have pushed beyond the 5000-machine mark and deployingthem results in large expenses as well asmanagement and provisioning issuesFurthermore if the cluster is ldquorentedrdquoout to various users it is very time-con-suming to configure it to a userrsquos specsregardless of how long they need to useit The COD work presented createsdynamic virtual clusters within a givenphysical cluster The goal was to have aprovisioning tool that would automati-cally select a chunk of available nodesand install the operating system andmiddleware specified by the customer ina short period of time This would allowa greater use of resources since multiplevirtual clusters can exist at once Byusing a virtual cluster the size can bechanged dynamically Moore stated thatthey have created a working prototypeand are currently testing and bench-marking its performance

For more information visithttpwwwcsdukeedu~justincod

82 Vol 27 No 5 login

CATACOMB

Elias Sinderson University of CaliforniaSanta Cruz

Catacomb is a project to develop a data-base-backed DASL module for theApache Web server It was designed as areplacement for the WebDAV The initialrelease of the module only contains sup-port for a MySQL database but could beextended to others Sinderson brieflytouched on the performance of hermodule It was comparable to themod_dav Apache module for all querytypes but search The presentation wasconcluded with remarks about addingsupport for the lock method and includ-ing ACL specifications in the future

For more information visithttpoceancseucsceducatacomb

SELF-ORGANIZING STORAGE

Dan Ellard Harvard University

Ellardrsquos presentation introduced a stor-age system that tuned itself based on theworkload of the system without requir-ing the intervention of the user Theintelligence was implemented as a vir-tual self-organizing disk that residesbelow any file system The virtual diskobserves the access patterns exhibited bythe system and then attempts to predictwhat information will be accessed nextAn experiment was done using an NFStrace at a large ISP on one of their emailservers Ellardrsquos self-organizing storagesystem worked well which he attributesto the fact that most of the files beingrequested were large email boxes Areasof future work include exploration ofthe length and detail of the data collec-tion stage as well as the CPU impact ofrunning the virtual disk layer

VISUAL DEBUGGING

John Costigan Virginia Tech

Costigan feels that the state of currentdebugging facilities in the UNIX worldis not as good as it should be He pro-poses the addition of several elements toprograms like ddd and gdb The firstaddition is including separate heap and

stack components Another is to displayonly certain fields of a data structureand have the ability to zoom in and outif needed Finally the debugger shouldprovide the ability to visually presentcomplex data structures includinglinked lists trees etc He said that a betaversion of a debugger with these abilitiesis currently available

For more information visit httpinfoviscsvtedudatastruct

IMPROVING APPLICATION PERFORMANCE

THROUGH SYSTEM-CALL COMPOSITION

Amit Purohit University of Stony Brook

Web servers perform a huge number ofcontext switches and internal data copy-ing during normal operation These twoelements can drastically limit the perfor-mance of an application regardless ofthe hardware platform it is run on TheCompound System Call (CoSy) frame-work is an attempt to reduce the perfor-mance penalty for context switches anddata copies It includes a user-levellibrary and several kernel facilities thatcan be used via system calls The libraryprovides the programmer with a com-plete set of memory-management func-tions Performance tests using the CoSyframework showed a large saving forcontext switches and data copies Theonly area where the savings betweenconventional libraries and CoSy werenegligible was for very large file copies

For more information visithttpwwwcssunysbedu~purohit

ELASTIC QUOTAS

John Oscar Columbia University

Oscar began by stating that mostresources are flexible but that to datedisks have not been While approxi-mately 80 percent of files are short-liveddisk quotas are hard limits imposed byadministrators The idea behind elasticquotas is to have a non-permanent stor-age area that each user can temporarilyuse Oscar suggested the creation of aehome directory structure to be used indeploying elastic quotas Currently he

has implemented elastic quotas as astackable file system There were alsoseveral trends apparent in the dataMany users would ssh to their remotehost (good) but then launch an emailreader on the local machine whichwould connect via POP to the same hostand send the password in clear text(bad) This situation can easily be reme-died by tunneling protocols like POPthrough SSH but it appeared that manypeople were not aware this could bedone While most of his comments onthe use of the wireless network werenegative the list of passwords he hadcollected showed that people wereindeed using strong passwords His rec-ommendations were to educate andencourage people to use protocols likeIPSec SSH and SSL when conductingwork over a wireless network becauseyou never know who else is listening

SYSTRACE

Niels Provos University of Michigan

How can one be sure the applicationsone is using actually do exactly whattheir developers said they do Shortanswer you canrsquot People today are usinga large number of complex applicationswhich means it is impossible to checkeach application thoroughly for securityvulnerabilities There are tools out therethat can help though Provos has devel-oped a tool called systrace that allows auser to generate a policy of acceptablesystem calls a particular application canmake If the application attempts tomake a call that is not defined in thepolicy the user is notified and allowed tochoose an action Systrace includes anauto-policy generation mechanismwhich uses a training phase to record theactions of all programs the user exe-cutes If the user chooses not to be both-ered by applications breaking policysystrace allows default enforcementactions to be set Currently systrace isimplemented for FreeBSD and NetBSDwith a Linux version coming out shortly

83October 2002 login

C

ON

FERE

NC

ERE

PORT

SFor more information visit httpwwwcitiumicheduuprovossystrace

VERIFIABLE SECRET REDISTRIBUTION FOR

SURVIVABLE STORAGE SYSTEMS

Ted Wong Carnegie Mellon University

Wong presented a protocol that can beused to re-create a file distributed over aset of servers even if one of the serversis damaged In his scheme the user mustchoose how many shares the file is splitinto and the number of servers it will bestored across The goal of this work is toprovide persistent secure storage ofinformation even if it comes underattack Wong stated that his design ofthe protocol was complete and he is cur-rently building a prototype implementa-tion of it

For more information visit httpwwwcscmuedu~tmwongresearch

THE GURU IS IN SESSIONS

LINUX ON LAPTOPPDA

Bdale Garbee HP Linux Systems Operation

Summarized by Brennan Reynolds

This was a roundtable-like discussionwith Garbee acting as moderator He iscurrently in charge of the Debian distri-bution of Linux and is involved in port-ing Linux to platforms other than i386He has successfully help port the kernelto the Alpha Sparc and ARM and iscurrently working on a version to runon VAX machines

A discussion was held on which file sys-tem is best for battery life The Riser andExt3 file systems were discussed peoplecommented that when using Ext3 ontheir laptops the disc would never spindown and thus was always consuming alarge amount of power One suggestionwas to use a RAM-based file system forany volume that requires a large numberof writes and to only have the file systemwritten to disk at infrequent intervals orwhen the machine is shut down or sus-pended

A question was directed to Garbee abouthis knowledge of how the HP-Compaqmerger would affect Linux Garbeethought that the merger was good forLinux both within the company and forthe open source community as a wholeHe stated that the new company wouldbe the number one shipper of systemswith Linux as the base operating systemHe also fielded a question about thesupport of older HP servers and theirability to run Linux saying that indeedLinux has been ported to them and hepersonally had several machines in hisbasement running it

A question which sparked a large num-ber of responses concerned problemspeople had with running Linux onmobile platforms Most people actuallydid not have many problems at allThere were a few cases of a particularmachine model not working but therewere only two widespread issues dock-ing stations and the new ACPI powermanagement scheme Docking stationsstill appear to be somewhat of aheadache but most people had devel-oped ad-hoc solutions for getting themto work The ACPI power managementscheme developed by Intel does notappear to have a quick solution TedTrsquoso one of the head kernel developersstated that there are fundamental archi-tectural problems with the 24 series ker-nel that do not easily allow ACPI to beadded However he also stated thatACPI is supported in the newest devel-opment kernel 25 and will be includedin 26

The final topic was the possibility ofpurchasing a laptop without paying anychargetax for Microsoft Windows Theconsensus was that currently this is notpossible Even if the machine comeswith Linux or without any operatingsystem the vendors have contracts inplace which require them to payMicrosoft for each machine they ship ifthat machine is capable of running aMicrosoft operating system The onlyway this will change is if the demand for

USENIX 2002

other operating systems increases to thepoint that vendors renegotiate their con-tracts with Microsoft and no one sawthis happening in the near future

LARGE CLUSTERS

Andrew Hume ATampT Labs ndash Research

Summarized by Teri Lampoudi

This was a session on clusters big dataand resilient computing Given that theaudience was primarily interested inclusters and not necessarily big data orbeating nodes to a pulp the conversa-tion revolved mainly around what ittakes to make a heterogeneous clusterresilient Hume also presented aFREENIX paper on his current clusterproject which explained in more detailmuch of what was abstractly claimed inthe guru session

Humersquos use of the word ldquoclusterrdquoreferred not to what I had assumedwould be a Beowulf-type system but to aloosely coupled farm of machines inwhich concurrency was much less of anissue than it would be in a Beowulf Infact the architecture Hume describedwas designed to deal with large amountsof independent transaction processingessentially the process of billing callswhich requires no interprocess messagepassing or anything of the sort a parallelscientific application might

Where does the large data come inSince the transactions in question con-sist of large numbers of flat files amechanism for getting the files ontonodes and off-loading the results is nec-essary In this particular cluster namedNingaui this task is handled by theldquoreplication managerrdquo which generateda large amount of interest in the audi-ence All management functions in thecluster as well as job allocation consti-tute services that are to be bid for andleased out to nodes Furthermore fail-ures are handled by simply repeating thefailed task until it is completed properlyif that is possible and if it is not thenlooking through detailed logs to discover

84 Vol 27 No 5 login

what went wrong Logging and testingfor the successful completion of eachtask are what make this model resilientEvery management operation is loggedat some appropriate level of detail gen-erating 120MBday and every single filetransfer is md5 checksummed This wascharacterized by Hume as a ldquopatientrdquostyle of management where no one con-trols the cluster scheduling behavior isemergent rather than stringentlyplanned and nodes are free to drop inand out of the cluster a downside to thismodel is the increase in latency offset bythe fact that the cluster continues tofunction unattended outside of 8-to-5support hours

In response to questions about the pro-jected scalability of such a schemeHume said the system would presum-ably scale from the present 60 nodes toabout 100 without modifications butthat scaling higher would be a matter forfurther design The guru session endedon a somewhat unrelated note regardingthe problems of getting data off main-frames and onto Linux machines ndashissues of fixed- vs variable-lengthencoding of data blocks resulting in use-ful information being stripped by FTPand problems in converting COBOLcopybooks to something useful on a C-based architecture To reiterate Humersquosclaim there are homemade solutions forsome of these problems and the readershould feel free to contact Mr Hume forthem

WRITING PORTABLE APPLICATIONS

Nick Stoughton MSB Consultants

Summarized by Josh Lothian

Nick Stoughton addressed a concernthat developers are facing more andmore writing portable applications Heproposed that there is no such thing as aldquoportable applicationrdquo only those thathave been ported Developers are still insearch of the Holy Grail of applicationsprogramming a write-once compile-anywhere program Languages such asJava are leading up to this but virtual

machines still have to be ported pathswill still be different etc Web-basedapplications are another area that couldbe promising for portable applications

Nick went on to talk about the future ofportability The recently released POSIX2001 standards are expected to help thesituation The 2001 revision expands theoriginal POSIX sepc into seven volumesEven though POSIX 2001 is more spe-cific than the previous release Nickpointed out that even this standardallows for small differences where ven-dors may define their own behaviorsThis can only hurt developers in thelong run Along these lines it waspointed out that there may be room forautomated tools that have the capabilityto check application code and determinethe level of portability and standardscompliance of the source

NETWORK MANAGEMENT SYSTEM

PERFORMANCE TUNING

Jeff Allen Tellme Networks Inc

Summarized by Matt Selsky

Jeff Allen author of Cricket discussednetwork tuning The basic idea of tuningis measure twiddle and measure againCricket can be used for large installa-tions but itrsquos overkill for smaller instal-lations for which Mrtg is better suitedYou donrsquot need Cricket to do measure-ment Doing something like a shellscript that dumps data to Excel is goodbut you need to get some sort of mea-surement Cricket was not designed forbilling it was meant to help answerquestions

Useful tools and techniques coveredincluded looking for the differencebetween machines in order to identifythe causes for the differences measuringbut also thinking about what yoursquoremeasuring and why remembering touse strace and tcpdump to identifyproblems Problems repeat themselvesperformance tuning requires an intricateunderstanding of all the layers involvedto solve the problem and simplify theautomation (Having a computer science

background helps but is not essential) Ifyou can reproduce the problem youthen should investigate each piece tofind the bottleneck If you canrsquot repro-duce the problem then measure Youneed to understand all the system inter-actions Measurement can help deter-mine whether things are actually slow orif the user is imagining it

Troubleshooting begins with a hunchbut scientific processes are essential Youshould be able to determine the eventstream or when each event occurs hav-ing observation logs can help Somehints provided include checking cron forperiodic anomalies slowing downbinrm to avoid IO overload and look-ing for unusual kernel CPU usage

Also try to use lightweight monitoringto reduce overhead You donrsquot need tomonitor every resource on every systembut you should monitor those resourceson those systems that are essential Donrsquotcheck things too often since you canintroduce overhead that way Lots ofinformation can be gathered from theproc file system interface to the kernel

BSD BOF

Host Kirk McKusick Author and Consultant

Summarized by Bosko Milekic

The BSD Birds of a Feather session atUSENIX started off with five quick andinformative talks and ended with anexciting discussion of licensing issues

Christos Zoulas for the NetBSD Projectwent over the primary goals of theNetBSD Project (namely portability aclean architecture and security) andthen proceeded to briefly discuss someof the challenges that the project hasencountered The successes and areas forimprovement of 16 were then exam-ined NetBSD has recently seen variousimprovements in its cross-build systempackaging system and third-party toolsupport For what concerns the kernel azero-copy TCPUDP implementationhas been integrated and a new pmap

85October 2002 login

C

ON

FERE

NC

ERE

PORT

Scode for arm32 has been introduced ashave some other rather major changes tovarious subsystems More information isavailable in Christosrsquos slides which arenow available at httpwwwnetbsdorggalleryeventsusenix2002

Theo de Raadt of the OpenBSD Projectpresented a rundown of various newthings that the OpenBSD and OpenSSHteams have been looking at Theorsquos talkincluded an amusing description (andpictures) of some of the events thattranspired during OpenBSDrsquos hack-a-thon which occurred the week beforeUSENIX rsquo02 Finally Theo mentionedthat following complications with theIPFilter license the OpenBSD team haddone a fairly extensive license audit

Robert Watson of the FreeBSD Projectbrought up various examples ofFreeBSD being used in the real worldHe went on to describe a wide variety ofchanges that will surface with theupcoming FreeBSD release 50 Notably50 will introduce a re-architecting ofthe kernel aimed at providing muchmore scalable support for SMPmachines 50 will also include an earlyimplementation of KSE FreeBSDrsquos ver-sion of scheduler activations largeimprovements to pccard and significantframework bits from the TrustedBSDproject

Mike Karels from Wind River Systemspresented an overview of how BSDOShas been evolving following the WindRiver Systems acquisition of BSDirsquos soft-ware assets Ernie Prabhakar from Applediscussed Applersquos OS X and its successon the desktop He explained how OS Xaims to bring BSD to the desktop whilestriving to maintain a strong and stablecore one that is heavily based on BSDtechnology

The BoF finished with a discussion onlicensing specifically some folks ques-tioned the impact of Applersquos licensingfor code from their core OS (the DarwinProject) on the BSD community as awhole

CLOSING SESSION

HOW FLIES FLY

Michael H Dickinson University ofCalifornia Berkeley

Summarized by JD Welch

In this lively and offbeat talk Dickinsondiscussed his research into the mecha-nisms of flight in the fruit fly (Droso-phila melanogaster) and related theautonomic behavior to that of a techno-logical system Insects are the most suc-cessful organisms on earth due in nosmall part to their early adoption offlight Flies travel through space onstraight trajectories interrupted by sac-cades jumpy turning motions some-what analogous to the fast jumpymovements of the human eye Using avariety of monitoring techniquesincluding high-speed video in a ldquovirtualreality flight arenardquo Dickinson and hiscolleagues have observed that fliesrespond to visual cues to decide when tosaccade during flight For example morevisual texture makes flies saccade earlier(ie further away from the arena wall)described by Dickinson as a ldquocollisioncontrol algorithmrdquo

Through a combination of sensoryinputs flies can make decisions abouttheir flight path Flies possess a visualldquoexpansion detectorrdquo which at a certainthreshold causes the animal to turn acertain direction However expansion infront of the fly sometimes causes it toland How does the fly decide Using thevirtual reality device to replicate variousvisual scenarios Dickinson observedthat the flies fixate on vertical objectsthat move back and forth across thefield Expansion on the left side of theanimal causes it to turn right and viceversa while expansion directly in frontof the animal triggers the legs to flareout in preparation for landing

If the eyes detect these changes how arethe responses implemented The fliesrsquowings have ldquopower musclesrdquo controlledby mechanical resonance (as opposed tothe nervous system) which drive the

USENIX 2002

wings combined with neurally activatedldquosteering musclesrdquo which change theconfiguration of the wing joints Subtlevariations in the timing of impulses cor-respond to changes in wing movementcontrolled by sensors in the ldquowing-pitrdquoA small wing-like structure the halterecontrols equilibrium by beating con-stantly

Voluntary control is accomplished bythe halteres whose signals can interferewith the autonomic control of the strokecycle The halteres have ldquosteering mus-clesrdquo as well and information derivedfrom the visual system can turn off thehaltere or trick it into a ldquovirtualrdquo prob-lem requiring a response

Dickinson has also studied the aerody-namics of insect flight using a devicecalled the ldquoRobo Flyrdquo an oversizedmechanical insect wing suspended in alarge tank of mineral oil InterestinglyDickinson observed that the average liftgenerated by the fliesrsquo wings is under itsbody weight the flies use three mecha-nisms to overcome this including rotat-ing the wings (rotational lift)

Insects are extraordinarily robust crea-tures and because Dickinson analyzedthe problem in a systems-oriented waythese observations and analysis areimmediately applicable to technologythe response system can be used as anefficient search algorithm for controlsystems in autonomous vehicles forexample

THE AFS WORKSHOP

Summarized by Garry Zacheiss MIT

Love Hornquist-Astrand of the Arladevelopment team presented the Arlastatus report Arla 0358 has beenreleased Scheduled for release soon areimproved support for Tru64 UNIXMacOS X and FreeBSD improved vol-ume handling and implementation ofmore of the vospts subcommands Itwas stressed that MacOS X is consideredan important platform and that a GUIconfiguration manager for the cache

86 Vol 27 No 5 login

manager a GUI ACL manager for thefinder and a graphical login that obtainsKerberos tickets and AFS tokens at logintime were all under development

Future goals planned for the underlyingAFS protocols include GSSAPISPNEGOsupport for Rx performance improve-ments to Rx an enhanced disconnectedmode and IPv6 support for Rx anexperimental patch is already availablefor the latter Future Arla-specific goalsinclude improved performance partialfile reading and increased stability forseveral platforms Work is also inprogress for the RXGSS protocol exten-sions (integrating GSSAPI into Rx) Apartial implementation exists and workcontinues as developers find time

Derrick Brashear of the OpenAFS devel-opment team presented the OpenAFSstatus report OpenAFS was releasedimmediately prior to the conferenceOpenAFS 125 fixed a remotelyexploitable denial-of-service attack inseveral OpenAFS platforms mostnotably IRIX and AIX Future workplanned for OpenAFS includes bettersupport for MacOS X including work-ing around Finder interaction issuesBetter support for the BSDs is alsoplanned FreeBSD has a partially imple-mented client NetBSD and OpenBSDhave only server programs availableright now AIX 5 Tru64 51A andMacOS X 102 (aka Jaguar) are allplanned for the future Other plannedenhancements include nested groups inthe ptserver (code donated by the Uni-versity of Michigan awaits integration)disconnected AFS and further work ona native Windows client Derrick statedthat the guiding principles of OpenAFSwere to maintain compatibility withIBM AFS support new platforms easethe administrative burden and add newfunctionality

AFS performance was discussed Theopenafsorg cell consists of two fileservers one in Stockholm Sweden andone in Pittsburgh PA AFS works rea-sonably well over trans-Atlantic links

although OpenAFS clients donrsquot deter-mine which file server to talk to veryeffectively Arla clients use RTTs to theserver to determine the optimal fileserver to fetch replicated data fromModifications to OpenAFS to supportthis behavior in the future are desired

Jimmy Engelbrecht and Harald Barth ofKTH discussed their AFSCrawler scriptwritten to determine how many AFScells and clients were in the world whatimplementationsversions they were(Arla vs IBM AFS vs OpenAFS) andhow much data was in AFS The scriptunfortunately triggered a bug in IBMAFS 36ndashderived code causing someclients to panic while handling a specificRPC This has since been fixed in OpenAFS 125 and the most recent IBMAFS patch level and all AFS users arestrongly encouraged to upgrade Norelease of Arla is vulnerable to this par-ticular denial-of-service attack Therewas an extended discussion of the use-fulness of this exploration Many sitesbelieved this was useful information andsuch scanning should continue in thefuture but only on an opt-in basis

Many sites face the problem of manag-ing KerberosAFS credentials for batch-scheduled jobs Specifically most batchprocessing software needs to be modi-fied to forward tickets as part of thebatch submission process renew ticketsand tokens while the job is in the queueand for the lifetime of the job and prop-erly destroy credentials when the jobcompletes Ken Hornstein of NRL wasable to pay a commercial vendor to sup-port Kerberos 45 credential manage-ment in their product although they didnot implement AFS token managementMIT has implemented some of thedesired functionality in OpenPBS andmight be able to make it available toother interested sites

Tools to simplify AFS administrationwere discussed including

AFS Balancer A tool written byCMU to automate the process ofbalancing disk usage across all

servers in a cell Available fromftpftpandrewcmuedupubAFS-Toolsbalance-11btargz

Themis Themis is KTHrsquos enhancedversion of the AFS tool ldquopackagerdquofor updating files on local disk froma central AFS image KTHrsquosenhancements include allowing thedeletion of files simplifying theprocess of adding a file and allow-ing the merging of multiple rulesets for determining which files areupdated Themis is available fromthe Arla CVS repository

Stanford was presented as an example ofa large AFS site Stanfordrsquos AFS usageconsists of approximately 14TB of datain AFS in the form of approximately100000 volumes 33TB of storage isavailable in their primary cell irstan-fordedu Their file servers consistentirely of Solaris machines running acombination of Transarc 36 patch-level3 and OpenAFS 12x while their data-base servers run OpenAFS 122 Theircell consists of 25 file servers using acombination of EMC and Sun StorEdgehardware Stanford continues to use thekaserver for their authentication infra-structure with future plans to migrateentirely to an MIT Kerberos 5 KDC

Stanford has approximately 3400 clientson their campus not including SLAC(Stanford Linear Accelerator) approxi-mately 2100 AFS clients from outsideStanford contact their cell every monthTheir supported clients are almostentirely IBM AFS 36 clients althoughthey plan to release OpenAFS clientssoon Stanford currently supports onlyUNIX clients There is some on-campuspresence of Windows clients but Stan-ford has never publicly released or sup-ported it They do intend to release andsupport the MacOS X client in the nearfuture

All Stanford students faculty and staffare assigned AFS home directories witha default quota of 50MB for a total ofapproximately 550GB of user homedirectories Other significant uses of AFS

87October 2002 login

C

ON

FERE

NC

ERE

PORT

Sstorage are data volumes for workstationsoftware (400GB) and volumes forcourse Web sites and assignments(100GB)

AFS usage at Intel was also presentedIntel has been an AFS site since 1994They had bad experiences with the IBM35 Linux client their experience withOpenAFS on Linux 24x kernels hasbeen much better They use and are sat-isfied with the OpenAFS IA64 Linuxport Intel has hundreds of OpenAFS123 and 124 clients in many produc-tion cells accessing data stored on IBMAFS file servers They have not encoun-tered any interoperability issues Intelhas some concerns about OpenAFS theywould like to purchase commercial sup-port for OpenAFS and to see OpenAFSsupport for HP-UX on both PA-RISCand Itanium hardware HP-UX supportis currently unavailable due to a specificHP-UX header file being unavailablefrom HP this may be available soonIntel has not yet committed to migratingtheir file servers to OpenAFS and areunsure if they will do so without com-mercial support

Backups are a traditional topic of dis-cussion at AFS workshops and this timewas no exception Many users complainthat the traditional AFS backup tools(ldquobackuprdquo and ldquobutcrdquo) are complex anddifficult to automate requiring manyhome-grown scripts and much userintervention for error recovery An addi-tional complaint was that the traditionalAFS tools do not support file- or direc-tory-level backups and restores datamust be backed up and restored at thevolume level

Mitch Collinsworth of Cornell presentedwork to make AMANDA the freebackup software from the University ofMaryland suitable for AFS backupsUsing AMANDA for AFS backups allowsone to share AFS backup tapes withnon-AFS backups easily run multiplebackups in parallel automate errorrecovery and provide a robust degradedmode that prevents tape errors from

stopping backups altogether Theirimplementation allows for full volumerestores as well as individual directoryand file restores They have finished cod-ing this work and are in the process oftesting and documenting it

Peter Honeyman of CITI at the Univer-sity of Michigan spoke about work hehas proposed to replace Rx with RPC-SEC GSS in OpenAFS this would allowAFS to use a TCP-based transport mech-anism rather than the UDP-based Rxand possibly gain better congestion con-trol dynamic adaptation and fragmen-tation avoidance as a result RPCSECGSS uses the GSSAPI to authenticateSUN ONC RPC RPCSEC GSS is trans-port agnostic provides strong security isa developing Internet standard and hasmultiple open source implementationsBackward compatibility with existingAFS servers and clients is an importantgoal of this project

Page 21: inside - USENIX · 2019. 2. 25. · Bosko Milekic Juan Navarro David E. Ott Amit Purohit Brennan Reynolds Matt Selsky J.D. Welch Li Xiao Praveen Yalagandula Haijin Yan Gary Zacheiss

THINGS TO THINK ABOUT

Summarized by Bosko Milekic

SPEEDING UP KERNEL SCHEDULER BY REDUC-

ING CACHE MISSES

Shuji Yamamura Akira Hirai MitsuruSato Masao Yamamoto Akira NaruseKouichi Kumon Fujitsu Labs

This was an interesting talk pertaining tothe effects of cache coloring for taskstructures in the Linux kernel scheduler(Linux kernel 24x) The speaker firstpresented some interesting benchmarknumbers for the Linux scheduler show-ing that as the number of processes onthe task queue was increased the perfor-mance decreased The authors usedsome really nifty hardware to measurethe number of bus transactionsthroughout their tests and were thusable to reasonably quantify the impactthat cache misses had in the Linuxscheduler

Their experiments led them to imple-ment a cache coloring scheme for taskstructures which were previouslyaligned on 8KB boundaries and there-fore were being eventually mapped tothe same cache lines This unfortunateplacement of task structures in memoryinduced a significant number of cachemisses as the number of tasks grew inthe scheduler

The implemented solution consisted ofaligning task structures to more evenlydistribute cached entries across the L2cache The result was inevitably fewercache misses in the scheduler Some neg-ative effects were observed in certain sit-uations These were primarily due tomore cache slots being used by taskstructure data in the scheduler thusforcing data previously cached there tobe pushed out

81October 2002 login

C

ON

FERE

NC

ERE

PORT

SWORK-IN-PROGRESS REPORTS

Summarized by Brennan Reynolds

RESOURCE VIRTUALIZATION TECHNIQUES FOR

WIDE-AREA OVERLAY NETWORKS

Kartik Gopalan University Stony Brook

This work addressed the issue of provi-sioning a maximum number of virtualoverlay networks (VON) with diversequality of service (QoS) requirementson a single physical network Each of theVONs is logically isolated from others toensure the QoS requirements Gopalanmentioned several critical researchissues with this problem that are cur-rently being investigated Dealing withhow to provision the network at variouslevels (link route or path) and thenenforce the provisioning at run-time isone of the toughest challenges Cur-rently Gopalan has developed severalalgorithms to handle admission controlend-to-end QoS route selection sched-uling and fault tolerance in the net-work

For more information visithttpwwwecslcssunysbedu

VISUALIZING SOFTWARE INSTABILITY

Jennifer Bevan University of CaliforniaSanta Cruz

Detection of instability in software hastypically been an afterthought Thepoint in the development cycle when thesoftware is reviewed for instability isusually after it is difficult and costly togo back and perform major modifica-tions to the code base Bevan has devel-oped a technique to allow thevisualization of unstable regions of codethat can be used much earlier in thedevelopment cycle Her technique cre-ates a time series of dependent graphsthat include clusters and lines calledfault lines From the graphs a developeris able to easily determine where theunstable sections of code are and proac-tively restructure them She is currentlyworking on a prototype implementa-tion

For more information visit httpwwwcseucscecu~jbevanevo_viz

RELIABLE AND SCALABLE PEER-TO-PEER WEB

DOCUMENT SHARING

Li Xiao William and Mary College

The idea presented by Xiao would allowend users to share the content of theWeb browser caches with neighboringInternet users The rationale for this isthat todayrsquos end users are increasinglyconnected to the Internet over high-speed links and the browser caches arebecoming large storage systems There-fore if an individual accesses a pagewhich does not exist in their local cacheXiao is suggesting that they first queryother end users for the content beforetrying to access the machine hosting theoriginal This strategy does have someserious problems associated with it thatstill need to be addressed includingensuring the integrity of the content andprotecting the identity and privacy ofthe end users

SEREL FAST BOOTING FOR UNIX

Leni Mayo Fast Boot Software

Serel is a tool that generates a visual rep-resentation of a UNIX machinersquos boot-up sequence It can be used to identifythe critical path and can show if a par-ticular service or process blocks for anextended period of time This informa-tion could be used to determine wherethe largest performance gains could berealized by tuning the order of executionat boot-up Serel creates a dependencygraph expressed in XML during boot-up This graph is then used to create thevisual representation Currently the toolonly works on POSIX-compliant sys-tems but Mayo stated that he would beporting it to other platforms Otherextensions that were mentionedincluded having the metadata includethe use of shared libraries and monitor-ing the suspendresume sequence ofportable machines

For more information visithttpwwwfastbootorg

USENIX 2002

BERKELEY DB XML

John Merrells Sleepycat Software

Merrells gave a quick introduction andoverview of the new XML library forBerkeley DB The library specializes instorage and retrieval of XML contentthrough a tool called XPath The libraryallows for multiple containers per docu-ment and stores everything natively asXML The user is also given a wide rangeof elements to create indices withincluding edges elements text stringsor presence The XPath tool consists of aquery parser generator optimizer andexecution engine To conclude his pre-sentation Merrell gave a live demonstra-tion of the software

For more information visithttpwwwsleepycatcomxml

CLUSTER-ON-DEMAND (COD)

Justin Moore Duke University

Modern clusters are growing at a rapidrate Many have pushed beyond the 5000-machine mark and deployingthem results in large expenses as well asmanagement and provisioning issuesFurthermore if the cluster is ldquorentedrdquoout to various users it is very time-con-suming to configure it to a userrsquos specsregardless of how long they need to useit The COD work presented createsdynamic virtual clusters within a givenphysical cluster The goal was to have aprovisioning tool that would automati-cally select a chunk of available nodesand install the operating system andmiddleware specified by the customer ina short period of time This would allowa greater use of resources since multiplevirtual clusters can exist at once Byusing a virtual cluster the size can bechanged dynamically Moore stated thatthey have created a working prototypeand are currently testing and bench-marking its performance

For more information visithttpwwwcsdukeedu~justincod

82 Vol 27 No 5 login

CATACOMB

Elias Sinderson University of CaliforniaSanta Cruz

Catacomb is a project to develop a data-base-backed DASL module for theApache Web server It was designed as areplacement for the WebDAV The initialrelease of the module only contains sup-port for a MySQL database but could beextended to others Sinderson brieflytouched on the performance of hermodule It was comparable to themod_dav Apache module for all querytypes but search The presentation wasconcluded with remarks about addingsupport for the lock method and includ-ing ACL specifications in the future

For more information visithttpoceancseucsceducatacomb

SELF-ORGANIZING STORAGE

Dan Ellard Harvard University

Ellardrsquos presentation introduced a stor-age system that tuned itself based on theworkload of the system without requir-ing the intervention of the user Theintelligence was implemented as a vir-tual self-organizing disk that residesbelow any file system The virtual diskobserves the access patterns exhibited bythe system and then attempts to predictwhat information will be accessed nextAn experiment was done using an NFStrace at a large ISP on one of their emailservers Ellardrsquos self-organizing storagesystem worked well which he attributesto the fact that most of the files beingrequested were large email boxes Areasof future work include exploration ofthe length and detail of the data collec-tion stage as well as the CPU impact ofrunning the virtual disk layer

VISUAL DEBUGGING

John Costigan Virginia Tech

Costigan feels that the state of currentdebugging facilities in the UNIX worldis not as good as it should be He pro-poses the addition of several elements toprograms like ddd and gdb The firstaddition is including separate heap and

stack components Another is to displayonly certain fields of a data structureand have the ability to zoom in and outif needed Finally the debugger shouldprovide the ability to visually presentcomplex data structures includinglinked lists trees etc He said that a betaversion of a debugger with these abilitiesis currently available

For more information visit httpinfoviscsvtedudatastruct

IMPROVING APPLICATION PERFORMANCE

THROUGH SYSTEM-CALL COMPOSITION

Amit Purohit University of Stony Brook

Web servers perform a huge number ofcontext switches and internal data copy-ing during normal operation These twoelements can drastically limit the perfor-mance of an application regardless ofthe hardware platform it is run on TheCompound System Call (CoSy) frame-work is an attempt to reduce the perfor-mance penalty for context switches anddata copies It includes a user-levellibrary and several kernel facilities thatcan be used via system calls The libraryprovides the programmer with a com-plete set of memory-management func-tions Performance tests using the CoSyframework showed a large saving forcontext switches and data copies Theonly area where the savings betweenconventional libraries and CoSy werenegligible was for very large file copies

For more information visithttpwwwcssunysbedu~purohit

ELASTIC QUOTAS

John Oscar Columbia University

Oscar began by stating that mostresources are flexible but that to datedisks have not been While approxi-mately 80 percent of files are short-liveddisk quotas are hard limits imposed byadministrators The idea behind elasticquotas is to have a non-permanent stor-age area that each user can temporarilyuse Oscar suggested the creation of aehome directory structure to be used indeploying elastic quotas Currently he

has implemented elastic quotas as astackable file system There were alsoseveral trends apparent in the dataMany users would ssh to their remotehost (good) but then launch an emailreader on the local machine whichwould connect via POP to the same hostand send the password in clear text(bad) This situation can easily be reme-died by tunneling protocols like POPthrough SSH but it appeared that manypeople were not aware this could bedone While most of his comments onthe use of the wireless network werenegative the list of passwords he hadcollected showed that people wereindeed using strong passwords His rec-ommendations were to educate andencourage people to use protocols likeIPSec SSH and SSL when conductingwork over a wireless network becauseyou never know who else is listening

SYSTRACE

Niels Provos University of Michigan

How can one be sure the applicationsone is using actually do exactly whattheir developers said they do Shortanswer you canrsquot People today are usinga large number of complex applicationswhich means it is impossible to checkeach application thoroughly for securityvulnerabilities There are tools out therethat can help though Provos has devel-oped a tool called systrace that allows auser to generate a policy of acceptablesystem calls a particular application canmake If the application attempts tomake a call that is not defined in thepolicy the user is notified and allowed tochoose an action Systrace includes anauto-policy generation mechanismwhich uses a training phase to record theactions of all programs the user exe-cutes If the user chooses not to be both-ered by applications breaking policysystrace allows default enforcementactions to be set Currently systrace isimplemented for FreeBSD and NetBSDwith a Linux version coming out shortly

83October 2002 login

C

ON

FERE

NC

ERE

PORT

SFor more information visit httpwwwcitiumicheduuprovossystrace

VERIFIABLE SECRET REDISTRIBUTION FOR

SURVIVABLE STORAGE SYSTEMS

Ted Wong Carnegie Mellon University

Wong presented a protocol that can beused to re-create a file distributed over aset of servers even if one of the serversis damaged In his scheme the user mustchoose how many shares the file is splitinto and the number of servers it will bestored across The goal of this work is toprovide persistent secure storage ofinformation even if it comes underattack Wong stated that his design ofthe protocol was complete and he is cur-rently building a prototype implementa-tion of it

For more information visit httpwwwcscmuedu~tmwongresearch

THE GURU IS IN SESSIONS

LINUX ON LAPTOPPDA

Bdale Garbee HP Linux Systems Operation

Summarized by Brennan Reynolds

This was a roundtable-like discussionwith Garbee acting as moderator He iscurrently in charge of the Debian distri-bution of Linux and is involved in port-ing Linux to platforms other than i386He has successfully help port the kernelto the Alpha Sparc and ARM and iscurrently working on a version to runon VAX machines

A discussion was held on which file sys-tem is best for battery life The Riser andExt3 file systems were discussed peoplecommented that when using Ext3 ontheir laptops the disc would never spindown and thus was always consuming alarge amount of power One suggestionwas to use a RAM-based file system forany volume that requires a large numberof writes and to only have the file systemwritten to disk at infrequent intervals orwhen the machine is shut down or sus-pended

A question was directed to Garbee abouthis knowledge of how the HP-Compaqmerger would affect Linux Garbeethought that the merger was good forLinux both within the company and forthe open source community as a wholeHe stated that the new company wouldbe the number one shipper of systemswith Linux as the base operating systemHe also fielded a question about thesupport of older HP servers and theirability to run Linux saying that indeedLinux has been ported to them and hepersonally had several machines in hisbasement running it

A question which sparked a large num-ber of responses concerned problemspeople had with running Linux onmobile platforms Most people actuallydid not have many problems at allThere were a few cases of a particularmachine model not working but therewere only two widespread issues dock-ing stations and the new ACPI powermanagement scheme Docking stationsstill appear to be somewhat of aheadache but most people had devel-oped ad-hoc solutions for getting themto work The ACPI power managementscheme developed by Intel does notappear to have a quick solution TedTrsquoso one of the head kernel developersstated that there are fundamental archi-tectural problems with the 24 series ker-nel that do not easily allow ACPI to beadded However he also stated thatACPI is supported in the newest devel-opment kernel 25 and will be includedin 26

The final topic was the possibility ofpurchasing a laptop without paying anychargetax for Microsoft Windows Theconsensus was that currently this is notpossible Even if the machine comeswith Linux or without any operatingsystem the vendors have contracts inplace which require them to payMicrosoft for each machine they ship ifthat machine is capable of running aMicrosoft operating system The onlyway this will change is if the demand for

USENIX 2002

other operating systems increases to thepoint that vendors renegotiate their con-tracts with Microsoft and no one sawthis happening in the near future

LARGE CLUSTERS

Andrew Hume ATampT Labs ndash Research

Summarized by Teri Lampoudi

This was a session on clusters big dataand resilient computing Given that theaudience was primarily interested inclusters and not necessarily big data orbeating nodes to a pulp the conversa-tion revolved mainly around what ittakes to make a heterogeneous clusterresilient Hume also presented aFREENIX paper on his current clusterproject which explained in more detailmuch of what was abstractly claimed inthe guru session

Humersquos use of the word ldquoclusterrdquoreferred not to what I had assumedwould be a Beowulf-type system but to aloosely coupled farm of machines inwhich concurrency was much less of anissue than it would be in a Beowulf Infact the architecture Hume describedwas designed to deal with large amountsof independent transaction processingessentially the process of billing callswhich requires no interprocess messagepassing or anything of the sort a parallelscientific application might

Where does the large data come inSince the transactions in question con-sist of large numbers of flat files amechanism for getting the files ontonodes and off-loading the results is nec-essary In this particular cluster namedNingaui this task is handled by theldquoreplication managerrdquo which generateda large amount of interest in the audi-ence All management functions in thecluster as well as job allocation consti-tute services that are to be bid for andleased out to nodes Furthermore fail-ures are handled by simply repeating thefailed task until it is completed properlyif that is possible and if it is not thenlooking through detailed logs to discover

84 Vol 27 No 5 login

what went wrong Logging and testingfor the successful completion of eachtask are what make this model resilientEvery management operation is loggedat some appropriate level of detail gen-erating 120MBday and every single filetransfer is md5 checksummed This wascharacterized by Hume as a ldquopatientrdquostyle of management where no one con-trols the cluster scheduling behavior isemergent rather than stringentlyplanned and nodes are free to drop inand out of the cluster a downside to thismodel is the increase in latency offset bythe fact that the cluster continues tofunction unattended outside of 8-to-5support hours

In response to questions about the pro-jected scalability of such a schemeHume said the system would presum-ably scale from the present 60 nodes toabout 100 without modifications butthat scaling higher would be a matter forfurther design The guru session endedon a somewhat unrelated note regardingthe problems of getting data off main-frames and onto Linux machines ndashissues of fixed- vs variable-lengthencoding of data blocks resulting in use-ful information being stripped by FTPand problems in converting COBOLcopybooks to something useful on a C-based architecture To reiterate Humersquosclaim there are homemade solutions forsome of these problems and the readershould feel free to contact Mr Hume forthem

WRITING PORTABLE APPLICATIONS

Nick Stoughton MSB Consultants

Summarized by Josh Lothian

Nick Stoughton addressed a concernthat developers are facing more andmore writing portable applications Heproposed that there is no such thing as aldquoportable applicationrdquo only those thathave been ported Developers are still insearch of the Holy Grail of applicationsprogramming a write-once compile-anywhere program Languages such asJava are leading up to this but virtual

machines still have to be ported pathswill still be different etc Web-basedapplications are another area that couldbe promising for portable applications

Nick went on to talk about the future ofportability The recently released POSIX2001 standards are expected to help thesituation The 2001 revision expands theoriginal POSIX sepc into seven volumesEven though POSIX 2001 is more spe-cific than the previous release Nickpointed out that even this standardallows for small differences where ven-dors may define their own behaviorsThis can only hurt developers in thelong run Along these lines it waspointed out that there may be room forautomated tools that have the capabilityto check application code and determinethe level of portability and standardscompliance of the source

NETWORK MANAGEMENT SYSTEM

PERFORMANCE TUNING

Jeff Allen Tellme Networks Inc

Summarized by Matt Selsky

Jeff Allen author of Cricket discussednetwork tuning The basic idea of tuningis measure twiddle and measure againCricket can be used for large installa-tions but itrsquos overkill for smaller instal-lations for which Mrtg is better suitedYou donrsquot need Cricket to do measure-ment Doing something like a shellscript that dumps data to Excel is goodbut you need to get some sort of mea-surement Cricket was not designed forbilling it was meant to help answerquestions

Useful tools and techniques coveredincluded looking for the differencebetween machines in order to identifythe causes for the differences measuringbut also thinking about what yoursquoremeasuring and why remembering touse strace and tcpdump to identifyproblems Problems repeat themselvesperformance tuning requires an intricateunderstanding of all the layers involvedto solve the problem and simplify theautomation (Having a computer science

background helps but is not essential) Ifyou can reproduce the problem youthen should investigate each piece tofind the bottleneck If you canrsquot repro-duce the problem then measure Youneed to understand all the system inter-actions Measurement can help deter-mine whether things are actually slow orif the user is imagining it

Troubleshooting begins with a hunchbut scientific processes are essential Youshould be able to determine the eventstream or when each event occurs hav-ing observation logs can help Somehints provided include checking cron forperiodic anomalies slowing downbinrm to avoid IO overload and look-ing for unusual kernel CPU usage

Also try to use lightweight monitoringto reduce overhead You donrsquot need tomonitor every resource on every systembut you should monitor those resourceson those systems that are essential Donrsquotcheck things too often since you canintroduce overhead that way Lots ofinformation can be gathered from theproc file system interface to the kernel

BSD BOF

Host Kirk McKusick Author and Consultant

Summarized by Bosko Milekic

The BSD Birds of a Feather session atUSENIX started off with five quick andinformative talks and ended with anexciting discussion of licensing issues

Christos Zoulas for the NetBSD Projectwent over the primary goals of theNetBSD Project (namely portability aclean architecture and security) andthen proceeded to briefly discuss someof the challenges that the project hasencountered The successes and areas forimprovement of 16 were then exam-ined NetBSD has recently seen variousimprovements in its cross-build systempackaging system and third-party toolsupport For what concerns the kernel azero-copy TCPUDP implementationhas been integrated and a new pmap

85October 2002 login

C

ON

FERE

NC

ERE

PORT

Scode for arm32 has been introduced ashave some other rather major changes tovarious subsystems More information isavailable in Christosrsquos slides which arenow available at httpwwwnetbsdorggalleryeventsusenix2002

Theo de Raadt of the OpenBSD Projectpresented a rundown of various newthings that the OpenBSD and OpenSSHteams have been looking at Theorsquos talkincluded an amusing description (andpictures) of some of the events thattranspired during OpenBSDrsquos hack-a-thon which occurred the week beforeUSENIX rsquo02 Finally Theo mentionedthat following complications with theIPFilter license the OpenBSD team haddone a fairly extensive license audit

Robert Watson of the FreeBSD Projectbrought up various examples ofFreeBSD being used in the real worldHe went on to describe a wide variety ofchanges that will surface with theupcoming FreeBSD release 50 Notably50 will introduce a re-architecting ofthe kernel aimed at providing muchmore scalable support for SMPmachines 50 will also include an earlyimplementation of KSE FreeBSDrsquos ver-sion of scheduler activations largeimprovements to pccard and significantframework bits from the TrustedBSDproject

Mike Karels from Wind River Systemspresented an overview of how BSDOShas been evolving following the WindRiver Systems acquisition of BSDirsquos soft-ware assets Ernie Prabhakar from Applediscussed Applersquos OS X and its successon the desktop He explained how OS Xaims to bring BSD to the desktop whilestriving to maintain a strong and stablecore one that is heavily based on BSDtechnology

The BoF finished with a discussion onlicensing specifically some folks ques-tioned the impact of Applersquos licensingfor code from their core OS (the DarwinProject) on the BSD community as awhole

CLOSING SESSION

HOW FLIES FLY

Michael H Dickinson University ofCalifornia Berkeley

Summarized by JD Welch

In this lively and offbeat talk Dickinsondiscussed his research into the mecha-nisms of flight in the fruit fly (Droso-phila melanogaster) and related theautonomic behavior to that of a techno-logical system Insects are the most suc-cessful organisms on earth due in nosmall part to their early adoption offlight Flies travel through space onstraight trajectories interrupted by sac-cades jumpy turning motions some-what analogous to the fast jumpymovements of the human eye Using avariety of monitoring techniquesincluding high-speed video in a ldquovirtualreality flight arenardquo Dickinson and hiscolleagues have observed that fliesrespond to visual cues to decide when tosaccade during flight For example morevisual texture makes flies saccade earlier(ie further away from the arena wall)described by Dickinson as a ldquocollisioncontrol algorithmrdquo

Through a combination of sensoryinputs flies can make decisions abouttheir flight path Flies possess a visualldquoexpansion detectorrdquo which at a certainthreshold causes the animal to turn acertain direction However expansion infront of the fly sometimes causes it toland How does the fly decide Using thevirtual reality device to replicate variousvisual scenarios Dickinson observedthat the flies fixate on vertical objectsthat move back and forth across thefield Expansion on the left side of theanimal causes it to turn right and viceversa while expansion directly in frontof the animal triggers the legs to flareout in preparation for landing

If the eyes detect these changes how arethe responses implemented The fliesrsquowings have ldquopower musclesrdquo controlledby mechanical resonance (as opposed tothe nervous system) which drive the

USENIX 2002

wings combined with neurally activatedldquosteering musclesrdquo which change theconfiguration of the wing joints Subtlevariations in the timing of impulses cor-respond to changes in wing movementcontrolled by sensors in the ldquowing-pitrdquoA small wing-like structure the halterecontrols equilibrium by beating con-stantly

Voluntary control is accomplished bythe halteres whose signals can interferewith the autonomic control of the strokecycle The halteres have ldquosteering mus-clesrdquo as well and information derivedfrom the visual system can turn off thehaltere or trick it into a ldquovirtualrdquo prob-lem requiring a response

Dickinson has also studied the aerody-namics of insect flight using a devicecalled the ldquoRobo Flyrdquo an oversizedmechanical insect wing suspended in alarge tank of mineral oil InterestinglyDickinson observed that the average liftgenerated by the fliesrsquo wings is under itsbody weight the flies use three mecha-nisms to overcome this including rotat-ing the wings (rotational lift)

Insects are extraordinarily robust crea-tures and because Dickinson analyzedthe problem in a systems-oriented waythese observations and analysis areimmediately applicable to technologythe response system can be used as anefficient search algorithm for controlsystems in autonomous vehicles forexample

THE AFS WORKSHOP

Summarized by Garry Zacheiss MIT

Love Hornquist-Astrand of the Arladevelopment team presented the Arlastatus report Arla 0358 has beenreleased Scheduled for release soon areimproved support for Tru64 UNIXMacOS X and FreeBSD improved vol-ume handling and implementation ofmore of the vospts subcommands Itwas stressed that MacOS X is consideredan important platform and that a GUIconfiguration manager for the cache

86 Vol 27 No 5 login

manager a GUI ACL manager for thefinder and a graphical login that obtainsKerberos tickets and AFS tokens at logintime were all under development

Future goals planned for the underlyingAFS protocols include GSSAPISPNEGOsupport for Rx performance improve-ments to Rx an enhanced disconnectedmode and IPv6 support for Rx anexperimental patch is already availablefor the latter Future Arla-specific goalsinclude improved performance partialfile reading and increased stability forseveral platforms Work is also inprogress for the RXGSS protocol exten-sions (integrating GSSAPI into Rx) Apartial implementation exists and workcontinues as developers find time

Derrick Brashear of the OpenAFS devel-opment team presented the OpenAFSstatus report OpenAFS was releasedimmediately prior to the conferenceOpenAFS 125 fixed a remotelyexploitable denial-of-service attack inseveral OpenAFS platforms mostnotably IRIX and AIX Future workplanned for OpenAFS includes bettersupport for MacOS X including work-ing around Finder interaction issuesBetter support for the BSDs is alsoplanned FreeBSD has a partially imple-mented client NetBSD and OpenBSDhave only server programs availableright now AIX 5 Tru64 51A andMacOS X 102 (aka Jaguar) are allplanned for the future Other plannedenhancements include nested groups inthe ptserver (code donated by the Uni-versity of Michigan awaits integration)disconnected AFS and further work ona native Windows client Derrick statedthat the guiding principles of OpenAFSwere to maintain compatibility withIBM AFS support new platforms easethe administrative burden and add newfunctionality

AFS performance was discussed Theopenafsorg cell consists of two fileservers one in Stockholm Sweden andone in Pittsburgh PA AFS works rea-sonably well over trans-Atlantic links

although OpenAFS clients donrsquot deter-mine which file server to talk to veryeffectively Arla clients use RTTs to theserver to determine the optimal fileserver to fetch replicated data fromModifications to OpenAFS to supportthis behavior in the future are desired

Jimmy Engelbrecht and Harald Barth ofKTH discussed their AFSCrawler scriptwritten to determine how many AFScells and clients were in the world whatimplementationsversions they were(Arla vs IBM AFS vs OpenAFS) andhow much data was in AFS The scriptunfortunately triggered a bug in IBMAFS 36ndashderived code causing someclients to panic while handling a specificRPC This has since been fixed in OpenAFS 125 and the most recent IBMAFS patch level and all AFS users arestrongly encouraged to upgrade Norelease of Arla is vulnerable to this par-ticular denial-of-service attack Therewas an extended discussion of the use-fulness of this exploration Many sitesbelieved this was useful information andsuch scanning should continue in thefuture but only on an opt-in basis

Many sites face the problem of manag-ing KerberosAFS credentials for batch-scheduled jobs Specifically most batchprocessing software needs to be modi-fied to forward tickets as part of thebatch submission process renew ticketsand tokens while the job is in the queueand for the lifetime of the job and prop-erly destroy credentials when the jobcompletes Ken Hornstein of NRL wasable to pay a commercial vendor to sup-port Kerberos 45 credential manage-ment in their product although they didnot implement AFS token managementMIT has implemented some of thedesired functionality in OpenPBS andmight be able to make it available toother interested sites

Tools to simplify AFS administrationwere discussed including

AFS Balancer A tool written byCMU to automate the process ofbalancing disk usage across all

servers in a cell Available fromftpftpandrewcmuedupubAFS-Toolsbalance-11btargz

Themis Themis is KTHrsquos enhancedversion of the AFS tool ldquopackagerdquofor updating files on local disk froma central AFS image KTHrsquosenhancements include allowing thedeletion of files simplifying theprocess of adding a file and allow-ing the merging of multiple rulesets for determining which files areupdated Themis is available fromthe Arla CVS repository

Stanford was presented as an example ofa large AFS site Stanfordrsquos AFS usageconsists of approximately 14TB of datain AFS in the form of approximately100000 volumes 33TB of storage isavailable in their primary cell irstan-fordedu Their file servers consistentirely of Solaris machines running acombination of Transarc 36 patch-level3 and OpenAFS 12x while their data-base servers run OpenAFS 122 Theircell consists of 25 file servers using acombination of EMC and Sun StorEdgehardware Stanford continues to use thekaserver for their authentication infra-structure with future plans to migrateentirely to an MIT Kerberos 5 KDC

Stanford has approximately 3400 clientson their campus not including SLAC(Stanford Linear Accelerator) approxi-mately 2100 AFS clients from outsideStanford contact their cell every monthTheir supported clients are almostentirely IBM AFS 36 clients althoughthey plan to release OpenAFS clientssoon Stanford currently supports onlyUNIX clients There is some on-campuspresence of Windows clients but Stan-ford has never publicly released or sup-ported it They do intend to release andsupport the MacOS X client in the nearfuture

All Stanford students faculty and staffare assigned AFS home directories witha default quota of 50MB for a total ofapproximately 550GB of user homedirectories Other significant uses of AFS

87October 2002 login

C

ON

FERE

NC

ERE

PORT

Sstorage are data volumes for workstationsoftware (400GB) and volumes forcourse Web sites and assignments(100GB)

AFS usage at Intel was also presentedIntel has been an AFS site since 1994They had bad experiences with the IBM35 Linux client their experience withOpenAFS on Linux 24x kernels hasbeen much better They use and are sat-isfied with the OpenAFS IA64 Linuxport Intel has hundreds of OpenAFS123 and 124 clients in many produc-tion cells accessing data stored on IBMAFS file servers They have not encoun-tered any interoperability issues Intelhas some concerns about OpenAFS theywould like to purchase commercial sup-port for OpenAFS and to see OpenAFSsupport for HP-UX on both PA-RISCand Itanium hardware HP-UX supportis currently unavailable due to a specificHP-UX header file being unavailablefrom HP this may be available soonIntel has not yet committed to migratingtheir file servers to OpenAFS and areunsure if they will do so without com-mercial support

Backups are a traditional topic of dis-cussion at AFS workshops and this timewas no exception Many users complainthat the traditional AFS backup tools(ldquobackuprdquo and ldquobutcrdquo) are complex anddifficult to automate requiring manyhome-grown scripts and much userintervention for error recovery An addi-tional complaint was that the traditionalAFS tools do not support file- or direc-tory-level backups and restores datamust be backed up and restored at thevolume level

Mitch Collinsworth of Cornell presentedwork to make AMANDA the freebackup software from the University ofMaryland suitable for AFS backupsUsing AMANDA for AFS backups allowsone to share AFS backup tapes withnon-AFS backups easily run multiplebackups in parallel automate errorrecovery and provide a robust degradedmode that prevents tape errors from

stopping backups altogether Theirimplementation allows for full volumerestores as well as individual directoryand file restores They have finished cod-ing this work and are in the process oftesting and documenting it

Peter Honeyman of CITI at the Univer-sity of Michigan spoke about work hehas proposed to replace Rx with RPC-SEC GSS in OpenAFS this would allowAFS to use a TCP-based transport mech-anism rather than the UDP-based Rxand possibly gain better congestion con-trol dynamic adaptation and fragmen-tation avoidance as a result RPCSECGSS uses the GSSAPI to authenticateSUN ONC RPC RPCSEC GSS is trans-port agnostic provides strong security isa developing Internet standard and hasmultiple open source implementationsBackward compatibility with existingAFS servers and clients is an importantgoal of this project

Page 22: inside - USENIX · 2019. 2. 25. · Bosko Milekic Juan Navarro David E. Ott Amit Purohit Brennan Reynolds Matt Selsky J.D. Welch Li Xiao Praveen Yalagandula Haijin Yan Gary Zacheiss

BERKELEY DB XML

John Merrells Sleepycat Software

Merrells gave a quick introduction andoverview of the new XML library forBerkeley DB The library specializes instorage and retrieval of XML contentthrough a tool called XPath The libraryallows for multiple containers per docu-ment and stores everything natively asXML The user is also given a wide rangeof elements to create indices withincluding edges elements text stringsor presence The XPath tool consists of aquery parser generator optimizer andexecution engine To conclude his pre-sentation Merrell gave a live demonstra-tion of the software

For more information visithttpwwwsleepycatcomxml

CLUSTER-ON-DEMAND (COD)

Justin Moore Duke University

Modern clusters are growing at a rapidrate Many have pushed beyond the 5000-machine mark and deployingthem results in large expenses as well asmanagement and provisioning issuesFurthermore if the cluster is ldquorentedrdquoout to various users it is very time-con-suming to configure it to a userrsquos specsregardless of how long they need to useit The COD work presented createsdynamic virtual clusters within a givenphysical cluster The goal was to have aprovisioning tool that would automati-cally select a chunk of available nodesand install the operating system andmiddleware specified by the customer ina short period of time This would allowa greater use of resources since multiplevirtual clusters can exist at once Byusing a virtual cluster the size can bechanged dynamically Moore stated thatthey have created a working prototypeand are currently testing and bench-marking its performance

For more information visithttpwwwcsdukeedu~justincod

82 Vol 27 No 5 login

CATACOMB

Elias Sinderson University of CaliforniaSanta Cruz

Catacomb is a project to develop a data-base-backed DASL module for theApache Web server It was designed as areplacement for the WebDAV The initialrelease of the module only contains sup-port for a MySQL database but could beextended to others Sinderson brieflytouched on the performance of hermodule It was comparable to themod_dav Apache module for all querytypes but search The presentation wasconcluded with remarks about addingsupport for the lock method and includ-ing ACL specifications in the future

For more information visithttpoceancseucsceducatacomb

SELF-ORGANIZING STORAGE

Dan Ellard Harvard University

Ellardrsquos presentation introduced a stor-age system that tuned itself based on theworkload of the system without requir-ing the intervention of the user Theintelligence was implemented as a vir-tual self-organizing disk that residesbelow any file system The virtual diskobserves the access patterns exhibited bythe system and then attempts to predictwhat information will be accessed nextAn experiment was done using an NFStrace at a large ISP on one of their emailservers Ellardrsquos self-organizing storagesystem worked well which he attributesto the fact that most of the files beingrequested were large email boxes Areasof future work include exploration ofthe length and detail of the data collec-tion stage as well as the CPU impact ofrunning the virtual disk layer

VISUAL DEBUGGING

John Costigan Virginia Tech

Costigan feels that the state of currentdebugging facilities in the UNIX worldis not as good as it should be He pro-poses the addition of several elements toprograms like ddd and gdb The firstaddition is including separate heap and

stack components Another is to displayonly certain fields of a data structureand have the ability to zoom in and outif needed Finally the debugger shouldprovide the ability to visually presentcomplex data structures includinglinked lists trees etc He said that a betaversion of a debugger with these abilitiesis currently available

For more information visit httpinfoviscsvtedudatastruct

IMPROVING APPLICATION PERFORMANCE

THROUGH SYSTEM-CALL COMPOSITION

Amit Purohit University of Stony Brook

Web servers perform a huge number ofcontext switches and internal data copy-ing during normal operation These twoelements can drastically limit the perfor-mance of an application regardless ofthe hardware platform it is run on TheCompound System Call (CoSy) frame-work is an attempt to reduce the perfor-mance penalty for context switches anddata copies It includes a user-levellibrary and several kernel facilities thatcan be used via system calls The libraryprovides the programmer with a com-plete set of memory-management func-tions Performance tests using the CoSyframework showed a large saving forcontext switches and data copies Theonly area where the savings betweenconventional libraries and CoSy werenegligible was for very large file copies

For more information visithttpwwwcssunysbedu~purohit

ELASTIC QUOTAS

John Oscar Columbia University

Oscar began by stating that mostresources are flexible but that to datedisks have not been While approxi-mately 80 percent of files are short-liveddisk quotas are hard limits imposed byadministrators The idea behind elasticquotas is to have a non-permanent stor-age area that each user can temporarilyuse Oscar suggested the creation of aehome directory structure to be used indeploying elastic quotas Currently he

has implemented elastic quotas as astackable file system There were alsoseveral trends apparent in the dataMany users would ssh to their remotehost (good) but then launch an emailreader on the local machine whichwould connect via POP to the same hostand send the password in clear text(bad) This situation can easily be reme-died by tunneling protocols like POPthrough SSH but it appeared that manypeople were not aware this could bedone While most of his comments onthe use of the wireless network werenegative the list of passwords he hadcollected showed that people wereindeed using strong passwords His rec-ommendations were to educate andencourage people to use protocols likeIPSec SSH and SSL when conductingwork over a wireless network becauseyou never know who else is listening

SYSTRACE

Niels Provos University of Michigan

How can one be sure the applicationsone is using actually do exactly whattheir developers said they do Shortanswer you canrsquot People today are usinga large number of complex applicationswhich means it is impossible to checkeach application thoroughly for securityvulnerabilities There are tools out therethat can help though Provos has devel-oped a tool called systrace that allows auser to generate a policy of acceptablesystem calls a particular application canmake If the application attempts tomake a call that is not defined in thepolicy the user is notified and allowed tochoose an action Systrace includes anauto-policy generation mechanismwhich uses a training phase to record theactions of all programs the user exe-cutes If the user chooses not to be both-ered by applications breaking policysystrace allows default enforcementactions to be set Currently systrace isimplemented for FreeBSD and NetBSDwith a Linux version coming out shortly

83October 2002 login

C

ON

FERE

NC

ERE

PORT

SFor more information visit httpwwwcitiumicheduuprovossystrace

VERIFIABLE SECRET REDISTRIBUTION FOR

SURVIVABLE STORAGE SYSTEMS

Ted Wong Carnegie Mellon University

Wong presented a protocol that can beused to re-create a file distributed over aset of servers even if one of the serversis damaged In his scheme the user mustchoose how many shares the file is splitinto and the number of servers it will bestored across The goal of this work is toprovide persistent secure storage ofinformation even if it comes underattack Wong stated that his design ofthe protocol was complete and he is cur-rently building a prototype implementa-tion of it

For more information visit httpwwwcscmuedu~tmwongresearch

THE GURU IS IN SESSIONS

LINUX ON LAPTOPPDA

Bdale Garbee HP Linux Systems Operation

Summarized by Brennan Reynolds

This was a roundtable-like discussionwith Garbee acting as moderator He iscurrently in charge of the Debian distri-bution of Linux and is involved in port-ing Linux to platforms other than i386He has successfully help port the kernelto the Alpha Sparc and ARM and iscurrently working on a version to runon VAX machines

A discussion was held on which file sys-tem is best for battery life The Riser andExt3 file systems were discussed peoplecommented that when using Ext3 ontheir laptops the disc would never spindown and thus was always consuming alarge amount of power One suggestionwas to use a RAM-based file system forany volume that requires a large numberof writes and to only have the file systemwritten to disk at infrequent intervals orwhen the machine is shut down or sus-pended

A question was directed to Garbee abouthis knowledge of how the HP-Compaqmerger would affect Linux Garbeethought that the merger was good forLinux both within the company and forthe open source community as a wholeHe stated that the new company wouldbe the number one shipper of systemswith Linux as the base operating systemHe also fielded a question about thesupport of older HP servers and theirability to run Linux saying that indeedLinux has been ported to them and hepersonally had several machines in hisbasement running it

A question which sparked a large num-ber of responses concerned problemspeople had with running Linux onmobile platforms Most people actuallydid not have many problems at allThere were a few cases of a particularmachine model not working but therewere only two widespread issues dock-ing stations and the new ACPI powermanagement scheme Docking stationsstill appear to be somewhat of aheadache but most people had devel-oped ad-hoc solutions for getting themto work The ACPI power managementscheme developed by Intel does notappear to have a quick solution TedTrsquoso one of the head kernel developersstated that there are fundamental archi-tectural problems with the 24 series ker-nel that do not easily allow ACPI to beadded However he also stated thatACPI is supported in the newest devel-opment kernel 25 and will be includedin 26

The final topic was the possibility ofpurchasing a laptop without paying anychargetax for Microsoft Windows Theconsensus was that currently this is notpossible Even if the machine comeswith Linux or without any operatingsystem the vendors have contracts inplace which require them to payMicrosoft for each machine they ship ifthat machine is capable of running aMicrosoft operating system The onlyway this will change is if the demand for

USENIX 2002

other operating systems increases to thepoint that vendors renegotiate their con-tracts with Microsoft and no one sawthis happening in the near future

LARGE CLUSTERS

Andrew Hume ATampT Labs ndash Research

Summarized by Teri Lampoudi

This was a session on clusters big dataand resilient computing Given that theaudience was primarily interested inclusters and not necessarily big data orbeating nodes to a pulp the conversa-tion revolved mainly around what ittakes to make a heterogeneous clusterresilient Hume also presented aFREENIX paper on his current clusterproject which explained in more detailmuch of what was abstractly claimed inthe guru session

Humersquos use of the word ldquoclusterrdquoreferred not to what I had assumedwould be a Beowulf-type system but to aloosely coupled farm of machines inwhich concurrency was much less of anissue than it would be in a Beowulf Infact the architecture Hume describedwas designed to deal with large amountsof independent transaction processingessentially the process of billing callswhich requires no interprocess messagepassing or anything of the sort a parallelscientific application might

Where does the large data come inSince the transactions in question con-sist of large numbers of flat files amechanism for getting the files ontonodes and off-loading the results is nec-essary In this particular cluster namedNingaui this task is handled by theldquoreplication managerrdquo which generateda large amount of interest in the audi-ence All management functions in thecluster as well as job allocation consti-tute services that are to be bid for andleased out to nodes Furthermore fail-ures are handled by simply repeating thefailed task until it is completed properlyif that is possible and if it is not thenlooking through detailed logs to discover

84 Vol 27 No 5 login

what went wrong Logging and testingfor the successful completion of eachtask are what make this model resilientEvery management operation is loggedat some appropriate level of detail gen-erating 120MBday and every single filetransfer is md5 checksummed This wascharacterized by Hume as a ldquopatientrdquostyle of management where no one con-trols the cluster scheduling behavior isemergent rather than stringentlyplanned and nodes are free to drop inand out of the cluster a downside to thismodel is the increase in latency offset bythe fact that the cluster continues tofunction unattended outside of 8-to-5support hours

In response to questions about the pro-jected scalability of such a schemeHume said the system would presum-ably scale from the present 60 nodes toabout 100 without modifications butthat scaling higher would be a matter forfurther design The guru session endedon a somewhat unrelated note regardingthe problems of getting data off main-frames and onto Linux machines ndashissues of fixed- vs variable-lengthencoding of data blocks resulting in use-ful information being stripped by FTPand problems in converting COBOLcopybooks to something useful on a C-based architecture To reiterate Humersquosclaim there are homemade solutions forsome of these problems and the readershould feel free to contact Mr Hume forthem

WRITING PORTABLE APPLICATIONS

Nick Stoughton MSB Consultants

Summarized by Josh Lothian

Nick Stoughton addressed a concernthat developers are facing more andmore writing portable applications Heproposed that there is no such thing as aldquoportable applicationrdquo only those thathave been ported Developers are still insearch of the Holy Grail of applicationsprogramming a write-once compile-anywhere program Languages such asJava are leading up to this but virtual

machines still have to be ported pathswill still be different etc Web-basedapplications are another area that couldbe promising for portable applications

Nick went on to talk about the future ofportability The recently released POSIX2001 standards are expected to help thesituation The 2001 revision expands theoriginal POSIX sepc into seven volumesEven though POSIX 2001 is more spe-cific than the previous release Nickpointed out that even this standardallows for small differences where ven-dors may define their own behaviorsThis can only hurt developers in thelong run Along these lines it waspointed out that there may be room forautomated tools that have the capabilityto check application code and determinethe level of portability and standardscompliance of the source

NETWORK MANAGEMENT SYSTEM

PERFORMANCE TUNING

Jeff Allen Tellme Networks Inc

Summarized by Matt Selsky

Jeff Allen author of Cricket discussednetwork tuning The basic idea of tuningis measure twiddle and measure againCricket can be used for large installa-tions but itrsquos overkill for smaller instal-lations for which Mrtg is better suitedYou donrsquot need Cricket to do measure-ment Doing something like a shellscript that dumps data to Excel is goodbut you need to get some sort of mea-surement Cricket was not designed forbilling it was meant to help answerquestions

Useful tools and techniques coveredincluded looking for the differencebetween machines in order to identifythe causes for the differences measuringbut also thinking about what yoursquoremeasuring and why remembering touse strace and tcpdump to identifyproblems Problems repeat themselvesperformance tuning requires an intricateunderstanding of all the layers involvedto solve the problem and simplify theautomation (Having a computer science

background helps but is not essential) Ifyou can reproduce the problem youthen should investigate each piece tofind the bottleneck If you canrsquot repro-duce the problem then measure Youneed to understand all the system inter-actions Measurement can help deter-mine whether things are actually slow orif the user is imagining it

Troubleshooting begins with a hunchbut scientific processes are essential Youshould be able to determine the eventstream or when each event occurs hav-ing observation logs can help Somehints provided include checking cron forperiodic anomalies slowing downbinrm to avoid IO overload and look-ing for unusual kernel CPU usage

Also try to use lightweight monitoringto reduce overhead You donrsquot need tomonitor every resource on every systembut you should monitor those resourceson those systems that are essential Donrsquotcheck things too often since you canintroduce overhead that way Lots ofinformation can be gathered from theproc file system interface to the kernel

BSD BOF

Host Kirk McKusick Author and Consultant

Summarized by Bosko Milekic

The BSD Birds of a Feather session atUSENIX started off with five quick andinformative talks and ended with anexciting discussion of licensing issues

Christos Zoulas for the NetBSD Projectwent over the primary goals of theNetBSD Project (namely portability aclean architecture and security) andthen proceeded to briefly discuss someof the challenges that the project hasencountered The successes and areas forimprovement of 16 were then exam-ined NetBSD has recently seen variousimprovements in its cross-build systempackaging system and third-party toolsupport For what concerns the kernel azero-copy TCPUDP implementationhas been integrated and a new pmap

85October 2002 login

C

ON

FERE

NC

ERE

PORT

Scode for arm32 has been introduced ashave some other rather major changes tovarious subsystems More information isavailable in Christosrsquos slides which arenow available at httpwwwnetbsdorggalleryeventsusenix2002

Theo de Raadt of the OpenBSD Projectpresented a rundown of various newthings that the OpenBSD and OpenSSHteams have been looking at Theorsquos talkincluded an amusing description (andpictures) of some of the events thattranspired during OpenBSDrsquos hack-a-thon which occurred the week beforeUSENIX rsquo02 Finally Theo mentionedthat following complications with theIPFilter license the OpenBSD team haddone a fairly extensive license audit

Robert Watson of the FreeBSD Projectbrought up various examples ofFreeBSD being used in the real worldHe went on to describe a wide variety ofchanges that will surface with theupcoming FreeBSD release 50 Notably50 will introduce a re-architecting ofthe kernel aimed at providing muchmore scalable support for SMPmachines 50 will also include an earlyimplementation of KSE FreeBSDrsquos ver-sion of scheduler activations largeimprovements to pccard and significantframework bits from the TrustedBSDproject

Mike Karels from Wind River Systemspresented an overview of how BSDOShas been evolving following the WindRiver Systems acquisition of BSDirsquos soft-ware assets Ernie Prabhakar from Applediscussed Applersquos OS X and its successon the desktop He explained how OS Xaims to bring BSD to the desktop whilestriving to maintain a strong and stablecore one that is heavily based on BSDtechnology

The BoF finished with a discussion onlicensing specifically some folks ques-tioned the impact of Applersquos licensingfor code from their core OS (the DarwinProject) on the BSD community as awhole

CLOSING SESSION

HOW FLIES FLY

Michael H Dickinson University ofCalifornia Berkeley

Summarized by JD Welch

In this lively and offbeat talk Dickinsondiscussed his research into the mecha-nisms of flight in the fruit fly (Droso-phila melanogaster) and related theautonomic behavior to that of a techno-logical system Insects are the most suc-cessful organisms on earth due in nosmall part to their early adoption offlight Flies travel through space onstraight trajectories interrupted by sac-cades jumpy turning motions some-what analogous to the fast jumpymovements of the human eye Using avariety of monitoring techniquesincluding high-speed video in a ldquovirtualreality flight arenardquo Dickinson and hiscolleagues have observed that fliesrespond to visual cues to decide when tosaccade during flight For example morevisual texture makes flies saccade earlier(ie further away from the arena wall)described by Dickinson as a ldquocollisioncontrol algorithmrdquo

Through a combination of sensoryinputs flies can make decisions abouttheir flight path Flies possess a visualldquoexpansion detectorrdquo which at a certainthreshold causes the animal to turn acertain direction However expansion infront of the fly sometimes causes it toland How does the fly decide Using thevirtual reality device to replicate variousvisual scenarios Dickinson observedthat the flies fixate on vertical objectsthat move back and forth across thefield Expansion on the left side of theanimal causes it to turn right and viceversa while expansion directly in frontof the animal triggers the legs to flareout in preparation for landing

If the eyes detect these changes how arethe responses implemented The fliesrsquowings have ldquopower musclesrdquo controlledby mechanical resonance (as opposed tothe nervous system) which drive the

USENIX 2002

wings combined with neurally activatedldquosteering musclesrdquo which change theconfiguration of the wing joints Subtlevariations in the timing of impulses cor-respond to changes in wing movementcontrolled by sensors in the ldquowing-pitrdquoA small wing-like structure the halterecontrols equilibrium by beating con-stantly

Voluntary control is accomplished bythe halteres whose signals can interferewith the autonomic control of the strokecycle The halteres have ldquosteering mus-clesrdquo as well and information derivedfrom the visual system can turn off thehaltere or trick it into a ldquovirtualrdquo prob-lem requiring a response

Dickinson has also studied the aerody-namics of insect flight using a devicecalled the ldquoRobo Flyrdquo an oversizedmechanical insect wing suspended in alarge tank of mineral oil InterestinglyDickinson observed that the average liftgenerated by the fliesrsquo wings is under itsbody weight the flies use three mecha-nisms to overcome this including rotat-ing the wings (rotational lift)

Insects are extraordinarily robust crea-tures and because Dickinson analyzedthe problem in a systems-oriented waythese observations and analysis areimmediately applicable to technologythe response system can be used as anefficient search algorithm for controlsystems in autonomous vehicles forexample

THE AFS WORKSHOP

Summarized by Garry Zacheiss MIT

Love Hornquist-Astrand of the Arladevelopment team presented the Arlastatus report Arla 0358 has beenreleased Scheduled for release soon areimproved support for Tru64 UNIXMacOS X and FreeBSD improved vol-ume handling and implementation ofmore of the vospts subcommands Itwas stressed that MacOS X is consideredan important platform and that a GUIconfiguration manager for the cache

86 Vol 27 No 5 login

manager a GUI ACL manager for thefinder and a graphical login that obtainsKerberos tickets and AFS tokens at logintime were all under development

Future goals planned for the underlyingAFS protocols include GSSAPISPNEGOsupport for Rx performance improve-ments to Rx an enhanced disconnectedmode and IPv6 support for Rx anexperimental patch is already availablefor the latter Future Arla-specific goalsinclude improved performance partialfile reading and increased stability forseveral platforms Work is also inprogress for the RXGSS protocol exten-sions (integrating GSSAPI into Rx) Apartial implementation exists and workcontinues as developers find time

Derrick Brashear of the OpenAFS devel-opment team presented the OpenAFSstatus report OpenAFS was releasedimmediately prior to the conferenceOpenAFS 125 fixed a remotelyexploitable denial-of-service attack inseveral OpenAFS platforms mostnotably IRIX and AIX Future workplanned for OpenAFS includes bettersupport for MacOS X including work-ing around Finder interaction issuesBetter support for the BSDs is alsoplanned FreeBSD has a partially imple-mented client NetBSD and OpenBSDhave only server programs availableright now AIX 5 Tru64 51A andMacOS X 102 (aka Jaguar) are allplanned for the future Other plannedenhancements include nested groups inthe ptserver (code donated by the Uni-versity of Michigan awaits integration)disconnected AFS and further work ona native Windows client Derrick statedthat the guiding principles of OpenAFSwere to maintain compatibility withIBM AFS support new platforms easethe administrative burden and add newfunctionality

AFS performance was discussed Theopenafsorg cell consists of two fileservers one in Stockholm Sweden andone in Pittsburgh PA AFS works rea-sonably well over trans-Atlantic links

although OpenAFS clients donrsquot deter-mine which file server to talk to veryeffectively Arla clients use RTTs to theserver to determine the optimal fileserver to fetch replicated data fromModifications to OpenAFS to supportthis behavior in the future are desired

Jimmy Engelbrecht and Harald Barth ofKTH discussed their AFSCrawler scriptwritten to determine how many AFScells and clients were in the world whatimplementationsversions they were(Arla vs IBM AFS vs OpenAFS) andhow much data was in AFS The scriptunfortunately triggered a bug in IBMAFS 36ndashderived code causing someclients to panic while handling a specificRPC This has since been fixed in OpenAFS 125 and the most recent IBMAFS patch level and all AFS users arestrongly encouraged to upgrade Norelease of Arla is vulnerable to this par-ticular denial-of-service attack Therewas an extended discussion of the use-fulness of this exploration Many sitesbelieved this was useful information andsuch scanning should continue in thefuture but only on an opt-in basis

Many sites face the problem of manag-ing KerberosAFS credentials for batch-scheduled jobs Specifically most batchprocessing software needs to be modi-fied to forward tickets as part of thebatch submission process renew ticketsand tokens while the job is in the queueand for the lifetime of the job and prop-erly destroy credentials when the jobcompletes Ken Hornstein of NRL wasable to pay a commercial vendor to sup-port Kerberos 45 credential manage-ment in their product although they didnot implement AFS token managementMIT has implemented some of thedesired functionality in OpenPBS andmight be able to make it available toother interested sites

Tools to simplify AFS administrationwere discussed including

AFS Balancer A tool written byCMU to automate the process ofbalancing disk usage across all

servers in a cell Available fromftpftpandrewcmuedupubAFS-Toolsbalance-11btargz

Themis Themis is KTHrsquos enhancedversion of the AFS tool ldquopackagerdquofor updating files on local disk froma central AFS image KTHrsquosenhancements include allowing thedeletion of files simplifying theprocess of adding a file and allow-ing the merging of multiple rulesets for determining which files areupdated Themis is available fromthe Arla CVS repository

Stanford was presented as an example ofa large AFS site Stanfordrsquos AFS usageconsists of approximately 14TB of datain AFS in the form of approximately100000 volumes 33TB of storage isavailable in their primary cell irstan-fordedu Their file servers consistentirely of Solaris machines running acombination of Transarc 36 patch-level3 and OpenAFS 12x while their data-base servers run OpenAFS 122 Theircell consists of 25 file servers using acombination of EMC and Sun StorEdgehardware Stanford continues to use thekaserver for their authentication infra-structure with future plans to migrateentirely to an MIT Kerberos 5 KDC

Stanford has approximately 3400 clientson their campus not including SLAC(Stanford Linear Accelerator) approxi-mately 2100 AFS clients from outsideStanford contact their cell every monthTheir supported clients are almostentirely IBM AFS 36 clients althoughthey plan to release OpenAFS clientssoon Stanford currently supports onlyUNIX clients There is some on-campuspresence of Windows clients but Stan-ford has never publicly released or sup-ported it They do intend to release andsupport the MacOS X client in the nearfuture

All Stanford students faculty and staffare assigned AFS home directories witha default quota of 50MB for a total ofapproximately 550GB of user homedirectories Other significant uses of AFS

87October 2002 login

C

ON

FERE

NC

ERE

PORT

Sstorage are data volumes for workstationsoftware (400GB) and volumes forcourse Web sites and assignments(100GB)

AFS usage at Intel was also presentedIntel has been an AFS site since 1994They had bad experiences with the IBM35 Linux client their experience withOpenAFS on Linux 24x kernels hasbeen much better They use and are sat-isfied with the OpenAFS IA64 Linuxport Intel has hundreds of OpenAFS123 and 124 clients in many produc-tion cells accessing data stored on IBMAFS file servers They have not encoun-tered any interoperability issues Intelhas some concerns about OpenAFS theywould like to purchase commercial sup-port for OpenAFS and to see OpenAFSsupport for HP-UX on both PA-RISCand Itanium hardware HP-UX supportis currently unavailable due to a specificHP-UX header file being unavailablefrom HP this may be available soonIntel has not yet committed to migratingtheir file servers to OpenAFS and areunsure if they will do so without com-mercial support

Backups are a traditional topic of dis-cussion at AFS workshops and this timewas no exception Many users complainthat the traditional AFS backup tools(ldquobackuprdquo and ldquobutcrdquo) are complex anddifficult to automate requiring manyhome-grown scripts and much userintervention for error recovery An addi-tional complaint was that the traditionalAFS tools do not support file- or direc-tory-level backups and restores datamust be backed up and restored at thevolume level

Mitch Collinsworth of Cornell presentedwork to make AMANDA the freebackup software from the University ofMaryland suitable for AFS backupsUsing AMANDA for AFS backups allowsone to share AFS backup tapes withnon-AFS backups easily run multiplebackups in parallel automate errorrecovery and provide a robust degradedmode that prevents tape errors from

stopping backups altogether Theirimplementation allows for full volumerestores as well as individual directoryand file restores They have finished cod-ing this work and are in the process oftesting and documenting it

Peter Honeyman of CITI at the Univer-sity of Michigan spoke about work hehas proposed to replace Rx with RPC-SEC GSS in OpenAFS this would allowAFS to use a TCP-based transport mech-anism rather than the UDP-based Rxand possibly gain better congestion con-trol dynamic adaptation and fragmen-tation avoidance as a result RPCSECGSS uses the GSSAPI to authenticateSUN ONC RPC RPCSEC GSS is trans-port agnostic provides strong security isa developing Internet standard and hasmultiple open source implementationsBackward compatibility with existingAFS servers and clients is an importantgoal of this project

Page 23: inside - USENIX · 2019. 2. 25. · Bosko Milekic Juan Navarro David E. Ott Amit Purohit Brennan Reynolds Matt Selsky J.D. Welch Li Xiao Praveen Yalagandula Haijin Yan Gary Zacheiss

has implemented elastic quotas as astackable file system There were alsoseveral trends apparent in the dataMany users would ssh to their remotehost (good) but then launch an emailreader on the local machine whichwould connect via POP to the same hostand send the password in clear text(bad) This situation can easily be reme-died by tunneling protocols like POPthrough SSH but it appeared that manypeople were not aware this could bedone While most of his comments onthe use of the wireless network werenegative the list of passwords he hadcollected showed that people wereindeed using strong passwords His rec-ommendations were to educate andencourage people to use protocols likeIPSec SSH and SSL when conductingwork over a wireless network becauseyou never know who else is listening

SYSTRACE

Niels Provos University of Michigan

How can one be sure the applicationsone is using actually do exactly whattheir developers said they do Shortanswer you canrsquot People today are usinga large number of complex applicationswhich means it is impossible to checkeach application thoroughly for securityvulnerabilities There are tools out therethat can help though Provos has devel-oped a tool called systrace that allows auser to generate a policy of acceptablesystem calls a particular application canmake If the application attempts tomake a call that is not defined in thepolicy the user is notified and allowed tochoose an action Systrace includes anauto-policy generation mechanismwhich uses a training phase to record theactions of all programs the user exe-cutes If the user chooses not to be both-ered by applications breaking policysystrace allows default enforcementactions to be set Currently systrace isimplemented for FreeBSD and NetBSDwith a Linux version coming out shortly

83October 2002 login

C

ON

FERE

NC

ERE

PORT

SFor more information visit httpwwwcitiumicheduuprovossystrace

VERIFIABLE SECRET REDISTRIBUTION FOR

SURVIVABLE STORAGE SYSTEMS

Ted Wong Carnegie Mellon University

Wong presented a protocol that can beused to re-create a file distributed over aset of servers even if one of the serversis damaged In his scheme the user mustchoose how many shares the file is splitinto and the number of servers it will bestored across The goal of this work is toprovide persistent secure storage ofinformation even if it comes underattack Wong stated that his design ofthe protocol was complete and he is cur-rently building a prototype implementa-tion of it

For more information visit httpwwwcscmuedu~tmwongresearch

THE GURU IS IN SESSIONS

LINUX ON LAPTOPPDA

Bdale Garbee HP Linux Systems Operation

Summarized by Brennan Reynolds

This was a roundtable-like discussionwith Garbee acting as moderator He iscurrently in charge of the Debian distri-bution of Linux and is involved in port-ing Linux to platforms other than i386He has successfully help port the kernelto the Alpha Sparc and ARM and iscurrently working on a version to runon VAX machines

A discussion was held on which file sys-tem is best for battery life The Riser andExt3 file systems were discussed peoplecommented that when using Ext3 ontheir laptops the disc would never spindown and thus was always consuming alarge amount of power One suggestionwas to use a RAM-based file system forany volume that requires a large numberof writes and to only have the file systemwritten to disk at infrequent intervals orwhen the machine is shut down or sus-pended

A question was directed to Garbee abouthis knowledge of how the HP-Compaqmerger would affect Linux Garbeethought that the merger was good forLinux both within the company and forthe open source community as a wholeHe stated that the new company wouldbe the number one shipper of systemswith Linux as the base operating systemHe also fielded a question about thesupport of older HP servers and theirability to run Linux saying that indeedLinux has been ported to them and hepersonally had several machines in hisbasement running it

A question which sparked a large num-ber of responses concerned problemspeople had with running Linux onmobile platforms Most people actuallydid not have many problems at allThere were a few cases of a particularmachine model not working but therewere only two widespread issues dock-ing stations and the new ACPI powermanagement scheme Docking stationsstill appear to be somewhat of aheadache but most people had devel-oped ad-hoc solutions for getting themto work The ACPI power managementscheme developed by Intel does notappear to have a quick solution TedTrsquoso one of the head kernel developersstated that there are fundamental archi-tectural problems with the 24 series ker-nel that do not easily allow ACPI to beadded However he also stated thatACPI is supported in the newest devel-opment kernel 25 and will be includedin 26

The final topic was the possibility ofpurchasing a laptop without paying anychargetax for Microsoft Windows Theconsensus was that currently this is notpossible Even if the machine comeswith Linux or without any operatingsystem the vendors have contracts inplace which require them to payMicrosoft for each machine they ship ifthat machine is capable of running aMicrosoft operating system The onlyway this will change is if the demand for

USENIX 2002

other operating systems increases to thepoint that vendors renegotiate their con-tracts with Microsoft and no one sawthis happening in the near future

LARGE CLUSTERS

Andrew Hume ATampT Labs ndash Research

Summarized by Teri Lampoudi

This was a session on clusters big dataand resilient computing Given that theaudience was primarily interested inclusters and not necessarily big data orbeating nodes to a pulp the conversa-tion revolved mainly around what ittakes to make a heterogeneous clusterresilient Hume also presented aFREENIX paper on his current clusterproject which explained in more detailmuch of what was abstractly claimed inthe guru session

Humersquos use of the word ldquoclusterrdquoreferred not to what I had assumedwould be a Beowulf-type system but to aloosely coupled farm of machines inwhich concurrency was much less of anissue than it would be in a Beowulf Infact the architecture Hume describedwas designed to deal with large amountsof independent transaction processingessentially the process of billing callswhich requires no interprocess messagepassing or anything of the sort a parallelscientific application might

Where does the large data come inSince the transactions in question con-sist of large numbers of flat files amechanism for getting the files ontonodes and off-loading the results is nec-essary In this particular cluster namedNingaui this task is handled by theldquoreplication managerrdquo which generateda large amount of interest in the audi-ence All management functions in thecluster as well as job allocation consti-tute services that are to be bid for andleased out to nodes Furthermore fail-ures are handled by simply repeating thefailed task until it is completed properlyif that is possible and if it is not thenlooking through detailed logs to discover

84 Vol 27 No 5 login

what went wrong Logging and testingfor the successful completion of eachtask are what make this model resilientEvery management operation is loggedat some appropriate level of detail gen-erating 120MBday and every single filetransfer is md5 checksummed This wascharacterized by Hume as a ldquopatientrdquostyle of management where no one con-trols the cluster scheduling behavior isemergent rather than stringentlyplanned and nodes are free to drop inand out of the cluster a downside to thismodel is the increase in latency offset bythe fact that the cluster continues tofunction unattended outside of 8-to-5support hours

In response to questions about the pro-jected scalability of such a schemeHume said the system would presum-ably scale from the present 60 nodes toabout 100 without modifications butthat scaling higher would be a matter forfurther design The guru session endedon a somewhat unrelated note regardingthe problems of getting data off main-frames and onto Linux machines ndashissues of fixed- vs variable-lengthencoding of data blocks resulting in use-ful information being stripped by FTPand problems in converting COBOLcopybooks to something useful on a C-based architecture To reiterate Humersquosclaim there are homemade solutions forsome of these problems and the readershould feel free to contact Mr Hume forthem

WRITING PORTABLE APPLICATIONS

Nick Stoughton MSB Consultants

Summarized by Josh Lothian

Nick Stoughton addressed a concernthat developers are facing more andmore writing portable applications Heproposed that there is no such thing as aldquoportable applicationrdquo only those thathave been ported Developers are still insearch of the Holy Grail of applicationsprogramming a write-once compile-anywhere program Languages such asJava are leading up to this but virtual

machines still have to be ported pathswill still be different etc Web-basedapplications are another area that couldbe promising for portable applications

Nick went on to talk about the future ofportability The recently released POSIX2001 standards are expected to help thesituation The 2001 revision expands theoriginal POSIX sepc into seven volumesEven though POSIX 2001 is more spe-cific than the previous release Nickpointed out that even this standardallows for small differences where ven-dors may define their own behaviorsThis can only hurt developers in thelong run Along these lines it waspointed out that there may be room forautomated tools that have the capabilityto check application code and determinethe level of portability and standardscompliance of the source

NETWORK MANAGEMENT SYSTEM

PERFORMANCE TUNING

Jeff Allen Tellme Networks Inc

Summarized by Matt Selsky

Jeff Allen author of Cricket discussednetwork tuning The basic idea of tuningis measure twiddle and measure againCricket can be used for large installa-tions but itrsquos overkill for smaller instal-lations for which Mrtg is better suitedYou donrsquot need Cricket to do measure-ment Doing something like a shellscript that dumps data to Excel is goodbut you need to get some sort of mea-surement Cricket was not designed forbilling it was meant to help answerquestions

Useful tools and techniques coveredincluded looking for the differencebetween machines in order to identifythe causes for the differences measuringbut also thinking about what yoursquoremeasuring and why remembering touse strace and tcpdump to identifyproblems Problems repeat themselvesperformance tuning requires an intricateunderstanding of all the layers involvedto solve the problem and simplify theautomation (Having a computer science

background helps but is not essential) Ifyou can reproduce the problem youthen should investigate each piece tofind the bottleneck If you canrsquot repro-duce the problem then measure Youneed to understand all the system inter-actions Measurement can help deter-mine whether things are actually slow orif the user is imagining it

Troubleshooting begins with a hunchbut scientific processes are essential Youshould be able to determine the eventstream or when each event occurs hav-ing observation logs can help Somehints provided include checking cron forperiodic anomalies slowing downbinrm to avoid IO overload and look-ing for unusual kernel CPU usage

Also try to use lightweight monitoringto reduce overhead You donrsquot need tomonitor every resource on every systembut you should monitor those resourceson those systems that are essential Donrsquotcheck things too often since you canintroduce overhead that way Lots ofinformation can be gathered from theproc file system interface to the kernel

BSD BOF

Host Kirk McKusick Author and Consultant

Summarized by Bosko Milekic

The BSD Birds of a Feather session atUSENIX started off with five quick andinformative talks and ended with anexciting discussion of licensing issues

Christos Zoulas for the NetBSD Projectwent over the primary goals of theNetBSD Project (namely portability aclean architecture and security) andthen proceeded to briefly discuss someof the challenges that the project hasencountered The successes and areas forimprovement of 16 were then exam-ined NetBSD has recently seen variousimprovements in its cross-build systempackaging system and third-party toolsupport For what concerns the kernel azero-copy TCPUDP implementationhas been integrated and a new pmap

85October 2002 login

C

ON

FERE

NC

ERE

PORT

Scode for arm32 has been introduced ashave some other rather major changes tovarious subsystems More information isavailable in Christosrsquos slides which arenow available at httpwwwnetbsdorggalleryeventsusenix2002

Theo de Raadt of the OpenBSD Projectpresented a rundown of various newthings that the OpenBSD and OpenSSHteams have been looking at Theorsquos talkincluded an amusing description (andpictures) of some of the events thattranspired during OpenBSDrsquos hack-a-thon which occurred the week beforeUSENIX rsquo02 Finally Theo mentionedthat following complications with theIPFilter license the OpenBSD team haddone a fairly extensive license audit

Robert Watson of the FreeBSD Projectbrought up various examples ofFreeBSD being used in the real worldHe went on to describe a wide variety ofchanges that will surface with theupcoming FreeBSD release 50 Notably50 will introduce a re-architecting ofthe kernel aimed at providing muchmore scalable support for SMPmachines 50 will also include an earlyimplementation of KSE FreeBSDrsquos ver-sion of scheduler activations largeimprovements to pccard and significantframework bits from the TrustedBSDproject

Mike Karels from Wind River Systemspresented an overview of how BSDOShas been evolving following the WindRiver Systems acquisition of BSDirsquos soft-ware assets Ernie Prabhakar from Applediscussed Applersquos OS X and its successon the desktop He explained how OS Xaims to bring BSD to the desktop whilestriving to maintain a strong and stablecore one that is heavily based on BSDtechnology

The BoF finished with a discussion onlicensing specifically some folks ques-tioned the impact of Applersquos licensingfor code from their core OS (the DarwinProject) on the BSD community as awhole

CLOSING SESSION

HOW FLIES FLY

Michael H Dickinson University ofCalifornia Berkeley

Summarized by JD Welch

In this lively and offbeat talk Dickinsondiscussed his research into the mecha-nisms of flight in the fruit fly (Droso-phila melanogaster) and related theautonomic behavior to that of a techno-logical system Insects are the most suc-cessful organisms on earth due in nosmall part to their early adoption offlight Flies travel through space onstraight trajectories interrupted by sac-cades jumpy turning motions some-what analogous to the fast jumpymovements of the human eye Using avariety of monitoring techniquesincluding high-speed video in a ldquovirtualreality flight arenardquo Dickinson and hiscolleagues have observed that fliesrespond to visual cues to decide when tosaccade during flight For example morevisual texture makes flies saccade earlier(ie further away from the arena wall)described by Dickinson as a ldquocollisioncontrol algorithmrdquo

Through a combination of sensoryinputs flies can make decisions abouttheir flight path Flies possess a visualldquoexpansion detectorrdquo which at a certainthreshold causes the animal to turn acertain direction However expansion infront of the fly sometimes causes it toland How does the fly decide Using thevirtual reality device to replicate variousvisual scenarios Dickinson observedthat the flies fixate on vertical objectsthat move back and forth across thefield Expansion on the left side of theanimal causes it to turn right and viceversa while expansion directly in frontof the animal triggers the legs to flareout in preparation for landing

If the eyes detect these changes how arethe responses implemented The fliesrsquowings have ldquopower musclesrdquo controlledby mechanical resonance (as opposed tothe nervous system) which drive the

USENIX 2002

wings combined with neurally activatedldquosteering musclesrdquo which change theconfiguration of the wing joints Subtlevariations in the timing of impulses cor-respond to changes in wing movementcontrolled by sensors in the ldquowing-pitrdquoA small wing-like structure the halterecontrols equilibrium by beating con-stantly

Voluntary control is accomplished bythe halteres whose signals can interferewith the autonomic control of the strokecycle The halteres have ldquosteering mus-clesrdquo as well and information derivedfrom the visual system can turn off thehaltere or trick it into a ldquovirtualrdquo prob-lem requiring a response

Dickinson has also studied the aerody-namics of insect flight using a devicecalled the ldquoRobo Flyrdquo an oversizedmechanical insect wing suspended in alarge tank of mineral oil InterestinglyDickinson observed that the average liftgenerated by the fliesrsquo wings is under itsbody weight the flies use three mecha-nisms to overcome this including rotat-ing the wings (rotational lift)

Insects are extraordinarily robust crea-tures and because Dickinson analyzedthe problem in a systems-oriented waythese observations and analysis areimmediately applicable to technologythe response system can be used as anefficient search algorithm for controlsystems in autonomous vehicles forexample

THE AFS WORKSHOP

Summarized by Garry Zacheiss MIT

Love Hornquist-Astrand of the Arladevelopment team presented the Arlastatus report Arla 0358 has beenreleased Scheduled for release soon areimproved support for Tru64 UNIXMacOS X and FreeBSD improved vol-ume handling and implementation ofmore of the vospts subcommands Itwas stressed that MacOS X is consideredan important platform and that a GUIconfiguration manager for the cache

86 Vol 27 No 5 login

manager a GUI ACL manager for thefinder and a graphical login that obtainsKerberos tickets and AFS tokens at logintime were all under development

Future goals planned for the underlyingAFS protocols include GSSAPISPNEGOsupport for Rx performance improve-ments to Rx an enhanced disconnectedmode and IPv6 support for Rx anexperimental patch is already availablefor the latter Future Arla-specific goalsinclude improved performance partialfile reading and increased stability forseveral platforms Work is also inprogress for the RXGSS protocol exten-sions (integrating GSSAPI into Rx) Apartial implementation exists and workcontinues as developers find time

Derrick Brashear of the OpenAFS devel-opment team presented the OpenAFSstatus report OpenAFS was releasedimmediately prior to the conferenceOpenAFS 125 fixed a remotelyexploitable denial-of-service attack inseveral OpenAFS platforms mostnotably IRIX and AIX Future workplanned for OpenAFS includes bettersupport for MacOS X including work-ing around Finder interaction issuesBetter support for the BSDs is alsoplanned FreeBSD has a partially imple-mented client NetBSD and OpenBSDhave only server programs availableright now AIX 5 Tru64 51A andMacOS X 102 (aka Jaguar) are allplanned for the future Other plannedenhancements include nested groups inthe ptserver (code donated by the Uni-versity of Michigan awaits integration)disconnected AFS and further work ona native Windows client Derrick statedthat the guiding principles of OpenAFSwere to maintain compatibility withIBM AFS support new platforms easethe administrative burden and add newfunctionality

AFS performance was discussed Theopenafsorg cell consists of two fileservers one in Stockholm Sweden andone in Pittsburgh PA AFS works rea-sonably well over trans-Atlantic links

although OpenAFS clients donrsquot deter-mine which file server to talk to veryeffectively Arla clients use RTTs to theserver to determine the optimal fileserver to fetch replicated data fromModifications to OpenAFS to supportthis behavior in the future are desired

Jimmy Engelbrecht and Harald Barth ofKTH discussed their AFSCrawler scriptwritten to determine how many AFScells and clients were in the world whatimplementationsversions they were(Arla vs IBM AFS vs OpenAFS) andhow much data was in AFS The scriptunfortunately triggered a bug in IBMAFS 36ndashderived code causing someclients to panic while handling a specificRPC This has since been fixed in OpenAFS 125 and the most recent IBMAFS patch level and all AFS users arestrongly encouraged to upgrade Norelease of Arla is vulnerable to this par-ticular denial-of-service attack Therewas an extended discussion of the use-fulness of this exploration Many sitesbelieved this was useful information andsuch scanning should continue in thefuture but only on an opt-in basis

Many sites face the problem of manag-ing KerberosAFS credentials for batch-scheduled jobs Specifically most batchprocessing software needs to be modi-fied to forward tickets as part of thebatch submission process renew ticketsand tokens while the job is in the queueand for the lifetime of the job and prop-erly destroy credentials when the jobcompletes Ken Hornstein of NRL wasable to pay a commercial vendor to sup-port Kerberos 45 credential manage-ment in their product although they didnot implement AFS token managementMIT has implemented some of thedesired functionality in OpenPBS andmight be able to make it available toother interested sites

Tools to simplify AFS administrationwere discussed including

AFS Balancer A tool written byCMU to automate the process ofbalancing disk usage across all

servers in a cell Available fromftpftpandrewcmuedupubAFS-Toolsbalance-11btargz

Themis Themis is KTHrsquos enhancedversion of the AFS tool ldquopackagerdquofor updating files on local disk froma central AFS image KTHrsquosenhancements include allowing thedeletion of files simplifying theprocess of adding a file and allow-ing the merging of multiple rulesets for determining which files areupdated Themis is available fromthe Arla CVS repository

Stanford was presented as an example ofa large AFS site Stanfordrsquos AFS usageconsists of approximately 14TB of datain AFS in the form of approximately100000 volumes 33TB of storage isavailable in their primary cell irstan-fordedu Their file servers consistentirely of Solaris machines running acombination of Transarc 36 patch-level3 and OpenAFS 12x while their data-base servers run OpenAFS 122 Theircell consists of 25 file servers using acombination of EMC and Sun StorEdgehardware Stanford continues to use thekaserver for their authentication infra-structure with future plans to migrateentirely to an MIT Kerberos 5 KDC

Stanford has approximately 3400 clientson their campus not including SLAC(Stanford Linear Accelerator) approxi-mately 2100 AFS clients from outsideStanford contact their cell every monthTheir supported clients are almostentirely IBM AFS 36 clients althoughthey plan to release OpenAFS clientssoon Stanford currently supports onlyUNIX clients There is some on-campuspresence of Windows clients but Stan-ford has never publicly released or sup-ported it They do intend to release andsupport the MacOS X client in the nearfuture

All Stanford students faculty and staffare assigned AFS home directories witha default quota of 50MB for a total ofapproximately 550GB of user homedirectories Other significant uses of AFS

87October 2002 login

C

ON

FERE

NC

ERE

PORT

Sstorage are data volumes for workstationsoftware (400GB) and volumes forcourse Web sites and assignments(100GB)

AFS usage at Intel was also presentedIntel has been an AFS site since 1994They had bad experiences with the IBM35 Linux client their experience withOpenAFS on Linux 24x kernels hasbeen much better They use and are sat-isfied with the OpenAFS IA64 Linuxport Intel has hundreds of OpenAFS123 and 124 clients in many produc-tion cells accessing data stored on IBMAFS file servers They have not encoun-tered any interoperability issues Intelhas some concerns about OpenAFS theywould like to purchase commercial sup-port for OpenAFS and to see OpenAFSsupport for HP-UX on both PA-RISCand Itanium hardware HP-UX supportis currently unavailable due to a specificHP-UX header file being unavailablefrom HP this may be available soonIntel has not yet committed to migratingtheir file servers to OpenAFS and areunsure if they will do so without com-mercial support

Backups are a traditional topic of dis-cussion at AFS workshops and this timewas no exception Many users complainthat the traditional AFS backup tools(ldquobackuprdquo and ldquobutcrdquo) are complex anddifficult to automate requiring manyhome-grown scripts and much userintervention for error recovery An addi-tional complaint was that the traditionalAFS tools do not support file- or direc-tory-level backups and restores datamust be backed up and restored at thevolume level

Mitch Collinsworth of Cornell presentedwork to make AMANDA the freebackup software from the University ofMaryland suitable for AFS backupsUsing AMANDA for AFS backups allowsone to share AFS backup tapes withnon-AFS backups easily run multiplebackups in parallel automate errorrecovery and provide a robust degradedmode that prevents tape errors from

stopping backups altogether Theirimplementation allows for full volumerestores as well as individual directoryand file restores They have finished cod-ing this work and are in the process oftesting and documenting it

Peter Honeyman of CITI at the Univer-sity of Michigan spoke about work hehas proposed to replace Rx with RPC-SEC GSS in OpenAFS this would allowAFS to use a TCP-based transport mech-anism rather than the UDP-based Rxand possibly gain better congestion con-trol dynamic adaptation and fragmen-tation avoidance as a result RPCSECGSS uses the GSSAPI to authenticateSUN ONC RPC RPCSEC GSS is trans-port agnostic provides strong security isa developing Internet standard and hasmultiple open source implementationsBackward compatibility with existingAFS servers and clients is an importantgoal of this project

Page 24: inside - USENIX · 2019. 2. 25. · Bosko Milekic Juan Navarro David E. Ott Amit Purohit Brennan Reynolds Matt Selsky J.D. Welch Li Xiao Praveen Yalagandula Haijin Yan Gary Zacheiss

other operating systems increases to thepoint that vendors renegotiate their con-tracts with Microsoft and no one sawthis happening in the near future

LARGE CLUSTERS

Andrew Hume ATampT Labs ndash Research

Summarized by Teri Lampoudi

This was a session on clusters big dataand resilient computing Given that theaudience was primarily interested inclusters and not necessarily big data orbeating nodes to a pulp the conversa-tion revolved mainly around what ittakes to make a heterogeneous clusterresilient Hume also presented aFREENIX paper on his current clusterproject which explained in more detailmuch of what was abstractly claimed inthe guru session

Humersquos use of the word ldquoclusterrdquoreferred not to what I had assumedwould be a Beowulf-type system but to aloosely coupled farm of machines inwhich concurrency was much less of anissue than it would be in a Beowulf Infact the architecture Hume describedwas designed to deal with large amountsof independent transaction processingessentially the process of billing callswhich requires no interprocess messagepassing or anything of the sort a parallelscientific application might

Where does the large data come inSince the transactions in question con-sist of large numbers of flat files amechanism for getting the files ontonodes and off-loading the results is nec-essary In this particular cluster namedNingaui this task is handled by theldquoreplication managerrdquo which generateda large amount of interest in the audi-ence All management functions in thecluster as well as job allocation consti-tute services that are to be bid for andleased out to nodes Furthermore fail-ures are handled by simply repeating thefailed task until it is completed properlyif that is possible and if it is not thenlooking through detailed logs to discover

84 Vol 27 No 5 login

what went wrong Logging and testingfor the successful completion of eachtask are what make this model resilientEvery management operation is loggedat some appropriate level of detail gen-erating 120MBday and every single filetransfer is md5 checksummed This wascharacterized by Hume as a ldquopatientrdquostyle of management where no one con-trols the cluster scheduling behavior isemergent rather than stringentlyplanned and nodes are free to drop inand out of the cluster a downside to thismodel is the increase in latency offset bythe fact that the cluster continues tofunction unattended outside of 8-to-5support hours

In response to questions about the pro-jected scalability of such a schemeHume said the system would presum-ably scale from the present 60 nodes toabout 100 without modifications butthat scaling higher would be a matter forfurther design The guru session endedon a somewhat unrelated note regardingthe problems of getting data off main-frames and onto Linux machines ndashissues of fixed- vs variable-lengthencoding of data blocks resulting in use-ful information being stripped by FTPand problems in converting COBOLcopybooks to something useful on a C-based architecture To reiterate Humersquosclaim there are homemade solutions forsome of these problems and the readershould feel free to contact Mr Hume forthem

WRITING PORTABLE APPLICATIONS

Nick Stoughton MSB Consultants

Summarized by Josh Lothian

Nick Stoughton addressed a concernthat developers are facing more andmore writing portable applications Heproposed that there is no such thing as aldquoportable applicationrdquo only those thathave been ported Developers are still insearch of the Holy Grail of applicationsprogramming a write-once compile-anywhere program Languages such asJava are leading up to this but virtual

machines still have to be ported pathswill still be different etc Web-basedapplications are another area that couldbe promising for portable applications

Nick went on to talk about the future ofportability The recently released POSIX2001 standards are expected to help thesituation The 2001 revision expands theoriginal POSIX sepc into seven volumesEven though POSIX 2001 is more spe-cific than the previous release Nickpointed out that even this standardallows for small differences where ven-dors may define their own behaviorsThis can only hurt developers in thelong run Along these lines it waspointed out that there may be room forautomated tools that have the capabilityto check application code and determinethe level of portability and standardscompliance of the source

NETWORK MANAGEMENT SYSTEM

PERFORMANCE TUNING

Jeff Allen Tellme Networks Inc

Summarized by Matt Selsky

Jeff Allen author of Cricket discussednetwork tuning The basic idea of tuningis measure twiddle and measure againCricket can be used for large installa-tions but itrsquos overkill for smaller instal-lations for which Mrtg is better suitedYou donrsquot need Cricket to do measure-ment Doing something like a shellscript that dumps data to Excel is goodbut you need to get some sort of mea-surement Cricket was not designed forbilling it was meant to help answerquestions

Useful tools and techniques coveredincluded looking for the differencebetween machines in order to identifythe causes for the differences measuringbut also thinking about what yoursquoremeasuring and why remembering touse strace and tcpdump to identifyproblems Problems repeat themselvesperformance tuning requires an intricateunderstanding of all the layers involvedto solve the problem and simplify theautomation (Having a computer science

background helps but is not essential) Ifyou can reproduce the problem youthen should investigate each piece tofind the bottleneck If you canrsquot repro-duce the problem then measure Youneed to understand all the system inter-actions Measurement can help deter-mine whether things are actually slow orif the user is imagining it

Troubleshooting begins with a hunchbut scientific processes are essential Youshould be able to determine the eventstream or when each event occurs hav-ing observation logs can help Somehints provided include checking cron forperiodic anomalies slowing downbinrm to avoid IO overload and look-ing for unusual kernel CPU usage

Also try to use lightweight monitoringto reduce overhead You donrsquot need tomonitor every resource on every systembut you should monitor those resourceson those systems that are essential Donrsquotcheck things too often since you canintroduce overhead that way Lots ofinformation can be gathered from theproc file system interface to the kernel

BSD BOF

Host Kirk McKusick Author and Consultant

Summarized by Bosko Milekic

The BSD Birds of a Feather session atUSENIX started off with five quick andinformative talks and ended with anexciting discussion of licensing issues

Christos Zoulas for the NetBSD Projectwent over the primary goals of theNetBSD Project (namely portability aclean architecture and security) andthen proceeded to briefly discuss someof the challenges that the project hasencountered The successes and areas forimprovement of 16 were then exam-ined NetBSD has recently seen variousimprovements in its cross-build systempackaging system and third-party toolsupport For what concerns the kernel azero-copy TCPUDP implementationhas been integrated and a new pmap

85October 2002 login

C

ON

FERE

NC

ERE

PORT

Scode for arm32 has been introduced ashave some other rather major changes tovarious subsystems More information isavailable in Christosrsquos slides which arenow available at httpwwwnetbsdorggalleryeventsusenix2002

Theo de Raadt of the OpenBSD Projectpresented a rundown of various newthings that the OpenBSD and OpenSSHteams have been looking at Theorsquos talkincluded an amusing description (andpictures) of some of the events thattranspired during OpenBSDrsquos hack-a-thon which occurred the week beforeUSENIX rsquo02 Finally Theo mentionedthat following complications with theIPFilter license the OpenBSD team haddone a fairly extensive license audit

Robert Watson of the FreeBSD Projectbrought up various examples ofFreeBSD being used in the real worldHe went on to describe a wide variety ofchanges that will surface with theupcoming FreeBSD release 50 Notably50 will introduce a re-architecting ofthe kernel aimed at providing muchmore scalable support for SMPmachines 50 will also include an earlyimplementation of KSE FreeBSDrsquos ver-sion of scheduler activations largeimprovements to pccard and significantframework bits from the TrustedBSDproject

Mike Karels from Wind River Systemspresented an overview of how BSDOShas been evolving following the WindRiver Systems acquisition of BSDirsquos soft-ware assets Ernie Prabhakar from Applediscussed Applersquos OS X and its successon the desktop He explained how OS Xaims to bring BSD to the desktop whilestriving to maintain a strong and stablecore one that is heavily based on BSDtechnology

The BoF finished with a discussion onlicensing specifically some folks ques-tioned the impact of Applersquos licensingfor code from their core OS (the DarwinProject) on the BSD community as awhole

CLOSING SESSION

HOW FLIES FLY

Michael H Dickinson University ofCalifornia Berkeley

Summarized by JD Welch

In this lively and offbeat talk Dickinsondiscussed his research into the mecha-nisms of flight in the fruit fly (Droso-phila melanogaster) and related theautonomic behavior to that of a techno-logical system Insects are the most suc-cessful organisms on earth due in nosmall part to their early adoption offlight Flies travel through space onstraight trajectories interrupted by sac-cades jumpy turning motions some-what analogous to the fast jumpymovements of the human eye Using avariety of monitoring techniquesincluding high-speed video in a ldquovirtualreality flight arenardquo Dickinson and hiscolleagues have observed that fliesrespond to visual cues to decide when tosaccade during flight For example morevisual texture makes flies saccade earlier(ie further away from the arena wall)described by Dickinson as a ldquocollisioncontrol algorithmrdquo

Through a combination of sensoryinputs flies can make decisions abouttheir flight path Flies possess a visualldquoexpansion detectorrdquo which at a certainthreshold causes the animal to turn acertain direction However expansion infront of the fly sometimes causes it toland How does the fly decide Using thevirtual reality device to replicate variousvisual scenarios Dickinson observedthat the flies fixate on vertical objectsthat move back and forth across thefield Expansion on the left side of theanimal causes it to turn right and viceversa while expansion directly in frontof the animal triggers the legs to flareout in preparation for landing

If the eyes detect these changes how arethe responses implemented The fliesrsquowings have ldquopower musclesrdquo controlledby mechanical resonance (as opposed tothe nervous system) which drive the

USENIX 2002

wings combined with neurally activatedldquosteering musclesrdquo which change theconfiguration of the wing joints Subtlevariations in the timing of impulses cor-respond to changes in wing movementcontrolled by sensors in the ldquowing-pitrdquoA small wing-like structure the halterecontrols equilibrium by beating con-stantly

Voluntary control is accomplished bythe halteres whose signals can interferewith the autonomic control of the strokecycle The halteres have ldquosteering mus-clesrdquo as well and information derivedfrom the visual system can turn off thehaltere or trick it into a ldquovirtualrdquo prob-lem requiring a response

Dickinson has also studied the aerody-namics of insect flight using a devicecalled the ldquoRobo Flyrdquo an oversizedmechanical insect wing suspended in alarge tank of mineral oil InterestinglyDickinson observed that the average liftgenerated by the fliesrsquo wings is under itsbody weight the flies use three mecha-nisms to overcome this including rotat-ing the wings (rotational lift)

Insects are extraordinarily robust crea-tures and because Dickinson analyzedthe problem in a systems-oriented waythese observations and analysis areimmediately applicable to technologythe response system can be used as anefficient search algorithm for controlsystems in autonomous vehicles forexample

THE AFS WORKSHOP

Summarized by Garry Zacheiss MIT

Love Hornquist-Astrand of the Arladevelopment team presented the Arlastatus report Arla 0358 has beenreleased Scheduled for release soon areimproved support for Tru64 UNIXMacOS X and FreeBSD improved vol-ume handling and implementation ofmore of the vospts subcommands Itwas stressed that MacOS X is consideredan important platform and that a GUIconfiguration manager for the cache

86 Vol 27 No 5 login

manager a GUI ACL manager for thefinder and a graphical login that obtainsKerberos tickets and AFS tokens at logintime were all under development

Future goals planned for the underlyingAFS protocols include GSSAPISPNEGOsupport for Rx performance improve-ments to Rx an enhanced disconnectedmode and IPv6 support for Rx anexperimental patch is already availablefor the latter Future Arla-specific goalsinclude improved performance partialfile reading and increased stability forseveral platforms Work is also inprogress for the RXGSS protocol exten-sions (integrating GSSAPI into Rx) Apartial implementation exists and workcontinues as developers find time

Derrick Brashear of the OpenAFS devel-opment team presented the OpenAFSstatus report OpenAFS was releasedimmediately prior to the conferenceOpenAFS 125 fixed a remotelyexploitable denial-of-service attack inseveral OpenAFS platforms mostnotably IRIX and AIX Future workplanned for OpenAFS includes bettersupport for MacOS X including work-ing around Finder interaction issuesBetter support for the BSDs is alsoplanned FreeBSD has a partially imple-mented client NetBSD and OpenBSDhave only server programs availableright now AIX 5 Tru64 51A andMacOS X 102 (aka Jaguar) are allplanned for the future Other plannedenhancements include nested groups inthe ptserver (code donated by the Uni-versity of Michigan awaits integration)disconnected AFS and further work ona native Windows client Derrick statedthat the guiding principles of OpenAFSwere to maintain compatibility withIBM AFS support new platforms easethe administrative burden and add newfunctionality

AFS performance was discussed Theopenafsorg cell consists of two fileservers one in Stockholm Sweden andone in Pittsburgh PA AFS works rea-sonably well over trans-Atlantic links

although OpenAFS clients donrsquot deter-mine which file server to talk to veryeffectively Arla clients use RTTs to theserver to determine the optimal fileserver to fetch replicated data fromModifications to OpenAFS to supportthis behavior in the future are desired

Jimmy Engelbrecht and Harald Barth ofKTH discussed their AFSCrawler scriptwritten to determine how many AFScells and clients were in the world whatimplementationsversions they were(Arla vs IBM AFS vs OpenAFS) andhow much data was in AFS The scriptunfortunately triggered a bug in IBMAFS 36ndashderived code causing someclients to panic while handling a specificRPC This has since been fixed in OpenAFS 125 and the most recent IBMAFS patch level and all AFS users arestrongly encouraged to upgrade Norelease of Arla is vulnerable to this par-ticular denial-of-service attack Therewas an extended discussion of the use-fulness of this exploration Many sitesbelieved this was useful information andsuch scanning should continue in thefuture but only on an opt-in basis

Many sites face the problem of manag-ing KerberosAFS credentials for batch-scheduled jobs Specifically most batchprocessing software needs to be modi-fied to forward tickets as part of thebatch submission process renew ticketsand tokens while the job is in the queueand for the lifetime of the job and prop-erly destroy credentials when the jobcompletes Ken Hornstein of NRL wasable to pay a commercial vendor to sup-port Kerberos 45 credential manage-ment in their product although they didnot implement AFS token managementMIT has implemented some of thedesired functionality in OpenPBS andmight be able to make it available toother interested sites

Tools to simplify AFS administrationwere discussed including

AFS Balancer A tool written byCMU to automate the process ofbalancing disk usage across all

servers in a cell Available fromftpftpandrewcmuedupubAFS-Toolsbalance-11btargz

Themis Themis is KTHrsquos enhancedversion of the AFS tool ldquopackagerdquofor updating files on local disk froma central AFS image KTHrsquosenhancements include allowing thedeletion of files simplifying theprocess of adding a file and allow-ing the merging of multiple rulesets for determining which files areupdated Themis is available fromthe Arla CVS repository

Stanford was presented as an example ofa large AFS site Stanfordrsquos AFS usageconsists of approximately 14TB of datain AFS in the form of approximately100000 volumes 33TB of storage isavailable in their primary cell irstan-fordedu Their file servers consistentirely of Solaris machines running acombination of Transarc 36 patch-level3 and OpenAFS 12x while their data-base servers run OpenAFS 122 Theircell consists of 25 file servers using acombination of EMC and Sun StorEdgehardware Stanford continues to use thekaserver for their authentication infra-structure with future plans to migrateentirely to an MIT Kerberos 5 KDC

Stanford has approximately 3400 clientson their campus not including SLAC(Stanford Linear Accelerator) approxi-mately 2100 AFS clients from outsideStanford contact their cell every monthTheir supported clients are almostentirely IBM AFS 36 clients althoughthey plan to release OpenAFS clientssoon Stanford currently supports onlyUNIX clients There is some on-campuspresence of Windows clients but Stan-ford has never publicly released or sup-ported it They do intend to release andsupport the MacOS X client in the nearfuture

All Stanford students faculty and staffare assigned AFS home directories witha default quota of 50MB for a total ofapproximately 550GB of user homedirectories Other significant uses of AFS

87October 2002 login

C

ON

FERE

NC

ERE

PORT

Sstorage are data volumes for workstationsoftware (400GB) and volumes forcourse Web sites and assignments(100GB)

AFS usage at Intel was also presentedIntel has been an AFS site since 1994They had bad experiences with the IBM35 Linux client their experience withOpenAFS on Linux 24x kernels hasbeen much better They use and are sat-isfied with the OpenAFS IA64 Linuxport Intel has hundreds of OpenAFS123 and 124 clients in many produc-tion cells accessing data stored on IBMAFS file servers They have not encoun-tered any interoperability issues Intelhas some concerns about OpenAFS theywould like to purchase commercial sup-port for OpenAFS and to see OpenAFSsupport for HP-UX on both PA-RISCand Itanium hardware HP-UX supportis currently unavailable due to a specificHP-UX header file being unavailablefrom HP this may be available soonIntel has not yet committed to migratingtheir file servers to OpenAFS and areunsure if they will do so without com-mercial support

Backups are a traditional topic of dis-cussion at AFS workshops and this timewas no exception Many users complainthat the traditional AFS backup tools(ldquobackuprdquo and ldquobutcrdquo) are complex anddifficult to automate requiring manyhome-grown scripts and much userintervention for error recovery An addi-tional complaint was that the traditionalAFS tools do not support file- or direc-tory-level backups and restores datamust be backed up and restored at thevolume level

Mitch Collinsworth of Cornell presentedwork to make AMANDA the freebackup software from the University ofMaryland suitable for AFS backupsUsing AMANDA for AFS backups allowsone to share AFS backup tapes withnon-AFS backups easily run multiplebackups in parallel automate errorrecovery and provide a robust degradedmode that prevents tape errors from

stopping backups altogether Theirimplementation allows for full volumerestores as well as individual directoryand file restores They have finished cod-ing this work and are in the process oftesting and documenting it

Peter Honeyman of CITI at the Univer-sity of Michigan spoke about work hehas proposed to replace Rx with RPC-SEC GSS in OpenAFS this would allowAFS to use a TCP-based transport mech-anism rather than the UDP-based Rxand possibly gain better congestion con-trol dynamic adaptation and fragmen-tation avoidance as a result RPCSECGSS uses the GSSAPI to authenticateSUN ONC RPC RPCSEC GSS is trans-port agnostic provides strong security isa developing Internet standard and hasmultiple open source implementationsBackward compatibility with existingAFS servers and clients is an importantgoal of this project

Page 25: inside - USENIX · 2019. 2. 25. · Bosko Milekic Juan Navarro David E. Ott Amit Purohit Brennan Reynolds Matt Selsky J.D. Welch Li Xiao Praveen Yalagandula Haijin Yan Gary Zacheiss
Page 26: inside - USENIX · 2019. 2. 25. · Bosko Milekic Juan Navarro David E. Ott Amit Purohit Brennan Reynolds Matt Selsky J.D. Welch Li Xiao Praveen Yalagandula Haijin Yan Gary Zacheiss
Page 27: inside - USENIX · 2019. 2. 25. · Bosko Milekic Juan Navarro David E. Ott Amit Purohit Brennan Reynolds Matt Selsky J.D. Welch Li Xiao Praveen Yalagandula Haijin Yan Gary Zacheiss