Attacking Data Intensive Science with Distributed Computing
description
Transcript of Attacking Data Intensive Science with Distributed Computing
Attacking Data Intensive Attacking Data Intensive Science with Distributed Science with Distributed
ComputingComputing
Prof. Douglas ThainProf. Douglas Thain
University of Notre DameUniversity of Notre Dame
http://www.cse.nd.edu/~dthainhttp://www.cse.nd.edu/~dthain
OutlineOutline
Large Scale Distributed ComputingLarge Scale Distributed Computing– Plentiful Computing Resources World-WidePlentiful Computing Resources World-Wide– Challenges: Data and DebuggingChallenges: Data and Debugging
The Cooperative Computing LabThe Cooperative Computing Lab– Distributed Data ManagementDistributed Data Management– Applications to Scientific ComputingApplications to Scientific Computing– Debugging Complex SystemsDebugging Complex Systems
Open Problems in Distributed ComputingOpen Problems in Distributed Computing– Proposal: The All-Pairs AbstractionProposal: The All-Pairs Abstraction
As of 04 Sep 2006:As of 04 Sep 2006:
TeragridTeragrid– 21,972 CPUs / 220 TB / 6 sites21,972 CPUs / 220 TB / 6 sites
Open Science GridOpen Science Grid– 21,156 CPUs / 83 TB / 61 sites21,156 CPUs / 83 TB / 61 sites
Condor Worldwide:Condor Worldwide:– 96,352 CPUs / 1608 sites96,352 CPUs / 1608 sites
At Notre Dame:At Notre Dame:– CRC: 500 CPUsCRC: 500 CPUs– BOB: 212 CPUsBOB: 212 CPUs– Lots of little clusters!Lots of little clusters!
Plentiful Computing PowerPlentiful Computing Power
Who is using all of this?Who is using all of this?Anyone with Anyone with unlimitedunlimited computing needs! computing needs!
High Energy Physics:High Energy Physics:– Simulating the detector a particle accelerator before Simulating the detector a particle accelerator before
turning it on allows one to understand the output.turning it on allows one to understand the output.
Biochemistry:Biochemistry:– Simulate complex molecules under different forces to Simulate complex molecules under different forces to
understand how they fold/mate/react.understand how they fold/mate/react.
Biometrics:Biometrics:– Given a large database of human images, evaluate Given a large database of human images, evaluate
matching algorithms by comparing all to all.matching algorithms by comparing all to all.
Climatology:Climatology:– Given a starting global climate, simulate how climate Given a starting global climate, simulate how climate
develops under varying assumptions or events. develops under varying assumptions or events.
BuzzwordsBuzzwords
Distributed ComputingDistributed ComputingCluster ComputingCluster ComputingBeowulfBeowulfGrid ComputingGrid ComputingUtility ComputingUtility ComputingSomething@HomeSomething@Home
= A bunch of computers.= A bunch of computers.
Some Outstanding SuccessesSome Outstanding Successes
TeraGrid:TeraGrid:– AMANDA project uses 1000s of CPUs over months to AMANDA project uses 1000s of CPUs over months to
calibrate and process data from a neutrino telescope.calibrate and process data from a neutrino telescope.
PlanetLab:PlanetLab:– Hundreds of nodes used to test and validate a wide Hundreds of nodes used to test and validate a wide
variety of dist. and P2P systems: Chord, Pastry, etc...variety of dist. and P2P systems: Chord, Pastry, etc...
Condor:Condor:– MetaNEOS project solves a 30-year-old optimization MetaNEOS project solves a 30-year-old optimization
problem using brute force on 1000 heterogeneous problem using brute force on 1000 heterogeneous CPUs across multiple sites over several weeks.CPUs across multiple sites over several weeks.
Seti@Home:Seti@Home:– Millions of CPUs used to analyze celestial signals.Millions of CPUs used to analyze celestial signals.
And now the bad news...And now the bad news...
Large distributed systemsfall to pieces
when you have lots of data!
Example: Grid3 (OSG)Example: Grid3 (OSG)
Robert Gardner, et al. (102 authors)Robert Gardner, et al. (102 authors)The Grid3 Production GridThe Grid3 Production Grid
Principles and PracticePrinciples and PracticeIEEE HPDC 2004IEEE HPDC 2004
The Grid2003 Project has deployed a multi-virtual The Grid2003 Project has deployed a multi-virtual organization, application-driven grid laboratory organization, application-driven grid laboratory
that has sustained for several months the that has sustained for several months the production-level services required by…production-level services required by…
ATLAS, CMS, SDSS, LIGO…ATLAS, CMS, SDSS, LIGO…
Problem: Data ManagementProblem: Data ManagementThe good news:The good news:
– 27 sites with 2800 CPUs27 sites with 2800 CPUs– 40985 CPU-days provided over 6 months40985 CPU-days provided over 6 months– 10 applications with 1300 simultaneous jobs10 applications with 1300 simultaneous jobs
The bad news:The bad news:– 40-70 percent utilization40-70 percent utilization– 30 percent of jobs would fail30 percent of jobs would fail– 90 percent of failures were site problems90 percent of failures were site problems– Most site failures were due to disk space!Most site failures were due to disk space!
Problem: DebuggingProblem: Debugging
““Most groups reported problems in which a Most groups reported problems in which a job had been submitted... and something job had been submitted... and something
had not performed correctly, but they were had not performed correctly, but they were unable to determine where, why, or how to unable to determine where, why, or how to
fix that problem...”fix that problem...”
Jennifer Schopf and Steven Newhouse,“State of Grid Users: 25 Conversations with UK eScience Users”
Argonne National Lab Tech Report ANL/MCS-TM-278, 2004.
Both Problems: Debugging I/OBoth Problems: Debugging I/O
A user submits 1000 jobs to a grid.A user submits 1000 jobs to a grid.
Each requires 1 GB of input.Each requires 1 GB of input.
100 start at once. (Quite normal.)100 start at once. (Quite normal.)
The interleaved transfers all fail.The interleaved transfers all fail.
The “robust” system retries...The “robust” system retries...
(Happened last week in this department!) (Happened last week in this department!)
A Common Thread:A Common Thread:
Each of these problems:Each of these problems:– ““I can’t make storage do what I want!”I can’t make storage do what I want!”– ““I have no idea why this system is failing!”I have no idea why this system is failing!”
Arises from the following:Arises from the following:– Both service providers and users are lacking Both service providers and users are lacking
the the tools and modelstools and models that they need to that they need to harness and analyze complex environments.harness and analyze complex environments.
OutlineOutline
Large Scale Distributed ComputingLarge Scale Distributed Computing– Plentiful Computing Resources World-WidePlentiful Computing Resources World-Wide– Challenges: Data and DebuggingChallenges: Data and Debugging
The Cooperative Computing LabThe Cooperative Computing Lab– Distributed Data ManagementDistributed Data Management– Applications to Scientific ComputingApplications to Scientific Computing– Debugging Complex SystemsDebugging Complex Systems
Open Problems in Distributed ComputingOpen Problems in Distributed Computing– Proposal: The All-Pairs AbstractionProposal: The All-Pairs Abstraction
Cooperative Computing LabCooperative Computing Labat the University of Notre Dameat the University of Notre Dame
Basic Computer Science ResearchBasic Computer Science Research– Overlapping categories: Operating systems, distributedOverlapping categories: Operating systems, distributed systems, grid computing, filesystems, databases...systems, grid computing, filesystems, databases...
Modest Local OperationModest Local Operation– 300 CPUs, 20 TB of storage, 6 stakeholders300 CPUs, 20 TB of storage, 6 stakeholders– Keeps us honest + eat our own dog food.Keeps us honest + eat our own dog food.
Software Development and PublicationSoftware Development and Publication– http://www.cctools.orghttp://www.cctools.org– Students learn engineering as well as science.Students learn engineering as well as science.
Collaboration with External UsersCollaboration with External Users– High energy physics, bioinformatics, molecular dynamics...High energy physics, bioinformatics, molecular dynamics...
http://www.cse.nd.edu/~ccl
Computing EnvironmentComputing Environment
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU CPU CPU CPU
Disk Disk Disk Disk
Fitzpatrick Workstation Cluster
CCL Research ClusterCVRL Research Cluster
Miscellaneous CSE Workstations
CPU
CPU CPU
Disk
I will only run jobs when there is no-one working at
the keyboard
I will only run jobs between midnight and 8 AM
I prefer to run a job submitted by a CCL
student.
CondorMatchMakerJob
JobJob
Job
Job Job
Job
Job
CPU
Disk
JobJob
JobJob
Job Job Job Job
CPU HistoryCPU History
Storage HistoryStorage History
Flocking Between UniversitiesFlocking Between Universities
Notre Dame300 CPUs
Wisconsin1200 CPUs
Purdue A541 CPUs
Purdue B1016 CPUs http://www.cse.nd.edu/~ccl/operations/condor/
Problems and SolutionsProblems and Solutions
““I can’t make storage do what I want!”I can’t make storage do what I want!”– Need root access, configure, reboot, etc...Need root access, configure, reboot, etc...– Solution: Tactical Storage SystemsSolution: Tactical Storage Systems
I have no idea why this system is failing!I have no idea why this system is failing!– Multiple services, unreliable networks...Multiple services, unreliable networks...– Solution: Debugging Via Data MiningSolution: Debugging Via Data Mining
Why is Storage Hard?Why is Storage Hard?
Easy within one cluster:Easy within one cluster:– Shared filesystem on all nodes.Shared filesystem on all nodes.– But, limited to a few disks provided by admin.But, limited to a few disks provided by admin.– Even a “macho” file server has limited BW.Even a “macho” file server has limited BW.
Terrible across two or more clusters:Terrible across two or more clusters:– No shared filesystem on all nodes.No shared filesystem on all nodes.– Too hard to move data back and forth.Too hard to move data back and forth.– Limited to using storage on head nodes.Limited to using storage on head nodes.– Unable to become root to configure.Unable to become root to configure.
Conventional ClustersConventional Clusters
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
Disk Disk
Tactical Storage Systems (TSS)Tactical Storage Systems (TSS)
A TSS allows any node to serve as a file server A TSS allows any node to serve as a file server or as a file system client.or as a file system client.
All components can be deployed without special All components can be deployed without special privileges – but with standard grid security (GSI)privileges – but with standard grid security (GSI)
Users can build up complex structures.Users can build up complex structures.– Filesystems, databases, caches, ...Filesystems, databases, caches, ...– Admins need not know/care about larger structures.Admins need not know/care about larger structures.
Takes advantage of two resources:Takes advantage of two resources:– Total Storage (200 disks yields 20TB)Total Storage (200 disks yields 20TB)– Total Bandwidth (200 disks at 10 MB/s = 2 GB/s)Total Bandwidth (200 disks at 10 MB/s = 2 GB/s)
Tactical Storage SystemTactical Storage System
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPU
CPU CPU CPU CPUDisk Disk Disk Disk
Disk Disk Disk Disk
Disk Disk Disk Disk
Disk Disk Disk Disk Disk Disk Disk Disk
Disk Disk Disk Disk
Disk Disk Disk Disk
Disk Disk Disk Disk
1 – Uniform access between any nodes in either cluster
2 – Ability to group together multiple disks for a common purpose.
Logical Volume
Logical Volume
Appl
Secured byGrid GSI Credentials
WAN File System
Disk
Server
Adapter
Tactical Storage StructuresTactical Storage Structures
Adapter
Appl
Adapter
Appl
Disk
Server
Replicated File System
Disk
Server
Disk
Server
Appl
ApplAppl
Scalable Bandwidthfor Small Data
Adapter
AdapterAdapter
Disk
Server
Expandable File System
Disk
Server
Disk
Server
Adapter
Appl
Scalable Capacity/BWfor Large Data
Logical Volume
Applications and ExamplesApplications and Examples
Bioinformatics:Bioinformatics:– A WAN Filesystem for BLAST on EGEE grid.A WAN Filesystem for BLAST on EGEE grid.
Atmospheric PhysicsAtmospheric Physics– A cluster filesystem for scalable data analysis.A cluster filesystem for scalable data analysis.
Biometrics:Biometrics:– Dist. I/O for high-xput image comparison.Dist. I/O for high-xput image comparison.
Molecular Dynamics:Molecular Dynamics:– GEMS: Scalable distributed database.GEMS: Scalable distributed database.
High Energy Physics:High Energy Physics:– Global access to software distributions.Global access to software distributions.
Simple Wide Area File SystemSimple Wide Area File System
Bioinformatics on the European GridBioinformatics on the European Grid– Users want to run BLAST on standard DBs.Users want to run BLAST on standard DBs.– Cannot copy every DB to every node of the grid!Cannot copy every DB to every node of the grid!
Many databases of biological data in different Many databases of biological data in different formats around the world:formats around the world:– Archives: Swiss-Prot, TreMBL, NCBI, etc...Archives: Swiss-Prot, TreMBL, NCBI, etc...– Replicas: Public, Shared, Private, ???Replicas: Public, Shared, Private, ???
Goal: Refer to data objects by logical name.Goal: Refer to data objects by logical name.– Access the nearest copy of the non-redundant protein Access the nearest copy of the non-redundant protein
database, don’t care where it is.database, don’t care where it is.
Credit: Christophe Blanchet, Bioinformatics Center of Lyon, CNRS IBCP, Francehttp://gbio.ibcp.fr/cblanchet, [email protected]
Wide Area File SystemWide Area File System
BLAST
Adapter
RFIO gLite HTTP FTP
RFIOServer
FTPServer
HTTPServer
EGEE FileLocation Service
Run BLAST onLFN://ncbi.gov/nr.data
open(LFN://ncbi.gov/nr.data)
Where isLFN://ncbi.gov/nr.data?
Find it at:FTP://ibcp.fr/nr.data
nr.data
nr.data
nr.dataRETR nr.data
open(FTP://ibcp.fr/nr.data)
Performance of Bio Apps on EGEEPerformance of Bio Apps on EGEE
0
50
100
150
200
250
300
350
400
450
0 200 000 400 000 600 000 800 000 1 000 000 1 200 000
Protein Database Size (sequences)
Ru
nti
me (
sec)
BLAST+Parrot
FastA+Parrot
SSearch+Parrot
BLAST+copy
FastA+copy
SSearch+copy
Expandable FilesystemExpandable Filesystemfor Experimental Datafor Experimental Data
Credit: John Poirer @ Notre Dame Astrophysics Dept.
bufferdisk
2 GB/day todaycould be lots more!
dailytape
dailytapedaily
tapedailytapedaily
tape
30-yeararchive
analysiscode
Can only analyzethe most recent data.
Project GRANDhttp://www.nd.edu/~grand
Expandable FilesystemExpandable Filesystemfor Experimental Datafor Experimental Data
Credit: John Poirer @ Notre Dame Astrophysics Dept.
bufferdisk
2 GB/day todaycould be lots more!
dailytape
dailytapedaily
tapedailytapedaily
tape
30-yeararchive
Project GRANDhttp://www.nd.edu/~grand
fileserver
fileserver
fileserver
fileserver
Logical Volume
Adapter
analysiscode
Can analyze all dataover large time scales.
Scalable I/O for BiometricsScalable I/O for Biometrics
Computer Vision Research Lab in CSEComputer Vision Research Lab in CSE– Goal: Develop robust algorithms for identifying Goal: Develop robust algorithms for identifying
humans from (non-ideal) images.humans from (non-ideal) images.– Technique: Collect lots of images. Think up Technique: Collect lots of images. Think up
clever new matching function. Compare them.clever new matching function. Compare them.
How do you test a matching function?How do you test a matching function?– For a set S of images,For a set S of images,– Compute F(Si,Sj) for all Si and Sj in S.Compute F(Si,Sj) for all Si and Sj in S.– Compare the result matrix to known functions.Compare the result matrix to known functions.
Credit: Patrick Flynn at Notre Dame CSE
Computing SimilaritiesComputing Similarities
11 .8.8 .1.1 00 00 .1.1
11 00 .1.1 .1.1 00
11 00 .1.1 .7.7
11 00 00
11 .1.1
11
F
A Big Data ProblemA Big Data Problem
Data Size: 10k images of 1MB = 10 GBData Size: 10k images of 1MB = 10 GB
Total I/O: 10k * 10k * 2 MB *1/2 = Total I/O: 10k * 10k * 2 MB *1/2 = 100 TB100 TB
Would like to repeat many times!Would like to repeat many times!
In order to execute such a workload, we In order to execute such a workload, we must be careful to partition both the I/O must be careful to partition both the I/O and the CPU needs, taking advantage of and the CPU needs, taking advantage of distributed capacity. distributed capacity.
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
Conventional SolutionConventional Solution
DiskDisk
DiskDisk
Job JobJobJob Job JobJobJob
Move 200 TB at Runtime!
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
CPU
Disk
Using Tactical StorageUsing Tactical Storage
1. Break array into MB-size chunks.
3. Jobs find nearby data copy, and make full use before discarding.
Job Job Job Job
2. Replicate data to many disks.
Problems and SolutionsProblems and Solutions
““I can’t make storage do what I want!”I can’t make storage do what I want!”– Need root access, configure, reboot, etc...Need root access, configure, reboot, etc...– Solution: Tactical Storage SystemsSolution: Tactical Storage Systems
I have no idea why this system is failing!I have no idea why this system is failing!– Multiple services, unreliable networks...Multiple services, unreliable networks...– Solution: Debugging Via Data MiningSolution: Debugging Via Data Mining
It’s Ugly in the Real WorldIt’s Ugly in the Real WorldMachine related failures:Machine related failures:– Power outages, network outages, faulty memory, corrupted file Power outages, network outages, faulty memory, corrupted file
system, bad config files, expired certs, packet filters...system, bad config files, expired certs, packet filters...
Job related failures:Job related failures:– Crash on some args, bad executable, missing input files, mistake Crash on some args, bad executable, missing input files, mistake
in args, missing components, failure to understand in args, missing components, failure to understand dependencies...dependencies...
Incompatibilities between jobs and machines:Incompatibilities between jobs and machines:– Missing libraries, not enough disk/cpu/mem, wrong software Missing libraries, not enough disk/cpu/mem, wrong software
installed, wrong version installed, wrong memory layout...installed, wrong version installed, wrong memory layout...
Load related failures:Load related failures:– Slow actions induce timeouts; kernel tables: files, sockets, procs; Slow actions induce timeouts; kernel tables: files, sockets, procs;
router tables: addresses, routes, connections; competition with router tables: addresses, routes, connections; competition with other users...other users...
Non-deterministic failures:Non-deterministic failures:– Multi-thread/CPU synchronization, event interleaving across Multi-thread/CPU synchronization, event interleaving across
systems, random number generators, interactive effects, cosmic systems, random number generators, interactive effects, cosmic rays...rays...
A “Grand Challenge” Problem:A “Grand Challenge” Problem:
A user submits one million jobs to the grid.A user submits one million jobs to the grid.
Half of them fail.Half of them fail.
Now what?Now what?– Examine the output of every failed job?Examine the output of every failed job?– Login to every site to examine the logs?Login to every site to examine the logs?– Resubmit and hope for the best?Resubmit and hope for the best?
We need some way of getting the big picture.We need some way of getting the big picture.
Need to identify problems not seen before.Need to identify problems not seen before.
An Idea:An Idea:
We have lots of structured information about the We have lots of structured information about the components of a grid.components of a grid.Can we perform some form of data mining to Can we perform some form of data mining to discover the big picture of what is going on?discover the big picture of what is going on?– User: Your jobs work fine on RH Linux 12.1 and 12.3 User: Your jobs work fine on RH Linux 12.1 and 12.3
but they always seem to crash on version 12.2.but they always seem to crash on version 12.2.– Admin: Joe is running 1000s of jobs with 10 TB of data Admin: Joe is running 1000s of jobs with 10 TB of data
that fail immediately; perhaps he needs help.that fail immediately; perhaps he needs help.
Can we act on this information?Can we act on this information?– User: Avoid resources that aren’t working for you.User: Avoid resources that aren’t working for you.– Admin: Assist the user in understand and fixing the Admin: Assist the user in understand and fixing the
problem.problem.
Job ClassAdMyType = "Job"TargetType = "Machine"ClusterId = 11839QDate = 1150231068CompletionDate = 0Owner = "dcieslak“JobUniverse = 5Cmd = "ripper-cost-can-9-50.sh"LocalUserCpu = 0.000000LocalSysCpu = 0.000000ExitStatus = 0ImageSize = 40000DiskUsage = 110000NumCkpts = 0NumRestarts = 0NumSystemHolds = 0CommittedTime = 0ExitBySignal = FALSEPoolName = "ccl00.cse.nd.edu"CondorVersion = "6.7.19 May 10 2006"CondorPlatform = I386-LINUX_RH9RootDir = "/"Iwd = "/tmp/dcieslak/smotewrap1"MinHosts = 1WantRemoteSyscalls = FALSEWantCheckpoint = FALSEJobPrio = 0User = "[email protected]"NiceUser = FALSEEnv = "LD_LIBRARY_PATH=."EnvDelim = ";"JobNotification = 0WantRemoteIO = TRUEUserLog = "/tmp/dcieslak/smotewrap1/ripper-cost-can-9-50.log"CoreSize = -1KillSig = "SIGTERM"Rank = 0.000000In = "/dev/null"TransferIn = FALSEOut = "ripper-cost-can-9-50.output"StreamOut = FALSEErr = "ripper-cost-can-9-50.error"StreamErr = FALSEBufferSize = 524288BufferBlockSize = 32768ShouldTransferFiles = "YES"WhenToTransferOutput = "ON_EXIT_OR_EVICT"TransferFiles = "ALWAYS"TransferInput = "scripts.tar.gz,can-ripper.tar.gz"TransferOutput = "ripper-cost-50-can-9.tar.gz"ExecutableSize_RAW = 1ExecutableSize = 10000Requirements = (OpSys == "LINUX") && (Arch == "INTEL") && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && (HasFileTransfer)JobLeaseDuration = 1200PeriodicHold = FALSEPeriodicRelease = FALSEPeriodicRemove = FALSEOnExitHold = FALSEOnExitRemove = TRUELeaveJobInQueue = FALSEArguments = ""GlobalJobId = "cclbuild02.cse.nd.edu#1150231069#11839.0"ProcId = 0AutoClusterId = 0AutoClusterAttrs = "Owner,Requirements"JobStartDate = 1150256907LastRejMatchReason = "no match found"LastRejMatchTime = 1150815515TotalSuspensions = 73CumulativeSuspensionTime = 8179RemoteWallClockTime = 432493.000000LastRemoteHost = "hobbes.helios.nd.edu"LastClaimId = "<129.74.221.168:9359>#1150811733#2"MaxHosts = 1WantMatchDiagnostics = TRUELastMatchTime = 1150817352NumJobMatches = 34OrigMaxHosts = 1JobStatus = 2EnteredCurrentStatus = 1150817354LastSuspensionTime = 0CurrentHosts = 1ClaimId = "<129.74.20.20:9322>#1150232335#157"RemoteHost = "[email protected]"RemoteVirtualMachineID = 2ShadowBday = 1150817355JobLastStartDate = 1150815519JobCurrentStartDate = 1150817355JobRunCount = 24WallClockCheckpoint = 65927RemoteSysCpu = 52.000000ImageSize_RAW = 31324DiskUsage_RAW = 102814RemoteUserCpu = 62319.000000LastJobLeaseRenewal = 11
Machine ClassAdMyType = "Machine"TargetType = "Job"Name = "ccl00.cse.nd.edu"CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)MachineGroup = "ccl"MachineOwner = "dthain"CondorVersion = "6.7.19 May 10 2006"CondorPlatform = "I386-LINUX_RH9"VirtualMachineID = 1ExecutableSize = 20000JobUniverse = 1NiceUser = FALSEVirtualMemory = 962948Memory = 498Cpus = 1Disk = 19072712CondorLoadAvg = 1.000000LoadAvg = 1.130000KeyboardIdle = 817093ConsoleIdle = 817093StartdIpAddr = "<129.74.153.164:9453>"Arch = "INTEL"OpSys = "LINUX"UidDomain = "nd.edu"FileSystemDomain = "nd.edu"Subnet = "129.74.153"HasIOProxy = TRUECheckpointPlatform = "LINUX INTEL 2.4.x normal"TotalVirtualMemory = 962948TotalDisk = 19072712TotalCpus = 1TotalMemory = 498KFlops = 659777Mips = 2189LastBenchmark = 1150271600TotalLoadAvg = 1.130000TotalCondorLoadAvg = 1.000000ClockMin = 347ClockDay = 3TotalVirtualMachines = 1HasFileTransfer = TRUEHasPerFileEncryption = TRUEHasReconnect = TRUEHasMPI = TRUEHasTDP = TRUEHasJobDeferral = TRUEHasJICLocalConfig = TRUEHasJICLocalStdin = TRUEHasPVM = TRUEHasRemoteSyscalls = TRUEHasCheckpointing = TRUECpuBusyTime = 0CpuIsBusy = FALSETimeToLive = 2147483647State = "Claimed"EnteredCurrentState = 1150284871Activity = "Busy"EnteredCurrentActivity = 1150877237Start = ((KeyboardIdle > 15 * 60) && (((LoadAvg - CondorLoadAvg) <= 0.300000) || (State != "Unclaimed" && State != "Owner")))Requirements = (START) && (IsValidCheckpointPlatform)IsValidCheckpointPlatform = (((TARGET.JobUniverse == 1) == FALSE) || ((MY.CheckpointPlatform =!= UNDEFINED) && ((TARGET.LastCheckpointPlatform =?= MY.CheckpointPlatform) || (TARGET.NumCkpts == 0))))MaxJobRetirementTime = 0CurrentRank = 1.000000RemoteUser = "[email protected]"RemoteOwner = "[email protected]"ClientMachine = "cclbuild00.cse.nd.edu"JobId = "2929.0"GlobalJobId = "cclbuild00.cse.nd.edu#1150425594#2929.0"JobStart = 1150425941LastPeriodicCheckpoint = 1150879661ImageSize = 54196TotalJobRunTime = 456222TotalJobSuspendTime = 1080TotalClaimRunTime = 597057TotalClaimSuspendTime = 1271MonitorSelfTime = 1150883051MonitorSelfCPUUsage = 0.066660MonitorSelfImageSize = 8244.000000MonitorSelfResidentSetSize = 2036MonitorSelfAge = 0DaemonStartTime = 1150231320UpdateSequenceNumber = 2208MyAddress = "<129.74.153.164:9453>"LastHeardFrom = 1150883243UpdatesTotal = 2785UpdatesSequenced = 2784UpdatesLost = 0UpdatesHistory = "0x00000000000000000000000000000000"Machine = "ccl00.cse.nd.edu"Rank = ((Owner == "dthain") ||(Owner == "psnowber") ||(Owner == "cmoretti") ||(Owner == "jhemmes") ||(Owner == "gniederw")) * 2 + (PoolName =?= "ccl00.cse.nd.edu") * 1
User Job LogJob 1 submitted.Job 2 submitted.
Job 1 placed on ccl00.cse.nd.eduJob 1 evicted.Job 1 placed on smarty.cse.nd.edu.Job 1 completed.
Job 2 placed on dvorak.helios.nd.eduJob 2 suspendedJob 2 resumedJob 2 exited normally with status 1.
...
JobAd
MachineAdMachine
AdMachineAdMachine
Ad
UserJobLog
JobAdJob
AdJobAd
Failure Criteria:exit !=0core dumpevictedsuspendedbad output
Success Class Failure Class
RIPPER
Your jobs work fine on RH Linux 12.1 and 12.3 but they always seem to crash on
version 12.2.
Unexpected DiscoveriesUnexpected Discoveries
Purdue Teragrid (91343 jobs on 2523 CPUs)Purdue Teragrid (91343 jobs on 2523 CPUs)– Jobs fail on machines with (Memory>1920MB)Jobs fail on machines with (Memory>1920MB)– Diagnosis: Linux machines with > 3GB have a Diagnosis: Linux machines with > 3GB have a
different memory layout that breaks some programs different memory layout that breaks some programs that do inappropriate pointer arithmetic.that do inappropriate pointer arithmetic.
UND & UW (4005 jobs on 1460 CPUs)UND & UW (4005 jobs on 1460 CPUs)– Jobs fail on machines with less than 4MB disk.Jobs fail on machines with less than 4MB disk.– Diagnosis: Condor failed in an unusual way when the Diagnosis: Condor failed in an unusual way when the
job transfers input files that don’t fit.job transfers input files that don’t fit.
Many Open ProblemsMany Open ProblemsStrengths and Weaknesses of ApproachStrengths and Weaknesses of Approach– Correlation != Causation -> could be enough?Correlation != Causation -> could be enough?– Limits of reported data -> increase resolution?Limits of reported data -> increase resolution?– Not enough data points -> direct job placement?Not enough data points -> direct job placement?
Acting on InformationActing on Information– Steering by the end user.Steering by the end user.– Applying learned rules back to the system.Applying learned rules back to the system.– Evaluating (and sometimes abandoning) changes.Evaluating (and sometimes abandoning) changes.
Data Mining ResearchData Mining Research– Continuous intake + incremental construction.Continuous intake + incremental construction.– Creating results that non-specialists can understand.Creating results that non-specialists can understand.
Next Step: Monitor 21,000 CPUs on the OSG!Next Step: Monitor 21,000 CPUs on the OSG!
Problems and SolutionsProblems and Solutions
““I can’t make storage do what I want!”I can’t make storage do what I want!”– Need root access, configure, reboot, etc...Need root access, configure, reboot, etc...– Solution: Tactical Storage SystemsSolution: Tactical Storage Systems
I have no idea why this system is failing!I have no idea why this system is failing!– Multiple services, unreliable networks...Multiple services, unreliable networks...– Solution: Debugging Via Data MiningSolution: Debugging Via Data Mining
OutlineOutline
Large Scale Distributed ComputingLarge Scale Distributed Computing– Plentiful Computing Resources World-WidePlentiful Computing Resources World-Wide– Challenges: Data and DebuggingChallenges: Data and Debugging
The Cooperative Computing LabThe Cooperative Computing Lab– Distributed Data ManagementDistributed Data Management– Applications to Scientific ComputingApplications to Scientific Computing– Debugging Complex SystemsDebugging Complex Systems
Open Problems in Distributed ComputingOpen Problems in Distributed Computing– Proposal: The All-Pairs AbstractionProposal: The All-Pairs Abstraction
Some RuminationsSome Ruminations
These tools attack technical problems.These tools attack technical problems.But, users still have to be clever:But, users still have to be clever:– Where should my jobs run?Where should my jobs run?– How should I partition data?How should I partition data?– How long should I run before a checkpoint?How long should I run before a checkpoint?
Can we provide an interface such that:Can we provide an interface such that:– Scientific users state what to compute.Scientific users state what to compute.– The system decides where, when, and how.The system decides where, when, and how.
Previous attempts didn’t incorporate data. Previous attempts didn’t incorporate data.
The All-Pairs AbstractionThe All-Pairs Abstraction
All-Pairs:All-Pairs:– For a set S and a function F:For a set S and a function F:– Compute F(Si,Sj) for all Si and Sj in S.Compute F(Si,Sj) for all Si and Sj in S.
The end user provides:The end user provides:– Set S: A bunch of files.Set S: A bunch of files.– Function F: A self-contained program.Function F: A self-contained program.
The computing system determines:The computing system determines:– Optimal decomposition in time and space.Optimal decomposition in time and space.– What resources to employ. (F easy to distr.)What resources to employ. (F easy to distr.)– What to do when failures occur.What to do when failures occur.
An All-Pairs Facility at Notre DameAn All-Pairs Facility at Notre Dame
AllPairsWeb
Portal
CPU CPU CPU CPU
Disk Disk Disk Disk
100s-1000s of machines
2 – Backend decides where to run,how to partition, when to retry failures...
F F F F
F
1 – User uploads S and F into the system.
S
3 – Return result matrix to user.
Our Mode of ResearchOur Mode of Research
Find researchers with systems problems.Find researchers with systems problems.
Solve them by developing new tools.Solve them by developing new tools.
Generalize the solutions to new domains.Generalize the solutions to new domains.
Publish papers and software!Publish papers and software!
AcknowledgmentsAcknowledgments
Science Collaborators:Science Collaborators:– Christophe BlanchetChristophe Blanchet– Sander Klous Sander Klous – Peter KunzstPeter Kunzst– Erwin LaureErwin Laure– John PoirierJohn Poirier– Igor SfiligoiIgor Sfiligoi– Francesco Delli PaoliFrancesco Delli Paoli
CSE Students:CSE Students:– Paul BrennerPaul Brenner– Tim FaltemierTim Faltemier– James FitzgeraldJames Fitzgerald– Jeff HemmesJeff Hemmes– Chris MorettiChris Moretti– Gerhard NiederwieserGerhard Niederwieser– Phil SnowbergerPhil Snowberger– Justin WozniakJustin Wozniak
CSE Faculty:CSE Faculty:– Jesus IzaguirreJesus Izaguirre– Aaron StriegelAaron Striegel– Patrick FlynnPatrick Flynn– Nitesh ChawlaNitesh Chawla
For more information...For more information...
Cooperative Computing LabCooperative Computing Lab
http://www.cse.nd.edu/~cclhttp://www.cse.nd.edu/~ccl
Cooperative Computing ToolsCooperative Computing Tools
http://http://www.cctools.orgwww.cctools.org
Douglas ThainDouglas Thain– [email protected]@cse.nd.edu– http://http://www.cse.nd.edu/~dthainwww.cse.nd.edu/~dthain