Instrumenting Folding@Work
description
Transcript of Instrumenting Folding@Work
![Page 1: Instrumenting Folding@Work](https://reader036.fdocuments.in/reader036/viewer/2022062520/56815f53550346895dce30c0/html5/thumbnails/1.jpg)
Instrumenting Folding@Work
Badi Abdul-Wahid, RJ NowlingCSE 60641 Operating Systems
Professor Striegel
![Page 2: Instrumenting Folding@Work](https://reader036.fdocuments.in/reader036/viewer/2022062520/56815f53550346895dce30c0/html5/thumbnails/2.jpg)
Overview
• Problem Description– Experimental Structure– Folding@Work Workflow
• Benchmarks• Results– Weak Scaling (ns / day)– Server Capacity– Available Workers Over Time– Variability of Computation Time
• Conclusions
![Page 3: Instrumenting Folding@Work](https://reader036.fdocuments.in/reader036/viewer/2022062520/56815f53550346895dce30c0/html5/thumbnails/3.jpg)
Experimental Structure
![Page 4: Instrumenting Folding@Work](https://reader036.fdocuments.in/reader036/viewer/2022062520/56815f53550346895dce30c0/html5/thumbnails/4.jpg)
Folding@Work Workflow
![Page 5: Instrumenting Folding@Work](https://reader036.fdocuments.in/reader036/viewer/2022062520/56815f53550346895dce30c0/html5/thumbnails/5.jpg)
Benchmarks
• Tasks: 1 ns generations (approx 2 hr on test machine)
• 10 consecutive generations / simulations• Weak Scaling– 10 simulations / 10 workers– 100 simulations / 100 workers– 1,000 simulations / 1,000 workers
• Condor, later added SGE jobs• 1 Trial of each; Took ~ 2 days to run
![Page 6: Instrumenting Folding@Work](https://reader036.fdocuments.in/reader036/viewer/2022062520/56815f53550346895dce30c0/html5/thumbnails/6.jpg)
Weak Scaling of F@W
![Page 7: Instrumenting Folding@Work](https://reader036.fdocuments.in/reader036/viewer/2022062520/56815f53550346895dce30c0/html5/thumbnails/7.jpg)
Server Capacity (Wait Time)
![Page 8: Instrumenting Folding@Work](https://reader036.fdocuments.in/reader036/viewer/2022062520/56815f53550346895dce30c0/html5/thumbnails/8.jpg)
Available Workers over Time
![Page 9: Instrumenting Folding@Work](https://reader036.fdocuments.in/reader036/viewer/2022062520/56815f53550346895dce30c0/html5/thumbnails/9.jpg)
Transfer Times
![Page 10: Instrumenting Folding@Work](https://reader036.fdocuments.in/reader036/viewer/2022062520/56815f53550346895dce30c0/html5/thumbnails/10.jpg)
Variability of Computation Time
![Page 11: Instrumenting Folding@Work](https://reader036.fdocuments.in/reader036/viewer/2022062520/56815f53550346895dce30c0/html5/thumbnails/11.jpg)
Example Execution Timeline
![Page 12: Instrumenting Folding@Work](https://reader036.fdocuments.in/reader036/viewer/2022062520/56815f53550346895dce30c0/html5/thumbnails/12.jpg)
Performance Model
€
Nwu =⟨texe⟩+ ⟨tW ,wait⟩
⟨tnew⟩+ ⟨ttrans⟩+ ⟨tM ,wait⟩
![Page 13: Instrumenting Folding@Work](https://reader036.fdocuments.in/reader036/viewer/2022062520/56815f53550346895dce30c0/html5/thumbnails/13.jpg)
Weak Scaling (updated)
![Page 14: Instrumenting Folding@Work](https://reader036.fdocuments.in/reader036/viewer/2022062520/56815f53550346895dce30c0/html5/thumbnails/14.jpg)
Wait Times
![Page 15: Instrumenting Folding@Work](https://reader036.fdocuments.in/reader036/viewer/2022062520/56815f53550346895dce30c0/html5/thumbnails/15.jpg)
Tasks Waiting
![Page 16: Instrumenting Folding@Work](https://reader036.fdocuments.in/reader036/viewer/2022062520/56815f53550346895dce30c0/html5/thumbnails/16.jpg)
Identified Areas of Improvement• Availibility of Resources
– Benchmarks limited by number of sustained workers available through Condor
– New feature: WorkQueue Worker Pool can be used to start new workers• WorkQueue Limits Number of Workers
– Increasing number of file descriptors allowed up to 2,500 workers to connect– Bad behavior occuring in calls to select()– Working with WorkQueue developers to switch to poll()
• Long-Running Work Units Delay Completion of Trajectories– Some work units not returned / taking very long time– Prevents trajectories from finishing– Use fast abort feature to re-assign work units that take longer than a
specified time
![Page 17: Instrumenting Folding@Work](https://reader036.fdocuments.in/reader036/viewer/2022062520/56815f53550346895dce30c0/html5/thumbnails/17.jpg)
Conclusion
• Accomplished– Identified key metrics (ns / day, wait time)– Developed scaling model– Tested model
• Conclusions– Real scientific applications scale well– Forcing short workunits adds load to Master– Performance model validated– “Self-correcting” behavior