Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling...
Transcript of Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling...
![Page 2: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/2.jpg)
Agenda
- Part A: Background
- Part B: Sparrow system design
- Part C: Sparrow experimental evaluation
![Page 3: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/3.jpg)
Part A: Background
![Page 4: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/4.jpg)
Background: Data Processing Frameworks
• How to distribute data-parallel computations across multiple machines?• MapReduce (OSDI ‘04)
• Dremel (VLDB ‘10)
• Spark (NSDI ‘12)
• Convert high-level computation description into jobs
• Partition input data and assign jobs to multiple machines
![Page 5: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/5.jpg)
Background: Short Tasks
• Common challenges in data processing frameworks
• Problem 1: Stragglers• Job response times are dominated by stragglers
• Causes:• Machine performance (e.g. contended CPUS, congested networks, etc.)
• Data partitioning (tasks take increased time due to computational skew, etc.)
• Problem 2: Sharing• Long-running tasks block additional tasks from running
Reference: http://kayousterhout.org/publications/hotos13-final24.pdf
Solution: Shorter Tasks!
![Page 6: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/6.jpg)
Solution 1: Straggler Mitigation
Reference: http://kayousterhout.org/talks/tinytasks-hotos-talk.pdf
![Page 7: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/7.jpg)
Solution 1: Straggler Mitigation
Reference: http://kayousterhout.org/talks/tinytasks-hotos-talk.pdf
![Page 8: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/8.jpg)
Solution 2: Improved Sharing
Reference: http://kayousterhout.org/talks/tinytasks-hotos-talk.pdf
![Page 9: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/9.jpg)
Solution 2: Improved Sharing
Reference: http://kayousterhout.org/talks/tinytasks-hotos-talk.pdf
![Page 10: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/10.jpg)
Q: Why don’t existing data processing frameworks use short tasks?
![Page 11: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/11.jpg)
Background: Short Tasks
• Architectural changes:• Cluster must support minimal task launch overhead
• Scalable storage systems:• Task runtime could be dominated by time taken to read input data
• Low-latency scheduling:• Scheduler must be able to make millions of low-latency scheduling decisions per second
• Framework-controlled I/O:• Framework should exploit smaller resource footprint of small tasks (e.g. pipeline reading
data input)
• And more…• Changes to execution and programming model
![Page 12: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/12.jpg)
Background: Scheduling
• Sparrow provides a solution to the scheduling problem!
• Restrictive time requirements:• Sparrow has around 1-10 milliseconds to make scheduling decisions
• High throughput requirements:• Sparrow must support millions of scheduling decisions per second
![Page 13: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/13.jpg)
Background: Spark
• Data processing framework; optimizing for efficient data reuse and in-memory computation• Resilient distributed datasets (RDDs)
• Express computation as a sequence of transformations (e.g. map, filter, join, etc.) on RDDs
• Scheduling:• Tasks assigned to machines based on delay scheduling
• Delay scheduling attempts to achieve both fair sharing and data locality• Fair sharing: If N jobs are running, each job receives 1/N share of resources
• Data locality: Place computations near their input data
Reference: https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
![Page 14: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/14.jpg)
Background: Centralized vs. Decentralized
Reference: http://kayousterhout.org/talks/sparrow-sosp-talk.pdf
![Page 15: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/15.jpg)
Background: Centralized vs. Decentralized
Reference: http://kayousterhout.org/talks/sparrow-sosp-talk.pdf
![Page 16: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/16.jpg)
Part B: Sparrow System Design
![Page 17: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/17.jpg)
Sparrow’s Execution Model
• Cluster composed of worker machines that execute tasks and schedulers that assign tasks
• Each job composed of 𝑚 tasks
• Wait time:• Time until the task begins executing
• Represents scheduler overhead
• Service time:• Time the task spends executing on a worker machine
![Page 18: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/18.jpg)
Sparrow’s Optimizations
• Batch sampling• Optimization of “power of 2 choices” load balancing
• Place 𝑚 tasks in a job on the least loaded of 𝑑 ∙ 𝑚 randomly selected machines
• Late binding• Delays assignment of tasks to machines until the machine is ready to run the
task
![Page 19: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/19.jpg)
Randomized Sampling
• Scheduler chooses a random machine to assign tasks
Reference: http://kayousterhout.org/talks/sparrow-sosp-talk.pdf
![Page 20: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/20.jpg)
Randomized Sampling: Analysis
• Let 𝑛 be the number of machines in the cluster
• Let 𝑝 be the probability that a randomly selected machine is loaded• Represents cluster load
• Probability that random sampling assigns 𝑚 tasks to an unloaded machine:• (1 − 𝑝)𝑚
![Page 21: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/21.jpg)
Randomized Sampling: Results
Reference: http://kayousterhout.org/talks/sparrow-sosp-talk.pdf
![Page 22: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/22.jpg)
Power of 2 Choices• Suppose 𝑛 balls are inserted into 𝑛 bins:
• Each ball chooses 𝑑 = 2 bins uniformly at random• The ball is inserted into the bin that has lesser number of balls• If both bins have an identical number of balls, put the ball in either bin
• Azar et al. proved that the max load is log log 𝑛 + 𝑂(1) with high probability
• This is exponentially better compared to random allocation:• Max load is ≈
log 𝑛
log log 𝑛
• Increasing 𝑑 does not improve much:• Max load is
log log 𝑛
log 𝑑+ 𝑂(1)
Reference 1: http://www.eecs.harvard.edu/~michaelm/postscripts/handbook2001.pdfReference 2: https://homes.cs.washington.edu/~karlin/papers/balls.pdf
![Page 23: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/23.jpg)
Per-Task Sampling
• Scheduler chooses 2 random machines; assigns task to least loaded machine
Reference: http://kayousterhout.org/talks/sparrow-sosp-talk.pdf
![Page 24: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/24.jpg)
Per-Task Sampling: Analysis
• Let 𝑑 be the number of machines that are probed
• Probability that per-task sampling assigns 𝑚 tasks to an unloaded machine:• (1 − 𝑝𝑑)𝑚
• Q: Why not choose a larger 𝑑?
• Problems:• Job response time limited by longest wait time of any running task
• Sub-optimal placement of tasks
![Page 25: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/25.jpg)
Per-Task Sampling: Analysis
Reference: http://kayousterhout.org/talks/sparrow-sosp-talk.pdf
![Page 26: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/26.jpg)
Per-Task Sampling: Results
Reference: http://kayousterhout.org/talks/sparrow-sosp-talk.pdf
![Page 27: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/27.jpg)
Batch Sampling
• Scheduler probes 2𝑚 random machines; assigns 𝑚 tasks to least loaded machines
Reference: http://kayousterhout.org/talks/sparrow-sosp-talk.pdf
![Page 28: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/28.jpg)
Batch Sampling: Analysis
• Probability that batch sampling assigns 𝑚 tasks to an unloaded machine:• Equivalent to probability that ≥ 𝑚 machines are unloaded
• σ𝑖=𝑚𝑑∙𝑚(1 − 𝑝)𝑖𝑝𝑑∙𝑚−𝑖 𝑑∙𝑚
𝑖
• Problems:• Estimating load based on queue length is inaccurate
• Queue 1 = [ 50 ms, 50 ms, 50 ms ]
• Queue 2 = [ 200 ms ]
• Multiple schedulers assign tasks to the same machine
![Page 29: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/29.jpg)
Batch Sampling: Results
Reference: http://kayousterhout.org/talks/sparrow-sosp-talk.pdf
![Page 30: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/30.jpg)
Late Binding
• Scheduler probes 2𝑚 random machines; reserves task on all machines
Reference: http://kayousterhout.org/talks/sparrow-sosp-talk.pdf
![Page 31: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/31.jpg)
Late Binding
• Machine requests task once it reaches front of queue
Reference: http://kayousterhout.org/talks/sparrow-sosp-talk.pdf
![Page 32: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/32.jpg)
Late Binding: Analysis
• Problems:• Machines are idle during the RPC to request a task from the scheduler
• Machines might request tasks from schedulers that have already allocated alltasks in a job
• Solution: Proactive cancellation• Upon allocating all tasks in a job, send a cancellation RPC to machines that
have pending reservations
• Q: Does Sparrow’s design extend to microsecond-scale tasks?
![Page 33: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/33.jpg)
Late Binding: Results
Reference: http://kayousterhout.org/talks/sparrow-sosp-talk.pdf
![Page 34: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/34.jpg)
Placement constraints
• Per-job constraints:• E.g. Job must execute on machines that have a GPU
• Restrict batch sampling to machines that satisfy the constraint
• Per-task constraints:• E.g. Task must execute on machine that has input data
• Uses per-task sampling
• Probed information shared across tasks:• Probe Task 1: [A loaded, B loaded, C unloaded]
• Probe Task 2: [C unloaded, D unloaded, E loaded]
• Optimal placement?
![Page 35: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/35.jpg)
Resource allocation policies
• Strict priorities:• Tasks are assigned priorities (e.g. high/low)
• Sparrow maintains separate high/low priority task queues at each machine
• High priority task queue emptied over low priority task queue
• Weighted fair sharing:• Idea from network scheduling
• Maintain separate queues per-user
• Each user assigned a percentage representing their allocated “bandwidth”• e.g. 10%, 30%, and 60% to different users
![Page 36: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/36.jpg)
Implementation
• Front-end client converts high-level job descriptions to task specifications• Clients and scheduler run on the same machine
• Scheduler assigns tasks to machines
• Local node monitor running on each machine enqueues scheduled tasks
• Executor process on machines execute tasks
![Page 37: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/37.jpg)
Implementation: Fault tolerance
• Schedulers do not maintain persistent state• Similar to stateless web server backends
• Client must send heartbeats to schedulers to detect failure
• Upon failure, front-end must choose how to handle in-flight tasks• Simplest approach is to restart all in-flight tasks
• Q: Is this a good design? Is it acceptable to restart in-flight tasks upon scheduler failure?
![Page 38: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/38.jpg)
Example: Spark on Sparrow
• Front-end translates functional queries into parallel stages
• Sparrow receives task description and placement constraints
![Page 39: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/39.jpg)
Part C: Sparrow evaluation
![Page 40: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/40.jpg)
Experimental Setup: TPC-H
• Cluster running on Amazon EC2• 100 machines and 10 schedulers
• 8 cores and 68.4 GB memory per machine
• Performance evaluated using TPC-H benchmark• Representative of ad-hoc queries on business data
• Properties:• Cluster utilization fluctuates around 80%
• Non-uniform task durations (10-100 ms)
• Mixed constrained/unconstrained scheduling requests
![Page 41: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/41.jpg)
Experimental Evaluation: TPC-H
![Page 42: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/42.jpg)
Deconstructing Performance
![Page 43: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/43.jpg)
How do task constraints affect performance?
![Page 44: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/44.jpg)
How do scheduler failures impact job response time?
![Page 45: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/45.jpg)
How does Sparrow compare to Spark?
![Page 46: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/46.jpg)
How effective is Sparrow’s distributed fairness enforcement?
![Page 47: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/47.jpg)
How much can low priority users hurt response times for high priority users?
![Page 48: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/48.jpg)
How sensitive is Sparrow to the probe ratio?
![Page 49: Sparrow: Distributed, Low Latency Scheduling · •Sparrow must support millions of scheduling decisions per second. Background: Spark •Data processing framework; optimizing for](https://reader036.fdocuments.in/reader036/viewer/2022081618/6088eb5470d31603d14d00ee/html5/thumbnails/49.jpg)
Conclusion
• Sparrow presents a simple, scalable solution to task scheduling• Supports millions of scheduling requests per second
• Scheduling decisions can be made in the order of milliseconds
• Discussion:• Q: Suppose the cluster operates at max load (e.g. high job arrival rate). Is
Sparrow’s approach optimal?
• Q: How could data processing frameworks co-optimize with Sparrow to obtain higher performance?
• Q: Are there alternative solutions to the straggler problem?