Ben Lippmeier LambdaJam 2015/5/22 - YOW! … ·...

Post on 06-May-2018

215 views 0 download

Transcript of Ben Lippmeier LambdaJam 2015/5/22 - YOW! … ·...

Data Parallel Data Flow in Repa 4Ben LippmeierLambdaJam 2015/5/22

Image: A♥ / Aih. flickr. CC Generic.

> import Data.Repa.Flow as F> import qualified Data.Repa.Array as A> ws <- F.fromFiles [ “data/words1.txt” , “data/words2.txt” ] F.sourceLines

> :type ws

ws :: Sources (Array Char)

> F.sourcesArity ws


A flow consists of a bundle of individualstreams. We create a bundle two stream sources, one for each file.

> import Data.Repa.Flow.Auto.Debug as A> more 0 ws

Just [“A”, “a", “aa", “aal", “aalii", “aam", “Aani", “aardvark", “aardwolf", “Aaron", …

> more’ 0 100 ws

Just [“arbitrament”, “arbitrarily", “arbitrariness", “arbitrary", “arbitrate”, …

The more function shows the first few elements from the front of the next chunk. The streams are stateful, so pulling a chunk consumes it.

> moret 1 ws


Show the next chunk of the second stream as a table.

> import Data.Char> up <- map_i ( toUpper) ws> more 0 up


> more 1 up

Just [“THE”, “OF", “AND", “TO", “A", “IN …

Flows are data parallel, so applying a function like map_i transforms all streams in the bundle.



> ns <- map_i A.length ws> :type nsns :: Sources Int

> os <- toFiles [“out1.txt”, “out2.txt”] sinkLines

> :type osos :: Sinks (Array Char)

> ons <- map_o (\(x : Int) -> A.fromList $ show x) os> :type ons ons :: Sinks Int

Now we have a bundle of streamsinks. Data pushed into the sinks gets written out as text lines.

> :type drainP

drainP :: Source a -> Sinks a -> IO ()

Drain all data from the sources into the sinks, in parallel.

> :type nsns :: Sources Int

> :type ons ons :: Sinks Int


Image: Axel Taferner. flickr. CC-NC-SA.


map_i :: (a -> b) -> Sources a -> m (Sources b)


map_o :: (a -> b) -> Sinks b -> m (Sinks a)

pull from outputinduces

pull from input


push to input induces

push to output



:: a :: a

:: b :: b



data Sources i m e = Sources { sArity :: i , sPull :: i -> (e -> m ()) -> m () -> m ()}

data Sinks i m e = Sinks { kArity :: i , kPush :: i -> e -> m () , kEject :: i -> m () }

stream index monad element type

eat eject

module Data.Repa.Flow.Generic where

readLines_i :: [Handle] -> IO (Sources Int IO String)readLines hs = return $ Sources (length hs) pull where pullL i eatL ejectL = do eof <- isEOF (hs !! i) if eof then ejectL else do line <- readLine (hs !! i) eatL line

smap_i :: Monad m => (i -> a -> b) -> Sources i m a -> m (Sources i m b)

smap_i f (Sources n pullA) = return $ Sources n pullB where pullB i eatB eject = pullA i eatA eject where eatA x = eatB (f i x)

module Data.Repa.Flow.Chunked where import Data.Repa.Flow.Generic as G

type Sources a = G.Sources Int IO (Array a)

type Sinks a = G.Sinks Int IO (Array a)

The repa-flow packages definesgeneric flows, then various instances with a more specific/simpler API.

groupsBy_i :: (k -> k -> Bool) -> Sources k -> IO (Sources (k, Int))

> toList1 0 =<< groupsBy_i (==) =<< fromList 1 “waabbbbllee”

Just [ (‘w’, 1), (‘a’, 2) , (‘b’, 4), (‘l’, 2), (‘e’, 2)]



(keys, lens)+

foldGroupsBy_ii :: (k -> k -> Bool) -> (a -> b -> b) -> b -> Sources k -> Sources a -> IO (Sources (k, b))

> sKeys <- fromList 1 "waaaabllle"> sVals <- fromList 1 [10, 20, 30, 40, 50 … > toList1 0 =<< map_i (\(key, (acc, n)) -> (key, acc / n)) =<< foldGroupsBy_ii (==) (\x (acc, n) -> (acc + x, n + 1)) (0, 0) sKeys sVals

Just [(’w’, 10.0), (’a’, 35.0), (’b’, 60.0) …


keys values

(keys, results)+

foldGroups_ii :: .. -> Src k —> Src a -> Src (k, b)

foldGroups_iiOK Buffers



foldGroups_xx foldGroups_io

foldGroups_oo :: .. -> Snk k —> Snk a -> Snk (k, b)

foldGroups_io :: .. -> Src k —> Snk a -> Snk (k, b)foldGroups_xx :: .. -> Src k —> Snk a -> Src (k, b)



! !







drain buffer



drain buffer



“operator is in control”

drain buffer



“operator is in control” “context is in control”




zipDrain altBuffer

(a * b) (a + b)


“operator is in control” “context is in control”


deal_o :: (Int -> a -> IO ()) -> Sinks Int IO a -> IO (Sinks () IO (Array a))

uffish thought he stoodvorpal blade snicker snack







(spill function)(output)

+ 1

distribute_o :: (Int -> Array a -> IO ()) -> Sinks Int IO (Array a) -> IO (Sinks () IO (Array (Int, a)))

(0, ‘a’) (1, ‘b’) (2, ‘c’) (0, ‘d’)(0, ‘A’) (3, ‘B’) (3, ‘C’) (4, ‘E’)

(0, ‘c’)

distribute_o spilled





+ 1

(spill function)(output)



naturally sequentialread from the input streams

one after the other


+ + + +

naturally concurrentinput streams are contending

for a shared output

1 1

controlled order of consumption drain entire stream first,

or round robin element-wise

uncontrolled order of consumption elements pushed in

non-deterministic order



naturally sequentialread from the input streams

one after the other


+ + + +

naturally concurrentinput streams are contending

for a shared output

1 1

controlled order of consumption drain entire stream first,

or round robin element-wise

uncontrolled order of consumption elements pushed in

non-deterministic order

“I pull” “You push”

(0, ‘a’) (1, ‘b’)

(2, ‘c’) (0, ‘d’)(0, ‘A’) (3, ‘B’)

(3, ‘C’) (4, ‘E’)

shuffleP :: (Int -> Array a -> IO ()) -> Sources Int IO (Array (Int, a)) -> Sinks Int IO (Array a) -> IO ()

(spill function)


α-quality, active developmentcode that’s there should work ok,

but still some missing components


Image: Alias 0591. flickr. CC-BY-2.0.

smap_i (\i l -> (i, length l)) =<< readLines hs

smap_i (\i l -> (i, length l)) (Sources (length hs) (\i eatL ejectL -> do eof <- isEOF (hs !! i) if eof then ejectL else do line <- readLine (hs !! i) eatL line))

return $ Sources (length hs) (\i eatB ejectB -> (\i eatL ejectL -> do eof <- isEOF (hs !! i) if eof then ejectL else do line <- readLine (hs !! i) eatL line) i (\x -> eatB (i, length x)) ejectB)

return $ Sources (length hs) (\i eatB ejectB -> do eof <- isEOF (hs !! i) if eof then ejectB else do line <- readLine (hs !! i) (\x -> eatB (i, length x)) line)

return $ Sources (length hs) (\i eatB ejectB -> do eof <- isEOF (hs !! i) if eof then ejectB else do line <- readLine (hs !! i) eatB (i, length line))


Image: Leo CC-NC-SA.

conduit - Michael Snoyman

data Pipe l i o u m r = HaveOutput (Pipe l i o u m r) (m ()) o | NeedInput (i -> Pipe l i o u m r) (u -> Pipe l i o u m r) | Done r | PipeM (m (Pipe l i o u m r)) | Leftover (Pipe l i o u m r))

• Pipe is an instance of Monad. • Data can flow both ways through the pipe, and yield a final result. • Single stream, single element at a time. • Individual Sources created by ‘yield’ action. • Combine pipes/conduits with fusion operators.

leftovers input elems output elemsupstream result


pipes - Gabriel Gonzelez

data Proxy a a’ b’ b m r

= Request a’ (a -> Proxy a’ a b’ b m r) | Respond b (b’ -> Proxy a’ a b’ b m r) | M (m (Proxy a’ a b’ b m r)) | Pure r

upstream input and output

downstream input and output underlying monad


• Proxy / Pipe is an instance of Monad. • Data can flow both ways through the pipe, and yield a final result.

machines - Edward Kmett

newtype MachineT m k o = MachineT { runMachine :: m (Step k o (MachineT m k o))

type Machine k o = forall m. Monad m => MachineT m k o

type Process a b = Machine (Is a) b)

type Source b = forall k. Machine k b

• Like streams as used in Data.Vector stream fusion, except the step function returns a whole new Machine (stream)

• Clean and general API, but not sure if it supports array fusion. Machines library does not seem to attempt fusion.

repa-flow vs others

• Repa flow provides chunked, data parallel database-like operators with a straightforward API.

• Sources and Sinks are values rather than computations. The “Pipe” between them created implicitly in IO land.

• API focuses on simplicity and performance via stream and array fusion, rather than having the most general API.

• Suspect we could wrap single-stream Repa flow operators as either Pipes or Conduits, but neither of the former seem to naturally support data parallel flows.


Image: gullevek. flickr. CC-NC-SA.

repa-stream repa-eval



(stream / “chain” fusion)

(delayed array fusion)

(CPS fusion)

(parallel gang management)repa-convert(de/serialization)