Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

27
Big Blobs: moving big data in and out of the cloud Adrian Cole / Cloudsoft Wednesday, November 2, 11

Transcript of Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

Page 1: Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

Big Blobs: moving big data in and out of the cloud

Adrian Cole / Cloudsoft

Wednesday, November 2, 11

Page 2: Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

Adrian Cole (@jclouds)founded jclouds march 2009chief evangelist at Cloudsoft

Wednesday, November 2, 11

Page 3: Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

Agenda

• intro to jclouds blobstore• Omixon case study• awkward silence (or Q/A)

Wednesday, November 2, 11

Page 4: Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

BlobStore LoadBalancer

Compute Table

Portable APIs

Embeddable

Provider-Specific Hooks

Over 30 Tested Providers!

4

Wednesday, November 2, 11

Page 5: Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

Who’s integrating?

Wednesday, November 2, 11

Page 6: Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

Blob Storage

global name spacekey, value with metadatasites on demandunlimited size

6

Wednesday, November 2, 11

Page 7: Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

Blob Storage

7

Set<String> containers = namespacesInMyAccount;

Map<String, InputStream> keyValues = contentsOfContainer

Wednesday, November 2, 11

Page 8: Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

adrian@googlestorage

Love Letters

TronMovies

Goonies The Blob

ShrekThe One

putBlob

3d = trueurl = http://disney.go.com/tron

Blob Storage

8

Wednesday, November 2, 11

Page 9: Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

java overview github jclouds/jclouds

// initcontext = new BlobStoreContextFactory().createContext("s3", accesskeyid, secret);blobStore = context.getBlobStore();

// create containerblobStore.createContainerInLocation(null, “adriansmovies”);

// add blobblob = blobStore.blobBuilder("sushi.avi").payload(file).build();blobStore.putBlob(“adriansmovies”, blob);

9

Wednesday, November 2, 11

Page 10: Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

clojure overview github jclouds/jclouds

(use 'org.jclouds.blobstore2)

(def *blobstore* (blobstore “azureblob” account key))

(create-container *blobstore* “movies”)(put-blob *blobstore* “movies” (blob “tron.mp4“ :payload tron-file))

10

Wednesday, November 2, 11

Page 11: Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

Big data pipelines with Scale-out on the cloud

@tiborkisstibor

11

Wednesday, November 2, 11

Page 12: Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

bioinformatic pipelinesUsually requires high CPU

Continuously increasing data volumes

Complex algorithms on top of large datasets

12

Wednesday, November 2, 11

Page 13: Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

bioinformatics SaaS

13

Wednesday, November 2, 11

Page 14: Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

challenges of SaaS buildingHadoop cluster startup/shutdown - Cluster starting problems - Automatic cluster shutdown strategiesHadoop cluster monitoring on the cloud System monitoringConsumption based monitoringData transfer pathsAWS Import -> S3 -> hdfs -> S3 -> AWS ExportACL settings for client's bucketsS3 <=> hdfs transfers

14

Wednesday, November 2, 11

Page 15: Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

where did we start?30GB file @max 16MB/s upload to S3

32 minutes

1PB file @max 16MB/s upload to S3

18.2 hours

15

Wednesday, November 2, 11

Page 16: Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

where did we end up?30GB file @max 100MB/s upload to S3

32 5 minutes

1PB file @max 100MB/s upload to S3

18.2 2.9 hours

16

Wednesday, November 2, 11

Page 17: Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

How did we get there?

Add multi-part upload support

Optimize slicing

Optimize parallel upload strategy

Find big guns

17

Wednesday, November 2, 11

Page 18: Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

Multi-Part uploadLarge Blobs cannot be sent in a single request in most BlobStores. (ex. 5GB max in S3)

Large X-fers are likely to fail at inconvenient positions, and without resume.

Multi-part uploads allow you to send slices of a payload, which the server assembles later

18

Wednesday, November 2, 11

Page 19: Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

SlicingEach upload part must advance to the appropriate position in the source payload efficiently.

Payload slice(Payload input, long offset, long length);

ex. NettyPayloadSlicer uses ChunkedFileInputStream

19

Wednesday, November 2, 11

Page 20: Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

Slicing AlgorithmA Blob can be sliced into a maximum number of parts, and these parts have min and max sizes.

up to 3.2GB, converge 32M parts

then increase part size approaching max (5GB)

then continue at max part size or overflow

20

Wednesday, November 2, 11

Page 21: Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

Upload Strategy

Start sequential, stabilize, then parallelize

SequentialMultipartUploadStrategySimpler, less likely to fail, easier to retry, little to optimize outside chunk size

ParallelMultipartUploadStrategyMuch better throughput, but need to optimize degree, retries & error handling

21

Wednesday, November 2, 11

Page 22: Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

22

Wednesday, November 2, 11

Page 23: Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

What’s the top-speed?

23

Wednesday, November 2, 11

Page 24: Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

Is this as good as it gets?

10GigE should be able to do 1280MB/s

cc1.4xlarge has been measured up to ~560MB/s local

but we’re only getting ~100MB/s sustained

24

Wednesday, November 2, 11

Page 25: Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

So, where do we go now?

zero copy transfer

more work on slice algorithms

tools and integrations (ex. hdfs)

add implementations for other blobstores

25

Wednesday, November 2, 11

Page 26: Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

Wanna play?blobStore.putBlob(“movies”, blob, multipart());

(put-blob *blobstore* “movies” blob :multipart? true)

or just visit github jclouds-examples blobstore-largeblob blobstore-hdfs

26

Wednesday, November 2, 11

Page 27: Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole

27

github jclouds-examples

@jclouds @[email protected]

Questions?

Wednesday, November 2, 11