Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole
-
Upload
jax-london -
Category
Technology
-
view
1.041 -
download
0
Transcript of Java Tech & Tools | Big Blobs: Moving Big Data In and Out of the Cloud | Adrian Cole
Big Blobs: moving big data in and out of the cloud
Adrian Cole / Cloudsoft
Wednesday, November 2, 11
Adrian Cole (@jclouds)founded jclouds march 2009chief evangelist at Cloudsoft
Wednesday, November 2, 11
Agenda
• intro to jclouds blobstore• Omixon case study• awkward silence (or Q/A)
Wednesday, November 2, 11
BlobStore LoadBalancer
Compute Table
Portable APIs
Embeddable
Provider-Specific Hooks
Over 30 Tested Providers!
4
Wednesday, November 2, 11
Who’s integrating?
Wednesday, November 2, 11
Blob Storage
global name spacekey, value with metadatasites on demandunlimited size
6
Wednesday, November 2, 11
Blob Storage
7
Set<String> containers = namespacesInMyAccount;
Map<String, InputStream> keyValues = contentsOfContainer
Wednesday, November 2, 11
adrian@googlestorage
Love Letters
TronMovies
Goonies The Blob
ShrekThe One
putBlob
3d = trueurl = http://disney.go.com/tron
Blob Storage
8
Wednesday, November 2, 11
java overview github jclouds/jclouds
// initcontext = new BlobStoreContextFactory().createContext("s3", accesskeyid, secret);blobStore = context.getBlobStore();
// create containerblobStore.createContainerInLocation(null, “adriansmovies”);
// add blobblob = blobStore.blobBuilder("sushi.avi").payload(file).build();blobStore.putBlob(“adriansmovies”, blob);
9
Wednesday, November 2, 11
clojure overview github jclouds/jclouds
(use 'org.jclouds.blobstore2)
(def *blobstore* (blobstore “azureblob” account key))
(create-container *blobstore* “movies”)(put-blob *blobstore* “movies” (blob “tron.mp4“ :payload tron-file))
10
Wednesday, November 2, 11
Big data pipelines with Scale-out on the cloud
@tiborkisstibor
11
Wednesday, November 2, 11
bioinformatic pipelinesUsually requires high CPU
Continuously increasing data volumes
Complex algorithms on top of large datasets
12
Wednesday, November 2, 11
bioinformatics SaaS
13
Wednesday, November 2, 11
challenges of SaaS buildingHadoop cluster startup/shutdown - Cluster starting problems - Automatic cluster shutdown strategiesHadoop cluster monitoring on the cloud System monitoringConsumption based monitoringData transfer pathsAWS Import -> S3 -> hdfs -> S3 -> AWS ExportACL settings for client's bucketsS3 <=> hdfs transfers
14
Wednesday, November 2, 11
where did we start?30GB file @max 16MB/s upload to S3
32 minutes
1PB file @max 16MB/s upload to S3
18.2 hours
15
Wednesday, November 2, 11
where did we end up?30GB file @max 100MB/s upload to S3
32 5 minutes
1PB file @max 100MB/s upload to S3
18.2 2.9 hours
16
Wednesday, November 2, 11
How did we get there?
Add multi-part upload support
Optimize slicing
Optimize parallel upload strategy
Find big guns
17
Wednesday, November 2, 11
Multi-Part uploadLarge Blobs cannot be sent in a single request in most BlobStores. (ex. 5GB max in S3)
Large X-fers are likely to fail at inconvenient positions, and without resume.
Multi-part uploads allow you to send slices of a payload, which the server assembles later
18
Wednesday, November 2, 11
SlicingEach upload part must advance to the appropriate position in the source payload efficiently.
Payload slice(Payload input, long offset, long length);
ex. NettyPayloadSlicer uses ChunkedFileInputStream
19
Wednesday, November 2, 11
Slicing AlgorithmA Blob can be sliced into a maximum number of parts, and these parts have min and max sizes.
up to 3.2GB, converge 32M parts
then increase part size approaching max (5GB)
then continue at max part size or overflow
20
Wednesday, November 2, 11
Upload Strategy
Start sequential, stabilize, then parallelize
SequentialMultipartUploadStrategySimpler, less likely to fail, easier to retry, little to optimize outside chunk size
ParallelMultipartUploadStrategyMuch better throughput, but need to optimize degree, retries & error handling
21
Wednesday, November 2, 11
22
Wednesday, November 2, 11
What’s the top-speed?
23
Wednesday, November 2, 11
Is this as good as it gets?
10GigE should be able to do 1280MB/s
cc1.4xlarge has been measured up to ~560MB/s local
but we’re only getting ~100MB/s sustained
24
Wednesday, November 2, 11
So, where do we go now?
zero copy transfer
more work on slice algorithms
tools and integrations (ex. hdfs)
add implementations for other blobstores
25
Wednesday, November 2, 11
Wanna play?blobStore.putBlob(“movies”, blob, multipart());
(put-blob *blobstore* “movies” blob :multipart? true)
or just visit github jclouds-examples blobstore-largeblob blobstore-hdfs
26
Wednesday, November 2, 11