Automated ShardedMongoDB Deployment and Benchmarking for ... · Automated ShardedMongoDB Deployment...

Post on 26-Jun-2018

235 views 0 download

Transcript of Automated ShardedMongoDB Deployment and Benchmarking for ... · Automated ShardedMongoDB Deployment...

AutomatedSharded MongoDBDeploymentandBenchmarking

forBigDataAnalysisGregorvonLaszewski

MarkMcCombelaszewski@gmail.comIndianaUniversity

IntelligentSystemsEngineeringDepartment

laszewski@gmail.com

Acknowledgement

• ThisstudyhasbeenconductedaspartoftheI524classwiththetopicBigDataandSoftwareProjects• Theclassusedthefollowingresources

• Studentscomputers• FutureSystems (DSC@IndianaUniversity)acontinuationoftheFutureGrid (NSF)• ChameleonCloud(NSF):ProjectCH-818664,KVM• Jetstream(NSF)

• Somestudentsalsoelectedtouse• AWS• Azure

• Allresourcesasfarasweknowwereprovidedtousforfree.

laszewski@gmail.com

Outline

• Motivationfortheproject• IUeducatesdatascientists

• Sharded MongoDBdeployment

• Benchmarks

• UsageObservations

• Conclusion• Whywedidnotdoalargescalestudy…• Implicationforfutureclasses…

laszewski@gmail.com

DataScientistAnalysis• Statistics• MachineLearning• Optimization

Programming• Python,JavaScript• DistributedComp.• CloudProgramming

Infrastructure• CloudComputing• DistributedSystems• DevOps

Visualization• BasicSkills• CustomizeforDataSet

DomainKnowledge

Communication• Paperwrite-up• OnlinePublication

• Requiresintegratedknowledgeinseveralkeyareas.Weuseaprojectthataddresses:

• Communication• Analysis• Visualization• Programming• Infrastructure• Domainknowledge

• EducationProgramsneedtoaddressallofthem

DataScientist

ShardedMongoDB

Deployments

ShardedMongoDB

Deployments

laszewski@gmail.com

ContinuousImprovementvs.ContinuousDeploymentviaDevOps

design&modification

Cloudmeshscript

deployment

data

execution

verification

Continuousimprovement

• DevOpsisintegrated• Leadstoimprovementwhennotonlytargetingapplicationbutalsodeploymentenvironment.

CloudmeshShell– MakeBootingSimple

$emacs cloudmesh.yaml$cms defaultcloud=NAME$cms defaultimage=NAME$cmd defaultflavor=NAME$cms vm boot

$cms vm login

$cms vm delete

• cloudmesh.yaml

• Preparedefaults

• Boot

• Login

• Management …

CloudmeshShell– ManageHybridClouds

$cms aws boot$cms vm boot

$cms defaultcloud=chameleon$cms vm boot

$cms defaultcloud=IUCloud$cms vm boot

• BootCloudA

• BootCloudB

• BootCloudC

CloudmeshShell– CreateaHadoopCluster

$cmdefaultcloud=chameleon$cmclusterdefine- -count=10

- -flavor=m1.large$cmhadoop definespark

$cmhadoop sync#~30sec

$cmhadoop deploy#~7min

• Setcloud

• Definecluster

• Definehadoop Cluster

• Syncdefinitiontodb

• Deploythecluster

CloudmeshShell– CreateaHadoopCluster

$cmdefaultcloud=IUCloud$cmclusterdefine- -count=10

- -flavor=m1.large

$cmnist fingerprint #~30min

• Setcloud

• Definecluster

• RunNISTusecase

Additionalresources:https://github.com/cloudmesh/classes/blob/master/docs/source/notebooks/fingerprint_matching.ipynb

MongoDBFeatures

• DocumentorientedNoSQLdatabase• JSON-likedocuments• Specifiedthroughschemas

• Cross-platformcompatible• Freeopensource

• NoSQL=datathatismodeledinmeansotherthanthetabularrelationsusedinrelationaldatabases.

• Ad-hocqueries• Indexing• Replication• LoadBalancingwithSharding• FileStorage• Aggregation• ServerSideJavaScript• Cappedcollections

laszewski@gmail.com

MongoDB- Sharding

• Userselectsshardkey thatdetermineshowthedatainacollectionwillbedistributed.• dataissplitintoranges(basedontheshardkey)• distributedacrossmultipleshards.

• (a)ashardisamasterwithoneormoreslaves.• (b)ortheshardkeycanbehashedtomaptoashardallowingevendatadistribution.

• MongoDBcanrunovermultipleservers,• balancingtheload• duplicatingdataforfaulttolerance

laszewski@gmail.com

Architecture

laszewski@gmail.com

BenchmarksonClouds

• Threecloudswereselectedfordeployment:• ChameleonCloud• Futuresystems• Jetstream

• Goal• Comparewithintheallocationlimitationsofaclassmultiplecloudperformancesbyvaryinganumberofparameters.

• ScriptedDeployments• Wedevelopedautomatedscripteddeploymentandbenchmarkingprocess• cloudnameispassedasaparameter• CustomizationforthedeploymentofMongoDBispassedviacommandline

laszewski@gmail.com

CloudComparison

FutureSystems Chameleon JetstreamCPU XeonE5-2670 XeonX5550 HaswellE-2680Cores 1024 1008 7680Speed 2.66GHz 2.3GHz 2.5GHzRAM 3072GB 5376GB 40TBrStorage 335TB 2TB 2TBDeploymentyear 2010 Early 2015 OS2016

laszewski@gmail.com

FlavorandOS

• Ubuntu16.04LTS(Xenial Xerus)operatingsystem.• Flavors– slightlydifferentbetweencloudsweusemostalike• m1.mediumChameleonCloud• m1.mediumFutureSystems• m1.smallwasusedonJetstream

• FlavorshavemoreresourcesthanChameleonandFutureSystems• Storageisloweronjetstream

Cloud Flavor VCPU RAM Size Chameleon m1.medium 2 4 40 FutureSystems m1.medium 2 4 40 Jetstream m1.small 2 4 20

laszewski@gmail.com

Requirements• ResourceRequirements• 60users->VMhourswerelimited.

• CapabilityRequirements• creationofVMsandtheexecutionofourapplicationswithintheseVMs

• MonitoringRequirements• Monitoringandbenchmarkingwasconductedbyhandwithoutneedforspecializedservices.

• Newsoftwarecreated• improvedthecloudmesh clientsoftware[5][6][7],essentialtothesuccessoftheclass.

• PerformanceComparison• Wehaveconductedasignificantperformancecomparisonamongallclouds.

laszewski@gmail.com

Benchmark

• Deploymenttimes• ComparingMongoDBversions• ComparingClouds

laszewski@gmail.com

Deploymenttimes

laszewski@gmail.com

Deployments

• DeploymentA• asimpledeploymentwithonlyoneofeachcomponentbeingcreated..

• DeploymentB• variationinconfig serversandshardsandanadditionalMongosinstance.

• DeploymentC• focusonhighperformance.• 9shardsnoreplication

Config Mongos Shards Replicas Seconds

A 1 1 1 1 330

B 3 2 3 3 1059

C 1 1 9 1 719laszewski@gmail.com

Variing otherDeploymenttimes

ConfigServers -c Mongos -m Shards -s Replicas -r Time in

Seconds

5 1 1 1 534

1 5 1 1 556

1 1 5 1 607

1 1 1 5 524

laszewski@gmail.com

Data

• MajorLeagueBaseballPITCHf/xdataobtainedbyusingtheprogramBaseballonaStick(BBOS).

• BBOSisaPythonprogramcreatedby"willkoky"andhostedonsourceforge.netwhichextractsdatafrommlb.com andloadsitintoaMySQLdatabase.

• datawascapturedlocallytothedefaultMySQLdatabaseandthenextractedtoaCSV

• CSVfilewasimported• Contains5,508,014rowsand61columns.1.58GBinsizeuncompressed.

laszewski@gmail.com

VersionComparison

laszewski@gmail.com

VersionComparison:3.2vs3.4(ChameleonCloud)

FindCommand Mongoimport Command

laszewski@gmail.com

VersionComparison:3.2vs3.4MapReduce(ChameleonCloud)

MapReduce

Result:nottoomanychanges

laszewski@gmail.com

CloudComparisonSharding Test

laszewski@gmail.com

Figure2:Mongoimport Command- Sharding Test

laszewski@gmail.com

Figure1:FindCommand- Sharding Test

• Chameleon– Jetstream• Same

• FutureSystems• Acceptableresultswithhighernumberofshards

laszewski@gmail.com

Figure3:MapReduce- Sharding Test

• Chameleon– jetstream• Same

• Futuresystems• Significantlyworse

laszewski@gmail.com

Figure4:FindCommand- ReplicationTest

• Replication• Chameleoncloudseemstoperformslightlybetter• Futiresystems performssurprisinglywell

laszewski@gmail.com

CloudComparisonReplicationTest

laszewski@gmail.com

Figure5:Mongoimport Command- ReplicationTest

• Chameleon– jetstream• Same

• Futuresystems• Significantlyworse

laszewski@gmail.com

Figure6:MapReduce- ReplicationTest

• Chameleon• SlightlybetterthanJetstream

• Futuresystems• Significantlyworse

laszewski@gmail.com

Conclusion

• JetstreamandChameleonCloudareessentiallythesame.• InsomeinstancesChameleonCloudperformsslightlybetter• (disks/network…)

• AsexpectedFutureSystem isoldermachineandperformsnotaswell• ForsomequeriesFutureSystem issurprisinglygood

• Experimentswerelimitedbynumberofnodehoursfor60studentsinclass.• Afterclassisovernotimetorunonlargerexamples• Itsnotobviousforateacherwhentogivelargerallocationsforastudentthatperformswell.• Allocationprocessbroken• Futuresystems allocationprocessissuperior

laszewski@gmail.com