Making the Most of Infrastructure as a Service

Making the Most of Infrastructure as a

ServiceE.J. Daly

CTO, Creme Global2014-02-27

Creme Global

Scaling Resources Business and Management

Cloud Computingand IaaS

Migrate to the Cloud?

EUFP5MonteCarloProject1999

CREME Project2002-2005

Creme Software LtdFormed 2005

By 2007…HQ:Trinity College, Lloyd Building

Team:4 People (MSc HPC Graduates)

• Consistently Listed Amongst the Fastest Growing Technology Companies in Ireland

• Deloitte “Fast 50”• 2010: 14th • 2011: 9th

• 2012: 20th • 2013: 35th

• “Organic” Growth

Since then…

Today…HQ:Trinity Technology & Enterprise Campus

Today…Team:23 Full Time Staff

Software EngineersQuality AssuranceMaths ModellersStatisticiansFood ScientistsNutritionists

Today…

What exactly does Creme Global do?

Predictive Intake Modelling

We give decision makers access to the right data, models and expertise in a form that they can understand.

We build models and software to calculate consumer exposure to substances (chemicals, flavorings, fragrances, contaminants) present in food, cosmetics, packaging, environment

These analyses enable decision makers to set regulatory limits based on the real consumer exposure.

Creme Global - Services

High Performance Cloud Software

Technical Services & Projects

Data Validation &

Curation

Cloud Software

Data Curation

Technical Services

Primary Data Generation (research,

labs, innovation)

Analysis of Data -> Information

(scenarios, risk)

Decisions (Policy, Regulation,

Investment)

Value Chain

Complex Data, Large Volumes Accurate and Trusted

Results

Better Decisionsand Confidence

Creme Global

Creme Global - BenefitsPro Actively Protecting Consumer

Health

Understand Exposure

Assessment

Better Decisions

Limitations of Traditional Methods• Large investments in collecting data have been made

• Data sets are reduced to a few basic statistics to make exposure estimates for regulatory purposes

• Exposure estimates are assumed to be conservative • Level of conservatism is actually unknown

• Results are not accurate or realistic• Exposure estimates can be incorrect by an order of magnitude

Risk Analysis and The Flaw of Averages

Image: www.flawofaverages.com

Expert Models• Probabilistic

&• Deterministi

c

Detailed Product Usage

Information

Occurrence Data

Creme Global Methods

Consumer Exposure

Creme Global Methods

• Scientifically validated models of consumer exposure and risk assessment

• As called for by FDA, EFSA, SCCS, USDA, FSA, etc…

• Use all the available real data in the exposure model• Retains relationship between intakes and key factors

• Aggregate Exposure from multiple sources• Assess substances from multiple products

• Cumulative Exposure from multiple substances / chemicals simultaneously from all sources

• Assess full formulations

Creme Global: Probabilistic Modelling

• In an Ideal World, we would have access to complete exposure data for everyone:

• How much they consume?• How often?• Which products?• The exact chemical concentration in these products?

• This detailed data would enable a (relatively) straightforward calculation of population exposure

• In reality, data is only available for a relatively small proportion of the population.

• The software developed by Creme enables estimation of the actual population exposure from this limited data.

Subjects

Consumption

Groups

Brands

Processing

Endpoints

Correlations

Substances

Subject demographics

Products and Foods consumed

Market Shares, Brand Loyalties

Recipe and Food Groups Info

Potential processing factors

Substance / Chemical concentrationsin products / foods

e.g. ARfD, ADI

Information on correlated variables

Creme Database Tables

• The software creates a large simulated population based on the observed data, using probabilistic modelling (Monte Carlo)

• The simulated population has the same usage patterns and habits as the real population

• This simulated population is used to represent the real population• Exposure statistics are calculated for the simulated population


• Example: Dermal Exposure to Fragrance Compounds from a Cosmetics Product

Dermal Exposure (mg/cm2/day) = (F x A x C x R) / S

F = Frequency of Use (of Cosmetic Product) A = Amount Per UseC = Chemical ConcentrationR = Retention FactorS = Skin Surface Area



These values are not available for everyone in the population We gather information for each parameter from available data collection sources (surveys /

studies) Freq. of Use: Survey of 36000 EU/US consumers

(1.2 million recorded events) Amount per Use: Surveys of between 360 and 500 people Chem. Conc.: Fragrance and Cosmetics Manufacturers Retention Factors: Expert Opinion Surface Area: US EPA




X X XF A C R

S

Freq

uenc

yDistribution of Subjects

Exposure (mg/kg/day)

Model Output

Reference Dose

Freq

uenc

y Daily Average

Maximum Day

Daily Average & Maximum Day

ARfD Exposure

Freq

uenc

y

Exposure

DailyAverage

Lifetime

Daily Average & Lifetime Exposure

F O O D

G L O B A L

F O O D

G L O B A L

C O S M E T I C S

F O O D

G L O B A L

C O S M E T I C S

M I C R O B I A L

F O O D CROP PROTECTION

C O S M E T I C S

P A C K A G I N G

G L O B A L

M I C R O B I A L

P E S T I C I D E S N A N O T E C H

Creme Global




Cloud Computing• How is Cloud Computing different to everything else? (Armbrust et al, 2010)

• The appearance of infinite computing resources • The elimination of an up-front commitment by cloud users • The ability to pay for use of computing resources as needed

(for example processors by the hour, storage by the day)

• Definition from NIST (National Institute of Standards and Technology):“Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.”

• More concisely:• Cloud Computing = Internet Services & Pay for what you use

Image: www.jansipke.nl

Using Iaas to provide SaaS

Image: A view of Cloud Computing (Armbrust et al., 2010)

Leading Providers of IaaS

Image: blog.appcore.com

Creme Global




Benefits of IaaS• High Quality, Reliable, Enterprise Grade Infrastructure

• Servers• Storage• Networks

• Reduce waste and inefficiency• Reduce cost• Avoid large up front capital expenditure• Avoid large in-house maintenance costs• Rapid scaling possible to fit requirement• Accessible from anywhere

IaaS Benefits: Scalability

Image: www.techtricksworld.com


Problem 1:Wasted Resources



Problem 2:Losing Customers?

Negatives of IaaS• Performance

• Network (High Performance Computing)• Disk (not as good as performance / throughput. Database heavy applications)• Reliability in terms of performance (Not always consistent performance)• Performance of cloud is usually less than dedicated hardware

• Although, some cloud providers can provide higher performance (“pay per performance”)• You can sometimes work your way around poor performance (e.g. RAID arrays)

• Cost• More expensive for some applications / workflows

e.g. workflows with relatively constant load

• “Pound for pound” more expensive than Virtual Private Hosting, ColocationThe added flexibility and scalability comes at a cost

Cloud Computing: Hype Cycle

Cloud Computing: Hype CurveEntering the “Trough of Disillusionment”?

Moving away from Cloud?• Recent reports of migrations away from public cloud, to in-house / private clouds:

Zynga, HubSpot, MemSQL, Uber, Mixpanel, Tradesy

Eric Frenkiel (HubSpot) estimates that, had the company stuck with Amazon, it would have spent about $900,000 over the next three years. But with physical servers, the cost will be closer to $200,000.(wired.com report, Aug 2013)




• Cloud Computing predicted growth 36% compound annual thru 2016 (451Research)





• Cloud Computing is not perfect for every business / application• As IaaS matures, some early adopters may start to consider more sophisticated

approaches like Hybrid Cloud





• Cloud Computing is not perfect for every business / application• As IaaS matures, some early adopters may start to consider more sophisticated

approaches like Hybrid Cloud• (In agreement with Gartner Hype Cycle)

When is it a good idea to think about IaaS?• Start-ups

• Avoid up front expenditure in hardware• Avoid having a deicated sys admin function to ensure uptime for clients• Flexible, Agile, Lean – easy to ‘pivot’

• Elasticity• Not sure about predicted load / usage for the next 12-24 months?• Load on servers is inherently variable: you expect the load to vary a lot for the foreseeable future.

If you’re not sure:1) Try to do a cost calculation2) Is there a difference in the level of service you will be able to offer?

Case Study: Creme Global• 2006-2009:

• Single HPC Cluster (3x rack servers)• 8 cores• 16 GB RAM

• Colocation hosting in Dublin• Capacity:

• Up to 2x concurrent assessments / jobs

• 2009:• Increasing client base• Potential clients requesting trials• Evaluation of compute resource requirements needed…

Analysis of Compute Resources

Monitored assessment / job requests on compute servers over 4 month period

89.29% 10.71%

0 Jobs 1+ Jobs


89.29% 10.71%

0 Jobs 1+ Jobs

Problem 1:Most of the time: zero load on compute servers(Wasted compute resources)

Compute resources in use about 10% of the time


89.29% 10.71%

0 Jobs 1+ Jobs

Closer examination of load when resources are in use(i.e. when clients are using the compute resources)


89.29%

75%

25%

0 Jobs 1-2 Jobs 3+ Jobs

When clients are using the system, a large proportion of the time their jobs have to queue.

Problem 2:System is overloaded 25% of the time it is in use.

Analysis of Compute Resources• Problem 1: Compute system is usually unused

(~90% of the time)• Waste of compute resources

• Problem 2: When in use, system is regularly overloaded(~25% of the time)

• Unsatisfactory service being offered to customers



Problem 2:Losing Customers?

Typical Scenario

Creme Global




Creme Global

Scaling ResourcesBusiness and Management



Elastic Scaling of Resources in the Cloud• This is one of the biggest benefits of using IaaS

Cloud is generally more expensive – because of this benefit

• Manual -or- Automated?

Scaling Manually• Initially, you probably won’t have a scaling strategy

• Manually monitoring and scaling usage will provide the data you need to move to automation• An incomplete or poorly designed automation strategy can end up costing more, or providing worse service• Dev / Test phase of an application (let developers scale up/down manually)

• If your requirements will change relatively infrequently• Good predictions of future requirements• Scale up (down) as you add (remove) a product or client• Needs a good alignment with business development and strategy

• Scaling manually on the cloud is quite similar to Virtual Private Hosting• VPS are usually lower cost than on-demand cloud servers• You can get VPS-style hosting from cloud providers and migrate to on-demand when needed• Reserved instances (pay some up front to lower the overall cost)

Scaling Manually

Image: 8kmiles.com

Scaling Manually

Scale up when you need to(e.g. new contract)

Scaling Manually

Scale up when you need to(e.g. new contract)

Scale down when demand falls (e.g. end of contract)

Measuring Performance• Even a manual scaled system, will need Metrics to measure the performance

• Examples:• Server Load

CPU Utilization, Disk Read / Write, Network I/O, Memory Usage, Disk Usage• Availability

Uptime (%)• Response Time

Database queries, Server-side processing, Content distribution• Queue Length

Batch processes waiting to start

Scaling Automatically• Demand on system changes too rapidly to manage manually

Image: 8kmiles.com

Scaling Automatically

AutoScaling: Ready Made vs Build Your Own

AutoScaling: Ready Made Solution• IaaS providers offering built-in scaling solutions• Some third-party providers and consultants can help build a solution

for you• RightScale• 8kmiles.com

• Pros: can be set up quite quickly and relatively cheaplyDon’t need to spend a lot of time and resources on R&D

• Cons: may be limited in scopeMay not be a perfect fit for your scaling requirements

AWS provides built-in auto-scaling functionality for your EC2 instances

Image: aws.amazon.com

1) Metrics “Should we make a change?”


1) Metrics “Should we make a change?”

2) Scaling Rules “What change to make?”


AverageMinMaxSumSampleCount


CPU Utilization (%)Disk Reads (Bytes)Disk Read OperationsDisk Writes (Bytes)Disk Write OperationsNetwork In (Bytes)Network Out (Bytes)



<≤>≥



<≤>≥

1 Minute5 Minutes15 Minutes1 Hour6 Hours

Other Metrics Possible• EBS (Elastic Block Storage)

• Read / Write Bytes• Read / Write Operations• Idle time• Queue length (operations waiting to be completed)

• SQS (Simple Queuing System)• Number of messages sent / received• Number of messages in the queue

• Custom metrics can be defined by the user

Choose range of cluster size


Add a number of instancesor

Increase size by a certain percent




1) Rule for Scale Up




1) Rule for Scale Up

2) Rule for Scale Down

AutoScaling: Build Your Own Solution• Full control and customization over the scaling algorithms

• Case Studies:• Netflix• Creme Global

Custom AutoScaling: Netflix

5 Days

Custom AutoScaling: NetflixGeneral pattern emerges over time

5 Days


Noise(Unpredictable / Random)

5 Days


Noise(Unpredictable / Random)

Scaling Strategy:1) Predict the general pattern2) React to the randomness

5 Days

𝑀𝑒𝑡𝑟𝑖𝑐=𝑅𝑒𝑞𝑢𝑒𝑠𝑡𝑠𝑝𝑒𝑟 𝑆𝑒𝑐𝑜𝑛𝑑

h h𝑇 𝑟𝑜𝑢𝑔 𝑝𝑢𝑡


h h𝑇 𝑟𝑜𝑢𝑔 𝑝𝑢𝑡Fast Fourier Transform(approximation of observed data as combination of Sin waves)



Scaling plan ready before demand changes

Random spikes (deviations from the prediction) are fixed using Amazon AutoScaling



Scaling plan ready before demand changes

Custom AutoScaling: Creme Global


89.29%

8.16%

2.59%

0 Jobs 1-2 Jobs 3+ Jobs

Job Requests – Very Unpredictable


1-2 hr 2-3 hr 3-4 hr 4-5 hr 5-6 hr 6-7 hr 7-8 hr 8-9 hr 9-10 hr 10+ hr0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

Job Run Times(1+ hour only)

Run Time (Hours)

Job Size – Variable from 1 hour to 10+ hours

Custom AutoScaling: Creme Global• Job size is very much larger than typical requests to web service

• If resources are low – job cannot startCompare to standard web services

• Each job requires dedicated resources to runTypically 1-2 jobs per server; Each server is usually at close to 100% CPU while processing jobsMeasuring individual server load is not a good measure for scaling

• Jobs are very variable in sizeLength of job queue alone is not a very good measure for scaling

• A custom scaling approach was required

http://www.google.com/patents/US20110138055

Custom AutoScaling: Creme Global• Devised a more relevant Metric to measure the performance of the

system: “How long will it take for the last job in the queue to start?”

• How to calculate this metric?1. Estimate the time required for each job to complete (running or queued)2. Simulate the processing of each job through the queue in order3. Calculate the time that will have to pass before the last job will begin to process

http://www.google.com/patents/US20110138055

How to calculate this metric?1) Estimate the time required for each job to complete (running or queued)• For jobs that are already running:

• The application can estimate the percentage of a job completed so far• Estimate:

Total time required = (Time required so far) / (Percent complete)Time remaining = Total time required – Time required so far

• For jobs that have yet to start:• Estimate the complexity of the job based on input factors, including:

Monte Carlo iterations requestedSize and complexity of the data sets involvedMathematical model computational complexity

How to calculate this metric?2) Simulate the processing of each job through the queue in order

14 min

62min

30 min

5min

32min

7min

47min

Queue

T = 0 min


14 min

62min

30 min

5min

25min

0min

40min

Queue

T = 7 min


14 min

62min

30 min

5min

25min

40min

Queue

T = 7 min

0min

Job Done


14 min

62min

30 min

5min

25min

40min

Queue

T = 7 min

0min

New Job Starts

Job Done


14 min

62min

30 min

5min

25min

40min

Queue

T = 7 min

0min

New Job Starts

Job Done

Queue Progresses


14min

62min

30min

25min

5min

40min

Queue

T = 7 min


14min

62min

30min

20min

0min

35min

Queue

T = 12 min


14min

62min

20min

30min

35min

Queue

T = 12 min


14min

62min

0min

10min

15min

Queue

T = 32 min


14min

62min

10min

15min

Queue

T = 32 min


14min

52min

0min

5min

Queue

T = 42 min

How to calculate this metric?3) Calculate the time that will have to pass before the last job will begin to process

14min

52min

5min

Queue

T = 42 min

Last job ready to start

How to calculate this metric?3) Calculate the time that will have to pass before the last job will begin to process

14min

52min

5min

Queue

T = 42 min

Last job ready to start

Queue Length (Performance Metric): 42 min

Custom AutoScaling: Creme Global• Scale Up Rules:

Queue Length > 10 minNo instance pending

• Scale Down Rules:Instance is idle (no running job)Queue is emptyLess than 5 min to another billing hour


0-2 min 2-4 min 4-6 min 6-8 min 8-10 min 10-12 min 12+ min0%

10%

20%

30%

40%

50%

60%

70%

80%

Job Queue Times(14,623 Jobs :: 2103-14)

Queue Time (min)

Creme Global

Scaling ResourcesBusiness and Management



Creme Global




Business and Management Considerations• Changing Roles (Software Dev)• Data Security• Monitoring Costs• Further Cost Saving Strategies

Changing Role of Software Developers

• Hardware provisioning is now the responsibility of Software Dev

• Spinning up / down instances and volumes is part of the day-to-day for Developers

• What happens when things get busy?:Think about: what happens your desk, desktop, inbox…

• Risks:• Test / Development instances left running• Volumes and Snapshots without labels• Easy to keep backups “just in case” - build up over time if not managed• Billing is far less transparent

(compared to conventional hardware purchase / budgeting)• Easy to scale up Easy to make a mess!• A fixed-resource system will self-regulate due to the inherent limits

Cloud system does not have these limits

Changing Role of Software Developers

• Benefits• Very empowering for some software architects - can design, build, test

hardware configurations that will support their applications• Compliments Agile and Lean Development practices• Software developers can acquire new skills

(e.g. systems engineering skills, IS management)• Streamline design, development, QA / test, release, support

• Merging of a number of roles• Software Developer, Software Architect, Systems Engineer, …• Result: “DevOps”

Data Protection

Image: aws.amazon.com

Data Protection

Data protection is vital to reputation of IaaS providersVery high standards in place and auditing processes

Your IaaS provider should be able to grant you access to their auditing reports / whitepapers on security

Data Protection



Employ best practice within your own organization: - Server Upgrades / Patches - Application Security - Data encryption (storage / transit) - Principle of Least Privilege - Defense in Depth - Refer to guidelines: DPA, AWS

Data Protection



Employ best practice within your own organization: - Server Upgrades / Patches - Application Security - Data encryption (storage / transit) - Principle of Least Privilege - Defense in Depth - Refer to guidelines: Data Protection Commissioner, AWS

Building trust with customers: - Provide audit and reports from IaaS providers - Provide documentation on standards within your organization

Managing Cost• IaaS: “Pay for what you use”

• IaaS: Easy to scale up / down as demand increase / decreases

• Can you accurately predict Cloud Computing bill each month?

• Can you afford to wait until the next bill to find out?

• An unexpectedly large bill could cause cash flow problems for a small company or start-up

Managing Cost: Monitoring & Alarms

Managing Cost: Monitoring & AlarmsWith AWS, you can view current monthly spend


Set up Alertse.g. “Email me when my bill goes over $1,000”


Set up Alertse.g. “Email me when my bill goes over $1,000”

But, what to do next?

Cost Management: Finding Savings• IaaS = “Pay for what you use”

including: elasticity, scaling, storage quality / redundancy, backups, reliability

• To save costs:1. Can you make an upfront commitment on some servers (less flexible)?2. Can you build your own auto-scaling application?3. Can you put some of your data into cheaper archive storage?4. Can you live with having some of your data stored non-redundantly?5. Can you live with unpredictable server outages?

Cost Management: Finding Savings1. Can you do with less flexibility in terms of number of servers?

Paying an upfront annual cost for a particular usage of Cloud instances will reduce the overall costAWS provides Reserved Instances which can give up to 65% saving over on demand instances

2. Can you build your own auto-scaling application?

3. Can you put some of your data into cheaper archive storage?Data can be expensive to store on the Cloud using the standard servicesIf data is not needed “on demand”, then cheaper storage options are availableEBS costs: 1TB = $600 per annum, Glacier cost: 1TB = $60 per annum

4. Can you live with having some of your data stored non-redundantly?Standard storage on AWS is 99. 999999999% durable.If data is already stored somewhere else, then 99.99% durability may be sufficient (saving about 20% on cost)

5. Can you live with unpredictable server outages?If your application is fault tolerant and able to withstand random server outages AWS Spot InstancesSpot Instances are unused AWS instances that are auctioned off to highest bidderYou will lose any instances that are out-bidded without warning

Private and Hybrid Cloud

Private Cloud• Concerns / Considerations:

• Is your organization large enough to have a private cloud which gives the “appearance of infinite compute resources”?

• If not, then don’t expect your private cloud system to operate under exactly the same rules as the public cloud you’re used to

• Private cloud will require in-house IT capability to manage• Can internal system provide the same level of service as enterprise public

cloud?Think about: network bandwidth / redundancy, uptime, backup, disaster recovery

• Going private for performance: maybe bare-metal is the what you really need.

“Web servers belong in the public cloud. But things like databases — that need really high performance, in terms of [input and output] and reading and writing to memory — really belong on bare-metal servers or private setups.”John Engates (CTO, Rackspace)

Hybrid Cloud

Hybrid Cloud

Use Private Cloud for predictable workloads

Overflow to Public Cloud when needed

Hybrid Cloud

Use Private Cloud for predictable workloads

Overflow to Public Cloud when needed

Integration between Private and Public cloud is very important: - Network: Bandwidth, Latency, Reliability - Application Programming Interface (API) - Virtual Machine Image

Creme Global




Thanks to:International Association of Software

Architects, Ireland (IASA)Irish Computer Society (ICS)

More Info: blog.cremeglobal.com

ie.linkedin.com/in/ejdaly/

Making the Most of Infrastructure as a Service

Documents

Transcript of Making the Most of Infrastructure as a Service