Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

31
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker Boris Scholl VP Microservices, Oracle Harvey Raja Coherence Architect, Oracle March 6th, 2017

Transcript of Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Page 1: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Boris SchollVP Microservices, Oracle

Harvey RajaCoherence Architect, Oracle

March 6th, 2017

Page 2: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 2

Agenda• Objectives• Service use case• Service design goals and principles• Platform architecture• DevOps flow• Demo• Lessons learned

Page 3: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 3

Objectives• Provide insights into building production grade cloud services• Provide insights into production grade CI/CD pipeline• Share some lessons learned

• Get insight into a actual real world architecture• Awareness of potential pitfalls when entering this space

Takeaways

Page 4: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 4

Service Use Case• Backbone of other internal distributed services• Needed a services for– Leader election– Service Registry and Discovery– Configuration management

• Potentially making it available to customers later

Page 5: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 5

Service Design Goals

Page 6: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 6

Service Design Goals• Hyper-scale• Highly available • Resilient• Multitenant• Optimal hardware utilization to optimize costs• Agile delivery of individual services, continuous deployments

Page 7: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 7

Service Design principles• Design to optimize for Time to Market– Microservice architectural approach– Each service is delivered by independent development teams– Automate everything– e.g. Application consists of nine separate services delivered by five geographically-separate development teams

• Governance– Unit testing, coding standards, and code reviews on all commits– Common log format

• Only services which are “deployable and testable” can be promoted.• Build for operations– Custom Dashboard UI provides status for all versioned manifests and services, identifying issues, bottlenecks,

etc.– Diagnostics and Monitoring UI and alerting and tools in place

Page 8: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 8

Technology Stack

Page 9: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 9

Tech Stack mainly focused on proven OSS technologies• Reliable infrastructure–Oracle Bare Metal Cloud Services–Mesos/Marathon• Currently managed by our team. Will be moving to managed CaaS.

– NGINX

• Technologies designed for operations– Docker– ELK (Elastic Search, Logstash, Kibana) + Grafana– Prometheus

• Java (JAX-RS, Jersey, Grizzly, Netty, Coherence)• Jenkins CI

Page 10: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Architectural overview

Load Balancer

Management APIs Management APIs

Mesos/Marathon

Load Balancer

Load Balancer

Load Balancer

Load Balancer

Tenant 1

Tenant 2

Tenant 3

Tenant 4

AD 1AD 2

etcd-1

etcd-1

etcd-1

etcd-1

etcd-1

Operator

Page 11: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 11

Platform components• Load balancer– NGINX based – Control plane LB and Tenant LB– Tenant LB sits in the middle between service VCN and Tenant VCN• Acts as a ‘wormhole’ between the private networks

• Management APIs– Provides endpoints for Console and CLI to create new etcd services

• Etcd service– Virtual concept based on Coherence cluster• Etcd gateway == Frontend nodes• Storage enabled nodes == Backend nodes• Data persisted to NVMe

Page 12: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 12

Platform components• Orchestrator– Implemented a layer between management APIs and M/M– Responsible for provisioning the etcd service components in a particular order– Managing the life cycle of etcd service• Check for safe states etc.

– Supports target environment profiles• Depending on compute infrastructure the orchestrator will adjust cluster size and JVM resource

consumption

• Platform manifest– Declarative way of bundling platform components in to a release– Contains name and version of components (Docker images) being released

• Platform Installer– Deploys platform software as defined in the manifest– Can deploy to Mesos/Marathon, BM Container Service or Virtual Machines

Page 13: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 13

Service Runtime Architecture

Service VCN

Availability Domain 3Availability Domain 1 Availability Domain 2

Load Balancer Service

Gateway Gateway Gateway

Tenant 1 VCN

Tenant 2 VCN

Tenant nVCN

BackendT1 Inst 2 T2 Inst 1T1 Inst 1

T1 Client 1

T1 Client 2

T2 Client 1

Gateway Gateway Gateway

BackendT2 Inst 2T1 Inst 1

Gateway Gateway Gateway

BackendT1 Inst 2

Page 14: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 14

Testing and CI/CD Pipeline

Page 15: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 15

Service-Level Tests Platform-Level Tests

• Owned by central test team• Includes end-to-end tests– Functional Acceptance Test, –Minimal Acceptance Test (MAT)– Longevity Test– Upgrade Test– Non-functional (Performance/Stress)– Jepson testing

• Run as a part of the CI/CD pipeline

Testing Strategy

• Owned by each service team• Includes– Unit Test– Component Test– Integration Test

• Run as a part of individual builds, prior to CI/CD stage

Page 16: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 16

CI/CD Pipeline and Testing Levels

Level 1: "Verify"• Use localhost installation

(isolated sandbox env) with the last published platform manifest

• Install new version of X on top• Run Functional Acceptance Tests,

in 10 minute parallel chunks• Run Upgrade Acceptance Tests,

upgrading from current production manifest to new version of X

Level 2: "Pre-Stage"• Use Prod-like environment with

last published platform manifest• Deploy a platform instance from

scratch using new version of X on top

• Run MATs (10 minutes)• On success, a new platform

manifest is produced using the new version of X

• Successfully passing this level represents CI

Level 3: "Stage"• Use Prod-like environment with

current production manifest already deployed

• Upgrade to the new version of X• Run MATs (10 minutes)

Level 4: “Prod Candidate”• Frequency: Once per night• Run long running tests:

Longevity, PSR, and Functional Acceptance Tests

• Based on the results, a manual decision is made to deploy a specific manifest to production at a frequency determined by management

Note: X indicates platform components that have changed since the last release

Page 17: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 17

Detailed Service Release Pipeline

Etcd data plane

Etcd control plane

Build & Unit Tests

Build & Unit Tests

Integration Test

Publish Image

Install on pre-stageMATS

Upgrade stageUATS

Performance perf

Longevity under Stress

Parallel Test Runs

Prod Ready Candidate

Rollback stage

Production Release start

Review Prod Ready List

Select Manifest

Canary on prod

Finish prod Upgrade

Confirm Prod Ready

Install Prod on stage

Integration Test

Publish Image

Platform Testing

Update Manifest

Publish Manifest

Page 18: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 19

Lessons learned

Page 19: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 20

Lessons Learned (Best Practices) – Services will fail• Retries– Anticipate transient errors for services you are trying to reach– Implement a retry policy with an appropriate retry count and interval (e.g.

exponential back off, incremental intervals etc.) – Ensure idempotency with retries

• Circuit Breaker – Prevents an application to retry an operation that is likely to fail– Can be combined with the retry pattern

• Bulkhead– Avoids faults in one part of the system to take down the entire system

Page 20: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 21

Lessons Learned (Best Practices) – Communication• More services mean more communication and data exchange– HTTP (HTTP/2) for external and internal – TCP/UDP for internal for better performance – Serialization format: JSON, Protocol buffers, Coherence POF

• Serialization and Deserialization can be a bottleneck at large scale services– Consider if you need to re-serialize if a downstream service works with the same

object• Augment the de-serialized object and pass onto another service in a form

– Choose a JSON serializer wisely– Jersey (JAX-RS implementation) and Jetty (as the HTTP transport) and Jackson get you

pretty far.

Page 21: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 22

Lessons Learned – Docker and Java Apps• Memory– JVM does not honor Docker runtime metrics• JVM tries to use all the memory it sees • Docker daemon kills the container when crossing constraints• Avoid issue: Specifying max heap size for the process that is lower than the container memory

constraints

– Ensure the JVM memory settings are correctly synched with the Marathon container memory settings. • Failure to do this can cause marathon to simply continually kill the container when the enclosed JVM

hits a memory event that we Java programmers would just consider a "normal" event

Page 22: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 23

Lessons Learned – Docker and Java Apps

• CPU– Java VM running sees all the cores of the host machine. • Manually configure if you rely on that information (e.g. create Threads) • Workaround: using a -D Java property

– Ensure the JVM CPU settings are correctly synched with the Marathon container CPU settings. • Failure to do this can cause the container to not ever start.

Page 23: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 24

Lessons Learned – Marathon (or other orchestrator)• Mesos/Marathon status may not match service status–M/M reports on container status– It may take longer for the service inside the container to come up. –Workaround: Configure health check url for the service so that one can definitely

conclude that the container and process inside it are up and running

Page 24: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 25

Lessons Learned – NGINX• Dynamic reconfiguration of NGINX– Configuration updates based on events sent by orchestrator• E.g. spins up new etcd clusters• LB needs to be aware of new routes

– Not easy to find out when the change was applied–Workaround: Ping service via NGINX to make sure the service is back up– Disable NGINX logging– Increase worker processes to match load (auto == cpu count)

Page 25: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 26

Lessons Learned (Best Practice) – Containers • Use proper container image versioning– E.g. etcd-nginx:1.0.0-b21– Avoid the latest tag

• Use small base images– E.g. Oracle Linux 7.1 - slim – Large base images can delay service readiness

• Use one base image for multiple purposes – Add functionality to base image – Enable feature by configuration– E.g. Tenant LB needs to support HTTP/2 (Required by etcd V3 which uses gRPC)

Page 26: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Highly Restricted 27

CI/CD Pipeline – Lesson Learned• Automate everything may not be possible• Let teams choose what they want to use– Supporting both Maven and Gradle projects allowed Dev Teams to choose the tools

they preferred– Helped getting devs more involved in CI/CD process

• Restrict the testing pipeline to “deployable” components (i.e. Docker images).– Teams producing Java libraries required to coordinate with Docker image producers.– Dev projects should be responsible for managing dependencies with other projects.

Page 27: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 28

CI/CD Testing- Lessons Learned• Isolated sandbox environment for development, debugging and testing– Team members should be able to easily stand up their own isolated environment

• Agile “testing pyramid” of unit / integration / end-to-end tests–More unit tests for better quality. Less end-to-end tests to reduce CI cycle time

• Parallelize testing where possible– To execute more tests in a short amount of time

• Test upgrades early in the CI/CD pipeline– Saves cycle time on other expensive tests if upgrades fail/introduce regressions

• Address intermittent failures in end-to-end tests right away– Prioritize these failures, identify the root cause/faulty component and fix it first

Page 28: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Highly Restricted 29

CI/CD General – Lesson Learned• A Developer-focused Dashboard is essential to identify failures & pipeline

blockages• Provide information needed for diagnosis is essential in understanding

failures in a timely manner• More specific to our situation:– Defining an external Platform Manifest was a good choice.• Allowed reproducible test results• Mixing and matching between component versions

–Wish we had abstracted Container Management interfaces• Would allow moving between orchestrators

Page 29: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal/Restricted/Highly Restricted 30

Summary• Microservices approach helps with time to market– Be aware that you are dealing with a distributed system

• Tools and technology of choice requires governance• CI/CD pipeline and automation are key– You may not be able to automate all the way up to continuous deployment

• More on that topic:– 5 part blog series: Getting started with microservices• https://blogs.oracle.com/developers/getting-started-with-microservices-part-one

Page 30: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal/Restricted/Highly Restricted 31

Page 31: Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker