Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018...

68

Transcript of Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018...

Page 1: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental
Page 2: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Clouds at CERN : A 5 year perspective

Utility and Cloud Computing Conference, December 19, 2018

Tim Bell@noggin143UCC 2018 2

Page 3: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

About Tim• Responsible for Compute

and Monitoring in CERN IT department

• Elected member of the OpenStack Foundation management board

• Member of the OpenStack user committee from 2013-2015

UCC 2018 3

Page 4: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

UCC 2018 4

CERNa

Worldwidecollaboration CERN’s primary mission:

SCIENCE

Fundamental research on particle

physics, pushing the boundaries of knowledge and technology

Page 5: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

CERNWorld’s largestparticle physics

laboratory

UCC 20185

Image credit: CERN

Page 6: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

UCC 20186

The Large Hadron Collider: LHC

1232dipole magnets

15 metres

35t EACH

27km

Image credit: CERN

Page 7: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Image credit: CERN

COLDER TEMPERATURES

than outer space

( 120t He )

UCC 20187

LHC: World’s Largest Cryogenic System (1.9 K)

Page 8: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Vacuum?

• Yes

UCC 20188

LHC: Highest Vacuum

104 kmof PIPES

10-11bar (~ moon)

Image credit: CERN

Page 9: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Image credit: CERN

Image credit: CERN

UCC 20189

ATLAS, CMS, ALICE and LHCb

EIFFEL TOWER

HEAVIERthan the

Image credit: CERN

Page 10: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

UCC 2018 10

40 millionpictures

per second

1PB/s

Image credit: CERN

Page 11: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

About the CERN IT Department

UCC 2018 11

Enable the laboratory to fulfill its mission

- Main data centre on Meyrin site

- Wigner data centre in Budapest (since 2013)

- Connected via three dedicated 100Gbs links

- Where possible, resources at both sites

(plus disaster recovery)

Drone footage of the CERN CC

About the CERN IT Department

UCC 2018

4

Enable the laboratory to fulfill its mission

- Main data centre on Meyrin site

- Wigner data centre in Budapest (since 2013)

- Connected via three dedicated 100Gbs links

- Where possible, resources at both sites

(plus disaster recovery)

Drone footage of the CERN CC

19/12/2018

Page 12: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Status: Service Level Overview

UCC 2018

12

Page 13: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Outline

UCC 2018

13

• Fabric Management before 2012

• The AI Project

• The three AI areas

- Configuration Management

- Monitoring

- Resource provisioning

• Review

Page 14: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

CERN IT Tools up to 2011 (1)

UCC 2018

14

• Developed in series of EU funded projects

- 2001-2004: European DataGrid

- 2004-2010: EGEE

• Work package 4 – Fabric management:

“Deliver a computing fabric comprised of all the necessary tools to

manage a centre providing grid services on clusters of thousands of

nodes.”

Page 15: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

CERN IT Tools up to 2011 (2)

UCC 2018

15

• The WP4 software was developed from scratch- Scale and experience needed for LHC Computing was special

- Config’ mgmt, monitoring, secret store, service status, state mgmt, service databases, …

LEMON – LHC Era Monitoring

- client/server based monitoring

- local agent with sensors

- samples stored in a cache & sent to server

- UDP or TCP, w/ or w/o encryption

- support for remote entities

- system administration toolkit

- automated installation, configuration &

management of clusters

- clients interact with a configuration

database (CMDB) & and an installation

infrastructure (AII)

Around 8’000 servers managed!

Page 16: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

2012: A Turning Point for CERN IT

UCC 2018

16

• EU projects finished in 2010: decreasing development and support

• LHC compute and data requirements increasing- Moore’s law would help, but not enough

• Staff would not grow with managed resources- Standardization & automation, current tools not apt

• Other deployments have surpassed the CERN one- Mostly commercial companies like Google, Facebook, Rackspace, Amazon, Yahoo!, …

- We were no longer special! Can we profit?

0

20

40

60

80

100

120

140

160

Run1 Run2 Run3 Run4

GRID

ATLAS

CMS

LHCb

ALICE

we are

here

what we

can afford

LS1 (2013) ahead, next window for change would only open in 2019 …

2012

Page 17: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

UCC 2018

17

How we began …

• Formed a small team of service managers from …- Large services (e.g. batch, plus)

- Existing fabric services (e.g. monitoring)

- Existing virtualization service

• ... to define project goals - What issues do we need to address?

- What forward looking features do we need?

http://iopscience.iop.org/article/10.1088/1742-6596/396/4/042002/pdf

Page 18: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Agile Infrastructure Project Goals

UCC 2018

18

New data centre support

- Overcome limits of CC in Meyrin

- Disaster recovery and business continuity

- ‘Smart hands’ approach

1

Page 19: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Agile Infrastructure Project Goals

UCC 2018

19

Sustainable tool support

- Tools to be used at our scale need maintenance

- Tools with a limited community require more time for

newcomers to become productive and are less valuable

for the time after (transferable skills)

2

Page 20: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Agile Infrastructure Project Goals

UCC 2018

20

Improve user response time

- Reduce the resource provisioning time span

(current virtualization service reached scaling limits)

- Self-service kiosk

3

Page 21: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Agile Infrastructure Project Goals

UCC 2018

21

Enable cloud interfaces

- Experiments already started to use EC2

- Enable libraries such as Apache’s libcloud

4

Page 22: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Agile Infrastructure Project Goals

UCC 2018

22

Precise monitoring and accounting

- Enable timely monitoring for debugging

- Showback usage to the cloud users

- Consolidate accounting data for usage of CPU, network,

storage … across batch, physical nodes and grid

resources

5

Page 23: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Agile Infrastructure Project Goals

UCC 2018

23

Improve resource efficiency

- Adapt provisioned resources to services’ needs

- Streamline the provisioning workflows

(e.g. burn-in, repair or retirement)

6

Page 24: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Our Approach: Tool Chain and DevOps

UCC 2018

24

• CERN’s requirements are no longer special!

• A set of tools emerged when looking at other places

• Small dedicated tools allowed for rapid validation & prototyping

• Adapted our processes, policies and work flowsto the tools!

• Join (and contribute to) existing communities!

Page 25: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

IT Policy Changes for Services

UCC 2018

25

• Services shall be virtual …- Within reason

- Exceptions are costly!

• Puppet managed, and …

• … monitored!- (Semi-)automatic with Puppet

Decrease provisioning time

Increase resource efficiency

Simplify infrastructure mgmt

Profit from others’ work

Speed up deployment

‘Automatic’ documentation

Centralized monitoring

Integrated alarm handling

Page 26: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

UCC 2018

26

Tools + Policies:

Sounds simple!

From tools to services is complex!

- Integration w/ sec services?

- Incident handling?

- Request work flows?

- Change management?

- Accounting and charging?

- Life cycle management?

- …Image: Subbu Allamaraju

Page 27: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Public Procurement Timelines

UCC 2018 27

Page 28: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Resource Provisioning: IaaS

UCC 2018

28

• Based on OpenStack- Collection of open source projects for cloud orchestration

- Started by NASA and Rackspace in 2010

- Grown into a global software community

Page 29: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Early Prototypes

UCC 2018 29

Page 30: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

The CERN Cloud Service

UCC 2018

30

• Production since July 2013- Several rolling upgrades since,

now on Rocky

- Many sub services deployed

• Spans two data centers- One region, one API entry point

• Deployed using RDO + Puppet- Mostly upstream, patched where needed

• Many sub services run on VMs!- Boot strapping

Page 31: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

UCC 2018

31

Page 32: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Agility in the Cloud

UCC 2018

32

• Use case spectrum- Batch service (physics analysis)

- IT services (built on each other)

- Experiment services (build)

- Engineering (chip design)

- Infrastructure (hotel, bikes)

- Personal (development)

• Hardware spectrum- Processor archs (features, NUMA, …)

- Core-to-RAM ratio (1:2, 1:3, 1:5, …)

- Core-to-disk ratio (2x or 4x SSDs)

- Disk layout (2, 3, 4, mixed)

- Network (1/10GbE, FC, domain)

- Location (DC, power)

- SLC6, CC7, RHEL, Windows

- …

Page 33: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

What about our initial goals?

UCC 2018

33

• The remote DC is seamlessly

integrated- No difference from provisioning PoV

- Easily accessible by users

- Local DC limits overcome (business continuity?)

• Sustainable tools- Number of managed machines has multiplied

- Good collaboration with upstream communities

- Newcomers know tools, can use knowledge

afterwards

• Provisioning time span is ~minutes - Was several months before

- Self-service kiosk with automated workflows

• Cloud interfaces- Good OpenStack adoption, EC2 support

• Flexible monitoring infra- Automatic in for simple cases

- Powerful tool set for more complex ones

- Accounting for local and grid resources

• Increased resource efficiency - ‘Packing’ of services

- Overcommit

- Adapted to services’ needs

- Quick draining & back filling

So … 100% success?

Page 34: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Cloud Architecture Overview

UCC 2018

34

• Top and child cells for scaling- API, DB, MQ, Compute nodes

- Remote DC is set of cells

• Nova HA only on top cell - Simplicity vs impact

• Other projects global- Load balanced controllers

- RabbitMQ clusters

• Three Ceph instances- Volumes (Cinder), images (Glance), shares (Manila)

Page 35: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

UCC 2018 35

HL-LHC SKA

Page 36: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Tech. Challenge: Scaling• OpenStack Cells provides composable units

• Cells V1 – Special custom developments

• Cells V2 – Now the standard deployment model

• Broadcast vs Targetted queries

• Handling down cells• Quota

• Academic and scientific instances push the limits• Now many enterprise clouds above 1000

hypervisors

• CERN running 73 Cells in production

UCC 2018 36

https://www.openstack.org/analytics

Page 37: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Tech. Challenge: CPU Performance

UCC 2018

37

• The benchmarks on full-node VMs was about 20% lower than the one of the underlying host- Smaller VMs much better

• Investigated various tuning options- KSM*, EPT**, PAE, Pinning, … +hardware type dependencies

- Discrepancy down to ~10% between virtual and physical

• Comparison with Hyper-V: no general issue- Loss w/o tuning ~3% (full-node), <1% for small VMs

- … NUMA-awareness!

*KSM on/off: beware of memory reclaim! **EPT on/off: beware of expensive page table walks!

Page 38: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

CPU Performance: NUMA

UCC 2018

38

• NUMA-awareness identified as mostefficient setting

• “EPT-off” side-effect- Small number of hosts, but very

visible there

• Use 2MB Huge Pages- Keep the “EPT off” performance gain

with “EPT on”

Page 39: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

NUMA roll-out

UCC 2018

39

• Rolled out on ~2’000 batch hypervisors (~6’000 VMs)- HP allocation as boot parameter reboot

- VM NUMA awareness as flavor metadata delete/recreate

• Cell-by-cell (~200 hosts):- Queue-reshuffle to minimize resource impact

- Draining & deletion of batch VMs

- Hypervisor reconfiguration (Puppet) & reboot

- Recreation of batch VMs

• Whole update took about 8 weeks- Organized between batch and cloud teams

- No performance issue observed since

VM Before After

4x 8 8%

2x 16 16%

1x 24 20% 5%

1x 32 20% 3%

Page 40: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Tech. Challenge: Under used resources

UCC 2018 40

Page 41: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

VM Expiry

UCC 2018 41

• Each personal instance will have an expiration date

• Set shortly after creation and evaluated daily

• Configured to 180 days, renewable

• Reminder mails starting 30 days before expiration

Page 42: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Expiry results

UCC 2018 42

• Results exceeded

expectations

• Expired

• >1000 VMs

• >3000 cores

Page 43: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Tech. Challenge: Bare Metal

UCC 2018 43

• VMs not suitable for all of our use cases- Storage and database nodes, HPC clusters, boot strapping,

critical network equipment or specialised network setups,

precise/repeatable benchmarking for s/w frameworks, …

• Complete our service offerings- Physical nodes (in addition to VMs and containers)

- OpenStack UI as the single pane of glass

• Simplify hardware provisioning workflows- For users: openstack server create/delete

- For procurement & h/w provisioning team: initial on-boarding, server re-assignments

• Consolidate accounting & bookkeeping- Resource accounting input will come from less sources

- Machine re-assignments will be easier to track

Page 44: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Adapt the Burn In process• “Burn-in” before acceptance

- Compliance with technical spec (e.g. performance)

- Find failed components (e.g. broken RAM)

- Find systematic errors (e.g. bad firmware)

- Provoke early failing due to stress

- Tests include

- CPU: burnK7, burnP6, burnMMX (cooling)

- RAM: memtest, Disk: badblocks

- Network: iperf(3) between pairs of nodes- automatic node pairing

- Benchmarking: HEPSpec06 (& fio)- derivative of SPEC06

- we buy total compute capacity (not newest processors)

UCC 2018 44

Page 45: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Exploiting cloud services for burn in

UCC 2018 45

Page 46: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Tech. Challenge: Containers

UCC 2018 46

An OpenStack API Service that allows creation of container

clusters

● Use your OpenStack credentials, quota and roles

● You choose your cluster type

● Multi-Tenancy

● Quickly create new clusters with advanced features

such as multi-master

● Integrated monitoring and CERN storage access● Making it easy to do the right thing

Page 47: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Scale Testing using Rally• An Openstack benchmark test tool

• Easily extended by plugin

• Test result in HTML reports

• Used by many projects

• Context: set up environment

• Scenario: run benchmark

• Recommended for a production serviceto verify that the service behaves asexpected at all time

UCC 2018 47

Kubernetes

Clusterpods,

contai

ners

Rally

report

Page 48: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

First Attempt – 1M requests/Seq

• 200 Nodes

• Found multiple limits

• Heat Orchestration scaling

• Authentication caches

• Volume deletion

• Site services

UCC 2018 48

Page 49: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Second Attempt – 7M requests/Seq• Fixes and scale to 1000 Nodes

UCC 2018 49

Cluster Size

(Nodes)

Concurrency Deployment

Time (min)

2 50 2.5

16 10 4

32 10 4

128 5 5.5

512 1 14

1000 1 23

Page 50: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Tech. Challenge: Meltdown

UCC 2018 50

• In January 2018, a security vulnerability was disclosed a new kernel everywhere

• Staged campaign• 7 reboot days, 7 tidy up days

• By availability zone

• Benefits• Automation now to reboot the cloud if needed -

33,000 VMs on 9,000 hypervisors

• Latest QEMU and RBD user code on all VMs

• Then L1TF came along• And we had to do it all again......

06/06/2018

Page 51: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

UCC 2018 51

First run LS1 Second run Third run LS3 HL-LHC Run4

…2009 2013 2014 2015 2016 2017 201820112010 2012 2019 2023 2024 2030?20212020 2022 …2025

LS2

Significant part of cost comes

from global operations

Even with technology increase of

~15%/year, we still have a big

gap if we keep trying to do things

with our current compute models

Raw data volume

increases significantly

for High Luminosity LHC

2026

Page 52: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Commercial Clouds

UCC 2018 52

Page 53: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Non-Technical Challenges (1)

UCC 2018

53

• Agile Infrastructure Paradigm Adoption

- ‘VMs are slower than physical machines.’

- ‘I need to keep control on the full stack.’

- ‘This would not have happened with physical machines.’

- ‘It’s the cloud, so it should be able to do X!’

- ‘Using a config’ management tool is too dangerous!’

- ‘They are my machines’

Page 54: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Non-Technical Challenges (2)

UCC 2018

54

• Agility can bring great benefits …

• … but mind (adapted) Hooke’s Law!- Avoid irreversible deformations

• Ensure the tail is moving as well as

the head- Application support

- Cultural changes

- Workflow adoption

- Open source community culture can help

Page 55: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Non-Technical Challenges (3)

• Contributor License Agreements

• Patches needed but merges/review time

• Regular staff changes limits Karma

• Need to be a polyglot• Python, Ruby, Go, … and legacy Perl etc.

• Keep riding the release wave• Avoid the end-of-life scenarios

UCC 2018 55

Page 56: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Ongoing Work Areas

• Spot Market / Pre-emptible instances

• Software Defined Networking

• Regions

• GPUs

• Containers on Bare Metal

• …

UCC 2018 56

Page 57: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Summary

UCC 2018

57

Positive results after 5 years into the project!

- LHC needs met without additional staff

- Tools and workflows widely adopted and accepted

- Many technical challenges were mastered and returned upstream

- Integration with open source communities successful

- Use of common tools increased CERN’s attraction of talents

Further enhancements in function & scale needed for HL-LHC

Page 58: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Further Information• CERN information outside the auditorium

• Jobs at CERN – wide range of options• http://jobs.cern

• CERN blogs• http://openstack-in-production.blogspot.ch

• https://techblog.web.cern.ch/techblog/

• Recent Talks at OpenStack summits• https://www.openstack.org/videos/search?search=cern

• Source code• https://github.com/cernops and https://github.com/openstack

UCC 2018 58

Page 59: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

UCC 2018

59

Page 60: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental
Page 61: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Agile Infrastructure Core Areas

UCC 2018

61

• Resource provisioning (IaaS)- Based on OpenStack

• Centralized Monitoring- Based on Collectd (sensor) + ‘ELK’ stack

• Configuration Management- Based on Puppet

Page 62: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Configuration Management

UCC 2018

62

• Client/server architecture - ‘agents’ running on hosts plus horizontally scalable ‘masters’

• Desired state of hosts described in ‘manifests’- Simple, declarative language

- ‘resource’ basic unit for system modeling, e.g. package or service

• ‘agent’ discovers system state using ‘facter’- Sends current system state to masters

• Master compiles data and manifests into ‘catalog’ - Agent applies catalog on the host

Page 63: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Status: Config’ Management (1)

UCC 2018

63

(virtual and physical, private and public cloud)

(‘base’ is what every Puppet node gets)

(compilations are spread out)

(this number includes dev changes)

(number Puppet code committers)

Page 64: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Status: Config’ Management (2)

UCC 2018

64

Page 65: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Status: Config’ Management (3)

UCC 2018

65

• Changes to QA are announced publicly

• QA duration: 1 week

• All Service Managers can stop a change!

Page 66: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Monitoring: Scope

UCC 2018

66

Data Centre Monitoring

• Two DCs at CERN and Wigner

• Hardware, O/S, and services

• PDUs, temp sensors, …

• Metrics and logs

Experiment Dashboards

- WLCG Monitoring

- Sites availability, data transfers,

job information, reports

- Used by WLCG, experiments,

sites and users

Page 67: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

UCC 2018

67

Status: (Unified) Monitoring (1)

• Offering: monitor, collect, aggregate, process, visualize, alarm … for metrics and logs!

• ~400 (virtual) servers, 500GB/day, 1B docs/day- Mon data management from CERN IT and WLCG

- Infrastructure and tools for CERN IT and WLCG

• Migrations ongoing (double maintenance)- CERN IT: From Lemon sensor to collectd

- WLCG: From former infra, tools, and dashboards

Page 68: Clouds at CERN : A 5 year perspective...OpenStack user committee from 2013-2015 UCC 2018 3. UCC 2018 4 CERN a Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental

Status: (Unified) Monitoring (2)

UCC 2018

68

Kafka cluster

(buffering) *

Processing

Data enrichment

Data aggregation

Batch Processing

Transport

Flu

me

Ka

fka

sin

k

Flu

me

sin

ks

FTS

Data

Sources

Rucio

XRootD

Jobs

Lemon

syslog

app log

DB

HTTP

feed

AMQFlume

AMQ

Flume

DB

Flume

HTTP

Flume

Log

GW

Flume

Metric

GW

Logs

Lemon

metrics

HDFS

Elastic

Search

Storage &

Search

Others

(influxdb)

Data

Access

CLI, API

User

Views

User

Jobs

User

Data

Today: > 500 GB/day, 72h buffering