MCP Standard Configuration - Mirantis€¦ · Large Kubernetes cluster architecture 36 Ceph...

MCP Standard Configurationversion latest

Mirantis Standard Configuration

©2018, Mirantis Inc. Page i

ContentsCopyright notice 1

Preface 2

Intended audience 2

Documentation history 2

Cloud sizes 3

OpenStack environment sizes 3

Kubernetes cluster sizes 3

StackLight LMA and environment sizes 4

Ceph cluster sizes 4

Minimum hardware requirements 6

Infrastructure nodes disk layout 9

Networking 10

Server networking 10

Access networking 11

Switching fabric capabilities 12

NFV considerations 13

StackLight LMA scaling 14

StackLight LMA server roles and services distribution 14

StackLight LMA resource requirements per cloud size 14

OpenStack environment scaling 16

OpenStack server roles 16

Services distribution across nodes 17

Operating systems and versions 18

OpenStack compact cloud architecture 19

OpenStack small cloud architecture 21

OpenStack medium cloud architecture with Neutron OVS DVR/non-DVR 24

OpenStack medium cloud architecture with OpenContrail 26

OpenStack large cloud architecture 29

Kubernetes cluster scaling 33

Small Kubernetes cluster architecture 33

Medium Kubernetes cluster architecture 35


©2018, Mirantis Inc. Page ii

Large Kubernetes cluster architecture 36

Ceph hardware requirements 38

Ceph cluster considerations 38

Ceph hardware considerations 40


©2018, Mirantis Inc. Page iii

Copyright notice2018 Mirantis, Inc. All rights reserved.

This product is protected by U.S. and international copyright and intellectual property laws. No part of thispublication may be reproduced in any written, electronic, recording, or photocopying form without writtenpermission of Mirantis, Inc.

Mirantis, Inc. reserves the right to modify the content of this document at any time without prior notice.Functionality described in the document may not be available at the moment. The document contains thelatest information at the time of publication.

Mirantis, Inc. and the Mirantis Logo are trademarks of Mirantis, Inc. and/or its affiliates in the United States another countries. Third party trademarks, service marks, and names mentioned in this document are theproperties of their respective owners.


©2018, Mirantis Inc. Page 1

PrefaceThis documentation provides information on how to use Mirantis products to deploy cloud environments. Theinformation is for reference purposes and is subject to change.

Intended audienceThis documentation is intended for deployment engineers, system administrators, and developers; it assumesthat the reader is already familiar with network and cloud concepts.

Documentation historyThis is the latest version of the Mirantis Cloud Platform documentation. It is updated continuously to reflect therecent changes in the product. To switch to any release-tagged version of the Mirantis Cloud Platformdocumentation, use the Current version drop-down menu on the home page.



Cloud sizesThe Mirantis Cloud Platform (MCP) enables you to deploy OpenStack environments and Kubernetes clustersof different scales. This document uses the terms small, medium, and large clouds to correspond to thenumber of virtual machines that you can run in your OpenStack environment or the number of pods that youcan run in your Kubernetes cluster.

This section describes sizes of MCP clusters.

OpenStack environment sizesAll cloud environments require you to configure a staging environment which mimics the exact productionsetup and enables you to test configuration changes prior to applying them in production. However, if youhave multiple clouds of the same type, one staging environment is sufficient.

Environments configured with Neutron OVS as a networking solution require additional hardware nodes callednetwork nodes or tenant network gateway nodes.

The following table describes the number virtual machines for each scale:

OpenStack environment sizes

Environment sizeNumber of virtual

machinesNumber of compute

nodesNumber of infrastructure

nodes

Compact Up to 1,000 Up to 50 3

Small 1,000 - 2,500 50 - 100 6

Medium 2,500 - 5,000 100 - 200 12 (Neutron OVS orOpenContrail)

Large 5,000 - 10,000 200 - 500 17 (OpenContrail only)

Kubernetes cluster sizesKubernetes and etcd are lightweight applications that typically do not require a lot of resources. However,each Kubernetes components scales differently and some of them may have limitations on scaling.

Based on performance testing, an optimal density of pods is 100 pods per Kubernetes Node.

Note

If you run both Kubernetes clusters and OpenStack environments in one MCP installation, you do notneed to configure additional infrastructure nodes. Instead, the same infrastructure nodes are used forboth.

The following table describes the number of Kubernetes Master nodes and Kubernetes Nodes for each scale:

Small cloud



Cluster size Number of pods Number of NodesNumber of

Masternodes

Number ofinfrastructure

nodes

Small 2,000 - 5,000 20 - 50 3 3

Medium 5,000 - 20,000 50 - 200 6 3

Large 20,000 - 50,000 200 - 500 9 3

StackLight LMA and environment sizesThe following table contains StackLight LMA sizing recommendations for different sizes of monitored clouds.

OpenStack environment sizes

Environment sizeNumber of

monitored nodes

Number ofStackLight LMA

nodesNotes and comments

Compact Up to 50 0 (colocated withVCP)

StackLight LMA is installed on virtualmachines running on the same KVMinfrastructure nodes as VCP.

Small 50 - 100 3 StackLight LMA is installed on dedicatedinfrastructure nodes running InfluxDB,Elasticsearch, and the Docker Swarmmode cluster.

Medium 100 - 200 3 StackLight LMA is installed on dedicatedinfrastructure nodes running InfluxDB,Elasticsearch, and the Docker Swarmmode cluster.

Large 200 - 500 6 StackLight LMA is installed on 3dedicated infrastructure nodes runningInfluxDB and Elasticsearch, and on 3bare metal servers running the DockerSwarm mode cluster. Elasticsearchcluster can be scaled up to 5 nodes thatrises the number of StackLight LMAnodes to 8.

Ceph cluster sizesThe following table summarizes the number of storage nodes required depending on cloud size to meet theperformance characteristics described in Ceph cluster considerations.

Ceph storage nodes per cloud size

Cloud sizeNumber of

compute nodesNumber of virtual

machinesNumber of storage

nodesNumber of Ceph

Monitors



Compact Up to 50 Up to 1,000 Up to 9 (20-diskchassis)

3

Small 50 - 100 1,000 - 2,000 14 - 20 (20-diskchassis)

3

Medium 100 - 200 2,000 - 4,000 25 - 50 (20-diskchassis)

3

Large 200 - 500 4,000 - 10,000 60 (36-diskchassis)

3

Seealso

Ceph hardware considerations



Minimum hardware requirementsWhen calculating hardware requirements, you need to plan for the infrastructure nodes, compute nodes,storage nodes, and, if you are installing a Kubernetes cluster, Kubernetes Master nodes and KubernetesNodes.

Note

For more details about services distribution throughout the nodes, see: OpenStack environmentscaling and Kubernetes cluster scaling.

The following tables list minimal hardware requirements for the corresponding nodes of the cloud:

MCP foundation node

Parameter Value

CPU• 1 x 12 Core CPU Intel Xeon E5-2670v3

• 1 x 14 Core CPU Intel Xeon Gold 5120 (Skylake-SP)

RAM 32 GB

Disk 1 x 2 TB HDD

Network 1 x 1/10 GB Intel X710 dual-port NICs

OpenStack infrastructure node

Parameter Value

CPU• 2 x 12 Core CPUs Intel Xeon E5-2670v3

• 2 x 12 Core CPUs Intel Xeon Gold 6126 (Skylake-SP)


RAM 256 GB

Disk 2 x 1.6 TB SSD Intel S3520 or similar

Network 2 x 10 GB Intel X710 dual-port NICs

OpenStack compute node

Parameter Value






RAM 256 GB

Disk• 1 x 2TB HDD for system storage

• 2 x 960 GB SSD Intel S3520 for VM disk storage 1


1 Only required if local storage for VM ephemeral disks is used. Not required if networkdistributed storage is set up in the cloud, such as Ceph.



Ceph storage node

Parameter Value

CPU• 2 x 8 Core CPU Intel E5-2630v4 or

• 2 x 8 Core CPU Intel Xeon Silver 4110

RAM 128 GB

Disk• 4 x 200 GB SSD Intel S3710 for journal storage

• 20 x 2 TB HDD for data storage

Boot drives• 2 x 128 GB Disk on Module (DOM) or

• 2 x 150 GB SSD Intel S3520 or similar

Network 1 x 10 GB Intel X710 dual-port NICs or similar

Kubernetes Master node or Kubernetes Node

Parameter Value


• 2 x 12 Core CPUs Intel Xeon Gold 6126

• 2 x 14 Core CPUs Intel Xeon Gold 5120

RAM 256 GB

Disk 2 x 960 GB SSD Intel S3520 or similar for system storage




Infrastructure nodes disk layoutInfrastructure nodes are typically installed on hardware servers. These servers run all components ofmanagement and control plane for both MCP and the cloud itself. It is very important to configure hardwareservers properly upfront because changing their configuration after initial deployment is costly.

For instructions on how to configure the disk layout for MAAS to provision the hardware machines, see MCPDeployment Guide: Add a custom disk layout per node in the MCP model.

Consider the following recommendations:

Layout

Mirantis recommends using the LVM layout for disks on infrastructure machines. This option allows formore operational flexibility, such as resizing the Volume Groups and Logical Volumes for scale-out.

LVM Volume Groups

According to Minimum hardware requirements, an infrastructure node typically has two SSD disks. Thesedisks must be configured as LVM Physical Volumes and joined into a Volume Group.

The name of the Volume Group is the same across all infrastructure nodes to ensure consistency of LCMoperations. Mirantis recommends following the vg01 naming convention for the Volume Group.

LVM Logical Volumes

The following table summarizes the recommended Logical Volume schema for infrastructure nodes.Follow the instructions in the MCP Deployment Guide to configure this in your cluster model.

Logical Volume schema for infrastructure nodes

Logical Volume path Mount point Size

/dev/vg01/root '/' 70 GB

/dev/vg01/gluster /srv/glusterfs 200 GB

/dev/vg01/images /var/lib/libvirt/images/ 700 GB



https://docs.mirantis.com/mcp/latest/mcp-deployment-guide/deployment-customizations-guidelines/maas/custom-disk-layout.html

https://docs.mirantis.com/mcp/latest/mcp-deployment-guide/deployment-customizations-guidelines/maas/custom-disk-layout.html

NetworkingThis section describes the key hardware recommendations on server and data networking, as well asswitching fabric capabilities and NFV considerations.

Server networkingThe minimal network configuration for hardware servers includes two dual-port 1/10 Gbit Ethernet NetworkInterface Cards (NICs).

The first pair of 1/10 Gbit Ethernet interfaces is used for the PXE, management, and control plane traffic.These interfaces should be connected to access switch in 1 or 10 GbE mode.

The following options are recommended to configure this NIC in terms of bonding and distribution of thetraffic:

• Use one 1/10 GbE interface for PXE or management network, and the other 1/10 GbE interface forcontrol plane traffic. This is the most simple and generic option recommended for most use cases.

• Create a bond interface with this pair of 1/10 GbE physical interfaces and configure it to the balance-albmode, following the Linux Ethernet Bonding driver documentation. This configuration allows using theinterfaces separately from bond for PXE purposes and does not require support from switch fabric.

• Create a bond interface with this pair of 1/10 GbE interfaces in 802.3ad mode and set load balancingpolicy to transmission hash policy based on TCP/UDP port numbers (xmit_hash_policy encap3+4). Thisconfiguration provides for better availability of the connection. It requires that switching fabric supportsLACP Fallback capability. See Switching fabric capabilities for details.

The second NIC with two interfaces is used for the data plane traffic and storage traffic. On the operatingsystem level, ports on this 1/10 GbE card are joined into an LACP bond (Linux bond mode 802.3ad).

Recommended LACP load balancing method for this bond interface is transmission hash policy based onTCP/UDP port numbers (xmit_hash_policy encap3+4).

This NIC must be connected to an access switch in 10 GbE mode.

Note

The LACP configuration in 802.3ad mode on the server side must be supported by the correspondingconfiguration of switching fabric. See Switching fabric capabilities for details.

Seealso

Linux Ethernet Bonding Driver Documentation



https://www.kernel.org/doc/Documentation/networking/bonding.txt

Access networkingThe top of the rack (ToR) switches provide connectivity to servers on physical and data-link levels. They mustprovide support for LACP and other technologies used on the server side, for example, 802.1q VLANsegmentation. Access layer switches are used in stacked pairs.

Mirantis recommends installing the following 10 GbE switches as the top of the rack (ToR) for Public, Storage,and Tenant networks in MCP:

• Juniper QFX5100-48S 48x 1/10 GbE ports, 6x 40 GbE ports

• Quanta T3048-LY9 48x 1/10 GbE ports, 4x 40 GbE ports

• Arista 7050T-64 48x 1/10 GbE ports, 4x 40 GbE QSFP+ ports

The recommended IPMI switch is Dell PowerConnect 6248 (no stacking or uplink cards required) or a JuniperEX4300 Series switch.

The following 1/10 GbE switches are recommended to install as ToR for PXE and Management networks inMCP:

• Juniper EX3300-48T-BF

• Arista 7010T-48

• QuantaMesh T1048-LB9

The following diagram shows how a server is connected to the switching fabric, and how the fabric itself isconfigured.



For external networking with OpenContrail, the following hardware is recommended:

• Juniper MX104

• Virtual instances vMX or vSRX for lower bandwidth requirements (up to 80 Gbps)

Switching fabric capabilitiesThe following table summarizes requirements for the switching fabric capabilities:

Switch fabric capabilities summary

Name of requirement Description

LACP TCP/UDP hashbalance mode

Level 4 LACP hash balance mode is recommended to support services thatemploy TCP sessions. This helps to avoid fragmentation and asymmetry intraffic flows.

Multihome serverconnection support

There are two major options to support multihomed server connections:

• Switch stacking

Stacked switches work as a single logical unit from configurationand data path standpoint. This allows you to configure IP defaultgateway on the logical stacked switch to support multi-rack usecase.

• Multipath Link Aggregation Groups

MLAG support is recommended to allow cross-switch bondingacross stacked ToR switches. Using MLAG allows you to maintainand upgrade the switches separately without network interruptionfor servers. On the other hand, the Layer-3 network configuration ismore complicated when using MLAG. Therefore, Mirantisrecommends using MLAG with plain access network topology.

LAG/port-channel links The number of supported LAGs/port-channel links per switch must be twicethe number of ports. Take this parameter into account so that you can createthe required number of LAGs to accommodate all servers connected to theswitch.

Note

LACP configurations on access and server levels must be compatible with each other. In general, itmight require additional design and testing effort in every particular case, depending on the models ofswitching hardware and the requirements to networking performance.

The following capabilities are required from switching fabric to support PXE and BOOTP protocols over LACPbond interface. See Server networking for details.

Switch fabric capabilities summary



Name of requirement Description

LACP fallback mode Servers must be able to boot with bootp/pxe protocol over a networkconnected through LACP bond interface in the 802.3ad mode. LACP fallbackcapability allows you to dynamically assemble an LAG based on status ofmember NICs on the server side. Upon the initial boot, the interfaces are notbonded, LAG is disassembled, and the server boots normally over a singlePXE NIC. This requirement applies to the multihomed Management networkoption only.Note that different switch vendors have different notations for the saidfunctionality. See links to the examples below.

Seealso

• Configuring LACP Fallback on Arista switches

• Forcing MC-LAG Links or Interfaces With Limited LACP Capability to Be Up

• Configure LACP auto ungroup on Dell switches

NFV considerationsNetwork function virtualization (NFV) is an important factor to consider while planning hardware capacity forthe Mirantis Cloud Platform. Mirantis recommends separating nodes that support NFV from other nodes toreserve them as Data Plane nodes that use network virtualization functions.

The following types of the Data Plane nodes use NFV:

1. Compute nodes that can run virtual machines with hardware-assisted Virtualized Networking Functions(VNF) in terms of the OPNFV architecture.

2. Networking nodes that provide gateways and routers to the OVS-based tenant networks using networkvirtualization functions.

The following table describes compatibility of NFV features for different MCP deployments.

NFV for MCP compatibility matrix

Type Host OS KernelHugepages

DPDK SR-IOV NUMACPU

pinningMultiqueue

OVS Xenial 4.8 Yes No Yes Yes Yes Yes

KernelvRouter

Xenial 4.8 Yes No Yes Yes Yes Yes

DPDKvRouter

Xenial 4.8 Yes Yes No Yes Yes No(version3.2)

DPDKOVS

Xenial 4.8 Yes Yes No Yes Yes Yes



https://eos.arista.com/configuring-port-channel-lacp-fallback-on-arista-switches/

https://www.juniper.net/documentation/en_US/junos/topics/task/configuration/lacp-force-up-configuring.html

http://www.dell.com/support/manuals/al/en/aldhs1/force10-s6000/s6000_9.8.0.0_cli_pub-v1/lacp-ungroup-member-independent?guid=guid-f95a1f34-f765-403b-87eb-5680fe4a2d9a&lang=en-us

StackLight LMA scalingThe MCP StackLight LMA enables the user to monitor all kinds of MCP clusters, including OpenStack,Kubernetes, and Ceph environments.

This section describes the distribution of StackLight LMA services and hardware requirements for differentsizes of clouds.

StackLight LMA server roles and services distributionThe following table lists the roles of StackLight LMA nodes and their names in the Salt Reclass metadatamodel:

StackLight LMA nodes

Server role nameServer role

group name inReclass model

Description

StackLight LMA metering node mtr Servers that run Prometheus long-termstorage and/or InfluxDB.

StackLight LMA log storage andvisualization node

log Servers that run Elasticsearch and Kibana.

StackLight LMA monitoring node mon Servers that run the Prometheus, Grafana,Pushgateway, Alertmanager, and Alertaservices in containers in Docker Swarmmode.

StackLight LMA resource requirements per cloud sizeCompact cloud

The following table summarizes resource requirements of all StackLight LMA node roles for compact clouds(up to 50 compute nodes).

Resource requirements per StackLight LMA role for compact cloud

Virtual server roles # of instancesCPU vCores per

instanceMemory (GB) per

instanceDisk space (GB)

per instance

mon 3 4 16 240

mtr 3 4 32 240

log 3 4 8 400

Small cloud

The following table summarizes resource requirements of all StackLight LMA node roles for small clouds (50 -100 compute nodes).



Resource requirements per StackLight LMA role for small cloud




per instance

mon 3 12 64 240

mtr 3 12 64 500

log 3 12 48 1500

Medium cloud

The following table summarizes resource requirements of all StackLight LMA node roles for medium clouds(100 - 200 compute nodes).

Resource requirements per StackLight LMA role for medium cloud




per instance

mon 3 12 64 240

mtr 3 12 96 1000

log 3 16 48 3000

Large cloud

The following table summarizes resource requirements of all StackLight LMA node roles for large clouds (200- 500 compute nodes).

Resource requirements per StackLight LMA role for large cloud




per instance

mon 3 24 256 1000

mtr 3 16 196 1500

log 3 16 48 5000



OpenStack environment scalingThe Mirantis Cloud Platform (MCP) enables you to deploy OpenStack environments at different scales. Thisdocument defines the following sizes of environments: small, medium, and large. Each environment sizerequires a different number of infrastructure nodes. The Virtualized Control Plane (VCP) services aredistributed among the physical infrastructure nodes for optimal performance.

This section describes VCP services distribution, as well as hardware requirements for different sizes ofclouds.

OpenStack server rolesComponents of the Virtualized Control Plane (VCP) have roles that define their functions. Each role can beassigned to a specific set of virtual servers which allows to adjust the number of instances with a particularrole independently of other roles providing greater flexibility to the environment architecture.

The following table lists the OpenStack roles and their names in the SaltStack formulas:

OpenStack infrastructure nodes

Server role name

Server rolegroup name in

SaltStackformulas

Description

Infrastructure node kvm Infrastructure KVM hosts that run MCPcomponent services as virtual machines.

Network node gtw Nodes that provide tenant network dataplane services.

DriveTrain Salt Master node cfg The Salt Master node that is responsiblefor sending commands to Salt Minionnodes.

DriveTrain / StackLight OSS node cid Nodes that run in StackLight OSS andDriveTrain services in containers in DockerSwarm mode.

RabbitMQ server node msg Nodes that run the message queue serviceRabbitMQ.

Database server node dbs Nodes that run the database cluster calledGalera.

OpenStack controller nodes ctl Nodes that run the Virtualized ControlPlane service, including the API serversand scheduler components.

OpenStack compute nodes cmp Nodes that run the hypervisor service andVM workloads.

OpenStack monitoring database nodes mdb Nodes that run the Telemetry monitoringdatabase services.



Proxy node prx Nodes that run reverse proxy that exposesOpenStack API, dashboards, and othercomponents externally.

Contrail controller nodes ntw Nodes that run the OpenContrail controllerservices.

Contrail analytics nodes nal Nodes that run the OpenContrail analyticsservices.

StackLight LMA log nodes log Nodes that run the StackLight LMA loggingand visualization services.

StackLight LMA database nodes mtr Nodes that run the StackLight databaseservices.

StackLight LMA and OSS nodes mon Nodes that run the StackLight LMAmonitoring services and StackLight OSS(DevOps Portal) services.

Seealso

• StackLight LMA server roles and services distribution

Services distribution across nodesThe distribution of services across physical nodes depends on their resources consumption profile and thenumber of compute nodes they administer.

Mirantis recommends the following distribution of services across nodes:

Distribution of services across nodes in an OpenStack environment

ServicePhysical server

groupVirtual VM role group

aodh-api kvm yes mdb

aodh-evaluator kvm yes mdb

aodh-listener kvm yes mdb

aodh-notifier kvm yes mdb

ceilometer-agent-central kvm yes mdb or ctl 2

ceilometer-agent-notification kvm yes mdb or ctl 2

ceilometer-api kvm yes mdb or ctl 2

cinder-api kvm yes ctl

cinder-scheduler kvm yes ctl



cinder-volume kvm yes ctl

DriveTrain 3 kvm yes cid

glance-api kvm yes ctl

glance-registry kvm yes ctl

gnocchi-metricd 4 kvm yes mdb

horizon/apache2 kvm yes prx

keystone-all kvm yes ctl

mysql-server kvm yes dbs

neutron-dhcp-agent 5 gtw no N/A

neutron-l2-agent 5 cmp, gtw no N/A

neutron-l3-agent 5 gtw no N/A

neutron-metadata-agent 5 gtw no N/A

neutron-server 5 kvm yes ctl

nova-api kvm yes ctl

nova-compute cmp no N/A

nova-conductor kvm yes ctl

nova-scheduler kvm yes ctl

OSS Tools 3 kvm yes cid

panko-api 4 kvm yes mdb

rabbitmq-server kvm yes msg

2(1, 2, 3) The mdb role is for Pike, the ctl role is for Ocata.

3(1, 2) DriveTrain and OSS Tools services run in the Docker Swarm Mode cluster.

4(1, 2) Gnocchi and Panko services are added starting Pike.

5(1, 2, 3, 4, 5) Services are related to Neutron OVS only.

Operating systems and versionsThe following table lists the operating systems used for different roles in the Virtualized Control Plane (VCP):

Operating systems and versions

Server role name

Server rolegroup name inthe SaltStack

model

Ubuntu version

Infrastructure node kvm xenial/16.04



Network node gtw xenial/16.04

StackLight LMA monitoring node mon xenial/16.04

Stacklight LMA metering and tenanttelemetry node

mtr xenial/16.04

StackLight LMA log storage andvisualization node

log xenial/16.04

DriveTrain Salt Master node cfg xenial/16.04

DriveTrain StackLight OSS node cid xenial/16.04

RabbitMQ server node msg xenial/16.04

Database server node dbs xenial/16.04

OpenStack controller node ctl xenial/16.04

OpenStack compute node cmp xenial/16.04 (depends on whether NFVfeatures enabled or not)

Proxy node prx xenial/16.04

OpenContrail controller node ntw trusty/14.04 for v3.2, xenial/16.04 for v4.x

OpenContrail analytics node nal trusty/14.04, xenial/16.04 for v4.x

OpenStack compact cloud architectureA compact OpenStack cloud includes up to 50 compute nodes and requires you to have at least threeinfrastructure nodes.

A compact cloud includes all the roles described in OpenStack server roles.

The following diagram describes the distribution of VCP and other services throughout the infrastructurenodes.



The following table describes the hardware nodes in an OpenStack environment of an MCP cluster, rolesassigned to them, and resources per node:

Physical server roles and hardware requirements

Node type Role name Number of servers

Infrastructure nodes kvm 3

OpenStack compute nodes cmp up to 50

The following table summarizes the VCP virtual machines mapped to physical servers.

Resource requirements per VCP and DriveTrain roles

Virtualserver roles

Physical servers# of

instances

CPUvCores per

instance

Memory(GB) perinstance

Disk space(GB) perinstance

ctl kvm01 kvm02 kvm03 3 8 32 100

msg kvm01 kvm02 kvm03 3 8 32 100

dbs kvm01 kvm02 kvm03 3 8 16 100

prx kvm02 kvm03 2 4 8 50

cfg kvm01 1 2 8 50

mon kvm01 kvm02 kvm03 3 4 16 240

mtr kvm01 kvm02 kvm03 3 4 8 240

log kvm01 kvm02 kvm03 3 4 8 400



cid kvm01 kvm02 kvm03 3 8 32 100

gtw kvm01 kvm02 kvm03 3 4 16 50

TOTAL 27 154 480 4140

Note

• The gtw VM should have four separate NICs for the following interfaces: dhcp, primary, tenant,and external. It simplifies the host networking: you do not need to pass VLANs to VMs.

• The prx VM should have an additional NIC for the Proxy network.

• All other nodes should have two NICs for DHCP and Primary networks.

OpenStack small cloud architectureA small OpenStack cloud includes up to 100 compute nodes and requires you to have at least 6 infrastructurenodes, including additional servers for KVM infrastructure and RabbitMQ/Galera services.

The Stacklight LMA services are distributed across 3 additional infrastructure nodes. See detailed resourcerequirements for StackLight LMA in StackLight LMA resource requirements per cloud size.

A small cloud includes all the roles described in OpenStack server roles.

The number of nodes is the same for both OpenContrail and Neutron OVS-based clouds. However, the roleson network nodes are different.

The following diagram describes the distribution of VCP and other services throughout the infrastructurenodes for Neutron OVS-based small clouds.



The following table describes the hardware nodes in MCP OpenStack, roles assigned to them, and resourcesper node:

Physical server roles and quantities


Infrastructure nodes (VCP, OpenContrail) kvm 3

Infrastructure nodes (StackLight LMA) kvm 3

OpenStack compute nodes cmp 50 - 100

Staging infrastructure nodes kvm 6

Staging OpenStack compute nodes cmp 2 - 5


Resource requirements per VCP and DriveTrain roles

Virtualserver roles


instances

CPUvCores per

instance



ctl kvm01 kvm02 kvm03 3 8 32 100



msg kvm01 kvm02 kvm03 3 8 64 100

dbs kvm01 kvm02 kvm03 3 8 32 100

prx kvm02 kvm03 2 4 8 50

cfg kvm01 1 2 8 50

cid kvm01 kvm02 kvm03 3 8 32 100

gtw kvm01 kvm02 kvm03 3 4 16 50

TOTAL 18 118 552 1500

The following table summarizes the OpenContrail virtual machines mapped to physical servers (optional).This is only needed when OpenContrail is used for OpenStack Tenant Networking.

Resource requirements per OpenContrail roles (optional)

Virtualserver roles


instances

CPUvCores per

instance



ntw kvm01 kvm02 kvm03 3 8 32 50

nal kvm01 kvm02 kvm03 3 8 32 100

TOTAL 6 48 192 450

Note

A vCore is the number of available virtual cores considering hyper-threading and overcommit ratio.Assuming an overcommit ratio of 1, the number of vCores in a physical is roughly the number ofphysical cores multiplied by 1.3.

Note

• The gtw VM should have four separate NICs for the following interfaces: dhcp, primary, tenant,and external. It simplifies the host networking: you do not need to pass VLANs to VMs.

• The prx VM should have an additional NIC for the Proxy network.

• All other nodes should have two NICs for DHCP and Primary networks.



OpenStack medium cloud architecture with Neutron OVS DVR/non-DVRA medium OpenStack cloud includes up to 200 compute nodes and requires you to have at least 9infrastructure nodes, including additional servers for KVM infrastructure and RabbitMQ/Galera services. ForNeutron OVS as a networking solution, 3 bare-metal network nodes are installed to accommodate networktraffic as opposed to virtualized controllers in the case of OpenContrail.


A medium cloud includes all roles described in OpenStack server roles.







Infrastructure nodes (VCP) kvm 6


Tenant network gateway nodes gtw 3



Staging tenant network gateway nodes gtw 3

Staging compute nodes cmp 2 - 5


Resource requirements per VCP role

Virtualserver roles


instances

CPUvCores per

instance



ctl kvm01 kvm02 kvm03 3 16 64 100

dbs kvm01 kvm02 kvm03 3 8 32 100

msg kvm04 kvm05 kvm06 3 16 64 100

prx kvm04 kvm05 2 4 16 50

TOTAL 11 128 512 1000

Resource requirements per DriveTrain role

Virtualserver roles


instances

CPUvCores per

instance



cfg kvm01 1 8 16 50

cid kvm01 kvm02 kvm03 3 8 32 200

TOTAL 4 32 112 650

Note




OpenStack medium cloud architecture with OpenContrailA medium OpenStack cloud includes up to 200 compute nodes and requires you to have at least 9infrastructure nodes, including servers for KVM infrastructure and RabbitMQ/Galera services. ForOpenContrail as a networking solution, a virtualized OpenContrail Controller is installed onto the infrastructurenodes instead of bare metal network nodes as in the case of Neutron OVS.



Note

The diagram displays the OpenContrail 4.0 nodes that have the control, config, database, andanalytics services running as plain Docker containers managed by docker-compose. In OpenContrail3.2, these services are not Docker-based.








Infrastructure nodes (OpenContrail) kvm 3








Virtualserver roles


instances

CPUvCores per

instance


Diskspace(GB)

perinstance

ctl kvm01 kvm02 kvm03 3 16 64 100

dbs kvm01 kvm02 kvm03 3 8 32 100

msg kvm04 kvm05 kvm06 3 16 64 100

prx kvm04 kvm05 2 4 16 50

TOTAL 11 128 512 1000


Virtualserver roles


instances

CPUvCores per

instance


Diskspace(GB)

perinstance

cfg kvm01 1 8 16 50

cid kvm01 kvm02 kvm03 3 8 32 200

TOTAL 4 32 112 650

Resource requirements per OpenContrail role

Virtualserver roles


instances

CPUvCores per

instance


Diskspace(GB)

perinstance

ntw kvm07 kvm08 kvm09 3 8 64 100

nal kvm07 kvm08 kvm09 3 16 96 1200

TOTAL 6 72 480 3900

Note




OpenStack large cloud architectureA large OpenStack cloud includes up to 500 compute nodes and requires you to have at least 12infrastructure nodes, including dedicated bare-metal servers for RabbitMQ, OpenStack API services, anddatabase servers. Use OpenContrail as a networking solution for large OpenStack clouds. Neutron OVS isnot recommended.

The StackLight LMA services are distributed across additional 6 infrastructure nodes, including 3 bare metalnodes for the Docker Swarm Mode cluster. See detailed resource requirements for StackLight LMA inStackLight LMA resource requirements per cloud size.

A large cloud includes all roles described in OpenStack server roles.




Note

The diagram displays the OpenContrail 4.0 nodes that have the control, config, database, andanalytics services running as plain Docker containers managed by docker-compose. In OpenContrail3.2, these services are not Docker-based.







Infrastructure nodes (OpenContrail) kvm 3

Monitoring nodes (StackLight LMA) mon 3







Virtualserver roles


instances

CPUvCores per

instance



ctl kvm02 kvm03 kvm04 kvm05 kvm06 5 24 128 100

dbs kvm04 kvm05 kvm06 3 24 64 1000

msg kvm07 kvm08 kvm09 3 32 196 100

prx kvm07 kvm08 2 8 32 100

TOTAL 13 304 1484 4000


Virtualserver roles


instances

CPUvCores per

instance


Diskspace(GB)

perinstance

cfg kvm01 1 8 32 50

cid kvm01 kvm02 kvm03 3 4 32 500

TOTAL 4 20 128 1550

Resource requirements per OpenContrail role



Virtualserver roles


instances

CPUvCores per

instance



ntw kvm10 kvm11 kvm12 3 16 64 100

nal kvm10 kvm11 kvm12 3 24 128 2000

TOTAL 6 120 576 6300

Note




Kubernetes cluster scalingAs described in Kubernetes cluster sizes, depending on the anticipated number of pods, your Kubernetescluster may be a small, medium, or large scale deployment. A different number of Kubernetes Nodes isrequired for each size of cluster.

This section describes the services distribution across hardware nodes for different sizes of Kubernetesclusters.

Small Kubernetes cluster architectureA small Kubernetes cluster includes 2,000 - 5,000 pods spread across roughly 20 - 50 physical KubernetesNodes and 3 Kubernetes Master nodes. Mirantis recommends separating the Kubernetes control plane thatincludes etcd and the Kubernetes Master node components from Kubernetes workloads.

In addition, dedicate separate physical servers for shared storage services GlusterFS and Ceph. Runningthese components on the same physical servers as the control plane components may result in insufficientcycles for etcd cluster to stay synced and allow the kubelet agent to report their statuses reliably.



The following diagram displays the layout of services per physical node for a small Kubernetes cluster.



Medium Kubernetes cluster architectureA medium Kubernetes cluster includes 5,000 - 20,000 pods spread across roughly 50 - 200 physicalKubernetes Nodes and 6 Kubernetes Master nodes. Mirantis recommends separating the Kubernetes controlplane that includes etcd and the Kubernetes Master node components from Kubernetes workloads. WhileKubernetes components can run on the same host, run etcd on dedicated servers as the etcd workloadincreases due to constant recording and checking pod and kubelet statuses.

You can place shared storage services GlusterFS and Ceph on the same physical nodes as Kubernetescontrol plane components.



The following diagram displays the layout of services per physical node for a medium Kubernetes cluster.

Large Kubernetes cluster architectureA large Kubernetes cluster includes 20,000 - 50,000 pods spread across roughly 200 - 500 physicalKubernetes Nodes and 9 Kubernetes Master nodes. Mirantis recommends to separate the Kubernetes controlplane that includes etcd and Kubernetes Master node components from the Kubernetes workloads.

Note

While Kubernetes components can run on the same host, run etcd on dedicated servers as the etcdworkload increases due to constant recording and checking pod and kubelet statuses.



In addition, Mirantis recommends placing kube-scheduler separately from the rest of the Kubernetes controlplane components due to the high number of Kubernetes pod turnover and rescheduling, kube-schedulerrequires more resources than other Kubernetes components.

You can place shared storage services GlusterFS and Ceph on the same physical nodes as Kubernetescontrol plane components.

The following diagram displays the layout of services per physical node for a large Kubernetes cluster.



Ceph hardware requirementsMirantis recommends to use the Ceph cluster as primary storage solution for all types of ephemeral andpersistent storage. A Ceph cluster that is built in conjunction with MCP must be designed to accommodate:

• Capacity requirements

• Performance requirements

• Operational requirements

This section describes differently sized clouds that use the same Ceph building blocks.

Ceph cluster considerationsWhen planning storage for your cloud, you must consider performance, capacity, and operationalrequirements that affect the efficiency of your MCP environment.

Based on those considerations and operational experience, Mirantis recommends no less than nine-nodeCeph clusters for OpenStack production environments. Recommendation for test, development, or PoCenvironments is a minimum of five nodes. See details in Ceph cluster sizes.

Note

This section provides simplified calculations for your reference. Each Ceph cluster must be evaluatedby a Mirantis Solution Architect.

Capacity

When planning capacity for your Ceph cluster, consider the following:

• Total usable capacity

The existing amount of data plus the expected increase of data volume over the projected life of thecluster.

• Data protection (replication)

Typically, for persistent storage a factor of 3 is recommended, while for ephemeral storage a factorof 2 is sufficient. However, with a replication factor of 2, an object can not be recovered if one of thereplicas is damaged.

• Cluster overhead

To ensure cluster integrity, Ceph stops writing if the cluster is 90% full. Therefore, you need to planaccordingly.

• Administrative overhead

To catch spikes in cluster usage or unexpected increases in data volume, an additional 10-15% ofthe raw capacity should be set aside.

The following table describes an example of capacity calculation:



Example calculation

Parameter Value

Current capacity persistent 500 TB

Expected growth over 3 years 300 TB

Required usable capacity 800 TB

Replication factor for all pools 3

Raw capacity 2.4 PB

With 10% cluster internal reserve 2.64 PB

With operational reserve of 15% 3.03 PB

Total cluster capacity 3 PB

Overall sizing

When you have both performance and capacity requirements, scale the cluster size to the higher requirement.For example, if a Ceph cluster requires 10 nodes for capacity and 20 nodes for performance to meetrequirements, size the cluster to 20 nodes.

Operational recommendations

• A minimum of 9 Ceph OSD nodes is recommended to ensure that a node failure does not impact clusterperformance.

• Mirantis does not recommend using servers with excessive number of disks, such as more than 24disks.

• All Ceph OSD nodes must have identical CPU, memory, disk and network hardware configurations.

• If you use multiple availability zones (AZ), the number of nodes must be evenly divisible by the numberof AZ.

Perfromance considerations

When planning performance for your Ceph cluster, consider the following:

• Raw performance capability of the storage devices. For example, a SATA hard drive provides 150 IOPSfor 4k blocks.

• Ceph read IOPS performance. Calculate it using the following formula:

number of raw read IOPS per device X number of storage devices X 80%

• Ceph write IOPS performance. Calculate it using the following formula:

number of raw write IOPS per device X number of storagedevices / replication factor X 65%

• Ratio between reads and writes. Perform a rough calculation using the following formula:



read IOPS X % reads + write IOPS X % writes

Note

Do not use this formula for a Ceph cluster that is based on SSDs only. Contact Mirantis for evaluation.

Storage device considerations

The expected number of IOPS that a storage device can carry out, as well as its throughput, depends on thetype of device. For example, a hard disk may be rated for 150 IOPS and 75 MB/s. These numbers arecomplementary because IOPS are measured with very small files while the throughput is typically measuredwith big files.

Read IOPS and write IOPS differ depending on the device. ■onsidering typical usage patterns helpsdetermining how many read and write IOPS the cluster must provide. A ratio of 70/30 is fairly common formany types of clusters. The cluster size must also be considered, since the maximum number of write IOPSthat a cluster can push is divided by the cluster size. Furthermore, Ceph can not guarantee the full IOPSnumbers that a device could theoretically provide, because the numbers are typically measured under testingenvironments, which the Ceph cluster cannot offer and also because of the OSD and network overhead.

You can calculate estimated read IOPS by multiplying the read IOPS number for the device type by thenumber of devices, and then multiplying by ~0.8. Write IOPS are calculated as follows:

(the device IOPS * number of devices * 0.65) / cluster size

If the cluster size for the pools is different, an average can be used. If the number of devices is required, therespective formulas can be solved for the device number instead.

Ceph hardware considerationsWhen sizing a Ceph cluster, you must consider the number of drives needed for capacity and the number ofdrives required to accommodate performance requirements. You must also consider the largest number ofdrives that ensure all requirements are met.

The following list describes generic hardware considerations for a Ceph cluster:

• Create one Ceph Object Storage Device (OSD) per HDD disk in Ceph OSD nodes.

• Allocate 1 CPU thread per OSD.

• Allocate 1 GB of RAM per 1 TB of disk storage on the OSD node.

• Do not use RAID arrays for Ceph disks. Instead, all drives must be available to Ceph individually.

• Storage controllers with support for host bus adapter (HBA) or just-a-bunch-of-disks (JBOD) modeshoulud be selected for Ceph OSD nodes.

• Preferably, configure storage controllers on OSD nodes to work in the HBA or initiator target (IT)mode.



• If the RAID controller does not support HBA mode, you can configure the JBOD mode with disksexposed individually.

• Confirm that disks are exposed directly to the operating system by checking the device parameters.

• The Ceph monitors can run in virtual machines on the infrastructure nodes, as they use relatively fewresources.

• Place Ceph write journals on write-optimized SSDs instead of OSD HDD disks. Use at least 1 journaldevice per 5 OSD devices.

See the detailed hardware requirements to Ceph OSD storage nodes in Minimum hardware requirements.

The following table provides an example of input parameters for a Ceph cluster calculation:

Example of input parameters

Parameter Value

Virtual instance size 40 GB

Read IOPS 14

Read to write IOPS ratio 70/30

Number of availability zones 3

For 50 compute nodes, 1,000 instances

Number of OSD nodes: 9, 20-disk 2U chassis

This configuration provides 360 TB of raw storage and with cluster size of 3 and 60% used initially, the initialamount of data should not exceed 72 TB (out of 120 TB of replicated storage). Expected read IOPS for thiscluster is approximately 20,000 and write IOPS 5,000, or 15,000 IOPS in a 70/30 pattern.

Note

In this case performance is the driving factor, and so the capacity is greater than required.


Number of OSD nodes: 54, 36-disks chassis

The cost per node is low compared to the cost of the storage devices and with a larger number of nodesfailure of one node is proportionally less critical. A separate replication network is recommended.


Number of OSD nodes: 60, 36-disks chassis

You may consider using a larger chassis. A separate replication network is required.



Note

For a large cluster, three monitors is the upper recommended limit. If scale-out is planned, deploy fivemonitors initially.



MCP Standard Configuration - Mirantis€¦ · Large Kubernetes cluster architecture 36 Ceph...

Documents

Transcript of MCP Standard Configuration - Mirantis€¦ · Large Kubernetes cluster architecture 36 Ceph...