Download - Operationalizing the Value of MongoDB: The MetLife Experience

Transcript
Page 1: Operationalizing the Value of MongoDB: The MetLife Experience

Page 1

Operationalizing value of MongoDB

(MetLife experience)

Thrills and challenges of building MongoDB operations in a large enterprise

Page 2: Operationalizing the Value of MongoDB: The MetLife Experience

Page 2

A Journey

When new technology meets enterprise standards : - advantages and restrictions of large enterprises - it is always a journey - decisions we have to live with

Page 3: Operationalizing the Value of MongoDB: The MetLife Experience

Page 3

Highly Successful Adoption of New Technologyfor a Fortune 50 Enterprise Organization

• Unknown technology– Proves to be capable

• New platform– Quickly matures

• Untested for the Enterprise– Delivers success

• Many new things to learn– Become experts in time

Page 4: Operationalizing the Value of MongoDB: The MetLife Experience

Page 4

Disclaimer

• The content in this presentation represents MetLife's choices and MongoDB Inc.’s recommendations for MetLife’s specific use case. By no means is this a “universal blueprint for success” and it doesn’t necessarily represent MongoDB Inc.'s recommendations for all use cases.

• In particular- because there were some fixed decisions that predated the MongoDB implementation, MetLife's deployment may require some “manual intervention” (specifically in case of DR) whereas other, differently-organized deployments might not.

Page 5: Operationalizing the Value of MongoDB: The MetLife Experience

Page 5

Introducing “The Wall”

Page 6: Operationalizing the Value of MongoDB: The MetLife Experience

Page 6

Basic System Architecture Decisions

• Company Data Center vs. Public Cloud PlacementControl vs. ease of useMetLife: Compliance requirements dictate company data center(s) placement.

• Server type and sizesEnterprise class servers vs. Pizza boxesMetLife: More cost effective to run on enterprise class servers - 2x8 Core CPU, 512 GB RAM

• VirtualizationVM vs. “Bare Metal”MetLife: Data nodes – physical servers, Configuration Servers and MongoS – VMs.

• SAN vs. Local storageFlexibility of SAN vs. performance of local storageMetLife: Local storage enclosures. 600 GB SAS drives.

• NetworkDedicated LAN for MongoDB replicationMetLife: No dedicated LANs, for MongoDB installation.

Page 7: Operationalizing the Value of MongoDB: The MetLife Experience

Page 7

Business Requirements and System Topology

Business requirements: - mission critical application

- loss of entire data center for indefinite time should not limit the application functionality in any way - significant data growth is expected, as well as a significant increase in the number of users

Drive system topology :a. Geographic placement

MetLife: Geographically dispersed cluster, spanning two data centers

b. Sharded cluster vs. Replica setMetLife: Sharded cluster for elastic horizontal scalability

c. Number of nodes in the replica setMetLife: Minimum of 6 to ensure full operability in case of one data center loss.

d. Writes and reads geography MetLife: Business function driven write-concern implementation, reads are mostly

“secondary preferred”

Page 8: Operationalizing the Value of MongoDB: The MetLife Experience

Page 8

System topology

CConfiguration

Server 1

Local ProdReplica 1

Primary Prod

Local ProdHidden Replica for backups

Remote Prod Replica 1

Remote Prod Replica 2

Remote ProdHidden Replica forbackups

Configuration Server 2

Configuration Server 3

Data Center 1 Data Center 2

BackupSolution

Backup Solution

2 SHARDS comprise

this

2 SHARDS comprise

this

2 SHARDS comprise

this

2 SHARDS comprise

this

2 SHARDS comprise

this

2 SHARDS comprise

this

MongoS Prod Server

MongoS ProdServer

Mongos Server

Mongos Server

Page 9: Operationalizing the Value of MongoDB: The MetLife Experience

Page 9

System Setup for Availability and DR

System has to comply with MetLife’s enterprise standard for availability and DR (No single points of failure):

a. Replica setsMetLife: 6 member replica sets ( 3 in each data center), 2 hidden replicas for backup purposes, 5 voting members ( hidden replicas in DR data center has 0 votes), and 2 replicas in primary

datacenter who have higher priority.

b. Mongo Configuration serversMetLife: 3 configuration servers (2 in primary data center and 1 in DR data center). Loss of entire

data center halts cluster balancing ability, but not the application functionality.

c. MongoSMetLife: 4 MongoS servers (2 in each data center). All active.

d. Application servers connectivityMetLife: MongoDB drivers on application servers are configured to use all MongoS but in a

different order for pseudo load balancing.

e. DR exercise

MetLife: DR exercise is conducted yearly and includes all database and application infrastructure to ensure complete operability from DR data center.

Page 10: Operationalizing the Value of MongoDB: The MetLife Experience

Page 10

System Set up for Recoverability

System has to comply with MetLife’s enterprise standard for recoverability:

Backup and Recovery strategy.MetLife:

- Daily backups in both data centers (alternating). - Backups of hidden replicas are performed with mongod brought down. Balancer

is stopped. - Due to the database size backup is performed at the file system level. - At the same time backup of Configuration server is performed using mongodump.

Current challenges. MetLife:

- No point-in-time recovery - No easy way to restore one specific database

Using MMS Backup solution. MetLife:

- MMS Backup is capable of solving some of our current challenges. - Due to compliance reasons, cannot use MMS cloud backup solution in AWS- Currently looking into an option of running MMS Backup solution on premises

Page 11: Operationalizing the Value of MongoDB: The MetLife Experience

Page 11

Security

System has to comply with MetLife’s enterprise standard for data security:

Authentication and authorization.MetLife:

- Original build in MongoDB 2.2 had very limited options in database authentication and write or read/write permission at the database level.

- Biggest concerns : authentication – no password policy enforcement, authorization – excessive application permissions.

- MetLife’s MongoDB 2.6 goals are : authentication – Active Directory, authorization – custom build roles with least set of permission required by application. LDAP integration

MetLife: - Integration with Active Directory (AD) using LINUX PAMs - Third party product for secure Sever/AD communications - Currently mixed mode (both AD and in-database) authentication

Data-at-rest encryption MetLife: Data-at-rest encryption is implemented using third-party product (LINUX file system

/ device encryption).Audit.

MetLife: - Tactical: MongoDB 2.6 audit capability can do the job.- Strategic: Database activity audit is performed by third party product.

Page 12: Operationalizing the Value of MongoDB: The MetLife Experience

Page 12

Monitoring and Alerting

System has to comply with MetLife’s enterprise standard for monitoring and alerting:

Hardware monitoringMetLife: No munin-node monitoring. Using standard enterprise Linux server monitoring

toolset owned by MetLife

MongoDB monitoring with MMS MetLife: Currently using MMS in cloud for monitoring and alerting. Alerts are sent via SMS

and e-mails to responsible individuals in operations as well as to monitored group mail boxes.

BMC MongoDB Patrol KM as an alternative monitoring solutionMetLife: Third party Knowledge Modules are standard monitoring/alerting tools for MetLife’s

enterprise databases. Currently engaged in MongoDB KM beta-testing.

Integrating monitoring/alerting to the enterprise incident management system MetLife: Currently no integration. Two approaches in parallel:

- In-house written process to parse JSON attachment from MMS alert e-mail and create incident ticket

- Third party KM is natively integrated with enterprise incident management system

Page 13: Operationalizing the Value of MongoDB: The MetLife Experience

Page 13

Workload Management and Automation

System has to reliably support business SLAs and be efficient to manage:

Workload management and resource sharing.MetLife: Workload management and resource sharing is one of the bigger challenges.

MongoDB 2.6 does not have in-database mechanism for managing different workloads, that makes resource sharing problematic.

- Potential options: C-groups in RHEL 6

MMS automation (installation, upgrades).MetLife: Engaged with MongoDB for MMS automation beta-testing.

Page 14: Operationalizing the Value of MongoDB: The MetLife Experience

Page 14

Next Steps in our Journey

• Automation (installation, upgrades, maintenance).

• MMS backup solution (on premises).

• Monitoring/alerting integration with an incident management system.

• Workload management / resource sharing solution.

• Introduction of arbiter to existing replica sets (3rd data center).

• Performance benchmarking toolset.