Multi-Site Perforce at NetApp
-
Upload
perforce -
Category
Technology
-
view
1.437 -
download
3
description
Transcript of Multi-Site Perforce at NetApp
#
Scott Stanford
Multi-Site Perforce at NetApp - Final
#
• Topology• Infrastructure• Backups & Disaster Recovery• Monitoring• Lessons Learned• Q&A
Overview
#
Topology
#
Traditional Topology
P4D
(Sunnyvale)
Boston
Traditional
Proxy
Pittsburg
Traditional
Proxy
RTP
Traditional
Proxy
Bangalore
Traditional
Proxy
• 1.2 Tb database, mostly db.have
• Average daily journal size 70 Gb
• Average of 4.1 Million daily commands
• 3722 users globally
• 655 Gig of depots
• 254,000 Clients, most with @ 200,000
files
• One Git-Fusion instance
• 2014.1 version of Perforce
• Environment has to be up 24x7x365
#
Federated Topology
Commit
(Sunnyvale)
RTP
Edge
Pittsburg Proxy
Boston Proxy
Sunnyvale
Edge
Bangalore
Edge
Boston
Traditional
Proxy
Pittsburg
Traditional
Proxy
RTP
Traditional
Proxy
Bangalore
Traditional
Proxy
• Currently migrating from a
traditional model to Commit/Edge
servers.
• Traditional proxies will remain
until the migration completes later
this year
• Initial Edge database is 85 Gig
• Major sites have an Edge server,
others a proxy off of the closest
Edge (50ms improvement)
#
Infrastructure
#
Topology• All large sites have an
Edge server, formerly
were proxies
• High performance SAN
storage used for the
database, journal, and
log storage
• Proxies have a
P4TARGET of the
closest Edge server
(RTP)
• All hosts deployed with
an active/standby host
pairing
7
#
Server Connectivity• Redundant Connectivity to storage
• FC - redundant Fabric to each controller
and HBA
• SAS – each dual HBA connected to
each controller
• Filers has multiple redundant data LIFs
• 2 x 10 Gig NICs, HA bond, for the network
(NFS and p4d)
• VIF for hosting public IP / hostname
• Perforce licenses tied to this IP
#
Each Commit/Edge server is configured in a pair consisting of
• A production host, controlled through a virtual NIC
– Allows for a quick failover of the p4d without any DNS or changes to the
users environment
• Standby host with a warm database or read-only replica
• Dedicated SAN volume for low latency database storage
• Multiple levels of redundancy (Network, Storage, Power, HBA)
• Common init framework for all Perforce daemon binaries
• SnapMirrored volume used for hosting the infrastructure binaries & tools
(Perl, Ruby, Python, P4, Git-Fusion, common scripts)
Server Configuration
#
• Storage devices used– NetApp EF540 w/ FC for the Commit server
• 24 x 800 Gig SSD
– NetApp E5512 w/ FC or SAS for each Edge server
• 24 x 600 Gig 15k SAS
– All RAID 10 with multiple spare disks, XFS, dual controllers, and dual power supplies
• Used for:– Warm database or read-only replica on stand-by host
– Production journal• Hourly journal truncations, then copied to the filer
– Production p4d log• Nightly log rotations, compressed and copied to the filer
SAN Storage
#
• NetApp cDOT clusters used at each site with FAS6290 or better
• 10 Gig data LIF
• Dedicated vserver for Perforce
• Shared NFS volumes between production/standby pairs for longer term storage, snapshots, and offsite
• Used for:– Depot storage
– Rotated journals & p4d logs
– Checkpoints
– Warm database
• used for creating checkpoints and if both hosts are down to run the daemon
– Git-Fusion homedir & cache, dedicated volume per instance
Network Storage (NFS)
#
Backups & Disaster Recovery
#
• Truncate the journal
• Checksum the journal, copy to NFS and verify they match
• Create a SnapShot of the NFS volumes
• Remove any old snapshots
• Replay the journal on the warm SAN database
• Replay the journal on the warm NFS database
• Once a week create a temporary snapshot on the NFS database and create a checkpoint (p4d –jd)
P4D Backups - CommitChecksum journal on
SAN
Copy journal to NFS
Compare checksums of local and NFS
Create snapshot(s)
Delete old snapshots
Replay on warm standby
Replay on warm NFS
p4d -jj
Every 1 hour
#
Warm database• Trigger on the Edge server events.csv changing
• If a jj event, then get the journals that may need to be applied:
– p4 journals –F “jdate>=(event epoch – 1)” –T jfile,jnum”
• For each journal, run a p4d –jr
• Weekly checkpoint from a snapshot
P4D Backups - Edge
Read-only Replica from Edge• Weekly checkpoint• Created with:
• p4 –p localhost:<port> admin checkpoint -Z
Edge server captures event in
events.csv
Monit triggers backups on events.csv
Determine which journals to apply
Apply journals
Commit server truncates
#
• New process for Edge servers to avoid WAN NFS mounts
• For all the clients on an Edge server, at each site:– Save the change output for any open changes– Generate the journal data for the client– Create an tarball of the open files– Retained for 14 days
• A similar process will be used by users to clone clients across Edge servers
Client Backups
#
Snapshot/DR • Snapshots:
– Main backup method
– Created and kept for:
• 4 hours every 20 minutes (20 & 40 minutes past the hour)
• 8 hours every hour (top of the hour)
• 3 weeks of nightly during backups (@midnight PT)
• SnapVault
– Used for online backups
– Created every 4 weeks, kept for 12 months
• SnapMirrors
– Contains all of the data needed to recreate the instance
– Sunnyvale
• DataProtection (DP) Mirror for data recovery
• Stored in the Cluster
• Allows the possibility of fast test instances being created from production snapshots with FlexClone
– DR
• RTP is the Disaster Recovery site for the Commit server
• Sunnyvale is the Disaster Recovery site for the RTP and Bangalore Edge servers
#
Monitoring
#
• Monit & M/Monit– Monitors and alerts
• Filesystem thresholds, space and inodes
• On specific processes, and file changes (timestamp/md5)
• OS thresholds
• Ganglia– Used for identifying host or performance issues
• NetApp OnCommand– Storage monitoring
• Internal Tools– Monitor both infrastructure and the end-user experience
Tools Used
#
• Daemon that runs on each system, sends data to a single M/Monit instance
• Monitors core daemons (Perforce and system)
ssh, sendmail, ntpd, crond, ypbind, p4p, p4d, p4web, p4broker
• Able to restart or take actions when conditions met (ie. clean a proxy cache or purge all)
• Configured to alert on process children thresholds
• Dynamic monitoring from init framework ties
• Additional checks added for issues that have affected production in the past:
– NIC errors
– Number of filehandles
– known patterns in the system log
– p4d crashes
Monit
#
• Multiple Monit (one per host) communicate the status to a
single M/Monit instance
• All alerts and rules are controlled through M/Monit
• Provides the ability to remotely start/stop/restart daemons
• Has a dashboard of all of the Monit instances
• Keeps historical data of issues, both when found and
recovered from
M/Monit
#
Internal Tools• Collect historical data (depot, database, cache sizes,
license trends, number of clients and opened files per p4d)
• Benchmarks collected every hour with the top user commands
– Alerts if a site is 15% slower than a historical average
– Runs for both the Perforce binary and internal wrappers
#
Wrap up
#
• Faster performance for end-users– Most noticeable for sites with higher latency WAN connections
• Higher uptime for services since an Edge can service some commands when the WAN or Commit site are inaccessible
• Much smaller databases, from 1.2Tb to 82G on a new Edge server
• Automatic “backup” of the Commit server data through Edge servers
• Easily move users to new instances
• Can partially isolate some groups from affecting all users
Federated Benefits
#
• Helpful to disable csv log rotations for frequent journal truncations– Set the dm.rotatelogwithjnl configurable to 0
• Shared log volumes with multiple databases (warm or with a daemon) can cause interesting results with csv logs
• Set global configurables where you can, monitor, rpl.*, track, etc
• Use multiple pull –u threads to ensure the replicas have warm copies of the depot files
• Need to have rock solid backups on all p4d’s with client data– Warm databases are harder to maintain with frequent journal truncations, no way to trigger
on these events
• Shelves are not automatically promoted
• Users need to login to each edge server or ticket file updated from existing entries
• Adjusting the perforce topologies may have unforeseen side-effects. Pointing proxies to new P4TARGETs can cause increased load on the WAN depending on the topology.
Lessons Learned
#
Scott StanfordSCM LeadNetAppScott Stanford is the SCM Lead for NetApp where he also functions as a worldwide Perforce Administrator and tool developer. Scott has twenty years experience in software development, with thirteen years specializing in configuration management. Prior to joining NetApp, Scott was a Senior IT Architect at Synopsys.
#
RESOURCESSnapShot:
http://www.netapp.com/us/technology/storage-efficiency/se-technologies.aspx
SnapVault & SnapMirror:
http://www.netapp.com/us/products/protection-software/index.aspx
Backup & Recovery of Perforce on NetApp:
http://www.netapp.com/us/system/pdf-reader.aspx?pdfuri=tcm:10-107938-16&m=tr-4142.pdf
Monit:
http://mmonit.com/