The Pensions Trust - VM Backup Experiences

Post on 12-Dec-2014

868 views 0 download

Tags:

description

 

Transcript of The Pensions Trust - VM Backup Experiences

VMware Backup Experiences

Darren BullBusiness Support Manager,

The Pensions Trust

• I’m no expert – jump in with comments/corrections

• Everybody is different –

– Each solution will depend on budget, recovery objectives and available infrastructure.

• These our only our experiences –

– Things that didn’t work for us may work for you.

– We’ll focus on the things we have worked with.

Before we start

• 160 staff.

• 3 sites:

– Leeds/Edinburgh/London

– Originally DR site was Edinburgh office.

– Since downsize of Edinburgh office, now use rented rack space to house DR kit.

• 10mb WAN link to DR site.

• 3 IT Infrastructure staff.

About TPT

• Legacy application - BackupExec.

• LTO2 tape archive (with associated issues):

– Tapes can go bad.

– Stored off site with Iron Mountain.

– Tapes have gone missing.

– Management of tape rotation.

• Manual rebuild of servers during DR test:

– Dissimilar hardware.

– 48 hours to complete, not able to do full recovery.

Backups prior to virtualisation

• Server consolidation began late 2006.

• Complete summer 2007.

• 40 virtualised servers. Approx 25 critical for DR.

• Simplified disaster recovery & backups one of the main drivers for the project.

The move to VMware

• Information archiving

– Keep at least the last 12 months.

• Disaster recovery

– Recover systems within 24 hours.

VMware backups – considerations

• Backups that work! No constant checking.

• A backup archive.

– On site and offsite.

• Minimal administration, vSphere integration. Set it and forget it.

• Quick backups, within the available window.

• Quick restores.

– File level

– Image level.

– Application level (i.e. Exchange mailboxes)

• No tapes.

• Efficient use of storage (de-duplication)

• Secure backup data.

Backups - what did we want?

• Fast offsite recovery.

• Consistent data.

– SQL/Exchange/Active Directory.

• This means desktops too:

– Deployed VMware View 2009.

DR - what do we want?

• The business must decide its recovery objective and provide the funds to achieve it.

– TPT Objective: 24 hours lost data acceptable.

• Once the recovery objective is determined, many options may be ruled out.

– TPT didn’t need synchronous real time replication, could use cheaper options to be up in 24 hours.

• Even with small budgets, many things are possible:

– Redeploy old ESX servers/storage offsite.

– Shop around for bandwidth.

– With the latest backup applications, you don’t need expensive storage to make some things happen.

Limiting factors

• Installed EMC Clariion CX3-20 as part of consolidation project.

– 2nd unit installed in old Edinburgh office.

• Used Mirrorview/A for bidirectional site to site replication of VMFS data stores.

• Continued to use BackupExec and tape for archiving.

• Take snapshot of replicated LUN, make writeable, mount in ESX, power on VM for server recovery.

TPT approach (1)

TPT Approach (1)

• Asynchronous mode

– TPT ran 1 job per LUN per day.

– Replication of entire LUN.

– New VM’s on replicated LUN’s added huge replication burden.

– No de-duplication.

– Available bandwidth an issue.• Mirrors wouldn’t just go slow, but fail completely.• Could only run so many sessions at once.• Mirrors fell further and further behind as failed jobs had to start from

scratch.• Jobs needed constant monitoring.

• EMC no longer sell it.

Mirrorview/A - experiences

• EMC/NetApp/HP (and others) now offer products that work much better with VMware:

• Deduplicated primary storage

• Changed block tracking - efficient replication over slow links.

• Obtaining this functionality is expensive:

• We found it difficult to obtain budget – management saw ‘nothing wrong’ with existing SAN.

Mirrored SAN - alternatives

• We needed to fix the replication problem.• Installed 2 x DataDomain DD510.

– CIFS/NFS/VTL backup target.– Can mount as an ESX datastore.– Site to site bit level replication.– De-duplicated storage.– Massive savings on VMDK archive storage – 40x

de-duplication achieved.– Acts as backup archive storage and offsite

replication engine for disaster recovery.– All backups replicated offsite within 24 hours.– Throw away tapes.– Secure offsite backups, no physical media in transit.

TPT approach (2)

TPT approach (2)

• Tips before starting:

– Cannot snapshot persistent disks.

– Give a VM’s disks different names, even if on different LUN’s.

– Throughput issue doing network backups using vSphere.

• Service console LAN throughput limitation.

• Patch has been released (but I’ve not tried it).

• Affected any image level backup application using LAN mode.

– ESX3.x Snapshot timeout issue:

• 15 mins timeout, VC will report timeout to VCB proxy, even if ESX host continues and commits the snapshot.

– Changed tracking must be enabled in a VM (VM hardware level 7).

Change the backup software

• Image level backup of VM’s to DataDomain.

• DataDomain takes care of replication.

• File level restore.

• Restore server-by-server @ DR site.

• TPT started with version 3.x. First installed late 2008.

• vRanger now at version 4.

• Use vReplicator for replication of VM’s.

• Vizioncore now owned by Quest Software.

Vizioncore vRanger

• Struggled to work within backup window. 24 hour job cycle.• Had issues with snapshot timeouts (ESX 3.x).

– Had to use LAN based backups direct to ESX to work around this.

• Had issues with vRanger 3 backup naming inconsistencies:– ‘Could not find the compressed disk to mount’ doing a

FLR or DR site recovery.– Much messing around with VMX/VMDK/INFO files to

repair this and get restores working.– Never really seemed to be fixed.– VSS integration never worked well.

• Upgraded to vRanger Pro 4 – had the slow network backup issue and no VCB mode! Downgraded.

vRanger experiences

• Uses vStorage API.

– Backups to ‘normal’ storage (e.g. NAS) incredibly quick after 1st full (1tb file server backed up in 10 minutes).

– No backups during office hours.

– Deduplicated backup files.

• Not the same performance with DataDomain:.

– Inline dedupe performed by DataDomain slows things down a bit.

– Disable compression and deduplication options in backup job.

– Changed block tracking means things still work well.

• It ‘just works’.

• No more babysitting the backups.

Veeam Backup & Replication

• Uses changed block tracking to replicate changes to offsite replica VM.

• We synchronise nightly.

– One full backup of each VM.

– One replica pass for each VM.• Can keep previous versions of replica offsite for

archiving purposes - negates need for backup?

• Full backups of ‘large change’ servers still done to DataDomain using Veeam, then DD replicates to its offsite partner.

• DataDomain also used for backup archiving.

• One click DR testing of replica servers.

– Failover/failback using Veeam console.

Veeam Replicas

TPT approach (3)

• Veeam replicas – 20 servers up in approx 20 mins using

failover function.• Veeam backups

– 5 servers recovered from image level backups in approx 5 hours. Transactionally consistent.

• Time taken for full network recovery – approx 6 hours.– If we had the bandwidth, would use 100%

replicas.

2010 – DR test

• Veeam SureBackup – TPT wins:

– Automatic verification testing.

– Item level recovery?

– User self service for deleted files?

– We can power on direct from DataDomain at both primary and recovery sites.

– No more 5 hour wait for non-replica servers to be recovered. Instant recovery, then storage vMotion.

– DR restore may be minutes rather than hours…

Veeam SureBackup

• Backups that work! No constant checking. ACHIEVED.• A backup archive.

– On site and/or offsite. ACHIEVED• Minimal administration, vSphere integration. Set it and forget it.

ACHIEVED• Quick backups, within the available window. ACHIEVED• Quick restores.

– File level. ACHIEVED– Image level. ACHIEVED– Application level (i.e. Exchange mailboxes). NOT YET!

• No tapes. ACHIEVED• Efficient use of storage (de-duplication). ACHIEVED• Secure backup data. ACHIEVED.

Backups - what did we want?

• Fast offsite recovery. ACHIEVED VS. OBJECTIVE

• Consistent data. ACHIEVED.

DR - what do we want?

• Get rid of tape.

• Recovery objective (and therefore, budget) will drive what is possible with DR.

• If doing SAN-SAN mirroring, get the replication sizing right.

• Newer storage systems offer increased integration with VMware. If you have the budget, make use of these.

• Veeam is an excellent, cost effective alternative to costly SAN-level technology.

In conclusion…

Thank YouDarren Bull

Business Support Manager

Verity House, Canal Wharf, Leeds LS11 5BQTel. 0113 234 5500 Direct. 0113 394 2533

Fax. 0113 234 5599

E-mail: darren.bull@thepensionstrust.org.ukwww.thepensionstrust.org.uk

Thank You