Post on 07-Aug-2020
WG Town Hall
Lindsay Sill, Executive DirectorMartin Siegert, SFU (GP2/Cedar) Site Lead
Sergiy Stepanenko, University of Saskatchewan Site LeadErin Trifunov, Manager Projects & Outreach
Friday, January 27, 2017
Introduction & Outline
1. New systems updates2. Bugaboo storage system issues3. Cedar / GP2 System4. Migration Updates5. RAC / Growing Needs6. Upcoming Training Opportunities
Admin
Questions:● email “info@westgrid.ca”, OR● use Vidyo chat (for those on Vidyo)
Please MUTE yourself if you’re connected via Vidyo and not speaking
New System Updates
Lindsay SillExecutive Director
WestGrid
Technology Deployment Overview
● Major deployment of new resources underway:○ National Data Cyberinfrastructure○ New Cloud resources○ New HPC resources○ New Services
● Technology Briefing published by Compute Canada in November.
● Cloud Strategy & Services document updated in December.
Stage 1 Award to Implementation
2015
Award notification, June, 2015
2016 2017 2018
Award Finalization February,
2016
1st System (Arbutus)
OperationalSept., 2016
Cedar, Graham
OperationalApril, 2017
Niagara OperationalEnd, 2017
Target - all four major new systems in full production less than 2 years after award finalization. Software services development continues through 2018.
(Note - Niagara schedule purposely delayed by recommendation of CFI expert panel, to benefit from technology improvements)
National Systems UpdateCompute
System Status In-production Estimate
Arbutus (GP1, UVic)
● West.cloud.computecanada.ca: 7,640 cores DONE (Sep, 2016)
Cedar(GP2, SFU)
● Equipment is currently being delivered April 2017
Graham(GP3, Waterloo)
● Shipping planned for 1-st week of February.● Renovations almost complete (end of January)
April 2017
Network ● Preferred vendor testing. 2017
Parallel FS ● In progress. Will be ready for PROJECT and SCRATCH. February 2017
Scheduler ● Open-source Slurm with commercial support. Small test/dev cluster in cloud.
February 2017
Niagara(LP1, Toronto)
● Early discussion● RFP expected to go out in February.
Late 2017
National Systems UpdateStorage
System Status In-production Estimate
Silo Interim ● Waterloo: migration complete● SFU: migration underway
Available(see following slides)
NDC-SFU ● PO’s waiting for signatures● Vendor is ready-to-go
April 2017
NDC-Waterloo ● 13 PB of SBB’s delivered.● Waiting for datacentre completion
Available: April 2017● Aim: March, 2017
NDC - Object Storage
● Object Storage. DDN WOS.● Initial prototype for internal testing installed
on cloud.
Mid 2017Aim: April, 2017
Attached● Scratch
● High performance storage attached to clusters
Purchased with the clusters
NDC = “National Data Cyberinfrastructure”
Cedar / GP2 System Details
Martin SiegertWestGrid & National Site Lead
Simon Fraser University
Bugaboo Storage Issues
Bugaboo has been having storage system issues since Christmas.● The old DDN 10K system● Series of disc, controller and corresponding software problems.
Currently down.● Working with DDN.● File system corruption - about 6000 files affected in /global/scratch;
Estimated time to fix: one month● Options:
○ Continue fixing the problems, Bugaboo unavailable for one month○ Stop the fix, bring Bugaboo back up, loose 3000+ files
Decision has been made to proceed with latter option.
Coming Soon - Cedar and Graham
These will be the most powerful CC systems ever, with multiple node types to meet a variety of needs.
- Compute nodes, with local storage
- NVIDIA “Pascal” GPU nodes- Bigmem nodes
Delivery, installation and configuration will be happening from late January through March.
The DDN storage /scratch for Cedar (GP2) has been installed into racks at the SFU Data Centre.
Cedar SpecsNode Type # Nodes Cores/
Socket# Sockets Mem
(GB)Details
Base compute 576 16 2 128 E5-2683 V4 2.1 GHz
Large compute 128 16 2 256 E5-2683 V4 2.1 GHz
Bigmem500 24 16 2 0.5 TB E5-2683 V4 2.1 GHz
Bigmem1500 24 16 2 1.5 TB E5-2683 V4 2.1 GHz
Bigmem3000 4 8 4 3 TB E7-4809 V4 2.1 GHz
GPU nodes 146 12 2 128, 256 E5-2650 V4 2.2 GHz, 4 x NVIDIA P100 GPU’s
Interconnect: Intel OmniPath (version 1) (100 Gbit/s), non-blocking within “islands”, 2:1 blocking between islandsStorage: (next slide)Vendor: Scalar Decisions (Dell, DDN, Intel)
Cedar Storage
Type Estimated Size (PB)
Node Access
Allocated?(RAC)
Quota Purged
Home Mounted NO Yes No 50 GB/userCode, Configuration files
Scratch Mounted NO Yes Yes High performance, LustreDefault:
20 TB, 1M files per user,100 TB, 10M files per group
Project 10-20 Mounted YES Yes No Very large, Low performance (external), Tape backup
Nearline (tape)
None YES Yes Maybe Tape only
Project tape backup will be auto-replicated between SFU and Waterloo tape systems.
Graham (Waterloo)
● Very similar to Cedar.● Essentially identical software stack and batch system.● So users can move between the two very easily.
○ RAC allocates to a particular system however● Physical mix
○ slightly different mix of small-large mem nodes and GPU nodes.● Infiniband interconnect (50 Gbit/s)
○ Non-blocking within islands, 8:1 blocking between islands● Details on CC docs:
https://docs.computecanada.ca/wiki/Graham
Migration Planning
Erin TrifunovManager Projects & Outreach
WestGrid
Migration Stats
➔ 19 of 45 systems across Canada will be defunded* after March 31, 2016 and will be unavailable for allocations.
➔ 875 projects (40%) have utilized the to-be-defunded systems
Thousands of users will be migrating onto the new systems○ Move datasets○ Get code working (recompile, etc.)○ Set up jobs
Goal: For all users to have superior support for their migration○ Well-documented & functional systems○ Outstanding support: Local, regional, national
Who needs to migrate?
Users of systems scheduled to be “defunded” after March 31, 2017 (next slide)
Anyone else wishing to use new national systems is also welcome.
WestGrid Legacy Systems Site System(s) Defunding date* Current Status
Edmonton hungabee/jasper Mar 31, 2017 Available with conditions
Victoria hermes/nestor Mar 31, 2017 Hermes virtual
Calgary breezy/lattice/parallel Mar 31, 2017 Parallel extended to Mar.31, 2018
Vancouver - UBC orcinus Mar 31, 2018 Available but with conditions
Winnipeg grex Mar 31, 2018 New storage coming
Vancouver - SFU bugaboo Mar 31, 2018 Storage support issues.
*Please note these are provisional dates.
WestGrid Migration details: https://www.westgrid.ca/migration_process For other regional systems see https://www.computecanada.ca/research-portal/accessing-resources/migration/
Next Steps...
Users of legacy systems to be defunded:
1. Wait for WestGrid support to contact you.
2. Prepare files for migration:a. Clean up files & directories (DELETE unneeded files!)b. Archives & compress files/directoriesc. Transfer filesd. Verify and synchronize files
https://docs.computecanada.ca/wiki/General_directives_for_migration
Migration Schedule
February March April
Mar. 15: Virtual Test System available
Apr. 13: RAC 2017 implementedMar 1: RAC 2017
award letters sent
Migration to new systemsRegions contact users required
to migrate with timeline and further instructions
Schedule mitigations
Virtual TDS: Being configured now, on UVic’s Arbutus cloud➔ Suitable for internal configuration and testing➔ Not suitable for general user support, migration, etc.
User migration system➔ We hope to have a system available to users by March 15, 2017. ➔ Perhaps earlier, or perhaps with staged availability for support personnel
& selected users➔ This might be the vTDS (but expanded to support “toy”-sized parallel jobs)➔ This might be one or both of Cedar or Graham
WARNING: April may have very limited resources due to delivery delays.21
Migration & New System Info
Compute Canada document wiki now available:https://docs.computecanada.ca
In particular the Migration pages:https://docs.computecanada.ca/wiki/Migration2016
And of course WestGridhttps://www.westgrid.ca
Silo Migration
Sergiy StepanenkoWestGrid Site Lead
University of Saskatchewan
Interim Silo Storage I
● Transfer to Waterloo COMPLETE. SFU almost done. ● Silo interim storage “Storage Building Block” (SBB):
○ Waterloo: relatively simple, low-performance NFS system.○ SFU: shared Gluster filesystem.
● Users login similar to Silo logins.○ But using National LDAP accounts. For details see:
https://docs.computecanada.ca/wiki/Migration2016:User_Accounts_and_Groups
● Backed up to the new tape systems.○ Backup currently in progress
Silo Migration Stats
Silo to Waterloo completed Jan.11, 2017:● 57M files, 850TB, 140 Users.● Note: a few very large users are still transferring data from their own master copy (usually
experimental or observational data backups on silo)
Start Date for Silo to SFU Jan.11, 2017
Files transferred to SFU 21,503,022 of ~350M
Data transferred to SFU (current)
198.73 TB of ~850 TB
Users migrated to SFU 160 of ~300
Max transfer rate 650 MB/s
Current transfer rate 85 MB/s (some issues with controllers and FC network under investigation)
As of Jan.24, 2017
Interim Storage Solution II
● This is an interim solution for Silo data.
● A second migration to the final storage system may be required - likely summer 2017.
● USask has agreed to keep Silo going until the migration has completed. Many thanks to USask and the other regions!
Questions?
Questions for Sergiy?
Growing NeedsFuture Needs, RAC, Training
& Tuques!
Lindsay SillExecutive Director
WestGrid
International Comparisons
● Comparisons of giga-FLOPS (GF) per researcher ○ Canada used to be #6 (2009)○ We are now #24 (2015)
● Comparator countries for GF/researcher in charts that follow:○ US - #3 in 2015○ Germany - #5 in 2015○ Czech Republic - #10 in 2015
30
International Rankings - Log ScaleGF = Gigaflop/s
Continued Growth in User Base
31
Resource Allocation
Process Schedule
2016 RPP Progress Report (from PI’s) Due January 5, 2017 (DONE)
CC Scientific and Technical Reviews December/January (COMPLETE)
CC Face-to-Face Review meeting February 7/8
Allocation letters to users Early March
Implementation April
Growth in Number of Requests
33
Resource Allocation - 2017
2017 Requests 2016 Requests % Change
Compute - CPU-years 256,000 238,000 +7.5%
Compute - GPU-years 2,660 1,357 +96%
Storage (TBs) 55,000 28,660 +92%
2017 Requested Fraction Available
2016 Requested Fraction Available
Compute - CPU 54%* 54%
Compute - GPU 38% 20%
Storage 90+% 90+%
* 54% in 2017 includes 50k+ new cores with better performance
RAC Summary
35
● 2016 saw increased demand, static supply.● 2017 changes:
○ new systems (Arbutus, Cedar, Graham)● Replacing older (fragile) systems with larger more robust
systems will help ie. Codes should run much faster.● Massive user migration coincides with new system
commissioning and 2017 RAC allocation implementation.● Demand has continued to grow. 2017 will also be tough.
RAC 2017 success rate will be very similar to 2016.
Training Sessions
Full details online at www.westgrid.ca/training
DATE TOPIC TARGET AUDIENCE
JAN 31 Intro to OpenMP: Part 1 Anyone
FEB 7Tools for Managing Research Data:
Intro to REDCap Anyone
FEB 8 Improving Your Visual Science Communications: Plots & Figures Anyone
FEB 9Building Research Platforms and Portals
in the Humanities & Social Sciences Humanities & Social Sciences
FEB 21 Intro to OpenMP: Part 2 Anyone
FEB 28Visualization Workshop @ University of Alberta Anyone
CC Staff Awards of Excellence
Nominations Open January 20, 2017!
Nominate a team or team member and share with your community on campus
Any active full-time or part-time Compute Canada, ACENET, Calcul Québec, Compute Ontario or WestGrid team member is eligible for
nomination
Submissions due April 21, 2017
www.computecanada.ca
#Tuques4Compute
National Social Media Campaign to be kicked off by Dr. Art McDonald in late January.
GOAL:Linking world-class research in Canada with ARC
Suggestions and participation welcome.
38
Support
Contact us anytime:support@westgrid.ca
www.westgrid.cadocs.computecanada.ca
Questions?
Webstream viewers: email your Town Hall questions to info@westgrid.ca
○