Big Data Imperial June 2013 Dr Paul Calleja Director HPCS The SKA The worlds largest big-data...

19
Big Data Imperial June 2013 Dr Paul Calleja Director HPCS The SKA The worlds largest big-data project

Transcript of Big Data Imperial June 2013 Dr Paul Calleja Director HPCS The SKA The worlds largest big-data...

Big Data Imperial June 2013

Dr Paul Calleja

Director HPCS

The SKA The worlds largest big-data project

Big Data Imperial June 2013

HPCS activities & focus

Big Data Imperial June 2013

• Next generation radio telescope

• 100 x more sensitive• 1000000 X faster • 5 square km of dish over 3000 km

• The next big science project

• Currently the worlds most ambitious IT Project

• Cambridge lead the computational design • HPC compute design• HPC storage design• HPC operations

Square Kilometre Array - SKA

Big Data Imperial June 2013

SKA location

• Needs a radio-quiet site• Very low population density• Large amount of space• Two sites:

• Western Australia• Karoo Desert RSA

A Continental sized Radio A Continental sized Radio

TelescopeTelescope

Big Data Imperial June 2013

What is radio astronomy

X X X X X X

SKY Image

Detect & amplify

Digitise & delay

Correlate

Process Calibrate, grid, FFT

Integrate

s

B1 2

Astronomical signal (EM wave)

Big Data Imperial June 2013

Why SKA – Key scientific drivers

Are we alone ???

Cosmic Magnetism

Evolution of galaxies

Pulsar surveygravity waves

Exploring the dark ages

Big Data Imperial June 2013

SKA timeline

2019 Operations SKA1 2024: Operations SKA2

2019-2023 Construction of Full SKA, SKA2

€1.5B

2016-2019 10% SKA construction, SKA1

€300M

2012 Site selection

2012 - 2015 Pre-Construction: 1 yr Detailed design

€90MPEP 3 yr Production Readiness

2008 - 2012 System design and refinement of specification

2000 - 2007 Initial concepts stage

1995 - 2000 Preliminary ideas and R&D

Big Data Imperial June 2013

SKA project structure

SKA BoardSKA Board

Director GeneralDirector General

Work Package Consortium 1 Work Package Consortium 1

Work Package Consortium n Work Package Consortium n

Advisory Committees(Science, Engineering, Finance, Funding …)

Advisory Committees(Science, Engineering, Finance, Funding …)

……

Project Office (OSKAO)

Project Office (OSKAO)

Locally funded

Big Data Imperial June 2013

Work package breakdown

UK (lead), AU (CSIRO…), NL (ASTRON…) South Africa SKA, Industry (Intel, IBM…)

UK (lead), AU (CSIRO…), NL (ASTRON…) South Africa SKA, Industry (Intel, IBM…)

1. System

2. Science

3. Maintenance and support /Operations Plan

4. Site preparation

5. Dishes

6. Aperture arrays

7. Signal transport

8. Data networks

9. Signal processing

10. Science Data Processor

11. Monitor and Control

12. Power

SPO

Big Data Imperial June 2013

SKA data flow

..

Sparse AA

Dense AA

..

Central Processing Facility - CPF

User interfacevia Internet

...

To 250 AA Stations

DSP...

DSP

To 1200 Dishes

...15m Dishes

16 Tb/s

10 Gb/s

Data

Time

Control

70-450 MHzWide FoV

0.4-1.4 GHzWide FoV

1.2-10 GHzWB-Single Pixel feeds

Tile &Station

Processing

OpticalData links

... AA slice

... AA slice

... AA slice

...D

ish & AA+D

ish Correlation

ProcessorBuffer

ProcessorBuffer

ProcessorBuffer

ProcessorBuffer

ProcessorBuffer

ProcessorBuffer

ProcessorBuffer

ProcessorBuffer

ProcessorBuffer

ProcessorBuffer

ProcessorBuffer

ProcessorBuffer

ProcessorBuffer

ProcessorBuffer

ProcessorBuffer

ProcessorBuffer

ProcessorBuffer

ProcessorBuffer

ProcessorBuffer

ProcessorBuffer

ProcessorBuffer

ProcessorBuffer

Data sw

itch ......Data

Archive

ScienceProcessors

Tb/s Gb/s Gb/s

...

...

TimeStandard

Ima

gin

g P

roce

ssors

Control Processors & User interface

Pb/s

Correlator UV Processors Image formation Archive

Aperture Array Station

16 Tb/s 4 Pb/s

24 Tb/s

20 Gb/s

20 Gb/s

1000Tb/s

Big Data Imperial June 2013

Science data processor pipeline

10 Pflop 1 Eflop

100 Pflop

Software complexity

3200 GB/s 200 Pflop

2.5 Eflop

…IncomingData fromcollectors

Switch

Buffer store

Switch

Buffer store

Bulk StoreBulk Store

CorrelatorBeam

former

UV

Processor

Imaging:

Non-Imaging:

CornerTurning

CourseDelays

Fine F-step/Correlation

VisibilitySteering

ObservationBuffer

GriddingVisibilities Imaging

ImageStorage

CornerTurning

CourseDelays

Beamforming/De-dispersion

BeamSteering

ObservationBuffer

Time-seriesSearching

Searchanalysis

Object/timingStorage

HPC science

HPC science

processingprocessing

Image

Processor

128,000GB/s 1 Eflop3 EB SKA 2 SKA 1 300 PB 135 PB

5.40 EB

Big Data Imperial June 2013

• The SKA SDP compute facility will be at the time of deployment one of the largest HPC systems in existence

• Operational management of large HPC systems is challenging at the best of times - When HPC systems are housed in well established research centres with good IT logistics and experienced Linux HPC staff

• The SKA SDP will be housed in a desert location with little surrounding IT infrastructure, with poor IT logistics and little prior HPC history at the site

• Potential SKA SDP exascale systems are likely to consist of 100,000 nodes occupy 800 cabinets and consume 30 MW. This is very large – around 5 times the size of today largest supercomputer –Titan Cray at Oakridge national labs.

• The SKA SDP HPC operations will be very challenging

SKA Exascale computing in the desert

Big Data Imperial June 2013

• Although the operational aspects of the SKA SDP exacscale facility are challenging they are tractable if dealt with systematically and in collaboration with the HPC community.

The challenge is tractable

Big Data Imperial June 2013

• We can describe the operational aspects by functional element

Machine room requirements **SDP data connectivity requirementsSDP workflow requirements System service level requirementsSystem management software requirements**Commissioning & acceptance test procedures System administration procedureUser access proceduresSecurity procedureMaintenance & logistical procedures **Refresh procedure System staffing & training procedures **

SKA HPC operations – functional elements

Big Data Imperial June 2013

• Machine room infrastructure for exascale HPC facilities is challenging

• 800 racks, 1600M squared• 30MW IT load• ~40 Kw of heat per rack

• Cooling efficiency and heat density management is vital

• Machine infrastructure at this scale is in the £150M bracket with a design and implementation time sale of 2-3 years

• The power cost alone at todays cost is £30M per year

• Desert location presents particular problems for data centre

• Hot ambient temperature - difficult for compressor less cooling

• Lack of water - difficult for compressor less cooling• Very dry air - difficult for humidification• Remote location - difficult for DC maintenance

Machine room requirements

Big Data Imperial June 2013

• System management software is the vital element in HPC operations

• System management software today does not scale to exascale

• Worldwide coordinated effort to develop system management software for exascale

• Elements of system management software stack:-Power management **Network managementStorage managementWorkflow management OSRuntime environment **Security managementSystem resilience **System monitoring **System data analytics **Development tool

System management software

Big Data Imperial June 2013

• Current HPC technology MBTF for hardware and system software result in failure rates of ~ 2 nodes per week on a cluster a ~600 nodes.

• It is expected that SKA exascale systems could contain ~100,000 nodes

• Thus expected failure rates of 300 nodes per week could be realistic

• During system commissioning this will be 3 or 4 X

• Fixing nodes quickly is vital otherwise the system will soon degrade into a non functional state

• The manual engineering processes for fault detection and diagnosis on 600 will not scale to 100,000 nodes. This needs to be automated by the system software layer

• Scalable maintenance procedures need to be developed between HPC system administrators, system software and smart hands in the DC

• Vendor hardware replacement logistics need to cope with high turn around rates

Maintenance logistics

Big Data Imperial June 2013

• Providing functional staffing levels and experience at remote desert location will be challenging

• Its hard enough finding good HPC staff to run small scale HPC systems in Cambridge – finding orders of magnitude more staff to run much more complicated systems in a remote desert location will be very Challenging

• Operational procedures using a combination of remote system administration staff and DC smart hands will be needed.

• HPC training programmes need to be implemented to skill up way in advance

• The HPCS in partnership SA National HPC provider and SKA organisation is already in the process of building out pan African HPC training activities

Staffing levels and training

Big Data Imperial June 2013

Early Cambridge SKA solution - EDSAC 1