Network, Operations and Security Area Tony Rimovsky NOS Area Director [email protected].

10
Network, Operations and Security Area Tony Rimovsky NOS Area Director [email protected]

Transcript of Network, Operations and Security Area Tony Rimovsky NOS Area Director [email protected].

Network, Operations and Security Area

Tony RimovskyNOS Area Director

[email protected]

SDSC

TACC

UC/ANL

NCSA

ORNL

PU

IU

PSC

NCAR

Resource Provider (RP)

The TeraGrid Map

NICS

LONI

Network Hub

Networking

•Origins of the TeraGrid network– Originally 4 sites with 3x10G each– Full mesh of 10Gbps links

•Evolution– Most sites now at 10G, not 30G– TG Backbone is now 10G– One router serves almost all Resource Providers

•Key question: Why continue to have a TeraGrid specific network?– Variation in capacities to R&E Networks– Application specific utility

•GPFS-WAN and Luster-WAN

– Security

Networking

•Networking challenges– Tracking application specific use of the network– Finding a new architecture paradigm

Operations

•TeraGrid GIG does not operate resources• Resource Providers operate resources individually, but in a coordinated fashion–Accounts, Allocations, Accounting, Software, Processes, User Support and Policy are all coordinated

–Coordinated does not necessarily mean “the same”–Resources are operated under a range of complementary awards. This has encompassed Resource Provider, HPCOPS and Track 2 awards.

–Some activities are either shared or impact across the project»Operations Center, Accounting/Allocations, Networking, Security

• Instrumenting core activities is key– INCA validation testing helps coordinate software– Common accounting provides the ability to report across resources

– Usage instrumentation helps understand how users interact with TeraGrid across all platforms

Operations

•TGCDB/AMIE, POPS and Core2– Some Definitions

•TGCDB is the Database of account and accounting records•AMIE is the protocol for transferring records•POPS is the system for submitting and reviewing allocations•Core2 is the current name for a major review and redesign of account/allocations system

– The accounting system is significant to us and the user community for several reasons•Common account/allocation mechanism across HPC resources

•Relatively easy to add new resources•Facilitates user portability and access to resources across the project

Operations

•Operations Challenges– There are a lot of places to collect data– It is difficult to get a complete picture in any

particular area• eg. Network traffic levels can be measured, but the real question is about the applications that are driving that traffic. Some applications can be measured. Others are more challenging

– New systems with unique architectures are providing challenges with respect to how to balance commonality with resource needs.

Security

•Security has two main thrusts:– Operational Security/Incident Response– Security Architecture

•Operational Security/Incident Response– Security events happen. The goal of TG IR is to control the

spread of incidents among the sites.– Communication is key to success.

•All sites participate in IR•Regular calls combined with distinct tools to maintain a secure and rapid communication environment in the event of an incident

•The group is very successful at IR

– Operational Sec includes TAGPMA, writing and reviewing policy, and working with WGs on implementation details.

Security

•Security Architecture– Emphasis is on design and keeping track of the

big picture– Grid security– Gateway and Campus AAA

Security Challenges

• Policy crafting and adoption– NSF, DOE and campus cultures bring unique perspectives– We try for consensus and people are passionate. – Example: Centralized password management

• Grid security and operations– Operational people are focused on traditional computer security

and exposures– Architectural group creation was driven by the need for big-picture

security– Example: capturing the process for distributing DNs for SSO

• Certificate based authentication needs– Accounting and record keeping– IR logging– Gateways, community accounts, and accountability– Example: Attribute passing and tracking