Post on 12-Jan-2016
GGF12 – 20 Sept 2004 - 1
LCG Incident Response
Ian NeilsonLCG Security Officer
Grid Deployment Group
CERN
GGF12 – 20 Sept 2004 - 2
Background
• LCG – Large Hadron Collider (LHC) Computing Grid• Computing environment for the 4 LHC experiments
• ALICE, ATLAS, CMS, LHCb• LHC operation in 2007• Required 12-14 PetaBytes/year, equivalent 70,000 PCs compute
• * LCG1/2003 * LCG2/2003-4 * EGEE
• 70+ sites in Europe, USA, Asia, S. America ……• 7000+ CPUs • 6000GB+ Storage• Software certification, testing, deployment group• Distributed GOCs
• UK • http://goc.grid-support.ac.uk/gridsite/gocmain/monitoring/
• Taiwan• http://goc.grid.sinica.edu.tw/goc/
www.cern.ch/lcg
GGF12 – 20 Sept 2004 - 3
Grid monitoring
GGF12 – 20 Sept 2004 - 4
EGEE - Enabling Grids for E-science in Europe
• 12 federations with 70 partner institutions• 2 year + 2 project
• Operate a service grid facility for e-science• Initial built on LCG2 infrastructure
• Re-engineer a robust middleware layer• glite
• Attract new users• Research and Industry
• Broader focus than HEP: Biomedical, Earth Science ……..
www.cern.ch/egee
GGF12 – 20 Sept 2004 - 5
Policy – the Joint Security Group
Security & Availability Policy
UsageRules
Certification Authorities
AuditRequirements
GOCGuides
Incident Response
User RegistrationApplication Development& Network Admin Guide
http://cern.ch/proj-lcg-security/documents.html
GGF12 – 20 Sept 2004 - 6
Incident Response Policy
• Agreement on Incident Response• June 2003 for LCG1
• What is an incident?• Security investigation causing service interruption• Suspected misuse of resources beyond site• “Reasonable possibility” of stolen credentials
• Not to expire or be revoked within 3 days
• Classifications• Identity theft
• Suspected / Probable / Confirmed
• Actions • Misuse / Enforcement / Restoration / Escalation
GGF12 – 20 Sept 2004 - 7
Incident Response - Communications
• Site enrolment collects 2 entries per site• Registration questionnaire
• Site Contacts mail list• Closed list of named individuals
• email, telephone
• CSIRT list mail • List-of-lists (Open)
• 1 entry per site
• Updated list circulated to contacts list as sites enrol• Pointers to policy documents for responsibilities
• Channels• Users - local site contacts (& GOC)• Contacts - discussion and information exchange• CSIRT - incident notification, update• Roll-out - system administrators
GGF12 – 20 Sept 2004 - 8
Incident Response – management issues
• LCG “community” known at CERN, EGEE community is broader• User enrolment is well controlled, site enrolment is not
• Incomplete questionnaires• Personal instead of list• List instead of personal• Undeliverable addresses• Delayed delivery• Moderated delivery• Enrolment information not circulated• SPAM, SPAM, SPAM, SPAM
• Lists need active management!• Can we “see” all the sites?
• CERN/GOC view• VO “private” information systems
GGF12 – 20 Sept 2004 - 9
Incident response – operational issues
• Recognising and reporting • What is a local CSIRT?
• Scale of coverage• 24x7 site/campus network operations team
• Department Security Officer
• LCG system administrator
• Who is a security contact?• as above
• Intersection with local CSIRT procedures• Local quarantine and analysis
• Keeping emergency channels clear• Discussions, cross-postings
GGF12 – 20 Sept 2004 - 10
Incident response – near-term
• JSG, EGEE MWSG/JRA3, OSG, ……• Site and VO registration policy and process
• Control gathering, distribution and management of data• Sites need to understand requirements and responsibilities
• Coverage, access, audit
• Needs to be actively managed (? Self managed)
• Operational Security Co-ordination Team (OSCT)• Ownership of security incidents
• From notification to resolution• Liaise with national/institute CERTs
• Ownership of known problems• Liaise with development & deployment groups
• Co-ordination of monitoring• Post-mortem analysis• Team of experts
GGF12 – 20 Sept 2004 - 11
Security Co-ordination
• How does OSCT map onto EGEE operations structures?• Resource Centres (lots)• Regional Operations Centres - ROC (~9)• Core Infrastructure Centres - CIC (~5)• Operations Management Centre - OMC (1)
• Co-ordination with Open Science Grid ………• Adopt same co-ordinating model
GGF12 – 20 Sept 2004 - 12
2004 Security Service Challenges
• Objectives• Evaluate the effectiveness of current procedures by simulating a small and
well defined set of security incidents.• Use the experiences of a) in an iterative fashion (during the challenges) to
update procedures.• Formalise the understanding gained in a) & b) in updated incident response
procedures.• Provide feedback to middleware development and testing activities to inform
the process of building security test components.
• Exercise response procedures in controlled manner• Non-intrusive
• Compute resource usage trace to owner– Run a job to send an email
• Storage resource trace to owner– Run a job to store a file
• Disruptive• Disrupt a service and map the effects on the service and grid
GGF12 – 20 Sept 2004 - 13
LCG/EGEE Incident Response
Thank You
Thank you to UK PPARC