LCG – Databases - Meeting
description
Transcript of LCG – Databases - Meeting
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
LCG – Databases - Meeting
25 March 2008
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
• Presences: Miguel Anjo, John Shade, Paolo Tedesco, Phool Chand, David Collados, Judit Novak, James Casey, Steve Traylen
•
Presentation title - 2
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Presentation title - 3
Outline
Power cut problem
• Current config• What happened
Service changes
• Move Jobs to Scheduler• End of synonyms
Tasks on users / DBA
• Division of tasks
Points to improve• ServiceMap account• Gridmap service• Cleanup/partitioning of SAM• Gridview Merge/partitioning• Weekly report checkup
AOB
• Lemon alarm for DB availability• Next meeting
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Main issue during the power cut
• Ethernet network switches in RAC6 were not connected to the critical power (wrong connection of the power bar) – The public and cluster interconnect networks
went down
Presentation title - 4
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Impact of the power cut on the DB ServicesATLAS Online RAC – RAC2 – no downtime
• Services relocated to the 2 surviving nodes until manual reboot
ATLAS Offline RAC – RAC5 – no downtime
• Streams processes moved to available nodes
Downstream servers - RAC5 - no downtime• Services unavailable from 6:30 till 7:30• Services available on a single cluster node from 7:30 till 10:30
CMS RAC – RAC6 – 1h downtime + 3h of reduced performance• Services unavailable form 6:30 till 7:30 • Services available on a single cluster node from 7:30 till 09.45, except
LCG_SAM (wrongly allocated to the servers which went down)• Further downtime (9:45-11:00) while fixing other nodes to support load LCG RAC – RAC6 – 2h downtime + 2h of severely reduced
performance• Services unavailable from 6:30 till 7:30
LHCb RAC - RAC6 – 1h downtime
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Service Changes
• Announce and schedule interventions– Have a main contact that keeps plan and progress, contact all parts
and announces restart of all services• Move from dbms_job to dbms_SchedulerEGEE_PPS_SAM rmTDLOneWeekOld; LCG_FTS_PROD begin fts_history.movedata; end;LCG_FTS_PROD begin fts_servicestate.runjob; end; LCG_FTS_PROD_T2 begin fts_history.movedata; end; LCG_SAM_PPS p_testdef_autodel; LCG_SAM_PPS rmTDLOneWeekOld;http://
oracle-documentation.web.cern.ch/oracle-documentation/10gr2doc/server.102/b14231/jobtosched.htm
• End synonyms– Use Schema.TABLE_NAME
• Select from LCG_GRIDVIEW.SITES (from lcg_gridview_r or lcg_same_w or …)
– Only need to grant the privileges– Check usage outside CERN (Miguel Anjo)
LCG meeting - 6
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Developer tasks
Manage partitions
• create/drop/move
Clean up old data
Monitors space usage
Defragment tables
Check request to production (improve docs)
?
Presentation title - 7
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Points to improve
• ServiceMap account – Need reader/writer
• LCG_Gridmap service– User using LCG_SAM service
• Cleanup/partitioning of SAM (sam meeting?)• Gridview Merge/partitioning (gridview
meeting?)• Weekly report checkup
LCG meeting - 8
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
AOB
• Lemon alarm for DB availability– Create lemon metric for DB services
• Next meeting– 29th July?
LCG meeting - 9
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Why and what was done
• Space pressure on LCGR storage arrays• Of ~2750GB, only 175GB are available• Not possible to shrink datafiles• 650GB space not used in datafiles• Solution: move segments
Presentation title - 10
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Why and what was done• Overview
• Backup system will appreciate <200GB datafiles– Datafile is smallest unit to backup, not possible to
parallelize neither resume Presentation title - 11
TABLESPACE_NAME GB USEDLCG_GRIDVIEW_DATA01 858 311
LCG_SAME_DATA01 387 240
LCG_GRIDVIEW_DATA02 240 224
LCG_SAME_TESTDATA_1H2007 185 184
LCG_SAME_TESTDATA_2H2007 122 98
LCG_FTS_PROD_DATA01 113 77
LCG_GRIDVIEW_JOBSTATUSRAW 92 82
LCG_SAME_TESTDATA_2H2006 42 41
LCG_SAME_TESTDATA_1H2008 28 25
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Why and what was done
• Partitioned tables– LCG_SAME.TESTDATA (April 2007)
• Monthly partitions up to “2008”, indexed, clob• Data since July 2007• Work to do to (data move during CPUJan08 not finished - see later)
– LCG_SAME.TESTDATA_HISTORY (March 2008)• half-yearly partitions/tablespaces, no indexes• Data between July 2006 and July 2007• Created during CPUJan2008
– LCG_GRIDVIEW.JOBSTATUSRAW (March 2008)• Monthly partition up to Dec/2010, indexed• Created during CPUJan2008
Presentation title - 12
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Why and what was done
• LCG_GRIDVIEW_DATA01 space waste– Not possible to shrink datafiles. – Solution: move data to different datafile
• ALTER TABLE MOVE + ALTER INDEX REBUILD– Copy table, constraints online– Copy indexes online– Made some cursors invalid (need to restart app)– Done for tables <1GB (Thursday 6.March)
• DBMS_REDEFINITION does online– Copy table, indexes, constraints, keep synchronized– Rename tables, copy privileges– Done successfully for 7 tables (Monday 10.March)– Failed for table VO (but reported successful)
Presentation title - 13
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Why and what was done
• Why it failed (table VO)?– Service request open to Oracle– Table VO is heavily used (several users,
synonyms, views, procedures)– Oracle failed to get a lock but did not report error
• “ORA-4020 Deadlock when trying to lock xxx” reported for other tables when moving
– Similar problem for table SITES and NODES– Currently difficult to create/drop tables
referencing those tables • (tables in bad state? Service Request)
Presentation title - 14
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Missing operations
• LCG_GRIDVIEW– Recreate tables SITES, NODES, VO– Move 10 tables off DATA01 (120GB)– Possible “exp/imp” or “table move + index
rebuild”– 8 hours??
• LCG_SAME– Move partitions 2H2007 to correct tablespace– Split 2008 partitions– Create partitions up to Dec2010– >1 day, “transparent”
Presentation title - 15
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Presentation title - 16
Outline
Last week operations
• Why and what was done• Missing operations
Space usage and cleanup
• Current situation• What can be done• Division of tasks
Interventions in production system
• Transparent interventions• Applications resilience to interventions
Database meetings with developers
• Next meeting
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Current situation
• TESTDATA_HISTORY is 224GB • Data: July 06 > July 07• Needed? For how long?
• TESTDATA is 286GB + indexes• Data >= July 07
LCG_SAME
• JOBSTATUSRAW is partitioned• Agreed to drop partitions > 3 months
old• About 30GB/month
• JOBSTATUS, GRIDFTPMONITORRAW• > 30GB + indexes• Partition? Regular clean up?
LCG_GRIDVIE
W Presentation title - 17
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
What can be done
• Partitioning– Some maintenance work– No space gain
• Aggregates– After aggregation, delete row data– Space gain and performance boost
• History table (no indexes, compressed)– Little space gain– Heavy maintenance work
Presentation title - 18
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Expected growth
• Start monitoring of space growth per table• What are expectations?• How much aggregate data will be kept?• What about aggregation of aggregates?• LCG_SAM?
• LCG_GRIDVIEW?
• LCG_FTS?
Presentation title - 19
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Presentation title - 20
Outline
Last week operations
• Why and what was done• Missing operations
Space usage and cleanup
• Current situation• What can be done• Division of tasks
Interventions in production system
• Transparent interventions• Applications resilience to interventions
Database meetings with developers
• Next meeting
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Transparent interventions
• Huge database (for SAM, GridView)– Impossible to perform full scale tests
• Some operations ‘with risk’ for long periods• How to schedule? • Possible to do with downtime? (less risk)• Notification flow?
Presentation title - 21
Type Flow“With risk” selected users
“With risk” all users
“Downtime” selected users
“Downtime” all users
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Applications resilience to interventions
Presentation title - 22
• Resilient: adj. – Marked by the ability to recover readily, as from misfortune; – Capable of returning to an original shape or position, as after having been
compressed.
Application Resilient? Grid consequence? Acceptable downtime?
FTS
Gridview
LFC
SAM
VOMS
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Presentation title - 23
Outline
Last week operations
• Why and what was done• Missing operations
Space usage and cleanup
• Current situation• What can be done• Division of tasks
Interventions in production system
• Transparent interventions• Applications resilience to interventions
Database meetings with developers
• Next meeting
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Presentation title - 24
Next meeting
• Main developers of LCG
• Weekly report, interventions planned, SQL optimization, share solutions
• Schedule: Monday after the 15th at 14:00• Next meeting 21st April – 14:00
FTS Gavin MccanceGridView James CaseyLFCSAMVOMS Steve Murray
Presentation title - 25