OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL...

5
OSG OSG Area Coordinator Area Coordinator s Report: s Report: Workload Management Workload Management February 9 th , 2011 Maxim Potekhin BNL 631-344-3621 [email protected]

Transcript of OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL...

Page 1: OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL 631-344-3621 potekhin@bnl.gov.

OSG OSG Area CoordinatorArea Coordinator’’s Report:s Report:

Workload ManagementWorkload Management

February 9th, 2011Maxim Potekhin

BNL631-344-3621

[email protected]

Page 2: OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL 631-344-3621 potekhin@bnl.gov.

2

Workload Management: PandaWorkload Management: Panda

• Panda Monitoring:

Closer integration of the existing Panda Monitoring System with the Global Dashboard

Upgrade lowered in priority due to existing functionality in the Dashboard (ATLAS decision)

• Scalability of Panda:

Typical throughput almost doubled in the past 12 month, from about 250k daily jobs run globally, to almost 500k per day, with peak count of 713k in the final days of data reprocessing in Nov’10

That puts more pressure on the database (Oracle), which is used for keeping complete state of the system, monitoring and data mining for performance analysis

Data is heavily indexed and indexes can block during copying of data across tables

The DB engine sometimes make suboptimal choices when confronted with multiple indexes

In the fall of 2010, there were a few problem days after a series network outages:

resulting disbalance of data distribution across tables, lots of backlog be to copied hence decreased performance

Multiple DB optimizations have been implemented since, notably table partitions

Demonstrated increase in performance

Some queries are still problematic and require workarounds

Page 3: OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL 631-344-3621 potekhin@bnl.gov.

3

Workload Management: WBSWorkload Management: WBS

• 2.2.4.1 (new monitor code) on hold due to Atlas Management Decision To be dropped?

• 2.2.4.2 (monitor integration) progressing with the existing (old) code base

• 2.2.5.1 (Daya Bay/LBNE) progress, ready for production

• 2.2.5.2 (CHARMM expansion to 20+ sites) done, researchers happy

• New item – Panda Scalability – new database options To be added?

Page 4: OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL 631-344-3621 potekhin@bnl.gov.

4

Workload Management: PandaWorkload Management: Panda

• Scalability of Panda, cont’d:

Along with DB optimization, alternatives are being considered for storage of finalized job data (archive), where Oracle is redundant – looking at noSQL solutions in particular – such as Cassandra, HBASE etc

noSQL advantages (such as Cassandra):

When compared to traditional RDBMS, more cost-effective horizontal scaling with commodity hardware and media

Load-balanced, redundant, truly distributed system

Extremely fast sinking of data with proper configuration (important)

Demonstrated performance of noSQL solutions in industry (Amazon, Facebook, Twitter, Google etc)

In December 2010, started an evaluation of Cassandra with real Panda job data feed

Test cluster (3 nodes) located at CERN

Data repository at Amazon S3

First round of testing encouraging, data design ongoing

To be evaluated at the ATLAS Software Week at CERN in April

Page 5: OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL 631-344-3621 potekhin@bnl.gov.

5

Workload Management: EngagementWorkload Management: Engagement

• CHARMM:

Thanks to 17+ active sites used the recent run was expedient, according to the team

Resource requirements turned out to be pretty precise (encouraging)

The last wave of jobs is finishing right now and the data goes to the experimental group, only 408 jobs submitted in the past month

• LBNE/Daya Bay Jobs ran at PDSF and BNL (J.Caballero), a number of issues discovered and resolved, such as:

Peculiarities of WN configuration at PDSF (version of curl) Suboptimal job configuration resulted in some jobs running out of memory, which is now fixed

Additional software optimization was done by the researchers (MC) An announcement went out on the Daya Bay mailing list that the initial production run will start

in a few days An additional cluster at IIT (Illinois) is under construction Panda user documentation is being reviewed as per researchers’ request