The BaBar Event Building and Level-3 Trigger Farm Upgrade S.Luitz, R. Bartoldus, S. Dasu, G....

18
The BaBar Event Building and Level-3 Trigger Farm Upgrade S.Luitz, R. Bartoldus, S. Dasu, G. Dubois-Felsmann, B. Franek, J. Hamilton, R. Jacobsen, D. Kotturi, I. Narsky, C. O’Grady, A. Perazzo, R. Rodriguez, E. Rosenberg, A. Salnikov, M. Weaver, M. Wittgen for the BaBar Computing Group CHEP 2003 San Diego

Transcript of The BaBar Event Building and Level-3 Trigger Farm Upgrade S.Luitz, R. Bartoldus, S. Dasu, G....

Page 1: The BaBar Event Building and Level-3 Trigger Farm Upgrade S.Luitz, R. Bartoldus, S. Dasu, G. Dubois-Felsmann, B. Franek, J. Hamilton, R. Jacobsen, D. Kotturi,

The BaBar Event Building and Level-3 Trigger Farm Upgrade

S.Luitz, R. Bartoldus, S. Dasu, G. Dubois-Felsmann, B. Franek, J. Hamilton, R. Jacobsen, D. Kotturi, I. Narsky, C. O’Grady, A. Perazzo, R. Rodriguez, E. Rosenberg, A. Salnikov, M. Weaver,

M. Wittgen for the BaBar Computing Group

CHEP 2003 San Diego

Page 2: The BaBar Event Building and Level-3 Trigger Farm Upgrade S.Luitz, R. Bartoldus, S. Dasu, G. Dubois-Felsmann, B. Franek, J. Hamilton, R. Jacobsen, D. Kotturi,

3/24/03BaBar Farm Upgrade S.Luitz CHEP 2003

Outline BaBar Data Acquisition Overview The Old System Why upgrade? – Upgrade Options Adapting the Software Choosing Hardware Testing in the Real Environment Installation and Tests Other Performance Improvements Results – Summary - Plans

Page 3: The BaBar Event Building and Level-3 Trigger Farm Upgrade S.Luitz, R. Bartoldus, S. Dasu, G. Dubois-Felsmann, B. Franek, J. Hamilton, R. Jacobsen, D. Kotturi,

3/24/03BaBar Farm Upgrade S.Luitz CHEP 2003

Page 4: The BaBar Event Building and Level-3 Trigger Farm Upgrade S.Luitz, R. Bartoldus, S. Dasu, G. Dubois-Felsmann, B. Franek, J. Hamilton, R. Jacobsen, D. Kotturi,

3/24/03BaBar Farm Upgrade S.Luitz CHEP 2003

The Old System Ca. 150 Read-Out Modules (ROMs)

in 23 crates, 300MHz PPC 100 MBit/s Ethernet ROMSwitch 100 MBit/s Ethernet SwitchFarm

Nodes 32 333Mhz Sun Ultra5 machines in

level-3 trigger farm Ca. 12ms CPU /event/node (75%CPU) Various other limitations in system 2 kHz maximum L1 trigger rate

Page 5: The BaBar Event Building and Level-3 Trigger Farm Upgrade S.Luitz, R. Bartoldus, S. Dasu, G. Dubois-Felsmann, B. Franek, J. Hamilton, R. Jacobsen, D. Kotturi,

3/24/03BaBar Farm Upgrade S.Luitz CHEP 2003

Why Upgrade the Farm? Increasing luminosities from PEP-II

Detailed projections for trigger rates and event sizes At decision time: not sure about L1 trigger upgrades Factor 2 headroom desirable

Absorb background spikes and non-ideal machine conditions

Have more CPU-intensive level-3 trigger algorithms Better statistics for fast monitoring Sun hardware (bought 98/99) end of life?

Increased hardware failure rate Reclaim rack space

Page 6: The BaBar Event Building and Level-3 Trigger Farm Upgrade S.Luitz, R. Bartoldus, S. Dasu, G. Dubois-Felsmann, B. Franek, J. Hamilton, R. Jacobsen, D. Kotturi,

3/24/03BaBar Farm Upgrade S.Luitz CHEP 2003

Farm Upgrade Requirements Target:

10x as much CPU power as the original 32-node Sun Ultra-5 farm (for our specific application)

Gigabit Ethernet on the event building network

Farm side first ROM side to be upgraded later

Fit in existing 32-node rack space

Page 7: The BaBar Event Building and Level-3 Trigger Farm Upgrade S.Luitz, R. Bartoldus, S. Dasu, G. Dubois-Felsmann, B. Franek, J. Hamilton, R. Jacobsen, D. Kotturi,

3/24/03BaBar Farm Upgrade S.Luitz CHEP 2003

Upgrade Option 1 (at decision time in

2001)

Sun UltraSPARC-II 440Mhz single-CPU nodes replace existing nodes Add more nodes, maybe replace farm later

X 1.1 per CPU Re-use BaBar offline machines? No software modifications Very large number of machines

Factor 10 in total CPU difficult to achieve (300 machines!)

Expensive if new machines

Page 8: The BaBar Event Building and Level-3 Trigger Farm Upgrade S.Luitz, R. Bartoldus, S. Dasu, G. Dubois-Felsmann, B. Franek, J. Hamilton, R. Jacobsen, D. Kotturi,

3/24/03BaBar Farm Upgrade S.Luitz CHEP 2003

Upgrade Option 2 (at decision time in

2001)

Dual-CPU Pentium-III 1.3 Ghz Linux X 2.6 per CPU

Relatively low hardware costs Small number of nodes 1u form factor Little endian (byte swapping modifications) Mixed system

Page 9: The BaBar Event Building and Level-3 Trigger Farm Upgrade S.Luitz, R. Bartoldus, S. Dasu, G. Dubois-Felsmann, B. Franek, J. Hamilton, R. Jacobsen, D. Kotturi,

3/24/03BaBar Farm Upgrade S.Luitz CHEP 2003

Upgrade Option 3 (at decision time in

2001)

Dual-CPU UltraSPARC-III 750MHz X 1.8 per CPU

No software modifications necessary High cost (factors, only server hardware

available) 4u form factor

4-CPU (or more) machines not considered because of UDP network stack and SMP scaling issues

Page 10: The BaBar Event Building and Level-3 Trigger Farm Upgrade S.Luitz, R. Bartoldus, S. Dasu, G. Dubois-Felsmann, B. Franek, J. Hamilton, R. Jacobsen, D. Kotturi,

3/24/03BaBar Farm Upgrade S.Luitz CHEP 2003

The Choice After extensive consideration of all

options Decision to go ahead with Pentium-III

and Linux Plan for 50 Dual-CPU Pentium-III

machines

Page 11: The BaBar Event Building and Level-3 Trigger Farm Upgrade S.Luitz, R. Bartoldus, S. Dasu, G. Dubois-Felsmann, B. Franek, J. Hamilton, R. Jacobsen, D. Kotturi,

3/24/03BaBar Farm Upgrade S.Luitz CHEP 2003

Adapting the Software Data Flow

Retrofit endian conversion PPC and SPARC big endian, original design did not

foresee byte swapping for performance reasons All byte reordering done on Linux side Bulk 32-bit swapping of whole datagrams Takes care of control and navigational information

Accessing the data from Linux Payload contains byte and 2-byte aligned data Data 32-bit pre-swapped Fix up byte and 2-byte aligned structures on demand

Keep on-disk formats as big endian

Page 12: The BaBar Event Building and Level-3 Trigger Farm Upgrade S.Luitz, R. Bartoldus, S. Dasu, G. Dubois-Felsmann, B. Franek, J. Hamilton, R. Jacobsen, D. Kotturi,

3/24/03BaBar Farm Upgrade S.Luitz CHEP 2003

Choosing the Hardware Limited resources and time for

evaluation Start out with systems known to be reliable

for the Windows group at SLAC: Dell PowerEdge 1550

Optical Gigabit (then: no experience with copper at SLAC)

Acquire a few machines for testing

Page 13: The BaBar Event Building and Level-3 Trigger Farm Upgrade S.Luitz, R. Bartoldus, S. Dasu, G. Dubois-Felsmann, B. Franek, J. Hamilton, R. Jacobsen, D. Kotturi,

3/24/03BaBar Farm Upgrade S.Luitz CHEP 2003

Testing in Lab and Real System Test stand testing of all software Parasitic of few nodes in real system for

a few months Port monitoring (SPAN) feature of switch Feed copies of production datagrams to

Linux nodes – no reply required Run event building software on mirrored

events No stability problems observed

Page 14: The BaBar Event Building and Level-3 Trigger Farm Upgrade S.Luitz, R. Bartoldus, S. Dasu, G. Dubois-Felsmann, B. Franek, J. Hamilton, R. Jacobsen, D. Kotturi,

3/24/03BaBar Farm Upgrade S.Luitz CHEP 2003

Purchasing, Installation and Tests By the time the testing was completed,

hardware of choice no longer available Re-test next generation machines

Dell PowerEdge 1650 @ 1.4GHz OK Purchase 50 machines late spring 2002 and

install in summer shutdown Keep enough Ultra-5 in place for shutdown DAQ

needs New farm: 2 ½ water cooled racks Regular shelves, stack 2 machines

No significant hardware problems (1 disk, 1 main board dead on arrival)

Page 15: The BaBar Event Building and Level-3 Trigger Farm Upgrade S.Luitz, R. Bartoldus, S. Dasu, G. Dubois-Felsmann, B. Franek, J. Hamilton, R. Jacobsen, D. Kotturi,

3/24/03BaBar Farm Upgrade S.Luitz CHEP 2003

50 1u Farm Nodes

Page 16: The BaBar Event Building and Level-3 Trigger Farm Upgrade S.Luitz, R. Bartoldus, S. Dasu, G. Dubois-Felsmann, B. Franek, J. Hamilton, R. Jacobsen, D. Kotturi,

3/24/03BaBar Farm Upgrade S.Luitz CHEP 2003

Other Improvements In parallel: ROM Gigabit Ethernet

Originally planned for later but we realized that this could be done by the end of the shutdown too

Develop optimized zero-copy UDP stack Install optical Gigabit Ethernet PMC on

readout modules Split crates to balance amounts of data Improve feature extraction ROM software

Page 17: The BaBar Event Building and Level-3 Trigger Farm Upgrade S.Luitz, R. Bartoldus, S. Dasu, G. Dubois-Felsmann, B. Franek, J. Hamilton, R. Jacobsen, D. Kotturi,

3/24/03BaBar Farm Upgrade S.Luitz CHEP 2003

Result and Summary Very smooth transition System now capable of 5.5kHz L1

accept rate at current backgrounds Original design + performance: 2kHz

System working very well in routine data taking No crashes No system stability problems No hardware problems

Page 18: The BaBar Event Building and Level-3 Trigger Farm Upgrade S.Luitz, R. Bartoldus, S. Dasu, G. Dubois-Felsmann, B. Franek, J. Hamilton, R. Jacobsen, D. Kotturi,

3/24/03BaBar Farm Upgrade S.Luitz CHEP 2003

Further Improvements and Longer Term Plans Improvements

Multi-CPU support Single L3 worker thread Run more than 1 L3 process per node Currently being implemented

Migrating more software to Linux Longer Term Plans

Keep Sun server infrastructure, however look into Linux as file servers

Replace more systems with Linux machines