Distributed applications monitoring at system and network level A.Brunengo (INFN- Ge), A.Ghiselli...
-
Upload
edith-patrick -
Category
Documents
-
view
222 -
download
0
Transcript of Distributed applications monitoring at system and network level A.Brunengo (INFN- Ge), A.Ghiselli...
Distributed applications monitoring at
system and network level
A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi), S.Resconi (INFN-Mi), M.Sgaravatto (INFN-Pd), C.Vistoli (INFN-Cnaf)
for the
Monarc Collaboration
7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 2
Objectives• analysis of measurements regarding resource
utilization in a distributed environment: – CPU usage, – network throughput – wall clock time of a single job
• to locate– system and network bottlenecks,– software and hardware inefficiencies in different scenarios.
• to understand– client behaviour and system resource requirements– network impact to the application behaviour and application efficency using
network
7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 3
Test description• ATLFast++ stress tests: increasing number of concurrent
jobs with read access to the Data Base• A single job reads ~3000 events 40KB each• Single objectivity federation• System configuration
– Single workstation (without AMS server)– One AMS server - one client machine – One AMS server - many client machines
• Network configuration– LAN (Gigabit Ethernet,Fast Ethernet, Ethernet)– WAN with different bandwidth capacity (2Mbps to 8Mbps)– QOS/Differentiated services WAN link 2Mbps
7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 4
• System parameters collected:– Client side: CPU use (by user and system), job wall clock time– Server side: CPU use (by user and system), network throughput
• CPU use in client machine is important to evaluate machine load versus number of concurrent jobs with different link speed.
• CPU use on Server is important to evaluate the maximum number of client-jobs that can be served and if this is related with client characteristics and network link capacity.
• Wall clock time execution is important to evaluate system capacity to deliver workload in connection with the number of jobs and network speed.
Application monitoring
7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 5
• CPU usage (client and server) is collected via periodical ‘vmstat’ commands
• Application itself records elapsed time and Cpu time• Aggregate server throughput is collected tracing the AMS
server process systems calls:– every 2 minutes a script compute and sum n.bytes read/send
and n.bytes write /received.
Application monitoring
7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 6
Test results
• One Client – One Server – Gigabit Ethernet
Server
1000BaseSX
sunlab1 gsun
Client
sunlab1, gsun: Sun Ultra5, 333 MHz, 128 MB RAM, Solaris 2.7 (14 SpecInt 95)
7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 7
Server and client connected with GE CPU use on client
0
20
40
60
80
100
120
09-2
2-18
:29:
18
09-2
2-18
:43:
25
09-2
2-18
:57:
33
09-2
2-19
:11:
42
09-2
2-19
:25:
57
09-2
2-19
:40:
13
09-2
2-19
:54:
31
09-2
2-20
:08:
55
09-2
2-20
:23:
08
09-2
2-20
:37:
35
09-2
2-20
:51:
55
09-2
2-21
:06:
04
09-2
2-21
:20:
16
09-2
2-21
:34:
27
09-2
2-21
:48:
37
09-2
2-22
:02:
47
09-2
2-22
:17:
00
09-2
2-22
:31:
13
09-2
2-22
:45:
27
09-2
2-22
:59:
35
09-2
2-23
:16:
10
09-2
2-23
:33:
33
09-2
2-23
:47:
58
09-2
3-00
:02:
06
09-2
3-00
:23:
13
09-2
3-00
:38:
44
09-2
3-00
:53:
05
09-2
3-01
:07:
25
09-2
3-01
:21:
35
09-2
3-01
:42:
10
09-2
3-02
:01:
45
09-2
3-02
:18:
16
09-2
3-02
:32:
39
09-2
3-02
:46:
47
09-2
3-03
:05:
51
09-2
3-03
:41:
12
09-2
3-04
:15:
38
09-2
3-04
:47:
31
09-2
3-05
:13:
50
%C
PU
n. job %cpu user %cpu system %cpu totale
7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 8
Server and Client connected via GE CPU use on Server
0
20
40
60
80
100
120
09-2
2-18
:29:
17
09-2
2-18
:45:
47
09-2
2-19
:02:
54
09-2
2-19
:20:
16
09-2
2-19
:38:
53
09-2
2-19
:57:
35
09-2
2-20
:15:
24
09-2
2-20
:36:
18
09-2
2-20
:55:
27
09-2
2-21
:16:
10
09-2
2-21
:37:
42
09-2
2-21
:55:
47
09-2
2-22
:17:
26
09-2
2-22
:38:
41
09-2
2-22
:56:
32
09-2
2-23
:18:
44
09-2
2-23
:40:
01
09-2
2-23
:58:
17
09-2
3-00
:20:
01
09-2
3-00
:40:
17
09-2
3-01
:00:
54
09-2
3-01
:19:
06
09-2
3-01
:40:
30
09-2
3-02
:00:
38
09-2
3-02
:21:
14
09-2
3-02
:40:
00
09-2
3-03
:00:
39
09-2
3-03
:19:
02
09-2
3-03
:36:
10
09-2
3-03
:53:
17
09-2
3-04
:10:
28
09-2
3-04
:27:
43
09-2
3-04
:44:
58
09-2
3-05
:02:
27
09-2
3-05
:22:
44
%cp
u
n.job %Cpu User %Cpu system %Cpu totale
7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 9
Server Client connected via GE Throughpit in kbit/sec
0
5000
10000
15000
20000
25000
30000
35000
40000
1999
-09-
22-1
8:31
:21
1999
-09-
22-1
8:47
:53
1999
-09-
22-1
9:05
:24
1999
-09-
22-1
9:22
:28
1999
-09-
22-1
9:40
:58
1999
-09-
22-2
0:00
:07
1999
-09-
22-2
0:17
:33
1999
-09-
22-2
0:39
:07
1999
-09-
22-2
0:57
:36
1999
-09-
22-2
1:18
:57
1999
-09-
22-2
1:40
:16
1999
-09-
22-2
1:58
:56
1999
-09-
22-2
2:20
:11
1999
-09-
22-2
2:41
:23
1999
-09-
22-2
2:59
:27
1999
-09-
22-2
3:21
:30
1999
-09-
22-2
3:42
:43
1999
-09-
23-0
0:00
:39
1999
-09-
23-0
0:22
:42
1999
-09-
23-0
0:42
:59
1999
-09-
23-0
1:03
:38
1999
-09-
23-0
1:21
:28
1999
-09-
23-0
1:43
:09
1999
-09-
23-0
2:03
:23
1999
-09-
23-0
2:23
:56
1999
-09-
23-0
2:42
:05
1999
-09-
23-0
3:03
:18
1999
-09-
23-0
3:21
:13
1999
-09-
23-0
3:38
:21
1999
-09-
23-0
3:55
:28
1999
-09-
23-0
4:12
:41
1999
-09-
23-0
4:29
:55
1999
-09-
23-0
4:47
:09
1999
-09-
23-0
5:04
:46
1999
-09-
23-0
5:25
:16
kbp
s
7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 10
Server - Client connected via GEMean Wall Clock Time
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
1 5 10 20 30 40 50 60 70 80 90 100
n.job
se
c.
gsun
7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 11
Comments:
– Client CPU: 100 % used with 5 jobs
– Server CPU: 100 % used with 50 jobs
– After 40 concurrent jobs, Client CPU decreases as well as network throughput
7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 12
100BaseT
bbfarm01 bbfarm02
Client Server
CPU Memory DISK OS LINK
bbfarm01 Sun E450 Ultra 2 400MHz, 4 CPU, each 17 SpecInt
512MB RAID A3500
Solaris 2.7 Sun PCI FE
bbfarm02 Sun E450 Ultra 2 400MHz, 4 CPU, each 17 SpecInt
512MB RAID A3500
Solaris 2.7 Sun PCI FE
One Client – One Server – Fast EthernetRome Babar Farm
7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 13
Server and Client connected via FECPU use on Server
0
10
20
30
40
50
60
70
09-2
2-10
:21:
02
09-2
2-10
:31:
42
09-2
2-10
:43:
15
09-2
2-10
:54:
03
09-2
2-11
:06:
20
09-2
2-11
:19:
37
09-2
2-11
:31:
34
09-2
2-11
:45:
45
09-2
2-11
:58:
46
09-2
2-12
:09:
33
09-2
2-12
:20:
51
09-2
2-12
:32:
14
09-2
2-12
:44:
09
09-2
2-12
:56:
24
09-2
2-13
:08:
42
09-2
2-13
:21:
06
09-2
2-13
:32:
41
09-2
2-13
:44:
15
09-2
2-13
:56:
15
09-2
2-14
:08:
23
09-2
2-14
:20:
27
09-2
2-14
:32:
35
09-2
2-14
:44:
34
09-2
2-14
:56:
32
09-2
2-15
:08:
28
09-2
2-15
:20:
26
09-2
2-15
:31:
24
09-2
2-15
:43:
35
09-2
2-15
:55:
30
09-2
2-16
:07:
26
09-2
2-16
:19:
23
09-2
2-16
:30:
38
09-2
2-16
:41:
52
09-2
2-16
:53:
09
09-2
2-17
:04:
34
%C
PU
Nr.Jobs %CPU (user) %CPU (system) %CPU (total)
7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 14
Server and Client connected via FECPU use on Client
0
10
20
30
40
50
60
70
80
1999
-09-
22-1
0:18
:57
1999
-09-
22-1
0:31
:20
1999
-09-
22-1
0:43
:44
1999
-09-
22-1
0:56
:07
1999
-09-
22-1
1:08
:31
1999
-09-
22-1
1:20
:57
1999
-09-
22-1
1:33
:21
1999
-09-
22-1
1:45
:53
1999
-09-
22-1
1:58
:23
1999
-09-
22-1
2:10
:46
1999
-09-
22-1
2:23
:10
1999
-09-
22-1
2:35
:34
1999
-09-
22-1
2:47
:58
1999
-09-
22-1
3:00
:22
1999
-09-
22-1
3:12
:46
1999
-09-
22-1
3:25
:10
1999
-09-
22-1
3:37
:33
1999
-09-
22-1
3:49
:57
1999
-09-
22-1
4:02
:21
1999
-09-
22-1
4:14
:45
1999
-09-
22-1
4:27
:09
1999
-09-
22-1
4:39
:33
1999
-09-
22-1
4:51
:57
1999
-09-
22-1
5:04
:21
1999
-09-
22-1
5:16
:45
1999
-09-
22-1
5:29
:09
1999
-09-
22-1
5:41
:32
1999
-09-
22-1
5:53
:56
1999
-09-
22-1
6:06
:20
1999
-09-
22-1
6:18
:44
1999
-09-
22-1
6:31
:08
1999
-09-
22-1
6:43
:32
1999
-09-
22-1
6:55
:56
1999
-09-
22-1
7:08
:20
%C
PU
(4
CP
U)
Nr.Jobs %CPU (user) %CPU (system) %CPU (total)
7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 15
Server and Client connected via FEThroughput in Kbit/sec
0
10000
20000
30000
40000
50000
60000
70000
80000
Kbi
t/sec
Throughput in kbit/sec
7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 16
Server and client connected via FEMean Wall clock time
0
1000
2000
3000
4000
5000
6000
7000
0 10 20 30 40 50 60
n.job
se
c
bbfarm02
7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 17
Comments:
– Client CPU: 60 % used up to 30 jobs then 20% – Server CPU: 100 % used with 5 jobs (multi-
processor server used as mono-processor)– After 30 concurrent jobs execution wall clock
execution increase rapidly – After 30 concurrent jobs, Client CPU decreases as
well as network throughput
7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 18
one server (Bologna) – one client(CERN)QOS/Differentiaited Service
2Mbps WAN ATM link
2 Mbps
sunlab1 monarc01
Server Client
sunlab1: Sun Ultra5, 333 MHz, 128 MB, Solaris 2.7monarc01: Sun Enterprise 450 - 4 X 400 MHz, 512 MB, Solaris 2.6
7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 19
• 4717 cells/sec --> 4717 * 48 byte/ sec = 1811 Kbit/sec.
• Available bandwidth completely used.• CPU server and client unloaded.• Very high job elapsed time. • Differentiated services mechanism working
properly (with precise network parameters tuning).
7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 20
Clientssunlab1, gsun, cmssun4, atlsun1, atlas4: Sun Ultra5/10, 333 MHz, 128 MBvlsi06: Sun SPARC20, 125 MHz, 128 MBmonarc01: Sun Enterprise 450, 4X 400 MHz, 512 MB
sunlab1
Server
1000BaseT
2 Mbps
2 Mbps
2 Mbps8 Mbps
sunlab1
cmssun4 vlsi06 atlsun1 atlas4 monarc01
Access Link technologyto GARR-B (ATMbackbone)
Access Speed to GARR(Mbps)
RTT to Cnaf
Cnaf ATM 6M
Padova Point-to-point 2M 7msec
Genova Frame-relay 2M 21msec
Milano ATM 8M 8msec
Cnaf - Cern ATM 2M ( 4717 cells/sec) 51msec
WAN Test
7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 21
Comments
• client CPU: never 100 % used
• server CPU: never 100 % used
• wall clock time for job running in the workstation connected via GE very high (i.e. for 10 jobs >1000’; 400’ without other clients in WAN): slow clients degrade performances on fast clients ???
7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 22
ClientsClients and server: Sun Enterprise 450, 4X 400 MHz, 512 MB
cutter
Server
100BaseT
100 BaseT 100BaseT
bbfarm01
bbfarm02 bbfarm03 bbfarm04
One server –many clients LAN FERome Babar Farm
7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 23
Comments:
• Throughput decrease when there are more than 30 job in the server.
• Cpu server is always 100% used.
• Starting with 40 concurrent jobs in one client jobs start crashing .
7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 24
CLIENT SERVER Network speed
Max CPU Number of jobs Max CPU Number of jobs
1000M (GE)
100% >5 100% >50
100M (FE)
60% , then 20%
Up to 30, then up to 60
100% Up to 60
10M(Eth) 80% >20 30% >60 2M (PPP ATM WAN
5% Up to 20 10% (constant)
1-20 (during the all test)
Test results
• GE test: client CPU is saturated with 5 jobs• FE test: server CPU is saturated with 30 jobs• Ethernet test: network is saturated
7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 25
Test results
• GE utilization is poor
Network Link Server host
Network speed
Max throughput
Number of jobs
1000M GEthernet
37Mbps 20
100M FEthernet
80Mbps 30
10M Ethernet 9Mbps 20
2M VC ATM 1.7Mbps 20
7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 26
Conclusion• Objectivity 5.1 behaviour on different network layouts:
– Application/Objectivity in powerful client machine use high CPU.– AMS is not able to use multi CPU machine
• Optimal measured values for the server corresponds to 30 connection from concurrent remote jobs.– Too small for a production environment.
• Identified boundary condition for efficent running with the specific CPU.• Acceptable running condition:
– Link Server/Client minimum speed 8Mbps– Client machine from 6 to 15 concurrent analysis jobs– Server 30 conncurrent jobs request
• Global performance degrades rapidly moving away from optimal condition
7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 27
Future works• Application monitoring tools able to real time check the
working conditions, to take necessary action to mantain the system around the optimal conditions.
• Test Objectivity 5.2 features• Test a Multi Server Configuration using read/write application. • Dedicated WAN Test-bed with 10Mbps bandwitdh links.• LAN - WAN behaviour with equivalent high bandwidth
capacity to be investigated deeply:– Host tuning and RTT could impact performances– QoS
• documentation: www.cern.ch/MONARC