A22 Introduction to DTrace by Kyle Hailey

Post on 06-May-2015

1.060 views 2 download

Transcript of A22 Introduction to DTrace by Kyle Hailey

DTrace Introduction Kyle Hailey

Agenda 1. Intro … Me … Delphix 2. What is DTrace 3. Why DTrace

– Make the Impossible be possible – Low overhead

4. Where DTrace can be used 5. How DTrace is used

– Probes – Overhead – Variables – Resources

Kyle Hailey • OEM 10g Performance Monitoring • Visual SQL Tuning (VST) in DB Optimizer

• Delphix

Delphix

25 TB

2 TB

What is DTrace • Way of tracing O/S and Programs

– Making the impossible possible

• Your code unchanged – Optional add static DTrace probes

• No overhead when off – Turning on dynamically changes code path

• Low overhead when on – 1000s of events per second cause less 1% overhead

• Event Driven – Like event 10046, 10053

Shouting at Disks

Where can we trace • Solaris • OpenSolaris • FreeBSD … • MacOS • Linux – announced from Oracle • AIX – working “probevue”

What can we trace? Almost anything

– All system calls “read” – All kernel calls “biodone” – All function calls in a program – All DTrace stable providers

• Example : io:::start • Predefined stable probes • Non-stable Probe names and arguments can change

over time – Custom probes

• Write custom probes in programs to trace

Structure

$ cat mydtrace.d #!/usr/sbin/dtrace -s

Name_of_something_to_trace / filters / { actions }

# additional tracing Something_else_to_trace /optional filters / { take some actions }

(called a probe)

Section1 : •Probe •Filter •Clause

Section 2

Event Driven • DTrace Code run when probes fire in OS

/usr/sbin/dtrace -n ' #pragma D option quiet io:::start { printf(" timestamp %d ¥n",timestamp); }'

• Program runs until canceled $ sudo ./mydtrace.d timestamp 8135515300287183

timestamp 8135515300328512

timestamp 8135515300346769

^C

Probe (multi-threaded, process) when this happens then:

Take action Print variable

What are these What are these probes and variables:?

io:::start { printf(" timestamp %d ¥n",timestamp); }'

– Probes • kernel and system calls • program function calls • predefined by DTrace

– Variables • Variables are either predefined in DTrace like timestamp • defined by user

Probe

Variable

How to list Probes? Two ways to list probes 1. All System and kernel calls

dtrace –l

2. All Process functions dtrace –l pid[pid]

Output will have 4 part name, colon separated Provider:module:function:name

Kernel vs User Space

dtrace –l Kernel Functions

dtrace –l System Calls

User Processes

899 731 21

$ dtrace –l pid21

User Land

$ dtrace –l

dtrace -l

$ sudo dtrace –l

ID PROVIDER MODULE FUNCTION NAME

1 dtrace BEGIN

2 dtrace END

3 dtrace ERROR

16 profile tick-1sec

17 fbt klmops lm_find_sysid entry

18 fbt klmops lm_find_sysid return

19 fbt klmops gister_share_locally entry

Thousands of lines .

Provider Module Function Name

dtrace –l : grouping probes

Provider:module:function:name $ sudo dtrace -l | awk '{print $2 }' | sort | uniq -c | sort -nr

Count provider area 72095 fbt – kernel functions 1283 sdt - system calls 629 mib - system statistics 473 hotspot_jni, hotspot – JVM 466 syscall – system calls 173 nfsv4,nfsv3,tcp,udp,ip – network 61 sysinfo – kernel statistics 55 sched – CPU, io, scheduling 46 fsinfo - file system info 41 vminfo - memory 40 iscsi,fc - iscsi,fibre channel 22 lockstat - locks 15 proc - fork, exit , create 14 profile - timers tick 12 io - io:::start, done 3 dtrace - BEGIN, END, ERROR

Providers:defined interfaces Instead of tracing a kernel function, which could change between O/S

versions, trace a maintained, stable probe

https://wikis.oracle.com/display/DTrace/Providers – I/O io Provider – CPU sched Provider – system calls syscall Provider – memory vminfo Provider – user processes pid Provider – network tcp Provider

Provider definition files in /usr/lib/dtrace, such as io.d, nfs.d, sched.d, tcp.d

Example Network: TCP What if we wanted to look for TCP transmissions for receive ?

Probes have 4 part name Provider:module:function:name

$ dtrace –l | grep tcp | grep receive tcp:ip:tcp_input_data:receive

Or look at wiki https://wikis.oracle.com/display/DTrace/tcp+Provider

Probe arguments: dtrace –lnv What are the arguments for the probe function “tcp:ip:tcp_input_data:receive”

$ dtrace -lvn tcp:ip:tcp_input_data:receive ID PROVIDER MODULE FUNCTION NAME 7301 tcp ip tcp_input_data receive

Argument Types args[0]: pktinfo_t * args[1]: csinfo_t * args[2]: ipinfo_t * args[3]: tcpsinfo_t * args[4]: tcpinfo_t *

What is “tcpsinfo_t ” for example ?

Probe Argument definitions Find out what “tcpsinfo_t ” is

Two ways: 1. Stable Provider

– https://wikis.oracle.com/display/DTrace/Providers – In our case there is a TCP stable provider

https://wikis.oracle.com/display/DTrace/tcp+Provider

2. Look at source code – For OpenSolaris see: http://scr.illumos.org – Otherwise get a copy of the source

• Load into Eclipse or similar for easy search

Let’s look up “tcpsinfo_t ”

src.illumos.org Type in variable

Click on Link

src.illumos.org

example string tcps_raddr = Remote machines IP address

tcpsinfo_t - points to many things

Creating a Program • Find out all the machines we are receiving TCP packets from

$ sudo ./tcpreceive.d address 127.0.0.1 address 172.16.103.58 address 127.0.0.1 address 172.16.100.187 address 172.16.103.58 address 127.0.0.1 ^C

$ cat tcpreceive.d #!/usr/sbin/dtrace -s #pragma D option quiet tcp:ip:tcp_input_data:receive { printf(" address %s ¥n", args[3]->tcps_raddr ); }

args[3]: tcpsinfo_t *

When TCP receive Print remote address

probe action

Using for TCP Window sizes

ip usend ssz send recd 172.16.103.58 564 16028 564 ¥ 172.16.103.58 696 16208 132 ¥ 172.16.103.58 1180 16208 484 ¥ 172.16.103.58 1664 16208 484 ¥ 172.16.103.58 2148 16208 484 ¥ 172.16.103.58 2148 16208 / 0 172.16.103.58 1452 16208 / 0

Remote Machine

Unacknowledged Bytes Sent

Send Window Bytes

Send Bytes

Receive Bytes

If unacknowleged bytes sent goes above send window then transmissions will be delayed

Review so far • DTrace – trace O/S and user programs • Solaris and partially on Linux among others • Code is event driven, structure

– probe – Include optional filter – Action

• Get all event’s with “dtrace –l” • Get event arguments with “dtrace –lnv probe” • Get argument definitions in source or wiki

Variables 1. Globals

• Not thread save X=1; A[1]=1;

2. Aggregates • Thread safe scalars and arrays • Special operations, Count, average, quantize

@ct = count() ; @sm = sum(value); @sm[type]=sum(value); @agg = quantize(value);

3. Self-> var • Thread variable, self->x = value;

4. This->var • Light weight variable for only this probe firing • this->x = value;

Variables: Aggregates are best

dtrace.org/blogs/brendan/2011/11/25/dtrace-variable-types/

What is an aggregate? • Multi CPU safe variable • Light weight • Array or scalar • Denoted by @

– @var= function(value); – @var[array_indice]=function(value);

• Functions pre-defined only, such as – sum() – count() – max() – quantize()***

• Print out with “printa”

Using Aggregates: count()

syscall::write:entry { @counts[execname] = count(); } expr 72 sh 291 tee 814 make.bin 2010

https://wikis.oracle.com/display/DTrace/Aggregations

Count of occurrences doing writes execname = session

What program writes the most often?

$ sudo dtrace -ln io::: ID PROVIDER MODULE FUNCTION NAME 6281 io genunix biodone done 6282 io genunix biowait wait-done 6283 io genunix biowait wait-start 7868 io nfs nfs_bio done 7871 io nfs nfs_bio start

Aggregate: quantize()

Alternately Limit output to specific probes with “-ln” flag:

Get distribution of all I/O sizes

$ sudo dtrace -l | grep io

If the following returns too many rows

Aggregate : quantize() What if we wanted a distribution of all I/O sizes?

$ sudo dtrace -ln io::: ID PROVIDER MODULE FUNCTION NAME 6281 io genunix biodone done 6282 io genunix biowait wait-done 6283 io genunix biowait wait-start 7868 io nfs nfs_bio done 7871 io nfs nfs_bio start

NFS module

bio = block I/O

$ sudo dtrace -lvn io:genunix:biodone:done ID PROVIDER MODULE FUNCTION NAME 6281 io genunix biodone done Argument Types args[0]: bufinfo_t * args[1]: devinfo_t * args[2]: fileinfo_t

What is bufinfo_t? Sounds like Buffer information

Finding what bufinfo_t points to

bufinfo_t arguments $ sudo dtrace -lvn io:genunix:biodone:done

ID PROVIDER MODULE FUNCTION NAME 6281 io genunix biodone done Argument Types args[0]: bufinfo_t * args[1]: devinfo_t * args[2]: fileinfo_t

args[0] = bufinfo_t * bufinfo_t -> b_bcount= number of bytes Use in Dtrace args[0]->b_bcount

Aggregate Example: iosizes.d

$ sudo iosizes.d value --- Distribution -- count 256 | 0 512 |@@@@ 6 1024 |@@@@ 6 2048 |@@@@@@@@@@@@@@@@@@ 31 4096 |@@@ 5 8192 |@@@@@ 9 16384 |@@@@ 6 32768 | 0 65536 | 0 ^C

#!/usr/sbin/dtrace -s #pragma D option quiet io:::done

{ @sizes = quantize(args[0]->b_bcount); } Size of the I/O

Aggregate : iosizes.d with execname

$ sudo iosizes.d sched value --- Distribution -- count 256 | 0 512 |@@@@ 6 1024 |@@@@ 6 2048 |@@@@@@@@@@@@@@@@@@ 31 4096 |@@@ 5 8192 |@@@@@ 9 16384 |@@@@ 6 32768 | 0 ^C

#!/usr/sbin/dtrace -s #pragma D option quiet io:::done { @sizes[execname] = quantize(args[0]->b_bcount); }

Size of the I/O

Only returns I/O for sched Why?

Kernel land I/O

Kernel vs User Space

dtrace –l Kernel Functions

dtrace –l System Calls

899 731 21

User Land

I/O is in kernel done by sched

User programs make a system call “read”

• I/O is done by the kernel so only see “sched” • User I/O is done via a system call to kernel

io:::start : kernel, look for user syscall

• Look for the read system call $ sudo dtrace -l | grep syscall | grep read

5425 syscall read entry 5426 syscall read return

$ sudo dtrace -lvn syscall::read:entry ID PROVIDER MODULE FUNCTION NAME 5425 syscall read entry Argument Types None

User program system call “read”

Arg0 = fd Arg1 = *buf Arg2 = size Instead of args[2]->size Use arg2

$ sudo dtrace -lvn syscall::read:entry Argument Types None

Aggregate Example: readsizes.d

java value ------------- Distribution ------------- count 4096 | 0 8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2 16384 | 0 cat value ------------- Distribution ------------- count 16384 | 0 32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 65536 | 0 sshd value ------------- Distribution ------------- count 8192 | 0 16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 931 32768 | 0

#!/usr/sbin/dtrace -s #pragma D option quiet syscall::read:entry { @read_sizes[execname] = quantize(arg2); }

Size of the I/O

User land I/O

Built in variables • pid – process id • tid – thread id • execname • timestamp – nano-seconds • cwd – current working directory • Probes:

– probeprov – probemod – probefunc – probename

Built in variable examples

# cat exec.d #!/usr/sbin/dtrace -s syscall:::entry { @num[execname, probefunc] = count(); } dtrace:::END { printa(" %-32s %-32s %@8d¥n", @num);} # ./syscall.d dtrace: script './exec.d' matched 236 probes sleep stat64 32 vmtoolsd pollsys 37 java pollsys 72 java lwp_cond_wait 180

Program name

Function executing Records function That fires

No function name = Wild card, all matches

Execname function count

Latency Latency crucial to performance analysis.

Latency = delta = end_time – start_time

Dtrace probes have • Entry, exit • Start , done Take time at beginning and time at end and take

Latency: how long does I/O take? Latency = delta = end_time – start_time

– start_time io:::start – end_time io:::done

Array to hold each I/O start time:

• Array needs a unique key for each I/O • Key could be based on

– device = args[0]->b_edev – block = args[0]->b_blkno

Array: tm_start[device,block]=timestamp

Look these up in source

Latency

#!/usr/sbin/dtrace -s #pragma D option quiet io:::start /* device block number */ { tm_start[ args[0]->b_edev, args[0]->b_blkno] = timestamp; } io:::done / tm_start[ args[0]->b_edev, args[0]->b_blkno] / { this->delta = (timestamp - tm_start[args[0]->b_edev,args[0]->b_blkno] ); @io = quantize(this->delta); tm_start[ args[0]->b_edev, args[0]->b_blkno] = 0; }

comment

start

end

Array index

quantize Clear Timestamp Array entry

filter

Output array

Timestamp array

Nano-second

Other ways of keying start/end

1. We used a global array – tm_start[device,block]=timestamp – Probably best general way

2. Some people use arg0

– tm_start[arg0]=timestamp – Not as clear that this is valid

3. Others use

– self->start = timestamp; – This only works if the same thread that does the begin

probe is the same the does the end probe • Doesn’t work for io:::start , io:::done • Does work for nfs:::start , nfs:::done

Tracing vs Profiling Tracing • Programs run until ^C • Can print every probe • At ^C all unprinted variables are printed Profiling • Take action every X seconds • Special probe name

profile:::tick-1sec

Can profile at hz or ns, us, ms, sec

profile:::tick-1 profile:::tick-1ms

Hz ms

Latency: output every second

#!/usr/sbin/dtrace -s #pragma D option quiet io:::start /* device block number */ { tm_start[ args[0]->b_edev, args[0]->b_blkno] = timestamp; } io:::done / tm_start[ args[0]->b_edev, args[0]->b_blkno] / { this->delta = (timestamp - tm_start[args[0]->b_edev,args[0]->b_blkno] ); @io = quantize(this->delta); tm_start[ args[0]->b_edev, args[0]->b_blkno] = 0; } profile:::tick-1sec { printa(@io); trunc(@io); }

start

end

Every second

clear print quantize clear

User Process Tracing

dtrace –l Kernel Functions

dtrace –l System Calls

User Processes

899 731 21

$ dtrace –l pid21

User Land

Tracing User Processes • What can you trace in Oracle

– $ ps –ef | grep oracle – Get a process id – $ dtrace –l pid[process_id] – Lists program functions

• What do these functions do? – Source code for Mysql – Guess if you are on Oracle – Some good blogs out there

Overhead User process tracing (from Brendan Gregg ) • Don't worry too much about pid provider probe cost at < 1000 events/sec. • At > 10,000 events/sec, pid provider probe cost will be noticeable. • At > 100,000 events/sec, pid provider probe cost may be painful. User process probes 2-15us typical, could be slower

Kernel and system calls are cheaper to trace • > 1,000,000 20% impact

For non CPU work loads impact may be greater • TCP tests showed 50% throughput drop at 160K events/sec

– 40K interupts/sec

Formatting data Problem : Formating data difficult in Dtrace DTrace has printf and printa (for arrays) but …

• No floating point • No “if-then-else” , no “for-loop”

– type = probename == "op-write-done" ? "W" : "R";

• No way to access index of an aggregate array (ex sum of time by sum of counts)

Solution: do formatting and calculations in perl

dtrace -n ‘ … ‘ | perl –e ‘ … ‘

Summary • Stucture

• List of Probes

• Arguments to probes

• Look up args in source code http://scr.illumos.org • Use Aggregates @ – they make DTrace easy • Google Dtrace

– Find example programs

#!/usr/sbin/dtrace -s Name_of_something_to_trace / filters / { actions }

dtrace -l

dtrace –lnv prov:mod:func:name

Resources • Oracle Wiki

– wikis.oracle.com/display/Dtrace

• DTrace book: – www.dtracebook.com

• Brendan Gregg’s Blog – dtrace.org/blogs/brendan/

• Oracle examples – alexanderanokhin.wordpress.com/2011/11/13 – andreynikolaev.wordpress.com/2010/10/28/ – blog.tanelpoder.com/2009/04/24