Debugging Kate Hedstrom August 2006. Overview Think before coding Common mistakes Defensive...

Post on 17-Jan-2018

226 views 0 download

description

Before Programming Think about the structure of the program you are writing What are the data structures? Careful planning can lead to programs that are: –Easier to debug –Easier to understand later when modifications prove necessary There’s a whole industry around tools for program design

Transcript of Debugging Kate Hedstrom August 2006. Overview Think before coding Common mistakes Defensive...

Debugging

Kate HedstromAugust 2006

Overview• Think before coding• Common mistakes• Defensive programming• Core files• Interactive debugging• Other tips• Parallel bug story• Demo

Before Programming• Think about the structure of the program you are writing

• What are the data structures?• Careful planning can lead to programs that are:– Easier to debug– Easier to understand later when modifications

prove necessary• There’s a whole industry around tools for program design

Various Problems• Failure to compile

– First compiler message is valid, rest could be due to confusion caused by first error

• Failure to link– Missing routines– Missing libraries

• Failure to run• Runs but gives the wrong answer

Some Common Mistakes• Number and type of arguments• Misspelled variables• Uninitialized variables• Failure to match up do/if and the end do/if

• Index out of range• Array size too small• Parallel bugs

Defensive programming

• Let the compiler help you find problems

• Implicit none (Fortran)• Use modules or interface blocks to let the compiler check the argument count/type for you (Fortran 90)

• Check error codes on function calls• Write useful comments!

Messages• Assert (C/C++)#include <assert.h>assert(g == 9.8);• Fortran example:GET_2DFLD - unable to find requested variable:

In input file: /wrkdir/kate/….ERROR: Abnormal termination: NetCDF INPUTREASON: No error

if (.not. Got_var) then write (stdout,10) trim(Vname(1,ifield)), trim(ncfile) exit_flag = 2 returnend ifstatus = nf_open(trim(ncfile), nf_nowrite, ncid)if (status .ne. nf_noerr) then write (stdout, 20) trim(ncfile) exit_flag = 3 ioerror = status returnend if10 format(/, ‘GET_2DFLD - unable to find …)20 format(/, ‘GET_2DFLD - unable to open NetCDF file:‘

a)

• C example:If (init_graph(graph) == OKAY) { while ((count < MAX_EDGES) && !ferror(source) &&

!feof(source)) { if (fgets(line, MAX_LEN, source) != NULL){ linenum++; if (sscanf(line, “%d %d: %d\n” …) ==3) { : } else { fprintf(stderr, “%s[%s()] Error: sscanf

couldn’t parse line #%d\n”, progname, proc, linenum); fprintf(stderr, “line = \”%s\”\n”, line); return(-2); }

Modular Programming and Testing

• Write programs in components or modules

• Test them individually• There are “test harnesses” for creating and managing tests– Many gnu programs can be tested with

“make check” after the “make”

Other tips• Check cpp labels:

– ifdefs– ifnames

• Bounds checking (-C)• Floating point trap (-qflttrap=enable:invalid:imprecise on IBM)

• Try another compiler - and write portable code

• There is no shame in using print statements

Core files• Contain a binary dump of your program as it crashed

• Can extract a stack trace from it

• If you recompile -g, might have enough info in the core to solve your problem

• Check your limits - you might be truncating your cores

Causes of Core Files

• Not enough memory• Segmentation violation

– Not enough stack space– Wrong number of function arguments

• Floating point error if not using IEEE standard

• I/O error

More on Core Files• Running “file” on it will tell you the executable name:

% file core Core: AIX core file fulldump 64-bit, ncra• I prefer dbx to totalview on core files:

% dbx ncra core (dbx) where abort() at 0x9000… nco_exit(??), line 28 in nco_ctl.c main(argc = 47, argv = 0x0fff….), line548 in ncra.c

Interactive Debuggers

• Totalview:– Is on both Cray and IBM– Has a gui– Works for parallel programs– Is worth learning– Isn’t my favorite debugger

• Text based debuggers:– dbx, gdb, etc– Some have had gui wrappers (xxgdb, for instance)

Debugger Uses

• Finding bugs• Help to understand the code

– Watch variables change– Watch the flow control– Perl debugger helped me learn Perl

Debugger Features• Set breakpoints• Execute:

– run/go– step– next

• View variables• Works for each process/thread• Debug the serial version first!

Tips for Totalview on IBM

• Use -qfullpath as well as -g compiler option

• When your program reads from standard input, invoke as:

totalview roms < roms.in• Doesn’t work right with -q64 or -qflttrap on IBM

dbx/gdb Commands• help - list of commands• where - stack trace, call trace• print - give the value of an expression• break/stop in/stop at - set a breakpoint• run - start execution until first breakpoint

• cont - continue to next breakpoint• step - step into function• next - execute next command• list - list source code for next ten lines• quit - how to get out

Debugger Caveats

• Debuggers have bugs too• Developers code in C/C++, don’t focus on Fortran

• If you don’t know where the problem is, you can spend an awful lot of time in the debugger

Miscellaneous

• Any parallel program should be compared to the serial version

• Did you overflow your quota?• Can the processor see the filesystem?

• Try recompiling after “make clean”

• Are you solving the equations you think you’re solving?

Compiler Bugs• More common than you might think

• Again, try other compilers• Try turning off optimization• I once had a situation where adding a print statement made the problem go away

• Auto-parallel compilers are especially buggy

Parallel Bug Story

• It’s always a good idea to compare the serial and parallel runs

• I can plot the difference field between the two outputs

• I can create a differences file with ncdiff (part of NCO)

Differences after a Day

Differences after one

step - in a part of the

domain without ice

What’s up?

• A variable was not being initialized properly - “if” statement without an “else”

• Both serial and parallel values are random junk

• Fixing this did not fix the one-day plot

Differences after a few steps - guess where the

tile boundaries

are

What was That?• The ocean code does a check for water colder than the local freezing point

• It then forms ice and tells the ice model about the new ice

• It adjusts the local temperature and salinity to account for the ice growth (warmer and saltier)

• It failed to then update the salinity and temperature ghost points

More…• Plotting the differences in surface temperature after one step failed to show this

• The change was very small and the single precision plotting code couldn’t catch it

• Differences did show up in timestep two of the ice variables

• Running ncdiff on the first step, then asking for the min/max values in temperature showed a problem

Debugging• I didn’t then know how to use totalview in parallel (fixed!)

• I don’t have good luck with totalview and 64-bit code

• Enclosing print statements inside if statements prevents each process from printing, possibly trying to print out-of-range values

• Find i,j value of the worst point from the diff file, print just that point - many fields

Last Word

• In my field, it is the problems that blow up right away that are the easiest to fix. You can see things go bad in the debugger, perhaps in the very first timestep. The problems that blow up after days and days of cpu time are more challenging and might require a complete rewrite of the model.