IBM Books

Hitchhiker's Guide


Don't Panic

The Hitchhiker's Guide to the Galaxy revealed that 42 is what you get when you multiply 6 by 9 (which explains why things keep going wrong). Now that we all know that, we can discuss what to do in one of those situations when things go wrong. What do you do when something goes wrong with your parallel program? As the The Hitchhiker's Guide to the Galaxy tells us, Don't Panic! The IBM Parallel Environment for AIX provides a variety of ways to identify and correct problems that may arise when you're developing or executing your parallel program. This all depends on where in the process the problem occurred and what the symptoms are.

This chapter is probably more useful if you use it in conjunction with IBM Parallel Environment for AIX: Operation and Use, Vol. 1 and IBM Parallel Environment for AIX: Operation and Use, Vol. 2 so you might want to go find them, and keep them on hand for reference.
Note:The sample programs in this chapter are provided in full and are available from the IBM RS/6000 World Wide Web site. See "Getting the Books and the Examples Online" for more information.

Before continuing, let's stop and think about the basic process of creating a parallel program. Here are the steps, (which have been greatly abbreviated):

  1. Create and compile program
  2. Start PE
  3. Execute the program
  4. Verify the output
  5. Optimize the performance.

As with any process, problems can arise in any one of these steps, and different tools are required to identify, analyze and correct the problem. Knowing the right tool to use is the first step in fixing the problem. The remainder of this chapter tells you about some of the common problems you might run into, and what to do when they occur. The sections in this chapter are labeled according to the symptom you might be experiencing.


Messages

Message Catalog Errors

Messages are an important part of diagnosing problems, so it's essential that you not only have access to them, but that they are at the correct level. In some cases, you may get message catalog errors. This usually means that the message catalog couldn't be located or loaded. Check that your NLSPATH environment variable includes the path where the message catalog is located. Generally, the message catalog will be in /usr/lib/nls/msg/C. The value of the NLSPATH environment variable, including /usr/lib/nls/msg/%L/%N and the LANG environment variable, is set to En_US. If the message catalogs are not in the proper place, or your environment variables are not set properly, your System Administrator can probably help you. There's really no point in going on until you can read the real error messages!

The following are the IBM Parallel Environment for AIX message catalogs:

Finding PE Messages

There are a number of places that you can find PE messages:

Logging POE Errors to a File

You can also specify that diagnostic messages be logged to a file in /tmp on each of the remote nodes of your partition by using the MP_PMDLOG environment variable. The log file is called /tmp/mplog.pid.taskid, where pid is the process id of the Partition Manager daemon (pmd) that was started in response to the poe command, and taskid is the task number. This file contains additional diagnostic information about why the user connection wasn't made. If the file isn't there, then pmd didn't start. Check the /etc/inetd.conf and /etc/services entries and the executability of pmd for the root user ID again.

For more information about the MP_PMDLOG environment variable, see IBM Parallel Environment for AIX: Operation and Use, Vol. 1.

Message Format

Knowing which component a message is associated with can be helpful, especially when trying to resolve a problem. As a result, PE messages include prefixes that identify the related component. The message identifiers for the PE components are as follows.

0029-nnnn
pdbx

0030-nnnn
pedb

0031-nnnn
Parallel Operating Environment

0031-A4nn
Program Marker Array

0032-nnnn
Message Passing Library

0033-1nnn
Visualization Tool - Performance Monitor

0033-2nnn
Visualization Tool - Trace Visualization

0033-3nnn
Visualization Tool - Trace Collection

0033-4nnn
Visualization Tool - Widget

2537-nnn
Xprofiler X-Windows Performance Profiler

where:

For more information about PE messages, see IBM Parallel Environment for AIX: Messages

Note that you might find it helpful to run POE, the parallel debugger, or the Visualization Tool as you use this chapter.

Diagnosing Problems Using the Install Verification Program

The Installation Verification Program (IVP) can be a useful tool for diagnosing problems. When you installed POE, you verified that everything turned out alright by running the IVP. It verified that the:

The IVP can provide some important first clues when you experience a problem, so you may want to rerun this program before you do anything else. For more information on the IVP, see Appendix C. "Installation Verification Program Summary" or IBM Parallel Environment for AIX: Installation Guide


Can't Compile a Parallel Program

Programs for the IBM Parallel Environment for AIX must be compiled with the current release of mpxlf or mpcc, etc.. If the command you're trying to use cannot be found, make sure the installation was successful and that your PATH environment variable contains the path to the compiler scripts. These commands call the Fortan, C, and C++ compilers respectively, so you also need to make sure the underlying compiler is installed and accessible. Your System Administrator or local AIX guru should be able to assist you in verifying these things.


Can't Start a Parallel Job

Once your program has been successfully compiled, you either invoke it directly or start the Parallel Operating Environment (POE) and then submit the program to it. In both cases, POE is started to establish communication with the parallel nodes. Problems that can occur at this point include:

These problems can be caused by other problems on the home node (where you're trying to submit the job), on the remote parallel nodes, or in the communication subsystem that connects them. You need to make sure that all the things POE expects to be set up really are. Here's what you do:

  1. Make sure you can execute POE. If you're a Korn shell user, type:
    $ whence poe
    

    If you're a C shell user, type:

    $ which poe
    

    If the result is just the shell prompt, you don't have POE in your path. It might mean that POE isn't installed, or that your path doesn't point to it. Check that the file /usr/lpp/ppe.poe/bin/poe exists and is executable, and that your PATH includes /usr/lpp/ppe.poe/bin.

  2. Type:
    $ env | grep MP_
    

    Look at the settings of the environment variables beginning with MP_, (the POE environment variables). Check their values against what you expect, particularly MP_HOSTFILE (where the list of remote host names is to be found), MP_RESD (whether the a job management system is to be used to allocate remote hosts) and MP_RMPOOL (the pool from which job management system is to allocate remote hosts) values. If they're all unset, make sure you have a file named host.list in your current directory. This file must include the names of all the remote parallel hosts that can be used. There must be at least as many hosts available as the number of parallel processes you specified with the MP_PROCS environment variable.

  3. Type:
    $ poe -procs 1
    

    You should get the following message:

     
     
         0031-503   Enter program name (or quit): _
     
    

    If you do, POE has successfully loaded, established communication with the first remote host in your host list file, validated your use of that remote host, and is ready to go to work. If you type any AIX command, for example, date, hostname, or env, you should get a response when the command executes on the remote host (like you would from rsh).

    If you get some other set of messages, then the message text should give you some idea of where to look. Some common situations include:


Can't Execute a Parallel Program

Once POE can be started, you'll need to consider the problems that can arise in running a parallel program, specifically initializing the message passing subsystem. The way to eliminate this initialization as the source of POE startup problems is to run a program that does not use message passing. As discussed in "Running POE", you can use POE to invoke any AIX command or serial program on remote nodes. If you can get an AIX command or simple program, like Hello, World!, to run under POE, but a parallel program doesn't, you can be pretty sure the problem is in the message passing subsystem. The message passing subsystem is the underlying implementation of the message passing calls used by a parallel program (in other words, an MPI_Send). POE code that's linked into your executable by the compiler script (mpcc, mpCC, mpxlf, mpcc_r, mpCC_r, mpxlf_r) initializes the message passing subsystem.

The Parallel Operating Environment (POE) supports two distinct communication subsystems, an IP-based system, and User Space optimized adapter support for the SP Switch. The subsystem choice is normally made at run time, by environment variables or command line options passed to POE. Use the IP subsystem for diagnosing initialization problems before worrying about the User Space (US) subsystem. Select the IP subsystem by setting the environment variables:

$ export MP_EUILIB=ip
$ export MP_EUIDEVICE=en0

Use specific remote hosts in your host list file and don't use the Resource Manager (set MP_RESD=no). If you don't have a small parallel program around, recompile hello.c as follows:

$ mpcc -o hello_p hello.c

and make sure that the executable is loadable on the remote host that you are using.

Type the following command, and then look at the messages on the console:

$ poe hello_p -procs 1 -infolevel 4

If the last message that you see looks like this:

Calling mpci_connect

and there are no further messages, there's an error in opening a UDP socket on the remote host. Check to make sure that the IP address of the remote host is correct, as reported in the informational messages printed out by POE, and perform any other IP diagnostic procedures that you know of.

If you get

Hello, World!

then the communication subsystem has been successfully initialized on the one node and things ought to be looking good. Just for kicks, make sure there are two remote nodes in your host list file and try again with

$ poe hello_p -procs 2

If and when hello_p works with IP and device en0 (the Ethernet), try again with the SP Switch.

Suffice it to say that each SP node has one name that it is known by on the Ethernet LAN it is connected to and another name it is known by on the SP Switch. If the node name you use is not the proper name for the network device you specify, the connection will not be made. You can put the names in your host list file. Otherwise you will have to use LoadLeveler or the Resource Manager to locate the nodes.

For example,

$ export MP_RESD=yes
$ export MP_EUILIB=ip
$ export MP_EUIDEVICE=css0
$ poe hello_p -procs 2 -ilevel 2

where css0 is the switch device name.

Look at the console lines containing the string init_data. These identify the IP address that is actually being used for message passing (as opposed to the IP address that is used to connect the home node to the remote hosts.) If these aren't the switch IP addresses, check the LoadLeveler or Resource Manager configuration and the switch configuration .

Once IP works, and you're on an SP machine, you can try message passing using the User Space device support. Note that while LoadLeveler allows you to run multiple tasks over the switch adapter while in User Space, the Resource Manager will not. If you're using the Resource Manager, User Space support is accomplished by dedicating the switch adapter on a remote host to one specific task. The Resource Manager controls which remote hosts are assigned to which users.

You can run hello_p with the User Space library by typing:

$ export MP_RESD=yes
$ export MP_EUILIB=us
$ export MP_EUIDEVICE=css0
$ poe hello_p -procs 2 -ilevel 6

The console log should inform you that you're using User Space support, and that LoadLeveler or the Resource Manager is allocating the nodes for you. This happens a little differently depending on whether you're using LoadLeveler or the Resource Manager to manage your jobs. LoadLeveler will tell you it can't allocate the requested nodes if someone else is already running on them and has requested dedicated use of the switch, or if User Space capacity has been exceeded. The Resource Manager, on the other hand, will tell you that it can't allocate the requested nodes if someone else is already running on them.

So, what do you do now? You can try for other specific nodes, or you can ask LoadLeveler or the Resource Manager for non-specific nodes from a pool, but by this time, you're probably far enough along that we can just refer you to IBM Parallel Environment for AIX: Operation and Use, Vol. 1.

If you get a message that says POE can't load your program, and it mentions the symbol pm_exit_value, you are not loading POE's modification of the C run-time library. Make sure that the files /usr/lpp/ppe.poe/lib/libc.a and /usr/lpp/ppe.poe/lib/libc_r.a exist, and that the library search path (composed from MP_EUILIBPATH, MP_EUILIB, and your LIBPATH environment variable) finds these versions.


The Program Runs But...

Once you've gotten the parallel application running, it would be nice if you were guaranteed that it would run correctly. Unfortunately, this is not the case. In some cases, you may get no output at all, and your challenge is to figure out why not. In other cases, you may get output that's just not correct and, again, you must figure out why it isn't.

The Parallel Debugger is Your Friend

An important tool in analyzing your parallel program is the PE parallel debugger (pedb or pdbx). In some situations, using the parallel debugger is no different than using a debugger for a serial program. In others, however, the parallel nature of the problem introduces some subtle and not-so-subtle differences which you should understand in order to use the debugger efficiently. While debugging a serial application, you can focus your attention on the single problem area. In a parallel application, not only must you shift your attention between the various parallel tasks, you must also consider how the interaction among the tasks may be affecting the problem.

The Simplest Problem

The simplest parallel program to debug is one where all the problems exist in a single task. In this case, you can unhook all the other tasks from the debugger's control and use the parallel debugger as if it were a serial debugger. However, in addition to being the simplest case, it is also the most rare.

The Next Simplest Problem

The next simplest case is one where all the tasks are doing the same thing and they all experience the problem that is being investigated. In this case, you can apply the same debug commands to all the tasks, advance them in lockstep and interrogate the state of each task before proceeding. In this situation, you need to be sure to avoid debugging-introduced deadlocks. These are situations where the debugger is trying to single-step a task past a blocking communication call, but the debugger has not stepped the sender of the message past the point where the message is sent. In these cases, control will not be returned to the debugger until the message is received, but the message will not be sent until control returns to the debugger. Get the picture?

OK, the Worst Problem

The most difficult situation to debug and also the most common is where not all the tasks are doing the same thing and the problem spans two or more tasks. In these situations, you have to be aware of the state of each task, and the interrelations among tasks. You must ensure that blocking communication events either have been or will be satisfied before stepping or continuing through them. This means that the debugger has already executed the send for blocking receives, or the send will occur at the same time (as observed by the debugger) as the receive. Frequently, you may find that tracing back from an error state leads to a message from a task that you were not paying attention to. In these situations, your only choice may be to re-run the application and focus on the events leading up to the send.

It Core Dumps

If your program creates a core dump, POE saves a copy of the core file so you can debug it later. Unless you specify otherwise, POE saves the core file in the coredir .taskid directory, under the current working directory, where taskid is the task number. For example, if your current directory is /u/mickey, and your application creates a core dump (segmentation fault) while running on the node that is task 4, the core file will be located in /u/mickey/coredir.4 on that node.

You can control where POE saves the core file by using the -coredir POE command line option or the MP_COREDIR environment variable.

Debugging Core Dumps

There are two ways you can use core dumps to find problems in your program. After running the program, you can examine the resulting core file to see if you can find the problem. Or, you can try to view your program state by catching it at the point where the problem occurs.

Examining Core Files

Before you can debug a core file, you first need to get one. In our case, let's just generate it. The example we'll use is an MPI program in which even-numbered tasks pass the answer to the meaning of life to odd-numbered tasks. It's called bad_life.c, and here's what it looks like:

/*******************************************************************
*
* bad_life program
 
* To compile:
* mpcc -g -o bad_life bad_life.c
*
*******************************************************************/
 
#include <stdio.h>
#include <mpi.h>
 
void main(int argc, char *argv[])
{
        int  taskid;
        MPI_Status  stat;
 
        /* Find out number of tasks/nodes. */
        MPI_Init( &argc, &argv);
        MPI_Comm_rank( MPI_COMM_WORLD, &taskid);
 
        if ( (taskid % 2) ==  )
        {
                char *send_message = NULL;
 
                send_message = (char *) malloc(1 );
                strcpy(send_message, "Forty Two");
                MPI_Send(send_message, 1 , MPI_CHAR, taskid+1, 0,
                        MPI_COMM_WORLD);
                free(send_message);
        } else
        {
                char *recv_message = NULL;
 
                MPI_Recv(recv_message, 1 , MPI_CHAR, taskid-1, 0,
                MPI_COMM_WORLD, &stat);
                printf("The answer is  %s\n", recv_message);
                free(recv_message);
        }
                printf("Task %d complete.\n",taskid);
                MPI_Finalize();
                exit(0);
}

We compiled bad_life.c with the following parameters:

$ mpcc -g bad_life.c -o bad_life

and when we run it, we get the following results:

$ export MP_PROCS=4
$ export MP_LABELIO=yes
$ bad_life
  0:Task 0 complete.
  2:Task 2 complete.
ERROR: 0031-250  task 1: Segmentation fault
ERROR: 0031-250  task 3: Segmentation fault
ERROR: 0031-250  task 0: Terminated
ERROR: 0031-250  task 2: Terminated

As you can see, bad_life.c gets two segmentation faults which generates two core files. If we list our current directory, we can indeed see two core files; one for task 1 and the other for task 3.

$ ls -lR core*
total 88
-rwxr-xr-x   1 hoov     staff       8472 May 02 09:14 bad_life
-rw-r--r--   1 hoov     staff        928 May 02 09:13 bad_life.c
drwxr-xr-x   2 hoov     staff        512 May 02 09:01 coredir.1
drwxr-xr-x   2 hoov     staff        512 May 02 09:36 coredir.3
-rwxr-xr-x   1 hoov     staff       8400 May 02 09:14 good_life
-rw-r--r--   1 hoov     staff        912 May 02 09:13 good_life.c
-rw-r--r--   1 hoov     staff         72 May 02 08:57 host.list
./coredir.1:
total 48
-rw-r--r--   1 hoov     staff      24427 May 02 09:36 core
 
./coredir.3:
total 48
-rw-r--r--   1 hoov     staff      24427 May 02 09:36 core

So, what do we do now? Let's run dbx on one of the core files to see if we can find the problem. You run dbx like this:

$ dbx bad_life coredir.1/core
 
Type 'help' for help.
reading symbolic information ...
[using memory image in coredir.1/core]
 
Segmentation fault in moveeq.memcpy [/usr/lpp/ppe.poe/lib/ip/libmpci.a] at 0xd055
b320
0xd055b320 (memcpy+0x10) 7ca01d2a       stsx   r5,r0,r3
(dbx)

Now, let's see where the program crashed and what its state was at that time. If we issue the where command,

(dbx) where

we can see the program stack:

moveeq._moveeq() at 0xd055b320
fmemcpy() at 0xd0568900
cpfromdev() at 0xd056791c
readdatafrompipe(??, ??, ??) at 0xd0558c08
readfrompipe() at 0xd0562564
finishread(??) at 0xd05571bc
kickpipes() at 0xd0556e64
mpci_recv() at 0xd05662cc
_mpi_recv() at 0xd050635c
MPI__Recv() at 0xd0504fe8
main(argc = 1, argv = 0x2ff22c08), line 32 in "bad_life.c"
(dbx)

The output of the where command shows that bad_life.c failed at line 32, so let's look at line 32, like this:

(dbx) func main
(dbx) list 32
 
    32          MPI_Recv(recv_message, 10, MPI_CHAR, taskid-1, 0,
                         MPI_COMM_WORLD, &stat);

When we look at line 32 of bad_life.c, our first guess is that one of the parameters being passed into MPI_Recv is bad. Let's look at some of these parameters to see if we can find the source of the error:

(dbx) print recv_message
(nil)

Ah ha! Our receive buffer has not been initialized and is NULL. The sample programs for this book include a solution called good_life.c. See "Getting the Books and the Examples Online" for information on how to get the sample programs.

It's important to note that we compiled bad_life.c with the -g compile flag. This gives us all the debugging information we need in order to view the entire program state and to print program variables. In many cases, people don't compile their programs with the -g flag, and they may even turn optimization on (-O), so there's virtually no information to tell them what happened when their program executed. If this is the case, you can still use dbx to look at only stack information, which allows you to determine the function or subroutine that generated the core dump.

Viewing the Program State

If collecting core files is impractical, you can also try catching the program at the segmentation fault. You do this by running the program under the control of the debugger. The debugger gets control of the application at the point of the segmentation fault, and this allows you to view your program state at the point where the problem occurs.

In the following example, we'll use bad_life again, but we'll use pdbx instead of dbx. Load bad_life.c under pdbx with the following command:

$ pdbx bad_life
 
pdbx Version 2.1 -- Apr 30 1996 15:56:32
 
  0:reading symbolic information ...
  1:reading symbolic information ...
  2:reading symbolic information ...
  3:reading symbolic information ...
  1:[1] stopped in main at line 12
  1:   12       char            *send_message = NULL;
  0:[1] stopped in main at line 12
  0:   12       char            *send_message = NULL;
  3:[1] stopped in main at line 12
  3:   12       char            *send_message = NULL;
  2:[1] stopped in main at line 12
  2:   12       char            *send_message = NULL;
0031-504  Partition loaded ...

Next, let the program run to allow it to reach a segmentation fault.

pdbx(all) cont
 
  0:Task 0 complete.
  2:Task 2 complete.
  3:
  3:Segmentation fault in @moveeq._moveeq [/usr/lpp/ppe.poe/lib/ip/libmpci.]a
  at 0xd036c320
  3:0xd036c320 (memmove+0x10) 7ca01d2a       stsx   r5,r0,r3
  1:
  1:Segmentation fault in @moveeq._moveeq [/usr/lpp/ppe.poe/lib/ip/libmpci.a]
  at 0xd055b320
  1:0xd055b320 (memcpy+0x10) 7ca01d2a       stsx   r5,r0,r3

Once we get segmentation faults, we can focus our attention on one of the tasks that failed. Let's look at task 1:

pdbx(all) on 1

By using the pdbx where command, we can see where the problem originated in our source code:

pdbx(1) where
 
  1:@moveeq.memcpy() at 0xd055b320
  1:fmemcpy() at 0xd0568900
  1:cpfromdev() at 0xd056791c
  1:readdatafrompipe(??, ??, ??) at 0xd0558c08
  1:readfrompipe() at 0xd0562564
  1:finishread(??) at 0xd05571bc
  1:kickpipes() at 0xd0556e50
  1:mpci_recv() at 0xd05662fc
  1:_mpi_recv() at 0xd050635c
  1:MPI__Recv() at 0xd0504fe8
  1:main(argc = 1, argv = 0x2ff22bf0), line 32 in "bad_life.c"

Now, let's move up the stack to function main:

pdbx(1) func main

Next, we'll list line 32, which is where the problem is located:

pdbx(1) l 32
 
  1:   32        MPI_Recv(recv_message, 10, MPI_CHAR, taskid-1, 0,
                 MPI_COMM_WORLD, &stat);

Now that we're at line 32, we'll print the value of recv_message:

 
pdbx(1) p recv_message
 
  1:(nil)

As we can see, our program passes a bad parameter to MPI_RECV().

Both the techniques we've talked about so far help you find the location of the problem in your code. The example we used makes it look easy, but in many cases it won't be so simple. However, knowing where the problem occurred is valuable information if you're forced to debug the problem interactively, so it's worth the time and trouble to figure it out.

Core Dumps and Threaded Programs

If a task of a threaded program produces a core file, the partial dump produced by default does not contain the stack and status information for all threads, so it is of limited usefulness. You can request AIX to produce a full core file, but such files are generally larger than permitted by user limits (the communication subsystem alone generates more than 64 MB of core information). As a result, if possible, use the attach capability of dbx, xldb, pdbx, or pedb to examine the task while it's still running.

No Output at All

Should There Be Output?

If you're getting no output from your program and you think you ought to be, the first thing you should do is make sure you have enabled the program to send data back to you. If the MP_STDOUTMODE environment variable is set to a number, it is the number of the only task for which standard output will be displayed. If that task does not generate standard output, you won't see any.

There Should Be Output

If MP_STDOUTMODE is set appropriately, the next step is to verify that the program is actually doing something. Start by observing how the program terminates (or fails to). It will do one of the following things:

In the first case, you should examine any messages you receive (since your program is not generating any output, all of the messages will be coming from POE).

In the second case, you will have to stop the program yourself (<Ctrl-c> should work).

One possible reason for lack of output could be that your program is terminating abnormally before it can generate any. POE will report abnormal termination conditions such as being killed, as well as non-zero return codes. Sometimes these messages are obscured in the blur of other errata, so it's important to check the messages carefully.

Figuring Out Return Codes

It's important to understand POE's interpretation of return codes. If the exit code for a task is zero(0) or in the range of 2 to 127, then POE will make that task wait until all tasks have exited. If the exit code is 1 or greater than 128 (or less than 0), then POE will terminate the entire parallel job abruptly (with a SIGTERM signal to each h task). In normal program execution, one would expect to have each program go through exit(0) or STOP, and exit with an exit code of 0. However, if a task encounters an error condition (for example, a full file system), then it may exit unexpectedly. In these cases, the exit code is usually set to -1, but if you have written error handlers which produce exit codes other than 1 or -1, then POE's termination algorithm may cause your program to hang because one task has terminated abnormally, while the other tasks continue processing (expecting the terminated task to participate).

If the POE messages indicate the job was killed (either because of some external situation like low page space or because of POE's interpretation of the return codes), it may be enough information to fix the problem. Otherwise, more analysis is required.

It Hangs

If you've gotten this far and the POE messages and the additional checking by the message passing routines have been unable to shed any light on why your program is not generating output, the next step is to figure out whether your program is doing anything at all (besides not giving you output).

Let's Try Using the Visualization Tool

One way to do this is to run with Visualization Tool (VT) tracing enabled, and examine the tracefile. To do this, compile your program with the -g flag and run the program with the -tracelevel 9 command line option, or by setting the MP_TRACELEVEL environment variable to 9.

When your program terminates (either on its own or via a <Ctrl-c> from you), you will be left with a file called pgmname.trc in the directory from which you submitted the parallel job. You can view this file with VT.

Let's look at the following example...it's got a bug in it.

/************************************************************************
*
* Ray trace program with bug
*
* To compile:
* mpcc -g -o rtrace_bug rtrace_bug.c
*
*
* Description:
* This is a sample program that partitions N tasks into
* two groups, a collect node and N - 1 compute nodes.
* The responsibility of the collect node is to collect the data
* generated by the compute nodes. The compute nodes send the
* results of their work to the collect node for collection.
*
* There is a bug in this code.  Please do not fix it in this file!
*
************************************************************************/
 
#include <mpi.h>
 
#define PIXEL_WIDTH 50
#define PIXEL_HEIGHT 50
 
int First_Line = 0;
int Last_Line  = 0;
 
void main(int argc, char *argv[])
{
  int numtask;
  int taskid;
 
  /* Find out number of tasks/nodes. */
  MPI_Init( &argc, &argv);
  MPI_Comm_size( MPI_COMM_WORLD, &numtask);
  MPI_Comm_rank( MPI_COMM_WORLD, &taskid);
 
  /* Task 0 is the coordinator and collects the processed pixels */
  /* All the other tasks process the pixels                      */
  if ( taskid == 0 )
    collect_pixels(taskid, numtask);
  else
    compute_pixels(taskid, numtask);
 
  printf("Task %d waiting to complete.\n", taskid);
  /* Wait for everybody to complete */
  MPI_Barrier(MPI_COMM_WORLD);
  printf("Task %d complete.\n",taskid);
  MPI_Finalize();
  exit();
}
 
/* In a real implementation, this routine would process the pixel */
/* in some manner and send back the processed pixel along with its*/
/* location.  Since we're not processing the pixel, all we do is  */
/* send back the location                                         */
compute_pixels(int taskid, int numtask)
{
  int  section;
  int  row, col;
  int  pixel_data[2];
  MPI_Status stat;
 
  printf("Compute #%d: checking in\n", taskid);
 
  section = PIXEL_HEIGHT / (numtask -1);
 
  First_Line = (taskid - 1) * section;
  Last_Line  = taskid * section;
 
  for (row = First_Line; row < Last_Line; row ++)
    for ( col = 0; col < PIXEL_WIDTH; col ++)
      {
         pixel_data[0] = row;
         pixel_data[1] = col;
         MPI_Send(pixel_data, 2, MPI_INT, 0, 0, MPI_COMM_WORLD);
      }
  printf("Compute #%d: done sending. ", taskid);
  return;
}
 
/* This routine collects the pixels.  In a real implementation, */
/* after receiving the pixel data, the routine would look at the*/
/* location information that came back with the pixel and move  */
/* the pixel into the appropriate place in the working buffer   */
/* Since we aren't doing anything with the pixel data, we don't */
/* bother and each message overwrites the previous one          */
collect_pixels(int taskid, int numtask)
{
  int  pixel_data[2];
  MPI_Status stat;
  int      mx = PIXEL_HEIGHT * PIXEL_WIDTH;
 
  printf("Control #%d: No. of nodes used is %d\n", taskid,numtask);
  printf("Control: expect to receive %d messages\n", mx);
 
  while (mx > 0)
    {
      MPI_Recv(pixel_data, 2, MPI_INT, MPI_ANY_SOURCE,
        MPI_ANY_TAG, MPI_COMM_WORLD, &stat);
      mx--;
    }
  printf("Control node #%d: done receiving. ",taskid);
  return;
}

This example was taken from a ray tracing program that distributed a display buffer out to server nodes. The intent is that each task, other than Task 0, takes an equal number of full rows of the display buffer, processes the pixels in those rows, and then sends the updated pixel values back to the client. In the real application, the task would compute the new pixel value and send it as well, but in this example, we're just just sending the row and column of the pixel. Because the client is getting the row and column location of each pixel in the message, it doesn't care which server each pixel comes from. The client is Task 0 and the servers are all the other tasks in the parallel job.

This example has a functional bug in it. With a little bit of analysis, the bug is probably easy to spot and you may be tempted to fix it right away. PLEASE DO NOT!

When you run this program, you get the output shown below. Notice that we're using the -g option when we compile the example. We're cheating a little because we know there's going to be a problem, so we're compiling with debug information turned on right away.

$ mpcc -g -o rtrace_bug rtrace_bug.c
$ rtrace_bug -procs 4 -labelio yes
  1:Compute #1: checking in
  0:Control #0: No. of nodes used is 4
  1:Compute #1: done sending. Task 1 waiting to complete.
  2:Compute #2: checking in
  3:Compute #3: checking in
  0:Control: expect to receive 2500 messages
  2:Compute #2: done sending. Task 2 waiting to complete.
  3:Compute #3: done sending. Task 3 waiting to complete.
^C
ERROR: 0031-250  task 1: Interrupt
ERROR: 0031-250  task 2: Interrupt
ERROR: 0031-250  task 3: Interrupt
ERROR: 0031-250  task 0: Interrupt

No matter how long you wait, the program will not terminate until you press <Ctrl-c>.

So, we suspect the program is hanging somewhere. We know it starts executing because we get some messages from it. It could be a logical hang or it could be a communication hang. There are two ways you can approach this problem; by using either VT or the attach feature of pedb (the PE ; parallel debugger). We'll start by describing how to use VT.

Because we don't know what the problem is, we'll turn on full tracing with -tracelevel 9, as in the example below.

$ rtrace_bug -procs 4 -labelio yes -tracelevel 9
  1:Compute #1: checking in
  3:Compute #3: checking in
  2:Compute #2: checking in
  0:Control #0: No. of nodes used is 4
  0:Control: expect to receive 2500 messages
  2:Compute #2: done sending. Task 2 waiting to complete.
  1:Compute #1: done sending. Task 1 waiting to complete.
  3:Compute #3: done sending. Task 3 waiting to complete.
^C
ERROR: 0031-250  task 0: Interrupt
ERROR: 0031-250  task 3: Interrupt
ERROR: 0031-250  task 1: Interrupt
ERROR: 0031-250  task 2: Interrupt
$ ls -alt rtrace_bug.trc
-rw-r--r--   1 vt       staff     839440 May  9 14:54 rtrace1.trc

When you run this example, make sure you press < Ctrl-c> as soon as the last waiting to complete message is shown. Otherwise, the trace file will continue to grow with kernel information, and it will make visualization more cumbersome. This will create the trace file in the current directory with the name rtrace_bug.trc.

Now we can start VT with the command:

$ vt -tracefile rtrace_bug.trc

VT will start with the two screens shown in Figure 2 and Figure 3 below. One is the VT Control Panel:

Figure 2. VT Control Panel

View figure.

The other is the VT View Selector:

Figure 3. VT View Selector

View figure.

Hangs and Threaded Programs

Coordinating the threads in a task requires careful locking and signaling. Deadlocks that occur because the program is waiting on locks that haven't been released are common, in addition to the deadlock possibilities that arise from improper use of the MPI messages passing calls.

Using the VT Displays

Since we don't know exactly what the problem is, we should look at communication and kernel activity. Once VT has started, you should select the Interprocessor Communication, Source Code, and System Summary displays from the View Selector panel ( Figure 3). To do this, go to the VT View Selector Panel. Locate the category labeled Communication/Program. Beneath that category label is a toggle button labeled Interprocessor Communication.

PLACE
the mouse cursor on the toggle button.

PRESS
the left mouse button.

* The Interprocessor Communication display appears.

Next, locate the toggle button in the same category labeled Source Code and select it in a similar manner. Finally, locate the System Summary toggle button in the System category and select it. You may need to resize the System Summary to make all four pie charts visible.

If you experience any problems with the Source Code display, it may be because rtrace_bug was not compiled with the -g option, or because the source and executable do not reside in the directory from which you started VT. Although, you can use the -spath command line option of VT to specify paths for locating the source code, for this example you should make sure that the source, executable, and trace file all exist in the directory from which you are starting VT.

Using the VT Trace File to Locate Communication Problems

Since all we know is that the program appeared to hang, the best bet at this point is to just play the trace file and watch for activity that appears to cease. Place the mouse cursor over the Play button on the main control panel and press the left mouse button to begin playback. This may take some time as VT attempts to play the trace file back at the same speed it was captured. If you want to speed up playback, you can use the mouse to drag the Replay Speed slider to the right.

You're looking for one of the following things to happen:

In the first case, a blocking communication call (for example, a Receive) has been invoked by a process, but the call was never satisfied. This can happen if a Blocking Receive, or a Wait on a Non-Blocking Receive is issued, and the corresponding Send never occurs. In this case, the Source Code display can show you where in the source code the blocked call occurred.

If the Interprocessor Communication display shows that the processes are not in a communication state, and the Source Code display shows no activity for one or more processes, then the code is probably in a standard infinite loop. You can deduce this because the Source Code display only tracks message passing calls. Once the display shows a process at a communication call, the position of that process will not change until another communication call is performed by that process. If the Interprocessor Communication display shows that the process is no longer in the communication call, you can conclude that the process has completed the call but has not reached another call. If you expected the process to go on to the next call, something in the non-communication code has prevented it from doing so. In many cases, this will be an infinite loop or blocking in I/O (although there are other legitimate causes as well, such as blocking while waiting on a signal).

If the trace file shows less or different communication activity than you expected, you should use the Reset button on the VT control panel to start the trace file from the beginning and then step the trace file, one communication event at a time, until you determine where the execution does something you didn't expect.

In each of these scenarios, you can use the VT Source Code display to locate the area that appears to contain the problem, and use that knowledge to narrow the focus of your debugging session.

Another cause for no activity in the trace file for a process, is failure of the network connection to the node during execution. Normally, POE checks to make sure its connections to all remote hosts are healthy by periodic message exchange (we call it the POE pulse). If the network connection fails, the POE pulse will fail and POE will time out, terminating the job. You can turn the pulse off by setting the MP_PULSE environment variable to 0, but then you are responsible for detecting if a remote host connection fails.

In our example, the trace file playback ends with the Interprocessor Communication display and the Source Code display appearing as they do in Figure 4 and Figure 5.

The Interprocessor Communication display, below, shows that process 0 is in a blocking receive. You can tell this by placing the mouse cursor over the box that has the zero in it and pressing the left mouse button. The pop-up box that appears will tell you so. The other three processes are in an MPI_Barrier state. What this means is that process 0 is expecting more input, but the other processes are not sending any.

Figure 4. Interprocessor Communication Display

View figure.

The Source Code display, below, shows that Process 0 is at the MPI_Recv call in collect_pixels. Process 0 will not leave this loop until the variable mx becomes 0. The other processes are in the MPI_Barrier call that they make after compute_pixels finishes. They have sent all the data they think they should, but Process 0 is expecting more.

Figure 5. Source Code Display

View figure.

Let's Attach the Debugger

After using VT, is seems to be clear that our program is hanging. Let's use the debugger to find out why. The best way to diagnose this problem is to attach the debugger directly to our POE job.

Start up POE and run rtrace_bug:

$ rtrace_bug -procs 4 -labelio yes

To attach the debugger, we first need to get the process id (pid) of the POE job. You can do this with the AIX ps command:

$ ps -ef | grep poe
 
smith 24152 20728   0 08:25:22  pts/0  0:00 poe

Next, we'll need to start the debugger in attach mode. Note that we can use either the pdbx or the pedb debugger. In this next example, we'll use pedb, which we'll start in attach mode by using the -a flag and the process identifier (pid) of the POE job:

$ pedb -a 24152

After starting the debugger in attach mode, the pedb Attach Dialog window appears:

Figure 6. Attach Dialog Window

View figure.

The Attach Dialog Window contains a list of task numbers and other information that describes the POE application. It provides information for each task in the following fields:

Task
The task number

IP
The IP address of the node on which the task or application is running

Node
The name, if available, of the node on which the task or application is running

PID
The process identifier of the selected task

Program
The name of the application and arguments, if any. These may be different if your program is MPMD.

At the bottom of the window there are two buttons (other than Quit and Help):

Attach
Causes the debugger to attach to the tasks that you selected. This button remains grayed out until you make a selection.

Attach All
Causes the debugger to attach to all the tasks listed in the window. You don't have to select any specific tasks.

Next, select the tasks to which you want to attach. You can either select all the tasks by pressing the Attach All button, or you can select individual tasks, by pressing the Attach button. In our example, since we don't know which task or set of tasks is causing the problem, we'll attach to all the tasks by pressing the Attach All button.

PLACE
the mouse cursor on the Attach All button.

PRESS
the left mouse button.

* The Attach Dialog window closes and the debugger main window appears:

Figure 7. pedb Main Window

View figure.

Since our code is hung in low level routines, the initial main window only provides stack traces. To get additional information for a particular task, double click on the highest line in the stack trace that has a line number and a file name associated with it. This indicates that the source code is available.

For task 0, in our example, this line in the stack trace is:

collect_pixels(), line 101 in rtrace_bug.c

Clicking on this line causes the local data to appear in the Local Data area and the source file (the compute_pixels function) to appear in the Source File area. In the source for compute_pixels, line 101 is highlighted. Note that the function name and line number within the function your program last executed appears here. (in this case, it was function MPI_Recv() on line 101).

Figure 8. Getting Additional Information About a Task

View figure.

PLACE
the mouse cursor on the Task 0 label (not the box) in the Global Data area.

PRESS
the right mouse button.

* a pop-up menu appears.

SELECT
the Show All option.

* All the global variables for this are tracked and displayed in the window below the task button.

Repeat the steps above for each task.

Now you can see that task 0 is stopped on an MPI_Recv() call. When we look at the Local Data values, we find that mx is still set to 100, so task 0 thinks it's still going to receive 100 messages. Now, lets look at what the other tasks are doing.

To get information on task 1, go to its stack window and double click on the highest entry that includes a line number. In our example, this line is:

main(argc = 1, argv = 0x2ff22a74), line 43, in rtrace_bug.c

Task 1 has reached an MPI_Barrier() call. If we quickly check the other tasks, we see that they have all reached this point as well. So....the problem is solved. Tasks 1 through 3 have completed sending messages, but task 0 is still expecting to receive more. Task 0 was expecting 2500 messages but only got 2400, so it's still waiting for 100 messages. Let's see how many messages each of the other tasks are sending. To do this, we'll look at the global variables First_Line and Last_Line. We can get the values of First_Line and Last_Line for each task by selecting them in the Global Data area.

PLACE
the mouse cursor over the desired task number label (not the box) in the Global Data area.

PRESS
the right mouse button.

* a pop-up menu appears.

SELECT
the Show All option.

* The First_Line and Last_Line variables are tracked and displayed in the window below the task button.

Repeat the steps above for each task.

Figure 9. Global Data Window

View figure.

As you can see...

So what happened to lines 48 and 49? Since each row is 50 pixels wide, and we're missing 2 rows, that explains the 100 missing messages. As you've probably already figured out, the division of the total number of lines by the number of tasks is not integral, so we lose part of the result when it's converted back to an integer. Where each task is supposed to be processing 16 and two-thirds lines, it's only handling 16.

Fix the Problem

So how do we fix this problem permanently? As we mentioned above, there are many ways:

In our case, since Task 1 was responsible for 16 and two thirds rows, it would process rows 0 through 16. Task 2 would process 17-33 and Task 3 would process 34-49. The way we're going to solve it is by creating blocks, with as many rows as there are servers. Each server is responsible for one row in each block (the offset of the row in the block is determined by the server's task number). The fixed code is shown in the following example. Note that this is only part of the program. You can access the entire program from the IBM RS/6000 World Wide Web site. See "Getting the Books and the Examples Online" for more information.

/************************************************************************
*
* Ray trace program with bug corrected
*
* To compile:
* mpcc -g -o rtrace_good rtrace_good.c
*
*
* Description:
* This is part of a sample program that partitions N tasks into
* two groups, a collect node and N - 1 compute nodes.
* The responsibility of the collect node is to collect the data
* generated by the compute nodes. The compute nodes send the
* results of their work to the collect node for collection.
*
* The bug in the original code was due to the fact that each processing
* task determined the rows to cover by dividing the total number of
* rows by the number of processing tasks.  If that division was not
* integral, the number of pixels processed was less than the number of
* pixels expected by the collection task and that task waited
* indefinitely for more input.
*
* The solution is to allocate the pixels among the processing tasks
* in such a manner as to ensure that all pixels are processed.
*
************************************************************************/
 
compute_pixels(int taskid, int numtask)
{
  int  offset;
  int  row, col;
  int  pixel_data[2];
  MPI_Status stat;
 
  printf("Compute #%d: checking in\n", taskid);
 
  First_Line = (taskid - 1);
     /* First n-1 rows are assigned */
     /* to processing tasks         */
  offset = numtask - 1;
     /* Each task skips over rows   */
     /* processed by other tasks    */
 
     /* Go through entire pixel buffer, jumping ahead by numtask-1 each time */
for (row = First_Line; row < PIXEL_HEIGHT; row += offset)
  for ( col = 0; col < PIXEL_WIDTH; col ++)
    {
      pixel_data[0] = row;
      pixel_data[1] = col;
      MPI_Send(pixel_data, 2, MPI_INT, 0, 0, MPI_COMM_WORLD);
    }
  printf("Compute #%d: done sending. ", taskid);
  return;
}

This program is the same as the original one except for the loop in compute_pixels. Now, each task starts at a row determined by its task number and jumps to the next block on each iteration of the loop. The loop is terminated when the task jumps past the last row (which will be at different points when the number of rows is not evenly divisible by the number of servers).

What's the Hangup?

The symptom of the problem in the rtrace_bug program was a hang. Hangs can occur for the same reasons they occur in serial programs (in other words, loops without exit conditions). They may also occur because of message passing deadlocks or because of some subtle differences between the parallel and sequential environments. The Visualization Tool (VT), of the IBM Parallel Environment for AIX, can show you what's going on in the program when the hang occurs, and this information can be used with the parallel debugger to identify the specific cause of the hang.

However, sometimes analysis under the debugger indicates that the source of a hang is a message that was never received, even though it's a valid one, and even though it appears to have been sent. In these situations, the problem is probably due to lost messages in the communication subsystem. This is especially true if the lost message is intermittent or varies from run to run. This is either the program's fault or the environment's fault. Before investigating the environment, you should analyze the program's safety with respect to MPI. A safe MPI program is one that does not depend on a particular implementation of MPI.

Although MPI specifies many details about the interface and behavior of communication calls, it also leaves many implementation details unspecified (and it doesn't just omit them, it specifies that they are unspecified.) This means that certain uses of MPI may work correctly in one implementation and fail in another, particularly in the area of how messages are buffered. An application may even work with one set of data and fail with another in the same implementation of MPI. This is because, when the program works, it has stayed within the limits of the implementation. When it fails, it has exceeded the limits. Because the limits are unspecified by MPI, both implementations are valid. MPI safety is discussed further in Appendix B. "MPI Safety".

Once you have verified that the application is MPI-safe, your only recourse is to blame lost messages on the environment. If the communication path is IP, use the standard network analysis tools to diagnose the problem. Look particularly at mbuf usage. You can examine mbuf usage with the netstat command:

$ netstat -m

If the mbuf line shows any failed allocations, you should increase the thewall value of your network options. You can see your current setting with the no command:

$ no -a

The value presented for thewall is in K Byte s. You can use the no command to change this value. For example,

$ no -o thewall=16384

sets thewall to 16 M Bytes.

Message passing between lots of remote hosts can tax the underlying IP system. Make sure you look at all the remote nodes, not just the home node. Allow lots of buffers. If the communication path is user space (US), you'll need to get your system support people involved to isolate the problem.

Other Hangups

One final cause for no output is a problem on the home node (POE is hung). Normally, a hang is associated with the remote hosts waiting for each other, or for a termination signal. POE running on the home node is alive and well, waiting patiently for some action on the remote hosts. If you type <Ctrl-c> on the POE console, you will be able to successfully interrupt and terminate the set of remote hosts. See IBM Parallel Environment for AIX: Operation and Use, Vol. 1 for information on the poekill command.

There are situations where POE itself can hang. Usually these are associated with large volumes of input or output. Remember that POE normally gets standard output from each node; if each task writes a large amount of data to standard output, it may chew up the IP buffers on the machine running POE, causing it (and all the other processes on that machine) to block and hang. The only way to know that this is the problem is by seeing that the rest of the home node has hung. If you think that POE is hung on the home node, your only solution may be to kill POE there. Press <Ctrl-c> several times, or use the command kill -9. At present, there are only partial approaches to avoiding the problem; allocate lots of mbufs on the home node, and don't make the send and receive buffers too large.

Bad Output

Bad output includes unexpected error messages. After all, who expects error messages or bad results (results that are not correct).

Error Messages

The causes of error messages are tracked down and corrected in parallel programs using techniques similar to those used for serial programs. One difference, however, is that you need to identify which task is producing the message, if it's not coming from all tasks. You can do this by setting the MP_LABELIO environment variable to yes, or using the -labelio yes command line parameter. Generally, the message will give you enough information to identify the location of the problem.

You may also want to generate more error and warning messages by setting the MP_EUIDEVELOP environment variable to yes. when you first start running a new parallel application. This will give you more information about the things that the message passing library considers errors or unsafe practices.

Bad Results

Bad results are tracked down and corrected in a parallel program in a fashion similar to that used for serial programs. The process, as we saw in the previous debugging exercise, can be more complicated because the processing and control flow on one task may be affected by other tasks. In a serial program, you can follow the exact sequence of instructions that were executed and observe the values of all variables that affect the control flow. However, in a parallel program, both the control flow and the data processing on a task may be affected by messages sent from other tasks. For one thing, you may not have been watching those other tasks. For another, the messages could have been sent a long time ago, so it's very difficult to correlate a message that you receive with a particular series of events.

Debugging and Threads

So far, we've talked about debugging normal old serial or parallel programs, but you may want to debug a threaded program (or a program that uses threaded libraries). If this is the case, there are a few things you should consider.

Before you do anything else, you first need to understand the environment you're working in. You have the potential to create a multi-threaded application, using a multi-threaded library, that consists of multiple distributed tasks. As a result, finding and diagnosing bugs in this environment may require a different set of debugging techniques that you're not used to using. Here are some things to remember.

When you attach to a running program, all the tasks you selected in your program will be stopped at their current points of execution. Typically, you want to see the current point of execution of your task. This stop point is the position of the program counter, and may be in any one of the many threads that your program may create OR any one of the threads that the MPI library creates. With non-threaded programs it was adequate to just travel up the program stack until you reached your application code (assuming you compiled your program with the -g option). But with threaded programs, you now need to traverse across other threads to get to your thread(s) and then up the program stack to view the current point of execution of your code.

If you're using the threaded MPI library, the library itself will create a set of threads to process message requests. When you attach to a program that uses the MPI library, all of the threads associated with the POE job are stopped, including the ones created and used by MPI.

It's important to note that to effectively debug your application, you must be aware of how threads are dispatched. When a task is stopped, all threads are also stopped. Each time you issue an execution command such as step over, step into, step return, or continue, all the threads are released for execution until the next stop (at which time they are stopped, even if they haven't completed their work). This stop may be at a breakpoint you set or the result of a step. A single step over an MPI routine may prevent the MPI library threads from completely processing the message that is being exchanged.

For example, if you wanted to debug the transfer of a message from a send node to a receiver node, you would step over an MPI_SEND() in your program on task 1, switch to task 2, then step over the MPI_RECV() on task 2. Unless the MPI threads on task 1 and 2 have the opportunity to process the message transfer, it will appear that the message was lost. Remember...the window of opportunity for the MPI threads to process the message is brief, and is only open during the step over. Otherwise, the threads will be stopped. Longer-running execution requests, of both the sending and receiving nodes, allow the message to be processed and, eventually, received.

For more information on debugging threaded and non-threaded MPI programs with the PE debugging tools, (pdbx and pedb), see IBM Parallel Environment for AIX: Operation and Use, Vol. 2, which provides more detailed information on how to manage and display threads.

For more information on the threaded MPI library, see IBM Parallel Environment for AIX: MPI Programming and Subroutine Reference.


Keeping an Eye on Progress

Often, once a program is running correctly, you'd like to keep tabs on its progress. Frequently, in a sequential program, this is done by printing to standard output. However, if you remember from Chapter 1, standard output from all the tasks is interleaved, and it is difficult to follow the progress of just one task. If you set the MP_STDOUTMODE environment variable to ordered, you can't see how the progress of one task relates to another. In addition, normal output is not a blocking operation. This means that a task that writes a message will continue processing so that by the time you see the message, the task is well beyond that point. This makes it difficult to understand the true state of the parallel application, and it's especially difficult to correlate the states of two tasks from their progress messages. One way to synchronize the state of a parallel task with its output messages is to use the Program Marker Array (pmarray).
Note:If you are unfamiliar with the Program Marker Array, you may find it helpful to refer to IBM Parallel Environment for AIX: Operation and Use, Vol. 1 for more information.

The Program Marker Array consists of two components: the display function, pmarray, and the instrumentation call, mpc_marker. When pmarray is running, it shows a display that looks like Figure 10, below.

Figure 10. Program Marker Array

View figure.

Each row of colored squares is associated with one task, which can change the color of any of the lights in its row with the mpc_marker call. The declaration looks like this in Fortran:

MP_MARKER (INTEGER LIGHT, INTEGER COLOR, CHARACTER STRING)

And it looks like this in C:

void mpc_marker(int light, int color, char *str)

This call accepts two integer values and a character string. The first parameter, light, controls which light in the pmarray is being modified. You can have up to 100 lights for each task. The second parameter, color, specifies the color to which you are setting the light. There are 100 colors available. The third parameter is a string of up to 80 characters that is a message shown in the text area of the pmarray display.

Before you start the parallel application, you need to tell pmarray how many lights to use, as well as how many tasks there will be. You do this with the MP_PMLIGHTS and the MP_PROCS environment variables.

$ export MP_PROCS=4
$ export MP_PMLIGHTS=16

If the parallel application is started from an X-Windows environment where pmarray is running, the output square of pmarray, for the task that made the call in the position specified by the light parameter, changes to the color specified by the color parameter. The character string is displayed in a text output region for the task. In addition to providing a quick graphical representation of the progress of the application, the output to pmarray is synchronized with the task that generates it. The task will not proceed until it has been informed that the data has been sent to pmarray. This gives you a much more current view of the state of each task.

The example below shows how pmarray can be used to track the progress of an application. This program doesn't do anything useful, but there's an inner loop that's executed 16 times, and an outer loop that is executed based on an input parameter. On each pass through the inner loop, the mpc_marker call is made to color each square in the task's pmarray row according to the color of the index for the outer loop. On the first pass through the inner loop, each of the 16 squares will be colored with color 0. On the second pass, they will be colored with color 1. On each pass through the outer loop, the task will be delayed by the number of seconds equal to its task number. Thus, task 0 will quickly finish but task 4 will take a while to finish. The color of the squares for a task indicate how far they are through the outer loop. The square that is actually changing color is the position in the inner loop. In addition, a text message is updated on each pass through the outer loop.

/************************************************************************
*
* Demonstration of use of pmarray
*
* To compile:
* mpcc -g -o use_pmarray use_pmarray.c
*
************************************************************************/
 
#include<stdlib.h>
#include<stdio.h>
#include<mpi.h>
#include<time.h>
 
int main(int argc, char **argv)
{
  int i, j;
  int inner_loops = 16, outer_loops;
  int me;
  char buffer[256];
  time_t start, now;
 
  MPI_Init(&argc,&argv);
  MPI_Comm_rank(MPI_COMM_WORLD,&me;);;
 
  if(argc>1) outer_loops = atoi(argv[1]);
  if(outer_loops<1) outer_loops = 16;
 
  for(i=0;i<outer_loops;i++)
    {
      /* Create message that will be shown in pmarray text area */
      sprintf(buffer,"Task %d performing loop %d of %d",me,i,outer_loops);
      printf("%s\n",buffer);
      for(j=0;j<inner_loops;j++)
        {
          /* pmarray light shows which outer loop we are in  */
          /* color of light shows which inner loop we are in */
          /* text in buffer is created in outer loop         */
          mpc_marker(i,5*j,buffer);
        }
 
      /* Pause for a number of seconds determined by which */
      /* task this is.  sleep(me) cannot be used because   */
      /* underlying communication mechanism uses a regular */
      /* timer interrupt that interrupts the sleep call    */
      /* Instead, we'll look at the time we start waiting  */
      /* and then loop until the difference between the    */
      /* time we started and the current time is equal to  */
      /* the task id                                       */
      time(&start);
      time(&now);
      while(difftime(now,start)<(double)me)
         {
           time(&now);
         }
    }
  MPI_Finalize();
  return 0;
}

Before running this example, you need to start pmarray, telling it how many lights to use. You do this with the MP_PMLIGHTS environment variable.

In our example, if we wanted to run the program with eight outer loops, we would set MP_PMLIGHTS to 8 before running the program.

Although it's not as freeform as print statements, or as extensible, pmarray allows you to send three pieces of information (the light number, the color, and the text string) back to the home node for presentation. It also ensures that the presentation is synchronized as closely to the task state as possible. We recommend that if you use pmarray for debugging, you define a consistent strategy for your application and stick with it. For example, you may want to use color to indicate state (initializing, active, disabled, terminated), and light number to indicate module or subsystem. You can configure pmarray with as many lights as will fit on the display.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]