IBM Books

Hitchhiker's Guide


Hitching a Lift on the Vogon Constructor Ship

The Hitchhiker's Guide to the Galaxy begins with Arthur Dent, earthman and main character, being suddenly swept aboard an alien spaceship from his garden. Once on the ship, Arthur is totally bewildered by his new surroundings. Fortunately, Ford Prefect, Arthur's earth companion (who, as Arthur recently discovered, is actually an alien from a planet somewhere in the Betelguese system), is there to explain what's going on.

Just as Arthur had to get used to his new environment when he first left earth, this chapter will help you get used to the new environment you're in; the IBM Parallel Environment for AIX (PE). It covers:

This book contains many examples and illustrates various commands and programs as well as the output you get as a result of running them. When looking at these examples, please keep in mind that the output you see on your system may not exactly match what's printed in the book, due to the differences between your system and ours. We've included them here just to give you a basic idea of what happens.

The sample programs, as they appear in this book, are also provided in source format from the IBM RS/6000 World Wide Web site (as described in "Getting the Books and the Examples Online"). If you intend to write or use any of the sample programs, it would be best if you obtained the examples from the web site rather than copying them from the book. Because of formatting and other restrictions in publishing code examples, some of what you see here may not be syntactically correct. On the other hand, the source code on the web site will work (we paid big bucks to someone to make sure they did).

If you're unfamiliar with the terms in this chapter, the Appendix E. "Glossary of Terms and Abbreviations" may be of some help.


What's the IBM Parallel Environment for AIX?

The IBM Parallel Environment for AIX (PE) software lets you develop, debug, analyze, tune, and execute parallel applications, written in Fortran, C, and C++, quickly and efficiently. PE conforms to existing standards like UNIX and MPI. The PE runs on either an IBM RS/6000 SP (SP) machine, or an AIX workstation cluster.

PE consists of:

What's the Parallel Operating Environment?

The purpose of the Parallel Operating Environment (POE) is to allow you to develop and execute your parallel applications across multiple processors, called nodes. When using POE, there is a single node (a workstation) called the home node that manages interactions with users.

POE transparently manages the allocation of remote nodes, where your parallel application actually runs. It also handles the various requests and communication between the home node and the remote nodes via the underlying network.

This approach eases the transition from serial to parallel programming by hiding the differences, and allowing you to continue using standard AIX tools and techniques. You have to tell POE what remote nodes to use (more on that in a moment), but once you have, POE does the rest.

When we say processor node, we're talking about a physical entity or location that's defined to the network. It can be a standalone machine, or a processor node within an IBM RS/6000 SP (SP) frame. From POE's point of view, a node is a node...it doesn't make much difference.

If you're using an SMP system, it's important to know that although an SMP node has more than one processing unit, it is still considered, and referred to as, a processor node.

What's New in PE 2.4?

AIX 4.3 Support

With PE 2.4, POE supports user programs developed with AIX 4.3. It also supports programs developed with AIX 4.2, intended for execution on AIX 4.3.

Parallel Checkpoint/Restart

This release of PE provides a mechanism for temporarily saving the state of a parallel program at a specific point (checkpointing), and then later restarting it from the saved state. When a program is checkpointed, the checkpointing function captures the state of the application as well as all data, and saves it in a file. When the program is restarted, the restart function retrieves the application information from the file it saved, and the program then starts running again from the place at which it was saved.

Enhanced Job Management Function

In earlier releases of PE, POE relied on the SP Resource Manager for performing job management functions. These functions included keeping track of which nodes were available or allocated and loading the switch tables for programs performing User Space communications. LoadLeveler, which had only been used for batch job submissions in the past, is now replacing the Resource Manager as the job management system for PE. One notable effect of this change is that LoadLeveler now allows you to run more than one User Space task per node.

MPI I/O

With PE 2.4, the MPI library now includes support for a subset of MPI I/O, described by Chapter 9 of the MPI-2 document; MPI-2: Extensions to the Message-Passing Interface, Version 2.0. MPI-I/O provides a common programming interface, improving the portability of code that involves parallel I/O.

1024 Task Support

This release of PE supports a maximum of 1024 tasks per User Space MPI/LAPI job, as opposed to the previous release, which supported a maximum of 512 tasks. For jobs using the IP version of the MPI library, PE supports a maximum of 2048 tasks.

Enhanced Compiler Support

In this release, POE now supports the following compilers:

Message Queue Facility

The pedb debugger now includes a message queue facility. Part of the pedb debugger interface, the message queue viewing feature can help you debug Message Passing Interface (MPI) applications by showing internal message request queue information. With this feature, you can view:

Xprofiler Enhancements

This release includes a variety of enhancements to Xprofiler, including:

Before You Start

Before getting underway, you should check to see that the items covered in this section have been addressed.

Installation

Whoever installed POE (this was probably your System Administrator but may have been you or someone else) should verify that it was installed successfully by running the Installation Verification Program (IVP). The IVP, which verifies installation for both threaded and non-threaded environments, is discussed briefly in Appendix C. "Installation Verification Program Summary", and is covered in more detail in IBM Parallel Environment for AIX: Installation Guide.

The IVP tests to see if POE is able to:

The instructions for verifying that the PE Visualization Tool (VT) was installed correctly are in IBM Parallel Environment for AIX: Installation Guide.

Access

Before you can run your job, you must first have access to the compute resources in your system. Here are some things to think about:

Note that if you're using LoadLeveler to submit POE jobs, it is LoadLeveler, not POE that handles user authorization. As a result, if you are using LoadLeveler to submit jobs, the following sections on user authorization do not apply to you, and you can skip ahead to "Job Management".

POE, when running without LoadLeveler, allows two types of user authorization:

  1. AIX-based user authorization, using entries in /etc/hosts.equiv or .rhosts files. This is the default POE user authorization method.

  2. DFS/DCE-based user authorization, using DCE credentials. If you plan to run POE jobs in a DFS environment, you must use DFS/DCE-based user authorization.

The type of user authorization is controlled by the MP_AUTH environment variable. The valid values are AIX (the default) or DFS.

The system administrator can also define the value for MP_AUTH in the /etc/poe.limits file. If MP_AUTH is specified in /etc/poe.limits, POE overrides the value of the MP_AUTH environment variable if it's different.

AIX-Based User Authorization

You must have remote execution authority on all the nodes in the system that you will use for parallel execution. You system administrator should:

/etc/hosts.equiv is checked first, and if the home node and user/machine name don't appear there, it then looks to .rhosts.

You can verify that you have remote execution authority by running a remote shell from the workstation where you intend to submit parallel jobs. For example, to test whether you have remote execution authority on 202r1n10, try the following command:

$ rsh <202r1n10> hostname

The response to this should be the remote host name. If it isn't the remote host name, or the command cannot run, you'll have to see your system administrator. Issue this command for every remote host on which you plan to have POE execute your job.

Refer to IBM Parallel Environment for AIX: Installation Guide for more detailed information.

DFS/DCE-Based User Authorization

If you plan to run POE on a system with the Distributed File System (DFS), you need to perform some additional steps in order to enable POE to run with DFS.

DFS requires you to have a set of DCE credentials which manage the files to which you have access. Since POE needs access to your DCE credentials, you need to provide them to POE:

  1. Do a dce_login. This enables your credentials through DCE (ensures that you are properly authenticated to DCE).

  2. Propagate your credentials to the nodes on which you plan to run your POE jobs. Use the poeauth command to do this. poeauth copies the credentials from task 0, using a host list file or job management system. The first node in the host list or pool must be the node from which you did the dce_login (and is where the credentials exist).

    poeauth is actually a POE application program that is used to copy the DCE credentials. As a result, before you attempt to run poeauth on a DFS system, you need to make your current working directory a non-DFS directory (for example, /tmp). Otherwise, you may encounter errors running poeauth which are related to POE's access of DFS directories.

Keep in mind that each node in your system, on which parallel jobs may run, requires the DFS/DCE credentials. As a result, it's wise to use the poeauth command with a host list file or pool that contains every node on which you might want to run your jobs later.

DCE credentials are maintained on a per user basis, so each user will need to invoke poeauth themselves in order to copy the credentials files. The credentials remain in effect on all nodes to which they were copied until they expire (at which time you will need to copy them using poeauth again).

For more information on running in a DFS environment and the poeauth command, refer to IBM Parallel Environment for AIX: Operation and Use, Vol. 1.

Job Management

In earlier releases of PE, POE relied on the SP Resource Manager for performing job management functions. These functions included keeping track of which nodes were available or allocated, and loading the switch tables for programs performing User Space communications. LoadLeveler, which had only been used for batch job submissions in the past, is now replacing the Resource Manager as the job management system for PE.

Parallel jobs, whose tasks will run in a partition consisting of nodes at either the PSSP 2.3 or 2.4 level will be limited to using the Resource Manager for job management (PSSP 2.3 and 2.4 did not support LoadLeveler). However, these jobs will be unable to exploit the new functionality under LoadLeveler, most notably the ability to run a maximum of four User Space jobs per node. In this case, the Resource Manager is indicated when the the PSSP SP_NAME environment variable is set to the name of that partition's control workstation.

Differences Between LoadLeveler and the Resource Manager

LoadLeveler and the Resource Manager differ in the following ways:

Pool Specification

With the Resource Manager, pools were specified with a pool number. With LoadLeveler, pools may be specified with a number or a name.

Host List File Entries

With the Resource Manager, pools could be specified on a per-node basis in a host list file, or on a per-job basis with the MP_RMPOOL environment variable, by setting the MP_CPU_USE or MP_ADAPTER_USE environment variables. With LoadLeveler, you cannot specify CPU or adapter usage in a host list file. If you try it, you will see a message indicating the specifications are being ignored. If you use a host list file, you will need to use the MP_CPU_USE and MP_ADAPTER_USE environment variables to specify the desired usage. Note also that these settings will continue to be in effect if you use the MP_RMPOOL environment variable later on.

When specifying pools in a host list file, for a job run under LoadLeveler, each entry must be for the same pool. In other words, all parallel tasks must run on nodes in the same pool. If a host list file for a LoadLeveler job contains more than one pool, the job will terminate.

Semantics of Usage

With the Resource Manager, specifying dedicated adapter usage or unique CPU usage prevented any other task from using that resource. Under LoadLeveler, this specification only prevents tasks of other parallel jobs from using the resource; tasks from the same parallel job are able to use the resource.

New LoadLeveler Options

The following environment variables are only valid for jobs that will run under LoadLeveler:
MP_MSG_API Used to indicate whether the parallel tasks of a job will be using MPI, LAPI, or both for message passing communication.
MP_NODES Used to indicate the number of physical nodes on which the tasks of a parallel job should be run.
MP_TASK_PER_NODE Used to indicate the number of tasks to be run on each of the physical nodes.

System administrators may use the MP_USE_LL keyword in the /etc/poe.limits file to indicate that only parallel jobs that are run under LoadLeveler are allowed on a particular node.

Host List File

One way to tell POE where to run your program is by using a host list file. The host list file is generally in your current working directory, but you can move it anywhere you like by specifying certain parameters. This file can be given any name, but the default name is host.list. Many people use host.list as the name to avoid having to specify another parameter (as we'll discuss later). This file contains one of two different kinds of information; node names or pool numbers (note that if you are using LoadLeveler, a pool can also be designated by a string). When using the Resource Manager, your host list file cannot contain a mixture of node names and pool numbers (or strings), so you must specify one or the other.

Node names refer to the hosts on which parallel jobs may be run. They may be specified as Domain Names (as long as those Domain Names can be resolved from the workstation where you submit the job) or as Internet addresses. Each host goes on a separate line in the host list file.

Here's an example of a host list file that specifies the node names on which four tasks will run:

202r1n10.hpssl.kgn.ibm.com
202r1n11.hpssl.kgn.ibm.com
202r1n09.hpssl.kgn.ibm.com
202r1n12.hpssl.kgn.ibm.com

Pools are groups of nodes that are known to the job management system you are using (LoadLeveler or Resource Manager). If you're using LoadLeveler, the pool is identified by either a number or a string. In general, the system administrator defines the pools and then tells particular groups of people which pool to use. If you're using the Resource Manager, the pool is identified by a number. Pools are entered in the host list file with an at (@) sign, followed by the pool number (for instance, @1 or @mypool).

Here's an example of a host list file that specifies pool numbers. Four tasks will run; two on nodes in pool 10 and two on nodes in pool 11. Note that if you're using LoadLeveler, the pools specified in the host list file must all be the same (all tasks must use the same pool). If you're using the Resource Manager, on the other hand, the host list file may contain different pools, as in the example below.

@10
@10
@11
@11

Running POE

Once you've checked all the items in "Before You Start", you're ready to run the Parallel Operating Environment. At this point, you can view POE as a way to run commands and programs on multiple nodes from a single point. It is important to remember that these commands and programs are really running on the remote nodes, and if you ask POE to do something on a remote node, everything necessary to do that thing must be available on that remote node. More on this in a moment.

Note that there are two ways to influence the way your parallel program is executed; with environment variables or command-line option flags. You can set environment variables at the beginning of your session to influence each program that you execute. You could also get the same effect by specifying the related command-line flag when you invoke POE, but its influence only lasts for that particular program execution. For the most part, this book shows you how to use the command-line option flags to influence the way your program executes. "Running POE with Environment Variables" gives you some high-level information, but you may also want to refer to IBM Parallel Environment for AIX: Operation and Use, Vol. 1 to learn more about using environment variables.

One more thing. In the following sections, we show you how to run POE by requesting nodes via a host list file. Note, however, that you may also request nodes via LoadLeveler or the Resource Manager . LoadLeveler and the Resource Manager are covered in more detail in "Who's In Control (SP Users Only)?".

Some Examples of Running POE

The poe command enables you to load and execute programs on remote nodes. The syntax is:

poe [program] [options]

When you invoke poe, it allocates processor nodes for each task and initializes the local environment. It then loads your program and reproduces your local shell environment on each processor node. POE also passes the user program arguments to each remote node.

The simplest thing to do with POE is to run an AIX command. When you try these examples on your system, use a host list file that contains the node names (as opposed to a pool number). The reason for this will be discussed a little later. These examples also assume at least a four-node parallel environment. If you have more than four nodes, feel free to use more. If you have fewer than four nodes, it's okay to duplicate lines. This example assumes your file is called host.list, and is in the directory from which you're submitting the parallel job. If either of these conditions are not true, POE will not find the host list file unless you use the -hostfile option (covered later....one thing at a time!).

The -procs 4 option tells POE to run this command on four nodes. It will use the first four in the host list file.


$ poe hostname -procs 4
 
202r1n10.hpssl.kgn.ibm.com
202r1n11.hpssl.kgn.ibm.com
202r1n09.hpssl.kgn.ibm.com
202r1n12.hpssl.kgn.ibm.com

What you see is the output from the hostname command run on each of the remote nodes. POE has taken care of submitting the command to each node, collecting the standard output and standard error from each remote node, and sending it back to your workstation. One thing that you don't see is which task is responsible for each line of output. In a simple example like this, it isn't that important but if you had many lines of output from each node, you'd want to know which task was responsible for each line of output. To do that, you use the -labelio option:


$ poe hostname -procs 4 -labelio yes
 
1:202r1n10.hpssl.kgn.ibm.com
2:202r1n11.hpssl.kgn.ibm.com
0:202r1n09.hpssl.kgn.ibm.com
3:202r1n12.hpssl.kgn.ibm.com

This time, notice how each line starts with a number and a colon? Notice also that the numbering started at 0 (zero). The number is the task id that the line of output came from (it is also the line number in the host list file that identifies the host which generated this output). Now we can use this parameter to identify lines from a command that generates more output.

Try this command:


$ poe cat /etc/motd -procs 2 -labelio yes

You should see something similar to this:

0:*******************************************************************************
0:*                                                                             *
0:*  Welcome to IBM AIX Version 4.3   on pe03.kgn.ibm.com                       *
0:*                                                                             *
0:*******************************************************************************
0:*                                                                             *
0:*     Message of the Day:  Never drink more than 3 Pan                        *
0:*     Galactic Gargle Blasters unless you are a 50 ton maga                   *
0:*     elephant with nemona.                                                   *
0:*                                                                             *
1:*******************************************************************************
1:*                                                                             *
1:*  Welcome to IBM AIX Version 4.3   on pe04.kgn.ibm.com                       *
1:*                                                                             *
1:*******************************************************************************
1:*                                                                             *
1:*                                                                             *
1:*     Message of the Day:  Never drink more than 3 Pan                        *
1:*     Galactic Gargle Blasters unless you are a 50 ton maga                   *
1:*     elephant with nemona.                                                   *
1:*                                                                             *
1:*                                                                             *
1:*******************************************************************************
0:*                                                                             *
0:*                                                                             *
0:*                                                                             *
0:*******************************************************************************

The cat command is listing the contents of the file /etc/motd on each of the remote nodes. But notice how the output from each of the remote nodes is intermingled? This is because as soon as a buffer is full on the remote node, POE sends it back to your workstation for display (in case you had any doubts that these commands were really being executed in parallel). The result is the jumbled mess that can be difficult to interpret. Fortunately, we can ask POE to clear things up with the -stdoutmode parameter.

Try this command:


$ poe cat /etc/motd -procs 2 -labelio yes -stdoutmode ordered

You should see something similar to this:

0:*******************************************************************************
0:*                                                                             *
0:*  Welcome to IBM AIX Version 4.3   on pe03.kgn.ibm.com                       *
0:*                                                                             *
0:*******************************************************************************
0:*                                                                             *
0:*                                                                             *
0:*     Message of the Day:  Never drink more than 3 Pan                        *
0:*     Galactic Gargle Blasters unless you are a 50 ton maga                   *
0:*     elephant with nemona.                                                   *
0:*                                                                             *
0:*                                                                             *
0:*******************************************************************************
1:*******************************************************************************
1:*                                                                             *
1:*  Welcome to IBM AIX Version 4.3   on pe04.kgn.ibm.com                       *
1:*                                                                             *
1:*******************************************************************************
1:*                                                                             *
1:*                                                                             *
1:*     Message of the Day:  Never drink more than 3 Pan                        *
1:*     Galactic Gargle Blasters unless you are a 50 ton maga                   *
1:*     elephant with nemona.                                                   *
1:*                                                                             *
1:*                                                                             *
1:*******************************************************************************

This time, POE holds onto all the output until the jobs either finish or POE itself runs out of space. If the jobs finish, POE displays the output from each remote node together. If POE runs out of space, it prints everything, and then starts a new page of output. You get less of a sense of the parallel nature of your program, but it's easier to understand. Note that the -stdoutmode option consumes a significant amount of system resources, which may affect performance.

Running POE with Environment Variables

By the way, if you're getting tired of typing the same command line options over and over again, you can set them as environment variables so you don't have to put them on the command line. The environment variable names are the same as the command line option names (without the leading dash), but they start with MP_, all in upper case. For example, the environment variable name for the -procs option is MP_PROCS, and for the -labelio option it's MP_LABELIO. If we set these two variables like this:

$ export MP_PROCS=2
$ export MP_LABELIO=yes

we can then run our /etc/motd program with two processes and labeled output, without specifying either with the poe command.

Try this command;

$ poe cat /etc/motd -stdoutmode ordered

You should see something similar to this:

0:*******************************************************************************
0:*                                                                             *
0:*  Welcome to IBM AIX Version 4.3   on pe03.kgn.ibm.com                       *
0:*                                                                             *
0:*******************************************************************************
0:*                                                                             *
0:*                                                                             *
0:*     Message of the Day:  Never drink more than 3 Pan                        *
0:*     Galactic Gargle Blasters unless you are a 50 ton maga                   *
0:*     elephant with nemona.                                                   *
0:*                                                                             *
0:*                                                                             *
0:*******************************************************************************
1:*******************************************************************************
1:*                                                                             *
1:*  Welcome to IBM AIX Version 4.3   on pe03.kgn.ibm.com                       *
1:*                                                                             *
1:*******************************************************************************
1:*                                                                             *
1:*                                                                             *
0:*     Message of the Day:  Never drink more than 3 Pan                        *
0:*     Galactic Gargle Blasters unless you are a 50 ton maga                   *
0:*     elephant with nemona.                                                   *
1:*                                                                             *
1:*                                                                             *
1:*******************************************************************************

In the example above, notice how the program ran with two processes, and the output was labeled?

Now, just so you can see that your environment variable setting lasts for the duration of your session, try running the command below, without specifying the number of processes or labeled I/O.


$ poe hostname
 
0:202r1n09.hpssl.kgn.ibm.com
1:202r1n10.hpssl.kgn.ibm.com

Notice how the program still ran with two processes and you got labeled output?

Now let's try overriding the environment variables we just set. To do this, we'll use command line options when we run POE. Try running the following command:

$ poe hostname -procs 4 -labelio no
 
202r1n09.hpssl.kgn.ibm.com
202r1n12.hpssl.kgn.ibm.com
202r1n11.hpssl.kgn.ibm.com
202r1n10.hpssl.kgn.ibm.com

This time, notice how the program ran with four processes and the output wasn't labeled? No matter what the environment variables have been set to, you can always override them when you run POE.

To show that this was a temporary override of the environment variable settings, try running the following command again, without specifying any command line options.

$ poe hostname
 
0:202r1n09.hpssl.kgn.ibm.com
1:202r1n10.hpssl.kgn.ibm.com

Once again, the program ran with two processes and the output was labeled.

Compiling (a Little Vogon Poetry)

All this is fine, but you probably have your own programs that you want to (eventually) run in parallel. We're going to talk in a little more detail in "The Answer is 42" about creating parallel programs, but right now we'll cover compiling a program for POE. Almost any Fortran, C or C++ program can be compiled for execution under POE.

According to The Hitchhiker's Guide to the Galaxy Vogon poetry is the third worst in the Universe. In fact, it's so bad that the Vogons subjected Arthur Dent and Ford Prefect to a poetry reading as a form of torture. Some people may think that compiling a parallel program is just as painful as Vogon poetry, but as you'll see, it's really quite simple.

Before compiling, you should verify that:

See IBM Parallel Environment for AIX: MPI Programming and Subroutine Reference for information on compilation restrictions for POE.

To show you how compiling works, we've selected the Hello World program. Here it is in C:


/************************************************************************
*
* Hello World C Example
*
* To compile:
* mpcc -o hello_world_c hello_world.c
*
************************************************************************/
#include<stdlib.h>
#include<stdio.h>
/* Basic program to demonstrate compilation and execution techniques */
int main()
{
printf("Hello, World!\n");
return(0);
}

And here it is in Fortran:


c***********************************************************************
c*
c* Hello World Fortran Example
c*
c* To compile:
c* mpxlf -o hello_world_f hello_world.f
c*
c***********************************************************************
c ------------------------------------------------------------------
c  Basic program to demonstrate compilation and execution techniques
c ------------------------------------------------------------------
c     program  hello
 
implicit none
write(6,*)'Hello, World!'
 
stop
end
 

To compile these programs, you just invoke the appropriate compiler script:


$ mpcc -o hello_world_c hello_world.c
 
$ mpxlf -o hello_world_f hello_world.f
** main   === End of Compilation 1 ===
1501-510  Compilation successful for file hello_world.f.

mpcc, mpCC, and mpxlf are POE scripts that link the parallel libraries that allow your programs to run in parallel. mpcc, mpCC, and mpxlf are for compiling non-threaded programs. Just as there is a version of the cc command called cc_r, that's used for threaded programs, there is also a script called mpcc_r (and also mpxlf_r and mpCC_r) for compiling threaded message passing programs. mpcc_r generates thread-aware code by linking in the threaded version of MPI, including the threaded VT and POE utility libraries. These threaded libraries are located in the same subdirectory as the non-threaded libraries.

All the compiler scripts accept all the same options that the non-parallel compilers do, as well as some options specific to POE. For a complete list of all parallel-specific compilation options, see IBM Parallel Environment for AIX: Operation and Use, Vol. 1.

Running the mpcc, mpCC, mpxlf, mpcc_r, mpCC_r, or mpxlf_r script, as we've shown you, creates an executable version of your source program that takes advantage of POE. However, before POE can run your program, you need to make sure it's accessible on each remote node. You can do this by either copying it there, or by mounting the file system that your program is in to each remote node.

Figure 1. Output from mpcc/mpxlf

Here's the output of the C program (threaded or non-threaded):

$ poe hello_world_c -procs 4
 
Hello, World!
Hello, World!
Hello, World!
Hello, World!

And here's the output of the Fortran program:

$ poe hello_world_f -procs 4
 
Hello, World!
Hello, World!
Hello, World!
Hello, World!

POE Options

There are a number of options (command line flags) that you may want to specify when invoking POE. These options are covered in full detail in IBM Parallel Environment for AIX: Operation and Use, Vol. 1 but here are the ones you'll most likely need to be familiar with at this stage.

-procs

When you set -procs, you're telling POE how many tasks your program will run. You can also set the MP_PROCS environment variable to do this (-procs can be used to temporarily override it).

-hostfile or -hfile

The default host list file used by POE to allocate nodes is called host.list. You can specify a file other than host.list by setting the -hostfile or -hfile options when invoking POE. You can also set the MP_HOSTFILE environment variable to do this (-hostfile and -hfile can be used to temporarily override it).

-labelio

You can set the -labelio option when invoking POE so that the output from the parallel tasks of your program are labeled by task id. This becomes especially useful when you're running a parallel program and your output is unordered. With labeled output, you can easily determine which task returns which message.

You can also set the MP_LABELIO environment variable to do this (-labelio can be used to temporarily override it).

-infolevel or -ilevel

You can use the -infolevel or -ilevel options to specify the level of messages you want from POE. There are different levels of informational, warning, and error messages, plus several debugging levels. Note that the -infolevel option consumes a significant amount of system resources. Use it with care. You can also set the MP_INFOLEVEL environment variable to do this (-infolevel and -ilevel can be used to temporarily override it).

-pmdlog

The -pmdlog option lets you specify that diagnostic messages should be logged to a file in /tmp each of the remote nodes of your partition. These diagnostic logs are particularly useful for isolating the cause of abnormal termination. Note that the -infolevel option consumes a significant amount of system resources. Use it with care. You can also set the MP_PMDLOG environment variable to do this (-pmdlog can be used to temporarily override it).

-stdoutmode

The -stdoutmode option lets you specify how you wa nt the output data from each task in your program to be displayed. When you set this option to ordered, the output data from each parallel task is written to its own buffer, and later, all buffers are flushed, in task order, to STDOUT. We showed you how this works in some of the examples in this section. Note that using the -infolevel option consumes a significant amount of system resources. Use it with care. You can also set the MP_STDOUTMODE environment variable to do this (-stdoutmode can be used to temporarily override i t).

Who's In Control (SP Users Only)?

So far, we've explicitly specified to POE the set of nodes on which to run our parallel application. We did this by creating a list of hosts in a file called host.list, in the directory from which we submitted the parallel job. In the absence of any other instructions, POE selected host names out of this file until it had as many as the number of processes we told POE to use (with the -procs option).

Another way to tell POE which hosts to use is with a job management system (LoadLeveler or the Resource Manager) . LoadLeveler can be used to manage jobs on a networked cluster of RS/6000 workstations, which may or may not include nodes of an IBM RS/6000 SP. If you're using LoadLeveler to manage your jobs, skip ahead to "Using LoadLeveler to Manage Your Jobs". The Resource Manager, on the other hand, is only used to manage jobs on an IBM RS/6000 SP (running PSSP 2.3 or 2.4). If you're using the Resource Manager on an SP, skip ahead to "Using the Resource Manager to Manage Your Jobs". If you don't know what you're using to manage your jobs, check with your system administrator.

For information on indicating whether you are using the Resource Manager or LoadLeveler to specify hosts, see IBM Parallel Environment for AIX: Operation and Use, Vol. 1.

Note that this section discusses only the basics of node allocation; it doesn't address performance considerations. See IBM Parallel Environment for AIX: Operation and Use, Vol. 1 for information on maximizing your program's performance.

Managing Your Jobs

Using LoadLeveler to Manage Your Jobs

LoadLeveler is also used to allocate nodes, one job at a time. This is necessary if your parallel application is communicating directly over the SP Switch. With the -euilib command line option (or the MP_EUILIB environment variable), you can specify how you want to do message passing. This option lets you specify the message passing subsystem library implementation, IP or User Space (US), that you wish to use. See IBM Parallel Environment for AIX: Operation and Use, Vol. 1 for more information. With LoadLeveler, you can also dedicate the parallel nodes to a single job, so there's no conflict or contention for resources. LoadLeveler allocates nodes from either the host list file, or from a predefined pool, which the System Administrator usually sets up.

Using the Resource Manager to Manage Your Jobs

The Resource Manager is used to allocate nodes, one job at a time, on an RS/6000 SP (running PSSP 2.3 or 2.4 only). This is necessary if your parallel application is communicating directly over the SP Switch. With the -euilib command line option (or the MP_EUILIB environment variable), you can specify how you want to do message passing. This option lets you specify the message passing subsystem library implementation, IP or User Space (US), that you wish to use. See IBM Parallel Environment for AIX: Operation and Use, Vol. 1 for more information. It's also a convenient mechanism for dedicating the parallel nodes to a single job, so there's no conflict or contention for resources. The Resource Manager allocates nodes from either the host list file, or from a predefined pool, which the System Administrator usually sets up.

How Are Your SP Nodes Allocated?

So how do you know who's allocating the nodes and where they're being allocated from? First of all, you must always have a host list file (or use the MP_RMPOOL environment variable or -rmpool command line option). As we've already mentioned, the default for the host list file is a file named host.list in the directory from which the job is submitted. This default may be overridden by the -hostfile command line option or the MP_HOSTFILE environment variable. For example, the following command:

$ poe hostname -procs 4 -hostfile $HOME/myHosts

would use a file called myHosts, located in the home directory. If the value of the -hostfile parameter does not start with a slash (/), it is taken as relative to the current directory. If the value starts with a slash (/), it is taken as a fully-qualified file name.

A System Administrator defines pools differently, depending on whether you will be using LoadLeveler or the Resource Manager to submit jobs. For specific examples of how a System Administrator defines pools, see IBM LoadLeveler for AIX: Using and Administering (SA22-7311). Note, however, that there's another way to designate the pool on which you want your program to run. If myHosts didn't contain any pool numbers, you could use the:

Note:If a host list file is used, it will override anything you specify with the MP_RMPOOL environment variable or the -rmpool command line option. You must set MP_HOSTFILE or -hostfile to NULL in order for MP_RMPOOL or -rmpool to work.

For more information about the MP_RMPOOL environment variable or the -rmpool command line option, see IBM Parallel Environment for AIX: Operation and Use, Vol. 1

If the myHosts file contains actual host names, but you want to use the SP Switch directly for communication, the job management system (LoadLeveler or Resource Manager) will only allocate the nodes that are listed in myHosts. The job management system you're using (LoadLeveler or the Resource Manager) keeps track of which parallel jobs are using the switch. When using LoadLeveler, more than one job at a time may use the switch, so LoadLeveler makes sure that only the allowed number of tasks actually use it. If the host list file contains actual host names, but you don't want to use the SP Switch directly for communication, POE allocates the nodes from those listed in the host list file.

When using the Resource Manager only one parallel job at a time can use the switch directly, and the Resource Manager will make sure that a node is allocated to only one job at a time.

As we said before, you can't have both host names and pool IDs in the same host list file.

Your program executes exactly the same way, regardless of whether POE or the job management system (LoadLeveler or Resource Manager) allocated the nodes. In the following example, the host list file contains a pool number which causes the job management system to allocate nodes. However, the output is identical to Figure 1, where POE allocated the nodes from the host list file.


$ poe hello_world_c -procs 4 -hostfile pool.list
 
Hello, World!
Hello, World!
Hello, World!
Hello, World!

So, if the output looks the same, regardless of how your nodes are allocated, how do you skeptics know whether the LoadLeveler or Resource Manager were really used? Well, POE knows a lot that it ordinarily doesn't tell you. If you coax it with the -infolevel option, POE will tell you more than you ever wanted to know. Read on...

Getting a Little More Information

You can control the level of messages you get from POE as your program executes by using the -infolevel option of POE. The default setting is 1 (normal), which says that warning and error messages from POE will be written to STDERR. However, you can use this option to get more information about how your program executes. For example, with -infolevel set to 2, you see a couple of different things. First, you'll see a message that says POE has connected to the job management system you're using (LoadLeveler or the Resource Manager:). Following that, you'll see messages that indicate which nodes the job management system passed back to POE for use.

For a description of the various -infolevel settings, see IBM Parallel Environment for AIX: Operation and Use, Vol. 1

Here's the Hello World program again:


$ poe hello_world_c -procs 2 -hostfile pool.list -labelio yes -infolevel 2

You should see output similar to the following:

INFO: 0031-364  Contacting LoadLeveler to set and query information 
                for interactive job
INFO: 0031-119  Host k54n05.ppd.pok.ibm.com allocated for task 0
INFO: 0031-119  Host k54n01.ppd.pok.ibm.com allocated for task 1
   0:INFO: 0031-724  Executing program: <hello_world_c>
   1:INFO: 0031-724  Executing program: <hello_world_c>
   0:Hello, World!
   1:Hello, World!
   0:INFO: 0031-306  pm_atexit: pm_exit_value is 0.
   1:INFO: 0031-306  pm_atexit: pm_exit_value is 0.
INFO: 0031-656  I/O file STDOUT closed by task 0
INFO: 0031-656  I/O file STDERR closed by task 0
INFO: 0031-251  task 0 exited: rc=0
INFO: 0031-656  I/O file STDOUT closed by task 1
INFO: 0031-656  I/O file STDERR closed by task 1
INFO: 0031-251  task 1 exited: rc=0
INFO: 0031-639  Exit status from pm_respond = 0

With -infolevel set to 2, you also see messages from each node that indicate the executable they're running and what the return code from the executable is. In the example above, you can differentiate between the -infolevel messages that come from POE itself and the messages that come from the remote nodes, because the remote nodes are prefixed with their task ID. If we didn't set -infolevel, we would see only the output of the executable (Hello World!, in the example above), interspersed with POE output from remote nodes.

With -infolevel set to 3, you get even more information. In the following example, we use the host list file that contains host names again (as opposed to a Pool ID), when we invoke POE.

Look at the output, below. In this case, POE tells us that it's opening the host list file, the nodes it found in the file (along with their Internet addresses), the parameters to the executable being run, and the values of some of the POE parameters.


$ poe hello_world_c -procs 2 -hostfile pool.list -labelio yes -infolevel 3

You should see output similar to the following:

INFO: DEBUG_LEVEL changed from 0 to 1
D1<L1>: Open of file ./pool.list successful
D1<L1>: mp_euilib = ip
D1<L1>: task 0 5 1
D1<L1>: extended 1 5 1
D1<L1>: node allocation strategy = 2
INFO: 0031-364  Contacting LoadLeveler to set and query information for interact
ive job
D1<L1>: Job Command String:
#@ job_type = parallel
#@ environment = COPY_ALL
#@ requirements = (Pool == 1)
#@ node = 2
#@ total_tasks = 2
#@ node_usage = not_shared
#@ network.mpi = en0,not_shared,ip
#@ class = Inter_Class
#@ queue
INFO: 0031-119  Host k54n05.ppd.pok.ibm.com allocated for task 0
INFO: 0031-119  Host k54n08.ppd.pok.ibm.com allocated for task 1
D1<L1>: Spawning /etc/pmdv2 on all nodes
D1<L1>: Socket file descriptor for task 0 (k54n05.ppd.pok.ibm.com) is 6
D1<L1>: Socket file descriptor for task 1 (k54n08.ppd.pok.ibm.com) is 7
D1<L1>: Jobid = 900549356
   0:INFO: 0031-724  Executing program: <hello_world_c>
   1:INFO: 0031-724  Executing program: <hello_world_c>
   0:INFO: DEBUG_LEVEL changed from 0 to 1
   0:D1<L1>: mp_euilib is <ip>
   0:D1<L1>: Executing _mp_init_msg_passing() from mp_main()...
   0:D1<L1>: cssAdapterType is <1>
   1:INFO: DEBUG_LEVEL changed from 0 to 1
   1:D1<L1>: mp_euilib is <ip>
   1:D1<L1>: Executing _mp_init_msg_passing() from mp_main()...
   1:D1<L1>: cssAdapterType is <1>
D1<L1>: init_data for task 0: <129.40.148.69:38085>
D1<L1>: init_data for task 1: <129.40.148.72:38272>
   0:D1<L1>: mp_css_interrupt is <0>
   0:D1<L1>: About to call mpci_connect
   1:D1<L1>: mp_css_interrupt is <0>
   1:D1<L1>: About to call mpci_connect
   1:D1<L1>: Elapsed time for mpci_connect: 0 seconds
   1:D1<L1>: _css_init: adapter address = 00000000
   1:
   1:D1<L1>: _css_init: rc from HPSOclk_init is 0
   1:
   1:D1<L1>: About to call _ccl_init
   0:D1<L1>: Elapsed time for mpci_connect: 0 seconds
   0:D1<L1>: _css_init: adapter address = 00000000
   0:
   1:D1<L1>: Elapsed time for _ccl_init: 0 seconds
   0:D1<L1>: _css_init: rc from HPSOclk_init is 1
   0:
   0:D1<L1>: About to call _ccl_init
   0:D1<L1>: Elapsed time for _ccl_init: 0 seconds
   0:Hello, World!
   1:Hello, World!
   1:INFO: 0031-306  pm_atexit: pm_exit_value is 0.
   0:INFO: 0031-306  pm_atexit: pm_exit_value is 0.
INFO: 0031-656  I/O file STDOUT closed by task 0
INFO: 0031-656  I/O file STDOUT closed by task 1
D1<L1>: Accounting data from task 1 for source 1:
D1<L1>: Accounting data from task 0 for source 0:
INFO: 0031-656  I/O file STDERR closed by task 1
INFO: 0031-656  I/O file STDERR closed by task 0
INFO: 0031-251  task 1 exited: rc=0
INFO: 0031-251  task 0 exited: rc=0
D1<L1>: All remote tasks have exited: maxx_errcode = 0
INFO: 0031-639  Exit status from pm_respond = 0
D1<L1>: Maximum return code from user = 0

The -infolevel messages give you more information about what's happening on the home node, but if you want to see what's happening on the remote nodes, you need to use the -pmdlog option. If you set -pmdlog to a value of yes, a log is written to each of the remote nodes that tells you what POE did while running each task.

If you issue the following command, a file is written in /tmp, of each remote node, called mplog.pid.taskid,

$ poe hello_world -procs 4 -pmdlog yes

If -infolevel is set high enough, the process number will be displayed in the output. If you don't know what the process number is, it's probably the most recent log file. If you're sharing the node with other POE users, the process number will be one of the most recent log files (but you own the file, so you should be able to tell).

Here's a sample log file:

AIX Parallel Environment pmd2 version @(#) 95/06/22 10: 53: 26
The ID of this process is 14734
The hostname of this node is k6n05.ppd.pok.ibm.com
The taskid of this task is 0
HOMENAME: k6n05.ppd.pok.ibm.com
USERID: 1063
USERNAME: vt
GROUPID: 1
GROUPNAME: staff
PWD: /u/vt/hughes
PRIORITY: 0
NPROCS: 2
PMDLOG: 1
NEWJOB: 0
PDBX: 0
AFSTOKEN: 5765-144 AIX Parallel Environment
LIBPATH: /usr/lpp/ppe.poe/lib: /usr/lpp/ppe.poe/lib/ip: /usr/lib
ENVC recv'd
envc: 23
envc is 23
env[0] = _=/bin/poe
env[1] = LANG=En_US
env[2] = LOGIN=vt
env[3] = NLSPATH=/usr/lib/nls/msg/%L/%N:/usr/lib/nls/msg/%L/%N.cat
env[4] = PATH=/bin: /usr/bin: /etc: /usr/ucb: /usr/sbin: /usr/bin/X11: .:
env[5rb; = LC__FASTMSG=true
env[6] = LOGNAME=vt
env[7] = MAIL=/usr/spool/mail/vt
env[8] = LOCPATH=/usr/lib/nls/loc
env[9] = USER=vt
env[10] = AUTHSTATE=compat
env[11] = SHELL=/bin/ksh
env[12] = ODMDIR=/etc/objrepos
env[13] = HOME=/u/vt
env[14] = TERM=aixterm
env[15] = MAILMSG=[YOU HAVE NEW MAIL]
env[16] = PWD=/u/vt/hughes
env[17] = TZ=EST5EDT
env[18] = A__z=! LOGNAME
env[19] = MP_PROCS=2
env[20] = MP_HOSTFILE=host.list.k6
env[21] = MP_INFOLEVEL=2
env[22] = MP_PMDLOG=YES
Initial data msg received and parsed
Info level = 2
User validation complete
About to do user root chk
User root check complete
SSM_PARA_NODE_DATA msg recv'd
  0: 129.40.84.69: k6n05.ppd.pok.ibm.com:  -1
  1: 129.40.84.70: k6n06.ppd.pok.ibm.com:  -1
node map parsed
newjob is 0.
msg read, type is 13
string = <JOBID 804194891
hello_world_c >
SSM_CMD_STR recv'd
JOBID id 804194891
command string is <hello_world_c >
pm_putargs: argc = 1, k = 1
SSM_CMD_STR parsed
child pipes created
child: pipes successfully duped
child: MP_CHILD = 0
partition id is <31>
child: after initgroups (*group_struct).gr_gid = 100
child: after initgroups (*group_struct).gr_name = 1
fork completed
parent: my child's pid is 15248
attach data sent
pmd child: core limit is 1048576, hard limit is 2147483647
pmd child: rss limit is 33554432, hard limit is 2147483647
pmd child: stack limit is 33554432, hard limit is 2147483647
pmd child: data segment limit is 134217728, hard limit is 2147483647
pmd child: cpu time limit is 2147483647, hard limit is 2147483647
pmd child: file size limit is 1073741312, hard limit is 1073741312
child: (*group_struct).gr_gid = 1
child: (*group_struct).gr_name = staff
child: userid, groupid and cwd set!
child: current directory is /u/vt/hughes
child: about to start the user's program
child: argument list:
argv[0] = hello_world_c
argv[1] (in hex) = 0
child: environment:
env[0] = _=/bin/poe
env[1] = LANG=En_US
env[2] = LOGIN=vt
env[3] = NLSPATH=/usr/lib/nls/msg/%L/%N:/usr/lib/nls/msg/%L/%N.cat
env[4] = PATH=/bin: /usr/bin: /etc: /usr/ucb: /usr/sbin: /usr/bin/X11: .:
env[5] = LC__FASTMSG=true
env[6] = LOGNAME=vt
env[7] = MAIL=/usr/spool/mail/vt
env[8] = LOCPATH=/usr/lib/nls/loc
env[9] = USER=vt
env[10] = AUTHSTATE=compat
env[11] = SHELL=/bin/ksh
env[12] = ODMDIR=/etc/objrepos
env[13] = HOME=/u/vt
env[14] = TERM=aixterm
env[15] = MAILMSG=[YOU HAVE NEW MAIL]
env[16] = PWD=/u/vt/hughes
env[17] = TZ=EST5EDT
env[18] = A__z=! LOGNAME
env[19] = MP_PROCS=2
env[20] = MP_HOSTFILE=host.list.k6
env[21] = MP_INFOLEVEL=2
env[22] = MP_PMDLOG=YES
child: LIBPATH = /usr/lpp/ppe.poe/lib:/usr/lpp/ppe.poe/lib/ip:/usr/lib
select: rc = 1
pulse is on, curr_time is 804180935, send_time is 0, select time is 600
pulse sent at 804180935
count = 51 on stderr
pmd parent: STDERR read OK:
STDERR: INFO: 0031-724  Executing program: <hello_world_c>
select: rc = 1
pulse is on, curr_time is 804180935, send_time is 804180935, select
time is 600
SSM type = 34
STDIN:
select: rc = 1
pulse is on, curr_time is 804180935, send_time is 804180935,
select time is 600
pmd parent: cntl pipe read OK:
pmd parent: type: 26, srce: 0, dest: -2, bytes: 6
parent: SSM_CHILD_PID: 15248
select: rc = 1
pulse is on, curr_time is 804180935, send_time is 804180935,
select time is 600
pmd parent: cntl pipe read OK:
pmd parent: type: 23, srce: 0, dest: -1, bytes: 18
select: rc = 1
pulse is on, curr_time is 804180935, send_time is 804180935,
select time is 600
SSM type = 29
STDIN: 129.40.84.69:1257
129.40.84.70:1213
select: rc = 1
pulse is on, curr_time is 804180935, send_time is 804180935,
select time is 600
pmd parent: cntl pipe read OK:
pmd parent: type: 44, srce: 0, dest: -1, bytes: 2
select: rc = 1
pulse is on, curr_time is 804180935, send_time is 804180935,
select time is 600
SSM type = 3
STDIN:
select: rc = 1
pulse is on, curr_time is 804180936, send_time is 804180935,
select time is 600
pmd parent: STDOUT read OK
STDOUT: Hello, World!
select: rc = 1
pulse is on, curr_time is 804180936, send_time is 804180935,
select time is 599
count = 65 on stderr
pmd parent: STDERR read OK:
STDERR: INFO: 0033-3075 VT Node Tracing completed.  Node merge beginning
select: rc = 1
pulse is on, curr_time is 804180936, send_time is 804180935,
select time is 599
count = 47 on stderr
pmd parent: STDERR read OK:
STDERR: INFO: 0031-306  pm_atexit: pm_exit_value is 0.
select: rc = 1
pulse is on, curr_time is 804180936, send_time is 804180935,
select time is 599
pmd parent: cntl pipe read OK:
pmd parent: type: 17, srce: 0, dest: -1, bytes: 2
select: rc = 1
pulse is on, curr_time is 804180936, send_time is 804180935,
select time is 599
SSM type = 5
STDIN: 5
select: rc = 1
pulse is on, curr_time is 804180936, send_time is 804180935,
select time is 599
in pmd signal handler
wait status is 00000000
exiting child pid = 15248
err_data is 0
select: rc = 2
pulse is on, curr_time is 804180936, send_time is 804180935,
select time is 599
count = 0 on stderr
child exited and all pipes closed
err_data is 0
pmd_exit reached!, exit code is 0

Appendix A. "A Sample Program to Illustrate Messages" includes an example of setting -infolevel to 6, and explains the important lines of output.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]