IBM Books

IBM PE for AIX V2R4.0: Operation and Use, Vol. 1


Executing Parallel Programs

This chapter describes the Parallel Operating Environment (POE). POE is a simple and friendly environment designed to ease the transition from serial to parallel application development and execution. POE lets you develop and run parallel programs using many of the same methods and mechanisms as you would for serial jobs. POE allows you to continue to use the standard UNIX** and AIX application development and execution techniques with which you are already familiar. For example, you can redirect input and output, pipe the output of programs into more or grep, write shell scripts to invoke parallel programs, and use shell tools such as history. You do all these in just the same way you would for serial programs. So while the concepts and approach to writing parallel programs must necessarily be different, POE makes your working environment as familiar as possible.

This chapter describes the steps involved in compiling and executing your parallel C, C++, or Fortran programs using either an IBM RS/6000 SP, an RS/6000 network cluster, or a mixed system.


Executing Parallel Programs Using POE

This section discusses how to compile and execute your parallel C, C++, or Fortran programs. It leaves out the first step in any application's life cycle which is actually writing the program. For information on writing parallel programs, refer to IBM Parallel Environment for AIX: MPI Programming and Subroutine Reference IBM Parallel Environment for AIX: MPL Programming and Subroutine Reference IBM Parallel Environment for AIX: Hitchhiker's Guide, and IBM Parallel System Support Programs for AIX: Command and Technical Reference.
Note:If you are using POE for the first time, check that you have authorized access. See IBM Parallel Environment for AIX: Installation for information on setting up users.

In order to execute an MPI, MPL, or LAPI parallel program, you need to:

  1. Compile and link the program using shell scripts or makefiles which call the C, C++, or Fortran compilers while linking in the Partition Manager interface and message passing subroutines.

  2. Copy your executable to the individual nodes in your partition if it is not accessible to the remote nodes.

  3. Set up your execution environment. This includes setting the number of tasks, and determining the method of node allocation.

  4. Optionally start either of the POE X-Windows analysis tools - the Program Marker Array and the System Status Array - you want to use.

  5. Load and execute the parallel program on the processor nodes of your partition. You can:

Step 1: Compile the Program

As with a serial application, you must compile a parallel C, C++, or Fortran program before you can run it. Instead of using the cc, xlC, or xlf commands, however, you use the commands mpcc, mpCC, or mpxlf. The mpcc, mpCC, and mpxlf commands not only compile your program, but also link in the Partition Manager and message passing interface libraries. When you later invoke the program, the subroutines in these libraries enable the home node Partition Manager to communicate with the parallel tasks, and the tasks with each other. To compile threaded C, C++, or Fortran programs, use the mpcc_r, mpCC_r, or mpxlf_r commands. These commands can also be used to compile non-threaded programs with the threaded libraries.

To compile programs with the checkpoint/restart capability, use the mpcc_chkpt, mpCC_chkpt, or mpxlf_chkpt commands. See IBM Parallel Environment for AIX: Hitchhiker's Guide for an overview of checkpointing and restarting POE programs. For specific details, see the section later in this chapter, "Checkpointing and Restarting Programs".

These compiler commands are actually shell scripts which call the appropriate compiler. You can use any of the cc, xlC, or xlf flags on these commands.

The following table shows what to enter to compile a program depending on the language in which it is written. For more information on these commands, see Appendix A. "Parallel Environment Commands".
To: Enter:
Compile a C program. A communication subsystem library implementation will be dynamically linked when the executable is invoked. mpcc program.c -o program
Compile a C++ program. A communication subsystem library implementation will be dynamically linked when the executable is invoked. mpCC program.C -o program
Compile a Fortran program. A communication subsystem library implementation will be dynamically linked when the executable is invoked. mpxlf program.f -o program
Compile a C program which uses threaded MPI. A communication subsystem library implementation will be dynamically linked when the executable is invoked. mpcc_r program.c -o program
Compile a C++ program which uses threaded MPI. A communication subsystem library implementation will be dynamically linked when the executable is invoked. mpCC_r program.C -o program
Compile a Fortran program which uses threaded MPI. A communication subsystem library implementation will be dynamically linked when the executable is invoked. mpxlf_r program.f -o program

Notes:

  1. Be sure to specify the -g flag when compiling a program for use with one of the parallel debuggers or with VT. The -g flag is a standard compiler flag that produces an object file with symbol table references. This object file is needed by the debuggers and by VT's Source Code view. For more information on the -g option, refer to its use on the cc command as described in IBM AIX Version 4 Commands Reference.

  2. Application programs compiled for use with POE are limited to eight (8) data segments. The -bmaxdata option cannot specify any more than 0x80000000.

Creating a Static Executable

Note:We discourage you from creating statically bound executables with POE. If service is ever applied that affects any of the Parallel Environment libraries, you will need to recompile your application to create a new executable that will work with the new libraries. This could lead to a lot of work and may expose you to potential problems, which would be avoided if dynamic libraries are used.

In general, to create a static executable, do the following:

  1. Create an object file of your program using cc, xlf, or xlC. For your threaded program, use cc_r or xlC_r. Include the path /usr/lpp/ppe.poe/include to the message passing include files in your compilation statement. For example:
    cc -c myprog.c -I/usr/lpp/ppe.poe/include
    

  2. Using ld, create the executable.

The following table shows how you create a C, C++, or Fortran static executable for IP or US.
Note:When you see ld, l represents a lower case L. When you see -bI, I represents an upper case i.

To: For IP, Enter: For US (SP only), Enter:
Create a C or C++ static executable ld -o myprog myprog.o /lib/crt0.o -binitfini:poe_remote_main -bnso -lmpci -lmpi -lvtd -lc -lppe -bI :/lib/syscalls.exp -L/usr/lpp/ppe.poe/lib -L/usr/lpp/ppe.poe/lib/ip ld -o myprog myprog.o /lib/crt0.o -binitfini:poe_remote_main -bnso -bI:/usr/lpp/ssp/css/libus /fs_ext.exp -lmpci -lmpi -lvtd -lc -lppe -bI:/lib/syscalls.exp -L/usr/lpp/ppe.poe/lib -L/usr/lpp/ppe.poe/lib/us
Create a Fortran static executable ld -o myprog myprog.o /lib/crt0.o -binitfini:poe_remote_main -bnso -lmpci -lmpi -lvtd -lxlf90 -lxlf -lc -lppe -bI:/lib/syscalls.exp -L/usr/lpp/ppe.poe/lib -L/usr/lpp/ppe.poe/lib/ip ld -o myprog myprog.o /lib/crt0.o -binitfini:poe_remote_main -bnso -bI:/usr/lpp/ssp/css/libus /fs_ext.exp -lmpci -lmpi -lvtd -lxlf90 -lxlf -lc -lppe -bI:/lib/syscalls.exp -L/usr/lpp/ppe.poe/lib -L/usr/lpp/ppe.poe/lib/us
Create a C or C++ static executable which uses threaded MPI ld -o myprog myprog.o /lib/crt0_r.o -binitfini:poe_remote_main -bnso -lmpi_r -lvtd_r -lc_r -lppe_r -lpthreads -lmpci_r -lc /usr/lib/libc.a -bI:/lib/threads.exp -bI:/lib/syscalls.exp -L/usr/lib/threads -L/usr/lpp/ppe.poe/lib -L/usr/lpp/ppe.poe/lib/ip ld -o myprog myprog.o /lib/crt0_r.o -binitfini:poe_remote_main -bnso -bI:/usr/lpp/ssp/css/libus/fs_ext.exp -lmpi_r -lvtd_r -lc_r -lppe_r -lpthreads -lmpci_r -lc /usr/lib/libc.a -bI:/lib/threads.exp -bI:/lib/syscalls.exp -L/usr/lib/threads -L/usr/lpp/ppe.poe/lib -L/usr/lpp/ppe.poe/lib/us
Create a Fortran static executable which uses threaded MPI ld -o myprog myprog.o /lib/crt0_r.o -binitfini:poe_remote_main -bnso -lmpci_r -lmpi_r -lvtd_r -lxlf90_r -lc_r -lppe_r -lc -lpthreads /usr/lib/libc.a -bI:/lib/syscalls.exp -bI:/lib/threads.exp -L/usr/lib/threads -L/usr/lpp/ppe.poe/lib -L/usr/lpp/ppe.poe/lib/ip ld -o myprog myprog.o /lib/crt0_r.o -binitfini:poe_remote_main -bnso -bI:/usr/lpp/ssp/css/libus/fs_ext.exp -lmpci_r -lmpi_r -lvtd_r -lxlf90_r -lc_r -lppe_r -lc -lpthreads /usr/lib/libc.a -bI:/lib/syscalls.exp -bI:/lib/threads.exp -L/usr/lpp/ppe.poe/lib -L/usr/lpp/ppe.poe/lib/us

Notes:

  1. Users of PE 2.1 and 2.2 who have made references to any crt0 in /usr/lpp/ppe.poe/lib (for example, users who create statically bound executables) and who wish to recompile using PE 2.4 should do the following:

    1. References to any crt0 in /usr/lpp/ppe.poe/lib should be changed to the desired crt0 in /lib or /usr/lpp/xlC/lib.

    2. The -binitfini:poe_remote_main binder option should be added to the compile or ld statement.

  2. POE compile scripts utilize the -binitfini binder option. As a result, POE programs have a priority default of zero. If other user applications are using the initfini binder option, they should only specify a priority in the range of 1 to 2,147,483,647.

  3. If you try to create a US static executable on the SP control workstation, and the ld command fails because it cannot find the mpci library file, it is possible that a link needs to be set by your system administrator. Refer to IBM Parallel Environment for AIX: Installation for instructions on installing PE on the SP control workstation.

  4. On a cluster, you can create an IP static executable only. The US libraries are only shipped with an SP system.

  5. When creating a Fortran static executable, include the xlf90 and xlf libraries in the ld command after the -lvtd statement.

  6. To use threads and Fortran, you should have Fortran Release 4.1.0.1 or later.

  7. To create a static executable of a program which uses LAPI subroutines, see "Understanding and Using the Communications Low-Level Application Programming Interface (LAPI)", in IBM Parallel System Support Programs for AIX: Administration Guide

Step 2: Copy Files to Individual Nodes

Note:You only need to perform this step if your executable, your data files, and (if you plan to use pdbx) your source code files are not in a commonly accessed, or shared, file system. If running on an SP system, you can also skip this step if the needed files are part of a file collection which is distributed automatically. For information on using file collections, see IBM Parallel System Support Programs for AIX: Administration Guide For more information on the parallel debuggers, see IBM Parallel Environment for AIX: Operation and Use, Volume 2, Tools Reference.

If the program you are running is in a shared file system, the Partition Manager loads a copy of your executable in each processor node in your partition when you invoke a program. If your executable is in a private file system, however, you must copy it to the nodes in your partition. If you plan to use the parallel debugger pdbx, you must copy your source files to all nodes as well. You can easily copy files to nodes using the mprcp command. All you do is pass mprcp the name of the host list file you are using to define your partition and the absolute path name of the file.

For example, to send a copy of program to all the processor nodes listed in host.list in your current directory:

ENTER
mprcp host.list $PWD/program

* The mprcp command copies program to each of the nodes listed in host.list using the rcp command. If a program of the same name already exists, the mprcp command will overwrite it.

For more information on the rcp command, refer to IBM AIX Version 4 Commands Reference . For more information on the mprcp command, see Appendix A. "Parallel Environment Commands".

You can also copy your executable to each node with the mcp command. There is an advantage in using mcp over mprcp in that mcp copies large programs faster. mcp uses the message passing facilities of the Parallel Environment to copy a file from a file system on the home node to a remote node file system. For example, assume that your executable program is on a mounted file system (/u/edgar/somedir/myexecutable), and you want to make a private copy in /tmp on each node in host.list.

ENTER
mcp /u/edgar/somedir/myexecutable /tmp/myexecutable -procs n

Note:If you load your executable from a mounted file system, you may experience an initial delay while the program is being initialized on all nodes. You may experience this delay even after the program begins executing, because individual pages of the program are brought in on demand. This is particularly apparent during initialization of the message passing interface; since individual nodes are synchronized, there are simultaneous demands on the network file transfer system. You can minimize this delay by copying the executable to a local file system on each node, using the mcp message passing file copy program.

Step 3: Set Up the Execution Environment

This step contains the following sections:

Before invoking your program, you need to set up your execution environment. There are a number of POE environment variables discussed throughout this book and summarized in Appendix B. "POE Environment Variables and Command-Line Flags". Any of these environment variables can be set at this time to later influence the execution of parallel programs. This step covers those environment variables most important for successful invocation of a parallel program. When you invoke a parallel program, your home node Partition Manager checks these environment variables to determine:

Note:If you are using an SP system, and plan to use its high performance switch adapter for communication, note that each node has been configured by your system administrator for communication using the IP communication subsystem, the US communication subsystem, or both. Any node you request through specific or non-specific node allocation must be configured for the appropriate communication subsystem library implementation. Check with your system administrator to learn which nodes were initialized for the US communication subsystem, which were initialized for the IP communication subsystem, or which nodes allow either.

There are five separate environment variables that, collectively, determine how nodes are allocated by the Partition Manager. While these are the only ones you must set to allocate nodes, keep in mind that there are many other environment variables you can set. These are summarized in Appendix B. "POE Environment Variables and Command-Line Flags", and control such things as standard I/O handling and VT trace file generation. The environment variables for node allocation are:

MP_HOSTFILE
which specifies the name of a host list file to use for node allocation. If set to an empty string (" ") or to the word "NULL", this environment variable specifies that no host list file should be used. If MP_HOSTFILE is not set, POE looks for a file host.list in the current directory. You need to create a host list file if you want specific node allocation, or if you want non-specific node allocation from a number of SP system pools.

MP_RESD
which specifies whether or not the Partition Manager should connect to a job management system (LoadLeveler or the Resource Manager) to allocate nodes.

Notes:

  1. MP_RESD only specifies whether or not to use a job management system.

  2. When the Resource Manager is used, the actual system you are using must be identified by the environment variable SP_NAME, of the control workstation on the SP system.

  3. When running POE from a workstation that is external to the LoadLeveler cluster, the LoadL.so fileset must be installed on the external node (see Using and Administering LoadLeveler and IBM Parallel Environment for AIX: Installation for more information).

  4. When running POE from a workstation that is external to the SP system, and using the Resource Manager, the ssp.clients fileset must be installed on the external node (see IBM Parallel Environment for AIX: Installation for more information).

MP_EUILIB
which specifies the communication subsystem library implementation to use - either the IP communication subsystem implementation or the User Space (US) communication subsystem implementation. The IP communication subsystem library uses Internet Protocol for communication among processor nodes, while the US communication subsystem library lets you drive an SP system's high performance switch directly from your parallel tasks, without going through the kernel or operating system. For US communication on an SP system, you must have the high performance switch feature.

MP_EUIDEVICE
which specifies the adapter set you want to use for IP communication among processor nodes. The Partition Manager only checks this if you are using the IP communication subsystem implementation with LoadLeveler or the SP system Resource Manager. It does not check this if you are using an RS/6000 network cluster. If MP_RESD=no, the value of MP_EUIDEVICE is ignored.

MP_RMPOOL
which specifies the name or number of a LoadLeveler pool, or number of an SP system pool. The Partition Manager only checks this if you are using LoadLeveler or the SP system Resource Manager for non-specific node allocation without a host list file. You can use the llstatus command to return information about LoadLeveler pools. To use llstatus on a workstation that is external to the LoadLeveler system, the LoadL.so fileset must be installed on the external node. For more information, see Using and Administering LoadLeveler and IBM Parallel Environment for AIX: Installation. You can use the jm_status command to return information about SP system pools. To use jm_status on a workstation that is external to the SP system, the ssp.clients fileset must be installed on the external node (see IBM Parallel Environment for AIX: Installation for more information). For information on this command, and on SP system pools in general, refer to IBM Parallel System Support Programs for AIX: Command and Technical Reference

The remainder of this step consists of sub-steps describing how to set each of these environment variables, and how to create a host list file. Depending on the hardware and message passing library you are using, and the method of node allocation you want, some of the sub-steps that follow may not apply to you. For this reason, pay close attention to the task variant tables at the beginning of many of the sub-steps. They will tell you whether of not you need to perform the sub-step.

For further clarification, the following tables summarize the procedure for determining how nodes are allocated. The tables describe the possible methods of node allocation available to you, what each environment variable must be set to, and whether or not you need to create a host list file. To make the procedure of setting up the execution environment easier and less prone to error, you may eventually wish to create a shell script which automates some of the environment variable settings. To allocate nodes of an SP system, see Table 1. If you are using an RS/6000 network cluster, or if you are using a mixed system and want to allocate nodes not on the SP system, see Table 2.

Table 1. Execution Environment Setup Summary (for an SP system)

If you want to use the US communication subsystem library for communication among parallel tasks and... If you want to use the IP communication subsystem library for communication among parallel tasks and...

  you want non-specific node allocation from a single pool: you want specific node allocation or if you want non-specific node allocation from more than one pool: you want non-specific node allocation from a single pool: you want specific node allocation or non-specific node allocation (using the Resource Manager) from more than one pool:
A host list file... not required. required. not required. required.
MP_HOSTFILE should be set to an empty string ("") or the word "NULL" should be set to the name of your host list file. If not set, the host list file is assumed to be host.list in the current directory. should be set to an empty string ("") or the word "NULL" should be set to the name of your host list file. If not set, the host list file is assumed to be host.list in the current directory.
MP_RESD should be set to yes. If set to an empty string (""), or if not set, the Partition Manager assumes MP_RESD is yes. should be set to yes. If set to an empty string (""), or if not set, the Partition Manager assumes MP_RESD is yes. should be set to yes. If set to an empty string (""), or if not set, the Partition Manager assumes MP_RESD is yes. should be set to yes. If set to an empty string (""), the Partition Manager assumes MP_RESD is no.
MP_EUILIB us us ip ip
MP_EUIDEVICE css0 (the high performance switch). However, the actual value is ignored when MP_EUILIB is set to us. css0 (the high performance switch). However, the actual value is ignored when MP_EUILIB is set to us. should specify the adapter type. A valid, case-sensitive value is css0 (the high performance switch).

Note that the MP_EUIDEVICE value is only used when the value of MP_EUILIB is ip.

should specify the adapter type. A valid, case-sensitive value is css0 (the high performance switch).

Note that the MP_EUIDEVICE value is only used when the value of MP_EUILIB is ip.

MP_RMPOOL should be set to the name or number of a LoadLeveler pool, or the number of an SP system pool. It must be used if you are not using a host list file. is ignored if you are using a host list file. should be set to the name or number or a LoadLeveler pool, or the number of an SP system pool. It must be used if you are not using a host list file. is ignored if you are using a host list file.

Table 2. Execution Environment Setup Summary (for RS/6000 Network Cluster or Mixed System)

If you are using an RS/6000 network cluster: If you are using a mixed system:
A host list file... is used. is used.
MP_HOSTFILE should be set to the name of a host list file. If not defined, the host list file is assumed to be host.list in the current directory. should be set to the name of a host list file. If not defined, the host list file is assumed to be host.list in the current directory.
MP_RESD should be set to no. should be set to yes. If set to an empty string (""), the Partition Manager assumes MP_RESD is no.
MP_EUILIB should be set to ip. should be set to ip.
MP_EUIDEVICE is not checked. should specify the adapter type. Valid, case-sensitive, values are en0 (Ethernet), tr0 (token ring), fi0 (FDDI), and css0 (the high performance switch).
MP_RMPOOL is not used because you are using a host list file. is not used because you are using a host list file.

The following table shows how nodes will be allocated depending on the value of the environment variables discussed in this step. It is provided here for additional illustration. Refer to it in situations when the environment variables are set in patterns other than those suggested in Table 1 and Table 2.

Table 3. Node Allocation Summary
If Then



The value of MP_EUILIB is: The value of MP_RESD is: Your Host List file contains a list of: The allocation mode will be: The communication subsystem library implementation used will be: The message passing address used will be:
ip - nodes Node_List IP Nodes
pools RM_List IP MP_EUIDEVICE NULL RM
IP MP_EUIDEVICE yes nodes RM_List IP
MP_EUIDEVICE pools RM_List IP MP_EUIDEVICE NULL
RM IP MP_EUIDEVICE no nodes Node_List
IP Nodes pools Error - -
NULL Error - -

us - nodes RM_List US N/A
pools RM_List US N/A NULL RM
US N/A yes nodes RM_List US
N/A pools RM_List US N/A NULL
RM US N/A no nodes Error
- - pools Error - -
NULL Error - -

- - nodes Node_List IP Nodes
pools RM_List IP MP_EUIDEVICE NULL RM
US N/A yes nodes RM_List US
N/A pools RM_List US N/A NULL
RM US N/A no nodes Node_List
IP Nodes pools Error - -
NULL Error - -

Note:

Node_List
means that the host list file is used to create the partition.

RM_List
means that the host list file is used to create the partition, but the nodes are requested from either LoadLeveler or the SP system Resource Manager.

RM
means that the partition is created by requesting nodes in MP_RMPOOL from the SP system Resource Manager.

Nodes
indicates that the external IP address of the processor node is used for communication.

MP_EUIDEVICE
indicates that the IP adapter address indicated by MP_EUIDEVICE is used for communication.

Step 3a: Set the MP_PROCS Environment Variable

Before you execute a program, you need to set the size of the partition. To do this, use the MP_PROCS environment variable or its associated command-line flag -procs. For example, say you want to specify the number of task processes as 6. You could:
Set the MP_PROCS environment variable: Use the -procs flag when invoking the program:

ENTER
export MP_PROCS=6

ENTER
poe program -procs 6

Invoking parallel programs is discussed in more detail in "Step 5: Invoke the Executable".

Notes:

  1. Keep in mind that MP_PROCS sets the number of task processes per partition and does not necessarily correspond to the number of processor nodes. If tasks are time-sharing processor nodes, for example, the number of tasks will be greater than the number of nodes.

  2. If you do not set MP_PROCS, the default number of task processes is 1 unless:

    See "Step 3i: Set the MP_RMPOOL Environment Variable" for more details.

  3. The examples in this book assume use of the Korn shell. If you are using the C shell, you would have to use the setenv command rather than the export command. See "Setting POE Environment Variables" for more information.

Step 3b: Set the SP_NAME Environment Variable

If all nodes to be used for the parallel job exist in a PSSP 2.3.0 or 2.4.0 partition, the SP_NAME environment variable should be set to the name of the control workstation of the SP system on which these nodes exist. This is the only case that results in POE contacting the Resource Manager rather than LoadLeveler for node allocation requests.

Step 3c: Create a Host List File


You need to create a host list file if: You do not need to create a host list file if:

  • you are using an RS/6000 network cluster.

  • you are using a mixed system with the Resource Manager and want to allocate some nodes not on the SP system.

  • you are using an SP system and want specific node allocation.

  • you are using an SP system with the Resource Manager and want non-specific node allocation from more than one pool.

you are using a LoadLeveler cluster or an SP system and want non-specific node allocation from a single pool.

A host list file specifies the processor nodes on which the individual tasks of your program should run. When you invoke a parallel program, your Partition Manager checks to see if you have specified a host list file. If you have, it reads the file to allocate processor nodes. The procedure for creating a host list file differs depending on whether you are using an RS/6000 network cluster, a LoadLeveler cluster, an SP system, or a mixed system. If you are using an RS/6000 network cluster, see "Creating a Host List File to Allocate Nodes of a Cluster". If you are using a LoadLeveler cluster, an SP system, or a mixed system, see "Creating a Host List File to Allocate Nodes of an SP System".

Creating a Host List File to Allocate Nodes of a Cluster

If you are using an RS/6000, a host list file simply lists a series of host names - one per line. These must be the names of remote nodes accessible from the Home Node. Lines beginning with an exclamation point (!) or asterisk (*) are comments. The Partition Manager ignores blank lines and comments. The host list file can list more names than are required by the number of program tasks. The additional names are ignored.

To understand how the Partition Manager uses a host list file to determine the nodes on which your program should run, consider the following example host list file:

! Host list file for allocating 6 tasks
 
* An asterisk may also be used to indicate a comment
 
 
 
host1_name
 
host2_name
 
host3_name
 
host4_name
 
host5_name
 
host6_name

The Partition Manager ignores the first two lines because they are comments, and the third line because it is blank. It then allocates host1_name to run task 0, host2_name to run task 1, host3_name to run task 2, and so on. If any of the processor nodes listed in the host list file are unavailable when you invoke your program, the Partition Manager returns a message stating this and does not run your program.

You can also have multiple tasks of a program share the same node by simply listing the same node multiple times in your host list file. For example, say your host list file contains the following:

host1_name
 
host2_name
 
host3_name
 
host1_name
 
host2_name
 
host3_name

Tasks 0 and 3 will run on host1_name, tasks 1 and 4 will run on host2_name, and tasks 2 and 5 will run on host3_name.

Creating a Host List File to Allocate Nodes of an SP System

If you are using a LoadLeveler cluster or SP system, you can use a host list file for either:

In either case, the host list file can contain a number of records - one per line. For specific node allocation, each record indicates a processor node. For non-specific node allocation you can have one system pool only, when using LoadLeveler. When using the Resource Manager, each record indicates an SP system pool. Your host list file cannot contain a mixture of node and pool requests, so you must use one method or the other. The host list file can contain more records than required by the number of program tasks. The additional records are ignored.

For specific node allocation:

Each record is either a host name or IP adapter address of a specific processor node of the SP system. If you are using a mixed system and want to allocate nodes not on the SP system, you must request them by host name. Lines beginning with an exclamation point (!) and asterisk (*) are comments. The Partition Manager ignores blank lines and comments.

To understand how the Partition Manager uses a host list file to determine the SP system nodes on which your program should run, consider the following representation of a host list file.

! Host list file for allocating 6 tasks
 
host1_name
host2_name
host3_name
9.117.8.53
9.117.8.53
9.117.8.53

The Partition Manager ignores the first line because it is a comment, and the second because it is blank. It then allocates host1_name to run task 0, host2_name to run task 1, host3_name to run task 2, and so on. The last three nodes are requested by adapter IP address using dot decimal notation.

Notes:

  1. You can also, on each of the records in the host list file, specify how the allocated node's adapter and CPU should be used. For more information, see "Specifying How a Node's Resources Are Used".

  2. If any of the processor nodes listed in the host list file are unavailable when you invoke your program, the Partition Manager returns a message stating this and does not run your program.

For non-specific node allocation from a number of pools

After installation of a LoadLeveler cluster or SP system, your system administrator divides its processor nodes into a number of pools. With LoadLeveler, each pool has an identifying pool name or number. With an SP system, each pool has an identifying pool number. Using LoadLeveler for non-specific node allocation, you need to supply the appropriate pool name or number. LoadLeveler does not use more than one pool. Using Resource Manager for non-specific node allocation from a number of pools, you need to supply the appropriate pool numbers.

If you require information about LoadLeveler pools, use the command llstatus. To use llstatus on a workstation that is external to the LoadLeveler cluster, the LoadL.so fileset must be installed on the external node (see Using and Administering LoadLeveler for more information).

ENTER
llstatus -l (lower case L)

* LoadLeveler lists information about pools in the LoadLeveler cluster.

If you require information about SP system pools, use the command jm_status. To use jm_status on a workstation that is external to the SP system, the ssp.clients fileset must be installed on the external node (see IBM Parallel Environment for AIX: Installation for more information).

ENTER
jm_status -P

* The Resource Manager lists information about all SP system pools.

With regard to LoadLeveler, in a host list file intended for non-specific node allocation, each record is a pool name or number preceded by an at symbol (@). Lines beginning with an exclamation point (!) and asterisk (*) are comments. The Partition Manager ignores blank lines and comments.

To understand how the Partition Manager uses a host list file for non-specific node allocation, consider the following example host list file:

! Host list file for allocating 3 tasks with LoadLeveler
 
@6
@6
@6

The Partition Manager ignores the first line because it is a comment, and the second line because it is blank. The at (@) symbols tell the Partition Manager that these are pool requests. It connects to LoadLeveler to request three nodes from pool 6.

With regard to the Resource Manager only, in a host list file intended for non-specific node allocation from a number of pools, each record is a pool number preceded by an at symbol (@). Lines beginning with an exclamation point (!) and asterisk (*) are comments. The Partition Manager ignores blank lines and comments.

To understand how the Partition Manager uses a host list file for non-specific node allocation from a number of pools, consider the following example host list file:

! Host list file for allocating 6 tasks with the Resource Manager
 
@6
@6
@6
@12
@12
@12

The Partition Manager ignores the first line because it is a comment, and the second line because it is blank. The at (@) symbols tell the Partition Manager that these are pool requests. It connects to the SP system Resource Manager to request three nodes from pool 6, and three nodes from pool 12.

Notes:

  1. When using the Resource Manager you can also, on each of the records in the host list file, specify how the allocated node's adapter and CPU should be used. For more information, see "Specifying How a Node's Resources Are Used".

  2. If there are insufficient nodes available in a requested pool when you invoke your program, the Partition Manager returns a message stating this, and does not run your program.

  3. For more information on the llstatus command and LoadLeveler pools, refer to Using and Administering LoadLeveler. For more information on the jm_status command and SP system pools, refer to IBM Parallel System Support Programs for AIX: Command and Technical Reference.

  4. If the number of program tasks is greater than the number of records in the host list file, the last record in the file is used for the remaining requests.

Specifying How a Node's Resources Are Used

When requesting nodes of an SP system, you can optionally request how each node's resources - its adapter and CPU - should be used. You can specify:

Note:When using LoadLeveler, you can request how nodes are used with the MP_CPU_USE and/or MP_ADAPTER_USE environment variables, or their associated command line options. Usage specification in a host list file will be ignored when using LoadLeveler.

When Using a Host List File for Node Allocation

With regard to the Resource Manager, on each record of the host list file, you can make either or both of the specifications listed above. For example, if you wanted your program task to have exclusive use of both the adapter and CPU, the host list record would be:

host1_name dedicated unique

or

host1_name d u

This is the same for pool requests:

@6 dedicated unique

or

@6 d u

When Not Using a Host List File for Node Allocation

The environment variables MP_ADAPTER_USE and MP_CPU_USE, or the associated command line options (-adapter_use and -cpu_use) can be used to make either or both of these specifications. These specifications will then affect the resource usage for each node allocated from the pool specified using MP_RMPOOL or -rmpool. For example, if you wanted nodes from Resource Manager pool 5, and you wanted your program to have exclusive use of both the adapter and CPU, the following command line could be used:

poe [program] -rmpool 5 -adapter_use d[edicated]
 
-cpu_use u[nique] [more_poe_options]

Associated environment variables (MP_RMPOOL, MP_ADAPTER_USE, MP_CPU_USE) could also be used to specify any or all of the options in this example.

The following tables illustrate how node resources are used. Table 4 shows the default settings for adapter and CPU use, while Table 5 outlines how the two separate specifications determine how the allocated node's resources are used.

Table 4. Adapter/CPU Default Settings
  Adapter CPU
If host list file contains non-specific pool requests: Dedicated Unique
If host list file requests specific nodes: Shared 1 Multiple
If host list file is not used nodes: Dedicated2 Unique3
Note:

1 For US jobs, adapter is dedicated.

2 For IP jobs, adapter is shared.

3 For IP jobs, CPU is multiple.


Table 5. Adapter/CPU Use under LoadLeveler

If the Node's CPU is "Unique": If the Node's CPU is "Multiple":
If the adapter use is "Dedicated": Intended for production runs of high performance applications. Only the tasks of that parallel job use the adapter and CPU. The adapter you specified with MP_EUIDEVICE is dedicated to the tasks of your parallel job. However, you and other users still have access to the CPU through another adapter.
If the adapter use is "Shared": Only your program tasks have access to the node's CPU, but other program's tasks can share the adapter. Both the adapter and CPU can be used by a number of your program's tasks and other users.

Table 6. Adapter/CPU Use under the Resource Manager

If the Node's CPU is "Unique": If the Node's CPU is "Multiple":
If the adapter use is "Dedicated": Intended for production runs of high performance applications. Only one task uses the adapter and CPU. The adapter you specified with MP_EUIDEVICE is dedicated to your program task. However, you and other users still have access to the CPU through another adapter.
If the adapter use is "Shared": Only you have access to the node's CPU, but a number of your program's tasks can share the adapter. Both the adapter and CPU can be used by a number of your program's tasks and other users.

Notes:

  1. When using LoadLeveler, the US communication subsystem library does not require dedicated use of the SP switch on the node. Adapter use will be defaulted, as in Table 4, but shared usage may be specified.

  2. When using the Resource Manager, the US communication subsystem library requires dedicated use of the SP switch on the node. If you are using the US communication subsystem for communication among processor nodes, POE forces adapter use to be dedicated. If you are using the US communication subsystem and you specify adapter use to be shared, the specification is ignored.

  3. Adapter/CPU usage specification is only enforced for jobs using LoadLeveler or the SP system Resource Manager for node allocation.

Generating an Output Host List File

When running parallel programs in a LoadLeveler cluster or on an SP system, you can generate an output host list file of the nodes allocated by LoadLeveler or the Resource Manager. When you have LoadLeveler or the Resource Manager perform non-specific node allocation from SP system pools, this enables you to learn which nodes were allocated. This information is vital if you want to perform some postmortem analysis or file cleanup on those nodes, or if you want to rerun the program using the same nodes. To generate a host list file, set the MP_SAVEHOSTFILE environment variable to a file name. You can specify this using a relative or full path name. As with most POE environment variables, you can temporarily override the value of MP_SAVEHOSTFILE using its associated command-line flag -savehostfile. For example, to save LoadLeveler's or the Resource Manager's node allocation into a file called /u/hinkle/myhosts, you could:
Set the MP_SAVEHOSTFILE environment variable: Use the -savehostfile flag when invoking the program:

ENTER
export MP_SAVEHOSTFILE=/u/hinkle/myhosts

ENTER
poe program -savehostfile /u/hinkle/myhosts

Each record in the output host list file will be the original non-specific pool request. Following each record will be comments indicating the specific node that was allocated. The specific node is identified by:

For example, using LoadLeveler, say the input host list file contains the following records:

@mypool
@mypool
@mypool

The following is a representation of the output hostlist file.

host1_name
! 9.117.11.47                  9.117.8.53
 
!@mypool
host1_name
! 9.117.11.47                  9.117.8.53
 
!@mypool
host1_name
! 9.117.11.47                  9.117.8.53
 
!@mypool

Using the Resource Manager, say the input host list file contains the following records:

@6
 
@6
 
@6
 
@12
 
@12
 
@12

The following is a representation of the output hostlist file.

host1_name dedicated unique
 
! 9.117.11.47                  9.117.8.53
 
!@6
 
host2_name dedicated unique
 
! 9.117.11.47                  9.117.8.53
 
!@6
 
host3_name dedicated unique
 
! 9.117.11.47                  9.117.8.53
 
!@6
 
host4_name dedicated unique
 
! 9.117.11.47                  9.117.8.53
 
!@12
 
host5_name dedicated unique
 
! 9.117.11.47                  9.117.8.53
 
!@12
 
host6_name dedicated unique
 
! 9.117.11.47                  9.117.8.53
 
!@12

Note:The name of your output host list file can be the same as your input host list file. If a file of the same name already exists, it is overwritten by the output host list file.

Step 3d: Set the MP_HOSTFILE Environment Variable


You need to set the MP_HOSTFILE environment variable if: You do not need to set the MP_HOSTFILE environment variable if:

  • you are using a host list file other than the default ./host.list

  • you are requesting non-specific node allocation without a host list file.

If your host list file is the default ./host.list

The default host list file used by the Partition Manager to allocate nodes is called host.list and is located in your current directory. You can specify a file other than host.list by setting the environment variable MP_HOSTFILE to the name of a host list file, or by using either the -hostfile or -hfile flag when invoking the program. In either case, you can specify the file using its relative or full path name. For example, say you want to use the host list file myhosts located in the directory /u/hinkle. You could:
Set the MP_HOSTFILE environment variable: Use the -hostfile flag when invoking the program:

ENTER
export MP_HOSTFILE=/u/hinkle/myhosts

ENTER
poe program -hostfile /u/hinkle/myhosts

or poe program -hfile /u/hinkle/myhosts


If you are using LoadLeveler or the SP system Resource Manager for non-specific node allocation from a single pool specified by MP_RMPOOL, and a host list file exists in the current directory, you must set MP_HOSTFILE to an empty string or to the word "NULL". Otherwise the Partition Manager uses the host list file. You can either:
Set the MP_HOSTFILE environment variable: Use the -hostfile flag when invoking the program:

ENTER
export MP_HOSTFILE=

or

export MP_HOSTFILE=""

or

export MP_HOSTFILE=NULL

ENTER
poe program -hostfile ""

or poe program -hostfile NULL


Step 3e: Set the MP_RESD Environment Variable

To indicate whether a job management system should be used, you set the MP_RESD environment variable to yes or no. As specified in Table 1 and Table 2, MP_RESD controls whether or not the Partition Manager connects to LoadLeveler or the Resource Manager to allocate processor nodes.

If you are allocating nodes that are not part of a LoadLeveler cluster, MP_RESD should be set to no. If MP_RESD is set to yes, only nodes within the LoadLeveler cluster are allocated.

If you are allocating nodes of an RS/6000 network cluster, you do not have a job management system and should set MP_RESD to no. If you are using a mixed system, you may set MP_RESD to yes. However, the job management system only has knowledge of SP system nodes. To allocate any of the additional RS/6000 processors which supplement the SP system nodes in a mixed system, you must also use a host list file.

As with most POE environment variables, you can temporarily override the value of MP_RESD using its associated command-line flag -resd. For example, to specify that you want the Partition Manager to connect to the Resource Manager, you could:
Set the MP_RESD environment variable: Use the -resd flag when invoking the program:

ENTER
export MP_RESD=yes

ENTER
poe program -resd yes

You can also set MP_RESD to an empty string. If set to an empty string, or if not set, the default value of MP_RESD is interpreted as yes or no depending on the context. Specifically, the value of MP_RESD will be determined by the value of MP_EUILIB and whether or not you are using a host list file. The following table shows how the context determines the value of MP_RESD.
MP_EUILIB setting and you are using a host list file: and you are not using a host list file:
If MP_EUILIB is set to ip, an empty string, the word "NULL", or if not set: MP_RESD is interpreted as no by default, unless host list file includes pool requests. MP_RESD is interpreted as yes by default.
If MP_EUILIB is set to us: MP_RESD is interpreted as yes by default. MP_RESD is interpreted as yes by default.

Notes:

  1. MP_RESD only specifies whether or not to use a job management system (LoadLeveler or the Resource Manager).

  2. When the Resource Manager is used, the actual system you are using is identified by the environment variable SP_NAME, of the control workstation on the SP system.

  3. When running POE from a workstation that is external to the LoadLeveler cluster, the LoadL.so fileset must be installed on the external node (see Using and Administering LoadLeveler and IBM Parallel Environment for AIX: Installation for more information).

  4. When running POE from a workstation that is external to the SP system, and using the Resource Manager, the ssp.clients fileset must be installed on the external node (see IBM Parallel Environment for AIX: Installation for more information).

Step 3f: Set the MP_EUILIB Environment Variable

During execution, the tasks of your program can communicate via calls to message passing routines. The message passing routines in turn call communication subsystem library routines which enable the processor nodes to exchange the message data. Before you invoke your program, you need to decide which communication subsystem library implementation you wish to use - the Internet Protocol (IP) communication subsystem or the User Space (US) communication subsystem.

The MP_EUILIB environment variable, or its associated command-line flag -euilib, is used to indicate which communication subsystem library implementation you are using. POE needs to know which communication subsystem implementation to dynamically link in as part of your executable when you invoke it. The following table shows the appropriate setting for MP_EUILIB depending on the communication subsystem library implementation you want and whether or not it has already been statically linked.
  and you want it dynamically linked when you invoke your program:
If you want the IP communication subsystem or US communication subsystem: MP_EUILIB should be set to ip or us This specification is case-sensitive.

For example, say you want to dynamically link in the communication subsystem library at execution time. You could:
Set the MP_EUILIB environment variable: Use the -euilib flag when invoking the program:

ENTER
export MP_EUILIB=ip or us

ENTER
poe program -euilib ip or us

Note:When you invoke a parallel program, your Partition Manager checks the value of MP_EUILIB and then looks to the directory /usr/lpp/ppe.poe/lib for the message passing interface and the communication subsystem library implementation. If you are running on an RS/6000 network cluster, this is the actual location of the message passing interface. If you are running on an SP system, /usr/lpp/ppe.poe/lib contains symbolic links to the actual location. Consult your system administrator for the actual location of the message passing library if necessary.

You can make POE look to a directory other than /usr/lpp/ppe.poe/lib by setting the MP_EUILIBPATH environment variable or its associated command-line flag -euilibpath. For example, say the communication subsystem library implementations were moved to /usr/altlib. To instruct the Partition Manager to look there, you could:


Set the MP_EUILIBPATH environment variable: Use the -euilibpath flag when invoking the program:

ENTER
export MP_EUILIBPATH=/usr/altlib

ENTER
poe program -euilibpath /usr/altlib

The expected library for loading the communication subsystem library implementation is in directory /usr/lpp/ppe.poe/lib/$MP_EUILIB. Setting the MP_EUILIBPATH environment variable causes POE to try to load the communication subsystem library from the directory $MP_EUILIBPATH/$MP_EUILIB. If the communication subsystem library (libmpci.a) is not in the requested path, it will be loaded from the library path for the IP communication subsystem library implementation used when the program was compiled - $MP_PREFIX/ppe.poe/lib/ip. MP_PREFIX can also be set by the user, but is normally /usr/lpp. Thus the default library path is normally /usr/lpp/ppe.poe/lib/ip, provided the library is not specified by the MP_EUILIB and/or MP_EUILIBPATH environment variables.

Step 3g: Set the MP_EUIDEVICE Environment Variable


You need to set the MP_EUIDEVICE environment variable if: You do not need to set the MP_EUIDEVICE environment variable if:
you have set the MP_EUILIB environment variable to ip, and are using LoadLeveler or the Resource Manager. you have set the MP_EUILIB environment variable to us. The Partition Manager assumes that MP_EUIDEVICE is css0 - the high performance communication adapter.

If you are using the IP communication subsystem library implementation for communication among parallel tasks on an SP system, you can specify which adapter set to use for message passing - either Ethernet, FDDI, token-ring, or a high performance switch. The MP_EUIDEVICE environment variable and its associated command-line flag -euidevice are used to select an alternate adapter set for communication among processor nodes. If neither MP_EUIDEVICE device nor the -euidevice flag is set, the communication subsystem library uses the external IP address of each remote node. The following table shows the possible, case-sensitive, settings for MP_EUIDEVICE.
Setting the MP_EUIDEVICE environment variable to: Selects:
en0 The Ethernet adapter
fi0 The FDDI adapter
tr0 The token-ring adapter
css0 The high performance switch adapter

For example, say you want to use IP over the high performance switch. The nodes have been initialized for IP as described in IBM Parallel System Support Programs for AIX: Installation and Migration Guide, and you have already set the MP_EUILIB environment variable to ip. To specify the high performance switch, you could:
Set the MP_EUIDEVICE environment variable: Use the -euidevice flag when invoking the program:

ENTER
export MP_EUIDEVICE=css0

ENTER
poe program -euidevice css0

Notes:

  1. If you do not set the MP_EUIDEVICE environment variable, the default is the adapter set used as the external network address.

  2. If MP_EUIDEVICE is explicitly set to en0 and LoadLeveler is being used for node allocation, the en0 adapter must be configured in LoadLeveler. See Using and Administering LoadLeveler for more information.

Step 3h: Set the MP_MSG_API Environment Variable

The MP_MSG_API environment variable, or its associated command line option, is used to indicate to POE which message passing API is being used by the parallel tasks.
You need to set the MP_MSG_API environment variable if: You do not need to set the MP_MSG_API environment variable if:
A parallel task is using LAPI alone or in conjunction with MPI. A parallel task is using MPI only.

Step 3i: Set the MP_RMPOOL Environment Variable


You need to set the MP_RMPOOL environment variable if: You do not need to set the MP_RMPOOL environment variable if:
You are using a LoadLeveler cluster oran SP system and want non-specific node allocation from a single pool. You are allocating nodes using a host list file.

After installation of a LoadLeveler cluster or SP system, your system administrator divides its processor nodes into a number of pools. Each pool has an identifying pool name or number. When using LoadLeveler, and you want non-specific node allocation from a single pool, you need to set the MP_RMPOOL environment variable to the name or number of that pool. When using the Resource Manager, and you want non-specific node allocation from a single pool, you need to set the MP_RMPOOL environment variable to the number of that pool. The pool number you specify should consist of nodes configured for the appropriate communication subsystem library implementation. Check with your system administrator to learn which pools consist of nodes initialized for the US communication subsystem and which were initialized for the IP communication subsystem.

If you need information about available pools and are using LoadLeveler, use the command llstatus. To use llstatus on a workstation that is external to the LoadLeveler cluster, the LoadL.so fileset must be installed on the external node (see Using and Administering LoadLeveler and IBM Parallel Environment for AIX: Installation for more information).

ENTER
llstatus -l (lower case L)

* LoadLeveler lists information about all LoadLeveler pools and/or features.

If you need information about available pools and are using the Resource Manager, use the command jm_status to get job manager status. To use jm_status on a workstation that is external to the SP system, the ssp.clients fileset must be installed on the external node (see IBM Parallel Environment for AIX: Installation for more information).

ENTER
jm_status -P

* The Resource Manager lists information about all SP system pools.

As with most POE environment variables, you can temporarily override the value of MP_RMPOOL using its associated command-line flag -rmpool. To specify pool 6, for example, you could:
Set the MP_RMPOOL environment variable: Use the -rmpool flag when invoking the program:

ENTER
export MP_RMPOOL=6

ENTER
poe program -rmpool 6

Notes:

  1. For more information on the llstatus command and on LoadLeveler pools, refer to Using and Administering LoadLeveler.

  2. For more information on the jm_status command and on SP system pools, refer to IBM Parallel System Support Programs for AIX: Command and Technical Reference.

  3. When using LoadLeveler, if the value of the MP_RMPOOL environment variable is numeric, that pool number must be configured in LoadLeveler. If the value of MP_RMPOOL contains any non-numeric characters, that pool name must be configured as a feature in LoadLeveler. See Using and Administering LoadLeveler for more information.

In conjunction with MP_RMPOOL, when using LoadLeveler, the MP_NODES or MP_TASKS_PER_NODE environment variables or associated command line options may be used.


Table 7. LoadLeveler Node Allocation
MP_PROCS set? MP_TASKS_PER_NODE set? MP_NODES set? Conditions and Results
Yes Yes Yes MP_TASKS_PER_NODE multiplied by MP_NODES must equal MP_PROCS, otherwise an error occurs.
Yes Yes No MP_TASKS_PER_NODE must divide evenly into MP_PROCS, otherwise an error occurs.
Yes No Yes MP_NODES (n) must be less than or equal to MP_PROCS (p). If less than, LoadLeveler will allocate one task to each node, from 0 to n - 1, and will then allocate a second task to each of the nodes from 0 to n - 1, etc., until there are p tasks allocated. For example, if n = 3 and p = 5, 2 tasks will run on node 0, 2 tasks will run on node 1, and 1 task will run on node 2.
Yes No No The parallel job will run with the indicated number of MP_PROCS (p) on p nodes.
No Yes Yes The parallel job will consist of MP_TASKS_PER_NODE multiplied by MP_NODES tasks.
No Yes No An error occurs. MP_NODES or MP_PROCS must be specified with MP_TASKS_PER_NODE.
No No Yes One parallel task will be run on each of n nodes.
No No No One parallel task will be run on one node.
Note:The examples in this table use the environment variable setting to illustrate each of the three options. The associated command line options may also be used.

Step 3j: Set the MP_AUTH Environment Variable


You need to set the MP_AUTH environment variable if: You do not need to set the MP_AUTH environment variable if:
You are using DFS/DCE based user authorization and your system administrator has not defined the MP_AUTH value in /etc/poe.limits. You are using AIX based user authorization defined by /etc/hosts.equiv or .rhosts entries, or your system administrator has defined the MP_AUTH value in /etc/poe.limits.

POE allows two types of user authorization:

  1. AIX based user authorization, using entries in /etc/hosts.equiv or .rhosts files. This is the default POE user authorization method.

  2. DFS/DCE based user authorization, using DCE credentials. If you plan to run POE jobs in a DFS environment, you must use DFS/DCE based user authorization.
Note:If POE is run under LoadLeveler, LoadLeveler handles the user authorization, and the POE user authorization steps are skipped.

The type of user authorization is controlled by the MP_AUTH environment variable. The valid values are AIX (the default) or DFS.

The system administrator can also define the value for MP_AUTH in the /etc/poe.limits file. If MP_AUTH is specified in /etc/poe.limits, POE will override the value of the MP_AUTH environment variable, if different.

For more information on running POE in a DFS environment, see "Running POE within a Distributed File System".

For more information on user authorization and on the /etc/poe.limits entries, see IBM Parallel Environment for AIX: Installation

Step 4: Start X-Windows Analysis Tools

If you wish to use either of the POE X-Windows analysis tools - the Program Marker Array or the System Status Array - you should start them before invoking the executable. For more information on these tools and how to start them, see Figure 1 and "Using the System Status Array".

Step 5: Invoke the Executable

Note:In order to perform this step, you need to have a user account on, and be able to remotely login to, each of the processor nodes. This requires that you have an .rhosts file set up in your home directory on each of the remote processor nodes. Alternatively, your user id on the home node can be authorized in the /etc/hosts.equiv file on each remote node. For more information on the TCP/IP .rhosts file format, see IBM General Concepts and Procedures for RS/6000, and IBM AIX Version 4 Files Reference

The poe command enables you to load and execute programs on remote nodes. You can use it to:

When you invoke poe, the Partition Manager allocates processor nodes for each task and initializes the local environment. It then loads your program, and reproduces your local environment, on each processor node. The Partition Manager also passes the option list to each remote node. If your program is in a shared file system, the Partition Manager loads a copy of it on each node. If your program is in a private file system, you will have already manually copied your executable to the nodes using the mprcp or mcp command. If you are using the dynamic message passing interface, the appropriate communication subsystem library implementation (IP or US) is automatically loaded at this time.

Since the Partition Manager attempts to reproduce your local environment on each remote node, your current directory is important. When you invoke poe, the Partition Manager will, immediately before running your executable, issue the cd command to your current working directory on each remote node. If you are in a local directory that does not exist on remote nodes, you will get an error as the Partition Manager attempts to change to that directory on remote nodes. Typically, this will happen when you invoke poe from a directory under /tmp. We suggest that you invoke poe from a file system that is mounted across the system. If it is important that the current directory be under /tmp, make sure that directory exists on all the remote nodes. If you are running in the C shell, see "Running Programs Under the C Shell".
Note:The Parallel Environment opens several file descriptors before passing control to the user. The Parallel Environment will not assign specific file descriptors other than standard in, standard out, and standard error.

Before using the poe command, you can first specify which programming model you are using by setting the MP_PGMMODEL environment variable to either spmd or mpmd. As with most POE environment variables, you can temporarily override the value of MP_PGMMODEL using its associated command-line flag -pgmmodel. For example, if you want to run an MPMD program, you could:
Set the MP_PGMMODEL environment variable: Use the -pgmmodel flag when invoking the program:

ENTER
export MP_PGMMODEL=mpmd

ENTER
poe program -pgmmodel mpmd

Note:If you do not set the MP_PGMMODEL environment variable or -pgmmodel flag, the default programming model is SPMD.
Note:If you load your executable from a mounted file system, you may experience an initial delay while the program is being initialized on all nodes. You may experience this delay even after the program begins executing, because individual pages of the program are brought in on demand. This is particularly apparent during initialization of the message passing interface; since individual nodes are synchronized, there are simultaneous demands on the network file transfer system. You can minimize this delay by copying the executable to a local file system on each node, using the mcp message passing file copy program.

Invoking an SPMD Program

If you have an SPMD program, you want to load it as a separate task on each node of your partition. To do this, follow the poe command with the program name and any options. The options can be program options or any of the POE command-line flags shown in Appendix B. "POE Environment Variables and Command-Line Flags". You can also invoke an SPMD program by entering the program name and any options:

ENTER
poe program [options]

or

program [options]

You can also enter poe without a program name:

ENTER
poe [options]

* Once your partition is established, a prompt appears.

ENTER
the name of the program you want to load. You can follow the program name with any program options or a subset of the POE flags.

Note:For National Language Support, POE displays messages located in an externalized message catalog. POE checks the LANG and NLSPATH environment variables, and if either is not set, it will set up the following defaults:

  • LANG=C

  • NLSPATH=/usr/lib/nls/msg/%L/%N

For more information about the message catalog, see "National Language Support".

Invoking an MPMD Program

Note:You must set the MP_PGMMODEL environment variable or -pgmmodel flag to invoke an MPMD program.

With an SPMD application, the name of the same executable is sent to, and runs on, each of the processor nodes of your partition. If you are invoking an MPMD application, you are dealing with more than one program and need to individually load the nodes of your partition.

For example, say you have two programs - master and workers - designed to run together and communicate via calls to message passing subroutines. The program master is designed to run on one processor node. The workers program is designed to run as separate tasks on any number of other nodes. The master program will coordinate and synchronize the execution of all the worker tasks. Neither program can run without the other, as master only does sends and the workers tasks only do receives.

You can establish a partition and load each node individually using:

Loading Nodes Individually From Standard Input

To establish a partition and load each node individually using STDIN:

ENTER
poe [options]

* The Partition Manager allocates the processor nodes of your partition. Once your partition is established, a prompt containing both the logical node identifier 0 and the actual host name it maps to, appears.

ENTER
the name of the program you want to load on node 0. You can follow the program name with any program options or a subset of the POE flags.

* A prompt for the next node in the partition displays.

ENTER
the name of the program you want to load on each processor node as you are prompted.

* When you have specified the program to run on the last node of your partition, the message "Partition loaded..." displays and execution begins.

For additional illustration, the following shows the command prompts that would appear, as well as the program names you would enter, to load the example master and workers programs. This example assumes that the MP_PROCS environment variable is set to 5.

0:host1_name> master [options]
 
1:host2_name> workers [options]
 
2:host3_name> workers [options]
 
3:host4_name> workers [options]
 
4:host5_name> workers [options]
 
Partition loaded...

% poe
 
0:host1_name> master [options]
 
1:host2_name> workers [options]
 
2:host3_name> workers [options]
 
3:host4_name> workers [options]
 
4:host5_name> workers [options]
 
Partition loaded...

Note:You can use some POE command-line flags on individual program names, but not those that are used to set up the partition. The flags you can use are mainly those having to do with VT trace file collection. They are:
  • -infolevel or -ilevel
  • -ttempsize or -ttsize
  • -tmpdir
  • -samplefreq or -sfreq
  • -tbuffwrap or -tbwrap
  • -tbuffsize or -tbsize
  • -euidevelop

Loading Nodes Individually Using a POE Commands File

The MP_CMDFILE environment variable, and its associated command-line flag -cmdfile, let you specify the name of a POE commands file. You can use such a file when individually loading a partition - thus freeing STDIN. The POE commands file simply lists the individual programs you want to load and run on the nodes of your partition. The programs are loaded in task order. For example, say you have a typical master/workers MPMD program that you want to run as 5 tasks. Your POE commands file would contain:

master [options]
 
workers [options]
 
workers [options]
 
workers [options]
 
workers [options]

Once you have created a POE commands file, you can specify it using a relative or full path name on the MP_CMDFILE environment variable or -cmdfile flag. For example, if your POE commands file is /u/hinkle/mpmdprog, you could:
Set the MP_CMDFILE environment variable: Use the -cmdfile flag on the poe command:

ENTER
export MP_CMDFILE=/u/hinkle/mpmdprog

ENTER
poe -cmdfile /u/hinkle/mpmdprog

Once you have set the MP_CMDFILE environment variable to the name of the POE commands file, you can individually load the nodes of your partition. To do this:

ENTER
poe [options]

* The Partition Manager allocates the processor nodes of your partition. The programs listed in your POE commands file are run on the nodes of your partition.

Loading a Series of Programs as Job Steps

By default, the Partition Manager releases your partition when your program completes its run. However, you can set the environment variable MP_NEWJOB, or its associated command-line flag -newjob, to specify that the Partition Manager should maintain your partition for multiple job steps.

For example, say you have three separate SPMD programs. The first one sets up a particular computation by adding some files to /tmp on each of the processor nodes on the partition. The second program does the actual computation. The third program does some postmortem analysis and file cleanup. These three parallel programs must run as job steps on the same processor nodes in order to work correctly. While specific node allocation using a host list file might work, the requested nodes might not be available when you invoke each program. The better solution is to instruct the Partition Manager to maintain your partition after execution of each program completes. You can then read multiple job steps from:

In either case, you must first specify that you want the Partition Manager to maintain your partition for multiple job steps. To do this, you could:
Set the MP_NEWJOB environment variable: Use the -newjob flag on the poe command:

ENTER
export MP_NEWJOB=yes

ENTER
poe -newjob yes

Notes:

  1. You can only load a series of programs as job steps using the poe command. You cannot do this with either of the parallel debugger commands - pdbx and pedb.

  2. poe is its own shell. Whether successive steps run after a step completes is a function of the exit code, as described in IBM Parallel Environment for AIX: MPI Programming and Subroutine Reference

Reading Job Steps From Standard Input

Say you want to run three SPMD programs - setup, computation, and cleanup - as job steps on the same partition. Assuming STDIN is keyboard entry, MP_PGMMODEL is set to spmd, and MP_NEWJOB is set to yes, you would:

ENTER
poe [poe-options]

* The Partition Manager allocates the processor nodes of your partition, and the following prompt displays:

0031-503 Enter program name (or quit):

ENTER
setup [program-options]

* The program setup executes on all nodes of your partition. When execution completes, the following prompt displays:

0031-503 Enter program name (or quit):

ENTER
computation [program-options]

* The program computation executes on all nodes of your partition. When execution completes, the following prompt displays:

0031-503 Enter program name (or quit):

ENTER
cleanup [program-options]

* The program cleanup executes on all nodes of your partition. When execution completes, the following prompt displays:

0031-503 Enter program name (or quit):

ENTER
quit

 
or

 
<Ctrl-d>

 
<Ctrl-d>

* The Partition Manager releases the nodes of your partition.

Notes:

  1. You can also run a series of MPMD programs in job step fashion from STDIN. If MP_PGMMODEL is set to mpmd, the Partition Manager will, after each step completes, prompt you to individually reload the partition as described in "Loading Nodes Individually From Standard Input".

  2. When MP_NEWJOB is yes, the Partition Manager, by default, looks to STDIN for job steps. However, if the environment variable MP_CMDFILE is set to the name of a POE commands file as described in "Reading Job Steps From a POE Commands File", the Partition Manger will look to the commands file instead. To ensure that job steps are read from STDIN, check that the MP_CMDFILE environment variable is unspecified.

Multi-Step STDIN for Newjob Mode

POE's STDIN processing model allows redirected STDIN to be passed to all steps of a newjob sequence, when the redirection is from a file. If redirection is from a pipe, POE does not distribute the input to each step, only to the first step.

Reading Job Steps From a POE Commands File

The MP_CMDFILE environment variable, and its associated command-line flag -cmdfile, lets you specify the name of a POE commands file. If MP_NEWJOB is yes, you can have the Partition Manager read job steps from a POE commands file. The commands file in this case simply lists the programs you want to run as job steps. For example, say you want to run the three SPMD programs setup, computation, and cleanup as job steps on the same partition. Your POE commands file would contain the following three lines:

setup [program-options]
 
computation [program-options]
 
cleanup [program-options]

Program-options represent the actual values you need to specify.

If you are loading a series of MPMD programs, the POE commands file is also responsible for individually loading the partition. For example, say you had three master/worker MPMD job steps that you wanted to run as 4 tasks on the same partition. The following is a representation of what your POE commands file would contain. Options represent the actual values you need to specify.

master1 [options]
 
workers1 [options]
 
workers1 [options]
 
workers1 [options]
 
master2 [options]
 
workers2 [options]
 
workers2 [options]
 
workers2 [options]
 
master3 [options]
 
workers3 [options]
 
workers3 [options]
 
workers3 [options]

While you could also redirect STDIN to read job steps from a file, a POE commands file gives you more flexibility by not tying up STDIN. You can specify a POE commands file using its relative or full path name. Say your POE commands file is called /u/hinkle/jobsteps. To specify that the Partition Manager should read job steps from this file rather than STDIN, you could:
Set the MP_CMDFILE environment variable: Use the -cmdfile flag on the poe command:

ENTER
export MP_CMDFILE=/u/hinkle/jobsteps

ENTER
poe -cmdfile /u/hinkle/jobsteps

Once MP_NEWJOB is set to yes, and MP_CMDFILE is set to the name of your POE commands file, you would:

ENTER
poe [poe-options]

* The Partition Manager allocates the processor nodes of your partition, and reads job steps from your POE commands file. The Partition Manager does not release your partition until it reaches the end of your commands file.

Invoking a Non-Parallel Program On Remote Nodes

You can also use POE to run non-parallel programs on the remote nodes of your partition. Any executable (binary file, shell script, UNIX utility) is suitable, and it does not need to have been compiled with mpcc, mpCC, or mpxlf. For example, if you wanted to check the process status (using the AIX command ps) for all remote nodes in your partition, you would:

ENTER
poe ps

* The process status for each remote node is written to standard out (STDOUT) at your home node. How STDOUT from all the remote nodes is handled at your home node depends on the output mode. See "Managing Standard Output (STDOUT)" for more information.


Controlling Program Execution

This section describes a number of additional POE environment variables for monitoring and controlling program execution. It describes how to use the:

For a complete listing of all POE environment variables, see Appendix B. "POE Environment Variables and Command-Line Flags".

Specifying Develop Mode

You can run programs in one of two modes - develop mode or run mode. In develop mode, intended for developing applications, the message passing interface performs more detailed checking during execution. Because of the additional checking it performs, develop mode can significantly slow program performance. In run mode, intended for completed applications, only minimal checking is done. While run mode is the default, you can use the MP_EUIDEVELOP environment variable to specify message passing develop mode. As with most POE environment variables, MP_EUIDEVELOP has an associated command-line flag -euidevelop. To specify MPI develop mode, you could:
Set the MP_EUIDEVELOP environment variable: Use the -euidevelop flag when invoking the program:

ENTER
export MP_EUIDEVELOP=yes

ENTER
poe program -euidevelop yes

To later go back to run mode, set MP_EUIDEVELOP to no.

You can also use MP_EUIDEVELOP for pedb parameter checking by specifying the DEB value, for "debug".
Set the MP_EUIDEVELOP environment variable: Use the -euidevelop flag when invoking the program:

ENTER
export MP_EUIDEVELOP=DEB

ENTER
poe program -euidevelop DEB

To stop parameter checking, set MP_EUIDEVELOP to min, for "minimum".

Making POE Wait for Processor Nodes

If you are using an SP system, and there are not enough available nodes to run your program, the Partition Manager, by default, returns immediately with an error. Your program does not run. Using the MP_RETRY and MP_RETRYCOUNT environment variables, however, you can instruct the Partition Manager to repeat the node request a set number of times at set intervals. Each time the Partition Manager repeats the node request, it displays the following message:

Retry allocation      ......press control-C to terminate

The MP_RETRY environment variable, and its associated command-line flag -retry, specifies the interval (in seconds) to wait before repeating the node request. The MP_RETRYCOUNT environment variable, and its associated command-line flag -retrycount, specifies the number of times the Partition Manager should make the request before returning. For example, if you wanted to retry the node request five times at five minute (300 second) intervals, you could:
Set the MP_RETRY and MP_RETRYCOUNT environment variables: Use the -retry and -retrycount flags when invoking the program:

ENTER
export MP_RETRY=300

 
export MP_RETRYCOUNT=5

ENTER
poe program -retry 300 -retrycount 5

Note:If the MP_RETRYCOUNT environment variable or the -retrycount command-line flag is used, the MP_RETRY environment variable or the -retry command-line flag must be set to at least one second.

Making POE Ignore Arguments

When you invoke a parallel executable, you can specify an argument list consisting of a number of program options and POE command-line flags. The argument list is parsed by POE - the POE command-line flags are removed and the remainder of the list is passed on to the program. If any of your program arguments are identical to POE command-line flags, however, this can cause problems. For example, say you have a program that takes the argument -retry. You invoke the program with the -retry option, but it does not execute correctly. This is because there is also a POE command-line flag -retry. POE parses the argument list and so the -retry option is never passed on to your program. There are two ways to correct this sort of problem. You can:

Making POE Ignore the Entire Argument List

When you invoke a parallel executable, POE, by default, parses the argument list and removes all POE command-line flags before passing the rest of the list on to the program. Using the environment variable MP_NOARGLIST, you can prevent POE from parsing the argument list. To do this:

ENTER
export MP_NOARGLIST=yes

When the MP_NOARGLIST environment variable is set to yes, POE does not examine the argument list at all. It simply passes the entire list on to the program. For this reason, you can not use any POE command-line flags, but must use the POE environment variables exclusively. While most POE environment variables have associated command-line flags, MP_NOARGLIST, for obvious reasons, does not. To specify that POE should again examine argument lists, either set MP_NOARGLIST to no, or unset it.

ENTER
export MP_NOARGLIST=no

 
or

 
unset MP_NOARGLIST

Making POE Ignore a Portion of the Argument List

When you invoke a parallel executable, POE, by default, parses the entire argument list and removes all POE command-line flags before passing the rest of the list on to the program. You can use a fence, however, to prevent POE from parsing the remainder of the argument list. A fence is simply a character string you define using the MP_FENCE environment variable. Once defined, you can use the fence to separate those arguments you want parsed by POE from those you do not. For example, say you have a program that takes the argument -retry. Because there is also a POE command-line flag -retry, you need to put this argument after a fence. To do this, you could:

ENTER
export MP_FENCE=Q

 
poe program -procs 26 -infolevel 2 Q -retry RGB

While this example defines Q as the fence, keep in mind that the fence can be any character string. Any arguments placed after the fence are passed by POE, unexamined, to the program. While most POE environment variables have associated command-line flags, MP_FENCE does not.

Managing Standard Input, Output, and Error

POE lets you control standard input (STDIN), standard output (STDOUT), and standard error (STDERR) in several ways. You can continue using the traditional I/O manipulation techniques such as redirection and piping, and can also:

Managing Standard Input (STDIN)

STDIN is the primary source of data going into a command. Usually, STDIN refers to keyboard input. If you use redirection or piping, however, STDIN could refer to a file or the output from another command (see "Using MP_HOLD_STDIN"). How you manage STDIN for a parallel application depends on whether or not its parallel tasks require the same input data. Using the environment variable MP_STDINMODE or the command-line flag -stdinmode, you can specify that:

Multiple Input Mode

Setting MP_STDINMODE to all indicates that all tasks should receive the same input data from STDIN. The home node Partition Manager sends STDIN to each task as it is read.

To specify multiple input mode so all tasks receive the same input data from STDIN, you could:
Set the MP_STDINMODE environment variable: Use the -stdinmode flag when invoking the program:

ENTER
export MP_STDINMODE=all

ENTER
poe program -stdinmode all

Note:If you do not set the MP_STDINMODE environment variable or use the -stdinmode command-line flag, multiple input mode is the default.

Single Input Mode

There are times when you only want a single task to read from STDIN. To do this, you set MP_STDINMODE to the appropriate task id. For example, say you have an MPMD application consisting of two programs - master and workers. The program master is designed to run as a single task on one processor node. The workers program is designed to run as separate tasks on any number of other nodes. The master program handles all I/O, so only its task needs to read STDIN. If master is running as task 0, you need to specify that only task 0 should receive STDIN. To do this, you could:
Set the MP_STDINMODE environment variable: Use the -stdinmode flag when invoking the program:

ENTER
export MP_STDINMODE=0

ENTER
poe program -stdinmode 0

Using MP_HOLD_STDIN

The environment variable MP_HOLD_STDIN is used to defer sending of STDIN from the home node to the remote node(s) until the message passing library has been initialized. The variable must be set to "yes" when using POE to invoke a program which: (1) has been compiled with mpcc, mpxlf, or mpCC and their _r equivalents for the threaded environment, and (2) will be reading STDIN from other than the keyboard (redirection or piping). Failing to export this environment variable when running these programs could likely result in the user program hanging.

In addition, if a program invoked using POE has not been compiled with mpcc, mpxlf, or mpCC, the environment variable must not be set (or set to "no") to ensure that STDIN is delivered to the remote node(s).

To set MP_HOLD_STDIN correctly, you need to know the relative order of your program's use of stdin data and initialization of the message passing library.

The discussion immediately below applies to the signal handling message passing library (MPI/MPL), which is initialized before the user's executable gets control.

The subsequent section addresses the question for the threaded MPI library.

Using Redirected STDIN

Note:Wherever the following description refers to a POE environment variable (starting with MP_), the use of the associated command line option produces the same effect, with the exception of MP_HOLD_STDIN, which has no associated command line option.

A POE process can use its STDIN in two ways. First, if the program name is not supplied on the command line and no command file (MP_CMDFILE) is specified, POE uses STDIN to resolve the names of the programs to be run as the remote tasks. Second, any "remaining" STDIN is then distributed to the remote tasks as indicated by the MP_STDINMODE and MP_HOLD_STDIN settings. In this dual STDIN model, redirected STDIN can then pose two problems:

  1. If using job steps (MP_NEWJOB=yes), the "remaining" STDIN is always consumed by the remote tasks during the first job step.

  2. If POE attempts program name resolution on the redirected STDIN, program behavior can vary when using job steps, depending on the type of redirection used and the size of the redirected STDIN.

The first problem is addressed in POE by performing a rewind of STDIN between job steps (only if STDIN is redirected from a file, for reasons beyond the scope of this document). The second problem is addressed by providing an additional setting for MP_STDINMODE of "none", which tells POE to only use STDIN for program name resolution. As far as STDIN is concerned, "none" ever gets delivered to the remote tasks. This provides an additional method of reliably specifying the program name to POE, by redirecting STDIN from a file or pipe, or by using the shell's here-document syntax in conjunction with the "none" setting. If MP_STDINMODE is not set to "none" when POE attempts program name resolution on redirected STDIN, program behavior is undefined.

The following scenarios describe in more detail the effects of using (or not using) an MP_STDINMODE of "none" when redirecting (or not redirecting) STDIN, as shown in the example:

                                  Is STDIN Redirected?
 
 
                                       Yes    No
 
 
 
                                 Yes    A     B
 
Is MP_STDINMODE set to "none"?
 
                                 No     C     D

Scenario A

POE will use the redirected STDIN for program name resolution, only if no program name is supplied on the command line (MP_CMDFILE is ignored when MP_STDINMODE=none). No STDIN is distributed to the remote tasks. No rewind of STDIN is performed when MP_STDINMODE=none. If MP_HOLD_STDIN is set to "yes", this is ignored because no STDIN is being distributed.

Scenario B

POE will use the keyboard STDIN for program name resolution, only if no program name is supplied on the command line (MP_CMDFILE is ignored when MP_STDINMODE=none). No STDIN is distributed to the remote tasks. No rewind of STDIN is performed when MP_STDINMODE=none (also, STDIN is not from a file). If MP_HOLD_STDIN is set to "yes", this is ignored because no STDIN is being distributed.

Scenario C

POE will use the redirected STDIN for program name resolution, if required, and will distribute "remaining" STDIN to the remote tasks. If STDIN is intended to be used for program name resolution, program behavior is undefined in this case, since POE was not informed of this by setting STDINMODE to "none" (see Problem 2 above). If STDIN is redirected from a file, POE will rewind STDIN between each job step. If MP_HOLD_STDIN is set to "yes", this feature will behave accordingly.

Scenario D

POE will use the keyboard STDIN for program name resolution, if required. Any "remaining" STDIN is distributed to the remote tasks. No rewind of STDIN is performed since STDIN is not from a file. If MP_HOLD_STDIN is set to "yes", it is ignored because STDIN is not redirected.

Using Redirected STDIN with the Threaded MPI Library

If the user's executable is compiled with the threaded MPI library, message passing initialization occurs when MPI_Init is called, not before POE gives the user program control. If MPI_Init is called before any STDIN data is read, the discussions of the previous section apply. If, however, all STDIN is read before MPI_Init is called, then MP_HOLD_STDIN should be set to "no", to allow the STDIN data to be sent to the user's executable by POE.

Managing Standard Output (STDOUT)

STDOUT is where the data coming from the command will eventually go. Usually, STDOUT refers to the display. If you use redirection or piping, however, STDOUT could refer to a file or another command. How you manage STDOUT for a parallel application depends on whether you want output data from one task or all tasks. If all tasks are writing to STDOUT, you can also specify whether or not output is ordered by task id. Using the environment variable MP_STDOUTMODE, you can specify that:

Unordered Output Mode

Setting MP_STDOUTMODE to unordered specifies that all tasks should write output data to STDOUT asynchronously. To specify unordered output mode, you could:
Set the MP_STDOUTMODE environment variable: Use the -stdoutmode flag when invoking the program:

ENTER
export MP_STDOUTMODE=unordered

ENTER
poe program -stdoutmode unordered

Notes:

  1. If you do not set the MP_STDOUTMODE environment variable or use the -stdoutmode command-line flag, unordered output mode is the default.

  2. If you are using unordered output mode, you will probably want the messages labeled by task id. Otherwise it will be difficult to know which task sent which message. See "Labeling Message Output" for more information.

  3. You can also specify unordered output mode from your program by calling the MP_STDOUTMODE or mpc_stdoutmode Parallel Utility Function. Refer to IBM Parallel Environment for AIX: MPI Programming and Subroutine Reference for more information.

  4. Although the above environment variable and Parallel Utility Function are both described as "MP_STDOUTMODE", they are each used independently for their specific purposes.

Ordered Output Mode

Setting MP_STDOUTMODE to ordered specifies ordered output mode. In this mode, each task writes output data to its own buffer. Later, all the task buffers are flushed, in order of task id, to STDOUT. The buffers are flushed when:

Note:When running the parallel application under pdbx with MP_STDOUTMODE set to ordered, there will be a difference in the ordering from when the application is run directly under poe. The buffer size available for the application's STDOUT is smaller because pdbx uses some of the buffer, so the task buffers fill up more often.

To specify ordered output mode, you could:
Set the MP_STDOUTMODE environment variable: Use the -stdoutmode flag when invoking the program:

ENTER
export MP_STDOUTMODE=ordered

ENTER
poe program -stdoutmode ordered

Note:You can also specify ordered output mode from your program by calling the MP_STDOUTMODE or mpc_stdoutmode Parallel Utility Function. Refer to IBM Parallel Environment for AIX: MPI Programming and Subroutine Reference for more information.

Single Output Mode

You can specify that only one task should write its output data to STDOUT. To do this, you set MP_STDOUTMODE to the appropriate task id. For example, say you have an SPMD application in which all the parallel tasks are sending the exact same output messages. For easier readability, you would prefer output from only one task - task 0. To specify this, you could:
Set the MP_STDOUTMODE environment variable: Use the -stdoutmode flag when invoking the program:

ENTER
export MP_STDOUTMODE=0

ENTER
poe program -stdoutmode 0

Note:You can also specify single output mode from your program by calling the MP_STDOUTMODE or mpc_stdoutmode Parallel Utility Function. Refer to IBM Parallel Environment for AIX: MPI Programming and Subroutine Reference for more information.

Labeling Message Output

You can set the environment variable MP_LABELIO, or use the -labelio flag when invoking a program, so that output from the parallel tasks of your program are labeled by task id. While not necessary when output is being generated in single mode, this ability can be useful in ordered and unordered modes. For example, say the output mode is unordered. You are executing a program and receiving asynchronous output messages from all the tasks. This output is not labeled, so you do not know which task has sent which message. It would be clearer if the unordered output was labeled. For example:

  7: Hello World
 
  0: Hello World
 
  3: Hello World
 
 23: Hello World
 
 14: Hello World
 
  9: Hello World

To have the messages labeled with the appropriate task id, you could:
Set the MP_LABELIO environment variable: Use the -labelio flag when invoking the program:

ENTER
export MP_LABELIO=yes

ENTER
poe program -labelio yes

To no longer have message output labeled, set the MP_LABELIO environment variable to no.

Setting the Message Reporting Level for Standard Error (STDERR)

You can set the environment variable MP_INFOLEVEL to specify the level of messages you want from POE. You can set the value of MP_INFOLEVEL to one of the integers shown in the following table. The integers 0, 1, and 2 give you different levels of informational, warning, and error messages. The integers 3 through 6 indicate debug levels that provide additional debugging and diagnostic information. Should you require help from the IBM Support Center in resolving a PE-related problem, you will probably be asked to run with one of the debug levels. As with most POE environment variables, you can override MP_INFOLEVEL when you invoke a program. This is done using either the -infolevel or -ilevel flag followed by the appropriate integer.
This integer: Indicates this level of message reporting: In other words:
0 Error Only error messages from POE are written to STDERR.
1 Normal Warning and error messages from POE are written to STDERR. This level of message reporting is the default.
2 Verbose Informational, warning, and error messages from POE are written to STDERR.
3 Debug Level 1 Informational, warning, and error messages from POE are written to STDERR. Also written is some high-level debugging and diagnostic information.
4 Debug Level 2 Informational, warning, and error messages from POE are written to STDERR. Also written is some high- and low-level debugging and diagnostic information.
5 Debug Level 3 Debug level 2 messages plus some additional loop detail.
6 Debug Level 4 Debug level 3 messages plus other informational error messages for the greatest amount of diagnostic information.

Let us say you want the POE message level set to verbose. The following table shows the two ways to do this. You could:
Set the MP_INFOLEVEL environment variable: Use the -infolevel flag when invoking the program:

ENTER
export MP_INFOLEVEL=2

ENTER
poe program -infolevel 2

or poe program -ilevel 2


As with most POE command-line flags, the -infolevel or -ilevel flag temporarily override their associated environment variable.

Generating a Diagnostic Log on Remote Nodes

Using the MP_PMDLOG environment variable, you can also specify that diagnostic messages should be logged to a file in /tmp on each of the remote nodes of your partition. The log file is named mplog.pid.n, where pid is the AIX process id of the Partition Manager Daemon, and n is the task number. Should you require help from the IBM Support Center in resolving a PE-related problem, you will probably be asked to generate these diagnostic logs.

The ability to generate diagnostic logs on each node is particularly useful for isolating the cause of abnormal termination, especially when the connection between the remote node and the home node Partition Manager has been broken. As with most POE environment variables, you can temporarily override the value of MP_PMDLOG using its associated command-line flag -pmdlog. For example, to generate a pmd log file, you could:
Set the MP_PMDLOG environment variable: Use the -pmdlog flag when invoking the program:

ENTER
export MP_PMDLOG=yes

ENTER
poe program -pmdlog yes

Note:By default, MP_PMDLOG is set to no. No diagnostic logs are generated. It is not suggested that you run MP_PMDLOG routinely. Running this will greatly impact performance and fill up your file system space.


Checkpointing and Restarting Programs

You can set the environment variables MP_CHECKFILE and MP_CHECKDIR to checkpoint or restart a program, which was previously compiled with the mpcc_chkpt, mpCC_chkpt, or mpxlf_chkpt commands. Only POE/MPI applications submitted under LoadLeveler in batch mode are able to be checkpointed. Checkpointing of interactive POE applications is not allowed.

The program's execution will be suspended when the mp_chkpt() function is reached. See IBM Parallel Environment for AIX: MPI Programming and Subroutine Reference for the description of the mp_chkpt function. At that point, the state of the application is captured, along with all data, and saved to the file pointed to by the MP_CHECKFILE and MP_CHECKDIR variables.

MP_CHECKFILE defines the base name of the checkpoint file. MP_CHECKDIR defines the directory where the checkpoint file will reside. If the MP_CHECKFILE variable is not specified, the program cannot be checkpointed. The file name specified by MP_CHECKFILE may include the full path, in which case the MP_CHECKDIR variable will be ignored. If MP_CHECKDIR is not defined and MP_CHECKFILE does not specify a full path name, then MP_CHECKFILE is used as a relative path name from the current working directory.

Only programs compiled with the checkpoint compile scripts (mpcc_chkpt, mpCC_chkpt, or mpxlf_chkpt) that call the mp_chkpt function can be checkpointed.

When the checkpoint file is created during the checkpointing phase, the task id and a version id are appended to the base file name to differentiate between checkpoint files from different instances of the program.

There are certain limitations associated with checkpointing an application. Please refer to IBM Parallel Environment for AIX: MPI Programming and Subroutine Reference for specific details.

Restarting a Checkpointed Program

A program can be restarted by executing POE, using MP_CHECKFILE and MP_CHECKDIR to point to the checkpoint file from the previously checkpointed program. The checkpoint file must be valid and accessible to all tasks specific when invoking POE. The application can be restarted on the same or a different set of nodes. However, the number of tasks must remain the same.

During the restart processing, the version and content of the checkpoint file are verified internally by POE to ensure consistency and accuracy. Any discrepancies, such as a mismatch in versions of the program files, will be reported. That is, the versions of the specified checkpoint files are not the same across all tasks.

The checkpoint file will be read, and the program will be restored to an executing state, after retrieving the program state and data information from the file. When execution is completely restored, the checkpoint files are deleted.

If you are using the MP_BUFFER_MEM environment variable to change the maximum size of memory used by the communication subsystem while checkpointing a program, please be aware that the amount of space needed for the checkpointing files will be increased by the value of MP_BUFFER_MEM.

Checkpointing File Management

The ability to checkpoint or restart programs is controlled by the definition and availability of the checkpoint files, as specified by the MP_CHECKFILE environment variable.

The specified file may be defined on the local file system (JFS) of the node on which the instance of the program is running, or it may be defined in some shared file system (such as NFS, AFS, DFS, GPFS, etc.). When the file is in a local file system, then in order to perform process migration, the checkpoint file will have to be moved to the new system on which the process is to be restarted. If the old system crashed and is unavailable, it may not be possible to restart the program. It may be necessary, therefore, to use some kind of file management to avoid such a problem. If migration is not desired, it is sufficient to place checkpoint files in the local JFS file system.

The program checkpoint files can be large, and numerous. There is the potential need for significant amounts of available disk space to maintain the files. It is recommended that you do not use NFS, AFS, or DFS for managing checkpoint files. The nature of these systems is such that it takes a very long time to write and read large files. The use of GPFS or JFS is recommended.

If a local JFS file system is used, the checkpoint file must be written to each remote task's local file system during checkpointing. Consequently, during a restart, each remote task's local file system must be able to access the checkpoint file from the previously checkpointed program. This is of special concern when opting to restart a program on a different set of nodes from which it was checkpointed. The local checkpoint file may need to be relocated to any new nodes. For these reasons, it is suggested that GPFS be the file system best suited for checkpoint and restart file management.


Running POE within a Distributed File System

This section gives you instructions on how to run POE within a Distributed File System (DFS). Included is a description of the poeauth command, which allows you to copy DFS credentials to all nodes on which you want to run POE jobs.
Note:When running POE under LoadLeveler, LoadLeveler handles all user authorization instead of POE.

Setting Up Your System to Run POE

In order to run POE jobs from DFS, you will need to copy the DFS/DCE credentials files to each node you wish to run on, using the poeauth command. You should be set up with a DFS account, and after you login, you access your DCE user credentials by doing a dce_login.

DCE credentials are defined on a per user basis, therefore each user must use poeauth to copy the credentials prior to running a POE job on a DFS/DCE system.

To be able to run the poeauth command, you should make some initial file and directory changes. In your pool or host list file you define all nodes on which you want to run POE jobs. You change directories to a local non-DFS file system, for example, /tmp. The poeauth command sets up the DFS credentials for POE, therefore you cannot be in a DFS directory (as the current directory) to run it.

The execution of the poeauth command is dependent upon the type of user authorization specified by the MP_AUTH environment variable - either AIX or DFS/DCE authorization.

When AIX user authorization is selected (either by setting MP_AUTH=AIX or allowing it as the default), and your home directory resides in DFS, your user name must be properly authorized to access those nodes in the /etc/hosts.equiv file on each node. You should remove all entries from the .rhosts files on each node, and allow the /etc/hosts.equiv file to authorize the users on each node. Otherwise, POE will not be able to authorize users properly. Once DFS credentials are established, you can use a .rhosts file.

The dce_login sets up a new shell. As a result, you should set up any environment variables needed to run poeauth or other POE applications (such as MP_AUTH) after doing the dce_login.

Running the poeauth Command

You should run the poeauth command from task 0 that had a dce_login. You can use any POE command line flag or environment variable with poeauth, because it is a POE application. Each user must run poeauth before running any POE applications. When the credentials are copied, there is no need to use poeauth until the credentials expire (at which time you will need to copy them using poeauth again). After you run the poeauth command successfully, you can run POE from DFS. For more information on the poeauth command, see Appendix A. "Parallel Environment Commands".
Note:Credentials files need to exist on the home node (task 0), that is, from where dce_login was performed. The poeauth command needs to be run from task 0.

Checking for Errors

When POE returns error messages related to an inability to change to a DFS directory or a problem copying a file to a DFS directory, it most likely means there is a problem with the DFS credentials on that task or node. Check to see if the credentials were properly copied with poeauth, or if they have expired (use the klist command).

Since poeauth is a POE application, if you try to run it when the credentials have expired, POE will encounter an error accessing the expired credentials.

If the credentials have expired, you must do another dce_login and run the poeauth command again.

POE maintains a master control file in /tmp to keep track of the credentials. If /tmp is periodically cleaned out or the file is accidentally erased before your credentials expire, POE will not be able to access your DCE credentials and you may get errors related to the inability to access credentials. If this occurs, you will need to run the poeauth command again to redefine your credentials to POE.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]