This appendix documents various limitations, restrictions, and programming considerations for user applications written to run under the IBM Parallel Environment for AIX (PE) licensed program.
PE includes two versions of the message passing libraries. These are called the signal-handling library and the threaded library.
This appendix consists of sections that list the programming considerations common to both libraries, as well as those unique to either the signal-handling library or the threaded library. There is also a subsection on using POE and the Fortran compiler. Specifically, the sections are as follows:
The information in this section pertains to both the (MPL/MPI) signal-handling library and the MPI threaded library.
As the end user, you are encouraged to think of the Parallel Operating Environment(POE) (also referred to as the poe command) as an ordinary (serial) command. It accepts redirected I/O, can be run under the nice and time commands, interprets command flags, and can be invoked in shell scripts.
An n-task parallel job running in the Parallel Operating Environment actually consists of the n user tasks, an equal number (n) of instances of the IBM Parallel Environment for AIX pmd daemon (which is the parent task of the user's task), and the POE home node task in which the poe command runs. A pmd daemon is started by the POE home node on each machine on which each user task runs, and serves as the point of contact between the home node and the user's tasks.
The POE home node routes standard input, standard output and standard error streams between the home node and the user's tasks via the pmd daemon, using TCP/IP sockets for this purpose. The sockets are created when the POE home node starts the pmd daemon for each task of a parallel job. The POE home node and pmd also use the sockets to exchange control messages to provide task synchronization, exit status and signaling. These capabilities do not depend upon the message passing library and are available to control any parallel program run by the poe command.
Exit status is a value between 0 and 255 inclusive. It is returned from POE on the home node reflecting the composite exit status of your parallel application, as follows:
The POE job-step function is intended for the execution of a sequence of separate yet inter-related dependent programs. Therefore, it provides you with a job control mechanism that allows both job-step progression and job-step termination. The job control mechanism is the program's exit code.
POE continues the job-step sequence if the task exit code is 0 or in the range of 2 - 127.
POE terminates the parallel job, and does not execute any remaining user programs in the job-step list if the task exit code is 1 or greater than 127.
Any POE infrastructure detected failure (such as failure to open pipes to the child task or an exec failure to start the user's executable) terminates the parallel job, and does not execute any remaining user programs in the job-step queue.
POE links in the following routines when your executable is compiled with any of the POE compilation scripts (mpcc, mpcc_r, mpxlf,etc.).
POE installs signal handlers for most signals that cause program termination in order to notify the other tasks of termination and to complete the VT trace file, if enabled. POE then causes the program to exit normally with a code of (128+signal). When running non-threaded applications under POE, you may install a signal handler for any of these signals, and it should call the POE registered signal handler if the task decides to terminate. (See "Let POE Handle Signals When Possible".) When running threaded applications, any attempt to install a signal handler is ignored.
Signals that are specifically handled by POE or the message passing library follow:
Caught and exits with an exit code of 128+SIGHUP.
Caught and exits with an exit code of 128+SIGINT.
Note: This signal may be caught by user or by dbx, in which case this usage is ignored.
Caught, sets the default signal handler and calls exit handler with an exit code of 128+SIGQUIT. The exit handler dumps the user's context and takes the default signal action.
Caught, sets the default signal handler and calls exit handler with an exit code of 128+SIGFPE. The exit handler dumps the user's context and takes the default signal action.
Caught, sets the default signal handler and calls exit handler with an exit code of 128+SIGSEGV. The exit handler dumps the user's context and takes the default signal action.
Caught, sets the default signal handler and calls exit handler with an exit code of 128+SIGBUS. The exit handler dumps the user's context and takes the default signal action.
Caught and exits with an exit code of 128+SIGTERM. This is also used by POE to signal orderly termination of a parallel job. If it must be caught by the user, please read carefully the section on program termination (below).
Default action (cannot be caught)
Caught, sets the default signal handler and calls exit handler with an exit code of 128+SIGPWR. The exit handler dumps the user's context and takes the default signal action.
Caught and exits with an exit code of 128+SIGDANGER.
The signal-handling library uses SIGIO, SIGALRM and SIGPIPE for its operations and it also handles these signals. For more information about the signal-handling library, see "MPI Signal-Handling Library Considerations". For more information about signals, see "Use of AIX Signals".
POE requires its own versions of the library exit()/atexit() functions, and expects to load them dynamically from its own version of libc.a (or libc_r.a) in /usr/lpp/ppe.poe/lib; therefore, do not code your own exit function to override the library function. This is to synchronize profiling and to provide barrier synchronization upon exit.
Programs that handle signals must coordinate with POE's handling of most of the common signals (see above).
DO NOT issue message passing calls from signal handlers. Also, many AIX library calls are not "signal safe", and should not be issued from signal handlers. Check the AIX Technical Reference (function sigaction()) for a list of AIX functions callable from signal handlers.
POE sets up signal handlers for all the signals that normally terminate program execution. It does this so that it can terminate the entire parallel job in an orderly fashion if one task terminates abnormally (via signal). A user program may install a handler for any or all of these signals, but should save the address of the POE signal handler. If the user program decides to terminate, it should call the POE signal handler. If the user program decides not to terminate, it should just return to the interrupted code. SIGTERM is used by POE to shutdown the parallel job in a variety of abnormal circumstances, and should be allowed to terminate the job.
The POE home node converts a user's SIGTSTP signal (Ctrl-z) to a SIGSTOP signal to all the remote nodes, and passes the SIGCONT signal sent by the fg or bg command to all the remote nodes to restart the job.
Do not use hard coded file descriptor numbers beyond those specified by STDIN, STDOUT and STDERR.
POE opens several files and uses file descriptors as message passing handles. These are allocated before the user gets control, so the first file descriptor allocated to a user is unpredictable.
POE provides for orderly termination of a parallel job, so that all tasks terminate at the same time. This is accomplished in the atexit routine registered at program initialization. For normal exits (codes 0, 2-127), the atexit routine sends a control message to the POE home node, and waits for a positive response. For abnormal exits and those which don't go through the atexit routine, the pmd daemon catches the exit code and sends a control message to the POE home node.
For normal exits, when POE gets a control message for every task, it responds to each node, allowing that node to exit normally with its individual exit code. The pmd daemon monitors the exit code and passes it back to the POE home node for presentation to the user.
For abnormal exits and those detected by pmd, POE sends a message to each pmd asking that it send a SIGTERM signal to its task, thereby terminating the task. When the task finally exits, pmd sends its exit code back to the POE home node and exits itself.
User-initiated termination of the POE home node via SIGINT (Ctrl-c) and/or SIGQUIT (Ctrl-\) causes a message to be sent to pmd asking that the appropriate signal be sent to the parallel task. Again, pmd waits for the task to die then terminates itself.
To prevent uncontrolled root access to the entire parallel job computation resource, POE checks to see that the user is not root as part of its authentication.
The use of the following AIX functions may be limited, but no formal testing has been done:
You can have POE run a shell script which is loaded and run on the remote nodes as if it were a binary file.
If the POE home node task is not started under the Korn shell, mounted file system names may not be mapped correctly to the names defined for the automount daemon or AIX equivalent running on the IBM RS/6000 SP. See the IBM Parallel Environment for AIX: Operation and Use, Volume 1 for a discussion of alternative name mapping techniques.
The program executed by POE on the parallel nodes does not run under a shell on those nodes. Redirection and piping of STDIO applies to the POE home node (poe binary), and not the user's code. If shell processing of a command line is desired on the remote nodes, invoke a shell script on the remote nodes to provide the desired preprocessing before the user's application is executed.
The partition manager daemon uses pipes to direct stdin, stdout and stderr to the user's program, therefore, do not rewind these files.
Quotation marks, either single or double, used as argument delimiters are stripped away by the shell and are never "seen" by poe. Therefore, the quotation marks must be escaped to allow the quoted string to be passed correctly to the remote task(s) as one argument. For example, if you want to pass the following string to the user program (including the imbedded blank)
then you need to enter the following:
poe user_program \"a b\"
user_program is passed the following argument as one token:
Without the backslashes, the string would have been treated as two arguments (a and b).
POE behaves like rsh when arguments are passed to POE. Therefore, the following:
poe user_program "a b"
is equivalent to:
rsh some_machine user_program "a b"
In order to pass the string argument as one token, the quotes have to be escaped.
Programs generating large volumes of STDOUT or STDERR may overload the home node. As described previously, standard output and standard error files generated by a user's program are piped to pmd, then forwarded to the poe binary via a TCP/IP socket. It is possible to generate so much data that the IP message buffers on the home node are exhausted, the poe binary hangs and possibly the entire node may hang). Note that the option -stdoutmode (environment variable MP_STDOUTMODE) controls which output stream is displayed by the poe binary, but does not limit the standard output traffic received from the remote nodes, even if set to display the output of just one node.
The POE environment variable MP_SNDBUF can be used to override the default network settings for the size of the TCP buffers used.
If you have large volumes of standard I/O, work with your network administrator to establish appropriate TCP/IP tuning parameters. You may also want to examine if using named pipes is appropriate for your application.
When your program runs on the remote nodes, it has no controlling terminal. STDIN and STDOUT, STDERR are always piped.
Programs that depend on piping standard input or standard output as part of a processing sequence may wish to bypass the home node poe binary. Running the poe command (or starting a program compiled with one of the POE compile scripts) causes the poe binary to be loaded on the machine on which you typed the command (the POE home node). The poe binary, in turn, starts a daemon named pmd on each parallel node assigned to run the job, and then requests pmd to run your executable (via fork and exec). The poe binary reads STDIN and passes it to each of the parallel tasks via a TCP/IP socket connection to the pmd daemon, which pipes it to the user. Similarly, STDOUT and STDERR from the user are piped to pmd and sent on the socket back to the home node, where it is written to the poe binary's STDOUT and STDERR descriptors. If you know that the task reading STDIN or writing STDOUT must be on the same node (processor) as the poe binary (the poe home node), named pipes can be used to bypass poe's reading and forwarding STDIN and STDOUT.
If STDIN is piped or redirected to the poe binary (via ordinary pipes), and your application is linked with the signal handling message passing library, (via mpcc, mpxlf, or mpCC), then set the environment variable MP_HOLD_STDIN to "yes". This lets poe initialize the signal-handling library before handling the STDIN file.
If your application is linked with the threaded library, see "Standard I/O Requires Special Attention" for more information.
The following two scripts show how STDIN and STDOUT can be piped directly between pre- and post-processing steps, bypassing the POE home node task. This example assumes that parallel task 0 is known or forced to be on the same node as the POE home node.
The script compute_home runs on the home node; the script compute_parallel runs on the parallel nodes (those running tasks 0 through n-1).
compute_home: #! /bin/ksh # Example script compute_home runs three tasks: # data_generator creates/gets data and writes to stdout # data_processor is a parallel program that reads data # from stdin, processes it in parallel, and writes # the results to stdout. # data_consumer reads data from stdin and summarizes it # mkfifo poe_in_$$ mkfifo poe_out_$$ export MP_STDOUTMODE=0 export MP_STDINMODE=0 data_generator >poe_in_$$ | poe compute_parallel poe_in_$$ poe_out_$$ data_processor | data_consumer <poe_out_$$ rc=$? rm poe_in_$$ rm poe_out_$$ exit rc
compute_parallel: #! /bin/ksh # Example script compute_parallel is a shell script that # takes the following arguments: # 1) name of input named pipe (stdin) # 2) name of output named pipe (stdout) # 3) name of program to be run (and arguments) # poe_in=$1 poe_out=$2 shift 2 $* <$poe_in >$poe_out
Environment variables starting with MP_ are intended for use by POE, and should be set only as instructed in the documentation. POE also uses a handful of MP_... environment variables for internal purposes, which should not be interfered with.
POE assumes that NLSPATH contains the appropriate POE message catalogs, even if LANG is set to "C" or is unset. Duplicate message catalogs are provided for languages "En_US", "en_US", and "C".
The Fortran, C and C++ bindings for MPI are contained in the same library and can be freely intermixed.
Refer to "Fortran Considerations" for more information about the Fortran compiler.
The AIX compilers support the flag -qarch. This option allows you to target code generation to a particular processor architecture. While this option can provide performance enhancements on specific platforms, it inhibits portability, particularly between the Power and PowerPC machines. The MPI library is not targeted to a specific architecture and is the same on PowerPC and Power nodes.
The MPI-IO functions from MPI-2 are only available with the threaded library.
AIX makes available up to 11 additional address segments for
end user programs. The MPI libraries use some of these as listed in Table 16. The remaining are available to the user for either extended heap
(-bmaxdata option) or shared memory (shmget). Very large jobs, which include all jobs with more than 1000 tasks,
will need to use the -bmaxdata option to ensure a large enough
Table 16. Memory Segments Used By the MPI and LAPI Libraries
|Component||RS/6000 SP node with switch||RS/6000 workstation or no switch|
|MPI User Space||2||not available|
|VT Trace Capture||1||0|
|LAPI User Space||2||not available|
* If the environment variable MP_CLOCK_SOURCE=AIX, the value is 0.
The RS/6000 SP switch clock is a globally-synchronized
counter that may be used as a source for the MPI_WTIME function, provided that
all tasks are run on nodes of the same SP system. The environment
variable MP_CLOCK_SOURCE provides additional control. Table 17 shows how the clock source is determined. MPI guarantees that MPI_WTIME_IS_GLOBAL has the same value at every
Table 17. How the Clock Source Is Determined
|MP_CLOCK_SOURCE||Library Version||All Nodes SP?||Source Used||MPI_WTIME_IS_GLOBAL|
* The user is responsible for ensuring all of the nodes are in the same SP system.
POE compiles and runs all applications as 32-bit applications. 64-bit applications are not supported yet.
If you plan to run your parallel applications with a large number of tasks (more than 256), the following tips may improve stability and performance:
ulimit -n 10000
The information in this subsection provides you with specific additional programming considerations for when you are using POE and the MPL/MPI signal-handling library.
POE sets up its environment environment via the entry point mp_main(). mp_main() initializes the message passing library, sets up signal handlers, sets up an atexit routine, and initializes VT trace data collection before calling your main program.
Only a subset of MPL message passing is allowed on handlers created by the MPL Receive and Call function (mpc_rcvncall or MP_RCNVCALL). MPI calls on these handlers are not supported.
POE links in the following routines when your executable is compiled with mpcc, mpxlf or mpCC. These are routines specific for the signal handling environment.
POE initializes the parallel message passing library and determines that all nodes can communicate successfully before the user main() program gains control. As a result, any program compiled with the POE compiler scripts must be run under the control of POE and is not suitable as a serial program.
If communication initialization fails, the parallel task is terminated with an appropriate exit code.
The message passing library sets up signal handlers for SIGALRM, SIGIO and SIGPIPE to manage message passing activity. A user program may install a handler for any or all of these signals, but should save the address of and invoke the POE signal handler before returning to the interrupted code. The sigaction() function returns the required structure. Also, set SA_RESTART as well as the mask so all signals are masked when the signal handler is running.
The following are the signals used and specifically handled by the message passing library in a signal handling environment:
Caught by the non-threaded User Space message passing library to manage the RS/6000 SP switch. If your application catches this signal, it should call the registered message passing signal handler before returning to the main code.
Do not block this signal for more than a few milliseconds.
Caught by message passing library to manage message traffic. If you provide your own interval timing mechanism, then you should arrange to call the POE signal handler approximately every 200-800 milliseconds. Message passing calls from user programs may be blocked until the POE signal handler is called.
If the user application catches this signal but doesn't do interval timing, it should call the registered message passing signal handler before returning to the main code.
Caught by the user space message passing library to manage message traffic. If your application catches this signal, it should call the registered message passing signal handler before returning to the main code.
The message passing library uses an interval timer to manage message traffic, specifically to ensure that messages progress even when message passing calls are not being made. When this interval timer expires, a SIGALRM signal is sent to the program, interrupting whatever computation is in progress. The message passing library has a signal handler set, and normally handles the signal and returns to the user's program without the program's knowledge. However, the following library and system calls are interrupted and do not complete normally. The user is responsible for testing whether an interrupt occurred and recovering from the interrupt. In many cases, this is accomplished by just retrying the call.
Note: The normal timer interval is less than 500 milliseconds. So a sleep call (with time specified in seconds) returns to the original sleep interval, due to rounding, and can't be used to determine how much time remains in the interval. You should use the functions usleep and nsleep instead. See also the "Sample Replacement Sleep Program" in Appendix H. "Using Signals and the IBM PE Programs".
With the exception of sleep, system and exec, the routines listed above set the system error indicator (the variable errno) to EINTR, which can be tested by the user's program. See the "Sample Replacement Select Program" in Appendix H. "Using Signals and the IBM PE Programs".
Normal file read and write are restarted automatically by AIX, and should not require any special treatment by the user.
The system and fork calls create a new task in which the interval timer is still running. If a fork is followed by an exec (which is what system does), the signal handler for the timer is overlaid, and the task is terminated when the interval timer expires.
To handle this for the system call, temporarily turn the interval timer off (using the alarm(0) call) before the call, and turn it on again (ualarm(500000, 500000) will do) after the system call.
To handle the interval timer for a forked child, merely turn off the interval timer via alarm(0) in the child.
There are other restrictions on fork described below.
As described earlier, if a task forks, the forked child inherits the running timer. The timer should be turned off before forking another program. If the forked child does not exec another program, it should be aware that an atexit routine has been registered for the parent which is also inherited by the child. In most cases, the atexit routine will request POE to terminate the task (parent). A forked child should terminate with an _exit(0) system call to prevent the atexit routine from being called. Also, if the forked parent terminates before the child, the child task will not be cleaned up by POE.
A forked child must not call the message passing library.
A user may initiate a checkpoint sequence from within a parallel MPI program by calling the MP_CHKPT function. All tasks in the parallel job must issue the call, which does not return until the checkpoint files have been created for all tasks. If the job subsequently fails and is restarted, the restart returns from the MP_CHKPT function with an indication that the parallel job has been restarted.
Programs using the signal handling (non-threaded) MPI library may be linked as a checkpointable executable, which is run as a LoadLeveler batch job. LoadLeveler Version 2.1 or later is required. Restrictions on the program follow:
When programming in a threaded environment specific skills and considerations are required. The information in this subsection provides you with specific programming considerations when using POE and the MPI threaded library. It assumes you are familiar with POSIX threads in general including mutexes, thread condition waiting, thread-specific storage, thread creation and termination.
POE sets up its environment via the entry point mp_main_r(). mp_main_r() sets up signal handlers, initializes VT, and sets up an atexit routine before calling your main program.
Note: In the threaded library, message passing initialization takes place when MPI_INIT is called and not by mp_main_r. The threaded library and the signal-handling library differ significantly in this regard.
The Fortran, C and C++ bindings for MPI are contained in the same library (libmpi_r.a) and can be freely intermixed.
Refer to "Fortran Considerations" for more information about running Fortran programs in a threaded environment.
The subset implementation of MPI-IO provided in the thread library depends on all tasks running on a single file system. IBM Generalized Parallel File System (GPFS) is able to present a single file system to all nodes of an SP. Shared file systems (NFS and AFS, for example) do not have the same rigorous management of file consistency when updates occur from more than one node.
MPI-IO can be used with most file systems as long as all tasks are on a single node. This single node approach may be useful in learning to use MPI-IO, but is not likely to be worthwhile in any production context.
Any production use of MPI-IO must be based on GPFS.
The threaded POE run-time environment creates a thread to handle the following asynchronous signals:
A user signal handler must not be invoked to handle the above signals, which are handled by sigwait.
The following signals, which are used by MPI in the non-threaded library, are handled as described below.
The threaded library does not use SIGALRM and long system calls such as sleep are not interrupted by the message passing library. For example, sleep runs its entire duration unless interrupted by a user-generated event.
PE blocks SIGIO before calling your program. SIGIO is used in the IP version of the library to notify you of an I/O event or the arrival of a message packet. This notification is enabled via the environment variable MP_CSS_INTERRUPT. If this environment variable is set to YES, the message packet arrival dispatches the interrupt service thread to process the packet.
The User Space version of the library receives notification of an arriving packet via an AIX kernel event and does not use SIGO. You may unblock it or use sigwait to process SIGIO signals.
If you've registered a signal handler (via sigaction) for SIGIO before MPI_INIT is called, the function is added to the interrupt service thread and is executed each time the service thread is dispatched. Although registered as a signal handler, the function is not required to be signal safe because it is executed on a thread. You can use pthread calls to communicate with other threads. You cannot call MPI functions in this handler.
After MPI_FINALIZE is called, your signal handler is restored but you need to unblock SIGIO in order to receive subsequent SIGIO signals.
If you register or change the SIGIO signal handler after calling MPI_INIT, your changes are ignored by the MPI library but your changes are not undone by MPI_FINALIZE.
Neither the threaded or non-threaded IP libraries use SIGPIPE. The threaded User Space library polls a variable set by the AIX kernel to determine if the switch has faulted and needs to be restarted. As a result, it does not use SIGPIPE.
The main thread stacksize is the same as the stacksize used for non-threaded applications. If you write your own MPI reduce functions to use with nonblocking collective communications or a SIGIO handler that will be executed on one of the library service threads, you are limited to a stacksize of 96KB by default. To increase your thread stacksize, use the environment variable MP_THREAD_STACKSIZE. For more information about the default and your ability to change the default, see the manpage for AIX_PTHREAD_SET_STACKSIZE.
If a task forks, only the thread that forked exists in the child task. Therefore, the message passing library will not operate properly. Also, if the forked child does not exec another program, it should be aware that an atexit routine has been registered for the parent which is also inherited by the child. In most cases, the atexit routine requests that POE terminate the task (parent). A forked child should terminate with an _exit(0) system call to prevent the atexit routine from being called. Also, if the forked parent terminates before the child, the child task will not be cleaned up by POE.
A forked child MUST NOT call the message passing library.
When your program runs on the remote nodes, it has no controlling terminal. STDIN and STDOUT, STDERR are always piped.
If your threaded MPI program processes STDIN from a large file on the home node, you must do one of the following:
This also includes programs which may not explicitly use MPI.
If STDIN is piped (or redirected) to the poe binary (via ordinary pipes) and your application is linked with the threaded library, then handle STDIN in the following way:
AIX provides thread-safe versions of some libraries, such as libc_r.a. However, not all libraries have a thread-safe version. It is your responsibility to determine whether the libraries you use can be safely called by more than one thread.
MPI_FINALIZE terminates the MPI service threads but does not affect user-created threads. Use pthread_exit to terminate any user-created threads, and exit(m) to terminate the main program (initial thread). The value of m is used to set POE's exit status as explained on "Exit Status".
For threaded programs, AIX requires that the system include <pthread.h> must be first with <stdio.h> or other system includes following it. <pthread.h> defines some conditional compile variables that modify the code generation of subsequent includes, particularly <stdio.h>. Please note that <pthread.h> is not required unless your file uses thread-related calls or data.
Call MPI_INIT once per task not once per thread. MPI_INIT does not have to be called on the main thread but MPI_INIT and MPI_FINALIZE must be called on the same thread.
MPI calls on other threads must adhere to the MPI standard in regard to the following:
Collective communications must meet the MPI standard requirement that all participating tasks execute collective communications on any given communicator in the same order. If collective communications calls are made on multiple threads, it is your responsibility to ensure the proper sequencing or to use distinct communicators.
By default, user threads are created with process contention scope, and M user threads are mapped to N kernel threads. The values of the ratio M:N and the default contention scope are settable by AIX environment variables. The service threads created by MPI, POE, and LAPI have system contention scope, that is, they are mapped 1:1 to kernel threads.
For PSSP 2.3 and 2.4, you must create system contention scope threads. For PSSP 3.1, you can create process contention scope threads, but any such thread will be converted to a system contention scope thread when it makes its first MPI call.
The information in this subsection provides you with some specific programming considerations for when you are using POE and the Fortran compiler.
Incompatibilities exist between Fortran 90 and MPI which may effect the ability to use such programs. Refer to the information in
for further details. PE, Version 2, Release 2 provided the header file mpif90.h for use with Fortran 90. The file is still available in PE, Version 2, Release 4, but should not be used by new code. The mpif.h header file is formatted to work with either mpxlf90 or mpxlf compilation.
Version 5 of the AIX XLF Fortran compiler supports threads.
Version 4.1 of the AIX XLF Fortran compiler is not thread-safe. However, XLF Version 18.104.22.168 provides a partial thread-support XLF runtime library. It supports multi-threaded applications that have one Fortran thread. Be sure you thoroughly test such use.
The partial thread-support library is libxlf90_t.a and is installed as /usr/lib/libxlf90_t.a. When you use the mpxlf_r command, this library is included automatically.
When you use libxlf90_t.a the following restrictions apply. Therefore, only one Fortran thread in a multi-threaded application may use the library.