IBM Books

Hitchhiker's Guide

Appendix D. Parallel Environment Internals

This appendix provides some additional information about how the IBM Parallel Environment for AIX (PE) works, with respect to the user's application. Much of this information is also explained in IBM Parallel Environment for AIX: MPI Programming and Subroutine Reference.

What Happens When I Compile My Applications?

In order to run your program in parallel, you first need to compile your application source code with one of the following scripts:

  1. mpcc

  2. mpcc_r

  3. mpcc_chkpt

  4. mpCC

  5. mpCC_r

  6. mpCC_chkpt

  7. mpxlf

  8. mpxlf_r

  9. mpxlf_chkpt

To make sure the parallel execution works, these scripts add the following to your application executable:

The compile scripts dynamically link the Message Passing library interfaces in such a way that the specific communication library that is used is determined when your application executes.

If you create a static executable, the application executable and the message passing libraries are statically bound together.

How Do My Applications Start?

Because POE adds its entry point to each application executable, user applications do not need to be run under the poe command. When a parallel application is invoked directly, as opposed to under the control of the poe command, POE is started automatically. It then sets up the parallel execution environment and then re-invokes the application on each of the remote nodes.

Serial applications can be run in parallel only using the poe command. However, such applications cannot take advantage of the function and performance provided with the Message Passing libraries.

How Does POE Talk to the Nodes?

A parallel job running under POE consists of a home node (where POE was started) and n tasks, each running under the control of its own Partition Manager daemon (pmd). When a parallel job is started, POE contacts the nodes assigned to run the job (called remote nodes), and starts a pmd instance for each task. POE sends environment information to the pmd daemon information for the parallel job (including the name of the executable) and the pmd daemon spawns a process to run the executable. The spawned process has standard I/O redirected to socket connections back to the pmd daemon, so any output the application writes to STDOUT or STERR is sent back to the pmd daemon. pmd, in turn, sends the output back to POE via another socket connection and POE writes the output to its STDOUT or STERR. Any input that POE receives on STDIN is delivered to the remote tasks in a similar fashion.

The socket connections between POE and the pmd daemons are also used to exchange control messages for providing task synchronization, exit status, and signalling. These capabilities are available to control any parallel program run by POE, and they don't depend on the Message Passing library.

How are Signals Handled?

POE installs signal handlers for most signals that cause program termination and interrupts, in order to control and notify all tasks of the signal. POE will exit the program normally with a code of (128 + signal). If the user installs a signal handler for any of the signals POE supports, it should call the POE registered signal handler if the process decides to terminate. Appendix G of IBM Parallel Environment for AIX: MPI Programming and Subroutine Reference explains signal handling in greater detail.

What Happens When My Application Ends?

POE returns exit status (a return code value between 0 and 255) on the home node which reflects the composite exit status of the user application. There are various conditions and values with specific meanings associated with exit status. These are explained in Appendix G of IBM Parallel Environment for AIX: MPI Programming and Subroutine Reference

In addition, if the POE job-step function is used, the job control mechanism is the program's exit code. When the task exit code is 0 (zero) or in the range of 2 to 127, the job-step will be continued. If the task exit code is 1 or greater than 127, POE terminates the parallel job, as well as any remaining user programs in the job-step list. Also, any POE infrastructure failure detected (such as failure to open pipes to the child process) will terminate the parallel job as well as any remaining programs in the job-step list.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]