IBM Books

Messages


Chapter 4. POE Messages

0031-001 No man page available for poe

Explanation: User has requested that the poe man page be displayed (via -hoption), but the /usr/man/cat1/poe.1 file does not exist, or some directory in the path leading to the file is not searchable.

User Response: Check that the file exists and that all directories in the path leading to the file are searchable. The pedocs fileset may need to be installed if the file doesn't exist.

0031-002 Error initializing communication subsystem.

Explanation: The remote node terminates.

User Response: Probable PE error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-007 Error initializing communication subsystem: return code number

Explanation: The remote node was unable to initialize its communication subsystem. This message gives the return code from the function called. The remote node terminates.

User Response: Verify that the communication subsystem is running properly. If the SP Switch is being used, make sure that the system software is operational.

0031-009 Couldn't integrate VT traces

Explanation: The system continues, but trace files for the Visualization Tool (VT) were not integrated.

User Response: Check that sufficient space is available for the VT trace files on each remote node.

0031-011 tcp service string unknown

Explanation: The Partition Manager terminates.

User Response: The PM daemon, pmd, is not known to the system. Review the results of installation to assure that the daemon specified by inetd is startable on each remote node.

0031-012 pm_contact: socket

Explanation: The Partition Manager terminates, as it could not create a socket.

User Response: The message is followed by an explanatory sentence. Check that the number of sockets required does not exceed the number available.

0031-013 pm_contact: setsockopt

Explanation: The Partition Manager continues, but some socket options may not be set correctly.

User Response: Probable PE error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-018 Couldn't get info for hostname string

Explanation: The Partition Manager terminates.

User Response: The name mentioned cannot be identified. Check that the host name is spelled correctly and is known by name to the node on which the Partition Manager is running. If hostname is blank, allocation has failed.

0031-019 pm_contact: connect failed

Explanation: The Partition Manager terminates.

User Response: The Partition Manager is unable to connect to a remote node. Message 0031-020 follows. Probable PE system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-020 Couldn't connect to task number (string)

Explanation: The Partition Manager terminates.

User Response: For the indicated remote task number and indicated host name, socket connection could not be established. Check for valid names.

0031-022 setsockopt(SO_LINGER)

Explanation: The Partition Manager continues.

User Response: An error occurred in setting the LINGER socket option. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-024 string: no response; rc = number

Explanation: The Partition Manager terminates.

User Response: No acknowledgement of startup was received from the pmd daemon running on the indicated node. Check for error message from that node. The return codes are: -1, EOF on connection; 1 I/O error; 2 allocation error

0031-025 unexpected acknowledgment of type string from remote node

Explanation: The Partition Manager received an unexpected acknowledgement during initialization. Initialization with a remote node has failed.

User Response: Check the remote node log file to determine the reason for failure. Probable PE error.

0031-026 Couldn't create socket for PM Array

Explanation: The Partition Manager terminates. An explanatory sentence follows.

User Response: Probable PE error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-027 Write to PM Array

Explanation: The Partition Manager continues. An explanatory sentence follows.

User Response: Probable PE error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-028 pm_mgr_handle; can't send a signal message to remote nodes

Explanation: The Partition Manager terminates. An explanatory sentence follows.

User Response: Probable PE error. This error has occurred in the Partition Manager signal handler. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-029 Caught signal number (string), sending to tasks...

Explanation: The indicated signal is not used specifically by Partition Manager, and is being passed on to each remote task.

User Response: Verify that the signal was intended.

0031-031 task number is alive

Explanation: The message is sent from the indicated task in response to signal SIGUSR2.

User Response: Verify that the signal was intended.

0031-032 exiting...

Explanation: The message is sent from the indicated task in response to signal SIGINT, and the remote node is exiting.

User Response: Verify that the signal was intended.

0031-033 Your application has forced paging space to be exceeded...bailing out.

Explanation: The remote node exits with signal SIGDANGER. The message is sent from the indicated task in response to signal SIGDANGER. AIX is running out of paging space.

0031-034 task signal number: string

Explanation: The message is sent from the indicated task in response to the indicated signal, which is not handled explicitly by the Partition Manager.

User Response: Verify that the signal was intended.

0031-036 sigaction(SIGHUP)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-037 sigaction(SIGINT)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-038 sigaction(SIGQUIT)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-039 sigaction(SIGILL)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-040 sigaction(SIGTRAP)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-041 sigaction(SIGIOT)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-042 sigaction(SIGEMT)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-043 sigaction(SIGFPE)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-044 sigaction(SIGBUS)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-045 sigaction(SIGSEGV)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-046 sigaction(SIGSYS)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-047 sigaction(SIGPIPE)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-048 sigaction(SIGALRM)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-049 sigaction(SIGTERM)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-050 sigaction(SIGURG)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-051 sigaction(SIGTSTP)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-052 sigaction(SIGCONT)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-053 sigaction(SIGCHLD)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-054 sigaction(SIGTTOU)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-055 sigaction(SIGIO)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-056 sigaction(SIGXCPU)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-057 sigaction(SIGMSG)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-058 sigaction(SIGWINCH)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-059 sigaction(SIGPWR)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-060 sigaction(SIGUSR1)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-061 sigaction(SIGUSR2)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-062 sigaction(SIGPROF)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-063 sigaction(SIGDANGER)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-064 sigaction(SIGVTALRM)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-065 sigaction(SIGMIGRATE)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-066 sigaction(SIGPRE)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-067 sigaction(SIGGRANT)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-068 sigaction(SIGRETRACT)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-069 sigaction(SIGSOUND)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-070 sigaction(SIGSAK)

Explanation: An explanatory sentence follows. The Partition Manager terminates.

Cause: The return from sigaction for the indicated signal is negative.

0031-071 invalid number of procs entered

Explanation: The Partition Manager terminates. Incorrect number of tasks specified.

User Response: Enter a number from 1 to the max numbers of tasks to be run.

0031-076 invalid infolevel

Explanation: The -infolevel option was neither a 0 nor a positive number.

User Response: Correct the flag.

0031-077 invalid tracelevel

Explanation: The -tracelevel option was neither a 0 nor a positive number.

User Response: Correct the flag.

0031-078 invalid retrytime

Explanation: The -retrytime option was neither a 0 nor a positive number.

User Response: Correct the flag.

0031-079 invalid pmlights

Explanation: The -pmlights option was neither a 0 nor a positive number.

User Response: Correct the flag.

0031-080 invalid usrport

Explanation: The -usrport option was neither a 0 nor a positive number less than 32768.

User Response: Correct the flag.

0031-081 invalid samplefreq

Explanation: The -samplefreq option was neither a 0 nor a positive number.

User Response: Correct the flag.

0031-083 invalid wrap trace buffer

Explanation: The -tbuffwrap option was neither YES or NO.

User Response: Correct the flag.

0031-084 invalid tbuffsize value

Explanation: The -tbuffsize option specifies too large a trace buffer (or an invalid number).

User Response: Correct the flag.

0031-085 invalid tbuffsize unit

Explanation: The -tbuffsize option is not of the form numberK or numberM.

User Response: Correct the flag.

0031-086 invalid ttempsize value

Explanation: The -ttempsize option specifies too large a file (or an invalid number).

User Response: Correct the flag.

0031-087 invalid ttempsize unit

Explanation: The -ttempsize option is not of the form numberM or numberG.

User Response: Correct the flag.

0031-088 invalid tpermsize value

Explanation: The -tpermsize option specifies too large a file (or an invalid number).

User Response: Correct the flag.

0031-089 invalid tpermsize unit

Explanation: The -tpermsize option is not of the form numberM or numberG.

User Response: Correct the flag.

0031-092 MP_PROCS not set correctly

Explanation: The MP_PROCS environment variable is not a positive number.

User Response: Correct the variable.

0031-093 MP_INFOLEVEL not set correctly

Explanation: The MP_INFOLEVEL environment variable is neither 0 or a positive number less than 32768.

User Response: Correct the variable.

0031-094 MP_TRACELEVEL not set correctly

Explanation: The MP_TRACELEVEL environment variable is neither 0 or a positive number less than 32768.

User Response: Correct the variable.

0031-095 MP_RETRY not set correctly

Explanation: The MP_RETRY environment variable is neither 0 or a positive number less than 32768.

User Response: Correct the variable.

0031-096 MP_PMLIGHTS not set correctly

Explanation: The MP_PMLIGHTS environment variable is neither 0 nor a positive number.

User Response: Correct the variable.

0031-097 MP_USRPORT not set correctly

Explanation: The MP_USRPORT environment variable is neither 0 nor a positive number less than 32768.

User Response: Correct the variable.

0031-098 MP_SAMPLEFREQ not set correctly

Explanation: The MP_SAMPLEFREQ environment variable is neither 0 nor a positive number less than 32768.

User Response: Correct the variable.

0031-100 MP_TBUFFWRAP not set correctly

Explanation: The MP_TBUFFWRAP environment variable is neither YES nor NO.

User Response: Correct the variable.

0031-101 Invalid MP_TBUFFSIZE

Explanation: The MP_TBUFFSIZE environment variable specifies too large a trace buffer (or an invalid number).

User Response: Reduce or correct the size.

0031-102 Incorrect MP_TBUFFSIZE unit

Explanation: The MP_TBUFFSIZE environment variable is not of the form numberK or numberM.

User Response: Correct the flag.

0031-103 Invalid MP_TTEMPSIZE

Explanation: The MP_TTEMPSIZE environment variable specifies too large a trace file (or an invalid number).

User Response: Reduce or correct the size.

0031-104 Incorrect MP_TTEMPSIZE unit

Explanation: The MP_TTEMPSIZE environment variable is not of the form numberM or numberG.

User Response: Correct the flag.

0031-105 Invalid MP_TPERMSIZE

Explanation: The MP_TPERMSIZE environment variable specifies too large a trace file (or an invalid number).

User Response: Reduce or correct the size.

0031-106 Incorrect MP_TPERMSIZE unit

Explanation: The MP_TPERMSIZE environment variable is not of the form numberM or numberG.

User Response: Correct the flag.

0031-110 pm: atexit

Explanation: The user exit handler could not be installed.

Cause: Probable PE error.

User Response: Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-113 Stopping the job. Wait 5 seconds for remotes.

Explanation: The Partition Manager has received a SIGTSTP signal from the LoadLeveler program or the <Ctrl-Z> keyboard interrupt and is preparing to stop the job on all the remote nodes. The system will then issue a stop message giving the task number for the job as it would for any <Ctrl-Z> keyboard interrupt.

User Response: Wait for the stop confirmation message. To continue the job in the foreground, type in fg tasknumber after you receive the stop message. To continue the job in the background, type in bg tasknumber after you receive the stop message.

0031-115 invalid resd option.

Explanation: The specification of the -resd option was neither YES or NO.

User Response: Correct the specification.

0031-116 MP_RESD not set correctly.

Explanation: The specification of MP_RESD was neither YES or NO.

User Response: Correct the specification of MP_RESD.

0031-117 Unable to contact Resource Manager

Explanation: The Partition Manager was unable to contact the Resource Manager to allocate nodes of the SP.

User Response: Check that the Resource Manager is running. Otherwise, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-118 string string requested for task number

Explanation: The named host or pool was requested from LoadLeveler or the Resource Manager for the indicated task number. This informational message is issued when a host list file is read for node allocation.

0031-119 Host string allocated for task number

Explanation: The named host was allocated by LoadLeveler or the Resource Manager for the indicated task number. This informational message is issued when the implicit node allocation is used.

0031-120 Resource Manager unable to allocate nodes due to internal error

Explanation: A system or socket error occurred when the Resource Manager client attempted to contact the server to request nodes. This is most often caused by loss of the connection between client and server under heavy network loads. An 0023 jm message is often printed before this message which may provide more specific information about the problem.

User Response: Retry the job; this should correct the problem if the network was temporarily overloaded. If this fails, contact your system administrator to determine if network is in a stable state before retrying. Otherwise, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-121 Invalid combination of settings for MP_EUILIB, MP_HOSTFILE, and MP_RESD

Explanation: The execution environment could not be established based on these settings.

User Response: See IBM Parallel Environment for AIX: Operation and Use, Volume 1 for valid combinations of these settings.

0031-123 Retrying allocation .... press control-C to terminate

Explanation: The requested nodes were not available from the Resource Manager. However, since the retry option was specified (by either the MP_RETRY environment variable or the -retry command-line flag), the Partition Manager will continue requesting nodes at the specified delay interval.

User Response: To terminate the allocation request, press <Ctrl-C>.

0031-124 Fewer than number nodes available from string.

Explanation: The requested nodes were not available from LoadLeveler or the Resource Manager.

User Response: Check that you haven't specified a number of nodes greater than the number of physical compute nodes in your RS/6000 SP or RS/6000 network cluster. Otherwise, wait until later when the needed number of nodes is available. You might want to specify the retry option by either setting the MP_RETRY environment variable or using the -retry command line flag.

0031-125 Fewer nodes (number) specified in string than tasks (number).

Explanation: There was a larger number of nodes specified than what is defined in the host.list file.

User Response: Check that you haven't specified a number of nodes greater than the number of physical compute nodes in your RS/6000 SP or RS/6000 network cluster. Otherwise, wait until later when the needed number of nodes is available. You might want to specify the retry option by either setting the MP_RETRY environment variable or using the -retry command line flag.

0031-126 Unable to read string for current directory

Explanation: The Partition Manager is unable to interpret the data from the pwd command.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-127 Executing with develop mode string

Explanation: A develop mode of value is currently active, which could significantly impact the performance of your program.

User Response: If you do not want or need the develop mode turned on, use the MP_EUIDEVELOP environment variable or the -euidevelop flag to set a value of no, normal, or minimum.

0031-128 Invalid euilib selected

Explanation: An euilib other than ip or us was entered.

User Response: Re-specify the euilib as either ip or us.

0031-129 Invalid euidevelop option, should be YES or NO

Explanation: An euidevelop other than YES or NO was entered.

User Response: Re-specify euidevelop with either YES or NO.

0031-130 Invalid newjob option, should be YES or NO

Explanation: A newjob other than YES or NO was entered.

User Response: Re-specify newjob with either YES or NO.

0031-131 Invalid pmdlog option, should be YES or NO

Explanation: A pmdlog other than YES or NO was entered.

User Response: Re-specify pmdlog with either YES or NO.

0031-133 Invalid stdoutmode

Explanation: A stdoutmode other than ORDERED, UNORDERED or an integer from 0 to (the number of tasks -1) was entered.

User Response: Re-specifystdoutmode with either ORDERED, UNORDERED or a number.

0031-134 Invalid mode for stdinmode

Explanation: A stdinmode other than ALL or an integer from 0 to (the number of tasks -1) was entered.

User Response: Re-specify stdinmode with either ALL or a number.

0031-135 Invalid labelio option, should be YES or NO

Explanation: A labelio other than YES or NO was entered.

User Response: Re-specify labelio with either YES or NO.

0031-136 Invalid MP_NOARGLIST option, should be YES or NO

Explanation: The Partition Manager terminates.

User Response: Enter YES or NO for MP_NOARGLIST.

0031-137 poe: Internal Error: Could not broadcast ACK for string data

Explanation: An error occurred when POE was trying to acknowledge receipt of connect or finalize data from all nodes. Either one of the remote nodes is no longer accessible or a system error has occurred.

User Response: Verify that the remote nodes in the partition can be contacted by other means. If the problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-138 The following nodes may be causing connect failures during message passing initialization:

Explanation: The home node has gathered connect data from each of the remote nodes and has determined that one or more nodes have been reported most as not connectable to during message passing initialization. A list of those nodes proceeds this message.

User Response: For jobs using the SP, contact System Administrator to determine if that node is up on the switch. For non-SP jobs, verify that the node can be contacted by other means. Also, refer to the node-specific error message related to mpci_connect for more information on what could be causing the problem (e.g. unauthorized user). If problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-139 Could not open socket for debugger.

Explanation: The call to socket() failed when attempting to open a socket for the parallel debugger.

User Response: None.

0031-140 Could not bind local debug socket address.

Explanation: The call to bind() failed when attempting to bind the local address for the debug socket.

User Response: None.

0031-141 Could not accept debugger socket connection.

Explanation: The call to accept() failed when attempting to make a socket connection with the debugger.

User Response: None.

0031-142 Could not write to debug socket.

Explanation: The call to write() failed when attempting to write to the debug socket.

User Response: None.

0031-143 Could not read message from debug socket.

Explanation: The call to read() failed when attempting to read a message from the debug socket.

User Response: None.

0031-144 error creating directory for core files, reason: <string>

Explanation: A corefile directory could not be created for the given reason.

User Response: Fix reason and rerun job.

0031-145 error changing to string corefile directory, reason <string>

Explanation: The core file could not be dumped to the named directory for the given reason.

User Response: Fix reason and rerun job

0031-146 MP_CMDFILE is ignored when MP_STDINMODE is set to none

Explanation: If you set the MP_STDINMODE environment variable or the -stdinmode option to "none", the MP_CMDFILE environment variable or the -cmdfile option is ignored.

User Response: To eliminate this WARNING message, you should remove the MP_CMDFILE setting or specify MP_STDINMODE, which is not case sensitive, to another value other than "none".

0031-147 MP_HOLD_STDIN is ignored when MP_STDINMODE is set to none

Explanation: If you set the MP_STDINMODE environment variable or the -stdinmode option to "none", MP_HOLD_STDIN=yes is ignored.

User Response: To eliminate this WARNING message, you should set MP_HOLD_STDIN to "no".

0031-148 Using redirected STDIN for program name resolution

Explanation: You redirected stdin without specifying a program name or command file name, and you did not set the MP_STDINMODE environment variable or the -stdinmode option to "none". Because program behavior is undefined in this case, a warning is issued.

User Response: You should set the MP_STDINMODE environment variable or the -stdinmode option to "none". For more details, see "Managing Standard Input" in IBM Parallel Environment for AIX: Operation and Use, Volume 1.

0031-149 Unable to load shared objects required for LoadLeveler.

Explanation: You submitted a POE job via LoadLeveler and a shared object or library required for LoadLeveler does not exist. This error usually indicates files created at install time have been deleted, moved, or had their permissions changed.

One of the following files cannot be accessed:

/usr/lpp/LoadL/full/lib/llapi_shr.o
/usr/lpp/LoadL/full/lib/libllapi.a

User Response: Contact your system administrator to determine if the files described above are accessible, and correct if possible. Otherwise, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-150 Unable to load shared objects required for Resource Manager

Explanation: The execution environment specified use of the Resource Manager, but one or more of the following shared objects did not exist in /usr/lpp/ssp/lib or /usr/lib:

jm_client.shr.o
libjm_client.a
libSDRs.a

See IBM Parallel Environment for AIX: Operation and Use, Volume 1 for more information about the execution environment.

If you submitted the job from an SP node, this error usually indicates that files created during installation have been deleted, moved, or had their permissions changed. If you submitted the job from a non-SP node (using the SP_NAME environment variable), this error usually indicates that the ssp.client fileset has not been installed on the submitting node.

User Response: Contact your system administrator to determine if the files or fileset described above are accessible or installed. If possible, make the files or fileset accessible or reinstall them. Otherwise, you should gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-151 Pool specified in hostfile for task number not same as previous.

Explanation: Unless Resource Manager use is specified by setting the SP_NAME environment variable, pool entries in a hostfile must be the same for all tasks.

User Response: Modify the hostfile as described, or use the MP_RMPOOL environment variable or the -rmpool command line option. If Resource Manager use was intended, set SP_NAME to the name of the system partition in which to run the job.

0031-152 Ignoring adapter and/or CPU usage specification in hostfile.

Explanation: Unless Resource Manager use is specified by setting the SP_NAME environment variable, adapter and/or CPU usage specifications are ignored in the hostfile.

User Response: Remove the usage specifications from the hostfile to eliminate the warning messages, if desired. This will result in default usages as described in IBM Parallel Environment for AIX Operation and Use, Volume 1. The MP_ADAPTER_USE and/or MP_CPU_USE environment variables or the associated command line options may be used to override the defaults. If Resource Manager use was intended, set SP_NAME to the name of the system partition in which to run the job.

0031-153 Unexpected return code number from ll_init_job.

Explanation: An internal error has occurred.

User Response: Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-154 Unexpected return code number from ll_parse_string.

Explanation: An internal error has occurred.

User Response: Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-155 Unexpected return code number from ll_set_data (number).

Explanation: An internal error has occurred.

User Response: Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-156 Unexpected return code number from ll_get_data (number).

Explanation: An internal error has occurred.

User Response: Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-157 Couldn't flush VT traces

Explanation: The program continues.

User Response: At termination, the Partition Manager was unable to successfully terminate trace processing for VT. Check any messages issued by VT.

0031-158 select

Explanation: An explanatory sentence follows. The Partition Manager terminates.

User Response: The select call to the sockets connecting the Partition Manager with the remote nodes failed. Presumably connection has been lost. The explanatory sentence may give an indication of the source of failure.

0031-160 I/O error on socket connection with task number

Explanation: An explanatory sentence follows. The Partition Manager continues. A read on the socket used to connect the Home Node with the indicated remote task failed. Probably the remote node has closed the connection. The task is marked as exited and processing continues.

User Response: Examine the communication subsystem for failure.

0031-161 EOF on socket connection with task number

Explanation: Processing continues. The socket used to connect the Home Node with the indicated remote task has closed. Probably the remote node has closed the connection.

User Response: Examine the communication subsystem for failure.

0031-164 process_io: read(io command)

Explanation: Processing continues. The command sent to the Partition Manager is ignored.

User Response: Probable system error. An incomplete or invalid I/O command was received by the Partition Manager.

0031-169 pm_remote_shutdown

Explanation: Processing continues. An explanatory sentence is appended.

User Response: A quit message being sent to all remote nodes could not be written to one of the sockets.

0031-171 unknown io command

Explanation: Processing continues. The data is ignored.

User Response: An unsupported or invalid I/O command code was received by the Partition Manager from a remote node.

0031-172 I/O buffer overflow

Explanation: The stdout or stderr string overflows the output buffer (8K). The excess is discarded.

User Response: Probable internal error. Normally, the output is automatically flushed if it exceeds the buffer length. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-179 Unable to acknowledge profiling request for task number

Explanation: An error occurred writing a message to the indicated task, allowing it to begin to write the profiling data to disk.

User Response: Probable internal error. Verify that the indicated node is still connected in the partition. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-180 read(PROF)

Explanation: Processing continues. An explanatory sentence is appended.

User Response: The Partition Manager cannot read the remote node's response to a SIGUSR1 profiling signal.

0031-183 Connection to task number blocked. Task abandoned.

Explanation: While trying to stop the indicated task on a remote node, the Partition Manager discovered that the socket connection was blocked (unavailable). The remote task is marked as inactive and the Partition Manager continues.

User Response: Manual intervention may be required to kill the job on the remote node.

0031-200 pmd: getpeername <string>

Explanation: The daemon is unable to identify the partition manager.

User Response: Probable system or communication subsystem failure.

0031-201 pmd: setsockopt(SO_KEEPALIVE): <string>

Explanation: The daemon is unable to set the indicated socket option. Explanatory sentence is provided.

User Response: Probable system or communication subsystem failure.

0031-202 pmd: setsockopt(SO_LINGER): <string>

Explanation: The daemon is unable to set the indicated socket option. Explanatory sentence is provided.

User Response: Probable system or communication subsystem failure.

0031-203 malformed from address: <string>

Explanation: The socket address family is incorrect.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-207 pmd: sigaction <string>

Explanation: Error when setting up to handle a signal.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-208 pmd: fork <string>

Explanation: The pm daemon is unable to fork to execute the user application.

User Response: Probable system error.

0031-212 pmd: node string: user string denied from access from host string

Explanation: The user is not permitted to run on the indicated node. The Partition Manager exits.

User Response: Make sure that the Partition Manager home node machine and user id are identified, for example, in $HOME/.rhosts or in /etc/hosts.equiv for this user on this machine. The access requirements are the same as for remote shell (rsh) access to the node.

0031-213 pmd: setuid <number>

Explanation: The setuid function failed for a given userid.

User Response: Make sure that the user is known by the same number on all systems.

0031-214 pmd: chdir <string>

Explanation: An attempt to change to the indicated directory failed.

User Response: Make sure that the directory exists. Check to see that the indicated directory can be properly mounted by the Automount daemon, if the directory is part of a mounted file system. To change the name of the directory to be mounted, set the environment variable MP_REMOTEDIR=some_script, where some_script is the name of a script or quoted command that echoes a directory name. For example, MP_REMOTEDIR='echo /tmp' will request that /tmp be the current directory on the remote nodes.

For non-Korn shell users, the script mpamddir in /usr/lpp/poe/bin may provide a usable name. It tries to match the entries in the Automount list with the user's directory as reported by the pwd command.

If the directory is from a DFS file system, the DFS/DCE credentials may not have been properly established, using the poeauth command.

0031-215 can't run parallel tasks as root

Explanation: The userid of the user running the application can not be 0.

User Response: Re-run under a userid other than root.

0031-216 POE (number) - pmd (number) - user program (number) versions incompatible

Explanation: The versions of the POE home node, the pmd and the user's program are incompatible.

User Response: You should first check that the POE home node, the pmd, and libmpi.a are at compatible PE version levels. If necessary, install compatible versions. You should also check that the user program has been compiled with a version of PE that is compatible with the version of PE on the home node and the pmd. If necessary, recompile the user program using compatible POE home node and pmd versions.

0031-217 POE (number), pmd (number), and dbe (number) versions are incompatible.

Explanation: The versions of POE, pmd, and the debug engine (dbe) are incompatible.

User Response: You should check that POE, pmd, and dbe are at compatible PE version levels. If necessary, install compatible versions.

0031-218 Partition manager daemon not started by LoadLeveler on node string.

Explanation: The daemon on the indicated node was not started by LoadLeveler, and an entry in the /etc/poe.limits file on that node specified that LoadLeveler must be used to start the daemon.

User Response: Set up the Execution Environment (see IBM Parallel Environment for AIX Operation and Use, Volume 1) so that LoadLeveler will be used, or contact the System Administrator to determine if use of the MP_USE_LL keyword in the /etc/poe.limits file was intended.

0031-235 invalid userid received

Explanation: The userid is not valid on this node.

User Response: Run under a valid userid.

0031-237 invalid group id received

Explanation: The group id received by the pm daemon is either negative or non numeric and therefore not valid.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-243 invalid environment length received

Explanation: The length received by the pm daemon is either negative or non numeric and therefore not valid.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-247 pmd: setgid <number>

Explanation: The pmd was unable to set the groupid for the remote task.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-250 task number: string

Explanation: The given task has received the given signal.

User Response: No response needed.

0031-251 task number exited: rc=number

Explanation: The indicated task has exited. This informational message is displayed when processing completes normally and when the job is terminated by the <Ctrl-C> interrupt key.

User Response: No response needed.

0031-252 task number stopped: string

Explanation: The indicated task has been stopped. The second variable in this message indicates the signal that stopped the task.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-253 Priority adjustment call failed: rc = number, errno = number

Explanation: The call to start the priority adjustment process failed. Check that the priority adjustment program is executable. Execution continues, but no priority adjustment is applied to this process. The return code and errno reported relate to the system function.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-254 User string not authorized on host string

Explanation: The userid is not found on the given host.

User Response: Add the userid to the host with SMIT.

0031-255 Group string does not exist on host string

Explanation: The group id is not found in /etc/group.

User Response: Add the groupid to the host with SMIT.

0031-256 Priority adjustment process has been invoked

Explanation: The user has elected to adjust the priority of the poe job.

User Response: Informational only, no action required.

0031-257 mp_euilib is not us, high priority daemon has been started.

Explanation: User has elected to use high priority for the poe job and the userspace library is not being used.

User Response: Informational only, no action required.

0031-258 User string not authorized for group string on host string.

Explanation: The userid is not permitted to access the given groupid on the given host.

User Response: Add the userid to the group access list on the host.

0031-259 /etc/poe.priority file not found; priority adjustment function not started

Explanation: In attempting to start the dispatching priority adjustment function, there was no /etc/poe.priority parameter file found for this task. Most likely, it was not set up or is inaccessible. Normal application execution continues, although the priority adjustment function will not be run.

User Response: Ensure the /etc/poe.priority file exists. Consult the chapter on File Formats in the IBM AIX Parallel Environment: Installation manual for further details.

0031-260 Invalid entry in /etc/poe.priority file for user string, class string; priority adjustment function not started

Explanation: In attempting to start the dispatching priority adjustment function, there was no entry for the user and class found in the /etc/poe.priority file for this task. Most likely, the entry is missing or in error. Normal application execution continues, although the priority adjustment function will not be run.

User Response: Ensure the entries for this user and class in /etc/poe.priority file exists and are properly defined. Consult the chapter on File Formats in the IBM AIX Parallel Environment: Installation manual for further details.

0031-300 Program statically linked with ip, css library loaded is not ip.

Explanation: The IP library does not exist on the system or the user is specifying the usage of the us library when the program has been linked with IP.

User Response: Make sure the IP library exists. Make sure you're not specifying us when you want to use IP.

0031-301 mp_euilib specifies ip, css library loaded is not ip.

Explanation: The IP library does not exist on the system.

User Response: Make sure the IP library exists. Make sure you're not specifying US when you want to use IP.

0031-302 Program statically linked with us, css library loaded is not us.

Explanation: The US library does not exist on the system or the user is specifying the usage of the IP library when the program has been linked with US.

User Response: Make sure the US library exists. Make sure you're not specifying IP when you want to use US.

0031-303 mp_euilib specifies us, css library loaded is not us.

Explanation: The us library does not exist on the system.

User Response: Make sure the us library exists. Make sure you're not specifying ip when you want to use us.

0031-304 remote child: error restoring stdout.

Explanation: The previously closed stdout cannot be restored.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-305 remote child: error restoring stderr.

Explanation: The previously closed stderr cannot be restored.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-306 pm_atexit: pm_exit_value is number.

Explanation: Program exit value

User Response: Informational message. No action required.

0031-307 remote child: error restoring stdin.

Explanation: The previously closed stdin cannot be restored.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-308 Invalid value for string: string

Explanation: Indicated value is not a valid setting for the indicated environment variable or command line option.

User Response: Set to a valid value and rerun.

0031-309 Connect failed during message passing initialization, task number, reason: string

Explanation: The Communication Subsystem was unable to connect this task to one or more other tasks in the current partition for the reason given.

User Response: If a timeout has occurred, the MP_TIMEOUT environment variable is set to too low of a value. (The default value is 150 seconds.) If you have not explicitly set the MP_TIMEOUT environment variable and the program being run under POE is NFS mounted, 150 seconds may not be sufficient.

If the reason given indicates "Permission denied", you should ensure the login name and user ID of the user submitting the job is consistent on all nodes on which the job is running.

If the reason given indicates "Permission denied" or "Not owner" and the job was submitted under LoadLeveler, you should ensure that the adapter requirement given to LoadLeveler is compatible with the MP_EUILIB environment variable.

If the reason given indicates "No such device", the Communication Subsystem library (libmpci.a) bound into the executable does not match the switch adapter for that node. This error usually occurs when the executable was statically bound on a system that was configured for a different switch adapter. For example, a program that was compiled on a system configured with a TB2 adapter, and was then attempted to be run on a system with a TB3 adapter. In this case, you should recompile the program on a system configured for the same switch adapter as that of node where the executable will be run.

For any other reason, an internal error has occurred. You should gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-310 Socket open failed during message passing initialization, task number, reason: string

Explanation: The Communication Subsystem was unable to open a socket for message passing for the indicated task for the reason given.

User Response: If the reason given is "No buffer space available," have the system administrator raise the value of sb_max using the no command. The current suggested value is 128000.

For any other reason, an internal error has most likely occurred. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-311 Restart of program string failed. Return code is number.

Explanation: The restart of the program indicated was unsuccessful.

User Response: Check that the program name is valid, and that it was previously checkpointed.

0031-312 The checkpoint file string already exists in the working directory.

Explanation: While attempting to checkpoint the program, an existing version of the checkpoint file was found in the working directory. Execution is terminated.

User Response: Check the name of the file specified by the MP_CHECKFILE and MP_CHECKDIR environment variables. If necessary, remove the previous version of the file.

0031-313 Error saving information for program string. Return code is number.

Explanation: The internal routine setExecInfo failed. This is an unrecoverable internal error.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-314 Checkpointing is not enabled. Set CHKPT_STATE to enable.

Explanation: Checkpointing of the program is not enabled, because the value of the CHKPT_STATE LoadLeveler environment variable was not set to enable.

User Response: To enable checkpointing, set CHKPT_STATE to enable.

0031-315 Invalid value for mp_chkpt flags.

Explanation: A non-valid value was set for the flags of the mp_chkpt function. MP_CUSER is the only valid value.

User Response: Set the flag value in the mp_chkpt function call to MP_CUSER.

0031-316 Error suspending the pipe connection during checkpointing. Return code is number.

Explanation: An error occurred calling the internal routine mp_stopMPI, to suspend the pipe connection during checkpointing.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-317 Error resuming the pipe connection during checkpointing. Return code is number.

Explanation: An error occurred calling the internal routine mp_resumeMPI(), to resume the pipe connection during checkpointing.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-318 Checkpointing failed. Return code is number.

Explanation: An error occurred during the checkpointing process.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-319 Error blocking signals during checkpointing. Return code is number.

Explanation: An error occurred attempting to block all signals while checkpointing the program.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-320 Error occurred saving file information during checkpointing. Return code is number.

Explanation: An error occurred attempting to save the file information for the data segment while checkpointing the program.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-321 Error occurred saving signal information during checkpointing. Return code is number.

Explanation: An error occurred attempting to save the signal information for the data segment while checkpointing the program.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-322 Error occurred opening the checkpoint file.

Explanation: An error occurred attempting to open the checkpoint file. Processing is terminated.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-323 Error occurred writing header information during checkpointing. Return code is number.

Explanation: An error occurred writing the header information while checkpointing the program.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-324 Error occurred writing program information during checkpointing. Return code is number.

Explanation: An error occurred writing the program information while checkpointing the program.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-325 Error occurred saving data segment during checkpointing. Return code is number.

Explanation: An error occurred saving the data segment while checkpointing the program.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-326 Error occurred saving the MPCI data during checkpointing. Return code is number.

Explanation: An error occurred saving the MPCI data while checkpointing the program.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-327 Error occurred saving the stack data during checkpointing. Return code is number.

Explanation: An error occurred saving the stack data while checkpointing the program.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-328 Error occurred writing the footer data during checkpointing. Return code is number.

Explanation: An error occurred writing the footer data while checkpointing the program.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-329 Error occurred unblocking signals during checkpointing. Return code is number.

Explanation: An error occurred unblocking signals while checkpointing the program.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-330 Error getting environment variable string.

Explanation: The internal getenv function failed to get the specified environment variable. The remote node terminates.

User Response: Probable PE error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-331 Error occurred disconnecting from MPCI during checkpointing. Return code is number.

Explanation: An error occurred disconnecting from MPCI while checkpointing the program.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-332 SSM_CSS_INIT expected and it was not received.

Explanation: System error occurred where an SSM_CSS_INIT was expected for the control pipe input. The remote node terminates.

User Response: Probable PE error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-333 More node information found than expected.

Explanation: An internal error was detected where there was more node information returned from SSM_CSS_INIT than expected. The remote node terminates.

User Response: Probable PE error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-334 SSM not SSM_ACK to our sync request.

Explanation: An internal error was detected where there was no acknowledgement returned for a synchronization request. The remote node terminates.

User Response: Probable PE error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-335 SSM subtype not what was expected

Explanation: An internal error was detected where an unexpected message type was returned. The remote node terminates.

User Response: Probable PE error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-336 Error with VT_trc_set_params_c.

Explanation: An internal error was detected after trying to set up the VT trace parameters. The remote node terminates.

User Response: Probable PE error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-337 Error in starting the user's code.

Explanation: An internal error was detected after trying to start the user executable code in the remote node. The remote node terminates.

User Response: Probable PE error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-338 Error sending exit request to home node.

Explanation: An internal error was detected after trying to send an exit request to the home node. The remote node terminates.

User Response: Probable PE error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-339 VT trace initialization failed.

Explanation: VT trace initialization failed on a remote node. The remote node continues, but no visualization trace file will be created for that node.

User Response: Probable PE error. Check for other messages from VT. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-340 Error occurred getting time during checkpointing. Return code is number.

Explanation: An error occurred getting time while checkpointing the program.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-341 Error occurred reconnecting MPCI during checkpointing. Return code is number.

Explanation: An error occurred re-establishing the connections to MPCI while checkpointing the program.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-342 Error occurred initializing time during checkpointing. Return code is number.

Explanation: An error occurred initializing time while checkpointing the program.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-343 Error occurred opening the checkpoint directory.

Explanation: An error occurred during opening the checkpoint file directory while checkpointing the program.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-344 Checkpoint files were not found. Restart cannot be performed.

Explanation: Local checkpoint/restart files were not found. As a result, restart of the program is not possible.

User Response: Make sure the location of the checkpoint files match what is specified by the MP_CHECKDIR and MP_CHECKFILE environment variables, and that those files are valid from a previously checkpointed parallel program. Otherwise, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-345 Checkpointed program is not a parallel program. Restart cannot be performed.

Explanation: The program contained in the checkpoint file is not a parallel program. As a result, restart of the program is not possible.

User Response: Make sure the location of the checkpoint files match what is specified by the MP_CHECKDIR and MP_CHECKFILE environment variables, and that those files are valid from a previously checkpointed parallel program. Otherwise, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-346 Error occurred restoring data segment. Return code is number.

Explanation: An error occurred restoring the data segment of a checkpointed program. Restore operation has failed.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-347 Error occurred restoring file descriptors. Return code is number.

Explanation: An error occurred restoring the file descriptors of a checkpointed program. Restore operation has failed.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-348 Error occurred restoring signal handlers. Return code is number.

Explanation: An error occurred restoring the signal handlers of a checkpointed program. Restore operation has failed.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-349 Program restart processing ended without program being restarted.

Explanation: The program restart function completed without the checkpointed program ever restarting.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-350 Error occurred comparing environment variables during restart. Return code is number.

Explanation: The original POE and MPI environment variables do not match those contained in the program to be restarted. As a result, the program cannot be restarted.

User Response: Make sure the contents of the checkpoint files specified by the MP_CHECKDIR and MP_CHECKFILE environment variables is valid for the previously checkpointed parallel program. Otherwise, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-351 Error occurred unblocking signals during restore processing. Return code is number.

Explanation: An error occurred unblocking the signals while restoring a checkpointed program. Restore operation has failed.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-352 Error occurred reestablishing MPI/MPCI connection during restore processing. Return code is number.

Explanation: An error occurred reconnecting to MPI/MPCI while restoring a checkpointed program. Restore operation has failed.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-353 Error occurred synchronizing POE tasks during restore processing. Return code is number.

Explanation: An error occurred synchronizing the POE tasks while restoring a checkpointed program. Restore operation has failed.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-354 Error occurred obtaining global variables during restore processing. Return code is number.

Explanation: An error occurred obtaining the global variables from the environment while restoring a previously checkpointed program. Restore operation has failed.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-355 Error allocating data while restoring a checkpointed program.

Explanation: An error occurred allocating storage during the restore processing of previously checkpointed program. Restore operation has failed.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-356 Error occurred reinitializing the clock during restore processing. Return code is number.

Explanation: An error occurred obtaining the switch clock address and reinitializing the clock for a previously checkpointed program. Restore operation has failed.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-357 Error occurred opening the checkpoint directory during restore.

Explanation: An error occurred during opening the checkpoint file directory while restoring the program.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-358 Error reading internal messages while synchronizing POE tasks. Return code is number.

Explanation: An internal error in pm_SSM_read occurred while trying to read the messages during the synchronization of POE tasks, while restoring a previously checkpointed file. Restore processing is terminated.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-359 Comparison of values for string is not consistent between original and restored value.

Explanation: The original and restored values of the environment variable indicated were not consistent, when a checkpointed program was being restored.

User Response: Make sure the contents of the checkpoint files specified by the MP_CHECKDIR and MP_CHECKFILE environment variables is valid for the previously checkpointed parallel program. Otherwise, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-360 Variable string was not found in the environment when restoring program.

Explanation: The environment variable indicated was not found in the current environment, for a previously checkpointed program during restore processing.

User Response: Make sure the contents of the checkpoint files specified by the MP_CHECKDIR and MP_CHECKFILE environment variables is valid for the previously checkpointed parallel program. Otherwise, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-361 Unexpected return code number from ll_get_job.

Explanation: An internal error has occurred.

User Response: Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-362 Unexpected return code number from ll_request.

Explanation: An internal error has occurred.

User Response: Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-363 Unexpected return code number from ll_event.

Explanation: An internal error has occurred.

User Response: Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-364 Contacting LoadLeveler to string information for string job.

Explanation: Informational message to user to indicate that LoadLeveler is being used for the interactive or batch job.

User Response: None required.

0031-365 LoadLeveler unable to run job, reason: string

Explanation: LoadLeveler either could not run the interactive job for the reason indicated, or, LoadLeveler terminated the interactive job for the reason indicated.

User Response: Refer to Using and Administering LoadLeveler for information on the specific reason indicated in the LoadLeveler message that follows this message.

0031-366 Invalid combination: nodes=number, tasks_per_node=number, procs=number.

Explanation: The combination specified did not result in a mathematical equality -- nodes times tasks_per_node must equal procs, when all three are specified.

User Response: Correct one or more of the specifications to ensure they are mathematically consistent.

0031-367 Invalid combination: tasks_per_node=number, procs=number.

Explanation: User specified the options indicated, and tasks_per_node did not divide evenly into procs, which is required as described in IBM Parallel Environment for AIX Operation and Use, Volume 1.

User Response: Correct the specifications as described above.

0031-368 Number of nodes specified (number) may not exceed total number of tasks (number).

Explanation: User has specified more nodes (using MP_NODES or -nodes) than tasks (using MP_PROCS or -procs), which is an error.

User Response: Correct the specifications so that there are the same or fewer nodes than tasks.

0031-369 Number of tasks or nodes must also be specified when using tasks_per_node.

Explanation: User has specified tasks per node (using MP_TASKS_PER_NODE or -tasks_per_node), but has not specified either the number of nodes (using MP_NODES or -nodes) or the number of tasks (using MP_PROCS or -procs), which is required as described in IBM Parallel Environment for AIX Operation and Use, Volume 1.

User Response: Provide either of the omitted specifications.

0031-370 Internal error: invalid taskid (number) received from LoadLeveler.

Explanation: An internal error has occurred.

User Response: Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-371 Conflicting specification for -msg_api, using string.

Explanation: A batch job using POE was submitted to LoadLeveler with a network statement in the Job Command File that contained a specification for messaging API that was different than the specification provided to POE via the MP_MSG_API environment variable or the -msg_api command line option. The specification used in this case will be that which appeared in the network statement.

User Response: Verify that the specification in the network statement was intended. If it was, modify the POE specification to eliminate the warning, if desired.

0031-372 Cannot run MPI/LAPI job as specified in submit file.

Explanation: A batch job using POE was submitted to LoadLeveler with a Job Command File that did not contain any network statements, and the specification provided to POE (using the MP_MSG_API environment variable or the -msg_api command line flag) indicated that both MPI and LAPI were being used. For batch jobs using both messaging APIs, the network statements must be used.

User Response: Ensure that use of both messaging APIs was intended. If so, add the required network statements. If not, modify the POE specification to indicate the correct messaging API.

0031-373 Using string for messaging API.

Explanation: Informational message to indicate the messaging API being used for the batch POE job submitted to LoadLeveler.

User Response: None.

0031-374 Conflicting specification for -euilib, using string.

Explanation: A batch job using POE was submitted to LoadLeveler with a network statement in the Job Command File that contained a specification for the messaging passing library that was different than the specification provided to POE via the MP_EUILIB environment variable or the -euilib command line option. The specification used in this case will be that which appeared in the network statement.

User Response: Verify that the specification in the network statement was intended. If it was, modify the poe specification to eliminate the warning, if desired.

0031-375 Using string for euilib.

Explanation: Informational message to indicate the messaging passing library being used for the batch POE job submitted to LoadLeveler.

User Response: None required.

0031-376 Conflicting specification for -euidevice, using string.

Explanation: A batch job using POE was submitted to LoadLeveler with a network statement in the Job Command File that contained a specification for the message passing device that was different than the specification provided to POE via the MP_EUIDEVICE environment variable or the -euidevice command line option. The specification used in this case will be that which appeared in the network statement.

User Response: Verify that the specification in the network statement was intended. If it was, modify the POE specification to eliminate the warning, if desired.

0031-377 Using string for euidevice.

Explanation: Informational message to indicate the messaging passing device being used for the batch POE job submitted to LoadLeveler.

User Response: None required.

0031-378 Ignoring SP_NAME in batch mode

Explanation: User has submitted a POE job in batch mode under LoadLeveler and the SP_NAME environment variable or associated command-line option was set.

User Response: Unset SP_NAME in the environment or remove -spname from the arguments line in the LoadLeveler submit file to eliminate the warning, if desired.

0031-379 Pool setting ignored when hostfile used

Explanation: User has set the MP_RMPOOL environment variable or the -rmpool command-line option but a hostfile was found.

User Response: Ensure that use of the hostfile was intended.

0031-380 LoadLeveler step ID is string

Explanation: The indicated step ID was assigned by LoadLeveler to the current interactive job. It may be useful when using the llq command to determine the job status.

User Response: None required.

0031-381 Switch clock source requested, but not all tasks on SP

Explanation: You or another user set MP_CLOCK_SOURCE=SWITCH, but one or more of the tasks were not on SP nodes that have access to the switch. The job is terminated.

User Response: If mixed-node execution is acceptable, unset the MP_CLOCK_SOURCE environment variable. Otherwise, check that the nodes allocated were all on an SP.

0031-400 Invalid value number for stdoutmode

Explanation: User has entered a non-negative value with -stdoutmode or MP_STDOUTMODE which is greater than or equal to the number of processes requested; for SINGLE mode, this value must be between 0 and n-1, where n is the number of processes.

User Response: Re-run with valid value.

0031-401 Invalid value number for stdinmode

Explanation: User has entered a non-negative value with -stdinmode or MP_STDINMODE which is greater than or equal to the number of processes requested; for SINGLE mode, this value must be between 0 and n-1, where n is the number of processes.

User Response: Re-run with valid value.

0031-402 Using css0 as euidevice for User Space job

Explanation: User specified command line option -euidevice or used environment variable MP_EUIDEVICE with a setting of other than css0, which must be used for User Space jobs. User value has been overridden to allow job to continue.

User Response: None required.

0031-403 Forcing dedicated adapter for User Space job

Explanation: User explicitly specified User Space job using -euilib us or MP_EUILIB=us and poe is making sure the adapter usage requested from the Resource Manager is dedicated. This can also occur if no euilib was specified and the execution environment resulted in an implicit User Space job.

User Response: None required.

0031-404 Forcing shared adapter for IP job

Explanation: User explicitly specified IP job using -euilib ip or MP_EUILIB=ip and poe is making sure the adapter usage requested from the Resource Manager is shared. This can also occur if no euilib was specified and the execution environment resulted in an implicit IP job.

User Response: None required.

0031-405 Hostfile entries for string usage for task number conflict, using string

Explanation: User has hostfile entry which, for the same node, specifies shared AND dedicated, or multiple AND unique, adapter or cpu usage, respectively.

User Response: Correct conflicting entries and rerun.

0031-406 IP not enabled for node string

Explanation: The Resource Manager allocated a node which was not configured to run IP over the switch.

User Response: Have the node configured for IP over the switch, or have the node removed from the pool being used, or change the hostfile entry.

0031-407 LoadLeveler unable to allocate nodes to poe for batch job, rc = number

Explanation: An internal error occurred in LoadLeveler. Reason codes are as follows:

-2 Could not get jobid from environment
-5 Socket error
-6 Could not connect to host
-8 Could not get hostlist

User Response: Retry; if problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-408 number tasks allocated by LoadLeveler, continuing...

Explanation: Loadleveler allocated the indicated number of nodes, which was different than that specified by the POE job (using MP_PROCS or -procs or default). The job is continued with the indicated number of nodes.

User Response: If a fixed number of nodes is required, verify that the min_processors and max_processors keywords in the job command file agree with the number of nodes requested from POE. If they agree and the message continues, contact System Administrator to determine node availability.

0031-409 Unable to start Partition Manager daemon (string) on node string, rc = number

Explanation: An error (possibly internal) occurred when LoadLeveler attempted to start /etc/pmdv2 on the indicated node. Reason codes for internal LoadLeveler errors are as follows:

1 -- Remote host could not fork new process

2 -- Could not get jobid from environment

3 -- Could not get hostname

4 -- Nameserver could not resolve host

5 -- Socket error

6 -- Could not connect to host

7 -- Could not send command to remote startd

User Response: Check pathname and permissions for /etc/pmdv2. Retry; if problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-410 Invalid cpu usage: string

Explanation: User has requested an invalid cpu usage, via the -cpu_use command line option or via the MP_CPU_USE environment variable.

User Response: Correct the request to be either multiple or unique and rerun.

0031-411 Invalid adapter usage: string

Explanation: User has requested an invalid adapter usage, via the -adapter_use command line option or via the MP_ADAPTER_USE environment variable.

User Response: Correct the request to be either shared or dedicated and rerun.

0031-412 Invalid pulse value.

Explanation: An invalid value was specified for the MP_PULSE environment variable or the -pulse command line flag.

User Response: Respecify a valid value for MP_PULSE or -pulse.

0031-413 Incompatible version of LoadLeveler installed... terminating job

Explanation: POE has determined that an incompatible version of LoadLeveler is installed on the node where this job was attempted to be run.

User Response: Follow local site procedures to request installation of LoadLeveler 2.1.0 on the node.

0031-414 pm_collect: read select error

Explanation: A system error occurred while reading from a remote node. The system error message is appended. POE terminates.

User Response: Verify that the remote nodes in the partition can be contacted by other means. If problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-415 Non-zero status number returned from pm_collect

Explanation: An error has occurred in a lower level function.

User Response: Perform whatever corrective action is indicated for earlier messages and retry. If problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-416 string: no response; rc = number

Explanation: An error occurred on reading data from remote node to home node.

User Response: This is an IP communication error between home and remote node. No acknowledgement of startup was received from the pmd daemon running on the indicated node. Check for error message from that node. The return codes are: -1, EOF on connection; 1 I/O error; 2 allocation error

0031-417 unexpected acknowledgment of type string from remote node

Explanation: The Partition Manager received an unexpected data value from remote node during pm_collect function. The data is ignored and processing continues.

User Response: none

0031-600 Number of tasks (number) > maximum (number)

Explanation: User has requested more tasks than maximum number allowed.

User Response: Rerun job within defined limits for number of tasks.

0031-601 Open of file string failed

Explanation: Specified hostfile could not be opened.

User Response: Check path name and permissions.

0031-603 Resource Manager allocation for task: number, node: string, rc = string

Explanation: The node for the specified task was not allocated successfully. The reason codes are:

 1     JM_NOTATTEMPTED
 2     JM_INVALIDPOOL
 3     JM_INVALIDSUBPOOL
 4     JM_INVALIDNODENAME
 5     JM_EXCEEDEDCAPACITY
 6     JM_DOWNONENET
 7     JM_DOWNONSWITCH
 8     JM_INVALIDUSER
 9     JM_INVALIDADAPTER
10     JM_PARTITIONCREATIONFAILURE
11     JM_SWITCHFAULT
12     JM_SYSTEM_ERROR
13     JM_PARTITIONINUSE

User Response: For reason codes 1, 6, or 7, use environment variable MP_RETRYCOUNT or command line flag -retrycount to the number of times that the Resource Manager attempts to allocate the node. If that fails, contact the system administrator to determine node availability.

For reason code 13, the node returning the message may have a previous MPI, LAPI, or other POE job in an unknown state. Check the node and kill the previous process IDs for these processes. The POE jobs that are hung may not be reported by jm_status -j.

0031-604 Unexpected non-numeric entry in hostfile

Explanation: A non-numeric pool number exists in hostfile.

User Response: Correct the hostfile entry.

0031-605 Unexpected EOF on allocation file for task number

Explanation: There were not enough entries in the hostfile for the number of processes specified.

User Response: Lower the number of processes or add more entries to the hostfile.

0031-606 Pool request not allowed in hostfile unless using Resource Manager

Explanation: Execution environment (see IBM AIX Parallel Environment Operation and Use) did not specify use of the Resource Manager, but a hostfile entry contained a pool request.

User Response: If Resource Manager use was intended, check environment variables. Otherwise, remove pool request from hostfile.

0031-607 Pool requests and host entries may not be intermixed in hostfile

Explanation: Pool requests and host entries co-existed in hostfile.

User Response: Modify hostfile to contain only pool requests or only hostnames.

0031-608 Unrecognized option for task number: < string>

Explanation: An option other than shared, dedicated, multiple, or unique was found in the hostfile.

User Response: Correct hostfile entry.

0031-609 Unable to open save_hostfile string

Explanation: Specified save hostfile could not be opened.

User Response: Check pathname and permissions.

0031-610 Error in command broadcast

Explanation: An error occurred in broadcasting the poe command to the partition. Probably one of the remote nodes is no longer accessible. POE terminates.

User Response: Verify that the remote nodes in the partition can be contacted by other means. If problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-611 Unable to open command file <string>

Explanation: The file designated to issue POE commands can not be opened. POE terminates.

User Response: Verify that the file name is spelled correctly and is readable.

0031-612 pm_contact: write select error

Explanation: A system error occurred while writing to a remote node. The system error message is appended. POE terminates.

User Response: Verify that the remote nodes in the partition can be contacted by other means. If problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-613 Unable to send command to task number

Explanation: An error occurred in sending the poe command to the indicated task. Probably the remote node is no longer accessible. POE terminates.

User Response: Verify that the remote node in the partition can be contacted by other means. If problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-614 Unable to send single command to task number

Explanation: An error occurred in sending the poe command to the indicated task. Probably the remote node is no longer accessible. POE terminates.

User Response: Verify that the remote node in the partition can be contacted by other means. If problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-615 strappend failed for string , rc = number

Explanation: The internal string append function failed. The system terminates.

User Response: Probable PE error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-616 gethostbyname failed for home node

Explanation: The internal gethostbyname function failed. The system terminates.

User Response: Probable PE error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-617 pm_getcwd failed, rc = number

Explanation: The internal pm_getcwd function failed. A return code of 1 implies either can't open pipe to ksh or command failed. A return code of 2 means the working directory string is longer than bufsize. The system terminates.

User Response: Probable PE error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-618 The following nodes were not contacted:

Explanation: See message 0031-623 for a list of the remote nodes that did not respond during the phase indicated by the code in message 0031-631. It is possible that some nodes were not tried, so the list doesn't necessarily indicate that all the nodes were unavailable. POE terminates.

User Response: Probably connectivity to one of the listed nodes has been lost. Verify that the node can be contacted by other means. If problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-619 string

Explanation: The text is explanatory text as would be provided by the perror() or psignal() functions. For perror(), it is the text contained in sys_errlist[errno] for the error errno. For psignal(), it is the text contained in sys_siglist[signal] for signal signal. A preliminary 0031-number message indicates the context.

0031-620 pm_SSM_write failed in sending the user/environment for taskid number

Explanation: The internal pm_SSM_write function failed. The system terminates.

User Response: Probable PE error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-621 pm_SSM_write failed in sending the partition map information for taskid number

Explanation: The internal pm_SSM_write function failed. The system terminates.

User Response: Probable PE error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-622 unexpected msg from task number, type number Text: string

Explanation: An unexpected message was returned from the indicated task. The system continues.

User Response: Probable PE error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-623 task number: hostname string

Explanation: The line indicates the task number and associated host name. See message 0031-618 and 0031-631 for more information.

User Response: The list may contain names of failing nodes. Verify that connectivity exists and the the pmd daemon is executable on that node.

0031-624 Error from sigprocmask for blocking stop signals

Explanation: An error occurred in setting the signal mask to block stop signals during installation. POE terminates.

User Response: Probable PE internal error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-625 sigaction(SIGSTOP)

Explanation: An error occurred in setting the flags for the SIGSTOP signal. POE terminates.

User Response: Probable PE internal error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-626 got signal number; awaiting response from signal number

Explanation: POE received a signal while processing the responses to a previous signal. The new signal is ignored unless it is the SIGKILL signal.

User Response: Often this means that a remote node is not responding. Verify that the node can be contacted by other means. If problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-627 Task number connection blocked. Task will be abandoned.

Explanation: While shutting down the partition, POE was unable to write to the indicated task, because the socket was blocked. The socket and task are subsequently ignored and the shutdown continues.

User Response: Often this means that a remote node is not responding. The tasks running on this node must be terminated manually. Verify that the node can be contacted by other means. If problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-628 pm_contact: read select error

Explanation: A system error occurred while reading from a remote node. The system error message is appended. POE terminates.

User Response: Verify that the remote nodes in the partition can be contacted by other means. If problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-629 pm_contact: write timeout occurred; nprocs = number

Explanation: The select statement timed out waiting for ready to write to a remote node. A list of nodes not contacted is appended. POE terminates.

User Response: Verify that the remote nodes in the partition can be contacted by other means. If necessary, the timeout interval may be set with the environment variable MP_TIMEOUT. The default is 150 seconds. If problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-630 pm_contact: read timeout occurred; nprocs = number

Explanation: The select statement timed out waiting for ready to read from a remote node. A list of nodes not contacted is appended. POE terminates.

User Response: Verify that the remote nodes in the partition can be contacted by other means. If necessary, the timeout interval may be set with the environment variable MP_TIMEOUT. The default is 150 seconds. If problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-631 pm_contact: signal received; nprocs = number, code = number

Explanation: Operation was terminated by a signal, either created by the user (SIGINT), or with the system (for example, SIGPIPE). POE terminates. The code indicates where in the contact sequence the signal occurred as follows:

2 - connect
3 - write select
4 - write message 1
5 - write message 2
6 - read select
7 - read acknowledgement
8 - end of contact routine

User Response: The remote node(s) do not respond. Verify that the node can be contacted by other means. Verify that the pmdv2 daemon is executable on the indicated remote node. If necessary, the timeout interval may be set with the environment variable MP_TIMEOUT. The default is 150 seconds. If the problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-632 Can't connect to PM Array. errno = number

Explanation: POE tried to connect to the Program Marker Array tool, but was unsuccessful. The system error number is returned. Most likely, the Program Marker Array has not been started.

User Response: If the Program Marker Array is not being used, ignore this message. Otherwise, terminate POE, start the pmarray, and restart POE. If Program Marker Array has been started, verify the value of the environment variable MP_USRPORT for a valid port number for connection.

0031-633 Unexpected EOF on socket to task number

Explanation: POE got a socket EOF when trying to broadcast a message to the partition. The affected node is marked as not active, and the broadcast continues. The broadcast calling routine may take additional actions.

User Response: Verify the reason for loss of connection. Often this means that a remote node is not responding. Verify that the node can be contacted by other means. If problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-634 Non-zero status number returned from pm_parse_args

Explanation: An error has occurred parsing the parameters.

User Response: More information or error messages should accompany this message, describing the errors in more detail. Correct the invalid values and retry.

0031-635 Non-zero status number returned from pm_mgr_init

Explanation: An error has occurred in a lower level function.

User Response: Perform whatever corrective action is indicated for earlier messages and retry. If problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-636 User requested or EOF termination of pm_command

Explanation: End of file was reached in the specified command file or user typed quit.

User Response: If termination is unexpected, verify that command file contains correct number of commands based on MP_PROCS and MP_PGMMODEL settings.

0031-637 Non-zero status number returned from pm_command

Explanation: An error has occurred in a lower level function.

User Response: Perform whatever corrective action is indicated for earlier messages and retry. If problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-638 Non-zero status number returned from pm_respond

Explanation: An error has occurred in a lower level function.

User Response: Perform whatever corrective action is indicated for earlier messages and retry. If problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-639 Exit status from pm_respond = number

Explanation: The pm_respond function exited with the indicated status.

User Response: If other error messages occurred, perform corrective action indicated for the message(s); otherwise, no action is required.

0031-641 Unrecoverable failure in Resource Manager, terminating partition...

Explanation: A non-zero return code was returned from the SP Resource Manager message interpretation function. The Partition Manager terminates the partition. The return code value is printed. The defined values are:

1 - A system error occurred. An explanatory message follows.
3 - A socket error occurred. Probably the Resource Manager has failed. If this is a batch job running under LoadLeveler, the job will be terminated.

User Response: Correct the condition causing the non-zero return code, and restart POE.

0031-642 End of File from Program Marker Array

Explanation: The socket connecting POE to the Program Marker Array has returned EOF. Execution of POE continues.

User Response: If the Program Marker Array has not been deliberately terminated, determine the cause of the EOF, and, if desired, restart POE. Otherwise, ignore the message and allow the POE job to terminate normally.

0031-643 Error read from Program Marker Array

Explanation: The socket connecting POE to the Program Marker Array has returned an error condition. POE continues. The defined error codes are:

1 - I/O error on the socket. An explanatory message is appended if errno is set.
2 - POE was unable to allocate storage for the message from PM Array. Probable internal error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.
3 - Some other error. An explanatory message is appended if errno is set.

User Response: If the Program Marker Array has not been deliberately terminated, determine the cause of the error, and, if desired, restart POE. Otherwise, ignore the message and allow the POE job to terminate normally. If the error code was 2 or 3, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-644 Can't route PM Array message to task number

Explanation: An error occurred while trying to forward a message from the Program Marker array to the indicated task. POE continues, but the connection to PM Array is closed.

User Response: Probably the remote task is no longer active, but the Program Marker array responses are backlogged. In this case, the messages can be ignored. Otherwise, look for other failures that may also cause this symptom.

0031-645 PM Array message to unknown destination number

Explanation: The PM Array message is not to a remote task and not for the Home Node. The destination code is given. POE continues, but the connection to PM Array is closed.

User Response: Probably the PM Array tool is issuing invalid messages. Check the PM Array application.

0031-646 PM Array is trying to tell us something ...

Explanation: A message from PM Array is directed to the Home Node. At present there are no Home Node functions responding to the PM Array, so the message text is just printed out.

User Response: Verify that the PM Array tool is working correctly.

0031-647 string

Explanation: This is the message buffer text from PM Array as described in message 0031-646.

User Response: Verify that the PM Array tool is working correctly.

0031-648 Couldn't tell world about EOF on STDIN

Explanation: An error occurred while broadcasting EOF on STDIN to the partition. The partition is terminated, and POE exits.

User Response: Verify that the remote nodes are accessible and restart POE. If the failure continues, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-649 Couldn't tell task number about EOF on STDIN

Explanation: An error occurred while sending EOF on STDIN to the indicated task. The partition is terminated, and POE exits.

User Response: Verify that the remote node is accessible and restart POE. If the failure continues, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-650 No receiver for STDIN bound for number

Explanation: STDINMODE defines a task number that is not active.

User Response: Probable user error. Verify the value of STDINMODE set by the environment variable or under program control.

0031-651 Error reading input command file

Explanation: An I/O error occurred reading the input command file describing the initialization sequence for pdbx and pedb. Input reverts to STDIN.

User Response: If possible, determine which file is being read and correct it. If the problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-652 Error reading STDIN

Explanation: An I/O error occurred reading STDIN. STDIN is subsequently ignored.

User Response: Verify that the file used for STDIN is readable. If the problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-653 Couldn't route data from STDIN to task number

Explanation: An error occurred routing STDIN to the indicated task. The partition is terminated and POE exits.

User Response: Verify that the remote task is active. If the problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-654 Allocation error for SSM_read, task number, length number

Explanation: An error occurred allocating storage for a message from a remote node. The partition is terminated and POE exits. The task id and length requested are printed.

User Response: Verify that sufficient storage is available to run POE on the Home Node, and that the requested length is not excessive. If the problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-655 Can't route message to destination number

Explanation: An error occurred routing a message to the indicated destination task. The requested routing is not supported.

User Response: If the message is generated by Parallel Environment, this is an internal error. If generated by a user program, this is a user error. Determine the source of the message. If the problem is an internal error, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-656 I/O file string closed by task number

Explanation: The stdio stream indicated has been closed by the indicated task.

User Response: Verify if this is the intended operation. If so, ignore the message. This message may also occur at the end of a job that terminates normally.

0031-657 Can't send mpl_init_data to nodes

Explanation: An error occurred in broadcasting the CSS initialization data to the remote nodes. The partition is terminated and POE exits.

User Response: The failing routine is pm_address. Look for other symptoms to determine the cause of failure.

0031-658 Can't send termination signal to nodes.

Explanation: An error occurred in broadcasting the termination message to the remote nodes. The partition is terminated and POE exits (which it was trying to do, anyway).

User Response: The failing routine is pm_shutdown_job. Look for other symptoms to determine the cause of failure.

0031-659 Can't log accounting data from node number

Explanation: An error occurred in logging the accounting records received from the remote nodes. Execution continues.

User Response: The failing routine is pm_acct_response. Look for other symptoms to determine the cause of failure.

0031-660 Partition Manager stopped ...

Explanation: The Home Node (POE) has stopped in response to a SIGTSTOP (<Ctrl>Z) signal. The remote nodes have been stopped.

User Response: To resume the job, issue SIGCONT, or use the shell job control commands fg or bg.

0031-661 signal_sent = number not recognized

Explanation: The indicated signal was recorded as being sent to the remote nodes, but is not recognized by POE. Execution continues.

User Response: Probable POE internal error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-662 Node number did not send PROFILE_DONE, sent msgtype number

Explanation: The indicated node did not send the PROFILE_DONE message after profiling, but sent a message of the indicated type. Message 0031-663 gives the text sent.

User Response: Consult the explanatory text. If that fails to disclose the problem, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-663 string

Explanation: This is the text of a message received instead of the expected PROFILE_DONE message.

User Response: If the text fails to disclose the problem, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-664 Unknown message type number received

Explanation: The indicated message type is not known by POE. Execution is terminated.

User Response: Probably the socket contains a non-structured message, which would be a stray. If the source of the stray socket message cannot be determined, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-665 Invalid value for SSM_SINGLE number

Explanation: The indicated value is invalid as a destination for SINGLE I/O mode. The specification is ignored.

User Response: Verify that the correct value for SINGLE I/O mode is used.

0031-666 Out of range value for SSM_SINGLE number

Explanation: The indicated value is out of range: less than zero, or greater than the number of tasks. The specification is ignored.

User Response: Verify that the correct value for SINGLE I/O mode is used.

0031-667 Invalid value for SSM_UNORDERED number

Explanation: The indicated value is invalid as a specification for UNORDERED I/O mode. The specification is ignored.

User Response: Verify that the correct value for I/O mode is used.

0031-668 pm_io_command: error in pm_SSM_write, rc = number

Explanation: An error occurred while responding to a STDIO MODE QUERY message. The response is abandoned.

User Response: Probable POE internal error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-669 Can't acknowledge IO_command sync

Explanation: A socket error occurred while broadcasting a synchronization request acknowledgment. The partition is terminated and POE exits.

User Response: One or more remote nodes may not be reachable. Verify that the remote nodes can be contacted, and restart POE. If problems persist, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-670 Illegal stdout mode number

Explanation: The indicated value for STDOUT mode is not valid. The requested I/O buffering is not performed.

User Response: Correct the value for STDOUT mode.

0031-671 Can't acknowledge PMArray data from task number

Explanation: An error occurred trying to return an acknowledgement to a node sending data to PM Array.

User Response: Probable POE internal error. The error may also be caused by loss of contact with the remote node. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-672 Invalid routing request from task number to task number

Explanation: The Home Node has received a message, but doesn't know how to route it to the indicated task (destination).

User Response: Probable POE internal error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-673 Invalid mode/destination for STDIN: number

Explanation: The requested destination for STDIN is invalid. The request to route STDIN is ignored.

User Response: Verify the STDIN I/O mode requested.

0031-674 Unexpected return code number from pm_SSM_write

Explanation: Internal error has occurred.

User Response: Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-675 Invalid mode number requested

Explanation: User program has called function MP_STDOUTMODE or mpc_stdoutmode with invalid mode.

User Response: Refer to man page for explanation of valid modes.

0031-676 Invalid value string for mp_euidevice

Explanation: The mp_euidevice specified on the command line with -euidevice or in the environment with MP_EUIDEVICE is not valid.

User Response: Refer to IBM AIX Parallel Environment Operation and Use for valid choices and rerun.

0031-677 Unexpected return code number from _mp_stdoutmode

Explanation: An error may have occurred in a lower level function.

User Response: If earlier error messages exist, perform whatever corrective action is indicated for these. If there are no other messages or if problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-678 Hostfile may not contain pool requests if not using job management system

Explanation: The user explicitly requested not to use the job management system (with the MP_RESD environment variable or the -resd command line flag), but the hostfile contained pool requests.

User Response: Use hostnames in the hostfile or do not request that LoadLeveler or the Resource Manager not be used.

0031-679 Profiling may not have completed on node number

Explanation: A profiling file may not have been completed for the given node. However, profiling files may exist for other nodes in the job.

User Response: If a profiling file from this node is needed, ensure that there is enough room on the node's filesystem for the profiling file and re-run the job. If the problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-680 Invalid mode number requested

Explanation: User program has called function MP_STDINMODE or mpc_stdinmode with invalid mode.

User Response: Refer to man page for explanation of valid modes.

0031-682 Unexpected return code number from _mp_in_mode

Explanation: An error may have occurred in a lower level function.

User Response: If earlier error messages exist, perform whatever corrective action is indicated for these. If there are no other messages or if problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-687 Unsuccessful call to pm_SSM_read

Explanation: Internal error has occurred.

User Response: Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-688 Incorrect subtype number received in structured socket message

Explanation: Internal error has occurred.

User Response: Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-689 Unexpected return code number from _mp_stdoutmode_query

Explanation: An error may have occurred in a lower level function.

User Response: If earlier error messages exist, perform whatever corrective action is indicated for these. If there are no other messages or if problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-690 Connected to Resource Manager string string

Explanation: Informational message to user to indicate successful connection to Resource Manager.

User Response: None required.

0031-692 Invalid option number requested

Explanation: User program has called function MP_FLUSH or mpc_flush with invalid option.

User Response: Refer to man page for explanation of valid options.

0031-696 Unexpected return code number from _mp_flush

Explanation: An error may have occurred in a lower level function.

User Response: If earlier error messages exist, perform whatever corrective action is indicated for these. If there are no other messages or if problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-699 Task number waiting to profile...

Explanation: Designated task is waiting to profile.

User Response: None required, information only.

0031-700 invalid priority received

Explanation: The priority received by the pm daemon is invalid.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-701 invalid envc received

Explanation: The envc received by the pm daemon is invalid.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-702 invalid pmdlog argument

Explanation: The pmdlog argument received by the pm daemon is invalid.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-703 invalid nprocs argument

Explanation: The nprocs argument received by the pm daemon is invalid.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-704 invalid newjob argument

Explanation: The newjob argument received by the pm daemon is invalid.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-705 invalid pdbx argument

Explanation: The pdbx argument received by the pm daemon is invalid.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-706 pmd: Error reading node info from home node.

Explanation: The node info received by the pm daemon is invalid.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-707 pmd: error sending node map ack to home node.

Explanation: The pm daemon was not able to send a node map ack to the home node.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-708 pmd: invalid JOBID.

Explanation: The pm daemon was not able to send a node map ack to the home node.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-709 pmd: SSM recv'd not cmd str or exit

Explanation: An incorrect SSM was received by the pm daemon from the home node.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-710 pmd: pipe creation error

Explanation: The pm daemon was unable to create pipes to its child.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-711 parent select errno = number

Explanation: select error from pmd parent.

User Response: Probable system error.

0031-712 parent error reading STDIN, rc = number

Explanation: pm daemon parent was unable to read STDIN.

User Response: Probable system error.

0031-713 pmd parent: error w/ack for sig req to home

Explanation: pm daemon parent had error sending ack for sig request.

User Response: Probable system error.

0031-714 pmd parent: error writing to child's STDIN

Explanation: pm daemon parent was not able to write to its child's STDIN.

User Response: Probable system error.

0031-715 pmd parent: error writing to child's cntl

Explanation: pm daemon parent was not able to write to its child's control pipe.

User Response: Probable system error.

0031-716 pmd parent: error reading STDOUT from child

Explanation: pm daemon parent was not able to read STDOUT from the child.

User Response: Probable system error.

0031-717 pmd parent: error writing to STDOUT

Explanation: pm daemon parent was not able to write to STDOUT.

User Response: Probable system error.

0031-718 pmd parent: error reading control from child

Explanation: pm daemon parent was not able to read the control pipe from the child.

User Response: Probable system error.

0031-719 AFS authorization failed in settokens

Explanation: settokens() failed in pmd child when given the afstoken.

User Response: Probable system error.

0031-720 child: initgroups error - errno = <number>

Explanation: initgroups failed, errno given.

User Response: Probable system error.

0031-721 unable to set user info

Explanation: userinfo() was unable to set user info.

User Response: No response needed.

0031-722 can't set priority to number

Explanation: setpriority() failed in pmd child.

User Response: No response needed.

0031-723 userid = <number>

Explanation: userid is set to the given userid.

User Response: No response needed.

0031-724 Executing program: <string>

Explanation: The child is executing the given program.

User Response: No response needed.

0031-725 Failed to exec program string; errno = number

Explanation: The child failed to execute the given program.

User Response: Probable system error. POE's /usr/lpp/ppe.poe/lib/libc.a may not be up to date. Have the system administrator run the following script to rebuild POE's libc.a: /usr/lpp/ppe.poe/bin/makelibc. Verify that the euilibpath includes the following path: /usr/lpp/ppe.poe/lib.

0031-726 pmd: error sending node attach data record to home node.

Explanation: The remote node PMD was not able to send the node attach data via IP communications to home node. The remote node will now exit.

User Response: Probable system error. Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-727 pmd parent: error writing to child's debug engine pipe

Explanation: pm daemon parent was not able to write to debug engine pipe, which is used to communicate with the node debug server.

User Response: Probable system error.

0031-728 Cannot set string limit to number, hard limit is number.

Explanation: If the user's soft limit is greater than the inetd hard limit, the soft limit will only get changed to the hard limit value.

User Response: If this causes a problem, ask the system administrator to increase the hard limit values for inetd.

0031-729 ident_match failed; user identification failed.

Explanation: The user is not authorized to communicate via the TCP/IP socket between the POE home node and partition manager daemon.

User Response: Ensure the user is properly authorized to use POE, and ensure ident_match routine is properly installed and available.

0031-730 POE DFS credentials file is empty; DFS credentials cannot be established.

Explanation: POE was unable to locate the DFS credentials, because the /tmp/poedce_master file was empty. As a result, DFS authentication cannot be established.

User Response: Contact the system administrator to ensure the DFS credentials files were successfully copied using the poeauth command.

0031-731 Error getting and setting DFS credentials.

Explanation: The PMD called the poe_dce_set function to get and set the current context for establishing the DFS/DCE credentials when it encountered an error. poe_dce_set should have issued additional messages describing the errors.

User Response: Contact the system administrator to ensure the DFS/DCE credentials are properly set up. If the problem persists, Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-732 Error loading POE DFS/DCE routines.

Explanation: The Partition Manager Daemon attempted to load poe_dce_set.o, the loadable object for the DFS/DCE routines it uses for setting up the user's credentials. Possibly DFS/DCE was not installed or is inoperable, or the required installation steps for running POE with DFS were not done.

User Response: Contact the system administrator to ensure that DFS/DCE is properly installed and that the necessary POE installation steps were completed. If the problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-733 The initialization of the remote debug engine failed.

Explanation: The Partition Manager Daemon detected an error while starting the remote debug engine. The debugger is terminated.

User Response: The remote debug engine portion of pdbx and pedb depend on the bos.adt.debug fileset. Contact your system adminstrator to make sure that the fileset bos.adt.debug is properly installed on the nodes where the job runs.

0031-734 POE DFS master file was not found. DFS credentials cannot be established.

Explanation: POE did not find the POE DFS credentials master file, /tmp/poedce_master,<uid>. DFS user authentication was selected (by specifying MP_AUTH = DFS), but most likely the poeauth command was not run to set up the DFS/DCE credentials. As a result, DFS authentication cannot be established and the job is terminated.

User Response: Make sure a valid set of DFS/DCE credentials were successfully copied using a dce_login and running the poeauth command, then run the POE job again.

0031-800 -procs string ignored in remote child

Explanation: -procs interpreted only in parent code.

User Response: No response needed.

0031-801 -hostfile string ignored in remote child

Explanation: -hostfile interpreted only in parent code.

User Response: No response needed.

0031-802 -newjob string ignored in remote child

Explanation: -newjob interpreted only in parent code.

User Response: No response needed.

0031-803 -pmdlog string ignored in remote child

Explanation: -pmdlog interpreted only in parent code.

User Response: No response needed.

0031-804 -pgmmodel string ignored in remote child

Explanation: -pgmmodel interpreted only in parent code.

User Response: No response needed.

0031-805 Invalid programming model specified: string

Explanation: -pgmmodel should be either SPMD or MPMD.

User Response: Re-enter -pgmmodel with either SPMD or MPMD.

0031-806 Invalid retry count string

Explanation: Retry count should be an integer.

User Response: Re-enter -retry followed by an integer.

0031-807 Invalid node pool specified: string

Explanation: -rmpool should be an integer >= 0.

User Response: Re-enter -rmpool followed by an integer >= 0.

0031-808 Hostfile or pool must be used to request nodes.

Explanation: When using LoadLeveler or the Resource Manager, the environment variable MP_RMPOOL or the command line option -rmpool must be used to specify the pool, since a hostfile did not exist.

User Response: Ensure that absence of hostfile was intended, verify command line or environment variable settings of hostfile, resd, and rmpool, and then retry. Refer to IBM Parallel Environment for AIX Operation and Use, Volume 1 for further information.

0031-809 -tracefile string ignored in remote child

Explanation: -tracefile interpreted only in parent code.

User Response: No response needed.

0031-810 -tracelevel string ignored in remote child

Explanation: -tracelevel interpreted only in parent code.

User Response: No response needed.

0031-900 Can't request profiling for task number

Explanation: A communication failure has occurred.

User Response: Retry; if problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-901 Didn't get response to profiling request for task number

Explanation: A communication failure has occurred.

User Response: Retry; if problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-902 Unexpected response to profiling request for task number

Explanation: A stray message may have been received during profiling.

User Response: Retry; if problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-903 Can't confirm profiling for task number

Explanation: A communication failure has occurred.

User Response: Retry; if problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-904 Can't rename profiling file to string

Explanation: A communication failure may have occurred, or the profiling file could not be opened.

User Response: Check path name and permissions. If problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-906 Task number finished profile...

Explanation: Designated task has finished profile.

User Response: None required, information only.

0031-907 Task number terminating due to pulse timeout

Explanation: Designated task has been terminated due to a timeout in the POE pulse processing. The connection to the home node may have dropped, or the job may have been hung or abnormally functioning.

User Response: It is possible that the pulse interval was too small to allow sufficient time for the task to complete. Verify that the node is still up, or that the job was not doing something abnormal. You may also want to increase your interval value with the MP_PULSE environment variable or -pulse command line flag.

0031-908 SSM_PULSE acknowledgment failed for task number.

Explanation: There was a failure in sending the acknowledgment message for the POE pulse function from POE to pmd for the indicated task.

User Response: Possible system error, unless the network connection between the nodes dropped. Otherwise, gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-909 POE terminating due to pulse timeout for task number.

Explanation: POE has determined that there were remote nodes which did not respond during POE pulse processing. There was not enough responses prior to the pulse timeout interval. The connection to the home node may have dropped, or the job may have been hung or abnormally functioning.

User Response: It is possible the that the pulse interval was too small to allow sufficient time for the task to complete. Verify that the nodes are still up, or that the job was not doing something abnormal. You may also want to increase your interval value with the MP_PULSE environment variable or -pulse command line flag.

0031-A400 Error in creating socket

Explanation: The program pmarray terminates. An explanatory sentence is appended.

User Response: Probable system error. Check the condition(s) given in the explanatory sentence.

0031-A401 Error in binding socket

Explanation: The program pmarray terminates. An explanatory sentence is appended.

User Response: Probable system error. Check the condition(s) given in the explanatory sentence.

0031-A402 Error in listen

Explanation: The program pmarray terminates. An explanatory sentence is appended.

User Response: Probable system error. Check the condition(s) given in the explanatory sentence.

0031-A403 Error in accept

Explanation: The program pmarray terminates. An explanatory sentence is appended.

User Response: Probable system error. Check the condition(s) given in the explanatory sentence.

0031-A404 Error in socket read. File descriptor = number

Explanation: The program PMArray terminates. An explanatory sentence is appended.

User Response: Probable system error. Check the condition(s) given in the explanatory sentence.

0031-A405 Unsupported data type number received by PMArray, task number

Explanation: Execution continues. The data is ignored.

User Response: PMArray received data which was not recognized. This probably indicates that the PMArray program is not connected properly to the Partition Manager.

0031-A406 Bad message type number

Explanation: Internal error has occurred.

User Response: Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-A407 Internal error reading socket message

Explanation: Internal error has occurred.

User Response: Gather information about the problem and follow local site procedures for reporting hardware and software problems.

0031-A409 Invalid action code number, msg type number

Explanation: An unexpected message was received during initialization of the Program Marker Array.

User Response: Restart pmarray; If problem persists, gather information about the problem and follow local site procedures for reporting hardware and software problems.


[ Top of Page | Previous Page | Next Page | Table of Contents ]