Learn more about Platform products at http://www.platform.com

[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]



lsb.events


The LSF batch event log file lsb.events is used to display LSF batch event history and for mbatchd failure recovery.

Whenever a host, job, or queue changes status, a record is appended to the event log file. The file is located in LSB_SHAREDIR/<cluster_name>/logdir, where LSB_SHAREDIR must be defined in lsf.conf(5) and <cluster_name> is the name of the LSF cluster, as returned by lsid(1). See mbatchd(8) for the description of LSB_SHAREDIR.

Contents

[ Top ]


lsb.events Structure

The event log file is an ASCII file with one record per line. For the lsb.events file, the first line has the format "# <history seek position>", which indicates the file position of the first history event after log switch. For the lsb.events.# file, the first line has the format "# <timestamp of most recent event>", which gives the timestamp of the recent event in the file.

Records and fields

The fields of a record are separated by blanks. The first string of an event record indicates its type. The following types of events are recorded:

JOB_NEW

A new job has been submitted. The fields in order of occurrence are:

Version number (%s)

The version number

Event time (%d)

The time of the event

jobId (%d)

Job ID

userId (%d)

UNIX user ID of the submitter

options (%d)

Bit flags for job processing

numProcessors (%d)

Number of processors requested for execution

submitTime (%d)

Job submission time

beginTime (%d)

Start time - the job should be started on or after this time

termTime (%d)

Termination deadline - the job should be terminated by this time (%d)

sigValue (%d)

Signal value

chkpntPeriod (%d)

Checkpointing period

restartPid (%d)

Restart process ID

userName (%s)

User name

rLimits

Soft CPU time limit (%d), see getrlimit(2)

rLimits

Soft file size limit (%d), see getrlimit(2)

rLimits

Soft data segment size limit (%d), see getrlimit(2)

rLimits

Soft stack segment size limit (%d), see getrlimit(2)

rLimits

Soft core file size limit (%d), see getrlimit(2)

rLimits

Soft memory size limit (%d), see getrlimit(2)

rLimits

Reserved (%d)

rLimits

Reserved (%d)

rLimits

Reserved (%d)

rLimits

Soft run time limit (%d), see getrlimit(2)

rLimits

Reserved (%d)

hostSpec (%s)

Model or host name for normalizing CPU time and run time

hostFactor (%f)

CPU factor of the above host

umask (%d)

File creation mask for this job

queue (%s)

Name of job queue to which the job was submitted

resReq (%s)

Resource requirements

fromHost (%s)

Submission host name

cwd (%s)

Current working directory

chkpntDir (%s)

Checkpoint directory

inFile (%s)

Input file name

outFile (%s)

Output file name

errFile (%s)

Error output file name

subHomeDir (%s)

Submitter's home directory

jobFile (%s)

Job file name

numAskedHosts (%d)

Number of candidate host names

askedHosts (%s)

List of names of candidate hosts for job dispatching

dependCond (%s)

Job dependency condition

preExecCmd (%s)

Job pre-execution command

timeEvent (%d)

Time Event, for job dependency condition; specifies when time event ended

jobName (%s)

Job name

command (%s)

Job command

nxf (%d)

Number of files to transfer (%d)

xf (%s)

List of file transfer specifications

mailUser (%s)

Mail user name

projectName (%s)

Project name

niosPort (%d)

Callback port if batch interactive job

maxNumProcessors (%d)

Maximum number of processors

schedHostType (%s)

Execution host type

loginShell (%s)

Login shell

userGroup (%s)

User group

exceptList (%s)

Exception handlers for the job

options2 (%d)

Bit flags for job processing

idx (%d)

Job array index

inFileSpool (%s)

Spool input file

commandSpool (%s)

Spool command file

jobSpoolDir (%s)

Job spool directory

userPriority (%d)

User priority

rsvId %s

Advance reservation ID; for example, "user2#0"

jobGroup (%s)

The job group under which the job runs

extsched (%s)

External scheduling options

warningAction (%s)

Job warning action

warningTimePeriod (%d)

Job warning time period in seconds

sla (%s)

SLA service class name under which the job runs

SLArunLimit (%d)

Absolute run time limit of the job for SLA service classes

JOB_FORWARD

A job has been forwarded to a remote cluster (Platform MultiCluster only).

If LSF_HPC_EXTENSIONS="SHORT_EVENTFILE" is specified in lsf.conf, older daemons and commands (pre-LSF Version 6.0) cannot recognize the lsb.events file format.

The fields in order of occurrence are:

Version number (%s)

The version number

Event time (%d)

The time of the event

jobId (%d)

Job ID

numReserHosts (%d)

Number of reserved hosts in the remote cluster

cluster (%s)

Remote cluster name

reserHosts (%s)

List of names of the reserved hosts in the remote cluster

idx (%d)

Job array index

JOB_ACCEPT

A job from a remote cluster has been accepted by this cluster. The fields in order of occurrence are:

Version number (%s)

The version number

Event time (%d)

The time of the event

jobId (%d)

Job ID at the accepting cluster

remoteJid (%d)

Job ID at the submission cluster

cluster (%s)

Job submission cluster name

idx (%d)

Job array index (%d)

JOB_START

A job has been dispatched.

If LSF_HPC_EXTENSIONS="SHORT_EVENTFILE" is specified in lsf.conf, older daemons and commands (pre-LSF Version 6.0) cannot recognize the lsb.events file format.

The fields in order of occurrence are:

Version number (%s)

The version number

Event time (%d)

The time of the event

jobId (%d)

Job ID

jStatus (%d)

Job status, (4, indicating the RUN status of the job)

jobPid (%d)

Job process ID

jobPGid (%d)

Job process group ID

hostFactor (%f)

CPU factor of the first execution host

numExHosts (%d)

Number of processors used for execution

execHosts (%s)

List of execution host names

queuePreCmd (%s)

Pre-execution command

queuePostCmd (%s)

Post-execution command

jFlags (%d)

Job processing flags

userGroup (%s)

User group name

idx (%d)

Job array index

additionalInfo (%s)

Placement information of HPC jobs

JOB_START_ACCEPT

A job has started on the execution host(s). The fields in order of occurrence are:

Version number (%s)

The version number

Event time (%d)

The time of the event

jobId (%d)

Job ID

jobPid (%d)

Job process ID

jobPGid (%d)

Job process group ID

idx (%d)

Job array index

JOB_STATUS

The status of a job changed after dispatch. The fields in order of occurrence are:

Version number (%s)

The version number

Event time (%d)

The time of the event

jobId (%d)

Job ID

jStatus (%d)

New status, see <lsbatch/lsbatch.h>

reason (%d)

Pending or suspended reason code, see <lsbatch/lsbatch.h>

subreasons (%d)

Pending or suspended subreason code, see <lsbatch/lsbatch.h>

cpuTime (%f)

CPU time consumed so far

endTime (%d)

Job completion time

ru (%d)

Resource usage flag

lsfRusage (%s)

Resource usage statistics, see <lsf/lsf.h>

exitStatus (%d)

Exit status of the job, see <lsbatch/lsbatch.h>

idx (%d)

Job array index

exitInfo (%d)

Job termination reason, see <lsbatch/lsbatch.h>

JOB_SWITCH

A job switched from one queue to another. The fields in order of occurrence are:

Version number (%s)

The version number

Event time (%d)

The time of the event

userId (%d)

UNIX user ID of the user invoking the command

jobId (%d)

Job ID

queue (%s)

Target queue name

idx (%d)

Job array index

userName (%s)

Name of the job submitter

JOB_MOVE

A job moved toward the top or bottom of its queue. The fields in order of occurrence are:

Version number (%s)

The version number

Event time (%d)

The time of the event

userId (%d)

UNIX user ID of the user invoking the command

jobId (%d)

Job ID

position (%d)

Position number

base (%d)

Operation code, (TO_TOP or TO_BOTTOM), see <lsbatch/lsbatch.h>

idx (%d)

Job array index

userName (%s)

Name of the job submitter

QUEUE_CTRL

A job queue has been altered. The fields in order of occurrence are:

Version number (%s)

The version number

Event time (%d)

The time of the event

opCode (%d)

Operation code), see <lsbatch/lsbatch.h>

queue (%s)

Queue name

userId (%d)

UNIX user ID of the user invoking the command

userName (%s)

Name of the user

ctrlComments (%s)

Administrator comment text from the -C option of badmin queue control commands qclose, qopen, qact, and qinact

HOST_CTRL

A batch server host changed status. The fields in order of occurrence are:

Version number (%s)

The version number

Event time (%d)

The time of the event

opCode (%d)

Operation code, see <lsbatch/lsbatch.h>

host (%s)

Host name

userId (%d)

UNIX user ID of the user invoking the command

userName (%s)

Name of the user

ctrlComments (%s)

Administrator comment text from the -C option of badmin host control commands hclose and hopen

MBD_START

The mbatchd has started. The fields in order of occurrence are:

Version number (%s)

The version number

Event time (%d)

The time of the event

master (%s)

Master host name

cluster (%s)

cluster name

numHosts (%d)

Number of hosts in the cluster

numQueues (%d)

Number of queues in the cluster

MBD_DIE

The mbatchd died. The fields in order of occurrence are:

Version number (%s)

The version number

Event time (%d)

The time of the event

master (%s)

Master host name

numRemoveJobs (%d)

Number of finished jobs that have been removed from the system and
logged in the current event file

exitCode (%d)

Exit code from mbatchd

ctrlComments (%s)

Administrator comment text from the -C option of badmin mbdrestart

UNFULFILL

Actions that were not taken because the mbatchd was unable to contact the sbatchd on the job execution host. The fields in order of occurrence are:

Version number (%s)

The version number

Event time (%d)

The time of the event

jobId (%d)

Job ID

notSwitched (%d)

Not switched: the mbatchd has switched the job to a new queue, but the sbatchd has not been informed of the switch

sig (%d)

Signal: this signal has not been sent to the job

sig1 (%d)

Checkpoint signal: the job has not been sent this signal to checkpoint itself

sig1Flags (%d)

Checkpoint flags, see <lsbatch/lsbatch.h>

chkPeriod (%d)

New checkpoint period for job

notModified (%s)

If set to true, then parameters for the job cannot be modified.

idx (%d)

Job array index

LOAD_INDEX

mbatchd restarted with these load index names (see lsf.cluster(5)). The fields in order of occurrence are:

Version number (%s)

The version number

Event time (%d)

The time of the event

nIdx (%d)

Number of index names

name (%s)

List of index names

JOB_SIGACT

An action on a job has been taken. The fields in order of occurrence are:

Version number (%s)

The version number

Event time (%d)

The time of the event

jobId (%d)

Job ID

period (%d)

Action period

pid (%d)

Process ID of the child sbatchd that initiated the action

jstatus (%d)

Job status

reasons (%d)

Job pending reasons

flags (%d)

Action flags, see <lsbatch/lsbatch.h>

actStatus (%d)

Action status:

1: Action started

2: One action preempted other actions

3: Action succeeded

4: Action Failed

signalSymbol (%s)

Action name, accompanied by actFlags

idx (%d)

Job array index

MIG

A job has been migrated. The fields in order of occurrence are:

Version number (%s)

The version number

Event time (%d)

The time of the event

jobId (%d)

Job ID

numAskedHosts (%d)

Number of candidate hosts for migration

askedHosts (%s)

List of names of candidate hosts

userId (%d)

UNIX user ID of the user invoking the command

idx (%d)

Job array index

userName (%s)

Name of the job submitter

JOB_MODIFY2

This is created when the mbatchd modifies a previously submitted job via bmod(1).

Version number (%s)

The version number

Event time (%d)

The time of the event

jobIdStr (%s)

Job ID

options (%d)

Bit flags for job processing

options2 (%d)

Bit flags for job processing

delOptions (%d)

Delete options for the options2 field

delOptions2 (%d)

Delete options for the options field

userId (%d)

UNIX user ID of the submitter

userName (%s)

User name

submitTime (%d)

Job submission time

umask (%d)

File creation mask for this job

numProcessors (%d)

Number of processors requested for execution

beginTime (%d)

Start time - the job should be started on or after this time

termTime (%d)

Termination deadline - the job should be terminated by this time

sigValue (%d)

Signal value

restartPid (%d)

Restart process ID for the original job

jobName (%s)

Job name

queue (%s)

Name of job queue to which the job was submitted

numAskedHosts (%d)

Number of candidate host names

askedHosts (%s)

List of names of candidate hosts for job dispatching; blank if the last field value is 0. If there is more than one host name, then each additional host name will be returned in its own field

resReq (%s)

Resource requirements

rLimits

Soft CPU time limit (%d), see getrlimit(2)

rLimits

Soft file size limit (%d), see getrlimit(2)

rLimits

Soft data segment size limit (%d), see getrlimit2)

rLimits

Soft stack segment size limit (%d), see getrlimit(2)

rLimits

Soft core file size limit (%d), see getrlimit(2)

rLimits

Soft memory size limit (%d), see getrlimit(2)

rLimits

Reserved (%d)

rLimits

Reserved (%d)

rLimits

Reserved (%d)

rLimits

Soft run time limit (%d), see getrlimit(2)

rLimits

Reserved (%d)

hostSpec (%s)

Model or host name for normalizing CPU time and run time

dependCond (%s)

Job dependency condition

timeEvent (%d)

Time Event, for job dependency condition; specifies when time event ended

subHomeDir (%s)

Submitter's home directory

inFile (%s)

Input file name

outFile (%s)

Output file name

errFile (%s)

Error output file name

command (%s)

Job command

inFileSpool (%s)

Spool input file

commandSpool (%s)

Spool command file

chkpntPeriod (%d)

Checkpointing period

chkpntDir (%s)

Checkpoint directory

nxf (%d)

Number of files to transfer

xf (%s)

List of file transfer specifications

jobFile (%s)

Job file name

fromHost (%s)

Submission host name

cwd (%s)

Current working directory

preExecCmd (%s)

Job pre-execution command

mailUser (%s)

Mail user name

projectName (%s)

Project name

niosPort (%d)

Callback port if batch interactive job

maxNumProcessors (%d)

Maximum number of processors

loginShell (%s)

Login shell

schedHostType (%s)

Execution host type

userGroup (%s)

User group

exceptList (%s)

Exception handlers for the job

userPriority (%d)

User priority

rsvId %s

Advance reservation ID; for example, "user2#0"

extsched (%s)

External scheduling options

warningAction (%s)

Job warning action

warningTimePeriod (%d)

Job warning time period in seconds

jobGroup (%s)

The job group to which the job is attached

sla (%s)

SLA service class name that the job is to be attached to

JOB_SIGNAL

This is created when a job is signaled via bkill(1) or deleted via bdel(1). The fields are in the order they appended:

Version number (%s)

The version number

Event time (%d)

The time of the event

jobId (%d)

Job ID

userId (%d)

UNIX user ID of the user invoking the command

runCount (%d)

Number of runs

signalSymbol (%s)

Signal name

idx (%d)

Job array index

userName (%s)

Name of the job submitter

CAL_NEW

This is created when a new calendar is added to the system via bcaladd(1). The fields in order of occurrence are:

Version number (%s)

The version number

Event time (%d)

The time of the event

userId (%d)

UNIX user ID of the calendar owner or the invoker

options (%d)

Options, see <lsbatch/lsbatch.h>

name (%s)

Calendar name

desc (%s)

Calendar description

calExpr (%s)

Time expression list associated with the calendar

CAL_MODIFY

This is created when a calendar is modified via bcalmod(1). The fields are the same as for CAL_NEW.

CAL_DELETE

This is created when a calendar is deleted via bcaldel(1). The fields are the same as for CAL_NEW.

JOB_EXECUTE

This is created when a job is actually running on an execution host. The fields in order of occurrence are:

Version number (%s)

The version number

Event time (%d)

The time of the event

jobId (%d)

Job ID

execUid (%d)

Mapped UNIX user ID on execution host

jobPGid (%d)

Job process group ID

execCwd (%s)

Current working directory job used on execution host

execHome (%s)

Home directory job used on execution host

execUsername (%s)

Mapped user name on execution host

jobPid (%d)

Job process ID

idx (%d)

Job array index

additionalInfo (%s)

Placement information of HPC jobs

SLAscaledRunLimit (%d)

Run time limit for the job scaled by the execution host

JOB_REQUEUE

This is created when a job ended and requeued by mbatchd. The fields in order of occurrence are:

Version number (%s)

The version number

Event time (%d)

The time of the event

jobId (%d)

Job ID

idx (%d)

Job array index

JOB_CLEAN

This is created when a job is removed from the mbatchd memory. The fields in order of occurrence are:

Version number (%s)

The version number

Event time (%d)

The time of the event

jobId (%d)

Job ID

idx (%d)

Job array index

JOB_EXCEPTION

This is created when an exception condition is detected for a job. The fields in order of occurrence are:

Version number (%s)

The version number

Event time (%d)

The time of the event

jobId (%d)

Job ID

exceptMask (%d)

Exception Id

0x01: missched

0x02: overrun

0x04: underrun

0x08: abend

0x10: cantrun

0x20: hostfail

0x40: startfail

actMask (%d)

Action Id

0x01: kill

0x02: alarm

0x04: rerun

0x08: setexcept

timeEvent (%d)

Time Event, for missched exception specifies when time event ended.

exceptInfo (%d)

Except Info, pending reason for missched or cantrun exception, the exit code of the job for the abend exception, otherwise 0.

idx (%d)

Job array index

JOB_EXT_MSG

An external message has been sent to a job. The fields in order of occurrence are:

Version number (%s)

The version number

Event time (%d)

The time of the event

jobId (%d)

Job ID

idx (%d)

Job array index

msgIdx (%d)

Index in the list

userId (%d)

Unique user ID of the user invoking the command

dataSize (%ld)

Size of the data if it has any, otherwise 0

postTime (%ld)

Message sending time

dataStatus (%d)

Status of the attached data

desc (%s)

Text description of the message

userName (%s)

Name of the author of the message

JOB_ATTA_DATA

An update on the data status of a message for a job has been sent. The fields in order of occurrence are:

Version number (%s)

The version number

Event time (%d)

The time of the event

jobId (%d)

Job ID

idx (%d)

Job array index

msgIdx (%d)

Index in the list

dataSize (%ld)

Size of the data if is has any, otherwise 0

dataStatus (%d)

Status of the attached data

fileName (%s)

File name of the attached data

JOB_CHUNK

This is created when a job is inserted into a chunk.

If LSF_HPC_EXTENSIONS="SHORT_EVENTFILE" is specified in lsf.conf, older daemons and commands (pre-LSF Version 6.0) cannot recognize the lsb.events file format.

The fields in order of occurrence are:

Version number (%s)

The version number

Event time (%d)

The time of the event

membSize (%ld)

Size of array membJobId

membJobId (%ld)

Job IDs of jobs in the chunk

numExHosts (%ld)

Number of execution hosts

execHosts (%s)

Execution host name array

SBD_UNREPORTED_STATUS

This is created when an unreported status change occurs. The fields in order of occurrence are:

Version number (%s)

The version number

Event time (%d)

The time of the event

jobId (%d)

Job ID

actPid (%d)

Acting processing ID

jobPid (%d)

Job process ID

jobPGid (%d)

Job process group ID

newStatus (%d)

New status of the job

reason (%d)

Pending or suspending reason code, see <lsbatch/lsbatch.h>

suspreason (%d)

Pending or suspending subreason code, see <lsbatch/lsbatch.h>

lsfRusage

The following fields contain resource usage information for the job. If the value of some field is unavailable (due to job abortion or the difference among the operating systems), -1 will be logged. Times are measured in seconds, and sizes are measured in KB.

ru_utime (%f)

User time used

ru_stime (%f)

System time used

ru_maxrss (%d)

Maximum shared text size

ru_ixrss (%d)

Integral of the shared text size over time (in kilobyte seconds)

ru_ismrss (%d)

Integral of the shared memory size over time (valid only on Ultrix)

ru_idrss (%d)

Integral of the unshared data size over time

ru_isrss (%d)

Integral of the unshared stack size over time

ru_minflt (%d)

Number of page reclaims

ru_magflt (%d)

Number of page faults

ru_nswap (%d)

Number of times the process was swapped out

ru_inblock (%d)

Number of block input operations

ru_oublock (%d)

Number of block output operations

ru_ioch (%d)

Number of characters read and written (valid only on HP-UX)

ru_msgsnd (%d)

Number of System V IPC messages sent

ru_msgrcv (%d)

Number of messages received

ru_nsignals (%d)

Number of signals received

ru_nvcsw (%d)

Number of voluntary context switches

ru_nivcsw (%d)

Number of involuntary context switches

ru_exutime (%d)

Exact user time used (valid only on ConvexOS)

exitStatus (%d)

Exit status of the job, see <lsbatch/lsbatch.h>

execCwd (%s)

Current working directory job used on execution host

execHome (%s)

Home directory job used on execution host

execUsername (%s)

Mapped user name on execution host

msgId (%d)

ID of the message

actStatus (%d)

Action status

1: Action started

2: One action preempted other actions

3: Action succeeded

4: Action Failed

sigValue (%d)

Signal value

seq (%d)

Sequence status of the job

idx (%d)

Job array index

jRusage (run usage)

The following fields contain resource usage information for the job. If the value of some field is unavailable (due to job abortion or the difference among the operating systems), -1 will be logged. Times are measured in seconds, and sizes are measured in KB.

mem (%d)

Total resident memory usage in KB of all currently running processes in a given process group

swap (%d)

Totaly virtual memory usage in KB of all currently running processes in given process groups

utime (%d)

Cumulative total user time in seconds

stime (%d)

Cumulative total system time in seconds

npids (%d)

Number of currently active process in given process groups. This entry has four sub-fields:

pid (%d)

Process ID of the child sbatchd that initiated the action

ppid (%d)

Parent process ID

pgid (%d)

Process group ID

jobId (%d)

Process Job ID

npgids (%d)

Number of currently active process groups

exitInfo (%d)

Job termination reason, see <lsbatch/lsbatch.h>

SEE ALSO

Related Topics:

lsid(1), getrlimit(2), lsb_geteventrec(3), lsb.acct(5), lsb.queues(5), lsb.hosts(5), lsb.users(5), lsb.params(5), lsf.conf(5), lsf.cluster(5), badmin(8), mbatchd(8)

Files:

LSB_SHAREDIR/<clustername>/logdir/lsb.events[.n]

[ Top ]


[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]


      Date Modified: February 24, 2004
Platform Computing: www.platform.com

Platform Support: support@platform.com
Platform Information Development: doc@platform.com

Copyright © 1994-2004 Platform Computing Corporation. All rights reserved.