HP3000-L Archives

March 2003, Week 3

HP3000-L@RAVEN.UTC.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
"VANCE,JEFF (HP-Cupertino,ex1)" <[log in to unmask]>
Reply To:
VANCE,JEFF (HP-Cupertino,ex1)
Date:
Fri, 21 Mar 2003 17:15:06 -0500
Content-Type:
text/plain
Parts/Attachments:
text/plain (106 lines)
David wrote (a few days ago):

> Seems that pause and jinfo have a slight timing window when
> they disagree about when a job no longer exists(?).

This point was re-stated in a subsequent email.

> I have jobs that wait for a specified other job to end, then blow
> it off after 'x' seconds if it thinks it must be hung.  Normally
> work just fine, but one time...
>
> :PAUSE 900; JOB=#J5648
> :IF  JINFO('#J5648', "exists")   =   TRUE
> :     TELLOP ABOUT TO TRY TO ABORT #J5648
> :     ABORTJOB #J5648
> ^
> Job does not exist. (CIERR 3042)
> REMAINDER OF JOB FLUSHED.
> CPU sec. = 1.  elapsed min. = 1.  SAT, MAR 1, 2003, 1:03 AM.
>
> Both jobs actually finished within 1 minute, so the 15-minute
> pause plainly ended when it thought the other job no longer
> existed.

Not necessarily.  The PAUSE command is implemented by polling the
JMAT (table that the SHOWJOB command scans), not based on receiving
a signal from the terminating job, or some other technique. The
JMAT polling algorithm starts out with short sleeps before re-scanning
the JMAT, but each subsequent JMAT poll occurs after a longer and
longer sleep intervals. (Note: the sleep interval is reset when
the job ID or job state changes. Assuming you are pausing for a
single job then only the state could change.) This explanation is
simply to let you know that the longer the PAUSE command is "sleeping"
for a particular job (or session), the longer each real sleep interval
will be.

In your example it is possible for PAUSE to exit due to the 900 seconds
expiring, the job in question still exists (could be seen by SHOWJOB),
JINFO finds it too, but by the time ABORTJOB is executed the job is
gone.

In my scripts I always test the return of PAUSE to find out if the
pause ended because the job terminated, or if it ended because the
number of seconds expired.  This is easily done as:
   errclear
   continue
   pause x,job=y
   if hpcierr  = -9032 then
      # pause ended due to x seconds expiring
   else ...
      # pause ended due to target job terminating
   endif

> But then jinfo thought it did still exist, and then
> abortjob voted that it didn't.  It was in fact logging off
> normally during all this.

It could have been in a logging off state, but that would not
have mattered since PAUSE and JINFO use the same exact code to
determine if a job exists. They both call the same internal
routine. However, jobs have several transient states during
initialization and termination, so you may have hit this
window.

> I could easily fix this by adding an extra ':pause 1', it (A)
> is this a known problem?  (B) any other workarounds?  (C) Any
> chance of a fix ?

In the if statement above where HPCIERR = -9032 is tested, and
thus you want to abort the job, I'd do something like this:

  if hpcierr = -9032 then
     # job still exists after pause ended...
     setvar cnt 0
     while setvar(cnt,cnt+1) <= some_limit do
        setvar hpcierr 0
        continue
        abortjob !theJob
        if hpcierr = 0 then
           # exit loop
           setvar cnt some_limit+99
        endif
     endwhile
     if hpcierr <> 0 then
        echo Abortjob on !theJob failed after !cnt attempts
        ...
     endif


This approach is used by the ABORTJ script (and UDC) on Jazz at:
http://jazz.external.hp.com/src/scripts/abortj.txt


See the Communicator article on Jazz at:
http://jazz.external.hp.com/papers/Communicator/5.5/exp3/ci_enhancements.htm
l
 (wrap!)
for more info on PAUSE.


HTH,
 Jeff Vance, vCSY

* To join/leave the list, search archives, change list settings, *
* etc., please visit http://raven.utc.edu/archives/hp3000-l.html *

ATOM RSS1 RSS2