LISTSERV - HP3000-L Archives

HP3000-L Archives

August 2019, Week 3

HP3000-L@RAVEN.UTC.EDU

	LISTSERV Archives
	HP3000-L Home
	HP3000-L August 2019, Week 3

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	Re: Job Monitoring Utility
From:	Roy Brown <[log in to unmask]>
Reply To:	Roy Brown <[log in to unmask]>
Date:	Mon, 19 Aug 2019 23:14:00 +0100
Content-Type:	text/plain
Parts/Attachments:	text/plain (117 lines)

It gets to be a battle as to just how shutdown the HP3000s can get themselves.

Enjoined by Harper Lee to ‘Go Set a Watchman’, my WATCHMAN job is streamed by any job that I fear may get stuck or just needs a helper, and streamed straight away, lest things are such that it might not be able to start if just scheduled for several hours out.

Using HPSTREAMEDBY, each instantiation knows the job that launched it, and monitors it until it’s time to deliver its payload.

Sometimes it’s practically immediate, as with the payload that adds autoreply capability to the SYSGEN tape backup, as a helper function.

When it isn’t, and it’s checking that the job that streamed it isn’t overrunning, it can read a file of expected timings, as with our partial and full backups, with different timing across three servers, add some leeway, and thus know what time to deliver the payload.

At payload time, it checks if the job that streamed it is still running; if not, it just quits.

But if it is still running, it can be set to check RECALL, tell the backup that it’s never going to get that third tape, or whatever, and/or alert the network centre to call me and hoick me out of bed to look at what’s going on.

But our HP3000s are ‘cleverer’ than that, and I’ve seen them freeze, or stop sending alerts, so nothing we are running can detect anything, or report anything, and yet still respond to the external automated checks, like pings, that the network centre uses to detect hung processors.

So at six every morning, a job runs on each of the three HP3000s that are still functioning, which performs a self-check, and then DSLINEs to the other two and checks them in turn.

Alerting us to any other processor that fails to respond to the DSLINE open, or to the health check if it does open.

We have had two documented cases so far where a server hung, issued no alerts and triggered no external alerts, and the other two successfully reported this. Ageing hardware.....

I think it’s as foolproof as I can make it now, but the HP3000s may yet prove me wrong.

But as you might imagine, the first line of defence, the WATCHMAN jobs, usually catch anything going, mostly the backups overrunning the tape capacities, and mostly it would have been fine to stream the WATCHMAN job at checking time.

Mostly..... 

All of the above is done in the CI interpreter, not a program or utility used anywhere, though we do rely on MPEX to make some of it easier. But it could be done without, I think.

Roy

> On 19 Aug 2019, at 21:15, Mark Ranft <[log in to unmask]> wrote:
> 
> Here is an example where I am concerned that the weekly SLTBACK job doesn't complete.  I stream a second job for 6 hours out to check and complain if SLTBACK is still running.
> 
> !job sltback,manager.sys,job;outclass=lp,1,1
> !
> !continue
> !stream sltchk.job.sys;in=,6
> !
> !if jobcnt('sltback,manager.sys') > 1 then
> !  continue
> !  mail.exe "-t [log in to unmask] &
> !  -f [log in to unmask] &
> !  -h !my_mailhost &
> !  -s '!osnode sltback already running!!' &
> !  sltback job already running !hpdatef at !hptimef"
> !  eoj
> !endif
> !
> !loadtap7
> !pause 60
> !
> !run autorep.exe;parm=7
> !file sysgtape;dev=7
> !showdev 7
> !
> !sysgen
> tape verbose store=^sltbk1
> exit
> !
> !tellop ---- job sltback is done
> !tell manager.sys ---- job sltback is done
> !stream sltback.job;day=SUNDAY;at=02:30
> !eoj
> 
> SLTBK1 contains....
> @[log in to unmask]@      <-- or whatever filesets you desire
> ;show;onvs=mpexl_system_volume_set,client_vs;progress;maxtapebuf;compress=high;online=start;partialdb;directory;statistics
> 
> 
> !JOB SLTCHK,manager.sys;OUTCLASS=LP,1
> !# *-----------------------------------------------*
> !# * THIS JOB WILL VERIFY THAT SLTBACK HAS         *
> !# * COMPLETED SUCCESSFULY. THIS JOB SHOULD BE     *
> !# * STREAMED TO RUN 6 HOURS AFTER THE SLTBACK     *
> !# * JOB HAS STARTED.                              *
> !# *-----------------------------------------------*
> !
> !# *-----------------------------------------------*
> !# * Validate SLTBACK.JOB.SYS  has completed       *
> !# *-----------------------------------------------*
> !
> !SETVAR CIERROR  0
> !RUN MAIN.PUB.VESOFT;PARM=1;INFO= &
> ! "SHOWOUT @[log in to unmask](SPOOL.JSNAME MATCHES 'SLTBACK' AND &
> !  SPOOL.ISOPENED)"
> !
> !IF CIERROR = 0
> !
> !  mail.exe "-t [log in to unmask] &
> !   -f [log in to unmask] &
> !   -h !my_mailhost &
> !   -s ' ****  !osnode SLTBACK job problem ****' &
> !    SLTBACK is still running!"
> !
> !abortio 7
> !pause 2
> !abortio 7
> !PAUSE 2
> !ABORTJOB SLTBACK,manager.sys
> !
> !endif
> !
> !eoj
> 
> 
> Mark Ranft
> Pro 3K
> 
> * To join/leave the list, search archives, change list settings, *
> * etc., please visit http://raven.utc.edu/archives/hp3000-l.html *

* To join/leave the list, search archives, change list settings, *
* etc., please visit http://raven.utc.edu/archives/hp3000-l.html *

ATOM RSS1 RSS2

RAVEN.UTC.EDU