LISTSERV - HP3000-L Archives

HP3000-L Archives

June 2005, Week 4

HP3000-L@RAVEN.UTC.EDU

	LISTSERV Archives
	HP3000-L Home
	HP3000-L June 2005, Week 4

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	[Fwd: Re: Who watches the watchers?]
From:	Dave Gale <[log in to unmask]>
Reply To:	Dave Gale <[log in to unmask]>
Date:	Fri, 24 Jun 2005 19:50:46 -0400
Content-Type:	text/plain
Parts/Attachments:	text/plain (108 lines)

For some reason I have forwarding as an attachment turned on.

-------- Original Message --------
Subject:        Re: Who watches the watchers?
Date:   Thu, 23 Jun 2005 23:17:02 -0400
From:   Greg Stigers <[log in to unmask]>
Reply-To:       Greg Stigers <[log in to unmask]>
To:     Dave Gale <[log in to unmask]>
References:     <[log in to unmask]>
<[log in to unmask]>



Please consider reposting on-list. I would like to see some discussion on
this thread.

I'm curious about the inhouse paging. We've been looking for something like
that. And I am looking at writing a job that wakes up, runs tests,
calculates the pause until the next boundary, and pauses for that.

We still have lots of "I ran" messages, apparently the default in Ecometry.
It's gotten so that those are ignored. The operator has asked me to cut down
on those. What I've been working to improve is monitoring events that
require response. As I posted, I had once worked on a system with the kind
of redundant monitoring I described. Pretty hard to lose both, if the system
is up. And, we have What's Up Gold for those sorts of problems.

Greg Stigers

----- Original Message -----
From: "Dave Gale" <[log in to unmask]>
To: "Greg Stigers" <[log in to unmask]>
Sent: Thursday, June 23, 2005 10:11 PM
Subject: Re: Who watches the watchers?


> Greg,
>
> Some years back I wrote a (what started as) watch job. After a while it
> started to look more like a cron job.
>
> By putting in functionality like email client and so forth, I could notify
> the operators and folks with pagers (pagers used an internet mail site to
> pick up text messages) of problems.
>
> What we did to monitor this process was have it send an email to the
> operations staff every hour even if everything was alright. If they didn't
> get the email, then there was something amiss. Also, there were timely
> status messages through out the night on important events.
>
> Later on we added routines like dataset watchers and disc space watchers.
> Mainly if we had a system problem I wrote a script to monitor the problem
> and used the watcher job to run the script at selected intervals.
>
> Just some food for thought.
>
> Dave
>
>
> Greg Stigers wrote:
>
>> I work in an Ecometry shop, and we have recently implemented MasterOp for
>> scheduling (with which I must say I am quite happy). What we lack is any
>> way
>> of automatically monitoring DAYCLOSE, the backup-and-EOD-processing job
>> that
>> comes with Ecometry, and which we have somewhat customized. DAYCLOSE runs
>> a
>> provided ABORTJOB program, which can either abort absolutely everything
>> except its own job / session, or just those things in !HPACCOUNT, neither
>> of
>> which quite fit for us. But this does not allow us to leave any other job
>> up
>> and running, as things stand.
>>
>> When DAYCLOSE ends, it streams our various background jobs, and what's
>> left
>> of the scheduling job we had been using. The background jobs include
>> MasterOp, which manages much of the scheduling, and lets us know if
>> anything
>> it has streamed aborts. No complaints there. What we are not watching is
>> the
>> background jobs, although I imagine I could have MasterOp do that.
>>
>> But what if MasterOp fails? Again, what if DAYCLOSE aborts? What kinds of
>> solutions have others employed, so that they can sleep soundly?
>>
>> I'm thinking about a belt and suspenders approach. In a previous shop, on
>> another platform, there was a monitoring process for all other critical
>> processes. Again, we can probably have MasterOp handle that for us.
>> System
>> load also started a separate process, that watched this watcher (ideally
>> running under, say, OPERATOR.SYS). And the first monitor watched it. The
>> assumption was that little besides a system failure would take down both.
>> Unfortunately, we do not have another 3000, to have it DSLINE in and
>> check
>> on things for us.
>>
>> Greg Stigers
>>
>> * To join/leave the list, search archives, change list settings, *
>> * etc., please visit http://raven.utc.edu/archives/hp3000-l.html *
>>
>

* To join/leave the list, search archives, change list settings, *
* etc., please visit http://raven.utc.edu/archives/hp3000-l.html *

ATOM RSS1 RSS2

RAVEN.UTC.EDU