LISTSERV - HP3000-L Archives

HP3000-L Archives

November 1995, Week 5

HP3000-L@RAVEN.UTC.EDU

	LISTSERV Archives
	HP3000-L Home
	HP3000-L November 1995, Week 5

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	Re: Long Time to recover from System Abort
From:	Jeff Kell <[log in to unmask]>
Reply To:	Jeff Kell <[log in to unmask]>
Date:	Tue, 28 Nov 1995 10:06:20 EST
Content-Type:	text/plain
Parts/Attachments:	text/plain (51 lines)

On Tue, 28 Nov 1995 09:00:51 EST Pete Crosby said:
>Steve Cole <[log in to unmask]> wrote:
>>(1).  Over a two year period I tracked the number of System
>>Aborts and the System Abort number.  The results of the
>>analysis indicated that 80% of the failures were non-repeating.
>>By taking a dump on every failure we extended the system
>>downtime for problems that would not reoccur.  We dumped
>>the system on any second occurrance of a failure.

Pete replies:
>Please note the above methodology is probably not a good idea for
>the predominance of systems. Many system abort numbers (in fact most
>of the ones that are encountered such as 1457, 1458, 1047, 615, etc.)
>are very generic in their meaning and a recurrance of one of these
>abort numbers does not indicate a recurrance of the original problem.

Then there is my method.  Get the SA number PLUS any subsystems or
secondary status information, and dial up the response center while you
have the operator start a memory dump.  If/when you contact one of the SIT
engineers (they are pretty good about jumping right on aborts, thanks Pete)
give them the abbreviated abort information.  Chances are:
  * they already know about it, and/or
  * they have a patch/fix for it

If the above holds true, you can control-B, RESET, and proceed to bring the
system back.  If not, you've at least got a head start on the dump.

Also not a bad idea to have a copy of your HPSWINFO (and it matches what is
actually running, right Pete?) file to check patch levels.

I used to follow Steve's recommendations above, but historically it was
usually true that your "lost time" ended up being waiting on an engineer;
if a critical system is down it made reasonable management sense to get it
back up and hope it was a fluke (or a known problem).  But recently I've
almost always gotten in touch with an engineer soon enough to address the
problem as noted above.  Any failures should be reported (else HP may
treat them like flukes too!) whether proactive or reactive.  Dumping and
diagnosing the problem is your call.  As Pete says:

>There are some problems that are very rare timing issues and may
>not recur during the life of your system, but please try to keep a
>broader view. You may not see it again but others might. We can't
>fix a problem if we don't know it exists and problems, like
>everything else in today's world, get prioritized. Frequency of
>occurrence and customer impact are 2 of the most important criteria
>used for prioritization so if you see a problem, system abort or
>not, let us know about it and please get use the data we need to
>address it.

Jeff Kell <[log in to unmask]>

ATOM RSS1 RSS2

RAVEN.UTC.EDU