LISTSERV - HP3000-L Archives

HP3000-L Archives

November 1995, Week 5

HP3000-L@RAVEN.UTC.EDU

	LISTSERV Archives
	HP3000-L Home
	HP3000-L November 1995, Week 5

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	Re: Long Time to recover from System Abort
From:	Denys Beauchemin <[log in to unmask]>
Reply To:	[log in to unmask]
Date:	Tue, 28 Nov 1995 11:06:02 -0500
Content-Type:	text/plain
Parts/Attachments:	text/plain (103 lines)

In a message dated 95-11-28 09:22:19 EST, [log in to unmask] (Pete Crosby)
writes:

>Steve Cole <[log in to unmask]> wrote:
>
>>I have experienced exactly what you are describing but when I
>>analyzed what was going on I made several changes that
>>significantly reduced the downtime.
>>
>>(1).  Over a two year period I tracked the number of System
>>Aborts and the System Abort number.  The results of the
>>analysis indicated that 80% of the failures were non-repeating.
>>By taking a dump on every failure we extended the system
>>downtime for problems that would not reoccur.  We dumped
>>the system on any second occurrance of a failure.
>>
>>This process not only improved the downtime of the system
>>but also improved reliability.  By dumping the system on the
>>second failure we only applied patches to fix problems that
>>were actually impacting us rather than applying a patch for
>>every failure.  Over the years I have found that the fewer number
>>of patches you have to apply the more reliable the OS is.
>>
>
>Please note the above methodology is probably not a good idea for
>the predominance of systems. Many system abort numbers (in fact most
>of the ones that are encountered such as 1457, 1458, 1047, 615, etc.)
>are very generic in their meaning and a recurrance of one of these
>abort numbers does not indicate a recurrance of the original problem.
>It generally leads to people getting very excited because they have
>'had the same failure twice in 3 days' or something similar. There
>are also many situations where multiple dumps of a failure may allow
>us to pinpoint a problem in a system, but a single dump does not
>contain enough data.
>
>Chances are, if your system failed once with a problem then it will
>fail again with that problem. It may not be the same day or the same
>week but it will almost certainly recur. Having a memory dump is the
>only tool we have to identify and prevent a recurrance by applying a
>patch, fixing a corruption issue, or developing a fix before you see
>the problem again (if a fix is not currently available).
>
>There are some problems that are very rare timing issues and may
>not recur during the life of your system, but please try to keep a
>broader view. You may not see it again but others might. We can't
>fix a problem if we don't know it exists and problems, like
>everything else in today's world, get prioritized. Frequency of
>occurrence and customer impact are 2 of the most important criteria
>used for prioritization so if you see a problem, system abort or
>not, let us know about it and please get use the data we need to
>address it.
>
>--
>                            --Pete Crosby
>

I have to chime in on this one.

Sorry Pete, but I believe that Steve's approach is more correct than yours.
 You have to step out into the real world sometimes, where everything is not
in a laboratory environment, where time for dissection of a problem is not
available, where users are waiting for the system to run the company, ship
stuff, take orders, etc. . .

Steve has as much experience managing large on-line mission-critical systems
as anyone, and far more than most.  I believe his methodology to be the
correct one, in fact I know it is.

I would only amend it to say that I would record the failure number (as he
did) and report it to the response center, for your reading enjoyment and
statistics :-).  I too have experienced one-time system aborts so I know from
experience that they occur (by one-time, I mean that I get an abort and may
not see it again for months, if ever.)

As for patching, I wholeheartedly agree with Steve, the less patches one
applies, the better.  Especially the patches concocted overnight or _just for
you_.  BTW It is a good idea to be aware of all available patches and read
the standard HP disclaimer of most patches I have seen in the past:  _Only
install the patch if the situation is detected on the system_.  This means,
don't put the patch just for the hell of it, make sure you need it.

As for database recovery, again Steve is correct.  The transaction manager
(XM) is a mature product which does a terrific job of maintaining database
integrity.  Use it, trust it.  Well you are using it, you have no choice, but
go ahead and believe in it.  Experience has shown me that databases rarely,
if ever, get corrupted by system aborts.  They get corrupted through human
intervention or hardware problems.  I have forgotten the number of instances
where datasets were restored on top of databases, or left off backups or
restored out of sequence.

Unless the company will close its doors because there may be a database
problem and you can't have a few hours off-hours to fix it, or someone will
die, I would not worry about it.  I would log the transactions and ensure
proper backups though.  But if you are totally paranoid and just can sleep
because you are wide-eyed with fear of a database problem, then use rollback
recovery, or better yet use dynamic transactions.

Kind regards,

Denys. . .
http://www.hicomp.com/hicomp

ATOM RSS1 RSS2

RAVEN.UTC.EDU