HP3000-L Archives

November 1995, Week 5

HP3000-L@RAVEN.UTC.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Steve Dirickson b894 westwins <[log in to unmask]>
Reply To:
Steve Dirickson b894 westwins <[log in to unmask]>
Date:
Tue, 28 Nov 1995 13:22:00 P
Content-Type:
text/plain
Parts/Attachments:
text/plain (62 lines)
<<You have to step out into the real world sometimes, where everything is
not in a laboratory environment, where time for dissection of a problem
is not available, where users are waiting for the system to run the
company, ship stuff, take orders, etc. . .
 
Steve has as much experience managing large on-line mission-critical
systems as anyone, and far more than most.  I believe his methodology to
be the correct one, in fact I know it is.
 
I would only amend it to say that I would record the failure number (as
he did) and report it to the response center, for your reading enjoyment
and statistics :-).  I too have experienced one-time system aborts so I
know from experience that they occur (by one-time, I mean that I get an
abort and may not see it again for months, if ever.)>>
 
If I may jump in on this one; I have to respectfully disagree, because
such an approach is, IMNSHO, in the "penny wise, pound foolish" category.
Actually, a better analogy would be to the Amalie oil commercials: "Pay
me now, or pay me later."
 
Certainly there are once-in-a-blue-moon aborts, but our experience is
that the majority of aborts fall in the "repeatable" category. If an
abort-with-dump costs four hours in downtime, that's certainly painful
enough, but the second or third time you invest an hour or two in the
same abort without dumping, you've passed break-even, and you still don't
have anything to show for your pain.
 
One factor that hasn't been mentioned is system stability. It seems
likely to me that a more stable system, i.e. one with less-frequent
hardware or software changes, requires that more priority be given to
getting the dump so the problem can be found and corrected. If you change
your hardware or key software more frequently, maybe the dump is less
critical, since you'll be on a different combination within a short time
anyway. For what it's worth, the system I work on has, in the 8.5 years
I've been here, gone from a Series 58 through 950 and 947 to a 957, and
will be a 959 by this time next month; there have been at least a
half-dozen independent OS updates along the way; and we still have a
policy of taking a dump on each abort.
 
<<As for database recovery, again Steve is correct.  The transaction
manager (XM) is a mature product which does a terrific job of maintaining
database integrity.  Use it, trust it.  Well you are using it, you have
no choice, but go ahead and believe in it.  Experience has shown me that
databases rarely, if ever, get corrupted by system aborts.  They get
corrupted through human intervention or hardware problems.  I have
forgotten the number of instances where datasets were restored on top of
databases, or left off backups or restored out of sequence.>>
 
This is a key point that many people don't fully appreciate. In the time
I've been here, and as far back as anyone is willing to talk about, this
system (in all its various incarnations) has *never* lost a byte of user
data due to hardware or system-software failure. Even when experiencing
daily (sometimes more than daily) system aborts due to a
failure-in-progress on a disk drive (two different occasions for that
one), we've still maintained 100% data integrity from a system
standpoint. The number of times data has been trashed due to operator or
developer error, requiring recovery from the backup, is withheld to avoid
embarrassment of the parties in question ;-)
 
Steve Dirickson         WestWin Consulting
(360) 598-6111  [log in to unmask]

ATOM RSS1 RSS2