HP3000-L Archives

February 2003, Week 1

HP3000-L@RAVEN.UTC.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Christian Lheureux <[log in to unmask]>
Reply To:
Date:
Tue, 4 Feb 2003 15:03:06 +0100
Content-Type:
text/plain
Parts/Attachments:
text/plain (73 lines)
Wirt wrote, among other things :

> People generally don't perform failure analyses on the HP3000
> when it fails,
> but all failures begin with a single, most "upstream" event
> that eventually
> cascades into the complete loss of the system.

Agreed, unfortunately. They really should, were it only to prevent a given
failure to happen again. That's precisely what dump analysis and other
troubleshooting techniques are all about, at least as far as the 3000 is
concerned. AutoRestart should have been mandatory on all HA or close-to-HA
level systems. Dumpreading should have been a mandatory subject for RC
engineers.

A single root cause is not always true. Much like other totally different
kinds of mishaps (I can't tell if Columbia will belong to this kind), MPE
failures are quite often explained by a string of causes, which need to
exist TOGETHER, and in which just one cause would not cause a failure.

> The capacity to perform
> backups obviates that work on a computer system. If good
> backups exists, the
> system can be easily "brought back to life," if need be, on a
> completely
> different set of hardware and never really notice the costs
> of the failure.

Then factor in the high probability of human error, at least way higher that
software or hardware failure, and you get a rough idea of what can happen.
And yes, backups can be screwed because of human errors, too, adding yet
another layer of consequences.

> NASA, on the other hand, spends an inordinate amount of time
> working out
> these fault trees, attempting to assess how severe a risk the
> failure of any
> single component presents to the mission at any point in time.

I would assume that, unlike most computer users, NASA would spend a decent
amount of time analyzing possible causes BEFORE they exist, thus doing the
preventive job that is so hard to do with computers. After all, when a
computer system fails, human life is rarely at stake (with the possible
exception of some health-care related systems). On the other hand, when a
shuttle fails at least 7 lives are at stake, possibly more... Imagine what
could have occurred if the Shuttle had blown up seconds earlier when
overflying Dallas / Fort Worth.

> Secondly, people have repeatedly mentioned how the crew was
> '"doomed from the
> beginning" but didn't know it. That too is characteristic of
> all failures.
> The seed of the failure always proves to be latent in the
> system, residing
> there for perhaps months or years, not merely a few days.

Agreed. Of course, it's easy to cry wolf afterward, but let's assume NASA
scientists had paid due attention to these chunks (tiles or foam or ice or
else) falling off the Shuttle at lift-off starting from STS-1. Can we
elaborate a different course of events than what we've seen Saturday ?

> Shakespeare wrote
> essentially the same when he had Hamlet say:

The rest is silence.

> Wirt Atmar

Christian Lheureux

* To join/leave the list, search archives, change list settings, *
* etc., please visit http://raven.utc.edu/archives/hp3000-l.html *

ATOM RSS1 RSS2