HP3000-L Archives

April 2000, Week 3

HP3000-L@RAVEN.UTC.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Michael Hopper <[log in to unmask]>
Reply To:
Michael Hopper <[log in to unmask]>
Date:
Mon, 17 Apr 2000 18:06:48 -0400
Content-Type:
text/plain
Parts/Attachments:
text/plain (132 lines)
> BACKGROUND:
>
> Our 959KS-200 has been running essentially 7 x 24 since Dec
> 1995.  IIRC zero problems with any board in the SPU....  UNTIL:
> Tuesday last week on early swing, when the machine decided
> to experience a totally silent system HALT.

<snip-here, snip-there>

>
> Oh;  yeah:  What caused our ORIGINAL HPMC failure, ala Jeff's
> above;  on the CPU that had been in there and untouched for over
> four years ??..   I haven't the foggiest idea....  HP replaced it and
> it's gone;  I can live without knowing....
>
> ....  just another week at the office.....   NOT !!
>
> Ken Sletten


All I can say is "Owch!"  What an ugly situation.
But, I may be able to beat it.  (Really)


I had just come into a system manager position for a 3000/987.  (Also,
something to keep in mind:  I had ~no~ training on an HP whatsoever.  I
had backed into the job because I was the only one in the company with
any type of basic computer knowledge after the other manager left, so I
was as green as old bread.)

We had ordered three new 3 gig HDs and when they arrived, our CE came
out a day early to do all the configs and hard wiring into the disc
hotel.  No problem.

The next day comes around and we power down the machine, make the final
wiring connections and fire it up.  The box booted, but one of the three
new drives refused to mount and the system wouldn't ever just give up
and continue with the boot process.
Power down, check all the addressing and wiring.  All looked good so we
tried to boot again.  Nothing.
Power down.  Take each drive back out of the racks and just rearrange
their orientation.  Power up.  Now the two original drives won't mount,
but the first problem one does.  (????)
Ok, try all new cables.  Nothing.
Get the RC on the horn and try any number of different combinations of
addressing and wiring configurations.
Finally, after much mucking around, we managed to get all three drives
to mount and the system happily came up.   By this time it's 9:30 in the
morning and I've got people just wandering around waiting on the system
and the big bosses starting to get antsy.
After watching the system hum along for about an hour or so while
running a few tests, the CE deems the system good to go.  Everyone jumps
on their terminals and begins working.
Our CE, being a good guy, decides to hang around ~just in case~ because
of all the problems there were with getting the thing going in the first
place.  (Good thing too.)

About 45 minutes later, <B A N G !> the system comes to a screeching
halt.  After finding that a warm boot doesn't work, we find that we
can't even do a cold boot.  This is about the time the CE adds even more
fun by saying, "Can you smell that?"  Uh-oh.

We trace the source of the interesting odor and find the rack of the new
drives slightly blackened.  After completely pulling apart the drive
hotel and drives, we find that the original drive that was giving us all
the problems had been mounted in it's little sled-type base too low and
that allowed the circuit board to sit flush against the metal bottom of
the sled.   Nice.

I'm surprised that the machine (up to that point) ran as long as it did.

Now, here is where the fun ~really~ began.   After taking out the
offending drives, the machine boots and unbelievably, all the original
drives are still working.  (Surprising since we had shorted out the
entire SCSI bus.)
So..  The drives all work, but the data is messed up because parts had
already migrated to the new drives.  (They were all configured as system
volume drives.)
No problem you say, just do a reload and go.

Well, remember when I said that the other system manager had "left"?  We
found out when going to do a reload just how bad this person was at
system administration.    The latest backup tape had been created TWO
MONTHS prior to the date this system crash had happened.  According to
the records, a update to the OS had even taken place at sometime within
that two months!
Yes, I should have done a backup the night before the hardware
installation, but remember that I also said that I was ~green~.  I
didn't even have my own id/passwords on the system yet.  I knew
nothing.  I was relying on the CE to cover me in most of this.

So we have no good drives, no good data and no good backups.   I'm
preparing to pack my bags and just move to Mexico at this point.

The RC suggests putting the bad drives back in the system (with the
improperly seated drive corrected.)  Booting the machine mounts all
drives (even the bad one) but the data is still gone.   A team of
engineers at the RC spent about 5 hours, dialed into the support modem
in debug mode, moving bits of data a piece at a time back onto the
original drives.   When they finally came back and said that they could
do no more, we rebooted once more (after taking the offending drives
out).   The system was still lost data-wise, but we were actually able
to cut a backup tape.  (Which took about 8 hours.  6 hours longer than
what I found out to be normal.)

After scratching all the drives (this was ~really~ scary part, because
there was no going back at this stage, but then, what else was there to
lose), we did a reload and the machine came back up and churned along
like nothing had ever happened.

The data had been recreated so perfectly that we were able to go into
one of the databases and find the exact, half-written record that was
being created when the system dumped out with the short.   From that
point on, I deemed all the engineers all over the world who work(ed) at
the RC, gods.


Total time from initial reboot to final reload and reboot:  36 hours
Total time awake:  41 hours
Total cups of coffee consumed:  No idea. Lost count early.
Total number of pounds lost due to stress:  About 5 (I actually checked)
Total number of games of solitare played in Windows while waiting for
the debug process/backup/reload to finish:  Well into the hundreds


I continued to serve as administrator for that machine for 4 years after
that without any major issues coming up even once.  (Needless to say, my
first priority was to put together a very solid backup plan.  :-)


M.Hopper

ATOM RSS1 RSS2