HP3000-L Archives

May 2004, Week 3

HP3000-L@RAVEN.UTC.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Joseph Dolliver <[log in to unmask]>
Reply To:
Joseph Dolliver <[log in to unmask]>
Date:
Wed, 19 May 2004 19:48:32 -0400
Content-Type:
text/plain
Parts/Attachments:
text/plain (135 lines)
You made a good point about the Jamaica Cabinet. They are cheap and it
happened to me
as well. I replaced my cabinet and it worked.


Joseph Dolliver
e3K Solutions, Inc.
1774 Stuller Road
New Windsor, MD 21776

443-838-7613 cell
410-848-9503 home
[log in to unmask]

-----Original Message-----
From: HP-3000 Systems Discussion [mailto:[log in to unmask]]On
Behalf Of Paul Christidis
Sent: Wednesday, May 19, 2004 6:11 PM
To: [log in to unmask]
Subject: [HP3000-L] Our HP3K 969 is misbehaving (Very long)


Fellow HP3000-L members,

I'm going to list my recent woes with our beloved 969/220 in the event that
they may induce some helpful suggestion from the more knowledgeable folks
in this list.

About a week ago, in a previous thread, I indicated that our HP3K was
behaving oddly and concluded that it was probably 'hanging' due to the
network scanning that was being done with "WhatsUp Gold".  The IP address
of our HP3K and the associated DTCs were excluded and I thought that was
'end of story'.

That occurred on 5/12/2004 and after the reboot the machine worked fine
until 4:00AM Saturday (5/15/2004) when the system became again
unresponsive.  Due to scheduled power outage in the remote site where most
of the users are located, the 'hang' was not 'discovered' until Monday
morning, when I drove to our site and had to reboot the machine.  Due to
some trouble with the tape drive and pressure from the users, I was not
able take a memory dump.  The system came up around 5:00AM on Monday and
stayed on until around 7:30AM, when if 'hung' again.

This time I did take a memory dump and since "WhatsUp Gold" was out of the
picture, no OS patches had been installed in months, no new applications,
nothing on the software side had changed, I suspected a hardware related
problem and thus I called our 3rd party support folks.  They asked if I had
taken a memory dump and indicated that they'll have someone dial-in and
read it.  The verdict was that since a large number of I/Os were pending on
a specific LDEV, it was a good bet that said device was going bad and
should be replace.

I scheduled enough time to try and take a full backup (out normal Sunday
morning back up did not occur) and then replace the drive.  The fact that
the backup was successful (backed up around 140GB onto 4 DDS3 tapes) caused
me to think that perhaps the trouble was not with the drive, but perhaps
not maybe data could be read but not written to it.  The technician arrived
around 8:00PM and proceeded to replace the drive, reboot from the recent
SLT, include the drive in the volume set, etc... and then we were ready to
start the reload.

We use BackPack/iX in our site and thus the first step was to restore from
the full backup the 'BackRest.pub.sys' program that would then be used to
perform the reload.  Said program is extracted using the MPE/iX 'restore'
command.  When the restore command was issued the proper prompts for a
'reply' were displayed on the console, and after the proper tape drive was
specified the program was restored.

I then proceed to run the 'BackRest' program, specify the needed commands
for selecting, reporting, tape drive numbers, etc.. and press '/go'.
AND.... nothing.  No activity what ever.  We wait a few minutes to see if
the 'autoload' command 'kicks in'.. nothing.  I abort the process and try
it again, making doubly sure that everything is specified correctly..
again nothing.  I repeat the process 4 to 5 more times.. nothing.  I then
try to extract a fresh copy of 'BackRest' through the MPE/iX restore
command.. Nothing again.  The same command that worked 10 minutes earlier
does not work, it just sits there.

I decide to reboot the machine again, through the Cntl B, RS sequence, and
attempt the same restores (MPE/iX and BackPack/iX) multiple times.. Again
nothing

I suggest that perhaps the 'bad LDEV' was LDEV1 and not the one identified
by the memory dump. The technician reboots the machine and uses IDE and
MAPPER (I think) to check LDEV1 (no errors are detected), he then brings up
the system and we try the MPE/iX restore again.  This time it works.  I
specify the 'BackPack' reload and IT works... A few hours later the system
is up and functioning.  It is Tuesday morning around 4:30AM  and I go home
to get some sleep.

The system ran fine all day Tuesday.  Then on Tuesday evening, before going
to bed, I tried to connect from home and again no response.  I drive to the
site and sure enough the system is hung.  I reboot and again have a problem
with the tape drive and thus the memory dump is unusable.  I go home around
2:00AM leaving  the system operational.  A few hours later I get paged by
out remote users.  I try to connect from home but cannot go beyond the
password prompt.  I tried to connect using Telnet, NS/VT, Telnet to both of
our DTCs, and specifying ';parm=-1' to bypass the UDCs, dialed in by modem,
and combinations of the previous.. same results (could not go beyond the
password prompt).

I drive to the site again.  The first line on the console was my session
'log-off' from 2:00AM, followed by a job login around 3:00AM and then a
dozen or so jobs and sessions.  The last successful log-on occurred at
4:30AM, however, none of my earlier attempts from home were logged. I try
to log-on from the console. I entered the 'hello ...' string (with
';parm=-1') and I'm prompted for my password, and the session hangs as soon
as I supply it, forcing me to issue the 'Cntl B, TC' sequence.

This time I get a good memory dump, return the system to the users and call
our hardware support folks informing them that, probably, the problem was
not with the LDEV that we just replaced.  They indicate that they'll be
reading the dump shortly.

While writing all this I was notified that the memory dump still indicates
that a large number of I/Os were pending against the same LDEV that we had
replaced.

So.  Does any of this suggest to anyone of you where the problem may be?
Is there any circuitry in the Jamaica enclosures that could be causing this
LDEV to 'drop out of sight' of the OS?  Are we going to have to 'back step'
from that LDEV and replace Jamaica box, cables, terminators, controller,
etc?

Any ideas are appreciated.

Regards
Paul Christidis

* To join/leave the list, search archives, change list settings, *
* etc., please visit http://raven.utc.edu/archives/hp3000-l.html *

* To join/leave the list, search archives, change list settings, *
* etc., please visit http://raven.utc.edu/archives/hp3000-l.html *

ATOM RSS1 RSS2