HP3000-L Archives

March 2005, Week 4

HP3000-L@RAVEN.UTC.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
"Dave Powell, MMfab" <[log in to unmask]>
Reply To:
Dave Powell, MMfab
Date:
Wed, 23 Mar 2005 12:13:48 -0800
Content-Type:
text/plain
Parts/Attachments:
text/plain (113 lines)
This might be it.  Our backup job stops stm before the backup (actually sysgen
with @.@@, etc) and restarts it right after, before the verify.  Sounds like a
good match for your description.  I had actually gotten suspicious of this and
yesterday added a 5-second pause after the stm start.  Today I'll change it to
five minutes.  Thanks.

So, why do we shut down stm like that and not shut the network?  Partly
because section 6.9 "Backing Up Your System" of my C.75 System Software
Maintenance Manual seems to suggest doing it that way, and partly because our
network printers are still busy printing the spoolfiles that were created just
before the backup job.  But I am far from sure that this is ideal and am open
to suggestions.   Anyone think it would be better to restart stm after the
verify ?  Should I set up a separate job to make just an slt when I can
shutdown the network ?

Also, is there any real way to know how long we really need to wait after
stm-start before we try to put the tape drive back online ??   Maybe HP can
give us an STMINFO function :) ?


----- Original Message -----
From: "John Clogg" <[log in to unmask]>
To: <[log in to unmask]>
Sent: Wednesday, March 23, 2005 10:43
Subject: Re: [HP3000-L] Hung tape enhancement request


We had a similar problem with DLT drives.  In our case it was not caused
by the drive itself, so I don't think the type of medium matters.  This
may or may not relate to your situation, but here is what was happening:

As a prelude to performing the synchronization point of our online
backup each night, we shut down all jobs and sessions (except the backup
job) and shut down networking (NETCONTROL STOP).  We learned that any
time you stop and restart networking, it is necessary to restart the STM
monitor processes used by the CSTM diagnostics, so we run STMSHUT.DIAG
before shutting down the network, and run STMSTART.DIAG after restarting
it.  This was the source of the problem.  When the diagnostic monitor is
started by STMSTART (or when the system is started), it goes through a
hardware mapping process.  That process runs diagnostics on peripherals
as it encounters them.  If you mount a tape while that is happening, the
diagnostic process and the AVR (automatic volume recognition) process
get into some kind of deadlock, and a reboot of the system is the only
way to get the drive back.

The solution for us was to forbid any tape mounts or other use of the
tape drives for several minutes after restarting the diagnostic monitor.
It's kind of like not swimming after eating.

I hope this is helpful to someone!

John
-----Original Message-----
From: HP-3000 Systems Discussion [mailto:[log in to unmask]] On
Behalf Of Dave Powell, MMfab
Sent: Tuesday, March 22, 2005 3:25 PM
To: [log in to unmask]
Subject: Hung tape enhancement request

If it's not too late to make requests...   How about fixing whatever
causes
our tape drive to hang about 2 or 3 times a year.  HP seems to think it
is
tied in with the ghost-session problem that others have mentioned..
Might be,
but we never have anything hang except the tape drive.

Details from last night's hang:
A500,  MPE 7.5 pp2 (just updated Saturday),  DDS-3, an old tape (used
about 20
times with no prior problems)
Drive now shows as 'UNAVAIL', owned by 'SYS'.
Backup ran normally.  Finished 11:05 pm.
Drive failed to come back on line.  Backup job noticed the drive was
"UNAVAIL", waited about 10 times longer than it usually takes, then
started
the verify anyway, at 11:16 pm.
Verify said "DEVICE UNAVAILABLE  (FSERR 55)" and " VSTORE ENCOUNTERED
FOPEN
FAILURE ON DEVICE FILE "T"  (S/R 2213)".  Then it just sat there.
At 2:51 a watchdog job concluded that the backup job was hung, did a
showdev 7
(it was 'UNAVAIL', owned by 'SYS', issued a few abortios on it (but
never got
the "no io to abort" message), then abortjobed it.  Job aborted ok first
time.

This morning I fed it ever-increasing abortio while-loops, and finally
got the
'no io to abort' message after about 2,600 total abortios.  But that
does not
mean that there were 2600 ios to abort.  If I do a single abortio at the
console, I still don't get the 'no io to abort' message.  Further while
loops
in jobs with no pauses seem get the warning after between 50 & several
hundred
repetitions.  2600 just happens to be how many total tests it took for
me to
get impatient enough to feed it a big enough while-loop with no pauses.
Tape
drive is STILL 'UNAVAIL', owned by 'SYS'.

Dave ("planning to reboot tonight") Powell,  MMfab

* To join/leave the list, search archives, change list settings, *
* etc., please visit http://raven.utc.edu/archives/hp3000-l.html *

* To join/leave the list, search archives, change list settings, *
* etc., please visit http://raven.utc.edu/archives/hp3000-l.html *

* To join/leave the list, search archives, change list settings, *
* etc., please visit http://raven.utc.edu/archives/hp3000-l.html *

ATOM RSS1 RSS2