HP3000-L Archives

March 2000, Week 3

HP3000-L@RAVEN.UTC.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Paul H Christidis <[log in to unmask]>
Reply To:
Paul H Christidis <[log in to unmask]>
Date:
Tue, 21 Mar 2000 18:06:11 -0800
Content-Type:
text/plain
Parts/Attachments:
text/plain (109 lines)
Fellow Listers

A few days have past, I'm almost caught up with my tasks, I've gotten enough
sleep and I can look back with less emotion on the events that transpired last
week, and try to make some sense out of them.  Who knows, it may even provide
some clues/insight/help to someone else on this list.

One day last week one of the programmers came and asked if I could increase the
job limit and let into the system the next job in the queue that had been
waiting for some time.  While doing that we noticed that we had an unusually
'long' job queue.  We attributed it to some 'long running' jobs that we saw
executing to satisfy some recent requirements.
 Around lunch time, of the same day, I went into 'glancexl' to see if any jobs
were being impeded and to my surprise I saw a predictive support process
"DISKTRKP", that normally runs at 11:00 PM, consuming 99.3% of one of the CPUs.
It became obvious that this was the reason for our long job queue.  I contacted
the RC and after a few attempts we were able to 'kill' the process and 'free up'
the 2nd CPU which in turn shortened our job queue and everyone was happy.
The RC could not hazard a guess as to why the process, that should have finished
minutes after it began, ran for over 13 hours.  They simply indicated that said
process could/should be disabled since our machine does not have any of the disk
drives that said process examines.

The next day I got an early morning call from one of the programmers and I was
informed that we could not issue any 'DSLINE' commands (the session would hang)
and that some remote users could not establish a connection.
When I used 'nslookup' to see if I could 'ping' any other hosts, my session
hung.  Batch jobs that were trying to e-mail notifications through 'sendmail'
running on our HP were also hung.  I could establish sessions through our DTCs
but any remote users trying to connect with NS/VT would hang/time out.

I called the RC.  Most of the CE's suggestions resulted in additional hung
sessions, while 'glancexl' showed an ever increasing number of sessions being in
the 'impeded' state, most of them running 'vtserver'.  It became apparent that
the only way out of this would be a 're-boot' and thus we 'broadcasted' the
necessary warnings and proceeded to take the system down.  (Programmers reported
that their session would also hang after the 'bye' command was issued).

I issued the shutdown command and proceeded to re-boot the system.   Just after
the 'Interact with IPL' prompt the system displayed a couple of messages
indicating that it had begun the 'booting' process and then displayed:
     IPL error: Bad IPL checksum
it re-displayed the 'boot' menu and "WARN C5F0" at the console status line.

I repeated the process a few times with the same result and thus I called the RC
again.  After some additional diagnostics and attempts to 'bypass' the error it
was concluded that the only remaining option was to reboot from the latest SLT
tape and reload the data.

Anticipating that we would be doing a reload I had requested for the backup and
SLT tapes from our 'off site' storage and begun the process.  Since I would be
applying two sets of tapes, the most recent partial and the most recent full, I
anticipated the reload to take about 5 hours and although we would loose that
morning's processing we should be able to 'pick up' with the evening's
production run.  I must say that the CE was very helpful in assisting me and
providing instructions for re-booting from the SLT and bringing the system to
the 'pre reload' state.

I begun the reload with the "partial's" tapes and went to get some food.   The
first set finished on time and I started with the first tape from our full which
contained most of our databases.  Since most of the databases happened to also
be on the partial I figured that this step would be very quick and then I could
apply the other tapes that contained the bulk (in terms of numbers) of the
files.  I must say what I expected to occur when the tape with the databases was
mounted was that the 'reload' command of the utility would 'detect' that the
bulk of the files on the tape had a 'newer' version already on disk, it would
restore a small number of files and terminate.

To my surprise the utility, RR, begun reading the tape, read the directory and
begun performing all kinds of disk accesses, disk accesses that seemed more
appropriate for actually restoring the files than just comparing their dates.
After some minutes, and since the utility kept on reading the tape and thrashing
the disks I became concerned that something may be wrong and decided to
interrupt the process, and contact the vendor.   After some effort the
technician that I contacted came in contact with one of the developers and
called me back to say that the utility behaved as expected.  It needed to
'unpack' a portion of the file first before it could compare the dates.

I went back to the console and resumed the 'reload'.  Two and one half hours
later the process finished and informed me that out of the 246 datasets on the
tape only 5 were restored the rest did not qualify because the ones on disk were
newer.

/Rant ON
I was very upset, needless to say.  It took longer to restore those 5 files of a
few thousand sectors than it took to store the entire tape of 115,000,000
sectors.  AND IT DOES NOT MAKE ANY SENSE.  The information needed to determine
where the 'newer' file resides (tape or disk) exists in the directory that IS
stored in the beginning of the tape.   I can accept the fact that some
additional time will be needed to 'locate' the correct blocks within the tape,
but more time than it required for the store?   My observations support the
suspicion that each file was 'restored' in its entirety as a temporary file and
then the "date comparison" was performed.
\Rant OFF

Subsequent to the above step the rest of the files were reloaded, then the 3rd
party software from their separate tape, the system was 'shutdown' and rebooted
once again and around 2:00 AM the next morning we were back in business and I
was very tired.

The big unknown in all this, since we were not able to take any memory dumps, is
what caused the problem.  Is the "Bad IPL checksum" error indicating disk
corruption?  Is the impediment of the 'VTSERVER' indicating the same?  What
about the 'runaway' "DISKTRKP"?.


Regards
Paul Christidis

ATOM RSS1 RSS2