HP3000-L Archives

April 1998, Week 3

HP3000-L@RAVEN.UTC.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Stan Sieler <[log in to unmask]>
Reply To:
Stan Sieler <[log in to unmask]>
Date:
Sat, 18 Apr 1998 14:24:10 -0700
Content-Type:
text/plain
Parts/Attachments:
text/plain (148 lines)
Tom writes:
...
> After 20 years of working with HP-3000 systems and various iterations of
> MPE, I find it unbelievable that STORE/RESTORE is still the weakest link
> in the entire OS.  You would think that a data backup routine would be
> the most rock-solid part of any OS (unless CSY considers a user's data
...

I agree.

> And now this totally ridiculous business of having to completely shut
> down and restart a *PRODUCTION SYSTEM* to kill off a hung session.
> Folks, Un*x has had the capability of killing an individual process
> since its inception many years ago.  Why not MPE??  Why is an OS so
> poorly designed that you have to sacrifice the work of ALL users to get
> rid of ONE offending session.

You usually don't have to.  The only reason to shut-down/restart in this
case would be one or more of:

   - the offending process is preventing you from accessing a file
     you want to access;

   - you want to use the tape drive (or other non-sharable device)
     that the offending process has open;

   - the offending process is looping, eating CPU;

   - you really need to use the terminal that the offending process
     was run from.

If none of the above is true, you don't *have* to reboot.

For some of the above four problems, workarounds exist.  (E.g.,
using a "mv.hpbin.sys" to move a file out, and replace it;
or using ALTPROC to put the looper into the ES queue)

However, all too often...there's no workaround, and you have to
reboot.

The problem with unabortable processes is that they've said:
   "hey, I'm going to be modifying a critical data structure ...
    one that's so important that if you aborted me, you'd be
    *really*, *really* sorry.".

...and MPE says: "ok, I wont let you be aborted".

Unix completely, utterly, and totally lacks this ability ... and it
shows.  You can easily (as root) blow away processes and corrupt your
file system, crash your machine, etc.

I've listened to people asking for the equivalent of the Unix "kill -9"
command, and I haven't been convinced .... because implementing it
compromises reliability so severely.

Steve Cooper said it best:

   If you want kill -9, you *SHOULD BE RUNNING UNIX*.

   MPE is reliable, first and foremost.  Kill -9 sacrifices that reliability.


That said, I sympathize with you.  I've had to reboot machines because
of hung or looping unabortable STOREs.


So...what can we do to address the problem of non-abortable processes?

Simply, try to minimize the amount of time a process is unabortable,
and to reduce the number of processes that use this feature.

Generally, a process that is unabortable is in "critical mode".

There are two basic problems with "critical mode" on the 3000:

   1) it's overused and not used well

      STORE is a big offender here.  It all too frequently
      runs in critical mode when it would appear to be unnecessary.
      But the *real* offense in STORE is that when it is in critical
      mode, it fails to adequately check for ABORT signals.

      Every loop inside STORE should have some kind of code like:

               if abort_is_requested then
                  exit;

      If this were true, we'd see no more reports of STORE
      looping and eating CPU time, or producing thousands of lines of
      unwanted messages after a <Break> :ABORT.

      However, STORE isn't alway abusing critical mode by oblivious
      looping.  Sometimes, it locks a resource and then goes into
      a wait state.  This takes us to #2.

   2) we lack the ability to say "please protect this data structure
      because I'm *reading* it" ... the only mechanism we have says
      "please protect this data structure because I'm accessing it
      and you don't know if I'm reading, writing, or both".
      As a result, when a process obtains a lock on an internal
      data structure, we have to assume it might be modifying it.

      So, if STORE has "locked" some files, and then (for some other reason)
      gets hung (perhaps in the I/O system), we see the problem: we can't
      abort it.

MPE could address the above problems, although it would take some effort.

Things to investigate:

   - drop set_critical;  replace with a cb_lock that allows the caller
     to say "read" or "write".

   - change cb_lock to a function that returns "good" or "failed",
     and require checking of the result.  (This would facilitate breaking
     deadlocks safely.)  Define cb_lock to return "failed" if the lock
     cannot be obtained within 5 minutes.

   - Change the "being STOREd" flag mechanism of STORE to allow resetting
     the "being STOREd" bits for files locked by an aborted/hung STORE process.

The last point is important.  Most processes obtain locks for a very short period
of time.  STORE is unusual in that it thinks it has a requirement to "lock" files
until they get successfully written to tape (or until the end of the reel they are
on).  Why does it think so?  Because we *defined* it that way.  You know that once
a STORE process opens the tape, every file that was "covered" by the STORE *will*
be on the tape (assuming no abort of the STORE at that point).  This simply doesn't
happen on Unix backups.  fbackup, cpio, pax, tar (the ones that come with HP-UX
for free) don't have that concept.  As a result, you can start a backup, take the
final tape with you, and not discover for months that an important file was
removed after the start of the backup (before it was written to tape), or that the
file was being modified during backup.

To put it differently, at least MPE *tries* to ship a professional, quality
backup program as a bundled part of FOS.  HP-UX ships four jokes, and not very
good jokes at that.
(Yes, both MPE and HP-UX have various third party backup products available, but
I'm not discussing them here.)

Back to STORE...so it has a need to "lock" a file from the start of the STORE to
(at worst) the end of the reel it's stored on.  As it turns out, the locking mechanism
it uses is completely different from the normal internal locking mechanism, and *could*
(if desired) be easily modified to notice that the STORE process had been aborted/terminated.

--
Stan Sieler                                          [log in to unmask]
                                     http://www.allegro.com/sieler.html

ATOM RSS1 RSS2