HP3000-L Archives

July 2007, Week 2

HP3000-L@RAVEN.UTC.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Jeff Kell <[log in to unmask]>
Reply To:
Jeff Kell <[log in to unmask]>
Date:
Tue, 10 Jul 2007 15:33:27 -0400
Content-Type:
text/plain
Parts/Attachments:
text/plain (103 lines)
You may have seen my earlier comments about the apparently "hung" drive on our A500 that led up to this alarming display:

MANAGER.SYS)# dstat all
>  LDEV-TYPE            STATUS    VOLUME           VOLUME SET - GEN
>  ----------           -------   ----------------------------------
>     1-ST336753LC      MASTER    MEMBER1          MPEXL_SYSTEM_VOLUME_SET-0
>     2-ST336753LC      MEMBER    MEMBER2          MPEXL_SYSTEM_VOLUME_SET-0
>    30-MAS3367NC       MASTER-MD MEMBER1          PRODUCTION-0
>    31-MAS3367NC       MEMBER-MD MEMBER2          PRODUCTION-0
>    32-MAS3367NC       MASTER-MD MEMBER1          DEVELOPMENT-0
>    33-MAS3367NC       MEMBER-MD MEMBER2          DEVELOPMENT-0
>    34-MAS3367NC       MASTER-MD MEMBER1          WORKSPACE-0
>    40-        BUSY    MASTER-MD MEMBER1          PRODUCTION-0
>    41-        BUSY    MEMBER-MD MEMBER2          PRODUCTION-0
>    42-        BUSY    MASTER-MD MEMBER1          DEVELOPMENT-0
>    43-        BUSY    MEMBER-MD MEMBER2          DEVELOPMENT-0
>    44-MAS3367NC    *DISABLED-MD MEMBER1          WORKSPACE-0

Now that we're back up and in production, thought I'd followup with a summary of what happened.

This involved an HP 2300 disk enclosure, ldevs 30-34 on one Ultra LVD channel and ldevs 40-44 on another.

We tried 'suspendmirrvol' of one of the sets to see if they would split and carry on about their business, but that resulted in a system abort (we had a wide variety of system aborts over the next day).

The system would not reboot with ldev 44 in place, hanging during I/O configuration.  MAPPER showed "phantom" subdevices hanging off of the existing disk drives of the affected channel.  Removing ldev 44 allowed the system to boot.

At least until it tried to do XM recovery on the affected volume sets, then another series of system aborts.

Pulled all the mirrors off the affected channel, and the system would boot, but with mirrors "pending" their mates.  Trying 'suspendmirrvol' again resulted in system aborts (usually), an error mount (in one case), or a recovered half of a two-volume set (master or member).

Tried removing the primary copies of the failed sets and booting with the mirrors, and suspendmirrvol'ing them, with much the same random results.  Even had some go from almost OK (pending) to error mount, e.g.:

> dstat all
>  LDEV-TYPE            STATUS    VOLUME           VOLUME SET - GEN
>  ----------           -------   ----------------------------------
>    30-MAS3367NC       MASTER-MD MEMBER1          PRODUCTION-0
>    32-MAS3367NC     *PENDING-MD MEMBER1          DEVELOPMENT-0
>    34-MAS3367NC     *PENDING-MD MEMBER1          WORKSPACE-0
>    40-MAS3367NC       MASTER-MD MEMBER1          PRODUCTION-0
> (MANAGER.SYS)# 
> BEGIN VOLUME MOUNTING OF DEVELOPMENT:MEMBER1 (LDEV 42) (AVR 8)
> *** ERROR MOUNTING VOLUME ***
> COULD NOT MOUNT VOLUME ON LDEV   42.  INFO     14; SUBSYS      0 
> VOLUME WILL BE MOUNTED AS ERROR VOLUME.
> *** ERROR MOUNTING VOLUME ***
> COULD NOT MOUNT VOLUME ON LDEV   32.  INFO     14; SUBSYS      0 
> VOLUME WILL BE MOUNTED AS ERROR VOLUME.
> 
> ERROR VOLUME MOUNTED ON LDEV 42 (AVR 27)
> 
> (MANAGER.SYS)# dstat all
>  LDEV-TYPE            STATUS    VOLUME           VOLUME SET - GEN
>  ----------           -------   ----------------------------------
>    30-MAS3367NC       MASTER-MD MEMBER1          PRODUCTION-0
>    32-MAS3367NC     ERROR_MOUNT MEMBER1          DEVELOPMENT-0
>    34-MAS3367NC     *PENDING-MD MEMBER1          WORKSPACE-0
>    40-MAS3367NC       MASTER-MD MEMBER1          PRODUCTION-0
>    42-MAS3367NC     ERROR_MOUNT MEMBER1          DEVELOPMENT-0



In the end, I had the PRODUCTION master intact but no member volume, one DEVELOPMENT member intact but no master, and one intact copy of WORKSPACE (the original ldev 34 which survived suspendmirrvol).

Replaced ldev 44, reinserted the remaining drives pulled so we could get it to boot w/o system abort, but again, the corruption of the mirror pairs was a bit much, and it would abort during xm until I could selectively remove, reboot, scratch the mate, reboot to get the system back up with all drives functional.  Then it was a matter of rebuilding WORKSPACE and PRODUCTION and reloading.

The reload was another war story.  In addition to DDS backup, we also do a weekly full backup of PRODUCTION and DEVELOPMENT to disk, and FTP the image to a linux box that in turn participates in our enterprise Legato backup suite and it's automated tape library.  Great, this runs on Saturdays, I'll just copy that back over and rebuild the volume sets from there without hassling with the DDS tapes.

We start with the FTPed image on the remote:

> ftp> dir
> 200 PORT command successful. Consider using PASV.
> 150 Here comes the directory listing.
> -rw-rw----    1 520      505      4294870272 Jul 07 19:45 condorfb

About four gigs.  That one is challenging enough just to retrieve.  You have to finagle around a GET with buildparms set so that the records add up to what you want (besides, you have to get with CODE=2501 [STORE] or the :restore won't work).

But thinking a nice fat blockfactor might make FTP go faster, I used rec=128,64,f,binary.  It copied just fine.  But :restore didn't like it and doesn't explain itself well.  I had run into this one before, just didn't dawn on me at the time since I only did it once, 3 years ago.  You have to specifically have rec=128,1,f,binary or :restore won't work:

> RESTORE  *in;xyz.manager.sys;SHOW
>
> MON, SEP 13, 2004, 10:32 PM
>
> This message is reserved.
> STORE/RESTORE WAS UNABLE TO OPEN PERMANENT DISC FILE
>  "/SYS/BACKUP2/STORE2" OR "/SYS/BACKUP2/STORE2" (S/R 2300)
>
> RESTORE aborted because of error. (CIERR 1091)

OK... so finally got the store moved, and copied to a digestable structure file, and started the restore while I took a nap.

Only later to find... it quit halfway through looking for the "next backup volume".  

Our full backup had grown larger than 4Gb, which causes :STORE to silently break up the backup into 4Gb-ish chunks, subsequent pieces getting the .1, .2 suffix in the posix namespace.  The old FTP process just copied the original file over, and not any extra chunks.

So we finished the day out (finally) restoring from DDS.  And fixing the FTP process.

Well, long day, but reasonably happy ending.  How's that for an on-topic post?  :-)

Jeff

* To join/leave the list, search archives, change list settings, *
* etc., please visit http://raven.utc.edu/archives/hp3000-l.html *

ATOM RSS1 RSS2