You may have seen my earlier comments about the apparently "hung" drive on our A500 that led up to this alarming display:
MANAGER.SYS)# dstat all
> LDEV-TYPE STATUS VOLUME VOLUME SET - GEN
> ---------- ------- ----------------------------------
> 1-ST336753LC MASTER MEMBER1 MPEXL_SYSTEM_VOLUME_SET-0
> 2-ST336753LC MEMBER MEMBER2 MPEXL_SYSTEM_VOLUME_SET-0
> 30-MAS3367NC MASTER-MD MEMBER1 PRODUCTION-0
> 31-MAS3367NC MEMBER-MD MEMBER2 PRODUCTION-0
> 32-MAS3367NC MASTER-MD MEMBER1 DEVELOPMENT-0
> 33-MAS3367NC MEMBER-MD MEMBER2 DEVELOPMENT-0
> 34-MAS3367NC MASTER-MD MEMBER1 WORKSPACE-0
> 40- BUSY MASTER-MD MEMBER1 PRODUCTION-0
> 41- BUSY MEMBER-MD MEMBER2 PRODUCTION-0
> 42- BUSY MASTER-MD MEMBER1 DEVELOPMENT-0
> 43- BUSY MEMBER-MD MEMBER2 DEVELOPMENT-0
> 44-MAS3367NC *DISABLED-MD MEMBER1 WORKSPACE-0
Now that we're back up and in production, thought I'd followup with a summary of what happened.
This involved an HP 2300 disk enclosure, ldevs 30-34 on one Ultra LVD channel and ldevs 40-44 on another.
We tried 'suspendmirrvol' of one of the sets to see if they would split and carry on about their business, but that resulted in a system abort (we had a wide variety of system aborts over the next day).
The system would not reboot with ldev 44 in place, hanging during I/O configuration. MAPPER showed "phantom" subdevices hanging off of the existing disk drives of the affected channel. Removing ldev 44 allowed the system to boot.
At least until it tried to do XM recovery on the affected volume sets, then another series of system aborts.
Pulled all the mirrors off the affected channel, and the system would boot, but with mirrors "pending" their mates. Trying 'suspendmirrvol' again resulted in system aborts (usually), an error mount (in one case), or a recovered half of a two-volume set (master or member).
Tried removing the primary copies of the failed sets and booting with the mirrors, and suspendmirrvol'ing them, with much the same random results. Even had some go from almost OK (pending) to error mount, e.g.:
> dstat all
> LDEV-TYPE STATUS VOLUME VOLUME SET - GEN
> ---------- ------- ----------------------------------
> 30-MAS3367NC MASTER-MD MEMBER1 PRODUCTION-0
> 32-MAS3367NC *PENDING-MD MEMBER1 DEVELOPMENT-0
> 34-MAS3367NC *PENDING-MD MEMBER1 WORKSPACE-0
> 40-MAS3367NC MASTER-MD MEMBER1 PRODUCTION-0
> (MANAGER.SYS)#
> BEGIN VOLUME MOUNTING OF DEVELOPMENT:MEMBER1 (LDEV 42) (AVR 8)
> *** ERROR MOUNTING VOLUME ***
> COULD NOT MOUNT VOLUME ON LDEV 42. INFO 14; SUBSYS 0
> VOLUME WILL BE MOUNTED AS ERROR VOLUME.
> *** ERROR MOUNTING VOLUME ***
> COULD NOT MOUNT VOLUME ON LDEV 32. INFO 14; SUBSYS 0
> VOLUME WILL BE MOUNTED AS ERROR VOLUME.
>
> ERROR VOLUME MOUNTED ON LDEV 42 (AVR 27)
>
> (MANAGER.SYS)# dstat all
> LDEV-TYPE STATUS VOLUME VOLUME SET - GEN
> ---------- ------- ----------------------------------
> 30-MAS3367NC MASTER-MD MEMBER1 PRODUCTION-0
> 32-MAS3367NC ERROR_MOUNT MEMBER1 DEVELOPMENT-0
> 34-MAS3367NC *PENDING-MD MEMBER1 WORKSPACE-0
> 40-MAS3367NC MASTER-MD MEMBER1 PRODUCTION-0
> 42-MAS3367NC ERROR_MOUNT MEMBER1 DEVELOPMENT-0
In the end, I had the PRODUCTION master intact but no member volume, one DEVELOPMENT member intact but no master, and one intact copy of WORKSPACE (the original ldev 34 which survived suspendmirrvol).
Replaced ldev 44, reinserted the remaining drives pulled so we could get it to boot w/o system abort, but again, the corruption of the mirror pairs was a bit much, and it would abort during xm until I could selectively remove, reboot, scratch the mate, reboot to get the system back up with all drives functional. Then it was a matter of rebuilding WORKSPACE and PRODUCTION and reloading.
The reload was another war story. In addition to DDS backup, we also do a weekly full backup of PRODUCTION and DEVELOPMENT to disk, and FTP the image to a linux box that in turn participates in our enterprise Legato backup suite and it's automated tape library. Great, this runs on Saturdays, I'll just copy that back over and rebuild the volume sets from there without hassling with the DDS tapes.
We start with the FTPed image on the remote:
> ftp> dir
> 200 PORT command successful. Consider using PASV.
> 150 Here comes the directory listing.
> -rw-rw---- 1 520 505 4294870272 Jul 07 19:45 condorfb
About four gigs. That one is challenging enough just to retrieve. You have to finagle around a GET with buildparms set so that the records add up to what you want (besides, you have to get with CODE=2501 [STORE] or the :restore won't work).
But thinking a nice fat blockfactor might make FTP go faster, I used rec=128,64,f,binary. It copied just fine. But :restore didn't like it and doesn't explain itself well. I had run into this one before, just didn't dawn on me at the time since I only did it once, 3 years ago. You have to specifically have rec=128,1,f,binary or :restore won't work:
> RESTORE *in;xyz.manager.sys;SHOW
>
> MON, SEP 13, 2004, 10:32 PM
>
> This message is reserved.
> STORE/RESTORE WAS UNABLE TO OPEN PERMANENT DISC FILE
> "/SYS/BACKUP2/STORE2" OR "/SYS/BACKUP2/STORE2" (S/R 2300)
>
> RESTORE aborted because of error. (CIERR 1091)
OK... so finally got the store moved, and copied to a digestable structure file, and started the restore while I took a nap.
Only later to find... it quit halfway through looking for the "next backup volume".
Our full backup had grown larger than 4Gb, which causes :STORE to silently break up the backup into 4Gb-ish chunks, subsequent pieces getting the .1, .2 suffix in the posix namespace. The old FTP process just copied the original file over, and not any extra chunks.
So we finished the day out (finally) restoring from DDS. And fixing the FTP process.
Well, long day, but reasonably happy ending. How's that for an on-topic post? :-)
Jeff
* To join/leave the list, search archives, change list settings, *
* etc., please visit http://raven.utc.edu/archives/hp3000-l.html *
|