LISTSERV - HP3000-L Archives

HP3000-L Archives

August 1997, Week 4

HP3000-L@RAVEN.UTC.EDU

	LISTSERV Archives
	HP3000-L Home
	HP3000-L August 1997, Week 4

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives
Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]
Subject:	Message File Corruption
From:	Christian Lheureux <[log in to unmask]>
Reply To:	[log in to unmask]
Date:	Fri, 22 Aug 1997 06:56:43 -0400
Content-Type:	text/plain
Parts/Attachments:	text/plain (174 lines)
On Aug. 21, Richiard Corn wrote :

< At one time, message files were susceptible to corruption if
<the accesing program was aborted or the system aborted. If

Case #1, program abort : I ain't been seeing MSG/IPC file corruptions for a
while, and maybe not since they're in NM (5.0).
Case #2, system abort : Still true. You'll get the more than famous FSERR 105
if you're running a CM program against your NM message file after startup if
the system got belly up and you did not check, and rebuild if necessary, the
message files.

<you were using a message file to queue records for processing
<and a failure occurred, message file corruption would cause
<loss of such queue records.

Sure. If you have to rebuild the file, you first must purge it, therefore
losing everything in the file. Good aueuing mechanism, but do not let the
queue build up too much, just in case.

<Is this still the case today or has this issue been dealt with
<as MPE has moved forward?  Can message files be relied upon to
<retain records if the system or writting program fails? What
<is the current reliability of message files as a persistent
<queueing mechanisim?

Hmmm... Let me think... When I was working on the 5.0 beta test, we asked CSY
if they intended to attache newly-NM message files (newly at that time, of
course) to XM. They said 'Man, good idea, let's think about it'. When they
came back to us some days later, it was more 'OK, the code is too complex,
we'll never be able to stabilize it around IPC anc XM compatibility. We'll
rather DISABLE (yes, I've written DISABLE!!!) the FSETMODE feature for IPC
files'. OK, we'd tried, we lost .

I think that IPCs are a very valid queueing mechanism, providing your system
is stable, i.e. does not have too many aborts/hangs/halts and other
disruption and, above all, you do not let tour IPC build up too big. In other
terms, the less you put in the file, the less you'll lose if the system goes
belly yup .

Then Gavin wrote :

<When I was at Quest, we used message files heavily as a persistent
<queueing mechanism, and I recall very few cases where the files
<actually became corrupt.  It is thus my impression that message files
<make a pretty good persistent queueing mechanism.  Whenever I say

Hmm, I was with Quest, too, for a while. It reminds me that I have something
I should post to the list re: their recent call for support engineers. Call
me an eye-witness. Back to the subject, I have worked extensively on
Netbase/Shareplex, and I know we used several of those IPCs as queues. Short
of system aborts/hangs/halts, I do not remember seeing an IPC corruption. In
other worgs, I support Gavin's opinion.

<this though, a lot of people start telling horror stories about the
<evils of message files and how they are "always getting corrupted".

IPC always corrupt ? Well, it sure has been in the 'bad old days' of CM, and
MPE V/E, with **lots** of FSERR 105, which is the genuine IPC corruption
message. And, once again, there ain't nothin' you can do ... short of purging
the file, losing the data in it if any, and rebuilding the file. Nowadays, I
certainly wouldn't say 'always' .

<Back in MPE/IV days when message files were brand new, there certainly
<were a lot of bugs, and pre MPE-XL systems may have tended to be more
<subject to corruption of message files on system failure.  Things are
<much better on /iX thday.  Another problem is that "HP" has never really
<understood what a useful utility message files are.  There are a number
<of magical things that messag files do which have no simple replacement,
<such as guaranteed notification of process termination, "pipes", queue
<files, persistent queue files, etc.

OK, there *are* other mechanisms, but less efficient. You may use a circular
file, but it's still CM code, it can lose data if it's overwritten, and it's
not bullet-proof either. You may use an MPE flat file, but your application
has to handle all the functions that IPC does for you, i.e. writer waiting on
full file, reader waiting on empty file, no overwriting, read and write
pointers, and so on. Feasible, but not easy.

<As far as I know, it's not possible to corrupt a message file on MPE/iX
<with anything less than a system abort (i.e. program aborts are not a
<problem) and I'm not sure exactly what windows of vulnerability to
<system aborts there are, and what sort of corruption (or simply loss of

I'm pretty sure there are significant windows of vulnerability. When I
benchmarked Shareplex, and I did that extensively, a system abort on the
shadow system would almost always result info an FSERR 105. Curiously enough,
the probability of getting an FSERR 105 if the primary system failed was
waaaaaay lower ! But I have some ideas about possible explanations.

<data that didn't make it to disk before the system failed) are really

Corrruption ? Mainly FSERR 105 and loss of all data in the IPC. No more, no
less.

<possible.  As I recall, message files (now in Native Mode since ~5.0)
<do not support being attached to the transaction manager (an unfortunate
<oversight as this ought to have made them virtually failure proof).

OK, attaching IPCs to XM would have been an *excellent* idea. Unfortunately,
when, as part of the beta test team, I (and some HP French Response Center
colleagues) raised the idea with CSY, it was turned down as close to
impossible to implement. But, one again, I support Gavin's opinion, in that
it would have made IPCs virtually bullet-proof.

In that situation, no wonder why Quest has replaced some IPCs with MPE flat
files *attached to XM* as of Shareplex 9.7. The NBM file is now a flat file,
and maybe others. For those who do not know about Shareplex, the NBM is the
main engine for transporting data from the primary to the shadow system, and
it resides on the primary. For those really interested, e-mail me private for
details.

<A very common reason that people *think* that message files are corrupt
<and unreliable is that they simply don't have a clue about programming
<with them.  The most common problem is failure to account for FSERR 151:

<CURRENT RECORD WAS LAST RECORD WRITTEN BEFORE SYSTEM <CRASHED (FSERR 151)

<which is a feature of message files, not an indication of corruption.
<When a message file is open for write and the system fails, the next
<time the file is opened, the last record in the file is marked with a
<flag.  When a program later reads that record, the read "fails" with
<the above error.  In fact the read didn't fail at all, the system is
<just trying to be helpful and let you know where a system interruption
<occurred.  This is useful for writing recovery code, detecting partial
<transactions, knowing that the "writer IDs" in the data are now going
<to start over again, etc.

<For most (simple) applications, the correct logic for reading a message
<file is to simply ignore FSERR 151.  A common mistake is to treat the
<error as a recoverable error and simply issue another read without
<processing the data returned with the "error".  This results in a single
<lost record every time the error is encountered, which is a tricky bug
<that may never be found.

Hmmm I'm not going to comment on that, 'cause I've not worked much on FSERR
151, mostly on the horrendous FSERR 105. But I know of many programs that do
not test return code enough (or not at all), and of very, very few programs
that test returrn code too much !

<Message files are subject to loss of data not posted to disk when the
<machine fails, just like other types of file.  You can use the (relatively
<expensive) FCONTROL 6 operation to force all data up to the current
<time to be flushed to disk.

I have tested some programs with and without the FCONTROL 6 feature. The
performance hit is, depending on the IPC I/O activity of the program of
course, real high. Another factor is the size of the IPC. The bigger it is,
the longer it is to execute the FCONTROL 6. I once tested that on a 3.2
million sector file (800 MB, roughly), and it took ... well ... a while ...
in the minutes range. That is because, as far as I understand, FCONTROL 6
against an IPC just write 0s everywhere in your file BUT (hopefully) on your
meaningfull data. Imagine a case where you have a 100,000 record IPC, with 10
meaningful record in it. Obviously, that is one more reason to ensure the IPC
acitivity on your system performs OK. Performance consultancy and Workload
Manager could help you a bit .

OK, it's a long message. Hope I'm not wasting our bandwith, though. The song
is over, though I'd something more to say. For those of you interested in
other details, technical, performance, and so on, e-mail me privately. I
happen to know the IPC *guru* (an understatement, actually) at the HP French
Response Center. As I d not want my former colleague flooded with messages, I
will not post his name here, but I can give it to interested people .

And, Gavin, let's exchange some e-mail about Quest, which happens to be our
common former employer.

Thanks for your patience ,

Christian Lheureux
MPE/iX Consultant
Somewhere in France, but ...
Available Worldwide !!!
ATOM RSS1 RSS2
RAVEN.UTC.EDU