At 11/20/98 08:30 AM , Simpkins, Terry wrote:
>Paul H. Christidis says:
>>then I'd like to also include some mechanism
>> for distinguishing restarts due to system failures.
>
>ME TOO!! boy would that be nice for remembering to recover all
>those KSAM files that were open at the time of the crash.
>Of course this happens sooooooo seldom, that you tend to forget
>things. (or is that the age?)
Our (Navy) application had to determine whether the application was
properly shut down, whether the application aborted, or whether the system
failed while the application was running.
We ended up using a "status" record, some Global RINs, some RIN locking
mechanisms, and some common code to ensure that the applications were
properly recovered/restarted after application or system failures.
The mechanisms we used have worked very well with batch applications and
were fairly simple. The mechanisms used with interactive applications were
more complicated because we ended up adding some "brokering functions" (for
passing primary control of the Global RINs to another process when the
first user of the application exits from the application).
Below is a very simplified description of the mechanism used for batch:
o When the application opens the database (or KSAM file), it updates
a "status" record to indicate the application is "running" and the
current system "cold load ID" value. The application also locks a
global RIN.
o Duplicate runs of the application are prevented because the duplicate
run can't lock the global RIN.
o When the application shuts down normally, it updates the "status"
record, setting status to indicate "not running" and changes the value
it has for the "cold load ID" to zero.
o If the application aborts, the status record still has the flag set
that says the application is running, and it has the system "cold
load ID" under which the application last ran.
o When the application is restarted, it sees the status record that
indicates the application is running, but tries to lock the global
RIN anyway. Since the application isn't running, it CAN lock the
global RIN, which tells the application that something is wrong.
The application then checks some things and decides that the
previous run must have aborted. It performs maintenance, and if
the maintenance completes successfully, restarts itself for a normal
run. If maintenance fails, it makes sure everyone knows about it,
and posts a PRINTOPREPLY to the console. The operators know better
than to reply to the PRINTOPREPLY request, so it sits out there until
someone fixes the problem.
o If the system fails while the application is running, something
similar to the above application abort recovery, except that the
application detects that the "cold load ID" doesn't match that
on the status record, so it assumes that the system has failed
and responds accordingly.
o Sometimes there is an application failure and the system is rebooted
before the problem is fixed. The application then tends to think that
the application didn't fail, but rather that the system failed while
the application was running. Thus, the application's recovery code
has to not do any damage by assuming that the system failed.
The above description is a simplified version of the methodology used, but
it should give you some ideas as to how you might approach determining
whether:
1) the application shut down normally,
2) the application failed, or
3) the system failed while the application was running.
John
--------------------------------------------------------------
John Korb email: [log in to unmask]
Innovative Software Solutions, Inc.
The thoughts, comments, and opinions expressed herein are mine
and do not reflect those of my employer(s), or anyone else.
|