LISTSERV - HP3000-L Archives

HP3000-L Archives

February 1998, Week 3

HP3000-L@RAVEN.UTC.EDU

	LISTSERV Archives
	HP3000-L Home
	HP3000-L February 1998, Week 3

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	Re: NM Stack Overflow
From:	Stan Sieler <[log in to unmask]>
Reply To:	Stan Sieler <[log in to unmask]>
Date:	Fri, 20 Feb 1998 18:29:51 -0800
Content-Type:	text/plain
Parts/Attachments:	text/plain (132 lines)

Lee writes/asks:
> We've recently run into a situation in our major on-line application
> whereby one subprogram continually aborts with a stack overflow --

> to 20000000) with no change.  We're also assuming that a child process
> inherits the stack characteristics of its parent (I think that's how MPE/iX

Nope.

Use LINKEDIT's ALTPROG to change the NM stack of CTONEUSC.EXEC.CEDDEV if
you want a permanent override of the SYSGEN default.

> **** NATIVE STACK OVERFLOW
> ABORT: CTONEUSC.EXEC.CEDDEV
>        PC=a.00a84240 dbg_stack_overflow_trace+$2c
> NM* 0) SP=41892850 RP=a.005103dc trap_handler.handle_setdump+$a8
> NM  1) SP=418927d0 RP=a.00511904 trap_handler+$3ec
> NM  2) SP=41892750 RP=a.0025ba50 leave_system_code+$114
> NM  3) SP=418920d0 RP=a.0025b928 ?leave_system_code+$8
>          export stub: 2c0.00296790 PROC'EXIT000454+$48
> NM  4) SP=41892050 RP=2c0.00233c10 dbput+$180        <---  why a second dbput
>  label ??
> NM  5) SP=4188e970 RP=2c0.00233a6c ?dbput+$8
>          export stub: 1214.001b66b8 p1108b+$6dd8       <--- aborted process
>  at DBPUT location
> NM  6) SP=4188e870 RP=1214.001a8c94 p1108a$001$+$698
> NM  7) SP=41882470 RP=1214.000635a8 p1108+$5980
> NM  8) SP=418624b0 RP=1214.0001428c p0201+$1e74
> NM  9) SP=41846730 RP=1214.00010e34 _start+$bdc
> NM  a) SP=41846370 RP=1214.00000000
>      (end of NM stack)


Re: "why a second dbut label".  The first one (?dbput+8) is the
return address back into the export stub for dbput.  dbput+180
is the return address from ?leave_system_code back to dbput.
(The ?leave_system_code isn't in the same module as dbput,
so the linker/loader put an import stub (mislabelled "export stub")
at PRIC'EXIT000454+48 (or so) to enable dbput to call leave_system_code.)

Ok...import/export stubs explained (briefly!)

p1108b calls dbput (from p1108b+$6dd0, return address is...+$6dd8).

But...dbput isn't linked into your program.  So, the linker & loader
conspire to make what COBOL thought was a local procedure call
actually call some helper code (called an "import stub").  The import
stub (in your program file) jumps into XL.PUB.SYS and lands not
in dbput, but in more helper code called an "export stub".  The
export stub is at ?dbput.  The first instruction of the export
stub is "BL dbput"...which jumps to dbput, leaving ?dbput+8 as
the return address.

The import stub used by p1108b is in the same module (program file)
as p1108b.  The export stub it jumps to is in XL.PUB.SYS.

dbput then runs for awhile.  If it didn't die, it would exit
back to ?dbput+8, which would exit back to p1108b+6dd8 (without
having to go through the rest of the original import stub mentioned
above).

Similarly, dbput's call to leave_system_code goes through an
import stub (in XL.PUB.SYS) and an export stub (in NL.PUB.SYS)
before arriving in leave_system_code.

However...dbput didn't complete :(

Let's look at the stack ... comparing the SP values
fro "NM 9" to "NM 8" (and "NM 8" to "NM 7", etc.) we notice
that SP changed by the following number of bytes:

   NM* 1 to 0)     #128 bytes
   NM  2 to 1)     #128
   NM  3 to 2)   #1,664
   NM  4 to 3)     #128
   NM  5 to 4)  #14,048
   NM  6 to 5)     #256
   NM  7 to 6)  #50,176
   NM  8 to 7) #131,008
   NM  9 to 8) #114,048 bytes

so...no one is taking a LARGE amount of dynamically allocated stack space.

BTW, the way you read the return address (e.g., NM 8 ... p0201+$1e74)
isn't quite obvious.  The "NM 8" line (read with the knowledge
of what NM 7 is) says:  if you were to return from p1108, you'd
return to p0201+$1e74.  However, at this moment (in p1108), your
SP=418624b0 (and your RP=1214.0001428c, which is p0201+$1e74).

Thus, if you wanted to know the amount of stack that, say,
dbput is using (for its local variables), look at:

> NM  4) SP=41892050 RP=2c0.00233c10 dbput+$180
> NM  5) SP=4188e970 RP=2c0.00233a6c ?dbput+$8

and subtract the two SP's  (giving 14,048 bytes)

Why those two?  "NM 4" is the return from ?leave_system_code+8
into dbput and the SP value *while in* ?leave_system_code",
and "NM 5" is the return from dbput and the SP value
*while* in dbput.


Anyway...all this sheds no light on why there was a stack overflow, does it?
Nope.

Let's assume dbput is executing as normal.  It then calls something (let's
call it X, a mystery procedure).  X causes a stack overflow (or, something
that X calls caused it.)  Can we just abort the process at that point?

No.  IMAGE usually runs in "critical" mode, and is marked as a "system" process.
As such, when a stack overflow occurs, we try to back out and continue
running (to do some cleanup).  Eventually, when we leave both critical
mode and "system code" mode, we'll trigger the delayed stack overflow
abort.   And that's what happened.  Eventually, dbput said "I'm getting
ready to exit, let's leave system code" ... and poof, we triggered the
delayed abort.

So...who caused the stack overflow?  When?

Good question!  There's not enough information available from this
stack trace to determine it.  (If this was a memory dump,
we might have some additional information that could help.)

However, I'd hazard a guess that Superdex was involved (perhaps
it was opening a file, or doing something that required a lot of
stack space).

Recommendation: increase the NMSTACK for the program.

SS

ATOM RSS1 RSS2

RAVEN.UTC.EDU