HP3000-L Archives

February 1996, Week 4

HP3000-L@RAVEN.UTC.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Bruce Toback <[log in to unmask]>
Reply To:
Bruce Toback <[log in to unmask]>
Date:
Fri, 23 Feb 1996 10:04:25 -0700
Content-Type:
text/plain
Parts/Attachments:
text/plain (145 lines)
I have spent quite a bit of time diagnosing a network connectivity problem
between the NetManage Chameleon TCP/IP stack and the HP3000. The problems
occured on dial-up links, but could occur under other circumstances as
well. There is also a possibility for data corruption on short transactions
with some networks. The problem turns out to be that the NetManage stack
does not follow the TCP standard in two important respects, both having to
do with error recovery. Because the NetManage stack is at fault here, the
problem may occur with other hosts as well. The problem has been verified
on several versions of NetManage software, including the latest release,
4.6. The problem is exacerbated by a somewhat unusual but perfectly
permissible characteristic of the HP 3000 TCP implementation.
 
Brief Description
-----------------
 
The symptom occurs when the first packet in a new TCP connection is
delayed. This is a common occurrence on dial-on-demand links, where the
first packet must wait for a dial-up connection to be established. It can
also occur on busy router-based catanets, especially if there's a slow link
between two or more nets in the path. The typical external result is that
Chameleon returns a "connection reset by peer" message. Sometimes, the
stack then crashes, requiring a reboot of the PC. A second, less probable
external result is that everything appears to work fine, but in fact two
transactions have been sent to and accepted by host, rather than just the
one that the user intended. Both of these symptoms are more likely to occur
when the HP 3000 is busy.
 
Etiology
--------
 
It turns out that NetManage violates two provisions of the TCP standard.
The first of these deals with the sequence involved in setting up a new TCP
connection. (Forgive me if you're already familiar with TCP; I'm including
some background material here for people who are not.)
 
Every octet sent across a TCP connection has a sequence number associated
with it. TCP uses the sequence number to detect dropped and redundant
packets. Each packet sent includes a header that contains the sequence
number of the first octet in the packet; the receiver compares this with
the last sequence number received in order to determine whether it has seen
these octets before, and whether it has received all the octets that
preceded the current one.
 
For this to work, each end must tell the other what the first sequence
number will be before actually sending any data. This is accomplished with
a four-part handshake, using a couple of reserved bits in the packet
header:
 
      Initiator             Responder       Meaning
            SYN    --->                     "Here's my first seq"
                  <---      ACK             "OK, I have your first seq"
                  <---      SYN             "And here's my first seq"
            ACK    --->                     "OK, I have your first seq"
 
It's possible to combine some of these steps, and most of the time, most
TCP implementations DO combine them:
 
      Initiator             Responder       Meaning
            SYN    --->                     "Here's my first seq"
                  <---      ACK, SYN        "Got yours, and here's mine"
            ACK    --->                     "OK, I have your first seq"
 
The 3000 combines the responding ACK and SYN most of the time. However,
when the 3000 is busy, it sometimes separates its response into two
separate packets. This appears to give NetManage fits:
 
      NetManage             HP3000
            SYN    --->                     "Here's my first seq"
                  <---      ACK             "I have your first seq"
                  <---      SYN             "And here's mine"
  ACK, SYN, RST    --->                     "I have yours, here's mine again,
                                             but (RST = "reset") I really
                                             don't want to talk to you"
                            (Gives up in confusion)
 
Now, normally what would happen is that NetManage would detect that the
connection isn't established and starts the whole sequence over again. If
this succeeds, no harm done except that the connection takes longer to
establish than it should. However, if there's a delay in delivery of the
first packet, NetManage sends just the first packet again. When both
acknowledgements come back, NetManage becomes confused -- thanks to a
second standards violation -- and veers off into the weeds. If you're
lucky, your data survives; if not, not. The full sequence demonstrating the
problem is 15 packets, but here's the start:
 
      NetManage             HP3000
            SYN    --->                     "Here's my first seq"
                 (link madly being dialed)
(Timeout, retry)
            SYN    --->                     "Here's my first seq"
                 (link connected, both packets sent)
                  <---      ACK             "I have your first seq"
(Timeout, retry)
           SYN    --->                     "Here's my first seq"
 ACK, SYN, RST    --->                     "I have yours, here's mine again,
                                            but (RST = "reset") I really
                                            don't want to talk to you"
                  <---      ACK             "I have your first seq, really"
 
Things degenerate from here. The reason that neither side can sort out the
mass of retries, resets and reconnects is that contrary to the
recommendations of the standard, NetManage does not pick a unique sequence
number for each new connection attempt. This is necessary so that
situations like that above can be sorted out: the 3000 has to understand
whether the SYN packet it just got is a retry of an earlier packet or
represents a new attempt at connection. But since NetManage ALWAYS uses
zero as the initial sequence number, the host can't tell one from the
other, and NetManage can't sort out the acknowledgements either.
 
There are guidelines in the standard that should be used when choosing
initial sequence numbers; the standard explicitly denigrates always using
zero. The potential for data loss occurs because NetManage, later in the
sequence, erroneously believes the connection is up and starts sending
data. It makes the error because it can't tell the host's ACK for the old
SYN from an ACK for the later SYN -- which it could, if it followed the
spec and used a new sequence number for the new connection attempt.
 
Summary
-------
 
Two errors in NetManage's implementation of the TCP protocol can cause
connection problems and/or data loss, particularly when used with dialup or
other slow links, and particularly when used to communicate with any TCP
implementation that separates the ACK and SYN phases of the response
handshake. I strongly recommend not using NetManage if there is any
possibility that the stack will be used in such situations. At best, such
use will result in increased communication costs due to repeated dial
attempts; at worst, use of the NetManage stack in this situation can cause
silent data loss.
 
If anyone else is using the NetManage stack, I have a workaround that can
be implemented on the host side. The workaround greatly reduces the
frequency of problems, but does not eliminate them; it also reduces the
number of connections that a TCP server can accept. Still, it's better than
the alternative.
 
-- Bruce
 
---------------------------------------------------------------------------
Bruce Toback    Tel: (602) 996-8601| My candle burns at both ends;
OPT, Inc.            (800) 858-4507| It will not last the night;
11801 N. Tatum Blvd. Ste. 142      | But ah, my foes, and oh, my friends -
Phoenix AZ 85028                   | It gives a lovely light.
[log in to unmask]                   |     -- Edna St. Vincent Millay

ATOM RSS1 RSS2