NetBSD-5.0.2/share/doc/iso/wisc/trans_design.nr

.\"	$NetBSD: trans_design.nr,v 1.3 2000/03/13 23:03:35 soren Exp $
.\"
.NC "The Design of the ARGO Transport Entity"
.sh 1 "Protocol Hooks"
.pp
The design of the AOS kernel IPC support to some
extent mandates the
design of protocols. 
Each protocol must provide the following 
protocol hooks, which are procedures called through a
protocol switch table
(an array of type \fIprotosw\fR as described in
Chapter Five.
.ip "pr_input()" 5
Called when data are to be passed up from a lower layer.
.ip "pr_output()" 5
Called when data are to be passed down from a higher layer.
.ip "pr_init()" 5
Called when the system is brought up.
.ip "pr_fasttimo()" 5
Called every 200 milliseconds by the clock functional unit.
.ip "pr_slowtimo()" 5
Called every 500 milliseconds by the clock functional unit.
.ip "pr_drain()" 5
This is meant to be called when buffer space is low.
Each protocol is expected to provide this routine to free
non-critical buffer space.
This is not yet called anywhere.
.ip "pr_ctlinput()" 5
Used for exchanging information between
protocols, such as notifying a transport protocol of changes
in routing or configuration information.
.ip "pr_ctloutput()" 5
Supports the protocol-dependent 
\fIgetsockopt()\fR
and 
\fIsetsockopt()\fR
options.
.ip "pr_usrreq()" 5
Called by the socket code to pass along a \*(lquser request\*(rq -
in other words a service primitive.
This call is also used for other protocol functions.
The functions served by the \fIpr_usrreq()\fR routine are:
.ip "     PRU_ATTACH" 10
Creates a protocol control block and attaches it to a given socket.
Called as a result of a \fIsocket()\fR system call.
.ip "     PRU_DISCONNECT" 10
Called as a result of a 
\fIclose()\fR system call.
Initiates disconnection.
.ip "     PRU_DETACH" 10
Disassociates a protocol control block from a socket and recycles
the buffer space used for the protocol control block.
Called after PRU_DISCONNECT.
.ip "     PRU_SHUTDOWN" 10
Called as a result of a 
\fIshutdown()\fR system call.
If the protocol supports the notion of half-open connections,
this closes the connection in one direction or both directions,
depending on the arguments passed to
\fIshutdown\fR.
.ip "     PRU_BIND" 10
Gives an address to a socket.
Called as a result of a 
\fIbind()\fR system call, also
when 
socket without a bound address is used.
In the latter case, an unused transport suffix is located and
bound to the socket.
.ip "     PRU_LISTEN" 10
Called as a result of a 
\fIlisten()\fR system call.
Marks the socket as willing to queue incoming connection
requests.
.ip "     PRU_CONNECT" 10
Called as a result of a 
\fIconnect()\fR system call.
Initiates a connection request.
.ip "     PRU_ACCEPT" 10
Called as a result of an 
\fIaccept()\fR system call.
Dequeues a pending connection request, or blocks waiting for
a connection request to arrive.
In the latter case, it marks the socket as willing to accept
connections.
.ip "     PRU_RCVD" 10
The protocol module is expected to have put incoming data
into the socket's receive buffer, \fIso_rcv\fR.
When a receive primitive is used
(\fIrecv(), recvmsg(), recvfrom(),
read(), readv(), \fRand 
\fIrecvv()\fR system calls)
the socket code module copies data from the
\fIso_rcv\fR to the user's
address space.
The protocol module may arrange to be informed each time the socket code
does this, in which case the socket code calls \fIpr_usrreq\fR(PRU_RCVD)
after the data were copied to the user.
.ip "     PRU_SEND" 10
This performs the protocol-dependent part of a send primitive
(\fIsend(), sendmsg(), sendto(), write(), writev(), 
\fRand \fIsendv()\fR system calls).
The socket code 
(procedures \fIsendit() and \fIsosend()\fR)
moves outgoing data from the user's
address space into a chain of \fImbufs\fR.
The socket code takes as much data from the user as it
determines will fit into the outgoing socket buffer, so_snd. 
It passes this much data in the form of an mbuf chain to the protocol
via \fIpr_usrreq\fR(PRU_SEND).
If there are more data than 
the so_snd can accommodate,
the socket code, which is running on behalf of a user process,
puts the user process to sleep.
The protocol module is expected to wake up the user process when
more room appears in so_snd.
.ip "     PRU_ABORT" 10
Called when a socket is closed and that socket
is accepting connections and has
queued pending
connection requests or
partially open connections.
.ip "     PRU_CONTROL" 10
Called as a result of an 
\fIioctl()\fR system call.
.ip "     PRU_SENSE" 10
Called as a result of an 
\fIfstat()\fR system call.
.ip "     PRU_RCVOOB" 10
Performs the work of receiving \*(lqout-of-band\*(rq data.
The socket module has already allocated an mbuf into which
the protocol module is expected to put the incoming 
\*(lqout-of-band\*(rq data.
The socket code will then move the data from this mbuf
to the user's address space.
.ip "     PRU_SENDOOB" 10
Performs the work of sending \*(lqout-of-band\*(rq data.
The socket module has already moved the data
from the user's address space into a chain of mbufs,
which it now passes to the protocol module.
.ip "     PRU_SOCKADDR" 10
Supports the system call
\fIgetsockname()\fR.
Puts the socket's bound address into an mbuf.
.ip "     PRU_PEERADDR" 10
Supports the system call
\fIgetpeername\fR().
Puts the peer's address into an mbuf.
.ip "     PRU_CONNECT2" 10
This is used in the Unix domain to support pipes.
It is not generally supported by transport protocols.
.ip "     PRU_FASTTIMO, PRU_SLOWTIMO" 10
These are superfluous.
None of the transport protocols uses them.
.ip "     PRU_PROTORCV, PRU_PROTOSEND" 10
None of the transport protocols uses these.
.ip "     PRU_SENDEOT" 10
This was added to support TP.
This indicates that the end of the data sent in this
send primitive should
be marked by the protocol as the end of the TSDU.
.sh 1 "The Interface Between the Transport Entity and Lower Layers"
.pp
The transport layer may run over a network layer such as IP
or the ISO connectionless network layer,
or it may run over a multi-purpose layer such as the service
provided by X.25.
X.25 is viewed as a network layer when
TP runs over X.25, and as a 
subnetwork layer 
when IP is running over X.25.
The software interface between data link and network layers differs
considerably from the software interface between transport and network
layers in AOS.
For this reason some modification of the transport-to-lower-layer
interface is necessary to support the suite of protocols included in 
ARGO.
.pp
In AOS it is assumed that the transport layer will run over one
and only one network layer, and therefore it may call the
network layer output procedure directly.
In order to allow TP to run over a set of lower layers,
all domain-specific functions have been put into a set of routines
that are called indirectly through a domain-specific switch table.
The primary reason for this is that the transport and network
layers share information, mostly information pertaining to addresses.
The protocol control blocks for different network layers
differ, so the transport layer cannot just directly
access the network layer's pcb.
Similarly, a network layer may not directly access the transport
pcb because a multitude of transport protocols can run over each
of the network protocols.
.pp
To permit different network-layer protocol control blocks to coexist
under one transport layer, all transport-dependent control
information was put into a transport-specific protocol control block.
A new field, \fIso_tpcb\fR,
was added to the \fIsocket\fR structure to hold a pointer to
the transport-layer protocol control block. 
The existing
field \fCso_pcb\fR is used for the network layer pcb.
.pp
The following structure was added to allow domain-specific
functions to be called indirectly.
All these functions operate on a network-layer pcb.
.pp
.(b
\fC
.TS
tab(+);
l s s s.
struct nl_protosw {
.T&
l l l l.
+int+nlp_afamily;+/* address family */
+int+(*nlp_putnetaddr)();+/* puts addrs in pcb */
+int+(*nlp_getnetaddr)();+/* gets addrs from pcb */
+int+(*nlp_putsufx)();+/* transp suffix -> pcb */
+int+(*nlp_getsufx)();+/* gets t-suffix */
+int+(*nlp_recycle_suffix)();+/* zeroes suffix */
+int+(*nlp_mtu)();+/* get maximum
+++transmission unit size */
+int+(*nlp_pcbbind)();+/* bind to pcb */
+int+(*nlp_pcbconn)();+/* connect */
+int+(*nlp_pcbdisc)();+/* disconnect */
+int+(*nlp_pcbdetach)();+/* detach pcb */
+int+(*nlp_pcballoc)();+/* allocate a pcb */
+int+(*nlp_output)();+/* emit packet */
+int+(*nlp_dgoutput)();+/* emit datagram */
+caddr_t+nlp_pcblist;+/* list of pcbs 
+++for management 
+++of connections */
};
.TE
\fR
.)b
.lp
The switch is based on the address family chosen when the
\fIsocket()\fR system call is made prior to connection establishment.
This unfortunately ties the address family to the domain,
but the only alternative is to add an argument to the \fIsocket()\fR
system call to let the user specify the desired network layer.
In the case of a connection oriented environment with no multi-homing,
it would be possible to determine which network layer is to be
used
from routing
information, but to do this requires unrealistic assumptions
about the environment.
For these reasons, linking the address family to the network
layer protocol is seen as the least of the evils.
The transport suffixes are kept in the network layer's pcb
as well as in the transport layer because 
full transport address pairs are used to identify a connection
in the Internet domain.
.sh 1 "The Architecture of the Transport Protocol Entity"
.pp
A set of protocol hooks is required
by the AOS IPC architecture.
These hooks are used by the protocol-independent parts of the kernel
to gain entry to protocol-specific code.
The protocol code can be entered in one of the following ways:
.ip "1) " 5
at boot time, when autoconfiguration
initializes each protocol through
the 
\fIpr_init()\fR
hook,
.ip "2) " 5
from above, either
a user program making a system call, through
the \fIpr_usrreq()\fR or \fIpr_ctloutput()\fR hooks, or
from a higher layer protocol using the
\fIpr_output()\fR hook,
.ip "3) " 5
from below, a device interrupt servicing an incoming packet
through the \fIpr_input()\fR  and \fIpr_ctlinput()\fR hooks, and
.ip "4) " 5
from a clock interrupt through the \fIpr_slowtimo()\fR
or the
\fIpr_fasttimo()\fR hook.
.\" FIGURE
.so figs/trans_flow.nr
.\".so figs/trans_flow.grn
.pp
The protocol code can be divided into
the following modules, which are described in more detail below.
.CF
shows the flow of data and control 
among these modules.
.in +5
.ip "Timers and References:" 5
The code executed on behalf of \fIpr_slowtimo()\fR.
The fast timeout is not used by TP.
.ip "Driver:" 5
This is the finite state machine for TP.
.ip "Input:     " 5
This is the module that decodes incoming packets,
identifies or creates the pcb for which 
the packet is destined, and creates an "event" to
pass to the driver.
.ip "Output:" 5
This is the module that creates a packet header of a given type
with fields containing 
values that are appropriate to the connection
on which the packet is being sent, appends data if necessary,
and hands a packet
to the lower layer, according to the transport-to-lower-layer
interface.
.ip "Send:      " 5
This module packetizes data from the outbound
socket buffer, \fIso_snd\fR,
handles retransmissions of packetized data, and
drops packetized data from the retransmission queue.
.ip "Receive:" 5
This module reorders packets if necessary,
depacketizes data, passes it to the socket code module,
and determines when acknowledgments should be sent.
.in -5
.sh 1 "Timers and References"
.pp
TP identifies sockets by \fIreference numbers\fR, or
\fIreferences\fR,
which are \*(lqfrozen\*(rq (may not be reassigned)
until some locally defined time after
a connection is broken and its protocol control block
is discarded.
An array of \fIreference blocks\fR is maintained by TP.
The reference number of a reference block is its
offset in the array.
When a reference block is in use it contains 
a pointer to the pcb for the socket to which the
reference applies.
.pp
The system clock calls the \fIpr_slowtimo()\fR and 
\fIpr_fasttimo()\fR hooks for each protocol in the protocol switch table
every 500 and 200 microseconds, respectively.
Each protocol handles its own timers its own way.
The timers in TP take two forms
- those that typically are cancelled and
those that usually expire.
The latter form may have more than one instantiation at any given
time.
The former may not.
The two are implemented slightly
differently for the sake of performance.
.pp
The timers that normally expire 
are kept in a queue, their values all relative
to the value of preceding timer.
Thus all timer values are decremented by a single
operation on the value of the first timer.
The timer is represented by the Ecallout structure:
.(b
\fC
.TS
tab(+);
l s s s.
struct Ecallout {
.T&
l l l l.
+int+c_time;+/* incremental time */
+int+c_func;+/* function to call */
+u_int+c_arg1;+/* argument to routine */
+u_int+c_arg2;+/* argument to routine */
+int+c_arg3;+/* argument to routine */
+struct Ecallout+*c_next;
};
.TE
\fR
.)b
.lp
When an Ecallout structure migrates to the head
of the E timer list, and its \fIc_time\fR
field is decremented to zero, 
the function stored in \fIc_func\fR is
called, with \fIc_arg1, c_arg2\fR, and \fIc_arg3\fR
as arguments.
Setting and cancelling these timers
are accomplished by a linear search and one
insertion or deletion from the timer queue.
This queue is linked to the 
reference block associated with a communication endpoint.
This form used for the reference timer
and for the retransmission timers for data TPDUs.
.pp
The second form of timer, the type that
typically is cancelled, is used for several
timers - the inactivity timer, the sendack timer,
and the retransmission
timer for all types of TPDUs except data TPDUs.
.(b
\fC
.TS
tab(+);
l s s s.
struct Ccallout {
.T&
l l l l.
+int+c_time;+/* incremental time */
+int+c_active;+/* this timer is active? */
};
.TE
\fR
.)b
.lp
All of these timers are stored
directly
in the reference block.
These timers are decremented in one linear scan of
the reference blocks.
Cancelling, setting, and both
cancelling and resetting one of these timers is accomplished by a
single assignment to an array element.
.sh 1 "Driver"
.pp
This is the finite state machine for TP.
A connection is managed by the finite state machine (fsm).
All events that pertain to a connection cause the
finite state machine driver to be called.
The driver takes two arguments - the pcb for the connection
and an event structure.
The event structure contains a field that discriminates
the different types of events, and a union of 
structures that are specific to the event types.
The driver evaluates a set of predicates based on the current
state of the finite state machine (which is kept in the pcb) and the event type.
The result of the predicate evaluation determines
a set of actions to take and a state transition.
The driver takes the actions and if they complete
without errors, the driver makes the state transition.
.pp
The states, event types, predicates, actions, and state transitions are all
specified as a \fIxebec transition file\fR.
\fIXebec\fR is a utility that takes a human-readable description
of a finite state machine
and produces a set of tables and C source code for the driver.
The driver procedure is called \fItp_driver()\fR.
It is located in a file generated by xebec, 
\fCtp_driver.c\fR.
For more details about xebec, see the manual page \fIxebec(1)\fR.
.pp
The transition file for TP is \fCtp.trans\fR,
and it is a good place to begin a perusal of the TP
source code.
.sh 1 "Input"
.pp
This is the module that decodes an incoming packet,
locates or creates the pcb for which 
the packet is destined, and creates an event to
pass to the driver.
The network layer passes a packet up to the appropriate
transport layer by indirectly calling a transport input
routine through the protocol switch table for the network
domain.
There is one protocol switch entry for TP for each domain in which
TP will run (Internet, ISO).
In the Internet domain, the protocol switch field \fIpr_input()\fR
takes the value \fItpip_input()\fR.
This procedure accepts a packet from IP, with the IP header
still intact.
It extracts the network addresses from the IP header,
strips the IP header, and calls the domain-independent
input procedure for TP,
\fItp_input()\fR.
\fITp_input()\fR
decodes a TPDU.
The multitude of options, the variable-length
nature of the options, the semantics of the
options, and the possible combinations of concatenated
TPDUs make this a 
complex procedure.
It is sensitive to changes, and from 
the point of view of a software maintenance, it is a
potential hazard.
Because it is in the 
critical path of TP however, some compromise
was made between maintainability and efficiency.
Multiple copies of sections of code were avoided as much as
possible,
not for the sake of saving space, but rather for the sake
of maintainability.
Ironically,
this detracts somewhat from the readability of the code.
.pp
Once a TPDU has been decoded and a pcb has been
identified for the TPDU,
the appropriate fields of the TPDU
are extracted and their values are placed in
an event structure.
Finally, \fItp_driver()\fR is called with
the event structure and the pcb as parameters.
.sh 1 "Output"
.pp
This module creates a TPDU header of a given type
with field values that are appropriate to the connection
on which the TPDU is being sent, appends data if necessary,
and hands a TPDU
to the lower layer according to the transport-to-lower-layer
interface.
Whenever a TPDU is to be sent to the peer or prospective peer,
the function \fItp_emit()\fR
is called, passing as arguments the pcb a TPDU type and several miscellaneous
other type-specific arguments, possibly including some data.
The data are in the form of an mbuf chain.
\fITp_emit()\fR prepends to the data an mbuf containing a TP header,
fills in the fields of the header according to the parameters
given, performs the checksum if appropriate, and
calls a domain-specific output routine.
For the Internet domain, this output routine is
\fItpip_output()\fR, which takes
as arguments the mbuf chain representing the TPDU,
and a network level pcb.
Some protocol errors cannot be associated with 
a connection 
but require that TP issue
an ER TPDU or a DR TPDU. 
When these errors occur the routine
\fItp_error_emit()\fR is called.
This procedure creates the appropriate type of TPDU
and passes it to a domain-dependent routine for transmitting datagrams.
In the Internet domain,
\fItpip_output_dg()\fR is called.
This takes as arguments an mbuf chain representing the TPDU,
a source network address, and a destination network address.
.sh 1 "Send"
.\" FIGURE
.so figs/mbufsnd.nr
.\".so figs/mbufsnd.grn
.pp
This module packetizes data from the outbound
socket buffer, \fIso_snd\fR,
handles retransmissions of packetized data, and
drops packetized data from the retransmission queue.
The major routine in this module is \fItp_send()\fR, which
takes a range of sequence numbers as arguments.
For each sequence number in the range,
it packetizes the an appropriate amount
of outbound data, and places the resulting TPDU on 
a retransmission control queue subject to the
constraints imposed by the rules of expedited data,
maximum packet sizes, and end-of-TSDU markers.
.pp
The most complicating factor is that of managing
expedited data.
A normal datum may not be sent (for its first time) before the
acknowledgment of any expedited datum
that was received from the user after the 
normal datum was received. 
In order to enforce this rule,
each TPDU must be marked in some way
so that it will be known which expedited datum
must be delivered and acknowledged by the peer before this TPDU may be transmitted
for the first time.
Markers are placed in \fIso_snd\fR 
when an
outgoing expedited datum arrives from the user. 
A marker is an mbuf structure with an \fIm_len\fR
of zero, but with the data area nevertheless containing
the sequence number of an expedited data TPDU.
The \fIm_type\fR of a marker is a new type, MT_XPD.
.pp
\fITp_send()\fR stops packetizing data when it encounters a marker
for an unacknowledged expedited datum.
If it encounters a marker for an expedited TPDU that has already
been acknowledged, the marker is jettisoned.
.CF
illustrates the structure of the sending socket buffer used
for normal data.
.pp
When \fItp_send()\fR moves data from mbufs on \fIso_snd\fR to the retransmission
control queue, it needs to know
how many octets of data can be placed in each TPDU.
The appropriate amount depends on, among other things,
the maximum transmission unit of the network layer
on the route the packet will take.
To determine the maximum transmission unit,
TP queries the network layer through
the domain-dependent switch table's field, \fInl_mtu\fR.
In the Internet domain, this resolves to \fItp_inmtu()\fR.
The header sizes for the network and transport layers
also affect the amount of data that can go into a packet,
and these sizes depend on the connection's characteristics.
.pp
Once the maximum amount of data per TPDU is determined,
\fItp_send()\fR can pull this amount off the \fIso_snd\fR queue to form
a TPDU,
assign a TPDU sequence number,
and place the new TPDU on the 
retransmission control queue.
The retransmission control queue is a list of mbuf chains.
Each mbuf chain represents one TPDU, preceded by an
\fIrtc structure\fR:
.(b
\fC
.TS
tab(+);
l s s s.
struct tp_rtc {
.T&
l l l l.
+struct tp_rtc+*tprt_next;+/* next rtc struct in list */
+SeqNum+tprt_seq;+/* seq # of this TPDU */
+int+tprt_eot;+/* end of TSDU? */
+int+tprt_octets;+/* # octets in this TPDU */
+struct mbuf+*tprt_data;+/* ptr to the octets of data */
.\"/* Performance measurment info: */
.\"int	tprt_window;	/* in which call to tp_send() was
.\"			  * this TPDU formed? 
.\"			  */
.\"struct timeval	tprt_sess_time;	/* time session received the 
.\"			* majority of the data for this packet on send;
.\"			* on recv, this is the time it's given to session 
.\"			*/
.\"struct timeval	tprt_net_time;	/* time first copy was given to net layer
.\"			* on send; on receive it's the time received from
.\"			* the network 
.\"			*/
};
.TE
\fR
.)b
.lp
Once TPDUs are on the retransmission control queue,
they are retransmitted or dropped by the actions
of timers.
The procedure \fItp_sbdrop()\fR
removes the TPDUs from the retransmission queue.
It takes a sequence number as an argument and drops
all TPDUs up to and including the TPDU with that sequence number.
.pp
When an AK TPDU arrives, the values from
its credit and sequence number fields
are passed to \fItp_goodack()\fR, which
determines whether or not the AK brought any news with it,
and therefore whether TP can send more data
or expedited data.
If this AK acknowledges something heretofore unacknowledged,
\fItp_goodack()\fR drops the appropriate TPDU(s) from the retransmission
control list, computes the smoothed average round trip time
and standard deviation of the round trip time, 
and updates
the retransmission timer based on these statistics.
It sets a flag in the pcb if the TP entity is obliged to
send the flow control confirmation parameter on its next
AK TPDU.
\fITp_goodack()\fR returns true if the AK brought some news with it,
either with respect to a change in credit or with respect to
new acknowledgments.
.pp
The function \fItp_goodXack()\fR is called when an XAK TPDU
arrives.
It takes the XAK sequence number as an argument and
determines if the XAK acknowledges the last XPD TPDU sent.
If so, it drops the expedited data from the outgoing
expedited data buffer.
By its definition in the TP specification,
the expedited data stream has a window
of size 1,
that is, 
only one expedited datum (packet) can be buffered
at a time.
\fITp_goodXack()\fR returns true if the XAK acknowledged
the last XPD TPDU sent and the data were dropped,
and it returns false if the acknowledgment caused no action to be taken.
.\" NEXT FIGURE
.so figs/mbufrcv.nr
.\".so figs/mbufrcv.grn
.sh 1 "Receive"
.pp
This module reorders incoming TPDUs if necessary,
depacketizes data, passes it to the socket code module,
and determines when acknowledgments should be sent.
The function 
\fItp_stash()\fR
takes an DT TPDU as an argument, and if the TPDU is not in
sequence, it saves the TPDU in a \fItp_rtc\fR structure in
a list, with the TPDUs
kept in order.
When the next expected TPDU arrives, the
list of out-of-order TPDUs is scanned for 
more TPDUs in sequence, updating
a field in the pcb, \fItp_rcvnxt\fR which
always contains the sequence
number of 
the next expected TPDU.
If an acknowledgment is to be generated
at any time, the value of tp_rcvnxt goes into the
\fIYR-TU-NR\fR\** field of the acknowledgment TPDU.
.(f
\** 
This is the name used in ISO 8073 for the field
which indicates the sequence number of the next expected DT TPDU.
.)f
.pp
\fITp_stash()\fR returns true if an acknowledgment needs to be generated
immediately, false not.
The acknowledgment strategy is therefore implemented in this routine.
Acknowledgments may be generated for one or more of several reasons,
listed below.
\fITp_stash()\fR increments a counter for each of these reasons
for which an acknowledgment is generated, and a counter for TPDUs
that are not acknowledged immediately.
.ip "ACK_STRAT_EACH" 5
The acknowledgment strategy in use calls for acknowledging each 
data packet with an AK TPDU.
.ip "ACK_STRAT_FULLWIN" 5
The acknowledgment strategy in use calls for acknowledging 
upon receiving the DT TPDU that represents the upper window
edge of the last advertised window.
.ip "ACK_DUP" 5
A duplicate data TPDU was received.
.ip "ACK_REORDER" 5
A DT TPDU arrived in the window but out of order.
.ip "ACK_EOT" 5
A DT TPDU arrived, and it had the end-of-TSDU flag set.
.pp
Upon receipt of a DT TPDU that is in order, and upon reordering
DT TPDUs, 
\fItp_stash()\fR
places the TSDUs into the socket's receive
socket buffer, \fIso->so_rcv\fR in mbuf chains, with
TSDUs delimited by mbufs of the \fIm_type\fR MT_EOT,
which is a new type with the ARGO kernel.
.CF
illustrates the structure of the receiving socket buffer used
for normal data.
.pp
A separate socket buffer, \fItpcb->tp_Xrcv\fR,
is used for
buffering expedited data.
Only one expedited data packet may reside in this buffer at a time
because the TP standard limits the size of the window on expedited flow
to be 1.
This means the data structures are straightforward;
there is no need to distinguish between separate TSDUs in this socket buffer.
.pp
Credit is determined 
by dividing the total amount of available
space in the receive buffer
by the negotiated maximum TPDU size.
TP can often offer a larger credit than this if it uses
an average of the measured actual TPDU sizes.
This strategy was once an option in the ARGO kernel,
but it was removed because unless the actual TPDU size
is constant, it leads to reneging of credit,
retransmissions, and decreased performance.
It does not work well when there is any fluctuation in the sizes
of TPDUs and it carries the penalty of lengthening the critical path
of the TP entity.
.sh 1 "Major Data Structures and Types"
.pp
In addition to the types commonly used in the kernel,
such as 
.(b
\fC
.TS
tab(+);
l l l l.
 +typedef+unsigned char+u_char;
 +typedef+unsigned int+u_int;
 +typedef+unsigned short+u_short;
.TE
\fR
.)b
TP uses the following types:
.(b
\fC
.TS
tab(+);
l l l l.
 +typedef+unsigned int+SeqNum
 +typedef+unsigned short+RefNum;
 +typedef+int+ProtoHook;
.TE
\fR
.)b
.pp
Sequence numbers can be either 7 or 31 bits.
An unsigned integer is used in all cases, and the proper type
of arithmetic is performed with bit masks.
Reference numbers are 16 bits.
ProtoHook is the type of the procedures that are in switch
tables, which,
although they are not functions,
are declared \fIint\fR rather than \fIvoid\fR
to be consistent with the rest of the kernel.
.pp
The following structures are fundamental
types used throughout TP,
in addition to those already described in the 
section,
"The Design of the Transport Entity".
.(b
\fC
.TS
tab(+);
l s s s.
struct tp_ref {
.T&
l l l l.
+u_char+tpr_state;+/* REF_FROZEN...*/
+struct Ccallout+tpr_callout[N_CTIMERS];+/* C timers */
+struct Ecallout+tpr_calltodo;+/* E timers list */
+struct tp_pcb+*tpr_pcb;+/* --> PCB */
};
.TE
\fR
.)b
.lp
The reference structure is logically a part of the protocol
control block and it is linked to a pcb, but it may outlive
a pcb.
When a connection is dissolved, the pcb may be recycled
but the reference structure must remain until the reference
timer goes off.
The field \fItpr_state\fR takes the values
REF_FROZEN (a reference timer is ticking),
REF_OPEN (in use, has timers and an associated pcb),
REF_OPENING (has a pcb but no timers), and
REF_FREE (free to reallocate).
.pp
The TP protocol control block is too large to fit into
one mbuf structure so it comprises two structures
linked together, the 
\fItp_pcb\fR structure and the.
\fItp_pcb_aux\fR structure.
The \fItp_pcb_aux\fR structure contains
items that are used less frequently than those in
the former structure, since each access to these
items requires a second pointer dereference.
.(b
\fC
.TS
tab(+);
l s s s.
struct tp_pcb_aux {
.T&
l l l s.
 +struct sockbuf+tpa_Xsnd;+/* for expedited data */
+struct sockbuf+tpa_Xrcv;+/* for expedited data */
+u_char +tpa_vers;+/* protocol version */
+u_char +tpa_peer_acktime;+/* to compute DT TPDU
+++retrans timer value */
+SeqNum+tpa_Xsndnxt;+/* seq # of
+++next XPD to send */
+SeqNum+tpa_Xuna;+/* seq # of 
+++unacked XPD */
+SeqNum+tpa_Xrcvnxt;+/* next XPD seq #
+++expect to recv */
+/* addressing */
+u_short+tpa_domain;+/* domain AF_ISO,...*/
+u_short+tpa_fsuffixlen;+/* foreign suffix */
+u_char+tpa_fsuffix[MAX_TSAP_SEL_LEN];+
+u_short+tpa_lsuffixlen;+/* local suffix */
+u_char+tpa_lsuffix[MAX_TSAP_SEL_LEN];+
.T&
l s s s.
 +/* AK subsequencing */
.T&
l l l s.
 +u_short+tpa_s_subseq;+/* next subseq to send */
+u_short+tpa_r_subseq;+/* highest recv subseq */
};
.TE
\fR
.)b
.pp
The major portion of the protocol control block is in the
\fItp_pcb\fR structure:
.(b
\fC
.TS
tab(%);
l s s s.
struct tp_pcb {
.\" *************************************** 
.T&
l l l l.
.\" The next line sets the spacing for the table: 1+3 17+3 17+3 13+3
 %                 %                 %
.\"456789 123456789- 123456789 123456-789 123456789 1234567890
.\"
 %struct tp_ref%*tp_refp;%    
.T&
l l l s.
%%/* reference structure */%
.\" *************************************** 
.T&
l l l l.
 %struct tp_pcb_aux%*tp_aux;% 
.T&
l l l s.
 %%/*rest of tpcb (auxiliary struct)*/%
.\" *************************************** 
.T&
l l l l.
 %caddr_t%tp_npcb;%/* to ll pcb */
%struct nl_protosw%*tp_nlproto;%
.T&
l l l s.
 % %/* domain-dependent routines */%
.\" *************************************** 
.T&
l l l l.
 %struct socket%*tp_sock;%/* back ptr */
.\" *************************************** 
.T&
l s s s.

/* local and foreign reference numbers: */
.T&
l l l l.
 %RefNum%tp_lref;% 
%RefNum%tp_fref;%
.\" *************************************** 
.T&
l s s s.
.\"456789 123456789 123456789 123456789 123456789 1234567890

/* Stuff for sequence space arithmetic: 
 * Maintaining 2 sequence spaces is a pain so we set these
 * values once at connection establishment time. Sequence
 * number arithmetic is a set of macros which uses these.
 * Sequence numbers are stored as 32 bits.
 * tp_seqmask tells which of the 32 bits is used.
 * tp_seqibt  is the lsb that is not used.  When set,
 *   it indicates wraparound has occurred.
 * tp_seqhalf is the value that is half the sequence space.
 *   (or half plus one).
 */
.T&
l l l l.
%u_int%tp_seqmask;%/* mask */
%u_int%tp_seqbit;%/* wraparound */
%u_int%tp_seqhalf;%/* half space */
.\" *************************************** 
.T&
l s s s.

/* flags:  values are defined in tp_user.h.
 * Here we keep such info as which options 
 * are in use: checksum, extended format,
 * flow control in class 2, etc.
 * See tp(4p) man page.
 */
.\" *************************************** 
.T&
l l l l.
 %u_short%tp_state;%/* fsm */
%short%tp_retrans;%
.T&
l l l s.
 % % /* # times to retransmit */% 
.\" *************************************** 
.T&
l s s s.

/* credit & sequencing info for SENDING: */
.T&
l l l s.
 %u_short%tp_fcredit;%
 % %/* remote real window */%
 %u_short%tp_cong_win;%
 % %/* remote congestion window */%
.\" *************************************** 
%SeqNum%tp_snduna;%
.T&
l l l s.
 % %/* seq # of lowest unacked DT */% 
.\" *************************************** 
.T&
l l l l.
 %struct tp_rtc    %*tp_snduna_rtc;% 
.T&
l l l s.
 % %/* ptr to mbufs containing lowest% 
%% * unacked TPDUs sent so far%
%% */%
.\" *************************************** 
.T&
l l l l.
 %SeqNum%tp_sndhiwat;% 
.T&
l l l s.
 % %/* highest DT sent yet */% 
.\" *************************************** 
.T&
l l l l.
 %struct tp_rtc%*tp_sndhiwat_rtc;% 
.T&
l l l s.
 % %/* ptr to mbufs containing the last% 
%% * DT sent - this is the last item %
%% * on the list that starts%
%% * at tp_snduna_rtc%
%% */%
.\" *************************************** 
.T&
l l l l.
 %int %tp_Nwindow;%/* for perf. measmt */
.\" *************************************** 
.T&
l s s s.

/* credit & sequencing info for RECEIVING: */
.\" *************************************** 
.T&
l l l s.
 %SeqNum%tp_sent_lcdt;%
 %%/* cdt according to last AK sent */%
 %SeqNum%tp_sent_uwe;% 
 % %/* upper window edge, according to% 
%% * the last AK sent %
%% */*
 %SeqNum%tp_sent_rcvnxt;% 
 % %/* rcvnxt, according to% 
%% * the last AK sent%
%% */*
.\" *************************************** 
.T&
l l l l.
 %short%tp_lcredit;%/* local */
.\" *************************************** 
.T&
l l l l.
 %SeqNum%tp_rcvnxt;% 
.T&
l l l s.
 % %/* next DT seq# we expect to recv */% 
.\" *************************************** 
.T&
l l l l.
 %struct tp_rtc%*tp_rcvnxt_rtc;% 
.T&
l l l s.
 % %/* ptr to mbufs containing unacked % 
%% * DTs received out of order, and %
%% * which we haven't acknowledged%
%% */%
.\" *************************************** 
.TE
.TS
tab(%);
l s s s.
/* Items kept in the aux structure: */

.\" *************************************** 
.T&
l s s l.
#define  tp_vers%tp_aux->tpa_vers
#define  tp_peer_acktime%tp_aux->tpa_peer_acktime
#define  tp_Xsnd%tp_aux->tpa_Xsnd
#define  tp_Xrcv%tp_aux->tpa_Xrcv
#define  tp_Xrcvnxt%tp_aux->tpa_Xrcvnxt
#define  tp_Xsndnxt%tp_aux->tpa_Xsndnxt
#define  tp_Xuna%tp_aux->tpa_Xuna
#define  tp_domain%tp_aux->tpa_domain
#define  tp_fsuffixlen%tp_aux->tpa_fsuffixlen
#define  tp_fsuffix%tp_aux->tpa_fsuffix
#define  tp_lsuffixlen%tp_aux->tpa_lsuffixlen
#define  tp_lsuffix%tp_aux->tpa_lsuffix
#define  tp_s_subseq%tp_aux->tpa_s_subseq
#define  tp_r_subseq%tp_aux->tpa_r_subseq
.\" *************************************** 
.T&
l s s s.
 % % % 
/* parameters per-connection controllable by user: */
.\" *************************************** 
.T&
l l l l.
 %struct%tp_conn_param%_tp_param; 
 % % %
.\" *************************************** 
.T&
l s s l.
#define  tp_Nretrans%_tp_param.p_Nretrans
#define  tp_dr_ticks%_tp_param.p_dr_ticks
#define  tp_cc_ticks%_tp_param.p_cc_ticks
#define  tp_dt_ticks%_tp_param.p_dt_ticks
#define  tp_xpd_ticks%_tp_param.p_x_ticks
#define  tp_cr_ticks%_tp_param.p_cr_ticks
#define  tp_keepalive_ticks%_tp_param.p_keepalive_ticks
#define  tp_sendack_ticks%_tp_param.p_sendack_ticks
#define  tp_refer_ticks%_tp_param.p_ref_ticks
#define  tp_inact_ticks%_tp_param.p_inact_ticks
#define  tp_xtd_format%_tp_param.p_xtd_format
#define  tp_xpd_service%_tp_param.p_xpd_service
#define  tp_ack_strat%_tp_param.p_ack_strat
#define  tp_rx_strat%_tp_param.p_rx_strat
#define  tp_use_checksum%_tp_param.p_use_checksum
#define  tp_tpdusize%_tp_param.p_tpdusize
#define  tp_class%_tp_param.p_class
#define  tp_winsize%_tp_param.p_winsize
#define  tp_netservice%_tp_param.p_netservice
#define  tp_no_disc_indications%_tp_param.p_no_disc_indications
#define  tp_dont_change_params%_tp_param.p_dont_change_params
.\" *************************************** 
.TE
.\" *************************************** 
.\" *************************************** 
.\" *************************************** 
.TS
tab(%);
l l l l.
.\" The next line sets the spacing for the table: 1+3 17+3 17+3 13+3
.\"456789 123456789- 123456789 123456-789 123456789 1234567890
.\"
.T&
l l l s.
 %%/* log2(the negotiated max size) */% 
.T&
l l l l.
 %int%tp_l_tpdusize;%/* # bytes */
.\" *************************************** 
 %struct timeval%tp_rtt;% 
.T&
l l l s.
 % %/* smoothed avg round-trip time */% 
 %struct timeval%tp_rtv;% 
 % %/* std deviation of round-trip time */% 
%struct timeval%tp_rttemit[ TP_RTT_NUM + 1 ];%
%%/* times that the last TP_RTT_NUM %
%% * DT_TPDUs were transmitted %
%% */%
.\" *************************************** 
 %unsigned % % 
%  tp_sendfcc:1,%/* shall next ack %
% %include flow control conf. param? */%
.\" *************************************** 
.T&
l l l s.
 %  tp_trace:1,%/* is this pcb being traced?% 
%% * (not used yet) %
%% */%
.\" *************************************** 
%  tp_perf_on:1,%/* statistics being kept? */% 
.\" *************************************** 
%  tp_reneged:1,%/* have we reneged on credit%
%% * since the last AK TPDU was sent? %
%% */%
%  tp_decbit:4,%/* congestion experienced? */%
%  tp_flags:8,%/* see #defines below */%
.\" *************************************** 
%  tp_unused:16;%%
.T&
l s s l.
#define  TPF_XPD_PRESENT%TPFLAG_XPD_PRESENT
#define  TPF_NLQOS_PDN%TPFLAG_NLQOS_PDN
#define  TPF_PEER_ON_SAMENET%TPFLAG_PEER_ON_SAMENET
%%%
.\" *************************************** 
.T&
l l l l.
 %struct tp_pmeas%*tp_p_meas;% 
.T&
l l l s.
 % %/* ptr to mbuf to hold the perf.% 
%% * statistics structure %
%% */%
.\" *************************************** 
};
.TE
\fR
.\"
.\" end of tpcb structure (thank you)
.\"
.)b
.fi
.sh 1 "Sequence Number Arithmetic"
.pp
Sequence numbers in TP can be either 7 bits 
(\*(lqnormal format\*(rq)
or 31 bits
(\*(lqextended format\*(rq).
Sequence numbers are unsigned integers,
regardless of their format.
Three fields are kept in the pcb to manage the sequence
number arithmetic:
.(b
\fC
.TS
tab(+);
l l l l.
 +u_int+tp_seqmask;+/* mask for seq space */
 +u_int+tp_seqbit;+/* bit for seq # wraparound */
 +u_int+tp_seqhalf;+/* half the seq space */
.TE
\fR
.)b
.lp
\fITp_seqmask\fR 
is a bit mask indicating which bits are legitimate 
for a sequence number of either format.
It takes the value 0x7f if 7-bit sequence numbers are in use,
and 0x7fffffff if 31-bit sequence numbers are in use.
\fITp_seqbit\fR 
is the bit that becomes set when a sequence number wraps around
while being incremented.
Its value is 0x80 for normal format, 0x80000000 for extended format.
\fITp_seqhalf\fR 
takes the value which is in the middle of the sequence space,
0x40 for normal format,
and
0x40000000 for extended format.
.(b
.nf
The macro 
.fi
\fC
.TS
tab(+);
l l l l.
     SEQ(tpcb, x)
.TE
\fR
.)b
.lp
extracts a sequence number from the location
in which it is stored.
.pp
The macros
.(b
\fC
.TS
tab(+);
l l s s l.
 +SEQ_GT(tpcb, seq, t)+is seq > t?
 +SEQ_GEQ(tpcb, seq, t)+is seq >= t?
 +SEQ_LT(tpcb, seq, t)+is seq < t?
 +SEQ_LEQ(tpcb, seq, t)+is seq <= t?
 +SEQ_INC(tpcb, seq)+seq\+\+
 +SEQ_DEC(tpcb, seq)+seq--
 +SEQ_SUB(tpcb, seq, amt)+seq -= amt
 +SEQ_ADD(tpcb, seq, amt)+seq \+= amt
.TE
\fR
.)b
.lp
perform the indicated comparisons and arithmetic
on their arguments.
.pp
An example of how these macros
are used is as follows.
To determine if a sequence
number \fIseq\fR is in a receive window
bounded by
\fIlwe\fR and \fIuwe\fR,
we define the
macro
.(b
\fC
.TS
tab(+);
l l.
#define+IN_RWINDOW(tpcb, seq, lwe, uwe)\\
+( SEQ_GEQ(tpcb, seq, lwe) && SEQ_LT(tpcb, seq, uwe) )
.TE
\fR
.)b
.sh 1 "TP Implementation Options"
.pp
The transport protocol specification leaves several
things to the discretion of the implementor,
some of which may affect the performance
of individual connections and
aggregate performance.
Wherever different strategies are likely to favor
the performance of
individual connections to the detriment of aggregate performance
or vice versa, the
various strategies are under the control of options via the
\fIgetsockopt()\fR and
\fIsetsockopt()\fR system calls (see the manual pages
\fIgetsockopt(2)\fR,
\fIsetsockopt(2)\fR  
and
\fItp(4p)\fR  
for details).
In some cases the preferred strategies differ for the different
subnetworks, so the strategies chosen will be determined
by the subnetwork in use.
.sh 2 "TPDU size"
.pp
The limitation of the maximum TPDU size to a power of two is
unfortunate in the LAN environment.
For example, if the maximum NSDU size is around 1500, as in the case of an
Ethernet,
using a maximum TPDU size of 1024 reduces
the possible throughput by approximately 30%.
TP negotiates a maximum TPDU size of 2048 and
generates TPDUs of size around 1500.
Obviously this works well only when the peer is known to be 
using the same scheme (so that the peer
doesn't send TPDUs of size 2048 and cause its
network layer to fragment the TPDUs).
This is likely to be the case in a LAN where
all protocol entities are under the same administrative
control.
The maximum TPDU size negotiated is under the control of the user,
so
it is possible to prevent this scheme from being used
by default
when the peer is not on the same LAN, by
setting the \fItp.tpdusize\fR parameter in the ARGO directory service
file to
something less than the network's maximum transmission
unit.
.\"***********************************************************
.sh 2 "Congestion Window Strategy"
.pp
The congestion window strategy from the
DoD Internet 
was adapted for use with TP.
The strategy is intended to minimize the 
adverse effect
of transport's retransmission on an
already congested network.
.pp
A TP entity keeps two notions of the peer's window:
the real window, which is that advertised by the peer
in AK TPDUs, and the congestion window, which is a locally
controlled window.
TP uses the smaller of the two windows when transmitting.
The congestion window starts small, which keeps a
new connection from overloading the network with a sudden
burst of packets
immediately after connection establishement.
This is called \fIslow start\fR. 
For each successful acknowledgment received, the congestion
window grows by one, until eventually the real window
is the one in use.
If a retransmission timer expires, the congestion window
is reset to size one.
.pp
The congestion window strategy is used for class 4 unless
the transport user requests that it not be used.
The slow start strategy is used for traffic over a PDN
unless
the transport user requests that it not be used.
Slow start is not used for traffic over a LAN unless
its use is requested by the transport user.
.\"***********************************************************
.sh 2 "Retransmission strategies"
.pp
A retransmission timer is invoked for each set of DT TPDUs
sent in one send operation (call to \fItp_send()\fR).
This set of packets is called the \fIsend window\fR for the purpose
of this discusssion.
.pp
The number of TPDUs 
in a send window
depends on the remote credit and the amount of data
in the local send buffers.
When a retransmission timer goes off, the lower
window edge 
is reevaluated but the upper window edge is not reevaluated.
.pp
There are several retransmission strategies implemented in
ARGO TP.
The choice of strategies is the user's, and is made with the 
\fIsetsockopt()\fR system call.
The strategies are summarized here:
.ip "Retransmit LWE TPDU only:" 5
Only the TPDU representing the new lower window edge 
is retransmitted.
This is the default retransmission strategy.
.ip "Retransmit whole send window:" 5
Retransmission begins with the new lower window edge
and continues up to the old upper window edge.
.pp
The value of the data retransmission timer
adapts to the average round trip time and the standard deviation of
the round trip time.
A round trip time is the time that passes between
the moment of a packet's first transmission and 
the moment it is first acknowledged.
The average round trip time
is kept by the sending side of TP, using
a formula for 
smoothing the average:
.(b
\fC
.TS
tab(+);
l l l l.
#define+TP_RTT_ALPHA+3
#define+TP_RTV_ALPHA+2
+++
#define+SMOOTH(alpha, old, new) \\
+(((new-old) >> alpha ) \+ (old) )
.TE
\fR
.)b
.lp
The times included in the average are chosen as follows.
The time of 
each packet's initial transmission is kept (for the last
\fIN\fR packets, where \fIN\fR is a defined constant).
When an AK TPDU arrives, ARGO TP subtracts the initial transmission
time for the lowest unacknowledged sequence number that was
acknowledged by this AK TPDU from the current time,
and apply the resulting time to the average.
Hence, not all packets are included in this average,
which is as it should be since
the purpose of this measurement is 
to find a good value for the retransmission timer.
.pp
Each time part of a window is retransmitted,
the retransmission timer for that window is increased.
This does not affect the retransmission timers for other windows.
.\"***********************************************************
.sh 2 "Acknowledgment strategies"
.pp
The transport protocol specification
requires acknowledgments to be sent immediately
upon receipt
of  CC TPDUs (in class 4), XPD TPDUs, and DT TPDUs containing an
EOT marker, and at other times as required for flow control,
otherwise acknowledgments may be delayed.
In addition to the times when an acknowledgment is required,
ARGO TP transmits an AK TPDU whenever the user receives some data,
thereby increasing the size of the window.
For those times when
immediate acknowledgment is optional,
ARGO TP offers two acknowledgment strategies:
.ip "     Acknowledge each TPDU" 10
Upon receipt of a DT TPDU and AK TPDU is sent.
.ip "     Acknowledge full window" 10
Acknowledgment is issued
upon receipt of enough data to
consume the last advertised credit.
.pp
The latter strategy
requires a timer to trigger an acknowledgment
in case the peer doesn't send the entire window 
quickly.
This timer is called the
\fIsendack timer\fR.
The upper bound on the value of this timer 
is called the \fIlocal acknowledgment time\fR.
The local acknowledgment time may be "advertised" to the 
peer during connection establishment, and the
peer may choose to use this value to
adjust its retransmission timers.
The ARGO TP entity advertises its local acknowledgment time
on a CR TPDU, but it is not 
constrained by 
the remote acknowledge time, should the peer 
advertise it.
Instead,
ARGO TP adapts its sendack timer
to the behavior of the connection.
.pp
Under the assumption that the round trip time is
often 
symmetric,
and lacking 
a method to measure
the round trip time in the other direction,
ARGO TP uses the measured average round trip time
to adjust the sendack timer.
.pp
The choice of strategies is made with the
\fIsetsockopt()\fR system call.
The default strategy is
to
delay acknowledgments until the most recently advertised window is filled.