4.2BSD/usr/doc/net/6.t
.nr H2 1
.ds RH "Internal layering
.NH
\s+2Internal layering\s0
.PP
The internal structure of the network system is divided into
three layers. These
layers correspond to the services provided by the socket
abstraction, those provided by the communication protocols,
and those provided by the hardware interfaces. The communication
protocols are normally layered into two or more individual
cooperating layers, though they are collectively viewed
in the system as one layer providing services supportive
of the appropriate socket abstraction.
.PP
The following sections describe the properties of each layer
in the system and the interfaces each must conform to.
.NH 2
Socket layer
.PP
The socket layer deals with the interprocess communications
facilities provided by the system. A socket is a bidirectional
endpoint of communication which is ``typed'' by the semantics
of communication it supports. The system calls described in
the \fI4.2BSD System Manual\fP are used to manipulate sockets.
.PP
A socket consists of the following data structure:
.DS
._f
struct socket {
short so_type; /* generic type */
short so_options; /* from socket call */
short so_linger; /* time to linger while closing */
short so_state; /* internal state flags */
caddr_t so_pcb; /* protocol control block */
struct protosw *so_proto; /* protocol handle */
struct socket *so_head; /* back pointer to accept socket */
struct socket *so_q0; /* queue of partial connections */
short so_q0len; /* partials on so_q0 */
struct socket *so_q; /* queue of incoming connections */
short so_qlen; /* number of connections on so_q */
short so_qlimit; /* max number queued connections */
struct sockbuf so_snd; /* send queue */
struct sockbuf so_rcv; /* receive queue */
short so_timeo; /* connection timeout */
u_short so_error; /* error affecting connection */
short so_oobmark; /* chars to oob mark */
short so_pgrp; /* pgrp for signals */
};
.DE
.PP
Each socket contains two data queues, \fIso_rcv\fP and \fIso_snd\fP,
and a pointer to routines which provide supporting services.
The type of the socket,
\fIso_type\fP is defined at socket creation time and used in selecting
those services which are appropriate to support it. The supporting
protocol is selected at socket creation time and recorded in
the socket data structure for later use. Protocols are defined
by a table of procedures, the \fIprotosw\fP structure, which will
be described in detail later. A pointer to a protocol specific
data structure,
the ``protocol control block'' is also present in the socket structure.
Protocols control this data structure and it normally includes a
back pointer to the parent socket structure(s) to allow easy
lookup when returning information to a user
(for example, placing an error number in the \fIso_error\fP
field). The other entries in the socket structure are used in
queueing connection requests, validating user requests, storing
socket characteristics (e.g.
options supplied at the time a socket is created), and maintaining
a socket's state.
.PP
Processes ``rendezvous at a socket'' in many instances. For instance,
when a process wishes to extract data from a socket's receive queue
and it is empty, or lacks sufficient data to satisfy the request,
the process blocks, supplying the address of the receive queue as
an ``wait channel' to be used in notification. When data arrives
for the process and is placed in the socket's queue, the blocked
process is identified by the fact it is waiting ``on the queue''.
.NH 3
Socket state
.PP
A socket's state is defined from the following:
.DS
.if t .ta .6i 2.3i 3.0i
.if n .ta .84i 2.5i 3.20i
#define SS_NOFDREF 0x001 /* no file table ref any more */
#define SS_ISCONNECTED 0x002 /* socket connected to a peer */
#define SS_ISCONNECTING 0x004 /* in process of connecting to peer */
#define SS_ISDISCONNECTING 0x008 /* in process of disconnecting */
#define SS_CANTSENDMORE 0x010 /* can't send more data to peer */
#define SS_CANTRCVMORE 0x020 /* can't receive more data from peer */
#define SS_CONNAWAITING 0x040 /* connections awaiting acceptance */
#define SS_RCVATMARK 0x080 /* at mark on input */
#define SS_PRIV 0x100 /* privileged */
#define SS_NBIO 0x200 /* non-blocking ops */
#define SS_ASYNC 0x400 /* async i/o notify */
.DE
.PP
The state of a socket is manipulated both by the protocols
and the user (through system calls).
When a socket is created the state is defined based on the type of
input/output the user wishes to perform. ``Non-blocking'' I/O implies
a process should never be blocked to await resources. Instead, any
call which would block returns prematurely
with the error EWOULDBLOCK (the service request may be partially
fulfilled, e.g. a request for more data than is present).
.PP
If a process requested ``asynchronous'' notification of events
related to the socket the SIGIO signal is posted to the process.
An event is a change in the socket's state,
examples of such occurances are: space
becoming available in the send queue, new data available in the
receive queue, connection establishment or disestablishment, etc.
.PP
A socket may be marked ``priviledged'' if it was created by the
super-user. Only priviledged sockets may
send broadcast packets, or bind
addresses in priviledged portions of an address space.
.NH 3
Socket data queues
.PP
A socket's data queue contains a pointer to the data stored in
the queue and other entries related to the management of
the data. The following structure defines a data queue:
.DS
._f
struct sockbuf {
short sb_cc; /* actual chars in buffer */
short sb_hiwat; /* max actual char count */
short sb_mbcnt; /* chars of mbufs used */
short sb_mbmax; /* max chars of mbufs to use */
short sb_lowat; /* low water mark */
short sb_timeo; /* timeout */
struct mbuf *sb_mb; /* the mbuf chain */
struct proc *sb_sel; /* process selecting read/write */
short sb_flags; /* flags, see below */
};
.DE
.PP
Data is stored in a queue as a chain of mbufs. The actual
count of characters as well as high and low water marks are
used by the protocols in controlling the flow of data.
The socket routines cooperate in implementing the flow control
policy by blocking a process when it requests to send data and
the high water mark has been reached, or when it requests to
receive data and less than the low water mark is present
(assuming non-blocking I/O has not been specified).
.PP
When a socket is created, the supporting protocol ``reserves'' space
for the send and receive queues of the socket.
The actual storage associated with a
socket queue may fluctuate during a socket's lifetime, but is assumed
this reservation will always allow a protocol to acquire enough memory
to satisfy the high water marks.
.PP
The timeout and select values are manipulated by the socket routines
in implementing various portions of the interprocess communications
facilities and will not be described here.
.PP
A socket queue has a number of flags used in synchronizing access
to the data and in acquiring resources;
.DS
._d
#define SB_LOCK 0x01 /* lock on data queue (so_rcv only) */
#define SB_WANT 0x02 /* someone is waiting to lock */
#define SB_WAIT 0x04 /* someone is waiting for data/space */
#define SB_SEL 0x08 /* buffer is selected */
#define SB_COLL 0x10 /* collision selecting */
.DE
The last two flags are manipulated by the system in implementing
the select mechanism.
.NH 3
Socket connection queueing
.PP
In dealing with connection oriented sockets (e.g. SOCK_STREAM)
the two sides are considered distinct. One side is termed
\fIactive\fP, and generates connection requests. The other
side is called \fIpassive\fP and accepts connection requests.
.PP
From the passive side, a socket is created with the option
SO_ACCEPTCONN specified,
creating two queues of sockets: \fIso_q0\fP for connections
in progress and \fIso_q\fP for connections already made and
awaiting user acceptance.
As a protocol is preparing incoming connections, it creates
a socket structure queued on \fIso_q0\fP by calling the routine
\fIsonewconn\fP(). When the connection
is established, the socket structure is then transfered
to \fIso_q\fP, making it available for an accept.
.PP
If an SO_ACCEPTCONN socket is closed with sockets on either
\fIso_q0\fP or \fIso_q\fP, these sockets are dropped.
.NH 2
Protocol layer(s)
.PP
Protocols are described by a set of entry points and certain
socket visible characteristics, some of which are used in
deciding which socket type(s) they may support.
.PP
An entry in the ``protocol switch'' table exists for each
protocol module configured into the system. It has the following form:
.DS
._f
struct protosw {
short pr_type; /* socket type used for */
short pr_family; /* protocol family */
short pr_protocol; /* protocol number */
short pr_flags; /* socket visible attributes */
/* protocol-protocol hooks */
int (*pr_input)(); /* input to protocol (from below) */
int (*pr_output)(); /* output to protocol (from above) */
int (*pr_ctlinput)(); /* control input (from below) */
int (*pr_ctloutput)(); /* control output (from above) */
/* user-protocol hook */
int (*pr_usrreq)(); /* user request */
/* utility hooks */
int (*pr_init)(); /* initialization routine */
int (*pr_fasttimo)(); /* fast timeout (200ms) */
int (*pr_slowtimo)(); /* slow timeout (500ms) */
int (*pr_drain)(); /* flush any excess space possible */
};
.DE
.PP
A protocol is called through the \fIpr_init\fP entry before any other.
Thereafter it is called every 200 milliseconds through the
\fIpr_fasttimo\fP entry and
every 500 milliseconds through the \fIpr_slowtimo\fP for timer based actions.
The system will call the \fIpr_drain\fP entry if it is low on space and
this should throw away any non-critical data.
.PP
Protocols pass data between themselves as chains of mbufs using
the \fIpr_input\fP and \fIpr_output\fP routines. \fIPr_input\fP
passes data up (towards
the user) and \fIpr_output\fP passes it down (towards the network); control
information passes up and down on \fIpr_ctlinput\fP and \fIpr_ctloutput\fP.
The protocol is responsible for the space occupied by any the
arguments to these entries and must dispose of it.
.PP
The \fIpr_userreq\fP routine interfaces protocols to the socket
code and is described below.
.PP
The \fIpr_flags\fP field is constructed from the following values:
.DS
._d
#define PR_ATOMIC 0x01 /* exchange atomic messages only */
#define PR_ADDR 0x02 /* addresses given with messages */
#define PR_CONNREQUIRED 0x04 /* connection required by protocol */
#define PR_WANTRCVD 0x08 /* want PRU_RCVD calls */
#define PR_RIGHTS 0x10 /* passes capabilities */
.DE
Protocols which are connection-based specify the PR_CONNREQUIRED
flag so that the socket routines will never attempt to send data
before a connection has been established. If the PR_WANTRCVD flag
is set, the socket routines will notfiy the protocol when the user
has removed data from the socket's receive queue. This allows
the protocol to implement acknowledgement on user receipt, and
also update windowing information based on the amount of space
available in the receive queue. The PR_ADDR field indicates any
data placed in the socket's receive queue will be preceded by the
address of the sender. The PR_ATOMIC flag specifies each \fIuser\fP
request to send data must be performed in a single \fIprotocol\fP send
request; it is the protocol's responsibility to maintain record
boundaries on data to be sent. The PR_RIGHTS flag indicates the
protocol supports the passing of capabilities; this is currently
used only the protocols in the UNIX protocol family.
.PP
When a socket is created, the socket routines scan the protocol
table looking for an appropriate protocol to support the type of
socket being created. The \fIpr_type\fP field contains one of the
possible socket types (e.g. SOCK_STREAM), while the \fIpr_family\fP
field indicates which protocol family the protocol belongs to.
The \fIpr_protocol\fP field contains the protocol number of the
protocol, normally a well known value.
.NH 2
Network-interface layer
.PP
Each network-interface configured into a system defines a
path through which packets may be sent and received.
Normally a hardware device is associated with this interface,
though there is no requirement for this (for example, all
systems have a software ``loopback'' interface used for
debugging and performance analysis).
In addition to manipulating the hardware device, an interface
module is responsible
for encapsulation and deencapsulation of any low level header
information required to deliver a message to it's destination.
The selection of which interface to use in delivering packets
is a routing decision carried out at a
higher level than the network-interface layer. Each interface
normally identifies itself at boot time to the routing module
so that it may be selected for packet delivery.
.PP
An interface is defined by the following structure,
.DS
._f
struct ifnet {
char *if_name; /* name, e.g. ``en'' or ``lo'' */
short if_unit; /* sub-unit for lower level driver */
short if_mtu; /* maximum transmission unit */
int if_net; /* network number of interface */
short if_flags; /* up/down, broadcast, etc. */
short if_timer; /* time 'til if_watchdog called */
int if_host[2]; /* local net host number */
struct sockaddr if_addr; /* address of interface */
union {
struct sockaddr ifu_broadaddr;
struct sockaddr ifu_dstaddr;
} if_ifu;
struct ifqueue if_snd; /* output queue */
int (*if_init)(); /* init routine */
int (*if_output)(); /* output routine */
int (*if_ioctl)(); /* ioctl routine */
int (*if_reset)(); /* bus reset routine */
int (*if_watchdog)(); /* timer routine */
int if_ipackets; /* packets received on interface */
int if_ierrors; /* input errors on interface */
int if_opackets; /* packets sent on interface */
int if_oerrors; /* output errors on interface */
int if_collisions; /* collisions on csma interfaces */
struct ifnet *if_next;
};
.DE
.PP
Each interface has a send queue and routines used for
initialization, \fIif_init\fP, and output, \fIif_output\fP.
If the interface resides on a system bus, the routine \fIif_reset\fP
will be called after a bus reset has been performed.
An interface may also
specify a timer routine, \fIif_watchdog\fP, which should be called
every \fIif_timer\fP seconds (if non-zero).
.PP
The state of an interface and certain characteristics are stored in
the \fIif_flags\fP field. The following values are possible:
.DS
._d
#define IFF_UP 0x1 /* interface is up */
#define IFF_BROADCAST 0x2 /* broadcast address valid */
#define IFF_DEBUG 0x4 /* turn on debugging */
#define IFF_ROUTE 0x8 /* routing entry installed */
#define IFF_POINTOPOINT 0x10 /* interface is point-to-point link */
#define IFF_NOTRAILERS 0x20 /* avoid use of trailers */
#define IFF_RUNNING 0x40 /* resources allocated */
#define IFF_NOARP 0x80 /* no address resolution protocol */
.DE
If the interface is connected to a network which supports transmission
of \fIbroadcast\fP packets, the IFF_BROADCAST flag will be set and
the \fIif_broadaddr\fP field will contain the address to be used in
sending or accepting a broadcast packet. If the interface is associated
with a point to point hardware link (for example, a DEC DMR-11), the
IFF_POINTOPOINT flag will be set and \fIif_dstaddr\fP will contain the
address of the host on the other side of the connection. These addresses
and the local address of the interface, \fIif_addr\fP, are used in
filtering incoming packets. The interface sets IFF_RUNNING after
it has allocated system resources and posted an initial read on the
device it manages. This state bit is used to avoid multiple allocation
requests when an interface's address is changed. The IFF_NOTRAILERS
flag indicates the interface should refrain from using a \fItrailer\fP
encapsulation on outgoing packets; \fItrailer\fP protocols are described
in section 14. The IFF_NOARP flag indicates the interface should not
use an ``address resolution protocol'' in mapping internetwork addresses
to local network addresses.
.PP
The information stored in an \fIifnet\fP structure for point to point
communication devices is not currently used by the system internally.
Rather, it is used by the user level routing process in determining
host network connections and in initially devising routes (refer to
chapter 10 for more information).
.PP
Various statistics are also stored in the interface structure. These
may be viewed by users using the \fInetstat\fP(1) program.
.PP
The interface address and flags may be set with the SIOCSIFADDR and
SIOCSIFFLAGS ioctls. SIOCSIFADDR is used to initially define each
interface's address; SIOGSIFFLAGS can be used to mark
an interface down and perform site-specific configuration.
.NH 3
UNIBUS interfaces
.PP
All hardware related interfaces currently reside on the UNIBUS.
Consequently a common set of utility routines for dealing
with the UNIBUS has been developed. Each UNIBUS interface
utilizes a structure of the following form:
.DS
.if t .ta .5i 1.25i 2.8i
.if n .ta .7i 1.75i 3.8i
struct ifuba {
short ifu_uban; /* uba number */
short ifu_hlen; /* local net header length */
struct uba_regs *ifu_uba; /* uba regs, in vm */
struct ifrw {
.if t .ta .5i 1.25i 2.0i 2.8i
.if n .ta .7i 1.75i 2.75i 3.8i
caddr_t ifrw_addr; /* virt addr of header */
int ifrw_bdp; /* unibus bdp */
int ifrw_info; /* value from ubaalloc */
int ifrw_proto; /* map register prototype */
struct pte *ifrw_mr; /* base of map registers */
} ifu_r, ifu_w;
.if t .ta .5i 1.25i 2.8i
.if n .ta .7i 1.75i 3.8i
struct pte ifu_wmap[IF_MAXNUBAMR]; /* base pages for output */
short ifu_xswapd; /* mask of clusters swapped */
short ifu_flags; /* used during uballoc's */
struct mbuf *ifu_xtofree; /* pages being dma'd out */
};
.DE
.PP
The \fIif_uba\fP structure describes UNIBUS resources held by
an interface.
IF_NUBAMR map registers are held for datagram data, starting
at \fIifr_mr\fP. UNIBUS map register \fIifr_mr\fP[\-1]
maps the local network header
ending on a page boundary. UNIBUS data paths are
reserved for read and for
write, given by \fIifr_bdp\fP. The prototype of the map
registers for read and for write is saved in \fIifr_proto\fP.
.PP
When write transfers are not full pages on page boundaries
the data is just copied into the pages mapped on the UNIBUS
and the transfer is started.
If a write transfer is of a (1024 byte) page size and on a page
boundary, UNIBUS page table entries are swapped to reference
the pages, and then the initial pages are
remapped from \fIifu_wmap\fP when the transfer completes.
.PP
When read transfers give whole pages of data to be input, page
frames are allocated from a network page list and traded
with the pages already containing the data, mapping the allocated
pages to replace the input pages for the next UNIBUS data input.
.PP
The following utility routines are available for use in
writing network interface drivers, all use the \fIifuba\fP
structure described above.
.IP "if_ubainit(ifu, uban, hlen, nmr);"
.br
\fIif_ubainit\fP allocates resources on UNIBUS adaptor \fIuban\fP
and stores the resultant information
in the \fIifuba\fP structure pointed to by \fIifu\fP.
It is called only at boot time or after a UNIBUS reset.
Two data paths (buffered or unbuffered,
depending on the \fIifu_flags\fP field) are allocated, one for
reading and one for writing. The \fInmr\fP parameter indicates
the number of UNIBUS mapping registers required to map a maximal
sized packet onto the UNIBUS, while \fIhlen\fP specifies the size
of a local network header, if any, which should be mapped separately
from the data (see the description of trailer protocols in chapter 14).
Sufficient UNIBUS mapping registers and pages of memory are allocated
to initialize the input data path for an initial read. For the output
data path, mapping registers and pages of memory are also allocated
and mapped onto the UNIBUS. The pages associated with the output
data path are held in reserve in the event a write requires copying
non-page-aligned data (see \fIif_wubaput\fP below).
If \fIif_ubainit\fP is called with resources already allocated,
they will be used instead of allocating new ones (this normally
occurs after a UNIBUS reset).
A 1 is returned when allocation and initialization is successful,
0 otherwise.
.IP "m = if_rubaget(ifu, totlen, off0);"
.br
\fIif_rubaget\fP pulls read data off an interface. \fItotlen\fP
specifies the length of data to be obtained, not counting the
local network header. If \fIoff0\fP is non-zero, it indicates
a byte offset to a trailing local network header which should be
copied into a
separate mbuf and prepended to the front of the resultant mbuf
chain. When page sized units of data are present and are
page-aligned, the previously mapped data pages are remapped
into the mbufs and swapped with fresh pages; thus avoiding
any copying. A 0 return value indicates a failure to allocate
resources.
.IP "if_wubaput(ifu, m);"
.br
\fIif_wubaput\fP maps a chain of mbufs onto a network interface
in preparation for output. The chain includes any local network
header, which is copied so that it resides in the mapped and
aligned I/O space. Any other mbufs which contained non page
sized data portions are also copied to the I/O space.
Pages mapped from a previous output operation (no longer needed)
are unmapped and returned to the network page pool.
.ds RH "Socket/protocol interface
.bp