.nr H2 1 .ds RH "Internal layering .NH \s+2Internal layering\s0 .PP The internal structure of the network system is divided into three layers. These layers correspond to the services provided by the socket abstraction, those provided by the communication protocols, and those provided by the hardware interfaces. The communication protocols are normally layered into two or more individual cooperating layers, though they are collectively viewed in the system as one layer providing services supportive of the appropriate socket abstraction. .PP The following sections describe the properties of each layer in the system and the interfaces each must conform to. .NH 2 Socket layer .PP The socket layer deals with the interprocess communications facilities provided by the system. A socket is a bidirectional endpoint of communication which is ``typed'' by the semantics of communication it supports. The system calls described in the \fI4.2BSD System Manual\fP are used to manipulate sockets. .PP A socket consists of the following data structure: .DS ._f struct socket { short so_type; /* generic type */ short so_options; /* from socket call */ short so_linger; /* time to linger while closing */ short so_state; /* internal state flags */ caddr_t so_pcb; /* protocol control block */ struct protosw *so_proto; /* protocol handle */ struct socket *so_head; /* back pointer to accept socket */ struct socket *so_q0; /* queue of partial connections */ short so_q0len; /* partials on so_q0 */ struct socket *so_q; /* queue of incoming connections */ short so_qlen; /* number of connections on so_q */ short so_qlimit; /* max number queued connections */ struct sockbuf so_snd; /* send queue */ struct sockbuf so_rcv; /* receive queue */ short so_timeo; /* connection timeout */ u_short so_error; /* error affecting connection */ short so_oobmark; /* chars to oob mark */ short so_pgrp; /* pgrp for signals */ }; .DE .PP Each socket contains two data queues, \fIso_rcv\fP and \fIso_snd\fP, and a pointer to routines which provide supporting services. The type of the socket, \fIso_type\fP is defined at socket creation time and used in selecting those services which are appropriate to support it. The supporting protocol is selected at socket creation time and recorded in the socket data structure for later use. Protocols are defined by a table of procedures, the \fIprotosw\fP structure, which will be described in detail later. A pointer to a protocol specific data structure, the ``protocol control block'' is also present in the socket structure. Protocols control this data structure and it normally includes a back pointer to the parent socket structure(s) to allow easy lookup when returning information to a user (for example, placing an error number in the \fIso_error\fP field). The other entries in the socket structure are used in queueing connection requests, validating user requests, storing socket characteristics (e.g. options supplied at the time a socket is created), and maintaining a socket's state. .PP Processes ``rendezvous at a socket'' in many instances. For instance, when a process wishes to extract data from a socket's receive queue and it is empty, or lacks sufficient data to satisfy the request, the process blocks, supplying the address of the receive queue as an ``wait channel' to be used in notification. When data arrives for the process and is placed in the socket's queue, the blocked process is identified by the fact it is waiting ``on the queue''. .NH 3 Socket state .PP A socket's state is defined from the following: .DS .if t .ta .6i 2.3i 3.0i .if n .ta .84i 2.5i 3.20i #define SS_NOFDREF 0x001 /* no file table ref any more */ #define SS_ISCONNECTED 0x002 /* socket connected to a peer */ #define SS_ISCONNECTING 0x004 /* in process of connecting to peer */ #define SS_ISDISCONNECTING 0x008 /* in process of disconnecting */ #define SS_CANTSENDMORE 0x010 /* can't send more data to peer */ #define SS_CANTRCVMORE 0x020 /* can't receive more data from peer */ #define SS_CONNAWAITING 0x040 /* connections awaiting acceptance */ #define SS_RCVATMARK 0x080 /* at mark on input */ #define SS_PRIV 0x100 /* privileged */ #define SS_NBIO 0x200 /* non-blocking ops */ #define SS_ASYNC 0x400 /* async i/o notify */ .DE .PP The state of a socket is manipulated both by the protocols and the user (through system calls). When a socket is created the state is defined based on the type of input/output the user wishes to perform. ``Non-blocking'' I/O implies a process should never be blocked to await resources. Instead, any call which would block returns prematurely with the error EWOULDBLOCK (the service request may be partially fulfilled, e.g. a request for more data than is present). .PP If a process requested ``asynchronous'' notification of events related to the socket the SIGIO signal is posted to the process. An event is a change in the socket's state, examples of such occurances are: space becoming available in the send queue, new data available in the receive queue, connection establishment or disestablishment, etc. .PP A socket may be marked ``priviledged'' if it was created by the super-user. Only priviledged sockets may send broadcast packets, or bind addresses in priviledged portions of an address space. .NH 3 Socket data queues .PP A socket's data queue contains a pointer to the data stored in the queue and other entries related to the management of the data. The following structure defines a data queue: .DS ._f struct sockbuf { short sb_cc; /* actual chars in buffer */ short sb_hiwat; /* max actual char count */ short sb_mbcnt; /* chars of mbufs used */ short sb_mbmax; /* max chars of mbufs to use */ short sb_lowat; /* low water mark */ short sb_timeo; /* timeout */ struct mbuf *sb_mb; /* the mbuf chain */ struct proc *sb_sel; /* process selecting read/write */ short sb_flags; /* flags, see below */ }; .DE .PP Data is stored in a queue as a chain of mbufs. The actual count of characters as well as high and low water marks are used by the protocols in controlling the flow of data. The socket routines cooperate in implementing the flow control policy by blocking a process when it requests to send data and the high water mark has been reached, or when it requests to receive data and less than the low water mark is present (assuming non-blocking I/O has not been specified). .PP When a socket is created, the supporting protocol ``reserves'' space for the send and receive queues of the socket. The actual storage associated with a socket queue may fluctuate during a socket's lifetime, but is assumed this reservation will always allow a protocol to acquire enough memory to satisfy the high water marks. .PP The timeout and select values are manipulated by the socket routines in implementing various portions of the interprocess communications facilities and will not be described here. .PP A socket queue has a number of flags used in synchronizing access to the data and in acquiring resources; .DS ._d #define SB_LOCK 0x01 /* lock on data queue (so_rcv only) */ #define SB_WANT 0x02 /* someone is waiting to lock */ #define SB_WAIT 0x04 /* someone is waiting for data/space */ #define SB_SEL 0x08 /* buffer is selected */ #define SB_COLL 0x10 /* collision selecting */ .DE The last two flags are manipulated by the system in implementing the select mechanism. .NH 3 Socket connection queueing .PP In dealing with connection oriented sockets (e.g. SOCK_STREAM) the two sides are considered distinct. One side is termed \fIactive\fP, and generates connection requests. The other side is called \fIpassive\fP and accepts connection requests. .PP From the passive side, a socket is created with the option SO_ACCEPTCONN specified, creating two queues of sockets: \fIso_q0\fP for connections in progress and \fIso_q\fP for connections already made and awaiting user acceptance. As a protocol is preparing incoming connections, it creates a socket structure queued on \fIso_q0\fP by calling the routine \fIsonewconn\fP(). When the connection is established, the socket structure is then transfered to \fIso_q\fP, making it available for an accept. .PP If an SO_ACCEPTCONN socket is closed with sockets on either \fIso_q0\fP or \fIso_q\fP, these sockets are dropped. .NH 2 Protocol layer(s) .PP Protocols are described by a set of entry points and certain socket visible characteristics, some of which are used in deciding which socket type(s) they may support. .PP An entry in the ``protocol switch'' table exists for each protocol module configured into the system. It has the following form: .DS ._f struct protosw { short pr_type; /* socket type used for */ short pr_family; /* protocol family */ short pr_protocol; /* protocol number */ short pr_flags; /* socket visible attributes */ /* protocol-protocol hooks */ int (*pr_input)(); /* input to protocol (from below) */ int (*pr_output)(); /* output to protocol (from above) */ int (*pr_ctlinput)(); /* control input (from below) */ int (*pr_ctloutput)(); /* control output (from above) */ /* user-protocol hook */ int (*pr_usrreq)(); /* user request */ /* utility hooks */ int (*pr_init)(); /* initialization routine */ int (*pr_fasttimo)(); /* fast timeout (200ms) */ int (*pr_slowtimo)(); /* slow timeout (500ms) */ int (*pr_drain)(); /* flush any excess space possible */ }; .DE .PP A protocol is called through the \fIpr_init\fP entry before any other. Thereafter it is called every 200 milliseconds through the \fIpr_fasttimo\fP entry and every 500 milliseconds through the \fIpr_slowtimo\fP for timer based actions. The system will call the \fIpr_drain\fP entry if it is low on space and this should throw away any non-critical data. .PP Protocols pass data between themselves as chains of mbufs using the \fIpr_input\fP and \fIpr_output\fP routines. \fIPr_input\fP passes data up (towards the user) and \fIpr_output\fP passes it down (towards the network); control information passes up and down on \fIpr_ctlinput\fP and \fIpr_ctloutput\fP. The protocol is responsible for the space occupied by any the arguments to these entries and must dispose of it. .PP The \fIpr_userreq\fP routine interfaces protocols to the socket code and is described below. .PP The \fIpr_flags\fP field is constructed from the following values: .DS ._d #define PR_ATOMIC 0x01 /* exchange atomic messages only */ #define PR_ADDR 0x02 /* addresses given with messages */ #define PR_CONNREQUIRED 0x04 /* connection required by protocol */ #define PR_WANTRCVD 0x08 /* want PRU_RCVD calls */ #define PR_RIGHTS 0x10 /* passes capabilities */ .DE Protocols which are connection-based specify the PR_CONNREQUIRED flag so that the socket routines will never attempt to send data before a connection has been established. If the PR_WANTRCVD flag is set, the socket routines will notfiy the protocol when the user has removed data from the socket's receive queue. This allows the protocol to implement acknowledgement on user receipt, and also update windowing information based on the amount of space available in the receive queue. The PR_ADDR field indicates any data placed in the socket's receive queue will be preceded by the address of the sender. The PR_ATOMIC flag specifies each \fIuser\fP request to send data must be performed in a single \fIprotocol\fP send request; it is the protocol's responsibility to maintain record boundaries on data to be sent. The PR_RIGHTS flag indicates the protocol supports the passing of capabilities; this is currently used only the protocols in the UNIX protocol family. .PP When a socket is created, the socket routines scan the protocol table looking for an appropriate protocol to support the type of socket being created. The \fIpr_type\fP field contains one of the possible socket types (e.g. SOCK_STREAM), while the \fIpr_family\fP field indicates which protocol family the protocol belongs to. The \fIpr_protocol\fP field contains the protocol number of the protocol, normally a well known value. .NH 2 Network-interface layer .PP Each network-interface configured into a system defines a path through which packets may be sent and received. Normally a hardware device is associated with this interface, though there is no requirement for this (for example, all systems have a software ``loopback'' interface used for debugging and performance analysis). In addition to manipulating the hardware device, an interface module is responsible for encapsulation and deencapsulation of any low level header information required to deliver a message to it's destination. The selection of which interface to use in delivering packets is a routing decision carried out at a higher level than the network-interface layer. Each interface normally identifies itself at boot time to the routing module so that it may be selected for packet delivery. .PP An interface is defined by the following structure, .DS ._f struct ifnet { char *if_name; /* name, e.g. ``en'' or ``lo'' */ short if_unit; /* sub-unit for lower level driver */ short if_mtu; /* maximum transmission unit */ int if_net; /* network number of interface */ short if_flags; /* up/down, broadcast, etc. */ short if_timer; /* time 'til if_watchdog called */ int if_host[2]; /* local net host number */ struct sockaddr if_addr; /* address of interface */ union { struct sockaddr ifu_broadaddr; struct sockaddr ifu_dstaddr; } if_ifu; struct ifqueue if_snd; /* output queue */ int (*if_init)(); /* init routine */ int (*if_output)(); /* output routine */ int (*if_ioctl)(); /* ioctl routine */ int (*if_reset)(); /* bus reset routine */ int (*if_watchdog)(); /* timer routine */ int if_ipackets; /* packets received on interface */ int if_ierrors; /* input errors on interface */ int if_opackets; /* packets sent on interface */ int if_oerrors; /* output errors on interface */ int if_collisions; /* collisions on csma interfaces */ struct ifnet *if_next; }; .DE .PP Each interface has a send queue and routines used for initialization, \fIif_init\fP, and output, \fIif_output\fP. If the interface resides on a system bus, the routine \fIif_reset\fP will be called after a bus reset has been performed. An interface may also specify a timer routine, \fIif_watchdog\fP, which should be called every \fIif_timer\fP seconds (if non-zero). .PP The state of an interface and certain characteristics are stored in the \fIif_flags\fP field. The following values are possible: .DS ._d #define IFF_UP 0x1 /* interface is up */ #define IFF_BROADCAST 0x2 /* broadcast address valid */ #define IFF_DEBUG 0x4 /* turn on debugging */ #define IFF_ROUTE 0x8 /* routing entry installed */ #define IFF_POINTOPOINT 0x10 /* interface is point-to-point link */ #define IFF_NOTRAILERS 0x20 /* avoid use of trailers */ #define IFF_RUNNING 0x40 /* resources allocated */ #define IFF_NOARP 0x80 /* no address resolution protocol */ .DE If the interface is connected to a network which supports transmission of \fIbroadcast\fP packets, the IFF_BROADCAST flag will be set and the \fIif_broadaddr\fP field will contain the address to be used in sending or accepting a broadcast packet. If the interface is associated with a point to point hardware link (for example, a DEC DMR-11), the IFF_POINTOPOINT flag will be set and \fIif_dstaddr\fP will contain the address of the host on the other side of the connection. These addresses and the local address of the interface, \fIif_addr\fP, are used in filtering incoming packets. The interface sets IFF_RUNNING after it has allocated system resources and posted an initial read on the device it manages. This state bit is used to avoid multiple allocation requests when an interface's address is changed. The IFF_NOTRAILERS flag indicates the interface should refrain from using a \fItrailer\fP encapsulation on outgoing packets; \fItrailer\fP protocols are described in section 14. The IFF_NOARP flag indicates the interface should not use an ``address resolution protocol'' in mapping internetwork addresses to local network addresses. .PP The information stored in an \fIifnet\fP structure for point to point communication devices is not currently used by the system internally. Rather, it is used by the user level routing process in determining host network connections and in initially devising routes (refer to chapter 10 for more information). .PP Various statistics are also stored in the interface structure. These may be viewed by users using the \fInetstat\fP(1) program. .PP The interface address and flags may be set with the SIOCSIFADDR and SIOCSIFFLAGS ioctls. SIOCSIFADDR is used to initially define each interface's address; SIOGSIFFLAGS can be used to mark an interface down and perform site-specific configuration. .NH 3 UNIBUS interfaces .PP All hardware related interfaces currently reside on the UNIBUS. Consequently a common set of utility routines for dealing with the UNIBUS has been developed. Each UNIBUS interface utilizes a structure of the following form: .DS .if t .ta .5i 1.25i 2.8i .if n .ta .7i 1.75i 3.8i struct ifuba { short ifu_uban; /* uba number */ short ifu_hlen; /* local net header length */ struct uba_regs *ifu_uba; /* uba regs, in vm */ struct ifrw { .if t .ta .5i 1.25i 2.0i 2.8i .if n .ta .7i 1.75i 2.75i 3.8i caddr_t ifrw_addr; /* virt addr of header */ int ifrw_bdp; /* unibus bdp */ int ifrw_info; /* value from ubaalloc */ int ifrw_proto; /* map register prototype */ struct pte *ifrw_mr; /* base of map registers */ } ifu_r, ifu_w; .if t .ta .5i 1.25i 2.8i .if n .ta .7i 1.75i 3.8i struct pte ifu_wmap[IF_MAXNUBAMR]; /* base pages for output */ short ifu_xswapd; /* mask of clusters swapped */ short ifu_flags; /* used during uballoc's */ struct mbuf *ifu_xtofree; /* pages being dma'd out */ }; .DE .PP The \fIif_uba\fP structure describes UNIBUS resources held by an interface. IF_NUBAMR map registers are held for datagram data, starting at \fIifr_mr\fP. UNIBUS map register \fIifr_mr\fP[\-1] maps the local network header ending on a page boundary. UNIBUS data paths are reserved for read and for write, given by \fIifr_bdp\fP. The prototype of the map registers for read and for write is saved in \fIifr_proto\fP. .PP When write transfers are not full pages on page boundaries the data is just copied into the pages mapped on the UNIBUS and the transfer is started. If a write transfer is of a (1024 byte) page size and on a page boundary, UNIBUS page table entries are swapped to reference the pages, and then the initial pages are remapped from \fIifu_wmap\fP when the transfer completes. .PP When read transfers give whole pages of data to be input, page frames are allocated from a network page list and traded with the pages already containing the data, mapping the allocated pages to replace the input pages for the next UNIBUS data input. .PP The following utility routines are available for use in writing network interface drivers, all use the \fIifuba\fP structure described above. .IP "if_ubainit(ifu, uban, hlen, nmr);" .br \fIif_ubainit\fP allocates resources on UNIBUS adaptor \fIuban\fP and stores the resultant information in the \fIifuba\fP structure pointed to by \fIifu\fP. It is called only at boot time or after a UNIBUS reset. Two data paths (buffered or unbuffered, depending on the \fIifu_flags\fP field) are allocated, one for reading and one for writing. The \fInmr\fP parameter indicates the number of UNIBUS mapping registers required to map a maximal sized packet onto the UNIBUS, while \fIhlen\fP specifies the size of a local network header, if any, which should be mapped separately from the data (see the description of trailer protocols in chapter 14). Sufficient UNIBUS mapping registers and pages of memory are allocated to initialize the input data path for an initial read. For the output data path, mapping registers and pages of memory are also allocated and mapped onto the UNIBUS. The pages associated with the output data path are held in reserve in the event a write requires copying non-page-aligned data (see \fIif_wubaput\fP below). If \fIif_ubainit\fP is called with resources already allocated, they will be used instead of allocating new ones (this normally occurs after a UNIBUS reset). A 1 is returned when allocation and initialization is successful, 0 otherwise. .IP "m = if_rubaget(ifu, totlen, off0);" .br \fIif_rubaget\fP pulls read data off an interface. \fItotlen\fP specifies the length of data to be obtained, not counting the local network header. If \fIoff0\fP is non-zero, it indicates a byte offset to a trailing local network header which should be copied into a separate mbuf and prepended to the front of the resultant mbuf chain. When page sized units of data are present and are page-aligned, the previously mapped data pages are remapped into the mbufs and swapped with fresh pages; thus avoiding any copying. A 0 return value indicates a failure to allocate resources. .IP "if_wubaput(ifu, m);" .br \fIif_wubaput\fP maps a chain of mbufs onto a network interface in preparation for output. The chain includes any local network header, which is copied so that it resides in the mapped and aligned I/O space. Any other mbufs which contained non page sized data portions are also copied to the I/O space. Pages mapped from a previous output operation (no longer needed) are unmapped and returned to the network page pool. .ds RH "Socket/protocol interface .bp