From: www.itworld.com

Solaris sockets, past and present

by Jim Mauro

March 20, 2001 —

 

Prior to Solaris 2.6, sockets were an abstraction that existed at the library level. That is, much of the socket state and socket semantics support were provided within the libsocket library. The kernel's view of a process's socket connection entailed a file descriptor and linkage to a Stream head, which provided the path to the underlying transport. The disparity between the library socket state and the kernel's view was one of several reasons a new implementation was introduced in Solaris 2.6.


To provide a relevant basis for comparison, we'll start by looking at what happens in the pre-Solaris 2.6 release (that is, releases up to and including Solaris 2.5.1) when a socket is created. The major software layers are shown in Figure 1 for reference.



The primary software components are the socket library and the sockmod Streams module. The specfs layer is shown for completeness and is part of the layering, due to the use of pseudodevices as an entry point into the networking layers. To digress for a moment, the special filesystem, specfs, came out of SVR4 Unix as a means of addressing the issue of device special files that exist on Unix on-disk filesystems (e.g., UFS). Unix systems have always abstracted I/O (input/output) devices through device special files. The /dev directory namespace stores files that represent physical devices and pseudodevices on the system. Using device major numbers, those device files provide an entry point into the appropriate device driver, and using minor numbers, they are able to uniquely identify one of potentially many devices of the same type. (That is something of an oversimplification, but is sufficient for our purpose here in describing specfs.)


The /dev directory resides on the root filesystem, which is an instance of UFS. As such, references to the filesystem and its files and directories are handled using the UFS filesystem operations and UFS file operations. That is usually sufficient, but is not desired behavior for device special files. I/O to a device special file requires entry into a device driver. That is, issuing an open(2) system call on /dev/rmt/0 means someone wishes to open the tape device represented by /dev/rmt/0, thereby entering the appropriate driver's xx_open() routine. As a file on a UFS filesystem, the typical open routine called would be the ufs_open() code, but that's not what we want for devices. The specfs filesystem was designed to address such situations; it provides a straightforward mechanism for linking the underlying structures for file support in the kernel to the required device driver interfaces. Like all filesystems in Solaris (and any SVR4-based Unix) it's based on the VFS/vnode infrastructure. (See Solaris Internals and UNIX Internals in the Resources section for detailed information on VFS.)


Figure 1. Pre-Solaris 2.6 socket layers


Getting back to sockets in Solaris 2.5.1, the specfs layer comes into play because the socket open ultimately results in an open(2) system call issued on the tcp(7P) or udp(7P) pseudodevice. More precisely, the socket library passes the arguments given to the socket(3N) call to a lookup function that searches an internal (internal to libsocket.s) array to match the domain argument and retrieve a corresponding character string. It then uses the character string to find a match in the /etc/netconfig file, which is used for transport selection and describes all the available transport protocols in Solaris. (See netconfig(4).) This transport selection mechanism is an essential part of a network programming implementation; it allows for the interfaces to be protocol-independent, so the programmer is not required to maintain a different source base for Ethernet-based networks versus FDDI-based networks, etc.


A netconfig datastructure (defined in /usr/include/sys/netconfig.h) is populated based on the line entries in /etc/netconfig that match the domain (as per the character string retrieved from the internal table), type, and protocol family specified in the socket(3N) call. Among the netconfig parameters, a device is defined that provides the entry point into the transport provider kernel module. For example, a call to socket(AF_INET, SOCK_STREAM, 0) indicates an Internet transport that provides reliable, connection-oriented behavior is desired. The TCP layer of the TCP/IP protocol family provides such a service, and the /etc/netconfig entry defines /dev/tcp as the device to open for entry into that transport layer. The socket library code will issue an open(2) on /dev/tcp accordingly. If one were developing a network-based application using the X/Open Transport Interface (XTI) -- a superset of what was the Transport Layer Interface (TLI) -- the t_open(3NSL) call would receive the /dev/tcp argument explicitly for a connection using TCP as a transport protocol.


The block sitting below specfs in Figure 1, the Stream head, is a generic part of a Stream-based communication path. The Stream head is created when a Streams device is opened. In Figure 1, the open(2) to the /dev/tcp transport layer, which is a Streams device, resulted in the creation of the Stream head. The Stream head translates the interface calls made by the socket library into Streams messages (the Streams framework is message-based and uses queues to move data downstream [from the user process to the Streams driver] and upstream [from the driver to the user process]). The Streams facility provides for the insertion (pushing) and removal (popping) of Streams modules in the data flow, between the Stream head and the underlying driver. Each module implements a set of queues -- a read queue and a write queue -- for processing the data and messages. The generic picture is shown in Figure 2.


Figure 2. Streams organization


In the context of Solaris 2.5.1 sockets, the Streams module shown in Figure 2 is the kernel sockmod module (located in the /kernel/strmod directory). sockmod provides, in conjunction with libsocket, support for socket semantics using the Streams facility. That is, socket calls are handled initially by the socket library, then passed down to the Stream head, which transforms the calls into Streams messages and passes them down to sockmod. Upstream messages are passed from the underlying device driver and transport provider through sockmod and back up to the process. Thus, the functions contained in the sockmod module include Streams queue reading and writing in the form of queue read put and write put code for moving data up and down the Stream as data is read and written from the socket. The sockmod module communicates with the underlying transport provider using primitives and structures defined in the /usr/include/sys/tihdr.h header file.


The socket state maintained at the library level is in the form of a library-internal datastructure, _si_user, which maintains various bits of information about the socket, and is what the internal socket create function returns on a socket call. Yes, it's the file descriptor that represents the socket that is returned to the user code. _si_user is visible only to the library. You will find the structure definition for si_user and associated structures that it links to (si_udata and si_sockparams) in /usr/include/sys/sockmod.h. If you look at the structure definition, you'll see that the _si_user imbeds the si_udata and si_sockparams structures, which maintain state information (e.g. connected, bound), socket options (accept connection), information on the transport provider (e.g. service type), and the family, type, and protocol used for the socket.


At the sockmod layer, a socket is internally represented in the so_so datastructure. Fields of interest there include an imbedded ti_info structure (/usr/include/sys/tiuser.h) that manages transport provider information, a network buffer (netbuf) for data transfers, a si_udata structure that replicates the socket state (among other things), and message blocks (mblk_t), which are the basic unit of communication across Streams.

Figure 3. Socket layers with sockfs


In Solaris 2.6, we did away with the sockmod Streams module and trimmed a lot of code from libsocket. Most of the socket-related library interfaces result in system call traps into the kernel, without any library-level code executing. A few of the interfaces (socket(3SOCKET) and socketpair(3SOCKET)) execute some library-level code before entering the kernel. However, all the state information is maintained in the kernel, where it belongs. This creates a nice visibility feature -- we can now see file descriptors that represent sockets:


sunsys> uname -a
SunOS sunsys 5.8 Generic_108528-01 sun4u sparc SUNW,Ultra-60
sunsys> srv &
[1]     7153
Socket port: # 34940
Send buf: 16384, Rcv buf: 24576

sunsys> pfiles 7153
7153:   srv
  Current rlimit: 1024 file descriptors
   0: S_IFCHR mode:0620 dev:32,0 ino:91176 uid:19822 gid:7 rdev:24,14
      O_RDWR|O_LARGEFILE
   1: S_IFCHR mode:0620 dev:32,0 ino:91176 uid:19822 gid:7 rdev:24,14
      O_RDWR|O_LARGEFILE
   2: S_IFCHR mode:0620 dev:32,0 ino:91176 uid:19822 gid:7 rdev:24,14
      O_RDWR|O_LARGEFILE
   3: S_IFSOCK mode:0666 dev:186,0 ino:63137 uid:0 gid:0 size:0
      O_RDWR
        sockname: AF_INET 0.0.0.0  port: 34940
sunsys> 


In the above example, a simple TCP socket server process is started (srv, PID 7153). (The Socket port and Send buf lines are output from the srv process when it starts.) Using the pfiles(1) command to dump the process's open file descriptors, we see that the file descriptor is identified as a socket, and we even get the socket type (AF_INET) and port number. (The freeware command, lsof, is a great utility for extracting process file descriptor information if you're on an older Sun OS that doesn't have pfiles(1). You can get lsof from ftp://vic.cc.purdue.edu/pub/tools/unix/lsof/.)


The libsocket changes associated with sockfs maintain the documented interfaces. Both source and binary compatibility are maintained, as socket code compiled on early versions of Solaris should work without recompilation on Solaris 2.6 and later releases. Source code should move over and recompile with no changes as well.


The trimming down of the library-level socket code required providing a new means to map the domain type passed as an argument to socket(3S) to facilitate a lookup in /etc/netconfig. Recall that the Solaris 2.5.1 socket library did this using an internal table. In Solaris 2.6 and later, a new configuration file and command is introduced to provide that functionality. The /etc/sock2path contains the necessary information to map the socket(3SOCKET) call parameters to the appropriate transport provider and device. A new command, soconfig(1M), is used to maintain /etc/sock2path. It's executed automatically at boot time via an entry in the /etc/inittab file. Reference the sock2path(4) and soconfig(1M) man pages for specifics. For most applications, the default entries in sock2path are sufficient.


As a filesystem (pseudofilesystem), sockfs implements the generic VFS/vnode-related support structures and exports the required filesystem-specific functions. However, the entry into the sockfs-specific functions doesn't necessarily follow the typical flow of a regular file open, which is vectored to the file-type-specific function through the use of macros and an operations table. That is, the issuing of an open(2) system call on a file enters a generic vnode code path and ultimately resolves through a VOP_OPEN() macro to the appropriate filesystem-specific open code (e.g. ufs_open for a file in a UFS filesystem).


Sockets are created and opened using the socket(3SOCKET) API. A call to socket() from user code enters the libsocket library, which handles the mapping to the transport provider device, then enters the sockfs kernel module through an internal so_create() system call. The sock_open() (filesystem specific open routine) is invoked through the so_create() call, which is how other necessary create functions, such as an initialization function for the socket Stream, are called.


Other conventional system calls, such as a read(2) or write(2) on a socket, are vectored into the sockfs specific read and write code (sock_read() and sock_write()) through the standard VFS/vnode mechanism. Once entered, the sockfs read/write code makes lower-level calls into the sockfs subsystem designed to interface with the transport provider. For example, a read(2) system call on a socket vectors into sock_read(), which does some basic housekeeping and calls an internal sorecvmsg() (socket receive message) function. In sorecvmsg(), socket state is tested and the request is moved downstream via a call to the Streams get-message function.


The most compelling part of the sockfs implementation is that consolidation of all socket state information is in a single structure, maintained in one place: the kernel. Sockets are represented internally as a sonode, defined in /usr/include/sys/socketvar.h. All operations on an sonode take place within the kernel sockfs subsystem, isolating state changes and eliminating the need to replicate state for consistency, which created some problems alluded to last month.


That's a wrap for this month. It occurred to me as I was working through this material that it really should have been preceded by detailed coverage of the generic Streams subsystem. At the risk of moving in a backwards direction, I'll add that to the editorial calendar for the not-to-distant future. (It's a huge subsystem.)


Next month, we'll begin a series on executable files in Solaris, and the runtime linker.



Resources and Related Links