Extensible Connection-oriented Messaging (XCM)
|
This is the API documentation for the Extensible Connection-oriented Messaging (XCM) library.
The XCM API consists of the following parts:
The low API/ABI version number is a result of all XCM releases being backward compatible, and thus left the major version at 0.
XCM is a shared library implementing an inter-process communication service on Linux. It facilitates communication between processes on the same system, as well as over a network.
XCM internals are divided into the core library, and a number of pluggable transports, handling the actual data delivery. In combination with an URL-like addressing scheme, it allows applications to be transport agnostic, and an IPC mechanism suitable for one deployment can seamlessly be replaced with another, in another deployment. The API semantics are the same, regardless of underlying transport used.
An XCM transport either provides a messaging or a byte stream type service.
XCM supports UNIX domain sockets for efficient local-only communication, and TCP, TLS and SCTP for remote inter-process communication. The service XCM provides is of the connection-oriented, client-server type. It is not a message bus and does not implement the publish-subscribe or broadcast patterns.
This document primarily serves as an API specification, but also also contains information specific to the implementation.
XCM reuses much of the terminology of the BSD Sockets API. Compared to the BSD Sockets API, XCM has more uniform semantics across underlying transports.
XCM implements a connection-oriented, client-server model. The server process creates one or more server sockets (e.g, with xcm_server()) bound to a specific address, after which clients may successfully establish connections to the server. When a connection is establishment, two connection sockets will be created; one on the server side (e.g., returned from xcm_accept()), and one of the client side (e.g., returned from xcm_connect()). Thus, a server serving multiple clients will have multiple sockets; one server socket and N connection sockets, one each for every client. A client will typically have one connection socket for each server it is connected to.
User application data (messages or bytes, depending on service type) are always sent and received on a particular connection socket - never on a server socket.
An XCM transport either provides a messaging or a byte stream service.
Messaging transports preserve message boundaries across the network. The buffer passed to xcm_send() constitutes one (and only one) message. What's received on the other end, in exactly one xcm_receive() call, is a buffer with the same length and contents.
The UX Transport, TCP Transport, TLS Transport, UTLS Transport, and SCTP Transport all provide a messaging type service.
For byte streams, there's no such thing as message boundaries: the data transported on the connection is just a sequence of bytes. The fact that xcm_send() accepts an array of bytes of a particular length, as opposed to individual bytes one-by-one, is a mere performance optimization.
For example, if two messages "abc" and "d" are passed to xcm_send() on to a messaging transport, they will arrive as "abc" and "d" in exactly two xcm_receive() call on the receiver. On a byte stream transport however, all the data "abcd" may arrive in a single xcm_receive(), or it may arrive in multiple calls, such as three calls, each producing "ab", "c", and "d", respectively, or any other combination.
The BTLS Transport and BTCP Transport transports provide a byte stream service.
Applications that allow the user to configure an arbitrary XCM address, but are designed to handle only a certain service type, may limit what type of sockets may be instantiated to be of only the messaging service type, or only byte stream, by passing the "xcm.service" attribute with the appropriate value (see Generic Attributes for details) at the time of socket creation. Because of XCM's history as a messaging-only framework, "xcm.service" defaults to "messaging".
Applications which are designed to handle both messaging and byte stream transports may retrieve the value of "xcm.service" and use it to differentiate the treatment where so is required (e.g., in xcm_send() return code handling).
Connections spawned off a server socket (e.g., with xcm_accept()) always have the same service type as their parent socket.
The XCM API and the various messaging type transports are designed for relatively small messages. For bulk data transfer, an application needs either employ fragmentation and reassembly, or use a byte stream transport.
The xcm.max_msg_size
socket attribute specifies the maximum message size (see Generic Attributes).
Current XCM transports do not negotiate the maximum message size, so the xcm.max_msg_size
limit denotes the local limit only. The remote end will reject messages larger than its limit, and may tear down the connection as a result. Application protocol level signaling or lockstep upgrade may be required to resolve such issues.
Historically, all messaging type transports in XCM have used a maximum message size of 65535 bytes. This limit was never exposed in the API, so subsequent changes to message size limits did not impact the API/ABI.
As of XCM v1.11.1, the maximum message size varies across different transports.
Applications using stack-allocated message buffers may want to impose their own upper limit (e.g., sizeof(msgbuf)
) on top of the xcm.max_msg_size
, to avoid overrunning the stack if linked to a newer, message size-wise more capable, XCM library version.
In-order delivery - that data arrives at the recipient in the same order it was sent by the sender - is guaranteed, but only for data sent on the same connection.
XCM transports support flow control. Thus, if the sender message rate or bandwidth is higher than the network or the receiver can handle on a particular connection, xcm_send() in the sender process will eventually block (or return an error EAGAIN, if in non-blocking mode). Unless XCM is used for bulk data transfer (as oppose to signaling traffic), xcm_send() blocking because of slow network or a slow receiver should be rare in practice. TCP, TLS, and UNIX domain socket transports all have large protocol windows and/or socket buffers to allow a large amount of outstanding data.
In XCM, the application is in control of which transport will be used, using the address supplied to xcm_connect() and xcm_server() including both the transport name and the transport address.
However, there is nothing preventing an XCM transport to use a more abstract addressing format, and internally include multiple underlying IPC transport mechanism. This model is implemented by the UTLS Transport.
Addresses are represented as strings with the following general syntax: <transport-name>:<transport-address>
For the UX UNIX Domain Socket transport, the addresses has this more specific form:
The addresses of the UXF UNIX Domain Socket transport variant have the following format:
For the TCP, TLS, UTLS, SCTP, BTCP and BTLS transports the syntax is:
'*' is a shorthand for '0.0.0.0' (i.e. bind to all IPv4 interfaces). '[*]' is the IPv6 equivalent, creating a server socket accepting connections on all IPv4 and IPv6 addresses.
Some example addresses:
For TCP, TLS, UTLS, SCTP, BTLS and BTCP server socket addresses, the port can be set to 0, in which case XCM (or rather, the Linux kernel) allocates a free TCP port from the local port range.
For transports allowing a DNS domain name as a part of the address, the transport will attempt to resoĺv the name to an IP address. A DNS domain name may resolv to zero or more IPv4 addresses and/or zero or more IPv6 addresses. XCM relies on the operating system to prioritize between IPv4 and IPv6.
By default, XCM will only connect to the first (highest-priority) IP address provided by DNS. This behavior can be changed for all TCP-based transports using the "dns.algorithm" attribute. See DNS Socket Attributes for more information.
XCM accepts IPv4 addresses in the dotted-decimal format
XCM allows only complete addresses with three '.', and not the archaic, classful, forms, where some bytes where left out, and thus the address contained fewer separators.
IPv6 link local addresses (i.e., fe80::/10) are not guaranteed to be unique outside a particular broadcast domain. To create such a socket an application must, besides the link local address to use, also supply a scope identifier, to allow the kernel to select which network interface to use.
The IPv6 scope id is not a part of an XCM address, but instead provided by the application as a socket attribute "ipv6.scope". See TCP Socket Attributes for details.
The rationale for this URI-style design choice, compared to the also-common practice to include a network interface name in the address ("<IPv6 address>%<if-name>"), is that the IPv6 scope identifiers are strictly local to the node and thus conceptually not a part of the address. One host may use a particular scope id to refer to a particular network, and another host on the same network may use a different.
XCM transports attempt to detect a number of conditions which can lead to lost connectivity, and does so even on idle connections.
If the remote end closes the connection, the local xcm_receive() will return 0. If the process on the remote end crashed, xcm_receive() will return -1 and set errno ECONNRESET. If network connectivity to the remote end is lost, xcm_receive() will return -1 and errno will be set to ETIMEDOUT.
In general, XCM follows the UNIX system API tradition when it comes to error handling. Where possible, errors are signaled to the application by using unused parts of the value range of the function return type. For functions returning signed integer types, this means the value of -1 (in case -1 is not a valid return value). For functions returning pointers, NULL is used to signal that an error has occurred. For functions where neither -1 or NULL can be used, or where the function does not return anything (side-effect only functions), an 'int' is used as the return type, and is used purely for the purpose to signal success (value 0), or an error (-1) to the application.
The actual error code is stored in the thread-local errno variable. The error codes are those from the fixed set of errno values defined by POSIX, found in errno.h. Standard functions such as perror() and strerror() may be used to turn the code into a human-readable string.
In non-blocking operation, given the fact the actual transmission might be defered (and the message buffered in the XCM layer), and that message receive processing might happen before the application has called receive, the error being signaled at the point of a certain XCM call might not be a direct result of the requested operation, but rather an error discovered previously.
The documentation for xcm_finish() includes a list of generic error codes, applicable xcm_connect(), xcm_accept(), xcm_send() and xcm_receive().
Also, for errors resulting in an unusable connection, repeated calls will produce the same errno.
In UNIX-style event-driven programming, a single application thread handles multiple clients (and thus multiple XCM connection sockets) and the task of accepting new clients on the XCM server socket concurrently (although not in parallel). To wait for events from multiple sources, an I/O multiplexing facility such as select(2), poll(2) or epoll(2) is used.
Each XCM socket is represented by a single fd, retrieved with xcm_fd(). The fd number and underlying file object remains the same across the life-time of the socket.
In the BSD Sockets API, the socket fd being readable means it's likely, but not guaranteed, that the application can successfully read data from the socket. Similarly, a fd marked writable by for example poll() signifies that the application is likely to be able to write data to the socket.
An application using non-blocking XCM sockets must always wait for the XCM socket fd to become readable (e.g., the XCM socket fd should always be in the readfds
in the select() call), regardless of the target condition. Thus, even if the application is waiting for an opportunity to send a message on a XCM socket, or is not interested in performing any type of operation on the socket, it must wait for the XCM socket fd to become readable. Not being interested in performing any operation here means that the application has the xcm_await() condition set to 0, and is neither interested in waiting to call xcm_send(), xcm_receive(), nor xcm_accept() on the socket.
An application must always include all of its XCM socket fds into readfds
in the select() call. An application must not leave an XCM socket unattended in the sense its fd is not in the set of fds passed to select() and/or neither of xcm_send(), xcm_receive(), xcm_accept() or xcm_finish() are called when its fd is marked readable by select().
XCM is oblivious to what I/O multiplexing mechanism employed by the application. It may call select(), poll() or epoll_wait() directly, or make use of any of the many available event loop libraries (such as libevent). For simplicity, select() is used in this documentation to denote the whole family of Linux I/O multiplexing facilities.
An event-driven application needs to set the XCM sockets it handles into non-blocking mode, by calling xcm_set_blocking(), setting the "xcm.blocking" socket attribute, or using the XCM_NONBLOCK flag in xcm_connect().
For XCM sockets in non-blocking mode, all potentially blocking API calls related to XCM connections - xcm_connect(), xcm_accept(), xcm_send(), and xcm_receive() - finish immediately.
The inability to finish the requested operation without blocking the thread (i.e., putting the thread to sleep) is signaled in the typical UNIX manner, by returning an NULL or -1, (depending on the return type) and setting errno to EAGAIN. Unlike most other errno values, EAGAIN is a temporary condition, and not a fatal error.
For xcm_send(), xcm_connect() and xcm_accept(), XCM signaling success means that the XCM layer has accepted the request. It may or may not have completed the operation.
In case the XCM_NONBLOCK flag is set in the xcm_connect() call, or in case the an XCM server socket is in non-blocking mode at the time of an xcm_accept() call, the newly created XCM connection returned to the application may be in a semi-operational state, with some internal processing and/or signaling with the remote peer still required before actual message transmission and reception may occur.
The application may attempt to send or receive messages on such semi-operational connections.
There are ways for an application to determine when connection establishment or the task of accepting a new client have completed. See Finishing Outstanding Tasks for more information.
To receive a message on an XCM connection socket in non-blocking mode, the application may need to wait for the right conditions to arise (i.e. a message being available). The application needs to inform the socket that it wants to receive by calling xcm_await() with the XCM_SO_RECEIVABLE
bit in the condition
bit mask set. It will pass the fd it received from xcm_fd() into select(), asking to get notified when the fd becomes readable. When select() marks the socket fd as readable, the application should issue xcm_receive() to attempt to retrieve a message.
xcm_receive() may also called on speculation, prior to any select() call, to poll the socket for incoming messages.
An XCM connection socket may have a number of messages buffered, and applications should generally, for optimal performance, repeat xcm_receive() until it returns an error, and errno is set to EAGAIN.
Similarly to receiving a message, an application may set the XCM_SO_SENDABLE
bit in the condition
bit mask, if it wants to wait for a socket state where it's likely it can successfully send a message. When select() marks the socket fd as readable, the application should attempt to send a message.
Just like with xcm_receive(), it may also choose to issue a xcm_send() call on speculation (i.e. without going into select()), which is often a good idea for performance reasons.
For send operations on non-blocking connection sockets, XCM may buffer whole or part of the message (or data, for byte stream transports) before transmission to the lower layer. This may be due to socket output buffer underrun, or the need for some in-band signaling, like cryptographic key exchange, to happen before the transmission of the complete message may finish. The XCM layer will (re-)attempt to hand the message over to the lower layer at a future call to xcm_finish(), xcm_send(), or xcm_receive().
For applications wishing to determine when all buffered data have successfully been deliver to the lower layer, may use xcm_finish() to do so. Normally, applications aren't expected to require this kind of control. Please also note that the fact a message has left the XCM layer doesn't necessarily mean it has successfully been delivered to the recipient. In particular, if for some reason the data can be dispatched immediately, it may be lingering in kernel buffers. Such buffers may be discarded in case the application close the connection.
xcm_connect(), xcm_accept(), xcm_send() may all leave the socket in a state where work is initiated, but not completed. In addition, the socket may have pending internal tasks, such flushing the output buffer into the TCP/IP stack, processing XCM control interface messages, or finishing the TLS hand shake procedure.
After waking up from a select() call, where a particular XCM non-blocking socket's fd is marked readable, the application must, if no xcm_send(), xcm_receive() or xcm_accept() calls are to be made, call xcm_finish(). This is to allow the socket to finish any outstanding tasks, even in the case the application has no immediate plans for the socket.
Prior to changing a socket from non-blocking to blocking mode, any outstanding tasks should be finished, or otherwise the switch might cause xcm_set_blocking() to return -1 and set errno to EAGAIN.
For example, if a server socket's desired condition has been set (with xcm_await()) to XCM_SO_ACCEPTABLE
, and the application wakes up from select() with the socket's fd marked readable, a call to xcm_accept() may still not produce a new connection socket.
The same holds true when reaching XCM_SO_RECEIVABLE
and a xcm_receive() call is made, and XCM_SO_SENDABLE
and calls to xcm_send().
In this example, the application connects and tries to send a message, before knowing if the connection is actually established. This may fail (for example, in case TCP and/or TLS-level connection establishment has not yet been completed), in which case the application will fall back and wait with the use of xcm_await(), xcm_fd() and select().
In case the application wants to know when the connection establishment has finished, it may use xcm_finish() to do so, like in the below example sequence.
While connecting to a server socket, the client's connection attempt may be refused immediately.
In many cases, the application is handed a connection socket before the connection establishment is completed. Any errors occuring during this process is handed over to the application at a future call to xcm_finish(), xcm_send() or xcm_receive().
In this example the application flushes any internal XCM buffers before shutting down the connection, to ensure that any buffered messages are delivered to the lower layer.
In this sequence, a server accepts a new connection, and continues to attempt to receive a message on this connection, while still, concurrently, is ready to accept more clients on the server socket.
Associated to a XCM server or connection socket is a set of XCM socket attributes.
Socket attributes represent both read-only state (e.g., TCP round-trip time), and read-write run-time configuration (e.g., TCP keepalive configuration).
Which attributes are present varies across different transports, socket types (i.e., server or connection) and socket states (i.e., fully established or not).
The socket attribute API <xcm_attr.h> provides access to transport-specific parameters, without the need to extend the API with transport-specific function calls.
Socket attributes are organized as a tree. An attribute's name is a string which describes a path to a node in the tree. The leaf nodes are one of a number of primitive types, such as integers and strings (see <xcm_attr_types.h> for the full list). Composite (interior) nodes are either dictionaries or lists.
An attribute may be read-only, write-only or available both for reading and writing. This is referred to as the attribute's mode. The mode may vary across the lifetime of the socket. For example, an attribute may be writable at the time of the xcm_connect_a() call, and read-only thereafter.
An attribute name (or, path name) is a string consisting of a sequence of dictionary keys and list indices. Keys are separated by a colon ".", and indices are enclosed in square brackes "[<index>]".
Read from the left, each segment in the path moves one level away from the root.
Here are some examples of attribute path names:
The attribute's value is coded in the native C data type and native CPU byte order. Strings are NUL-terminated, and the NUL character is included in the length of the attribute. There are four value types; a boolean type, a 64-bit signed integer type, a string type, a type for arbitrary binary data, and a double-precision floating point type. See xcm_attr_types.h for details.
In the current API, only leaf nodes can be accessed (e.g., it's not possible to retrieve a list or a dictionary as a single call).
The attribute access API is in xcm_attr.h.
The socket attribute tree has an unnamed root. This root dictionary has a number of keys.
The generic XCM attributes, available in all transports, are organized under the "xcm" key. Transport-specific attributes are prefixed with the transport or protocol name (e.g. "tcp" for TCP-specific attributes applicable to the TLS, BTLS, TCP, and BTCP transports).
Retrieving the value of an attribute is done using xcm_attr_get(), or any of its many type-specific convenience functions.
Below is an example of reading an integer attribute value.
Iterating over a list may look something like below.
Modyfing the value of an attribute is done using xcm_attr_set(), or any of its many type-specific convenience functions.
For example, setting the value of a boolean attribyte may be done like below.
Please note that all of these examples lack the error handling required in a real application.
An application may modify multiple attributes in one go, as a part of socket creation, by populating an attribute map and passing it to xcm_connect_a(), xcm_server_a(), or xcm_accept_a().
Certain attributes' default values (e.g., attribute controlling what TLS credentials are used) may only be modified in this manner.
Each key-value pair in the attribute maps is used as an instruction to set a node in the socket attribute tree to a particular value. The node's path is the key's name, and new desired value of the node is the key's value.
The attribute sets are represented by the xcm_attr_map
type in xcm_attr_map.h.
The caller retains the ownership of the attribute map passed to xcm_connect_a(), xcm_server_a(), or xcm_accept_a(), and may destroy it after the call has completed, or reuse it.
A somewhat contrived example:
Connection sockets spawned off a server sockets will inherit the server socket's attributes that also applies to connection sockets. An application may override such values by passing a different values in the xcm_accept_a() call.
These attributes are expected to be found on XCM sockets regardless of transport type.
For TCP and BTCP transport-specific attributes, see TCP Socket Attributes, and for TLS and BTLS, see TLS Socket Attributes. For DNS-related attributes (shared among all TCP-based transports) see DNS Socket Attributes.
Attribute Name | Socket Type | Value Type | Mode | Description |
---|---|---|---|---|
xcm.type | All | String | R | The socket type: "server" or "connection". |
xcm.transport | All | String | R | The transport type. |
xcm.service | All | String | RW | The service type: "messaging" or "bytestream". Writable only at the time of socket creation. If specified, it may be used by an application to limit the type of transports being used. The string "any" may be used to signify that any type of service is accepted. The default value is "messaging". |
xcm.local_addr | All | String | RW | The local address of a socket. Writable only if supplied to xcm_connect_a() together with a TLS, UTLS or TCP type address. Usually only needs to be written on multihomed hosts, in cases where the application needs to specify the source IP address to be used. Also see xcm_local_addr(). |
xcm.blocking | All | Boolean | RW | See xcm_set_blocking() and xcm_is_blocking(). The default value is true. |
xcm.remote_addr | Connection | String | R | See xcm_remote_addr(). |
xcm.max_msg_size | Connection | Integer | R | The local maximum size of any message transported by this connection. The remote end may have a different opinion on what is the upper limit. |
XCM connections sockets keeps track of the amount of data entering or leaving the XCM layer, both from the application and to the lower layer. Additionally, messaging transports also track the number of messages.
Some of the message and byte counter attributes use the concept of a "lower layer". What this means depends on the transport. For the UX and TCP transports, it is the Linux kernel. For example, for TCP, if the xcm.to_lower_msgs is incremented, it means that XCM has successfully sent the complete message to the kernel's networking stack for further processing. It does not means it has reached the receiving process. It may have, but it also may be sitting on the local or remote socket buffer, on a NIC queue, or be in-transmit in the network. For TLS, the lower layer is OpenSSL.
The counters only reflect data succesfully sent and/or received.
These counters are available on both byte stream and messaging type connection sockets.
The byte counters are incremented with the length of the XCM data (as in the length field in xcm_send()), and thus does not include any underlying headers or other lower layer overhead.
Attribute Name | Socket Type | Value Type | Mode | Description |
---|---|---|---|---|
xcm.from_app_bytes | Connection | Integer | R | Bytes sent from the application and accepted into XCM. |
xcm.to_app_bytes | Connection | Integer | R | Bytes delivered from XCM to the application. |
xcm.from_lower_bytes | Connection | Integer | R | Bytes received by XCM from the lower layer. |
xcm.to_lower_bytes | Connection | Integer | R | Bytes successfully sent by XCM into the lower layer. |
These counters are available only on messaging type connection sockets.
Attribute Name | Socket Type | Value Type | Mode | Description |
---|---|---|---|---|
xcm.from_app_msgs | Connection | Integer | R | Messages sent from the application and accepted into XCM. |
xcm.to_app_msgs | Connection | Integer | R | Messages delivered from XCM to the application. |
xcm.from_lower_msgs | Connection | Integer | R | Messages received by XCM from the lower layer. |
xcm.to_lower_msgs | Connection | Integer | R | Messages successfully sent by XCM into the lower layer. |
XCM includes a control interface, which allows iteration over the OS instance's XCM server and connection sockets (for processes with the appropriate permissions), and access to their attributes (see Socket Attributes).
Security-sensitive attributes (e.g., tls.key
) cannot be accessed.
The control interface is optional by means of build-time configuration.
For each XCM server or connection socket, there is a corresponding UNIX domain socket which is used for control signaling (i.e. state retrieval).
By default, the control interface's UNIX domain sockets are stored in the /run/xcm/ctl
directory.
This directory should to be created prior to running any XCM applications for the control interface to worker properly and should be writable for all XCM users.
A particular process using XCM may be configured to use a non-default directory for storing the UNIX domain sockets used for the control interface by means of setting the XCM_CTL
variable. Please note that using this setting will cause the XCM connections to be not visible globally on the OS instance (unless all other XCM-using processes also are using this non-default directory).
Generally, since the application is left unaware (from an API perspective) of the existence of the control interface, errors are not reported up to the application. They are however logged.
Application threads owning XCM sockets, but which are busy with non-XCM processing for a long duration of time, or otherwise are leaving their XCM sockets unattended to (in violation of XCM API contract), will not respond on the control interface's UNIX domain sockets (corresponding to their XCM sockets). Only the presence of these sockets may be detected, but their state cannot be retrieved.
Internally, the XCM implementation has control interface client library, but this library's API is not public at this point.
XCM includes a command-line program xcmctl
which uses the Control API to iterate of the system's current XCM sockets, and allow access (primarily for debugging purposes) to the sockets' attributes.
XCM API calls are MT safe provided the threads do not operate on the same socket, at the same time.
Thus, multiple threads may make XCM API calls in parallel, provided the calls refer to different XCM sockets.
An XCM socket may not be shared among different threads without synchronization external to XCM. Provided calls are properly serialized (e.g., with a mutex lock), a socket may be shared by different threads in the samea process. However, this might prove difficult since a thread in a blocking XCM function will continue to hold the lock, preventing other threads from accessing the socket.
For non-blocking sockets (with external synchronization), threads sharing a socket need to agree on what is the appropriate socket condition
to wait for. When this condition is met, all threads are woken up, returning from select().
It is safe to "give away" an XCM socket from one thread to another, provided the appropriate memory fences are used.
These limitations (compared to BSD Sockets) are in place to allow socket state outside the kernel (which is required for TCP framing and TLS).
Sharing an XCM socket between threads in different processes is not possible.
After a fork() call, either of the two process (the parent, or the child) must be designated the owner of every XCM socket the parent owned.
The owner may continue to use the XCM socket normally.
The non-owner may not call any other XCM API call than xcm_cleanup(), which frees local memory tied to this socket in the non-owner's process address space, without impacting the connection state in the owner process.
The core XCM API functions are oblivious to the transports used. However, the support for building, and parsing addresses are available only for a set of pre-defined set of transports. There is nothing preventing xcm_addr.h from being extended, and also nothing prevents an alternative XCM implementation to include more transports without extending the address helper API.
The UX transport uses UNIX Domain (AF_UNIX, also known as AF_LOCAL) Sockets to providing a service of the messaging type.
UX sockets may only be used with the same OS instance (or, more specifically, between processes in the same Linux kernel network namespace).
UNIX Domain Sockets comes in a number of flavors, and XCM uses the SOCK_SEQPACKET variety. SOCK_SEQPACKET sockets are connection-oriented, preserves message boundaries and delivers messages in the same order they were sent; perfectly matching XCM semantics and provides for an near-trivial mapping.
UX is the most efficient of the XCM transports.
The UX transport has a nominal maximum message size of 262144 bytes. This limit may be lower due to conservative kernel runtime configuration (i.e., low net.core.wmem_max
values). In such cases, the xcm.max_msg_size
will reflect the actual upper limit, at the time of socket creation. The max message size may change in future versions of the UX transport.
The standard UNIX Domain Sockets as defined by POSIX uses the file system as its namespace, with the sockets also being files. However, for simplicity and to avoid situations where stale socket files (originating from crashed processes) causing problems, the UX transport uses a Linux-specific extension, allowing a private UNIX Domain Socket namespace. This is known as the abstract namespace (see the unix(7) man page for details). With the abstract namespace, server socket address allocation has the same life time as TCP ports (i.e. if the process dies, the address is free'd).
The UX transport enables the SO_PASSCRED BSD socket option, to give the remote peer a name (which UNIX domain connection socket doesn't have by default). This is for debugging and observability purposes. Without a remote peer name, in server processes with multiple incoming connections to the same server socket, it's difficult to say which of the server-side connection sockets goes to which remote peer. The kernel-generated, unique, name is an integer in the form "%05x" (printf format). Applications using hardcoded UX addresses should avoid such names by, for example, using a prefix.
The UTLS Transport also indirectly uses the UX namespace, so care should be taken to avoid any clashes between UX and UTLS sockets in the same network namespace.
The UXF transport is identical to the UX transport, only it uses the standard POSIX naming mechanism. The name of a server socket is a file system path, and the socket is also a file.
The UXF sockets resides in a file system namespace, as opposed to UX sockets, which live in a network namespace.
Upon xcm_close(), the socket will be closed and the file removed. If an application crashes or otherwise fails to run xcm_close(), it will leave a file in the file system pointing toward a non-existing socket. This file will prevent the creation another server socket with the same name.
The TCP transport uses the Transmission Control Protocol (TCP), by means of the BSD Sockets API.
TCP is a byte-stream service, but the XCM TCP transport adds framing on top of the stream. A single-field 32-bit header containing the message length in network byte order is added to every message.
TCP uses TCP Keepalive to detect lost network connectivity between the peers.
The TCP transport has a maximum message size of 262144 bytes. This limit may change in future versions.
The TCP transport supports IPv4 and IPv6.
Since XCM is designed for signaling traffic, the TCP transport disables the Nagle algorithm of TCP to avoid its excessive latency.
The TLS transport (and all other TCP protocol-based transports) supports a number of socket attributes controlling DNS-related behavior.
The DNS resolver used by XCM (either glibc or C-ares) sorts the A and AAAA records retrieved from DNS in an order of preference, before returning them to the caller. In the glibc case, the details of the sorting is a function of the system's configuration (i.e. /etc/gai.conf). In the C-ares case, the sorting is according to RFC 6724 (with some minor deviations).
By default, XCM will only attempt to connect to the first, most preferred, address in the list of IP addresses provided by the resolver. If that connection attempt fails, the XCM connection establishment procedure will be terminated.
Using the "dns.algorithm" socket attribute, the application may control the DNS resolution and TCP connection establishment procedure used.
By default, "dns.algorithm" is set to "single", behaving in accordance to the above description.
If the algorithm is set to "sequential", all IP addresses will be probed, in a serial manner, in the order provided by the DNS resolver.
Setting the algorithm to "happy_eyeballs" will result in RFC 6555-like behavior, with two concurrent connection establishment tracks; one attempting to establish an IPv4 connection and the other an IPv6-based connection. The IPv6 track is given a 200 ms head start.
When the "sequential" or "happy_eyeballs" algorithm is used, only the first 32 addresses provided by the resolver will be considered.
Attribute Name | Socket Type | Value Type | Mode | Description |
---|---|---|---|---|
dns.algorithm | Connection | String | RW | The algorithm used for connecting to IP addresses retrieved from DNS. Must take on the value "single", "sequential", or "happy_eyeballs". See DNS Resolution and TCP Connection Establishment for more information. Writable only at the time of the xcm_connect_a() call. |
dns.timeout | Connection | Double | RW | The time (in s) until DNS resolution times out. Writable only at the time of the xcm_connect_a() call. The timeout covers the complete DNS resolution process (as opposed to a particular query-response transaction). Only available when the library is built with the c-ares DNS resolver. |
The read-only TCP attributes are retrieved from the kernel (struct tcp_info in linux/tcp.h).
Many read-write attributes are mapped directly to setsockopt() calls.
See the tcp(7) manual page for a more detailed description of these attributes. The struct retrieved with TCP_INFO
is the basis for the read-only attributes. The read-write attributes are mapped to TCP_KEEP*
and TCP_USER_TIMEOUT
.
Besides the TCP layer attributes, IP- and DNS-level attributes are also listed here.
Attribute Name | Socket Type | Value Type | Mode | Description |
---|---|---|---|---|
tcp.rtt | Connection | Integer | R | The current TCP round-trip estimate (in us). |
tcp.total_retrans | Connection | Integer | R | The total number of retransmitted TCP segments. |
tcp.segs_in | Connection | Integer | R | The total number of segments received. |
tcp.segs_out | Connection | Integer | R | The total number of segments sent. |
tcp.connect_timeout | Connection | Double | RW | The time (in s) until a particular TCP connection establishment attempt times out. Writable only at the time of the xcm_connect_a() call. The default is 3 s. The value of this attribute must be lower than the value of "tcp.user_timeout" to have any effect. Note that if "dns.algorithm" is set to "sequential" or "happy_eyeballs", one xcm_connect_a() call may result in several TCP connection establishment attempts. |
tcp.user_timeout | Connection | Integer | RW | The time (in s) before a connection is dropped due to unacknowledged data. The default value is 3 s. |
tcp.keepalive | Connection | Boolean | RW | Controls if TCP keepalive is enabled. The default value is true. |
tcp.keepalive_time | Connection | Integer | RW | The time (in s) before the first keepalive probe is sent on an idle connection. The default value is 1 s. |
tcp.keepalive_interval | Connection | Integer | RW | The time (in s) between keepalive probes. The default value is 1 s. |
tcp.keepalive_count | Connection | Integer | RW | The number of keepalive probes sent before the connection is dropped. The default value is 3. |
ipv6.scope | All | Integer | RW | The IPv6 scope id used. Only available on IPv6 sockets. Writable only at socket creation. If left unset, it will take on the value of 0 (the global scope). Any other value denotes the network interface index to be used, for IPv6 link local addresses. See the if_nametoindex(3) manual page for how to map interface names to indices. |
tcp.segs_in
and tcp.segs_out
are only present when running XCM on Linux kernel 4.2 or later.The TLS transport uses the Transport Layer Security (TLS) protocol to provide a secure, private, two-way authenticated transport over TCP. A TLS connection is a byte stream, but the XCM TLS transport adds framing in the same manner as does the XCM TCP transport.
The TLS transport supports IPv4 and IPv6. It disables the Nagle algorithm of TCP.
The TLS transport has a maximum message size of 262144 bytes. This limit may change in future versions.
The TLS transport honors any limitations set by the X.509 extended key usage extension, if present in the remote peer's certificate.
The TLS transport only employs TLS 1.2 and, if the XCM library is built with OpenSSL 1.1.1 or later, TLS 1.3 as well.
TLS 1.2 renegotiation is disabled, if the XCM library is built with OpenSSL 1.1.1c or later.
The TLS transport disables both client and server-side TLS session caching, and thus does not allow for TLS session reuse across TCP connections.
The TLS 1.2 cipher list is (in order of preference, using OpenSSL naming): ECDHE-ECDSA-AES128-GCM-SHA256, ECDHE-ECDSA-AES256-GCM-SHA384, ECDHE-ECDSA-CHACHA20-POLY1305, ECDHE-RSA-AES128-GCM-SHA256, ECDHE-RSA-AES256-GCM-SHA384, ECDHE-RSA-CHACHA20-POLY1305, DHE-RSA-AES128-GCM-SHA256, DHE-RSA-AES256-GCM-SHA384, and DHE-RSA-CHACHA20-POLY1305.
The TLS 1.3 cipher suites used are: TLS_AES_256_GCM_SHA384, TLS_CHACHA20_POLY1305_SHA256 and TLS_AES_128_GCM_SHA256.
The TLS cipher lists are neither build- nor run-time configurable.
By default, the TLS transport reads the leaf certificate and the corresponding private key from the file system, as well as a file containing all trusted CA certificates. The default file system paths are configured at build-time.
TLS Socket Attributes may be used to override one or more of the default paths, on a per-socket basis. Paths set on server sockets are inherited by its connection sockets, but may in turn be overriden at the time of an xcm_accept_a() call, using the proper attributes.
The default paths may also be overriden on a per-process basis by means of setting a UNIX environment variable. The current value of XCM_TLS_CERT
(at the time of xcm_connect() or xcm_accept()) determines the certificate directory used for that connection.
An application may also choose to configure TLS socket credentials by-value, rather than by-file-system-reference. For a particular piece of information, an application must use either supply a file system path (e.g., by setting tls.cert_file
) or the actual data (e.g., by passing the certificate data as the value of the tls.cert
attribute).
Setting a credentials by-value attribute in the xcm_attr_map
passed to xcm_accept_a() will override the corresponding by-reference attribute in the server socket, and vice versa.
Certificates (including CRLs) and private keys provided to XCM (either via files or by attribute value) must be in the Privacy-Enhanced Mail (PEM) format (RFC 7468).
The TLS transport will, at the time of xcm_connect() or xcm_server(), look up the process' current network namespace, unless that file's path was given as a TLS Socket Attributes. If the namespace is given a name per the iproute2 convention, XCM will retrieve this name and use it in the certificate and key lookup.
In case the certificate, key and trusted CA files are configured using TLS Socket Attributes, no network namespace lookup will be performed.
In the certificate directory (either the compile-time default, or the directory specified with XCM_TLS_CERT
), the TLS transport expects the files to follow the following naming conventions (where <ns> is the namespace):
The private key is stored in:
The trusted CA certificates are stored in:
The certification revocation lists (CRLs) are stored in:
For the default namespace (or rather, any network namespace not named according to iproute2 standards), the certificate need to be stored in a file "cert.pem" and the private key in "key.pem".
If authentication is enabled (which it is, by default), the trusted CA certificates need to be stored in a file named "tc.pem".
If CRL checking is enabled (which it is not, by default), the CRLs need to be stored in a file named "crl.pem".
In case the appropriate credential-related files are not in place (for a particular namespace), an xcm_server() call will return an error and set errno to EPROTO. The application may choose to retry at a later time.
If authentication is disabled, "tc.pem" need not be present, and vice versa. The same applies to CRL checking and "crl.pem" availability.
In case a certificate, private key, or trusted CAs file is modified, the new version of the file(s) will be used by new connections. Such a change does not affect already-established connections. The TLS transport works with differences between set of files, and thus the new generation of files need not nesserarily be newer (as in having a more recent file system mtime).
The certificate, key and trusted CA certificates should be updated in an atomic manner, or XCM may end up using the certificate file from one generation of files and the key file from another, for example.
One way of achieving an atomic update is to have the three files in a common directory. This certificate directory is then made a symbolic link to the directory where the actual files are located. Upon update, a new directory is created and populated, and the old symbolic link is replace an atomic manner (i.e. with rename(2)).
By default, on sockets that represent the client side of an XCM TLS connection (e.g., returned from xcm_connect_a()), the XCM TLS transport will act as a TLS client. Similarly, the default behavior for sockets representing the XCM (and TCP) server side of a connection is to act as a TLS server.
The default may be changed by setting the "tls.client" attribute, so that sockets that are XCM (and TCP) level clients, act as TLS servers, and vice versa. If the value is true, the socket will act as a TLS client, and if false, the socket is a TLS server.
Connection sockets created by xcm_accept() or xcm_accept_a() inherit the "tls.client" attribute value from their parent server sockets.
The TLS role must be specified at the time of socket creation, and thus cannot be changed on already-established connections.
By default, both the client and server side authenticate the other peer, often referred to as mutual TLS (mTLS).
TLS remote peer authentication may be disabled by setting the "tls.auth" socket attribute to false. The default value is true.
Connection sockets created by xcm_accept() or xcm_accept_a() inherit the "tls.auth" attribute value from their parent server sockets.
The "tls.auth" socket attribute may only be set at the time of socket creation (except for server sockets).
The TLS transport supports verifying the remote peer's certificate subject name against an application-specified expected name, or a set of names. "Subject name" here is used as per RFC 6125 definition, and is either a Distingushed Name (DN) of the X.509 certificate's subject field, or a DNS type subject alternative name extension. XCM does not make any distinction between the two.
Subject name verification may be enabled by setting the "tls.verify_peer_name" socket attribute to true. It is disabled by default.
If enabled, XCM will verify the hostname in the address supplied in the xcm_connect_a() call. In case the attribute "tls.peer_names" is also supplied, it overrides this behavior. The value of this attribute is a ':'-separated set of subject names. "tls.peer_names" may not be set unless "tls.verify_peer_name" is set to true.
If there is a non-zero overlap between these two sets, the verification is considered successful. The actual procedure is delegated to OpenSSL. Wildcard matching is disabled (X509_CHECK_FLAG_NO_WILDCARDS
) and the check includes the subject field (X509_CHECK_FLAG_ALWAYS_CHECK_SUBJECT
).
Subject name verification may be used both by a client (in its xcm_connect_a() call) or by a server (in xcm_server_a() or xcm_accept_a()). "tls.peer_names" must be specified in case "tls.verify_peer_name" is set to true on connection sockets created by accepting a TLS connection from a server socket (since there is no hostname to fall back to).
Connection sockets created by xcm_accept() or xcm_accept_a() inherit the "tls.verify_name" and "tls.peer_names" attributes from their parent server sockets.
After a connection is established, the "tls.peer_names" will be updated to reflect the remote peer's actual subject names, as opposed to those which were originally allowed.
OpenSSL refers to this functionality as hostname validation, and that is also how it's usually used. However, the subject name passed in "tls.peer_names" need not be DNS domain name, but can be any kind of name or identifier. All names must follow DNS domain name syntax rules (including label and total length limitations). Also, while uppercase and lowercase letters are allowed in domain names, no significance is attached to the case.
The XCM TLS transport may be asked to perform checks against one or more Certificate Revocation Lists (CRLs).
CRL checking is enabled by setting the "tls.check_crl" socket attribute to true during socket creation (e.g., when calling xcm_connect_a()). CRL checking is disabled by default. CRL checking may be employed by both TLS client and server endpoints.
The default CRL file location may be overriden using the "tls.crl_file" attribute. Alternatively, the CRL data may be provided by-value using the "tls.crl" attribute.
The CRL bundle must be in PEM format, and must be present and valid if CRL checking is enabled.
The full chain is checked against the user-provided CRLs (i.e., in OpenSSL terms, both the X509_V_FLAG_CRL_CHECK and X509_V_FLAG_CRL_CHECK_ALL flags are set).
CRL checking is only meaningful (and allowed) when authentication is enabled.
Due to a bug in OpenSSL, partial chains (i.e., where the trust anchor is a trusted non-root certificate) is not allowed when CRL checking is enabled. In OpenSSL terms, X509_V_FLAG_PARTIAL_CHAIN is disabled when X509_V_FLAG_CRL_CHECK_ALL is enabled. Future versions of XCM, built against newer versions of OpenSSL, may allow partial chains in combination with CRL checking.
By default, the XCM TLS transport checks the validity period of each X.509 certificate in the chain of trust, down to and including the remote peer's leaf certificate, against the current system time. If any certificate is found to be either not yet valid or expired, TLS connection establishment is aborted.
A socket may be configured to accept not-yet-valid certificates and expired certificates by setting the "tls.check_time" to false.
Connection sockets created by xcm_accept() or xcm_accept_a() inherit the "tls.check_time" attribute value from their parent server sockets.
Attribute Name | Socket Type | Value Type | Mode | Description |
---|---|---|---|---|
tls.cert_file | All | String | RW | The leaf certificate file. Writable only at socket creation. |
tls.key_file | All | String | RW | The leaf certificate private key file. Writable only at socket creation. |
tls.tc_file | All | String | RW | The trusted CA certificates bundle file. Writable only at socket creation. May not be set if authentication is disabled. |
tls.crl_file | Al l | String | RW | The certificate verification list (CRL) bundle. Writable only at socket creation. May only be set if CRL checking is enabled. |
tls.cert | All | Binary | RW | The leaf certificate to be used. Writable only at socket creation. |
tls.key | All | Binary | RW | The leaf certificate private key to be used. Writable only at socket creation. For security reasons, the value of this attribute is not available over the XCM control interface. |
tls.tc | All | Binary | RW | The trusted CA certificates bundle to be used. Writable only at socket creation. May not be set if authentication is disabled. |
tls.crl | All | Binary | RW | The certificate verification list (CRL) bundle to be used. Writable only at socket creation. May only be set if CRL checking is enabled. |
tls.client | All | Boolean | RW | Controls whether to act as a TLS-level client or a server. Writable only at socket creation. |
tls.auth | All | Boolean | RW | Controls whether or not to authenticate the remote peer. Writable only at socket creation. Default value is true. |
tls.check_crl | All | Boolean | RW | Controls whether or not to perform CRL checking. Writable only at socket creation. Default value is false. |
tls.check_time | All | Boolean | RW | Controls if the X.509 certificate validity period is honored. Writable only at socket creation. Default is true. |
tls.verify_peer_name | All | Boolean | RW | Controls if subject name verification should be performed. Writable only at socket creation. Default value is false. |
tls.peer_names | All | String | RW | At socket creation, a list of acceptable peer subject names. After connection establishment, a list of actual peer subject names. Writable only at socket creation. |
tls.peer_subject_key_id | Connection | String | R | The X509v3 Subject Key Identifier of the remote peer, or a zero-length string in case no certificate available (e.g, the TLS connection is not established or TLS authentication is disabled and the remote peer did not send a certificate). |
tls.peer.cert.subject.cn | Connection | String | R | The common name (CN) of the remote peer's subject field, provided the certificate (including a CN in the subject DN) exists. |
tls.peer.cert.san.dns | Connection | List | R | A list of strings, where each element is a remote peer's subject alternative name (SAN) of the DNS type. The subject field CN is not included in this list. |
tls.peer.cert.san.emails | Connection | List | R | A list of strings, where each element is a remote peer's SAN of the RFC 822 type. |
tls.peer.cert.san.dirs | Connection | List | R | A list of dictionaries. Each element represents is a remote peer's SAN of the directory name type, and contains the key "cn", holding the directory name DN's CN, if the CN is present. |
In addition to the TLS-specific attributes, a TLS socket also has all the DNS Socket Attributes and TCP Socket Attributes (including the IP-level attributes).
The UTLS transport provides a hybrid transport, utilizing both the TLS and UX transports internally for actual connection establishment and message delivery.
On the client side, at the time of xcm_connect(), the UTLS transport determines if the server socket can be reached by using the UX transport (i.e. if the server socket is located on the same OS instance, in the same network namespace). If not, UTLS will attempt to reach the server by means of the TLS transport.
For a particular UTLS connection, either TLS or UX is used (never both). XCM connections to a particular UTLS server socket may be a mix of the two different types.
For an UTLS server socket with the address utls:<ip>:<port>
, two underlying addresses will be allocated; tls:<ip>:<port>
and ux:<ip>:<port>
.
In case DNS is used: tls:<hostname>:<port>
and ux:<hostname>:<port>
.
UTLS sockets accept all the TLS Socket Attributes, as well as the Generic Attributes. In case a UTLS connection is being established as a UX connection socket, all TLS attributes are ignored.
A wildcard should never be used when creating a UTLS server socket.
If a DNS hostname is used in place of the IP address, both the client and server need employ DNS, and also agree upon which hostname to use (in case there are several pointing at the same IP address).
Failure to adhere to the above two rules will prevent a client from finding a local server. Such a client will instead establish a TLS connection to the server.
The SCTP transport uses the Stream Control Transmission Protocol (SCTP). SCTP provides a reliable, message-oriented service. In-order delivery is optional, but to adhere to XCM semantics (and for other reasons) XCM leaves SCTP in-order delivery enabled.
The SCTP transport utilizes the native Linux kernel's implementation of SCTP, via the BSD Socket API. The operating mode is such that there is a 1:1-mapping between an association and a socket (fd).
The SCTP transport supports IPv4 and IPv6.
To minimize latency, the SCTP transport disables the Nagle algorithm.
The SCTP transport has a maximum message size of 65535 bytes. This limit may change in future versions.
The BTCP transport provides a reliable two-way byte stream service over TCP.
Unlike the TCP Transport, BTCP doesn't use a framing header or anything else on the wire protocol level that is specific to XCM. In other words, it's a "raw" TCP connection.
Other than the above-mentioned differences, BTCP is identical to the TCP Transport, including supported DNS Socket Attributes and TCP Socket Attributes.
The BTLS transport provides a direct mapping to the Transport Layer Security (TLS) protocol over TCP. It provides a secure, private, two-way authenticated byte stream service.
BTLS has the same relationship to the TLS Transport as the TCP Transport has to the TCP Transport.
BTLS doesn't add a framing header or any other XCM BTLS-level protocol artefacts on top of the TLS session. In other words, it's a "raw" TLS connection.
Other than providing a byte stream, it's identical to the TLS Transport, including supported DNS Socket Attributes, TCP Socket Attributes and TLS Socket Attributes.
Namespaces is a Linux kernel facility concept for creating multiple, independent namespaces for kernel resources of a certain kind.
Linux Network Namespaces will affect all transports, except the UXF Transport.
XCM has no explicit namespace support. Rather, the application is expected to use the Linux kernel facilities for this functionality (i.e. switch to the right namespace before xcm_server() och xcm_connect()).
In case the system follows the iproute2 conventions in regards to network namespace naming, the TLS and UTLS transports support per-network namespace TLS certificates and private keys.
In case XCM is built with LTTng support, the XCM library will register two tracepoints: com_ericsson_xcm:xcm_debug
and com_ericsson_xcm:xcm_error
. The xcm_debug
is verbose indeed, while messages on xcm_error
are rare indeed. The latter is mostly due to the fact there are very few conditions that the library reliable can classify as errors, since many "errors" [e.g., connection refused] may well be the expected result).
If the XCM_DEBUG environment variable is set, the same trace messages that are routed via the LTTng tracepoints, are printed to stderr of the process linked to the library.
The tracepoint names and the format of the messages are subject to change, and not to be considered a part of the XCM API.