Statistics¶
Section author: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
The topic of this chapter is operational statistics for the data plane fast path.
Data plane statistics provides information on events having occurred in the data plane application. The same mechanism may also provide a view into the current state of the data plane. The former is the focus of this chapter.
The statistics of this chapter serves no purpose for the function the data plane application provides. In other words, would the statistics be removed, the user data would still flow through the network as before.
The statistics are usually a direct result of fast path packet processing (e.g., counters updated on a per-packet basis). There are also counters in the slow path (assuming such a component exists). Often, the slow path and fast path statistics need to be merged in order to provide a consistent view of the data plane as a whole.
Slow path statistics are more straight-forward to implement, since the performance requirements are less stringent, and are out of scope for this chapter.
Counters¶
In a SNMP MIB, counters are non-negative integers that monotonically increase, until they wrap upon exceeding some maximum value (usually 2^32-1 or 2^64-1). A gauge is a non-negative integer that varies within some predefined range.
For simplicity, this chapter uses the term counter to include both what SNMP calls counter and what SNMP calls gauge. In practice, the counter is the more common of the two.
In addition to counters, there may also be pseudo statistics representing the current state of the data plane, rather than a summary of past events. What type is best used to represent such state varies; it may be a boolean, an integers or a string. The Structure of Management Information (SMI) - the SNMP MIB meta model - allows for a number of other types, such as Unsigned32, TimeTicks and IpAddress. Some of the techniques discussed in this chapter also apply for managing and presenting state information to external parties (e.g., the control plane).
There exist a wide variety of network management protocols to access statistics-type information, where SNMP is just one among many. The mechanism to access statistics from outside the fast path process, and the data plane application, will be discussed in a future chapter on control plane interfaces.
Although the available counters and their semantics are likely crafted to fit some management plane data model (to avoid complex transformations), the data plane fast path level implementation should be oblivious to what entities access the data and for what purpose, and what protocols are being used.
Use Cases¶
Data plane statistics can be used for a number of different purposes, including:
Troubleshooting and debugging
Node and network level performance monitoring and profiling
Security (e.g., detecting intrusion attempts)
In-direct functional uses (e.g., billing)
The vague in-direct functional use means that while the counters have no role in the local data plane function, they still serve a purpose for the network as a whole. The counters may for example be used for billing, or as input for automated or manual network optimization.
Telemetry¶
Data plane statistics can be used as basis for network telemetry. Telemetry means network nodes provides statistics to a central location, usually using a push model. This information may then in turn be used to address one or more of the above-mentioned use cases. The data plane statistics implementation need not and should not know if its statistics is being used for telemetry.
Requirements¶
There are a number of aspects to consider, when specifying requirements for fast path statistics:
Performance
Correctness
Propagation delay
Time correlation
Consistency across multiple counters
Counter Reset
Writer parallelism
Reader preemption safety, read frequency, and acceptable read-side cost
Performance¶
Time Efficiency¶
The typical data plane application has a range from a couple of hundred clock cycles worth of processing latency per packet for low touch applications, up to an average of a couple of tens of thousands per packet for high touch applications.
Statistics are useful in almost all applications, and supporting a reasonably large set of data plane counters is essential. In the light of this, it makes sense to allocate a chunk of the fast path cycle budget to spend on solving the statistics problem. On the other hand, the fewer cycles spent on any per-packet task the better. The primary business value of the data plane comes from the core data plane function, and only a small minority of the CPU cycles should be spent on statistics - an auxiliary function.
The number of counters updated per packet will range from somewhere around a handful for low latency, low touch applications, up into the hundreds for a complex, multi-layer protocol stack.
One seemingly reasonable assumption is that low touch and high touch applications spend roughly the same amount of domain logic CPU cycles for every counter they need to update. The more elaborate logic, the more cycles are spent, and the more there is a need to account for what is going on, in order to, for example, profile or debug the application, or the network. [1] Thus a high touch application tend to update more counters per packet than does its low touch cousin.
The next assumption is that no more than 5% of the total per-packet CPU cycle budget should be spent on statistics, and that roughly one counter update per 300 CPU cycles worth of domain logic processing latency is expected.
In case these assumptions hold true, the processing latency budget for a counter update is 15 core clock cycles.
Uncertainties aside, this approximation give you an order-of-magnitude level indication of the performance requirements, and underscores the need to pay attention to producer-side statistics performance.
As an example of how things may go wrong if you don’t, consider an application with a budget of 5000 clock cycles per packet and a requirement for an average of 25 counter updates per packet. If the development team goes down the Shared Lock Protected Counters path, the statistics feature alone will consume several order of magnitudes more than the budget for the whole application, assuming the fast path is allocated to a handful of cores or more.
Vector Processing¶
If the fast path processing is organized as per the vector packet processing design pattern, the counter update overhead can potentially be reduced.
For example, if the lcore worker thread is handed a vector of 32
packets by the work scheduler, all destined for Example
Protocol-level processing, the total_pkts
need only be updated
once.
Space Efficiency¶
It’s not unusual to see data plane applications which track (or otherwise manage) large amounts of some kind of flows. Such flows may be TCP connections, HTTP connections, PDCP bearers, IPsec tunnels, or GTP-U tunnels, or indeed not really flows at all, like UEs or known IP hosts. Often, it makes sense to have counters on a per-flow basis. If the number of flows is large, the amount of memory used to store counter state will be as well. In such applications, care must be take to select a space efficient-enough statistics implementation.
There are two sides to statistics memory usage. One is the amount of memory allocated. The other is working set size, and the spatial (how tightly packed the data is) and temporal locality (how quickly the same cache line is reused) of the memory used, which affects Time Efficiency.
The former may impact requirements for the amount of DDR installed on the system (or allocated to the container), and the amount of huge page memory made available to the fast path process.
The latter translate to CPU core stalls (waiting for memory to be read or written), and thus more CPU cycles required per counter update, and potentially more CPU cycles for the execution of other part of the fast path, because statistics processing-generated cache pressure. The more different counters are updated, the larger the working set size for the application as a whole.
Counters should not be evicted from the last level cache, with often-used counters held in the L1 or L2 caches.
An issue strictly tied to space efficiency is the amount of DRAM required to hold the statistics data structures. Depending on how the counters are organized in memory, only a very small fraction of the memory allocated may actually be used.
A related question is if memory consumption are required to correlate with the number of active flows, or if it is acceptable to statically preallocated statistics memory for the maximum number of flows supported.
Prototyping may be required to determined how the counter working set size affects performance, both of the statistics related operations, and other processing. For example, the per-core counters are usually much more efficient than using a shared statistics. However, if the shared data structure avoid last level cache evictions, but it turns out the per-core counter approach doesn’t, the shared approach may surpass per-core counters in cycle-per-update performance, as well as requiring less memory. Cache effects are notoriously difficult to prototype, in part since usually, at the time of prototyping, the real network stack domain logic and the rest of the fast path is not yet in place.
Flow Affinity¶
An aspect to consider when choosing statistics implementation for per-flow counters is whether or not the work scheduler employed in the fast path maintains flow-to-core affinity.
The affinity need not to be completely static, in the sense a flow need not be pinned to a lcore worker forever until a system crash or the heat death of the universe (whichever comes first) occurs. Rather, it’s sufficient if a packet pertaining to a certain flow usually go to the same worker core as the packet preceding it, in that flow.
Flow affinity reduces statistics processing overhead regardless of implementation, but applications with shared per-flow counters gain the most.
Flow affinity makes it likely statistics related data is in a CPU cache near the core. If, on the other hand, there is no affinity, the cache lines holding the statistics data may well be located in a different core’s private cache (having just been written to), resulting in a very expensive cache miss when being accessed. Other flow-related data follow the same pattern, resulting in affinity generally having a significant positive effect on overall fast path performance.
Parallel Flow Processing¶
Another important aspect to consider for per-flow counters is whether or not packets pertaining to a particular flow are processed in parallel. If a large flow is spread across multiple cores, the shared atomic and shared lock protected approaches will suffer greatly, as all worker cores fight to retrieve the same lock, or otherwise access the same cache lines.
Usually, the data plane fast path includes some non-flow (e.g., global or per-interface) counters, so the issue with high contention counters need to be solved regardless.
Update Propagation Delay¶
Update propagation delay in this chapter refers to the wall-clock latency between the point in time when the counted event occurred, until the counter is updated and available to a potential reader. The result of domain logic processing that caused the counter update (e.g, a packet being modified), and the counter update itself, are never presented in an atomic manner to the parties outside the fast path process. This would imply that somehow the delivery of packets from the data plane and the delivery of counter updates to the external world could be done atomically.
The counter implementation patterns presented in this chapter, unless otherwise mentioned, all have a very short update propagation delay. It boils down to the time it takes for the result of a CPU store instruction become globally visible. This process shouldn’t take more than a couple of hundred nanoseconds, at most.
One way to reason about update propagation delay requirements is to think about how quickly a user, or any kind of data plane application-external agent, could retrieve counters which should reflect a particular packet having been received, processed, or sent.
For example, if serving a control plane request has a practical (although not guaranteed) lower processing latency of 1 ms, that could be used to set an upper boundary for the counter update propagation delay.
Organization¶
A common and straightforward way of organizing counters are as one or more nested C structs. Addressing an individual counter in a C struct is very efficient, since the offset, in relation to the beginning of the struct, is known at compile time.
An issue with C structs is that the maximum cardinality of the various objects (e.g., interfaces, routes, flow, connections, users, bearers) must be known at compile time. The struct may become very large when dimensioned for the worst case, for every type of object. Spatial locality may also be worse than other, more compact, alternatives.
An alternative to structs is to use some kind of map, for example a hash table or some sort of tree. However, even efficient map implementations typically has a processing latency in the range 10-100s of clock cycles to access the value of a key.
One radically different approach to managing counters is to maintain only changes, and not the complete statistics state, in the data plane fast path application. The aggregation of the changes are either done in a data plane control process or thread, or in the control plane proper. See Shared Counters with Per Core Buffering for more on this.
The statistics can either be organized in a global statistics database for the whole fast path, or broken down per module, or per layer basis, resulting in many different memory blocks being used.
In case of the global statistics struct type, care must be taken not to needlessly couple different protocol modules via the statistics module. On the statistics consumer side on the other hand, it might make perfect sense to see all statistics, for all the different parts of the data plane fast path.
Please note that the above discussion is orthogonal to the question whether or not the statistics data type should be kept in a single instance (or an instance per module, in the modularized case), or an instance per lcore.
Separate versus Integrated Counters¶
A design decision to make is if the counters should be an integral part of other network stack state, or kept separate (possibly together with statistics from other modules).
For example, consider an application that has a notion of a flow, a connection, an interface, a port, or a session. It keeps state related to the processing of packets pertaining to such domain-level objects in some data structure.
One option is to keep the counter state produced as a side effect of the network stack domain logic processing in the same struct as the core domain logic state. In case the domain logic state is protected by a lock, the counter updates could be performed under the protection of the same lock, at very little additional cost, provided the counters belongs to that object (as opposed to counters related to some aggregate of such objects, or global counters).
A potential reader would use the same lock to serialize access to the counters. This would be an implementation of the Shared Lock Protected Counters pattern, with per-flow locks. Non-flow counters would be handled separately.
There are also hybrids, where the statistics is an integral part of the domain logic data structures, but the statistics synchronization is done differently compared to accesses of domain object state.
The example implementations in this chapter use a simple, self-contained, fix size, statically allocated struct, but the various pattern applies to more elaborate, dynamically allocated data structures as well.
One way to reduce the amount of memory used (or at least make the amount of memory used be in relation to the data plane application’s configuration) for systems with high max cardinality, and large variations in the number of actual objects instantiated, is to use dynamic allocation, but use fixed-offset accesses to the fields within those dynamically-allocated chunks of memory. For example, the list of flows could be a pointer to a dynamically allocated array, instead of a fixed, compile-time-sized, array.
Large counter data structures should be allocated from huge page memory, regardless if the statistics struct is self-contained or spread across many different chunks of memory.
Synchronization¶
This chapter describes a number of approaches to counter implementation, with a focus on writer-writer synchronization (i.e., synchronization between different lcore worker threads updating the same counter) and reader-writer synchronization (e.g., synchronization between the control threads and the lcore workers).
In case where a data plane application has only one lcore worker thread, or there are multiple lcore workers, but there is no overlap between what counters the different threads manipulate in parallel, there is no need for writer-writer synchronization. The only concern in this case, is reader-writer synchronization, assuming there is a separate reader thread (e.g., a control thread). The solution to the reader problem can be solved in the same way as is described in the section for Per Core Counters.
The different patterns described in this section assumes multiple lcore workers with in part or fully overlapping statistics, and one or more separate reader threads.
Below are the data structure definition and declarations for the fictitious Example Protocol (EP). EP has packets and flows, and counters on both a global level, and on a per-flow basis.
#define EP_MAX_SESSIONS (1000)
struct ep_session_stats
{
uint64_t bytes;
uint64_t pkts;
};
struct ep_stats
{
struct ep_session_stats sessions[EP_MAX_SESSIONS];
uint64_t total_bytes;
uint64_t total_pkts;
};
static struct ep_stats stats;
Needless to say, a real application will have a more counters, organized into more struct types.
Per Core Counters¶
In DPDK, per-core data structures (usually in the form of nested C
struct) are usually implemented by having as many instances of the
struct as the maximum number of support DPDK lcores
(RTE_LCORE_MAX
), kept in a static array. A DPDK application may
reuse the same pattern, to good effect.
Since an EAL thread is almost always pinned to particular CPU core dedicated to its use, the per-thread data structure effectively becomes a per-core data structure.
The DPDK lcore id, numbered from 0 to RTE_LCORE_MAX-1
, is then
used as a index into the array of instances. The lcore id may be
retrieved relatively cheaply with rte_lcore_id()
.
If the per-core data structures are large, it’s better to have an array of pointers, and only allocated as many as the actual lcore count, or dynamically on-demand. This will reduce the amount of memory used, and allow allocation from huge page memory.
This scheme disallows modifying counters from an unregistered non-EAL thread, but does allow read operations from any thread (even such that my be preempted).
A data plane application have the option of reusing this pattern for its own per-core data structures. It may also choose to use Thread Local Storage (TLS) directly, instead
The example code for this approach uses the same data model and very similar struct definition as Shared Non Synchronized Counters.
#define EP_MAX_SESSIONS (1000)
struct ep_session_stats
{
uint64_t bytes;
uint64_t pkts;
};
struct ep_stats
{
struct ep_session_stats sessions[EP_MAX_SESSIONS];
uint64_t total_bytes;
uint64_t total_pkts;
} __rte_cache_aligned;
static struct ep_stats lcore_stats[RTE_MAX_LCORE];
Note
The __rte_cache_aligned
attribute is crucial from a
performance perspective. If left out, the data structures for two
different cores may wholly, or in part, reside on the same cache
line (i.e., false sharing). If frequently-updated counters for
two different cores are hosted by the same cache line, this shared
cache line will partly defeat the purpose of using per-core data
structures. False sharing does not affect correctness.
Arithmetic Operations¶
Since the statistics are duplicated across all lcores, no lcore-to-lcore writer synchronization is required.
static void
stats_add64(uint64_t *counter_value, uint64_t operand)
{
uint64_t new_value;
new_value = *counter_value + operand;
__atomic_store_n(counter_value, new_value, __ATOMIC_RELAXED);
}
void
ep_stats_update(uint16_t session_id, uint64_t pkt_size)
{
unsigned int lcore_id;
struct ep_stats *stats;
struct ep_session_stats *session_stats;
lcore_id = rte_lcore_id();
stats = &lcore_stats[lcore_id];
stats_add64(&stats->total_pkts, 1);
stats_add64(&stats->total_bytes, pkt_size);
session_stats = &stats->sessions[session_id];
stats_add64(&session_stats->pkts, 1);
stats_add64(&session_stats->bytes, pkt_size);
}
To guarantee that the counter is written atomically (e.g., to avoid a
scenario where a 64-bit counter is moved from register to memory using
four 16-bit store operations), an atomic store is used. Since there is
only a single writer (i.e., an EAL thread or a
registered non-EAL thread), the associated load need not be
atomic. More importantly, the whole load + add + store sequence need
not be atomic, and thus a comparatively expensive atomic add (e.g., a
__atomic_fetch_add()
) is avoided.
A relaxed-memory model atomic store comes without any performance penalty on all DPDK-supported compilers and CPU architectures. See the section of reading below for more information on this topic.
Reading¶
To retrieve the value of a counter, a reader needs to take a sum over all instances of that counter, in all the per-core statistics structs.
A reader may choose to either iterate over all possible lcores (i.e.,
from 0 up to RTE_LCORE_MAX
) , or just those actually in use (e.g.,
using the RTE_LCORE_FOREACH
macro from rte_lcore.h
). The
read-side operation is usually not very performance sensitive, so it
makes sense to do whatever results in the cleanest code.
Here’s an example, using the per-counter accessor function design style:
static uint64_t
stats_get_lcore_total_bytes(unsigned int lcore_id)
{
struct ep_stats *stats = &lcore_stats[lcore_id];
return __atomic_load_n(&stats->total_bytes, __ATOMIC_RELAXED);
}
uint64_t
ep_stats_get_total_bytes(void)
{
unsigned int lcore_id;
uint64_t total_bytes = 0;
for (lcore_id = 0; lcore_id < RTE_LCORE_MAX; lcore_ide++)
total_bytes += stats_get_lcore_total_bytes(lcore_id);
return total_bytes;
}
To guarantee that the counter is read atomically (e.g., so that a 64-bit counter is not read with two 32-bit loads), an atomic load should be used. Non-atomic loads (on the level of the ISA) from naturally aligned data is always atomic on all contemporary architectures, and the compiler rarely has a reason to break up the load into several instructions. However, there is no reason not to use an atomic load, from a performance perspective.
Atomic loads with the __ATOMIC_RELAXED
memory model does not
require memory barriers on any architectures. It does however imply
that loads of different variables may be reordered. Thus, if pkts
is read before bytes
in the program’s source, the compiler and/or
the processor may choose to reorder those load operations, so that
bytes
is read before pkts
. Usually this is not a problem for
counters, but see the discussion on transactions and consistency.
In the unlikely case atomicity would be violated, the results may be disastrous from a correctness point of view. For example, consider a 64-bit counter that currently has the value 4294967295 (FFFFFFFF in hexadecimal). Just as the counter is being read by some control plane thread, it’s also being incremented by one by the owning lcore worker thread. If the lcore worker store, or the control plane load operation fail to be atomic, the read may read the least significant 32 bits from the old value, and the most significant 32 bits from the new value. What the control plane thread will see is neither 4294967295, nor 4294967296, but 8589934591.
This phenomena is known as load or store tearing.
Inter Counter Consistency¶
The atomic store used by the counter producer thread (e.g., the lcore worker) is only atomic for a particular counter, and the statistics may be in a transient state of inconsistency, seen over a set of different counters, for a short period of time.
For example, if a 1000-byte EP packet is processed, the reader may see
a bytes
counter where that packet is accounted for, but a pkts
which is not yet updated. Similarly, it may read an updated
total_bytes
, but a not-yet-updated session-level bytes
counter.
Provided the lcore worker thread is not preempted by the operating system (which should only very rarely happen in a correctly configured deployment), the time window of the counters being inconsistency is likely to be very short indeed, but not zero-sized. If the counters are updated at a very high rate, the risk for a reader of seeing some inconsistencies might still be considerable.
The counter state will converge toward a consistent state. This is often enough, but for application where it is not, and the efficiency of the per-core counter approach is still required, adding one or more sequence counters to protect the statistics data may be an option.
#define EP_MAX_SESSIONS (1000)
struct ep_session_stats
{
uint64_t bytes;
uint64_t pkts;
};
struct ep_stats
{
rte_seqcount_t sc;
struct ep_session_stats sessions[EP_MAX_SESSIONS];
uint64_t total_bytes;
uint64_t total_pkts;
} __rte_cache_aligned;
static struct ep_stats lcore_stats[RTE_LCORE_MAX];
void
ep_stats_update(uint16_t session_id, uint64_t pkt_size)
{
unsigned int lcore_id;
struct ep_stats *stats;
struct ep_session_stats *session_stats;
lcore_id = rte_lcore_id();
stats = &lcore_stats[lcore_id];
session_stats = &stats->sessions[session_id];
rte_seqcount_begin_write(&stats->sc);
stats->total_pkts++;
stats->total_bytes += pkt_size;
session_stats->pkts++;
session_stats->bytes += pkt_size;
rte_seqcount_end_write(&stats->sc);
}
In this solution, the neither the loads nor the stores need be atomic, since the sequence counter guarantees atomicity over the whole set of counters.
The introduction of a sequence counter will increase the statistics overhead, both on the reader, and more importantly, writer side. The sequence counter requires incrementing a sequence number twice, and a number of memory barriers. On a Total Store Order (TSO) ISA (e.g., AMD64/IA-64), only an inexpensive compiler barrier is needed. On a weakly ordered CPU (e.g. ARM), actual barrier instructions are required.
The reader side will look something like:
uint64_t
ep_stats_get_session_stats(uint32_t session_id, struct ep_session_stats *stats)
{
struct ep_session_stats *session_stats =
&stats.sessions[session_id];
uint32_t sn;
do {
sn = rte_seqcount_read_begin(&config->sc);
result->pkts = session_stats->pkts;
result->bytes = session_stats->bytes;
} while (rte_seqcount_read_retry(&config->sc, sn));
return total_bytes;
}
This pattern represents an unorthodox use of a sequence counter, which is normally used to protect data which changes relatively infrequently. One issue that may occur, is that the reader will have to retry many times, known as reader starvation, since the data keeps changing while it’s being read.
Another option to achieve inter counter consistency is to protect the per-core statistics structure, or structures, with one or more spinlocks. Since a writer would only contend with a reader for such a lock, the level of contentions will be very low, and thus the overhead to acquire the lock will be as well.
Reset¶
A naive (and as we will see, broken) implementation of the reset would be to just write zero to the the per-core counter values.
static void
stats_reset_lcore_total_bytes(unsigned int lcore_id)
{
struct ep_stats *stats = &lcore_stats[lcore_id];
__atomic_store_n(&stats->total_bytes, 0, __ATOMIC_RELAXED);
}
void
ep_stats_reset_total_bytes(void)
{
unsigned int lcore_id;
for (lcore_id = 0; lcore_id < RTE_LCORE_MAX; lcore_ide++)
stats_reset_lcore_total_bytes(lcore_id);
}
In the basic form of the per-core counter pattern, the memory location holding the state for a particular counter on a particular lcore id is only written to the thread with that lcore id. With the introduction of the above reset routine, this is no longer true.
The updates performed by the lcore thread are done non-atomically in the sense that the whole read-modify-write cycle is not a single atomic operation. This may cause a counter just reset by another thread to be overwritten, and thus the reset request would be completely and permanently ignored.
One response to this issue may be to use an atomic add to increment counters. An atomic exchange operation could be used for the reset, to allow the previous counter value to be returned to the caller. However, an atomic add would much increase the overhead.
A more performant solution is to avoid having multiple writers by instead keeping an offset, tracking at which value a particular counter was last reset.
The offset could either be kept on a per-lcore id basis (in case the per-lcore id counter values are of relevance), or just a single offset for the aggregate counter value. The former will significantly increase counter-related working set size, compared to the latter (and even more so compared to a reset-free approach).
For simplicity, an per-core offset is maintained in the below example.
struct ep_counter
{
uint64_t count;
uint64_t offset;
};
static uint64_t
ep_counter_value(const struct ep_counter *counter)
{
uint64_t count;
uint64_t offset;
count = __atomic_load_n(&counter->count,
__ATOMIC_RELAXED);
offset = __atomic_load_n(&counter->offset,
__ATOMIC_RELAXED);
/* In case a writer thread does an counter update and a
* reset in quick session, it is possible for a reader
* thread to see the offset update before the count update
* (due compiler- or CPU-level reordering), resulting in
* offset > count. A 'count' wrap would have the same
* effect, but must have a different solution.
*/
if (unlikely(offset > count))
return 0;
return count - offset;
}
static void
ep_counter_reset(struct ep_counter *counter)
{
uint64_t count;
count = __atomic_load_n(&counter->count, __ATOMIC_RELAXED);
__atomic_store_n(&counter->offset, count, __ATOMIC_RELAXED);
}
struct ep_session_stats
{
struct ep_counter bytes;
struct ep_counter pkts;
};
struct ep_stats
{
struct ep_session_stats sessions[EP_MAX_SESSIONS];
struct ep_counter total_bytes;
struct ep_counter total_pkts;
} __rte_cache_aligned;
static uint64_t
stats_get_lcore_total_bytes(unsigned int lcore_id)
{
struct ep_stats *stats = &lcore_stats[lcore_id];
return ep_counter_value(&stats->total_bytes);
}
uint64_t
ep_stats_get_total_bytes(void)
{
unsigned int lcore_id;
uint64_t total_bytes = 0;
for (lcore_id = 0; lcore_id < RTE_LCORE_MAX; lcore_ide++)
total_bytes += stats_get_lcore_total_bytes(lcore_id);
return total_bytes;
}
static void
stats_reset_lcore_total_bytes(unsigned int lcore_id)
{
struct ep_stats *stats = &lcore_stats[lcore_id];
ep_counter_reset(&stats->total_bytes);
}
void
ep_stats_reset_total_bytes(void)
{
unsigned int lcore_id;
for (lcore_id = 0; lcore_id < RTE_LCORE_MAX; lcore_ide++)
stats_reset_lcore_total_bytes(lcore_id);
}
The function to increment a counter would not be affected, and would just continue to add to a monotonically increasing @c count field.
Care must be taken to handle counter wraps. In some cases, it may be possible to instead use a large-enough data type that can’t reasonably overflow.
Counter reset adds a fair amount of complexity, and could reasonably be considered the responsibilty of the control plane, and thus left out.
Performance¶
Keeping per-core instances of statistics data structures is generally the most CPU cycle-efficient way to implement counters.
Add a sequence counter or a spinlock to improve consistency adds overhead, but from what the micro benchmarks suggests, it is very small. [6]
The largest threat to the viability of the Per Core Counters approach is space efficiency. The amount of statistics memory grows linearly with the lcore worker count. This is not an issue in cases where the amount of statistics is relatively small. For systems supporting hundreds of thousands or even millions of flows, per-flow counters are kept, and flows are migrated across different lcore workers, this approach may prevent the fast path from utilizing many CPU cores, especially on memory-constrained systems.
Performance Comparison¶
The relative performance of the different approaches to counter implementation varies with a number of factors.
Counter update frequency, which in turn is usually depend on the per-packet processing latency.
Counter working set size (i.e., the number of counters modified).
The amount of overlap between two or more cores’ counter working set.
Worker core count.
CPU implementation details (e.g., memory model and cache latency).
The benchmarking application simulates a fairly low-touch data plane application, spending ~1000 clock cycles/packet for domain logic processing (thus excluding packet I/O). No actual packets are sent, and the clock cycles spent are dummy calculations, not using any memory. The numbers measured represent how much the application is slowed down, when statistics is added. Latency is specified as an average over all counter add operations.
The counter implementations in the benchmark are identical to the examples.
The application modifies two global counters, and two flow-related counters per packet. How many counters are incremented per packet, and how many are related to a flow, and how many are global as opposed to per-flow (or the equivalent), varies wildly between applications. This benchmark is at the low end of counter usage.
In the benchmark, load balancing of packets over cores works in such a way, that packet pertaining to a particular flow hits the same core, unless it’s migrated to another core. Migrations happens very rarely. The DSW Event Device is the work scheduler used to distribute the packets. This means there is almost no contention for the cache lines hosting the two flow-related counters for each of the 1024 flows in the test.
A real application would likely make heavy use of the level 1 and level 2 CPU caches, and thus the below numbers are an underestimation of the actual overhead.
The counters are updated all-in-one-go. In a real application however, different counters (or sets of counters) will be incremented at different stages of processing. This makes a difference for the sequence counter and spinlock-based approaches, in the benchmark the overhead for the locking/unlocking is amortized over four counter updates (i.e., the lock is only taken once). In some applications, the lock may need to be acquired for every counter update, and thus many times for a single packet.
System under Test Hardware¶
The “Cascade Lake Xeon” is a server with a 20-core Intel Xeon Gold 6230N CPU. To improve determinism, the Intel Turbo function is disabled, and all CPU cores run the nominal clock frequency - 2,3 GHz. The compiler used is GCC 10.3.0.
The “BCM2711 ARM A72” is a Raspberry Pi 4. It is equipped with a Broadcom BCM2711 SoC, with four ARM Cortex-A72 cores operating at 1,5 GHz. The code is compiled with GCC 9.3.0.
As expected, the variant based sequence counter-based synchronization suffers somewhat from the weakly ordered memory model requirements’ for barrier instructions. Surprisingly, on the BCM2711, the per-core spinlock variant performs as well as the per-core variant which only uses atomic stores - a fact which the author find difficult to explain.
DPDK 22.07 was used for both systems.
In both the Raspberry Pi and Xeon server case, the test application and DPDK was compiled with``-O3 -march=native``.
Benchmark Results¶
The counter update overhead is expressed in CPU core cycles.
CPU |
Worker Core Count |
Shared No Sync |
Shared Spinlock |
Shared Atomics |
Shared Buffered |
Per Core |
Per Core Spinlock |
Per Core Seqcount |
---|---|---|---|---|---|---|---|---|
Cascade Lake Xeon |
4 |
27 |
436 |
56 |
18 |
5 |
7 |
4 |
Cascade Lake Xeon |
20 |
273 |
8236 |
626 |
18 |
4 |
7 |
4 |
BCM2711 ARM A72 |
4 |
26 |
36 |
85 |
24 |
4 |
4 |
8 |
The different benchmark programs use the various implements the various patterns described earlier in this chapter, in a manner very similar to the example code.
Test case name |
Description |
---|---|
Shared No Sync |
|
Shared Spinlock |
|
Shared Buffered |
|
Per Core |
The atomic store variant described in Per Core Counters |
Per Core Spinlock |
The spinlock-protected per core statistics struct variant described in Per Core Counters |
Per Core Seqcount |
The sequence counter-protected per core statistics struct variant described in Per Core Counters |
Device Statistics¶
Metrics Library¶
Telemetry Library¶
Footnotes