The destruction of the socket and all associated resources
is done by a simple call to close(fd).
+Similarly as without PACKET_MMAP, it is possible to use one socket
+for capture and transmission. This can be done by mapping the
+allocated RX and TX buffer ring with a single mmap() call.
+See "Mapping and use of the circular buffer (ring)".
+
Next I will describe PACKET_MMAP settings and its constraints,
also the mapping of the circular buffer in the user process and
the use of this buffer.
the frames. This is because a frame cannot be spawn across two
blocks.
+To use one socket for capture and transmission, the mapping of both the
+RX and TX buffer ring has to be done with one call to mmap:
+
+ ...
+ setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &foo, sizeof(foo));
+ setsockopt(fd, SOL_PACKET, PACKET_TX_RING, &bar, sizeof(bar));
+ ...
+ rx_ring = mmap(0, size * 2, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+ tx_ring = rx_ring + size;
+
+RX must be the first as the kernel maps the TX ring memory right
+after the RX one.
+
At the beginning of each frame there is an status field (see
struct tpacket_hdr). If this field is 0 means that the frame is ready
to be used for the kernel, If not, there is a frame the user can read
+++ Capture process:
from include/linux/if_packet.h
- #define TP_STATUS_COPY 2
- #define TP_STATUS_LOSING 4
- #define TP_STATUS_CSUMNOTREADY 8
+ #define TP_STATUS_COPY (1 << 1)
+ #define TP_STATUS_LOSING (1 << 2)
+ #define TP_STATUS_CSUMNOTREADY (1 << 3)
+ #define TP_STATUS_CSUM_VALID (1 << 7)
TP_STATUS_COPY : This flag indicates that the frame (and associated
meta information) has been truncated because it's
enabled previously with setsockopt() and
the PACKET_COPY_THRESH option.
- The number of frames than can be buffered to
+ The number of frames that can be buffered to
be read with recvfrom is limited like a normal socket.
See the SO_RCVBUF option in the socket (7) man page.
reading the packet we should not try to check the
checksum.
+TP_STATUS_CSUM_VALID : This flag indicates that at least the transport
+ header checksum of the packet has been already
+ validated on the kernel side. If the flag is not set
+ then we are free to check the checksum by ourselves
+ provided that TP_STATUS_CSUMNOTREADY is also not set.
+
for convenience there are also the following defines:
#define TP_STATUS_KERNEL 0
TPACKET_V1:
- Default if not otherwise specified by setsockopt(2)
- RX_RING, TX_RING available
- - VLAN metadata information available for packets
- (TP_STATUS_VLAN_VALID)
TPACKET_V1 --> TPACKET_V2:
- Made 64 bit clean due to unsigned long usage in TPACKET_V1
userspace and the like
- Timestamp resolution in nanoseconds instead of microseconds
- RX_RING, TX_RING available
+ - VLAN metadata information available for packets
+ (TP_STATUS_VLAN_VALID, TP_STATUS_VLAN_TPID_VALID),
+ in the tpacket2_hdr structure:
+ - TP_STATUS_VLAN_VALID bit being set into the tp_status field indicates
+ that the tp_vlan_tci field has valid VLAN TCI value
+ - TP_STATUS_VLAN_TPID_VALID bit being set into the tp_status field
+ indicates that the tp_vlan_tpid field has valid VLAN TPID value
- How to switch to TPACKET_V2:
1. Replace struct tpacket_hdr by struct tpacket2_hdr
2. Query header len and save
In the AF_PACKET fanout mode, packet reception can be load balanced among
processes. This also works in combination with mmap(2) on packet sockets.
+Currently implemented fanout policies are:
+
+ - PACKET_FANOUT_HASH: schedule to socket by skb's packet hash
+ - PACKET_FANOUT_LB: schedule to socket by round-robin
+ - PACKET_FANOUT_CPU: schedule to socket by CPU packet arrives on
+ - PACKET_FANOUT_RND: schedule to socket by random selection
+ - PACKET_FANOUT_ROLLOVER: if one socket is full, rollover to another
+ - PACKET_FANOUT_QM: schedule to socket by skbs recorded queue_mapping
+
Minimal example code by David S. Miller (try things like "./test eth0 hash",
"./test eth0 lb", etc.):
Minimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile
it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.):
+/* Written from scratch, but kernel-to-user space API usage
+ * dissected from lolpcap:
+ * Copyright 2011, Chetan Loke <loke.chetan@gmail.com>
+ * License: GPL, version 2.0
+ */
+
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
-#define BLOCK_SIZE (1 << 22)
-#define FRAME_SIZE 2048
-
-#define NUM_BLOCKS 64
-#define NUM_FRAMES ((BLOCK_SIZE * NUM_BLOCKS) / FRAME_SIZE)
-
-#define BLOCK_RETIRE_TOV_IN_MS 64
-#define BLOCK_PRIV_AREA_SZ 13
-
-#define ALIGN_8(x) (((x) + 8 - 1) & ~(8 - 1))
-
-#define BLOCK_STATUS(x) ((x)->h1.block_status)
-#define BLOCK_NUM_PKTS(x) ((x)->h1.num_pkts)
-#define BLOCK_O2FP(x) ((x)->h1.offset_to_first_pkt)
-#define BLOCK_LEN(x) ((x)->h1.blk_len)
-#define BLOCK_SNUM(x) ((x)->h1.seq_num)
-#define BLOCK_O2PRIV(x) ((x)->offset_to_priv)
-#define BLOCK_PRIV(x) ((void *) ((uint8_t *) (x) + BLOCK_O2PRIV(x)))
-#define BLOCK_HDR_LEN (ALIGN_8(sizeof(struct block_desc)))
-#define BLOCK_PLUS_PRIV(sz_pri) (BLOCK_HDR_LEN + ALIGN_8((sz_pri)))
-
#ifndef likely
# define likely(x) __builtin_expect(!!(x), 1)
#endif
static unsigned long packets_total = 0, bytes_total = 0;
static sig_atomic_t sigint = 0;
-void sighandler(int num)
+static void sighandler(int num)
{
sigint = 1;
}
{
int err, i, fd, v = TPACKET_V3;
struct sockaddr_ll ll;
+ unsigned int blocksiz = 1 << 22, framesiz = 1 << 11;
+ unsigned int blocknum = 64;
fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
if (fd < 0) {
}
memset(&ring->req, 0, sizeof(ring->req));
- ring->req.tp_block_size = BLOCK_SIZE;
- ring->req.tp_frame_size = FRAME_SIZE;
- ring->req.tp_block_nr = NUM_BLOCKS;
- ring->req.tp_frame_nr = NUM_FRAMES;
- ring->req.tp_retire_blk_tov = BLOCK_RETIRE_TOV_IN_MS;
- ring->req.tp_sizeof_priv = BLOCK_PRIV_AREA_SZ;
- ring->req.tp_feature_req_word |= TP_FT_REQ_FILL_RXHASH;
+ ring->req.tp_block_size = blocksiz;
+ ring->req.tp_frame_size = framesiz;
+ ring->req.tp_block_nr = blocknum;
+ ring->req.tp_frame_nr = (blocksiz * blocknum) / framesiz;
+ ring->req.tp_retire_blk_tov = 60;
+ ring->req.tp_feature_req_word = TP_FT_REQ_FILL_RXHASH;
err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req,
sizeof(ring->req));
}
ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr,
- PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED,
- fd, 0);
+ PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, fd, 0);
if (ring->map == MAP_FAILED) {
perror("mmap");
exit(1);
return fd;
}
-#ifdef __checked
-static uint64_t prev_block_seq_num = 0;
-
-void assert_block_seq_num(struct block_desc *pbd)
-{
- if (unlikely(prev_block_seq_num + 1 != BLOCK_SNUM(pbd))) {
- printf("prev_block_seq_num:%"PRIu64", expected seq:%"PRIu64" != "
- "actual seq:%"PRIu64"\n", prev_block_seq_num,
- prev_block_seq_num + 1, (uint64_t) BLOCK_SNUM(pbd));
- exit(1);
- }
-
- prev_block_seq_num = BLOCK_SNUM(pbd);
-}
-
-static void assert_block_len(struct block_desc *pbd, uint32_t bytes, int block_num)
-{
- if (BLOCK_NUM_PKTS(pbd)) {
- if (unlikely(bytes != BLOCK_LEN(pbd))) {
- printf("block:%u with %upackets, expected len:%u != actual len:%u\n",
- block_num, BLOCK_NUM_PKTS(pbd), bytes, BLOCK_LEN(pbd));
- exit(1);
- }
- } else {
- if (unlikely(BLOCK_LEN(pbd) != BLOCK_PLUS_PRIV(BLOCK_PRIV_AREA_SZ))) {
- printf("block:%u, expected len:%lu != actual len:%u\n",
- block_num, BLOCK_HDR_LEN, BLOCK_LEN(pbd));
- exit(1);
- }
- }
-}
-
-static void assert_block_header(struct block_desc *pbd, const int block_num)
-{
- uint32_t block_status = BLOCK_STATUS(pbd);
-
- if (unlikely((block_status & TP_STATUS_USER) == 0)) {
- printf("block:%u, not in TP_STATUS_USER\n", block_num);
- exit(1);
- }
-
- assert_block_seq_num(pbd);
-}
-#else
-static inline void assert_block_header(struct block_desc *pbd, const int block_num)
-{
-}
-static void assert_block_len(struct block_desc *pbd, uint32_t bytes, int block_num)
-{
-}
-#endif
-
static void display(struct tpacket3_hdr *ppd)
{
struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac);
static void walk_block(struct block_desc *pbd, const int block_num)
{
- int num_pkts = BLOCK_NUM_PKTS(pbd), i;
+ int num_pkts = pbd->h1.num_pkts, i;
unsigned long bytes = 0;
- unsigned long bytes_with_padding = BLOCK_PLUS_PRIV(BLOCK_PRIV_AREA_SZ);
struct tpacket3_hdr *ppd;
- assert_block_header(pbd, block_num);
-
- ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd + BLOCK_O2FP(pbd));
+ ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd +
+ pbd->h1.offset_to_first_pkt);
for (i = 0; i < num_pkts; ++i) {
bytes += ppd->tp_snaplen;
- if (ppd->tp_next_offset)
- bytes_with_padding += ppd->tp_next_offset;
- else
- bytes_with_padding += ALIGN_8(ppd->tp_snaplen + ppd->tp_mac);
-
display(ppd);
- ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd + ppd->tp_next_offset);
- __sync_synchronize();
+ ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd +
+ ppd->tp_next_offset);
}
- assert_block_len(pbd, bytes_with_padding, block_num);
-
packets_total += num_pkts;
bytes_total += bytes;
}
-void flush_block(struct block_desc *pbd)
+static void flush_block(struct block_desc *pbd)
{
- BLOCK_STATUS(pbd) = TP_STATUS_KERNEL;
- __sync_synchronize();
+ pbd->h1.block_status = TP_STATUS_KERNEL;
}
static void teardown_socket(struct ring *ring, int fd)
socklen_t len;
struct ring ring;
struct pollfd pfd;
- unsigned int block_num = 0;
+ unsigned int block_num = 0, blocks = 64;
struct block_desc *pbd;
struct tpacket_stats_v3 stats;
while (likely(!sigint)) {
pbd = (struct block_desc *) ring.rd[block_num].iov_base;
-retry_block:
- if ((BLOCK_STATUS(pbd) & TP_STATUS_USER) == 0) {
+
+ if ((pbd->h1.block_status & TP_STATUS_USER) == 0) {
poll(&pfd, 1, -1);
- goto retry_block;
+ continue;
}
walk_block(pbd, block_num);
flush_block(pbd);
- block_num = (block_num + 1) % NUM_BLOCKS;
+ block_num = (block_num + 1) % blocks;
}
len = sizeof(stats);
return 0;
}
+-------------------------------------------------------------------------------
++ PACKET_QDISC_BYPASS
+-------------------------------------------------------------------------------
+
+If there is a requirement to load the network with many packets in a similar
+fashion as pktgen does, you might set the following option after socket
+creation:
+
+ int one = 1;
+ setsockopt(fd, SOL_PACKET, PACKET_QDISC_BYPASS, &one, sizeof(one));
+
+This has the side-effect, that packets sent through PF_PACKET will bypass the
+kernel's qdisc layer and are forcedly pushed to the driver directly. Meaning,
+packet are not buffered, tc disciplines are ignored, increased loss can occur
+and such packets are also not visible to other PF_PACKET sockets anymore. So,
+you have been warned; generally, this can be useful for stress testing various
+components of a system.
+
+On default, PACKET_QDISC_BYPASS is disabled and needs to be explicitly enabled
+on PF_PACKET sockets.
+
-------------------------------------------------------------------------------
+ PACKET_TIMESTAMP
-------------------------------------------------------------------------------
of hardware timestamps with SIOCSHWTSTAMP (see related information from
Documentation/networking/timestamping.txt).
-PACKET_TIMESTAMP accepts the same integer bit field as
-SO_TIMESTAMPING. However, only the SOF_TIMESTAMPING_SYS_HARDWARE
-and SOF_TIMESTAMPING_RAW_HARDWARE values are recognized by
-PACKET_TIMESTAMP. SOF_TIMESTAMPING_SYS_HARDWARE takes precedence over
-SOF_TIMESTAMPING_RAW_HARDWARE if both bits are set.
+PACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING:
- int req = 0;
- req |= SOF_TIMESTAMPING_SYS_HARDWARE;
+ int req = SOF_TIMESTAMPING_RAW_HARDWARE;
setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req))
For the mmap(2)ed ring buffers, such timestamps are stored in the
what kind of timestamp has been reported, the tp_status field is binary |'ed
with the following possible bits ...
- TP_STATUS_TS_SYS_HARDWARE
TP_STATUS_TS_RAW_HARDWARE
TP_STATUS_TS_SOFTWARE
... that are equivalent to its SOF_TIMESTAMPING_* counterparts. For the
-RX_RING, if none of those 3 are set (i.e. PACKET_TIMESTAMP is not set),
-then this means that a software fallback was invoked *within* PF_PACKET's
-processing code (less precise).
+RX_RING, if neither is set (i.e. PACKET_TIMESTAMP is not set), then a
+software fallback was invoked *within* PF_PACKET's processing code (less
+precise).
Getting timestamps for the TX_RING works as follows: i) fill the ring frames,
ii) call sendto() e.g. in blocking mode, iii) wait for status of relevant