--- /dev/null
+Virtual Routing and Forwarding (VRF)
+====================================
+The VRF device combined with ip rules provides the ability to create virtual
+routing and forwarding domains (aka VRFs, VRF-lite to be specific) in the
+Linux network stack. One use case is the multi-tenancy problem where each
+tenant has their own unique routing tables and in the very least need
+different default gateways.
+
+Processes can be "VRF aware" by binding a socket to the VRF device. Packets
+through the socket then use the routing table associated with the VRF
+device. An important feature of the VRF device implementation is that it
+impacts only Layer 3 and above so L2 tools (e.g., LLDP) are not affected
+(ie., they do not need to be run in each VRF). The design also allows
+the use of higher priority ip rules (Policy Based Routing, PBR) to take
+precedence over the VRF device rules directing specific traffic as desired.
+
+In addition, VRF devices allow VRFs to be nested within namespaces. For
+example network namespaces provide separation of network interfaces at L1
+(Layer 1 separation), VLANs on the interfaces within a namespace provide
+L2 separation and then VRF devices provide L3 separation.
+
+Design
+------
+A VRF device is created with an associated route table. Network interfaces
+are then enslaved to a VRF device:
+
+ +-----------------------------+
+ | vrf-blue | ===> route table 10
+ +-----------------------------+
+ | | |
+ +------+ +------+ +-------------+
+ | eth1 | | eth2 | ... | bond1 |
+ +------+ +------+ +-------------+
+ | |
+ +------+ +------+
+ | eth8 | | eth9 |
+ +------+ +------+
+
+Packets received on an enslaved device and are switched to the VRF device
+using an rx_handler which gives the impression that packets flow through
+the VRF device. Similarly on egress routing rules are used to send packets
+to the VRF device driver before getting sent out the actual interface. This
+allows tcpdump on a VRF device to capture all packets into and out of the
+VRF as a whole.[1] Similiarly, netfilter [2] and tc rules can be applied
+using the VRF device to specify rules that apply to the VRF domain as a whole.
+
+[1] Packets in the forwarded state do not flow through the device, so those
+ packets are not seen by tcpdump. Will revisit this limitation in a
+ future release.
+
+[2] Iptables on ingress is limited to NF_INET_PRE_ROUTING only with skb->dev
+ set to real ingress device and egress is limited to NF_INET_POST_ROUTING.
+ Will revisit this limitation in a future release.
+
+
+Setup
+-----
+1. VRF device is created with an association to a FIB table.
+ e.g, ip link add vrf-blue type vrf table 10
+ ip link set dev vrf-blue up
+
+2. Rules are added that send lookups to the associated FIB table when the
+ iif or oif is the VRF device. e.g.,
+ ip ru add oif vrf-blue table 10
+ ip ru add iif vrf-blue table 10
+
+ Set the default route for the table (and hence default route for the VRF).
+ e.g, ip route add table 10 prohibit default
+
+3. Enslave L3 interfaces to a VRF device.
+ e.g, ip link set dev eth1 master vrf-blue
+
+ Local and connected routes for enslaved devices are automatically moved to
+ the table associated with VRF device. Any additional routes depending on
+ the enslaved device will need to be reinserted following the enslavement.
+
+4. Additional VRF routes are added to associated table.
+ e.g., ip route add table 10 ...
+
+
+Applications
+------------
+Applications that are to work within a VRF need to bind their socket to the
+VRF device:
+
+ setsockopt(sd, SOL_SOCKET, SO_BINDTODEVICE, dev, strlen(dev)+1);
+
+or to specify the output device using cmsg and IP_PKTINFO.
+
+
+Limitations
+-----------
+VRF device currently only works for IPv4. Support for IPv6 is under development.
+
+Index of original ingress interface is not available via cmsg. Will address
+soon.
--------------
The default queuing discipline to use for network devices. This allows
-overriding the default queue discipline of pfifo_fast with an
-alternative. Since the default queuing discipline is created with the
-no additional parameters so is best suited to queuing disciplines that
-work well without configuration like stochastic fair queue (sfq),
-CoDel (codel) or fair queue CoDel (fq_codel). Don't use queuing disciplines
-like Hierarchical Token Bucket or Deficit Round Robin which require setting
-up classes and bandwidths.
+overriding the default of pfifo_fast with an alternative. Since the default
+queuing discipline is created without additional parameters so is best suited
+to queuing disciplines that work well without configuration like stochastic
+fair queue (sfq), CoDel (codel) or fair queue CoDel (fq_codel). Don't use
+queuing disciplines like Hierarchical Token Bucket or Deficit Round Robin
+which require setting up classes and bandwidths. Note that physical multiqueue
+interfaces still use mq as root qdisc, which in turn uses this default for its
+leaves. Virtual devices (like e.g. lo or veth) ignore this setting and instead
+default to noqueue.
Default: pfifo_fast
busy_read
S: Maintained
F: drivers/net/vrf.c
F: include/net/vrf.h
+F: Documentation/networking/vrf.txt
VT1211 HARDWARE MONITOR DRIVER
M: Juerg Haefliger <juergh@gmail.com>
continue;
}
- skb = alloc_skb(size + 1, GFP_ATOMIC);
+ /* Use netdev_alloc_skb() because it adds NET_SKB_PAD of
+ * headroom, and ensures we can route packets back out an
+ * Ethernet interface (for example) without having to
+ * reallocate. Adding NET_IP_ALIGN also ensures that both
+ * PPPoATM and PPPoEoBR2684 packets end up aligned. */
+ skb = netdev_alloc_skb_ip_align(NULL, size + 1);
if (!skb) {
if (net_ratelimit())
dev_warn(&card->dev->dev, "Failed to allocate sk_buff for RX\n");
/* Allocate RX skbs for any ports which need them */
if (card->using_dma && card->atmdev[port] &&
!card->rx_skb[port]) {
- struct sk_buff *skb = alloc_skb(RX_DMA_SIZE, GFP_ATOMIC);
+ /* Unlike the MMIO case (qv) we can't add NET_IP_ALIGN
+ * here; the FPGA can only DMA to addresses which are
+ * aligned to 4 bytes. */
+ struct sk_buff *skb = dev_alloc_skb(RX_DMA_SIZE);
if (skb) {
SKB_CB(skb)->dma_addr =
dma_map_single(&card->dev->dev, skb->data,
CH_PCI_ID_TABLE_FENTRY(0x5090), /* Custom T540-CR */
CH_PCI_ID_TABLE_FENTRY(0x5091), /* Custom T522-CR */
CH_PCI_ID_TABLE_FENTRY(0x5092), /* Custom T520-CR */
+ CH_PCI_ID_TABLE_FENTRY(0x5093), /* Custom T580-LP-CR */
+ CH_PCI_ID_TABLE_FENTRY(0x5094), /* Custom T540-CR */
+ CH_PCI_ID_TABLE_FENTRY(0x5095), /* Custom T540-CR-SO */
+ CH_PCI_ID_TABLE_FENTRY(0x5096), /* Custom T580-CR */
+ CH_PCI_ID_TABLE_FENTRY(0x5097), /* Custom T520-KR */
/* T6 adapters:
*/
rss_context->hash_fn = MLX4_RSS_HASH_TOP;
memcpy(rss_context->rss_key, priv->rss_key,
MLX4_EN_RSS_KEY_SIZE);
- netdev_rss_key_fill(rss_context->rss_key,
- MLX4_EN_RSS_KEY_SIZE);
} else {
en_err(priv, "Unknown RSS hash function requested\n");
err = -EINVAL;
{ .compatible = "micrel,ks8851" },
{ }
};
+MODULE_DEVICE_TABLE(of, ks8851_match_table);
static struct spi_driver ks8851_driver = {
.driver = {
.flowi4_oif = vrf_dev->ifindex,
.flowi4_iif = LOOPBACK_IFINDEX,
.flowi4_tos = RT_TOS(ip4h->tos),
- .flowi4_flags = FLOWI_FLAG_ANYSRC | FLOWI_FLAG_VRFSRC,
+ .flowi4_flags = FLOWI_FLAG_ANYSRC | FLOWI_FLAG_VRFSRC |
+ FLOWI_FLAG_SKIP_NH_OIF,
.daddr = ip4h->daddr,
};
#define FLOWI_FLAG_ANYSRC 0x01
#define FLOWI_FLAG_KNOWN_NH 0x02
#define FLOWI_FLAG_VRFSRC 0x04
+#define FLOWI_FLAG_SKIP_NH_OIF 0x08
__u32 flowic_secid;
struct flowi_tunnel flowic_tun_key;
};
struct nl_info *info, struct mx6_config *mxc);
int fib6_del(struct rt6_info *rt, struct nl_info *info);
-void inet6_rt_notify(int event, struct rt6_info *rt, struct nl_info *info);
+void inet6_rt_notify(int event, struct rt6_info *rt, struct nl_info *info,
+ unsigned int flags);
void fib6_run_gc(unsigned long expires, struct net *net, bool force);
flow_flags |= FLOWI_FLAG_ANYSRC;
if (netif_index_is_vrf(sock_net(sk), oif))
- flow_flags |= FLOWI_FLAG_VRFSRC;
+ flow_flags |= FLOWI_FLAG_VRFSRC | FLOWI_FLAG_SKIP_NH_OIF;
flowi4_init_output(fl4, oif, sk->sk_mark, tos, RT_SCOPE_UNIVERSE,
protocol, flow_flags, dst, src, dport, sport);
static int clip_encap(struct atm_vcc *vcc, int mode)
{
+ if (!CLIP_VCC(vcc))
+ return -EBADFD;
+
CLIP_VCC(vcc)->encap = mode;
return 0;
}
nh->nh_flags & RTNH_F_LINKDOWN &&
!(fib_flags & FIB_LOOKUP_IGNORE_LINKSTATE))
continue;
- if (!(flp->flowi4_flags & FLOWI_FLAG_VRFSRC)) {
+ if (!(flp->flowi4_flags & FLOWI_FLAG_SKIP_NH_OIF)) {
if (flp->flowi4_oif &&
flp->flowi4_oif != nh->nh_oif)
continue;
if (netif_index_is_vrf(net, ipc.oif)) {
flowi4_init_output(fl4, ipc.oif, sk->sk_mark, tos,
RT_SCOPE_UNIVERSE, sk->sk_protocol,
- (flow_flags | FLOWI_FLAG_VRFSRC),
+ (flow_flags | FLOWI_FLAG_VRFSRC |
+ FLOWI_FLAG_SKIP_NH_OIF),
faddr, saddr, dport,
inet->inet_sport);
if (saddr)
fl4->saddr = saddr->a4;
+ fl4->flowi4_flags = FLOWI_FLAG_SKIP_NH_OIF;
+
rt = __ip_route_output_key(net, fl4);
if (!IS_ERR(rt))
return &rt->dst;
*ins = rt;
rt->rt6i_node = fn;
atomic_inc(&rt->rt6i_ref);
- inet6_rt_notify(RTM_NEWROUTE, rt, info);
+ inet6_rt_notify(RTM_NEWROUTE, rt, info, 0);
info->nl_net->ipv6.rt6_stats->fib_rt_entries++;
if (!(fn->fn_flags & RTN_RTINFO)) {
rt->rt6i_node = fn;
rt->dst.rt6_next = iter->dst.rt6_next;
atomic_inc(&rt->rt6i_ref);
- inet6_rt_notify(RTM_NEWROUTE, rt, info);
+ inet6_rt_notify(RTM_NEWROUTE, rt, info, NLM_F_REPLACE);
if (!(fn->fn_flags & RTN_RTINFO)) {
info->nl_net->ipv6.rt6_stats->fib_route_nodes++;
fn->fn_flags |= RTN_RTINFO;
fib6_purge_rt(rt, fn, net);
- inet6_rt_notify(RTM_DELROUTE, rt, info);
+ inet6_rt_notify(RTM_DELROUTE, rt, info, 0);
rt6_release(rt);
}
frag_id = ipv6_select_ident(net, &ipv6_hdr(skb)->daddr,
&ipv6_hdr(skb)->saddr);
+ hroom = LL_RESERVED_SPACE(rt->dst.dev);
if (skb_has_frag_list(skb)) {
int first_len = skb_pagelen(skb);
struct sk_buff *frag2;
if (first_len - hlen > mtu ||
((first_len - hlen) & 7) ||
- skb_cloned(skb))
+ skb_cloned(skb) ||
+ skb_headroom(skb) < (hroom + sizeof(struct frag_hdr)))
goto slow_path;
skb_walk_frags(skb, frag) {
/* Correct geometry. */
if (frag->len > mtu ||
((frag->len & 7) && frag->next) ||
- skb_headroom(frag) < hlen)
+ skb_headroom(frag) < (hlen + hroom + sizeof(struct frag_hdr)))
goto slow_path_clean;
/* Partially cloned skb? */
err = 0;
offset = 0;
- frag = skb_shinfo(skb)->frag_list;
- skb_frag_list_init(skb);
/* BUILD HEADER */
*prevhdr = NEXTHDR_FRAGMENT;
if (!tmp_hdr) {
IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
IPSTATS_MIB_FRAGFAILS);
- return -ENOMEM;
+ err = -ENOMEM;
+ goto fail;
}
+ frag = skb_shinfo(skb)->frag_list;
+ skb_frag_list_init(skb);
__skb_pull(skb, hlen);
fh = (struct frag_hdr *)__skb_push(skb, sizeof(struct frag_hdr));
*/
*prevhdr = NEXTHDR_FRAGMENT;
- hroom = LL_RESERVED_SPACE(rt->dst.dev);
troom = rt->dst.dev->needed_tailroom;
/*
return err;
}
-void inet6_rt_notify(int event, struct rt6_info *rt, struct nl_info *info)
+void inet6_rt_notify(int event, struct rt6_info *rt, struct nl_info *info,
+ unsigned int nlm_flags)
{
struct sk_buff *skb;
struct net *net = info->nl_net;
goto errout;
err = rt6_fill_node(net, skb, rt, NULL, NULL, 0,
- event, info->portid, seq, 0, 0, 0);
+ event, info->portid, seq, 0, 0, nlm_flags);
if (err < 0) {
/* -EMSGSIZE implies BUG in rt6_nlmsg_size() */
WARN_ON(err == -EMSGSIZE);
case NFPROTO_IPV6: {
u8 nexthdr = ipv6_hdr(skb)->nexthdr;
__be16 frag_off;
+ int ofs;
- protoff = ipv6_skip_exthdr(skb, sizeof(struct ipv6hdr),
- &nexthdr, &frag_off);
- if (protoff < 0 || (frag_off & htons(~0x7)) != 0) {
+ ofs = ipv6_skip_exthdr(skb, sizeof(struct ipv6hdr), &nexthdr,
+ &frag_off);
+ if (ofs < 0 || (frag_off & htons(~0x7)) != 0) {
pr_debug("proto header not found\n");
return NF_ACCEPT;
}
+ protoff = ofs;
break;
}
default: