5894 Want big theory statement on MAC's data path

Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Reviewed by: Rob Gulewich <robert.gulewich@joyent.com> Reviewed by: Max Bruning <max@joyent.com> Approved by: Richard Lowe <richlowe@richlowe.net>
author: Robert Mustacchi <rm@joyent.com> 2015-04-30 15:04:38 -0700
committer: Robert Mustacchi <rm@joyent.com> 2015-05-14 14:42:24 -0700
commit: bc44a9330a5eaab897440aebd5b17691ec2c1d0a (patch)
tree: 4ad633a40a5d654ec20b25ca5c4af180e909b841 /usr/src
parent: 3b3b7026bde850c59ef70bb86cf2ca9e8d8011fc (diff)
download: illumos-joyent-bc44a9330a5eaab897440aebd5b17691ec2c1d0a.tar.gz
2 files changed, 950 insertions, 1 deletions
diff --git a/usr/src/uts/common/io/mac/mac.c b/usr/src/uts/common/io/mac/mac.c
index ed809e5f45..98ec8b4366 100644
--- a/usr/src/uts/common/io/mac/mac.c
+++ b/usr/src/uts/common/io/mac/mac.c
@@ -264,6 +264,13 @@
  * subflows before attempting a link property change.
  * Some of the above rules can be overridden by specifying additional command
  * line options while creating or modifying link or subflow properties.
+ *
+ * Datapath
+ * --------
+ *
+ * For information on the datapath, the world of soft rings, hardware rings, how
+ * it is structured, and the path of an mblk_t between a driver and a mac
+ * client, see mac_sched.c.
  */
 
 #include <sys/types.h>
diff --git a/usr/src/uts/common/io/mac/mac_sched.c b/usr/src/uts/common/io/mac/mac_sched.c
index 9385fb08ac..eb179b07c7 100644
--- a/usr/src/uts/common/io/mac/mac_sched.c
+++ b/usr/src/uts/common/io/mac/mac_sched.c
@@ -21,10 +21,952 @@
 /*
  * Copyright 2010 Sun Microsystems, Inc.  All rights reserved.
  * Use is subject to license terms.
- * Copyright 2011 Joyent, Inc.  All rights reserved.
+ * Copyright 2015 Joyent, Inc.
  * Copyright 2013 Nexenta Systems, Inc. All rights reserved.
  */
 
+/*
+ * MAC data path
+ *
+ * The MAC data path is concerned with the flow of traffic from mac clients --
+ * DLS, IP, etc. -- to various GLDv3 device drivers -- e1000g, vnic, aggr,
+ * ixgbe, etc. -- and from the GLDv3 device drivers back to clients.
+ *
+ * -----------
+ * Terminology
+ * -----------
+ *
+ * MAC uses a lot of different, but related terms that are associated with the
+ * design and structure of the data path. Before we cover other aspects, first
+ * let's review the terminology that MAC uses.
+ *
+ * MAC
+ *
+ * 	This driver. It interfaces with device drivers and provides abstractions
+ * 	that the rest of the system consumes. All data links -- things managed
+ * 	with dladm(1M), are accessed through MAC.
+ *
+ * GLDv3 DEVICE DRIVER
+ *
+ * 	A GLDv3 device driver refers to a driver, both for pseudo-devices and
+ * 	real devices, which implement the GLDv3 driver API. Common examples of
+ * 	these are igb and ixgbe, which are drivers for various Intel networking
+ * 	cards. These devices may or may not have various features, such as
+ * 	hardware rings and checksum offloading. For MAC, a GLDv3 device is the
+ * 	final point for the transmission of a packet and the starting point for
+ * 	the receipt of a packet.
+ *
+ * FLOWS
+ *
+ * 	At a high level, a flow refers to a series of packets that are related.
+ * 	Often times the term is used in the context of TCP to indicate a unique
+ * 	TCP connection and the traffic over it. However, a flow can exist at
+ * 	other levels of the system as well. MAC has a notion of a default flow
+ * 	which is used for all unicast traffic addressed to the address of a MAC
+ * 	device. For example, when a VNIC is created, a default flow is created
+ * 	for the VNIC's MAC address. In addition, flows are created for broadcast
+ * 	groups and a user may create a flow with flowadm(1M).
+ *
+ * CLASSIFICATION
+ *
+ * 	Classification refers to the notion of identifying an incoming frame
+ * 	based on its destination address and optionally its source addresses and
+ * 	doing different processing based on that information. Classification can
+ * 	be done in both hardware and software. In general, we usually only
+ * 	classify based on the layer two destination, eg. for Ethernet, the
+ * 	destination MAC address.
+ *
+ * 	The system also will do classification based on layer three and layer
+ * 	four properties. This is used to support things like flowadm(1M), which
+ * 	allows setting QoS and other properties on a per-flow basis.
+ *
+ * RING
+ *
+ * 	Conceptually, a ring represents a series of framed messages, often in a
+ * 	contiguous chunk of memory that acts as a circular buffer. Rings come in
+ * 	a couple of forms. Generally they are either a hardware construct (hw
+ * 	ring) or they are a software construct (sw ring) maintained by MAC.
+ *
+ * HW RING
+ *
+ * 	A hardware ring is a set of resources provided by a GLDv3 device driver
+ * 	(even if it is a pseudo-device). A hardware ring comes in two different
+ * 	forms: receive (rx) rings and transmit (tx) rings. An rx hw ring is
+ * 	something that has a unique DMA (direct memory access) region and
+ * 	generally supports some form of classification (though it isn't always
+ * 	used), as well as a means of generating an interrupt specific to that
+ * 	ring. For example, the device may generate a specific MSI-X for a PCI
+ * 	express device. A tx ring is similar, except that it is dedicated to
+ * 	transmission. It may also be a vector for enabling features such as VLAN
+ * 	tagging and large transmit offloading. It usually has its own dedicated
+ * 	interrupts for transmit being completed.
+ *
+ * SW RING
+ *
+ * 	A software ring is a construction of MAC. It represents the same thing
+ * 	that a hardware ring generally does, a collection of frames. However,
+ * 	instead of being in a contiguous ring of memory, they're instead linked
+ * 	by using the mblk_t's b_next pointer. Each frame may itself be multiple
+ * 	mblk_t's linked together by the b_cont pointer. A software ring always
+ * 	represents a collection of classified packets; however, it varies as to
+ * 	whether it uses only layer two information, or a combination of that and
+ * 	additional layer three and layer four data.
+ *
+ * FANOUT
+ *
+ * 	Fanout is the idea of spreading out the load of processing frames based
+ * 	on the source and destination information contained in the layer two,
+ * 	three, and four headers, such that the data can then be processed in
+ * 	parallel using multiple hardware threads.
+ *
+ * 	A fanout algorithm hashes the headers and uses that to place different
+ * 	flows into a bucket. The most important thing is that packets that are
+ * 	in the same flow end up in the same bucket. If they do not, performance
+ * 	can be adversely affected. Consider the case of TCP.  TCP severely
+ * 	penalizes a connection if the data arrives out of order. If a given flow
+ * 	is processed on different CPUs, then the data will appear out of order,
+ * 	hence the invariant that fanout always hash a given flow to the same
+ * 	bucket and thus get processed on the same CPU.
+ *
+ * RECEIVE SIDE SCALING (RSS)
+ *
+ *
+ * 	Receive side scaling is a term that isn't common in illumos, but is used
+ * 	by vendors and was popularized by Microsoft. It refers to the idea of
+ * 	spreading the incoming receive load out across multiple interrupts which
+ * 	can be directed to different CPUs. This allows a device to leverage
+ * 	hardware rings even when it doesn't support hardware classification. The
+ * 	hardware uses an algorithm to perform fanout that ensures the flow
+ * 	invariant is maintained.
+ *
+ * SOFT RING SET
+ *
+ * 	A soft ring set, commonly abbreviated SRS, is a collection of rings and
+ * 	is used for both transmitting and receiving. It is maintained in the
+ * 	structure mac_soft_ring_set_t. A soft ring set is usually associated
+ * 	with flows, and coordinates both the use of hardware and software rings.
+ * 	Because the use of hardware rings can change as devices such as VNICs
+ * 	come and go, we always ensure that the set has software classification
+ * 	rules that correspond to the hardware classification rules from rings.
+ *
+ * 	Soft ring sets are also used for the enforcement of various QoS
+ * 	properties. For example, if a bandwidth limit has been placed on a
+ * 	specific flow or device, then that will be enforced by the soft ring
+ * 	set.
+ *
+ * SERVICE ATTACHMENT POINT (SAP)
+ *
+ * 	The service attachment point is a DLPI (Data Link Provider Interface)
+ * 	concept; however, it comes up quite often in MAC. Most MAC devices speak
+ * 	a protocol that has some notion of different channels or message type
+ * 	identifiers. For example, Ethernet defines an EtherType which is a part
+ * 	of the Ethernet header and defines the particular protocol of the data
+ * 	payload. If the EtherType is set to 0x0800, then it defines that the
+ * 	contents of that Ethernet frame is IPv4 traffic. For Ethernet, the
+ * 	EtherType is the SAP.
+ *
+ * 	In DLPI, a given consumer attaches to a specific SAP. In illumos, the ip
+ * 	and arp drivers attach to the EtherTypes for IPv4, IPv6, and ARP. Using
+ * 	libdlpi(3LIB) user software can attach to arbitrary SAPs. With the
+ * 	exception of 802.1Q VLAN tagged traffic, MAC itself does not directly
+ * 	consume the SAP; however, it uses that information as part of hashing
+ * 	and it may be used as part of the construction of flows.
+ *
+ * PRIMARY MAC CLIENT
+ *
+ * 	The primary mac client refers to a mac client whose unicast address
+ * 	matches the address of the device itself. For example, if the system has
+ * 	instance of the e1000g driver such as e1000g0, e1000g1, etc., the
+ * 	primary mac client is the one named after the device itself. VNICs that
+ * 	are created on top of such devices are not the primary client.
+ *
+ * TRANSMIT DESCRIPTORS
+ *
+ * 	Transmit descriptors are a resource that most GLDv3 device drivers have.
+ * 	Generally, a GLDv3 device driver takes a frame that's meant to be output
+ * 	and puts a copy of it into a region of memory. Each region of memory
+ * 	usually has an associated descriptor that the device uses to manage
+ * 	properties of the frames. Devices have a limited number of such
+ * 	descriptors. They get reclaimed once the device finishes putting the
+ * 	frame on the wire.
+ *
+ * 	If the driver runs out of transmit descriptors, for example, the OS is
+ * 	generating more frames than it can put on the wire, then it will return
+ * 	them back to the MAC layer.
+ *
+ * ---------------------------------
+ * Rings, Classification, and Fanout
+ * ---------------------------------
+ *
+ * The heart of MAC is made up of rings, and not those that Elven-kings wear.
+ * When receiving a packet, MAC breaks the work into two different, though
+ * interrelated phases. The first phase is generally classification and then the
+ * second phase is generally fanout. When a frame comes in from a GLDv3 Device,
+ * MAC needs to determine where that frame should be delivered. If it's a
+ * unicast frame (say a normal TCP/IP packet), then it will be delivered to a
+ * single MAC client; however, if it's a broadcast or multicast frame, then MAC
+ * may need to deliver it to multiple MAC clients.
+ *
+ * On transmit, classification isn't quite as important, but may still be used.
+ * Unlike with the receive path, the classification is not used to determine
+ * devices that should transmit something, but rather is used for special
+ * properties of a flow, eg. bandwidth limits for a given IP address, device, or
+ * connection.
+ *
+ * MAC employs a software classifier and leverages hardware classification as
+ * well. The software classifier can leverage the full layer two information,
+ * source, destination, VLAN, and SAP. If the SAP indicates that IP traffic is
+ * being sent, it can classify based on the IP header, and finally, it also
+ * knows how to classify based on the local and remote ports of TCP, UDP, and
+ * SCTP.
+ *
+ * Hardware classifiers vary in capability. Generally all hardware classifiers
+ * provide the capability to classify based on the destination MAC address. Some
+ * hardware has additional filters built in for performing more in-depth
+ * classification; however, it often has much more limited resources for these
+ * activities as compared to the layer two destination address classification.
+ *
+ * The modus operandi in MAC is to always ensure that we have software-based
+ * capabilities and rules in place and then to supplement that with hardware
+ * resources when available. In general, simple layer two classification is
+ * sufficient and nothing else is used, unless a specific flow is created with
+ * tools such as flowadm(1M) or bandwidth limits are set on a device with
+ * dladm(1M).
+ *
+ * RINGS AND GROUPS
+ *
+ * To get into how rings and classification play together, it's first important
+ * to understand how hardware devices commonly associate rings and allow them to
+ * be programmed. Recall that a hardware ring should be thought of as a DMA
+ * buffer and an interrupt resource. Rings are then collected into groups. A
+ * group itself has a series of classification rules. One or more MAC addresses
+ * are assigned to a group.
+ *
+ * Hardware devices vary in terms of what capabilities they provide. Sometimes
+ * they allow for a dynamic assignment of rings to a group and sometimes they
+ * have a static assignment of rings to a group. For example, the ixgbe driver
+ * has a static assignment of rings to groups such that every group has exactly
+ * one ring and the number of groups is equal to the number of rings.
+ *
+ * Classification and receive side scaling both come into play with how a device
+ * advertises itself to MAC and how MAC uses it. If a device supports layer two
+ * classification of frames, then MAC will assign MAC addresses to a group as a
+ * form of primary classification. If a single MAC address is assigned to a
+ * group, a common case, then MAC will consider packets that come in from rings
+ * on that group to be fully classified and will not need to do any software
+ * classification unless a specific flow has been created.
+ *
+ * If a device supports receive side scaling, then it may advertise or support
+ * groups with multiple rings. In those cases, then receive side scaling will
+ * come into play and MAC will use that as a means of fanning out received
+ * frames across multiple CPUs. This can also be combined with groups that
+ * support layer two classification.
+ *
+ * If a device supports dynamic assignments of rings to groups, then MAC will
+ * change around the way that rings are assigned to various groups as devices
+ * come and go from the system. For example, when a VNIC is created, a new flow
+ * will be created for the VNIC's MAC address. If a hardware ring is available,
+ * MAC may opt to reassign it from one group to another.
+ *
+ * ASSIGNMENT OF HARDWARE RINGS
+ *
+ * This is a bit of a complicated subject that varies depending on the device,
+ * the use of aggregations, the special nature of the primary mac client. This
+ * section deserves being fleshed out.
+ *
+ * FANOUT
+ *
+ * illumos uses fanout to help spread out the incoming processing load of chains
+ * of frames away from a single CPU. If a device supports receive side scaling,
+ * then that provides an initial form of fanout; however, what we're concerned
+ * with all happens after the context of a given set of frames being classified
+ * to a soft ring set.
+ *
+ * After frames reach a soft ring set and account for any potential bandwidth
+ * related accounting, they may be fanned out based on one of the following
+ * three modes:
+ *
+ *     o No Fanout
+ *     o Protocol level fanout
+ *     o Full software ring protocol fanout
+ *
+ * MAC makes the determination as to which of these modes a given soft ring set
+ * obtains based on parameters such as whether or not it's the primary mac
+ * client, whether it's on a 10 GbE or faster device, user controlled dladm(1M)
+ * properties, and the nature of the hardware and the resources that it has.
+ *
+ * When there is no fanout, MAC does not create any soft rings for a device and
+ * the device has frames delivered directly to the MAC client.
+ *
+ * Otherwise, all fanout is performed by software. MAC divides incoming frames
+ * into one of three buckets -- IPv4 TCP traffic, IPv4 UDP traffic, and
+ * everything else. Note, VLAN tagged traffic is considered other, regardless of
+ * the interior EtherType. Regardless of the type of fanout, these three
+ * categories or buckets are always used.
+ *
+ * The difference between protocol level fanout and full software ring protocol
+ * fanout is the number of software rings that end up getting created. The
+ * system always uses the same number of software rings per protocol bucket. So
+ * in the first case when we're just doing protocol level fanout, we just create
+ * one software ring each for IPv4 TCP traffic, IPv4 UDP traffic, and everything
+ * else.
+ *
+ * In the case where we do full software ring protocol fanout, we generally use
+ * mac_compute_soft_ring_count() to determine the number of rings. There are
+ * other combinations of properties and devices that may send us down other
+ * paths, but this is a common starting point. If it's a non-bandwidth enforced
+ * device and we're on at least a 10 GbE link, then we'll use eight soft rings
+ * per protocol bucket as a starting point. See mac_compute_soft_ring_count()
+ * for more information on the total number.
+ *
+ * For each of these rings, we create a mac_soft_ring_t and an associated worker
+ * thread. Particularly when doing full software ring protocol fanout, we bind
+ * each of the worker threads to individual CPUs.
+ *
+ * The other advantage of these software rings is that it allows upper layers to
+ * optionally poll on them. For example, TCP can leverage an squeue to poll on
+ * the software ring, see squeue.c for more information.
+ *
+ * DLS BYPASS
+ *
+ * DLS is the data link services module. It interfaces with DLPI, which is the
+ * primary way that other parts of the system such as IP interface with the MAC
+ * layer. While DLS is traditionally a STREAMS-based interface, it allows for
+ * certain modules such as IP to negotiate various more modern interfaces to be
+ * used, which are useful for higher performance and allow it to use direct
+ * function calls to DLS instead of using STREAMS.
+ *
+ * When we have IPv4 TCP or UDP software rings, then traffic on those rings is
+ * eligible for what we call the dls bypass. In those cases, rather than going
+ * out mac_rx_deliver() to DLS, DLS instead registers them to go directly via
+ * the direct callback registered with DLS, generally ip_input().
+ *
+ * HARDWARE RING POLLING
+ *
+ * GLDv3 devices with hardware rings generally deliver chains of messages
+ * (mblk_t chain) during the context of a single interrupt. However, interrupts
+ * are not the only way that these devices may be used. As part of implementing
+ * ring support, a GLDv3 device driver must have a way to disable the generation
+ * of that interrupt and allow for the operating system to poll on that ring.
+ *
+ * To implement this, every soft ring set has a worker thread and a polling
+ * thread. If a sufficient packet rate comes into the system, MAC will 'blank'
+ * (disable) interrupts on that specific ring and the polling thread will start
+ * consuming packets from the hardware device and deliver them to the soft ring
+ * set, where the worker thread will take over.
+ *
+ * Once the rate of packet intake drops down below a certain threshold, then
+ * polling on the hardware ring will be quiesced and interrupts will be
+ * re-enabled for the given ring. This effectively allows the system to shift
+ * how it handles a ring based on its load. At high packet rates, polling on the
+ * device as opposed to relying on interrupts can actually reduce overall system
+ * load due to the minimization of interrupt activity.
+ *
+ * Note the importance of each ring having its own interrupt source. The whole
+ * idea here is that we do not disable interrupts on the device as a whole, but
+ * rather each ring can be independently toggled.
+ *
+ * USE OF WORKER THREADS
+ *
+ * Both the soft ring set and individual soft rings have a worker thread
+ * associated with them that may be bound to a specific CPU in the system. Any
+ * such assignment will get reassessed as part of dynamic reconfiguration events
+ * in the system such as the onlining and offlining of CPUs and the creation of
+ * CPU partitions.
+ *
+ * In many cases, while in an interrupt, we try to deliver a frame all the way
+ * through the stack in the context of the interrupt itself. However, if the
+ * amount of queued frames has exceeded a threshold, then we instead defer to
+ * the worker thread to do this work and signal it. This is particularly useful
+ * when you have the soft ring set delivering frames into multiple software
+ * rings. If it was only delivering frames into a single software ring then
+ * there'd be no need to have another thread take over. However, if it's
+ * delivering chains of frames to multiple rings, then it's worthwhile to have
+ * the worker for the software ring take over so that the different software
+ * rings can be processed in parallel.
+ *
+ * In a similar fashion to the hardware polling thread, if we don't have a
+ * backlog or there's nothing to do, then the worker thread will go back to
+ * sleep and frames can be delivered all the way from an interrupt. This
+ * behavior is useful as it's designed to minimize latency and the default
+ * disposition of MAC is to optimize for latency.
+ *
+ * MAINTAINING CHAINS
+ *
+ * Another useful idea that MAC uses is to try and maintain frames in chains for
+ * as long as possible. The idea is that all of MAC can handle chains of frames
+ * structured as a series of mblk_t structures linked with the b_next pointer.
+ * When performing software classification and software fanout, MAC does not
+ * simply determine the destination and send the frame along. Instead, in the
+ * case of classification, it tries to maintain a chain for as long as possible
+ * before passing it along and performing additional processing.
+ *
+ * In the case of fanout, MAC first determines what the target software ring is
+ * for every frame in the original chain and constructs a new chain for each
+ * target. MAC then delivers the new chain to each software ring in succession.
+ *
+ * The whole rationale for doing this is that we want to try and maintain the
+ * pipe as much as possible and deliver as many frames through the stack at once
+ * that we can, rather than just pushing a single frame through. This can often
+ * help bring down latency and allows MAC to get a better sense of the overall
+ * activity in the system and properly engage worker threads.
+ *
+ * --------------------
+ * Bandwidth Management
+ * --------------------
+ *
+ * Bandwidth management is something that's built into the soft ring set itself.
+ * When bandwidth limits are placed on a flow, a corresponding soft ring set is
+ * toggled into bandwidth mode. This changes how we transmit and receive the
+ * frames in question.
+ *
+ * Bandwidth management is done on a per-tick basis. We translate the user's
+ * requested bandwidth from a quantity per-second into a quantity per-tick. MAC
+ * cannot process a frame across more than one tick, thus it sets a lower bound
+ * for the bandwidth cap to be a single MTU. This also means that when
+ * hires ticks are enabled (hz is set to 1000), that the minimum amount of
+ * bandwidth is higher, because the number of ticks has increased and MAC has to
+ * go from accepting 100 packets / sec to 1000 / sec.
+ *
+ * The bandwidth counter is reset by either the soft ring set's worker thread or
+ * a thread that is doing an inline transmit or receive if they discover that
+ * the current tick is in the future from the recorded tick.
+ *
+ * Whenever we're receiving or transmitting data, we end up leaving most of the
+ * work to the soft ring set's worker thread. This forces data inserted into the
+ * soft ring set to be effectively serialized and allows us to exhume bandwidth
+ * at a reasonable rate. If there is nothing in the soft ring set at the moment
+ * and the set has available bandwidth, then it may processed inline.
+ * Otherwise, the worker is responsible for taking care of the soft ring set.
+ *
+ * ---------------------
+ * The Receive Data Path
+ * ---------------------
+ *
+ * The following series of ASCII art images breaks apart the way that a frame
+ * comes in and is processed in MAC.
+ *
+ * Part 1 -- Initial frame receipt, SRS classification
+ *
+ * Here, a frame is received by a GLDv3 driver, generally in the context of an
+ * interrupt, and it ends up in mac_rx_common(). A driver calls either mac_rx or
+ * mac_rx_ring, depending on whether or not it supports rings and can identify
+ * the interrupt as having come from a specific ring. Here we determine whether
+ * or not it's fully classified and perform software classification as
+ * appropriate. From here, everything always ends up going to either entry [A]
+ * or entry [B] based on whether or not they have subflow processing needed. We
+ * leave via fanout or delivery.
+ *
+ *           +===========+
+ *           v hardware  v
+ *           v interrupt v
+ *           +===========+
+ *                 |
+ *                 * . . appropriate
+ *                 |     upcall made
+ *                 |     by GLDv3 driver  . . always
+ *                 |                      .
+ *  +--------+     |     +----------+     .    +---------------+
+ *  | GLDv3  |     +---->| mac_rx   |-----*--->| mac_rx_common |
+ *  | Driver |-->--+     +----------+          +---------------+
+ *  +--------+     |        ^                         |
+ *      |          |        ^                         v
+ *      ^          |        * . . always   +----------------------+
+ *      |          |        |              | mac_promisc_dispatch |
+ *      |          |    +-------------+    +----------------------+
+ *      |          +--->| mac_rx_ring |               |
+ *      |               +-------------+               * . . hw classified
+ *      |                                             v     or single flow?
+ *      |                                             |
+ *      |                                   +--------++--------------+
+ *      |                                   |        |               * hw class,
+ *      |                                   |        * hw classified | subflows
+ *      |                 no hw class and . *        | or single     | exist
+ *      |                 subflows          |        | flow          |
+ *      |                                   |        v               v
+ *      |                                   |   +-----------+   +-----------+
+ *      |                                   |   |   goto    |   |  goto     |
+ *      |                                   |   | entry [A] |   | entry [B] |
+ *      |                                   |   +-----------+   +-----------+
+ *      |                                   v          ^
+ *      |                            +-------------+   |
+ *      |                            | mac_rx_flow |   * SRS and flow found,
+ *      |                            +-------------+   | call flow cb
+ *      |                                   |          +------+
+ *      |                                   v                 |
+ *      v                             +==========+    +-----------------+
+ *      |                             v For each v--->| mac_rx_classify |
+ * +----------+                       v  mblk_t  v    +-----------------+
+ * |   srs    |                       +==========+
+ * | pollling |
+ * |  thread  |->------------------------------------------+
+ * +----------+                                            |
+ *                                                         v       . inline
+ *            +--------------------+   +----------+   +---------+  .
+ *    [A]---->| mac_rx_srs_process |-->| check bw |-->| enqueue |--*---------+
+ *            +--------------------+   |  limits  |   | frames  |            |
+ *               ^                     +----------+   | to SRS  |            |
+ *               |                                    +---------+            |
+ *               |  send chain              +--------+    |                  |
+ *               *  when clasified          | signal |    * BW limits,       |
+ *               |  flow changes            |  srs   |<---+ loopback,        |
+ *               |                          | worker |      stack too        |
+ *               |                          +--------+      deep             |
+ *      +-----------------+        +--------+                                |
+ *      | mac_flow_lookup |        |  srs   |     +---------------------+    |
+ *      +-----------------+        | worker |---->| mac_rx_srs_drain    |<---+
+ *               ^                 | thread |     | mac_rx_srs_drain_bw |
+ *               |                 +--------+     +---------------------+
+ *               |                                          |
+ *         +----------------------------+                   * software rings
+ *   [B]-->| mac_rx_srs_subflow_process |                   | for fanout?
+ *         +----------------------------+                   |
+ *                                               +----------+-----------+
+ *                                               |                      |
+ *                                               v                      v
+ *                                          +--------+             +--------+
+ *                                          |  goto  |             |  goto  |
+ *                                          | Part 2 |             | Part 3 |
+ *                                          +--------+             +--------+
+ *
+ * Part 2 -- Fanout
+ *
+ * This part is concerned with using software fanout to assign frames to
+ * software rings and then deliver them to MAC clients or allow those rings to
+ * be polled upon. While there are two different primary fanout entry points,
+ * mac_rx_fanout and mac_rx_proto_fanout, they behave in similar ways, and aside
+ * from some of the individual hashing techniques used, most of the general
+ * flow is the same.
+ *
+ *  +--------+              +-------------------+
+ *  |  From  |---+--------->| mac_rx_srs_fanout |----+
+ *  | Part 1 |   |          +-------------------+    |    +=================+
+ *  +--------+   |                                   |    v for each mblk_t v
+ *               * . . protocol only                 +--->v assign to new   v
+ *               |     fanout                        |    v chain based on  v
+ *               |                                   |    v hash % nrings   v
+ *               |    +-------------------------+    |    +=================+
+ *               +--->| mac_rx_srs_proto_fanout |----+             |
+ *                    +-------------------------+                  |
+ *                                                                 v
+ *    +------------+    +--------------------------+       +================+
+ *    | enqueue in |<---| mac_rx_soft_ring_process |<------v for each chain v
+ *    | soft ring  |    +--------------------------+       +================+
+ *    +------------+
+ *         |                                    +-----------+
+ *         * soft ring set                      | soft ring |
+ *         | empty and no                       |  worker   |
+ *         | worker?                            |  thread   |
+ *         |                                    +-----------+
+ *         +------*----------------+                  |
+ *         |      .                |                  v
+ *    No . *      . Yes            |       +------------------------+
+ *         |                       +----<--| mac_rx_soft_ring_drain |
+ *         |                       |       +------------------------+
+ *         v                       |
+ *   +-----------+                 v
+ *   |   signal  |         +---------------+
+ *   | soft ring |         | Deliver chain |
+ *   |   worker  |         | goto Part 3   |
+ *   +-----------+         +---------------+
+ *
+ *
+ * Part 3 -- Packet Delivery
+ *
+ * Here, we go through and deliver the mblk_t chain directly to a given
+ * processing function. In a lot of cases this is mac_rx_deliver(). In the case
+ * of DLS bypass being used, then instead we end up going ahead and deliver it
+ * to the direct callback registered with DLS, generally ip_input.
+ *
+ *
+ *   +---------+            +----------------+    +------------------+
+ *   |  From   |---+------->| mac_rx_deliver |--->| Off to DLS, or   |
+ *   | Parts 1 |   |        +----------------+    | other MAC client |
+ *   |  and 2  |   * DLS bypass                   +------------------+
+ *   +---------+   | enabled   +----------+    +-------------+
+ *                 +---------->| ip_input |--->|    To IP    |
+ *                             +----------+    | and beyond! |
+ *                                             +-------------+
+ *
+ * ----------------------
+ * The Transmit Data Path
+ * ----------------------
+ *
+ * Before we go into the images, it's worth talking about a problem that is a
+ * bit different from the receive data path. GLDv3 device drivers have a finite
+ * amount of transmit descriptors. When they run out, they return unused frames
+ * back to MAC. MAC, at this point has several options about what it will do,
+ * which vary based upon the settings that the client uses.
+ *
+ * When a device runs out of descriptors, the next thing that MAC does is
+ * enqueue them off of the soft ring set or a software ring, depending on the
+ * configuration of the soft ring set. MAC will enqueue up to a high watermark
+ * of mblk_t chains, at which point it will indicate flow control back to the
+ * client. Once this condition is reached, any mblk_t chains that were not
+ * enqueued will be returned to the caller and they will have to decide what to
+ * do with them. There are various flags that control this behavior that a
+ * client may pass, which are discussed below.
+ *
+ * When this condition is hit, MAC also returns a cookie to the client in
+ * addition to unconsumed frames. Clients can poll on that cookie and register a
+ * callback with MAC to be notified when they are no longer subject to flow
+ * control, at which point they may continue to call mac_tx(). This flow control
+ * actually manages to work itself all the way up the stack, back through dls,
+ * to ip, through the various protocols, and to sockfs.
+ *
+ * While the behavior described above is the default, this behavior can be
+ * modified. There are two alternate modes, described below, which are
+ * controlled with flags.
+ *
+ * DROP MODE
+ *
+ * This mode is controlled by having the client pass the MAC_DROP_ON_NO_DESC
+ * flag. When this is passed, if a device driver runs out of transmit
+ * descriptors, then the MAC layer will drop any unsent traffic. The client in
+ * this case will never have any frames returned to it.
+ *
+ * DON'T ENQUEUE
+ *
+ * This mode is controlled by having the client pass the MAC_TX_NO_ENQUEUE flag.
+ * If the MAC_DROP_ON_NO_DESC flag is also passed, it takes precedence. In this
+ * mode, when we hit a case where a driver runs out of transmit descriptors,
+ * then instead of enqueuing packets in a soft ring set or software ring, we
+ * instead return the mblk_t chain back to the caller and immediately put the
+ * soft ring set into flow control mode.
+ *
+ * The following series of ASCII art images describe the transmit data path that
+ * MAC clients enter into based on calling into mac_tx(). A soft ring set has a
+ * transmission function associated with it. There are seven possible
+ * transmission modes, some of which share function entry points. The one that a
+ * soft ring set gets depends on properties such as whether there are
+ * transmission rings for fanout, whether the device involves aggregations,
+ * whether any bandwidth limits exist, etc.
+ *
+ *
+ * Part 1 -- Initial checks
+ *
+ *      * . called by
+ *      |   MAC clients
+ *      v                     . . No
+ *  +--------+  +-----------+ .   +-------------------+  +====================+
+ *  | mac_tx |->| device    |-*-->| mac_protect_check |->v Is this the simple v
+ *  +--------+  | quiesced? |     +-------------------+  v case? See [1]      v
+ *              +-----------+            |               +====================+
+ *                  * . Yes              * failed                 |
+ *                  v                    | frames                 |
+ *             +--------------+          |                +-------+---------+
+ *             | freemsgchain |<---------+          Yes . *            No . *
+ *             +--------------+                           v                 v
+ *                                                  +-----------+     +--------+
+ *                                                  |   goto    |     |  goto  |
+ *                                                  |  Part 2   |     | SRS TX |
+ *                                                  | Entry [A] |     |  func  |
+ *                                                  +-----------+     +--------+
+ *                                                        |                 |
+ *                                                        |                 v
+ *                                                        |           +--------+
+ *                                                        +---------->| return |
+ *                                                                    | cookie |
+ *                                                                    +--------+
+ *
+ * [1] The simple case refers to the SRS being configured with the
+ * SRS_TX_DEFAULT transmission mode, having a single mblk_t (not a chain), their
+ * being only a single active client, and not having a backlog in the srs.
+ *
+ *
+ * Part 2 -- The SRS transmission functions
+ *
+ * This part is a bit more complicated. The different transmission paths often
+ * leverage one another. In this case, we'll draw out the more common ones
+ * before the parts that depend upon them. Here, we're going to start with the
+ * workings of mac_tx_send() a common function that most of the others end up
+ * calling.
+ *
+ *      +-------------+
+ *      | mac_tx_send |
+ *      +-------------+
+ *            |
+ *            v
+ *      +=============+    +==============+
+ *      v  more than  v--->v    check     v
+ *      v one client? v    v VLAN and add v
+ *      +=============+    v  VLAN tags   v
+ *            |            +==============+
+ *            |                  |
+ *            +------------------+
+ *            |
+ *            |                 [A]
+ *            v                  |
+ *       +============+ . No     v
+ *       v more than  v .     +==========+     +--------------------------+
+ *       v one active v-*---->v for each v---->| mac_promisc_dispatch_one |---+
+ *       v  client?   v       v mblk_t   v     +--------------------------+   |
+ *       +============+       +==========+        ^                           |
+ *            |                                   |       +==========+        |
+ *            * . Yes                             |       v hardware v<-------+
+ *            v                      +------------+       v  rings?  v
+ *       +==========+                |                    +==========+
+ *       v for each v       No . . . *                         |
+ *       v mblk_t   v       specific |                         |
+ *       +==========+       flow     |                   +-----+-----+
+ *            |                      |                   |           |
+ *            v                      |                   v           v
+ *    +-----------------+            |               +-------+  +---------+
+ *    | mac_tx_classify |------------+               | GLDv3 |  |  GLDv3  |
+ *    +-----------------+                            |TX func|  | ring tx |
+ *            |                                      +-------+  |  func   |
+ *            * Specific flow, generally                 |      +---------+
+ *            | bcast, mcast, loopback                   |           |
+ *            v                                          +-----+-----+
+ *      +==========+       +---------+                         |
+ *      v valid L2 v--*--->| freemsg |                         v
+ *      v  header  v  . No +---------+               +-------------------+
+ *      +==========+                                 | return unconsumed |
+ *            * . Yes                                |   frames to the   |
+ *            v                                      |      caller       |
+ *      +===========+                                +-------------------+
+ *      v braodcast v      +----------------+                  ^
+ *      v   flow?   v--*-->| mac_bcast_send |------------------+
+ *      +===========+  .   +----------------+                  |
+ *            |        . . Yes                                 |
+ *       No . *                                                v
+ *            |  +---------------------+  +---------------+  +----------+
+ *            +->|mac_promisc_dispatch |->| mac_fix_cksum |->|   flow   |
+ *               +---------------------+  +---------------+  | callback |
+ *                                                           +----------+
+ *
+ *
+ * In addition, many but not all of the routines, all rely on
+ * mac_tx_softring_process as an entry point.
+ *
+ *
+ *                                           . No             . No
+ * +--------------------------+   +========+ .  +===========+ .  +-------------+
+ * | mac_tx_soft_ring_process |-->v worker v-*->v out of tx v-*->|    goto     |
+ * +--------------------------+   v only?  v    v  descr.?  v    | mac_tx_send |
+ *                                +========+    +===========+    +-------------+
+ *                              Yes . *               * . Yes           |
+ *                   . No             v               |                 v
+ *     v=========+   .          +===========+ . Yes   |     Yes .  +==========+
+ *     v apppend v<--*----------v out of tx v-*-------+---------*--v returned v
+ *     v mblk_t  v              v  descr.?  v         |            v frames?  v
+ *     v chain   v              +===========+         |            +==========+
+ *     +=========+                                    |                 *. No
+ *         |                                          |                 v
+ *         v                                          v           +------------+
+ * +===================+           +----------------------+       |   done     |
+ * v worker scheduled? v           | mac_tx_sring_enqueue |       | processing |
+ * v Out of tx descr?  v           +----------------------+       +------------+
+ * +===================+                      |
+ *    |           |           . Yes           v
+ *    * Yes       * No        .         +============+
+ *    |           v         +-*---------v drop on no v
+ *    |      +========+     v           v  TX desc?  v
+ *    |      v  wake  v  +----------+   +============+
+ *    |      v worker v  | mac_pkt_ |         * . No
+ *    |      +========+  | drop     |         |         . Yes         . No
+ *    |           |      +----------+         v         .             .
+ *    |           |         v   ^     +===============+ .  +========+ .
+ *    +--+--------+---------+   |     v Don't enqueue v-*->v ring   v-*----+
+ *       |                      |     v     Set?      v    v empty? v      |
+ *       |      +---------------+     +===============+    +========+      |
+ *       |      |                            |                |            |
+ *       |      |        +-------------------+                |            |
+ *       |      *. Yes   |                          +---------+            |
+ *       |      |        v                          v                      v
+ *       |      |  +===========+               +========+      +--------------+
+ *       |      +<-v At hiwat? v               v append v      |    return    |
+ *       |         +===========+               v mblk_t v      | mblk_t chain |
+ *       |                  * No               v chain  v      |   and flow   |
+ *       |                  v                  +========+      |    control   |
+ *       |               +=========+                |          |    cookie    |
+ *       |               v  append v                v          +--------------+
+ *       |               v  mblk_t v           +========+
+ *       |               v  chain  v           v  wake  v   +------------+
+ *       |               +=========+           v worker v-->|    done    |
+ *       |                    |                +========+   | processing |
+ *       |                    v       .. Yes                +------------+
+ *       |               +=========+  .   +========+
+ *       |               v  first  v--*-->v  wake  v
+ *       |               v append? v      v worker v
+ *       |               +=========+      +========+
+ *       |                   |                |
+ *       |              No . *                |
+ *       |                   v                |
+ *       |       +--------------+             |
+ *       +------>|   Return     |             |
+ *               | flow control |<------------+
+ *               |   cookie     |
+ *               +--------------+
+ *
+ *
+ * The remaining images are all specific to each of the different transmission
+ * modes.
+ *
+ * SRS TX DEFAULT
+ *
+ *      [ From Part 1 ]
+ *             |
+ *             v
+ * +-------------------------+
+ * | mac_tx_single_ring_mode |
+ * +-------------------------+
+ *            |
+ *            |       . Yes
+ *            v       .
+ *       +==========+ .  +============+
+ *       v   SRS    v-*->v   Try to   v---->---------------------+
+ *       v backlog? v    v enqueue in v                          |
+ *       +==========+    v     SRS    v-->------+                * . . Queue too
+ *            |          +============+         * don't enqueue  |     deep or
+ *            * . No         ^     |            | flag or at     |     drop flag
+ *            |              |     v            | hiwat,         |
+ *            v              |     |            | return    +---------+
+ *     +-------------+       |     |            | cookie    | freemsg |
+ *     |    goto     |-*-----+     |            |           +---------+
+ *     | mac_tx_send | . returned  |            |                |
+ *     +-------------+   mblk_t    |            |                |
+ *            |                    |            |                |
+ *            |                    |            |                |
+ *            * . . all mblk_t     * queued,    |                |
+ *            v     consumed       | may return |                |
+ *     +-------------+             | tx cookie  |                |
+ *     | SRS TX func |<------------+------------+----------------+
+ *     |  completed  |
+ *     +-------------+
+ *
+ * SRS_TX_SERIALIZE
+ *
+ *   +------------------------+
+ *   | mac_tx_serializer_mode |
+ *   +------------------------+
+ *               |
+ *               |        . No
+ *               v        .
+ *         +============+ .  +============+    +-------------+   +============+
+ *         v srs being  v-*->v  set SRS   v--->|    goto     |-->v remove SRS v
+ *         v processed? v    v proc flags v    | mac_tx_send |   v proc flag  v
+ *         +============+    +============+    +-------------+   +============+
+ *               |                                                     |
+ *               * Yes                                                 |
+ *               v                                       . No          v
+ *      +--------------------+                           .        +==========+
+ *      | mac_tx_srs_enqueue |  +------------------------*-----<--v returned v
+ *      +--------------------+  |                                 v frames?  v
+ *               |              |   . Yes                         +==========+
+ *               |              |   .                                  |
+ *               |              |   . +=========+                      v
+ *               v              +-<-*-v queued  v     +--------------------+
+ *        +-------------+       |     v frames? v<----| mac_tx_srs_enqueue |
+ *        | SRS TX func |       |     +=========+     +--------------------+
+ *        | completed,  |<------+         * . Yes
+ *        | may return  |       |         v
+ *        |   cookie    |       |     +========+
+ *        +-------------+       +-<---v  wake  v
+ *                                    v worker v
+ *                                    +========+
+ *
+ *
+ * SRS_TX_FANOUT
+ *
+ *                                             . Yes
+ *   +--------------------+    +=============+ .   +--------------------------+
+ *   | mac_tx_fanout_mode |--->v Have fanout v-*-->|           goto           |
+ *   +--------------------+    v   hint?     v     | mac_rx_soft_ring_process |
+ *                             +=============+     +--------------------------+
+ *                                   * . No                    |
+ *                                   v                         ^
+ *                             +===========+                   |
+ *                        +--->v for each  v           +===============+
+ *                        |    v   mblk_t  v           v pick softring v
+ *                 same   *    +===========+           v   from hash   v
+ *                 hash   |          |                 +===============+
+ *                        |          v                         |
+ *                        |   +--------------+                 |
+ *                        +---| mac_pkt_hash |--->*------------+
+ *                            +--------------+    . different
+ *                                                  hash or
+ *                                                  done proc.
+ * SRS_TX_AGGR                                      chain
+ *
+ *   +------------------+    +================================+
+ *   | mac_tx_aggr_mode |--->v Use aggr capab function to     v
+ *   +------------------+    v find appropriate tx ring.      v
+ *                           v Applies hash based on aggr     v
+ *                           v policy, see mac_tx_aggr_mode() v
+ *                           +================================+
+ *                                          |
+ *                                          v
+ *                           +-------------------------------+
+ *                           |            goto               |
+ *                           |  mac_rx_srs_soft_ring_process |
+ *                           +-------------------------------+
+ *
+ *
+ * SRS_TX_BW, SRS_TX_BW_FANOUT, SRS_TX_BW_AGGR
+ *
+ * Note, all three of these tx functions start from the same place --
+ * mac_tx_bw_mode().
+ *
+ *  +----------------+
+ *  | mac_tx_bw_mode |
+ *  +----------------+
+ *         |
+ *         v          . No               . No               . Yes
+ *  +==============+  .  +============+  .  +=============+ .  +=========+
+ *  v  Out of BW?  v--*->v SRS empty? v--*->v  reset BW   v-*->v Bump BW v
+ *  +==============+     +============+     v tick count? v    v Usage   v
+ *         |                   |            +=============+    +=========+
+ *         |         +---------+                   |                |
+ *         |         |        +--------------------+                |
+ *         |         |        |              +----------------------+
+ *         v         |        v              v
+ * +===============+ |  +==========+   +==========+      +------------------+
+ * v Don't enqueue v |  v  set bw  v   v Is aggr? v--*-->|       goto       |
+ * v   flag set?   v |  v enforced v   +==========+  .   | mac_tx_aggr_mode |-+
+ * +===============+ |  +==========+         |       .   +------------------+ |
+ *   |    Yes .*     |        |         No . *       .                        |
+ *   |         |     |        |              |       . Yes                    |
+ *   * . No    |     |        v              |                                |
+ *   |  +---------+  |   +========+          v              +======+          |
+ *   |  | freemsg |  |   v append v   +============+  . Yes v pick v          |
+ *   |  +---------+  |   v mblk_t v   v Is fanout? v--*---->v ring v          |
+ *   |      |        |   v chain  v   +============+        +======+          |
+ *   +------+        |   +========+          |                  |             |
+ *          v        |        |              v                  v             |
+ *    +---------+    |        v       +-------------+ +--------------------+  |
+ *    | return  |    |   +========+   |    goto     | |       goto         |  |
+ *    |  flow   |    |   v wakeup v   | mac_tx_send | | mac_tx_fanout_mode |  |
+ *    | control |    |   v worker v   +-------------+ +--------------------+  |
+ *    | cookie  |    |   +========+          |                  |             |
+ *    +---------+    |        |              |                  +------+------+
+ *                   |        v              |                         |
+ *                   |   +---------+         |                         v
+ *                   |   | return  |   +============+           +------------+
+ *                   |   |  flow   |   v unconsumed v-------+   |   done     |
+ *                   |   | control |   v   frames?  v       |   | processing |
+ *                   |   | cookie  |   +============+       |   +------------+
+ *                   |   +---------+         |              |
+ *                   |                  Yes  *              |
+ *                   |                       |              |
+ *                   |                 +===========+        |
+ *                   |                 v subtract  v        |
+ *                   |                 v unused bw v        |
+ *                   |                 +===========+        |
+ *                   |                       |              |
+ *                   |                       v              |
+ *                   |              +--------------------+  |
+ *                   +------------->| mac_tx_srs_enqueue |  |
+ *                                  +--------------------+  |
+ *                                           |              |
+ *                                           |              |
+ *                                     +------------+       |
+ *                                     |  return fc |       |
+ *                                     | cookie and |<------+
+ *                                     |    mblk_t  |
+ *                                     +------------+
+ */
+
 #include <sys/types.h>
 #include <sys/callb.h>
 #include <sys/sdt.h>
author	Robert Mustacchi <rm@joyent.com>	2015-04-30 15:04:38 -0700
committer	Robert Mustacchi <rm@joyent.com>	2015-05-14 14:42:24 -0700
commit	bc44a9330a5eaab897440aebd5b17691ec2c1d0a (patch)
tree	4ad633a40a5d654ec20b25ca5c4af180e909b841 /usr/src
parent	3b3b7026bde850c59ef70bb86cf2ca9e8d8011fc (diff)
download	illumos-joyent-bc44a9330a5eaab897440aebd5b17691ec2c1d0a.tar.gz