Re: [8023-CMSG] Server/NIC analogy
Ben,
I'm looking at this from a system architecture point of view, from the
top down, and identifying where each layer needs to be improved to make
a significant improvement in the performance of the system as a whole.
I've never assumed that solutions in any one layer could provide the
complete answer. From my perspective, IEEE can enable significant
improvements in layer 2 (802.1 & 802.3) for short range full duplex
Ethernet subnets. Since the primary target link technology is Ethernet,
my tact has been to focus on enhancements to 802.3, but not lose sight
of the fact that some support at 802.1 is likely to be required (to
enable dealing with the problems Hugh & Norm keep hammering on).
I came into the IEEE activity naively assuming there was a closer
working relationship between 802.1 & 802.3. I completely understand the
technical boundaries between the layers. What really surprised me was
the apparent turf wars over who should own a given problem and who is in
a better position to solve it. From a system perspective, the problem of
congestion control & management spans all layers. At layer 2 for
Ethernet based subnets, it spans both 802.1 and 802.3. IMO, if we allow
ourselves to consider tweaks to both 802.3 and 802.1, we can maximize
the effectiveness of the solution space while limiting the impact to
either standard. Trying to solve the problems completely in one or the
other is likely to result in falling short of supporting effective
solutions or usable/implementable solutions.
Re-stating Gadi's message: We need to open our minds to new ideas and
perspectives and stay away from the "religious" battles and turf wars.
Since Ethernet is the primary target link technology, it's a good bet
that enhancements to the 802.3 standards will be required to support an
acceptable solution space. Focusing attention on: 1) improving
communication of congestion infromation across 802.3 links and 2)
utilizing the congestion information to improve throughput, latency,
latency variation, and loss characteristics is a good place to start.
But ... as soon as we consider subnets with more than one layer of
switching, support of an acceptable solution space is likely to also
require enhancements to 802.1.
As far as defining the scope of subnets the CMSG is trying address, the
following are some limits I would propose in-order to bound the problem
space (this is not to say the supported solution space would be
prohibited from extending far beyond these limits):
1. Consider 1G to 10G full-duplex links as covering the target problem
space (this is not to say support can't be defined such that it can be
applied to any speed full-duplex links).
2. Consider up to 5 stages of switching in one subnet (3 layers in a
hierarchical topology). This equates to as many as 6 hops between any
two endpoints. (This would easily cover the sweet spot for clusters of
blades, shelves, and racks interconnected by a single cluster
interconnect.)
3. Limit the link lengths in the target problem space to 100 meters or
less. (This is not to say support can't be defined such that it can be
applied to links of much greater length. It only puts a limit on the
problem space we need consider in crafting 802.3 enhancements.)
It could be argued that #2 does not pertain to 802.3. But it does
pertain to CM and needs to be considered in the solution space supported
by 802.3 enhancements.
Gary
-----Original Message-----
From: owner-stds-802-3-cm@listserv.ieee.org
[mailto:owner-stds-802-3-cm@listserv.ieee.org] On Behalf Of Benjamin
Brown
Sent: Tuesday, June 08, 2004 7:30 PM
To: STDS-802-3-CM@listserv.ieee.org
Subject: Re: [8023-CMSG] Server/NIC analogy
Gary,
I don't know how to "define the target scope of the CMSG to
include multiple stages of switching, multiple hops, and different
link speeds in the same interconnect (or subnet)" and keep this
within 802.3. This kind of scope is at least 802.1 or even higher
up the stack.
The scope for 802.3 is a single link. We can talk about the
length of that link (either in terms of meters or bits for the
potentially high-latency PHYS). We can talk about the system
ramifications of events propagating from bridge to bridge via
multiple links based on the effects of one protocol vs. another
on each individual link. But we cannot talk about multiple
links and expect to keep the scope within 802.3.
Could you elaborate on your suggested scope and whether
you disagree with my comments above or where you think
this project should take place?
Thanks,
Ben
McAlpine, Gary L wrote:
Ben,
I think we can safely assume that, in the not too distant future,
servers will be capable of consuming 10 Gbs links on real applications.
When these servers are used to construct clusters for parallel
processing applications (database processing for instance), they will be
capable of creating frequent congestion with the cluster interconnect.
The traffic loads will include both bursty traffic and high priority
control traffic.
Prioritization will help the high priority traffic get through an
Ethernet cluster interconnect in a timely manner. The rest of the
traffic (that causing all the congestion) is likely to all be
transferred at a low priority. Dropping some portion of it to relieve
congestion is simply unacceptable. During times of heavy loading,
congestion will happen often enough to bring the cluster to its knees
doing retransmissions.
IMO static rate control is also unacceptable because the loading
requirements are not that predictable and not that managable. I just
can't see the database industry buying into static rate control
managment just so they can use Ethernet in backplanes, clusters, and
SANs.
I think the key to making Ethernet acceptable as a short range
interconnect for backplanes and clusters is to support dynamic rate
control. The use of a good implementation of 802.3x (within a limited
enough scope) would be preferrable to drops. If we define the target
scope of the CMSG to include multiple stages of switching, multiple
hops, and different link speeds in the same interconnect (or subnet),
then 802.3x just isn't going to hack it (and neither is static rate
control or packet dropping). I think we need to be thinking in terms of
an improved dynamic rate control, which requires feedback.
I think defining the scope of interconnects we want to target with CM
solutions will go a long way toward framing the range of acceptable
solutions.
Gary
-----Original Message-----
From: owner-stds-802-3-cm@listserv.ieee.org
[mailto:owner-stds-802-3-cm@listserv.ieee.org] On Behalf Of Benjamin
Brown
Sent: Friday, June 04, 2004 2:20 PM
To: STDS-802-3-CM@listserv.ieee.org
Subject: [8023-CMSG] Server/NIC analogy
All,
During a private discussion this afternoon regarding the results of last
week's meeting, the concept of feedback came up - whether it was
necessary or not. There was some level of discussion about this during
the meeting but no one seemed to be able to provide an adequate
justification for providing congestion feedback and why the more common
approach of packet drop wasn't adequate.
During this afternoon's discussion, I came up with something that I
think might be justification. I'm probably just painting a big target on
my chest but let's see how this goes.
Consider a stand-alone server with a 1G Ethernet NIC. Today's CPUs could
easily generate enough traffic to swamp the 1G Ethernet link (okay this
is a bit of an assumption on my part but if they can't today they will
be able to tomorrow). I don't build these things, nor have I looked at
their architecture all that closely in a number of years, but I'll step
out on a limb and state that there's a (most likely proprietary)
mechanism for the NIC to tell the CPU that the link is too slow to
handle all the packets that it is trying to transmit. I'll step even
farther out on that same limb and state that the mechanism is not packet
drop.
Now, let's use this analogy to consider a server card in a backplane
that communicates to the world via a port interface line card. The
server card communicates to the port interface line card using a link
compliant with the newly emerging Backplane Ethernet standard. (Okay, so
I'm looking a little into the future.) If you consider the entire
chassis analogous to the server/NIC in my initial example then it would
seem plausible that you would want to communicate buffer congestion on
the port interface line card back to the server card using a mechanism
other than packet drop.
I'll just close my eyes now. Fire at will.
Ben
--
-----------------------------------------
Benjamin Brown
178 Bear Hill Road
Chichester, NH 03258
603-491-0296 - Cell
603-798-4115 - Office
benjamin-dot-brown-at-ieee-dot-org
(Will this cut down on my spam???)
-----------------------------------------
--
-----------------------------------------
Benjamin Brown
178 Bear Hill Road
Chichester, NH 03258
603-491-0296 - Cell
603-798-4115 - Office
benjamin-dot-brown-at-ieee-dot-org
(Will this cut down on my spam???)
-----------------------------------------