Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

Re: [8023-CMSG] {Spam?} Re: [8023-CMSG] Server/NIC analogy



Title:

Gary,

Finer granularity immediately brings to mind a myriad of possibilities.
However, as others have mentioned in this thread, the number of
queues required to support finer granularities simply explodes as
the number of hops in the network increases.

In a previous note to this thread, you mentioned that CMSG may be
most applicable to a "microcosm" network and
"Bounding the scope of the microcosms, in which we
are trying to enable the use of Ethernet as the
local interconnect, will help us define the set of
assumptions that apply in that space."
Are you prepared to take a stab at bounding this scope?

Thanks,
Ben

McAlpine, Gary L wrote:
Siamack,

Excellent summary. I think this is exactly the right direction to
proceed.

Supporting a finer granularity of flows and flow control at the link
level can translate to significantly better system characteristics. The
question is: what granularity and what flow definition provides the
optimum cost vs. performance trade-offs? I don't think we can answer
this or the other related questions without further study. Isn't that
what the CMSG is about?

Gary

-----Original Message-----
From: owner-stds-802-3-cm@LISTSERV.IEEE.ORG
[mailto:owner-stds-802-3-cm@LISTSERV.IEEE.ORG] On Behalf Of Siamack
Ayandeh
Sent: Friday, June 11, 2004 5:48 AM
To: STDS-802-3-CM@LISTSERV.IEEE.ORG
Subject: Re: [8023-CMSG] {Spam?} Re: [8023-CMSG] Server/NIC analogy


Ben,

It may help to start from a more limited scope with clear value before
we venture far in to more complex territory.

Clearly there must have been a perceived value in the existing Pause
mechanism which is part of the standard and widely deployed.  This
mechanism, or a yet to be defined mechanism, can be improved in the
following sense:

1) Scope: Can remain as is i.e. a single link. Given that Ethernet is a
technology that is being used in a wide range of applications from
inter-chip communication to the local loop for Metro Ethernet services,
a single link would cover a wide range of applications.

2) Granularity: Needs to be improved. The granularity can be defined by
introduction of a grain_ID (I am threading carefully hear & don't use
flow-ID). How this is mapped to Class of Service, VLAN tags, etc.
becomes a local matter over a single link and need not be part of a
standard. It is application dependent. Sure there are problems to be
solved here but that's why we need a study group.

The need is to create multiple control loops rather than one. How these
get mapped is a local decision over a single link.

3) Flow control algorithm:  Currently ON/OFF control is in place. This
is a simple and effective mechanism. Whether it can be improved using
the so called "rate based" algorithms or some thing else is to be seen
and is the subject of study for the working group.

In this limited context the study group can add value and produce a
useful extension to the existing Pause flow control mechanism.

Whether more can be done e.g. to extend the scope to multiple hops will
no doubt arise and be debated in the course of the study. However the
ambiguity that currently is floating around this subject should not
prevent concrete progress in the more limited context.

Regards, Siamack



Benjamin Brown wrote:

  
Gary,

You say you're seeing promising results from simulations
but you're not ready to share the data. I certainly hope
that will change before the presentation deadline for the July meeting
    

  
in 4 weeks.

I don't mean to pick on you but you seem to be the only
one that is taking up the flag AND at least suggesting that there is
simulation data to back up your claims.

As chair of this group, I'm trying to stir up discussion in order to
get all the arguments on the table. If there are flaws in these
arguments (the "gospels" as you call them) and the exploitation of
these flaws has broad market potential and is both technically and
economically feasible, then we need to get this information
disseminated as soon as possible.

I don't think we can try to go through the July meeting without this
material and expect to get a continuation of this study group.

Regards,
Ben

McAlpine, Gary L wrote:

    
Norm,

I agree with you on many of your points below. A higher granularity of
      

  
"flow" than 8 priorities is needed to get any significant improvement
across multiple stages of switching. I know I'm being vague about
exactly what granularity of "flow" on which I want to exert targeted
influence (rate control/backpressure). It's not because I don't know,
it's because any discussions on the subject without data to back the
proposals will "simply" turn into a big rathole. I am busy developing
the data.

I understand all your arguments below. I've been listening to the same
      

  
ones for the last 15 years and, until a few years ago, treating them
as the gospel. It wasn't until I set out to thoroughly understand the
gorey details through simulations that I realized there were some
interesting flaws in the "old" assumptions that can be very
effectively exploited in confined networks such as multi-stage cluster
      

  
interconnects.

I guess I don't see such a clear boundary of responsibility between
802.1 and 802.3 as you. I think it's an IEEE problem. And since the
target link technology is Ethernet, then the focus should be on the
802.3 support required to enable acceptable Ethernet based solutions.
I think 802.1 needs to be part of a complete solution, but only to the
      

  
extent of including support for the 802.3 mechanisms.

Gary



-----Original Message-----
From: owner-stds-802-3-cm@LISTSERV.IEEE.ORG
[mailto:owner-stds-802-3-cm@LISTSERV.IEEE.ORG] On Behalf Of Norman
Finn
Sent: Wednesday, June 09, 2004 2:33 AM
To: STDS-802-3-CM@LISTSERV.IEEE.ORG
Subject: Re: [8023-CMSG] Server/NIC analogy


Gary,

McAlpine, Gary L wrote:
      
I think this discussion is off on a tangent.
        
One can reasonably claim that you're the one who's off on a tangent.
One man's tangent is another man's heart of the argument.  You keep
saying, "we're just ..." and "we're only ..." and "we're simply ..."
and failing to acknowledge our "but you're ..." arguments.
Specifically:

You want back pressure on some level finer than whole links.  The
heart of the argument, that you are not addressing in your last
message, is, "On exactly what granularity do you want to exert back
pressure?"

The answer to that question is, inevitably, "flows".  (I have no
problem that "flows" are relatively undefined; we dealt with that in
Link
Aggregation.)  Per-flow back pressure is the "but you're ..."
      
argument.
  
Hugh Barrass's comments boil down to exactly this point.  You want to
have per-flow back pressure.

The "per-something Pause" suggestions have mentioned VLANs and
priority levels as the granularity.  The use of only 8 priority
levels, and thus only 8 flows, is demonstrably insufficient in any
system with more than 9 ports.  For whatever granularity you name, you
      

  
require at least one queue in each endstation transmitter for each
flow in which that transmitter participates.  Unfortunately, this o(n)
      

  
problem in the endstations is an o(n**2) problem in the switch.  A
simple-minded switch architecture requires one queue per flow on each
inter-switch trunk port, which means o(n**2) queues per trunk port.
The construction of switches to handle back-pressured flows without
requiring o(n**2) queues per inter-switch port has been quite
thoroughly explored by ATM and Fibre Channel, to name two.  It is
*not* an easy problem to solve.

At the scale of one switch, one flow per port, and only a few ports,
as Ben suggests, it is easy and quite convenient to ignore the o(n**2)
      

  
factor, and assume that the per-link back pressure protocol is the
whole problem.  Unfortunately, as you imply in your e-mail below, the
trivial case of a one-switch "network" is insufficient.  As soon as
you scale the system up to even "a few hops", as you suggest, the
number of ports has grown large enough to stress even a 12-bit tag per
      

  
link. Furthermore, to assume that a given pair of physical ports will
never want to have multiple flows, e.g. between different processes in
      

  
the CPUs, is to deny the obvious.

In other words, implementing per-flow back pressure, even in networks
with a very small number of switches, very quickly requires very
sophisticated switch architectures.

For a historical example, just look at Fibre Channel.  It started with
      

  
very similar goals, and very similar scaling expectations, to what
you're talking about, here.  (The physical size was different because
of the technology of the day, but the number of ports and flows was
quite
similar.)  Fibre Channel switches are now quite sophisticated, because
the problem they are solving becomes extraordinarily difficult even
      
for
  
relatively small networks.

Summary:

This project, as described by its proponents, is per-flow switching.
It is not the job of 802.3 to work on switching based even on MAC
address, much less per-flow switching.  It is essential that anyone
who desires to work on per-flow switching in 802 or any forum become
familiar with what the real problems are, and what solutions exist.

-- Norm

      
... There are assumptions being
        
      
made here that are off-base. We need to focus our attention on what
it


        

      
is we are trying to enable with new standards. (My numbered items are
        

  
responses to Hugh's numbered items.)

1. If what we are trying to enable are single stage interconnects for
        

  
backplanes, then wrt the IEEE standards, we're done. We just need to
get good implementations of NIC's and switches using 802.3x (rate
control, not XON/XOFF) to meet the requirements (e.g. good enough
throughput, low latency, low latency variation, no loss due to
congestion). But ... single stage interconnects are not very
interesting to people who want to construct larger interconnects to
tie multiple racks with multiple shelves of blades together into a
single system.

2. (Putting on my server hat) We're NOT asking for IEEE to provide
end-to-end congestion management mechanisms. If IEEE can simply
standardize some tweaks to the current 802.3 (& 802.1) standards to
support better congestion visibility at layer 2 and better methods of
        

  
reacting to congestion at layer 2 (more selective rate control and no
        

  
frame drops), then the rest can be left up to the upper layers. There
        

  
are methods that can be implemented in layer 2 that don't prohibit
scalability. Scalability may be limited to a few hops, but that is
all


        

      
that is needed.

3. The assumption in item 3 is not entirely true. There are
relationships (that can be automatically discovered or configured)
that can be expoited for significantly improved layer 2 congestion
control.

4. For backpressure to work, it neither requires congestion to be
pushed all the way back to the source nor does it require the
backpressuring device to accurately predict the future. From the
layer


        

      
2 perspective, the source may be a router. So back pressure only
needs


        

      
to be pushed up to the upper layers (which could be a source endpoint
        

  
or a router). Also, the backpressuring device simply needs to know
its


        

      
own state of congestion and be able to convey clues to that state to
the surrounding devices. We don't need virtual circuits to supported
at layer 2 to get "good enough" congestion control.

5. From an implementation perspective, I believe the queues can go
either in the MAC or the bridge, depending on the switch
implementation. (Am I wrong? I haven't seen anything in the interface
        

  
between the bridge and the MAC that would force the queues to be in
the bridge.) IMO, where they go should NOT be dictated by either
802.1


        

      
or 802.3. The interface between the bridge and MAC should be defined
to enable the queues to be place where most appropriate for the
switch


        

      
architecture. In fact, a switch could be implemented such that frame
payloads bypass the bridge and the bridge only deal with the task of
routing frame handles from MAC receivers to one or more MAC
transmitters (Do the 802.1 standards prevent such a design?).

As far as the IETF standards go, they don't seem to rely on layer 2
to


        

      
drop frames (although we don't yet have a clear answer on this). If a
        

  
router gets overwhelmed, it will drop packets. But if it supports
ECN,


        

      
it can start forwarding ECN notices before becoming overwhelmed. I
think the jury is still out on whether the upper layers (in a
confined
network) would work better with layer 2 backpressure or layer 2
        
drops.
  
>From a datacenter server perspective, there is no doubt in my mind


        
that


          
backpressure would be preferrable to drops.

Gary


        

      
--
-----------------------------------------
Benjamin Brown
178 Bear Hill Road
Chichester, NH 03258
603-491-0296 - Cell
603-798-4115 - Office
benjamin-dot-brown-at-ieee-dot-org
(Will this cut down on my spam???)
-----------------------------------------

    

  

--
-----------------------------------------
Benjamin Brown
178 Bear Hill Road
Chichester, NH 03258
603-491-0296 - Cell
603-798-4115 - Office
benjamin-dot-brown-at-ieee-dot-org
(Will this cut down on my spam???)
-----------------------------------------