Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

[RPRWG] Why Data CRC Stomping is a GOOD idea!




Nirmal,

In understand your desire to address problems,
within aggressive timelines of RPR.
However, I really don't understand your error
logging or catastrophic event concerns.

Please note that frames with a goodCrc^STOMP
value are not considered valid. They are still
considered invalid, but the MIB error counter
is not updated (to avoid large numbers of spurious
counts).

Comments interleaved.

DVJ

David V. James, PhD
Chief Architect
Network Processing Solutions
Data Communications Division
Cypress Semiconductor, Bldg #3
3901 North First Street
San Jose, CA 95134-1599
Work: +1.408.545.7560
Cell: +1.650.954.6906
Fax:  +1.408.456.1962
Work: djz@xxxxxxxxxxx
Base: dvj@xxxxxxxxxxxx


>>-----Original Message-----
>>From: owner-stds-802-17@xxxxxxxxxxxxxxxxxx
>>[mailto:owner-stds-802-17@xxxxxxxxxxxxxxxxxx]On Behalf Of Nirmal Saxena
>>Sent: Tuesday, July 30, 2002 10:45 PM
>>To: djz@xxxxxxxxxxx
>>Cc: Nirmal Saxena; stds-802-17
>>Subject: Re: [RPRWG] Why Data CRC Stomping is a BAD Idea?
>>
...
>>
>>Here are my comments:
>>
>> a) CRC stomping has very little to do with protocol
>>    compliance. For example, it is not on par with
>>    requirements such as frame format etc.
>>    I am at a loss to understand why this is being
>>    made a requirement. At a minimum, it must be
>>    a suggested hint with full disclosure of caveats
>>    (such as those I raised in my comments).

The usefulness of MIB registers for diagnosing errors
is compromised if stomping is sometimes done one way
and othertimes another.

>> b) Location of failed links is best left to layer 1
>>    interfaces. For example, loss-of-link or BER in SONET.

Not all layer 1 interfaces do this.

>>    A distinguished architect such as yourself would agree
>>    with me that it is never a good idea to overload
>>    gratituous functions on established methods when
>>    we are fully aware that we are actually reducing
>>    the probability of error logging and causing implementation
>>    problems.

1) Identifying the failed link is not a gratuitous, but useful.
   I learned that from IBM, during my distinguished career.
2) I disagree on the claim of reducing error logging.
   a) No bad frame is ever marked "good" and used incorrectly
      (in fact, stomping the frame reduces that likelihood)
   b) If one's goal was to have MIBs identify the number of
      errors generated on each link, stomping is useful:
      i)  Eliminates large numbers of extraneous reports.
      ii) Minimal chance (as stated 1/2**32) of not incrementing
          a MIB when an error occurred.
      Since (i) is orders of six orders of magnitude more than
      (ii), the benefit of stomping is clear.

>>    Even if we agree that layer 1 methods may not be sufficient
>>    to detect failed links; we have other established layer 2
>>    mechanisms like keep-alive messages to detect failed links/
>>    nodes.

The concern is not failed links, but marginal links. Its easy to
detect failed links, but much harder to identify marginal ones.
My RAS (reliability, availability, and support) compatriots at HP
don't really care about hard faults, they care about marginal
and/or transient errors, which cannot be easily repeated under
controlled conditions.

>> c) The probability of correct CRC (undetected error) in the
>>    presence of error is irrelevant to the CRC stomping discussion
>>    because with or without stomping the probability of undetectable
>>    error is the same for both methods.

It better sets the framework for understanding. For example, I would
be much more concerned with a bad header CRC, which leads to corruption,
since the header is only protected by 16 bits.

Do you want to increase the header-CRC to 32 bits (like I do)?


>> d) The real issue is the probability of not-logging given an error
>>    in data frame (i.e., probability of CRC being checkStomp).

Let me try to clarify. The downside of stomping is that the MIB
will not be updated when:
   actualCrc==(goodCrc^STOMP)
Lets suppose that errors occur once a second, for example.
Then, the mean time between unlogged errors will be over 100 years.
Just trying to put things into perspective...

>>    Your claim that the probability is 1/2**32 is not correct on two
>>    counts:
>>
>>     1) It assumes equally likey error model with probability
>>        of bit error = 1/2. The actual error rate on the links
>>        is much lower and estimating this probability with a known
>>        bit error rate is NP-Hard.

It doesn't assume any 1/2 number. It simply assumes that 0xFFFFFFFF
syndrome is no more likely than any other syndrome.

>>     2) Your expression assumes unconditional probability. What we
>>        are interested in is conditional probability. That is,
>>        given a data frame error what is the probability that
>>        the computed CRC is equal to checkStomp value.

My assumption was conditional probability, using your terminology.
If there is a syndrome, then chances are about 1/(2*32) that that
particular syndrome happens to be 0XFFFFFFFF. Its simply a
pick-one-out-of-2Gig problem.

>>        My claim is that it could be very HIGH. For example, consider a
>>        a single bit error in a data frame of length L bits.
>>        Depending on the position of this single bit error and the
>>        chosen STOMP value the conditional probability could be
>>        ONE.

That's an interesting claim, but unsubstantiated.

I haven't computed how long the frame would have to be for a 1-bit error
to turn into the STOMP value, but odds are (and can be verified, if need be)
that the frame would have to be about 2**31 bits in size, which is about
32,000 times longer than a jumbo frame.

>>        I can generate a long list of STOMP values that are
>>        catestrophically BAD for the standard CRC-32 polynomial
>>        for single bit errors. The list for double-bit errors is
>>        quadratically longer. By the way, these catestrophical
>>        STOMP values are also functions of packet length. This
>>        compounds the problem of finding good STOMP values.

I'm not sure what you mean by "catastrophically BAD".
Even if the STOMP value was generated by a 1-bit error (it isn't),
then lets compute a few approximate numbers:
  averageFrameSize=256bytes=2Kbits
Then assuming any bit is equally likely to be in error:
  Probability(actualCrc==goodCrc^STOMP) is about 1/2000

I really don't think there is any problem if one frame is quietly
discarded (without a MIB update) on these rare occurances.
If errors are occuring once-per-second, we will miss logging one
every 1/2 hour, which certainly doesn't seem "catestrophically BAD".

Its much better than logging the error at 4 locations, making it
hard to distinguish between one-link and two-link failures. In that
case, we would be falsely incrementing other MIBs once every second,
which would seem to be much worse.

DVJ