Hi, Paul,
Here’re brief comments on your email:
1) Regarding data path selection for “transcoding + RS encoding”, I would suggest you to refer to the data flow on page 8
in the following presentation for 100G-Base-KR4
http://www.ieee802.org/3/bj/public/may12/brown_01a_0512.pdf
Note: a) transcoding reduces data rate by 8x65/513 times. Thus using same buswidth (65bits) for both input and output is not proper.
b) RS encoder works on symbol based, i.e. the input to RS encoder should be multiple of 8 bits in our case. Using 65b bus for RS encoder is
surely not a good design.
In brief, it is not trivial to select data bus width for the data flow at different points.
2) RS code is an error correction code. The key point is that it involves “error correction” process.
After syndrome check, we need perform “Chien search + Forney algorithm” to correct possible errors of the input data.
The data that your referred “to be available” are actually not clean data. They need go through “error correction” process
before being sent out. Thus I said RS decoder output data in multiple cycles.
On the other hand, we can move the first row (i.e., “the extra row” in your email) to the last at Tx side as shown below (also see attached pdf file).
This will force saving latency regardless of VLSI implementation at Rx or RS decoder side.
<image002.png>
In brief, savings on HW complexity and processing latency for the new TC scheme are real.
People can have different opinions on whether those savings are significant or not.
Now it might be too late to adopt the new TC scheme for 40G-T standard.
But I hope we can fully understand benefits of the new scheme. In this case, we could consider it for
other Ethernet standards in the future.
Thanks.
Dear All,
Regarding Zhongfeng’s comments on my presentation about the 40G transcoding:
1)
The question of MUX’ing in Zhongfeng’s proposed scheme comes down to how to handle the datapath through the PHY. My analysis makes the assumption that you would want to run a 64 + 1 bit datapath clocked on the XLGMII clock.
The problem with the scheme proposed by Zhongfeng is that it introduces the concept of an extra row to send, and in order to save the 7x 1:8 byte MUX’s (which is an insignificant number of gates in current semiconductors processes), you either have to implement
a 9/8th faster clock on the 64+1 datapath, or you have to transfer this new row in parallel with un-MUX’d data, resulting in a 128+1 bit datapath - both undesirable options.
2)
With respect to the question of latency and the suggestion that 513 bits will not be simultaneously available in a “proper/reasonable” RS decoder: the assumption here is that we are doing the error location via a Chien search,
and until the search is complete and the correction syndrome checked, we won’t know if the correction passed or failed – i.e. for T or less errors, it will pass, but for T -> 2T errors it will fail, and for >2T errors it may fail. Consequently, all of the
bits are available at the time the syndrome is checked.
Regards,
Paul Langner
Dear All,
I missed this interim meeting due to my personal vacation in China (12 hr difference from Florida)
I got Paul’s presentation from Tom.
I had a quick look. But I found his analyses on the alternate transcoding scheme are
not correct.
1) As I stated in my May presentation (see attached) on page 8, the new transcoding scheme does NOT need
muxing logic for rest 7 bytes for each row. But conventional transcoding DO need muxing logic for all bytes.
Here’s an simplified example:
1) input X[0:99];
Output Y[0:99];
Y[0:19]=X[80:99];
Y[20:99]=X[0:79];
In this case, output and input have 1:1 mapping. Thus there’s no need for muxing logic for the data conversion.
This explains why those 7 bytes (for each row) do not need muxing logic.
2) Regarding latency, although input data for RS decoding are all received,
the RS decoder output the corrected data in multiple cycles.
Since our throughput is only 1Gbps, we only need output 9 bits (= 1 RS symbol) per 9 ns (roughly).
Assuming 375Mhz of clock speed, we only need output about 1 (corrected) RS symbol (9 bits) in about 3 cycles.
Obviously we will not get corrected 513 bits in one clock cycle in a proper/reasonable VLSI design.
For a silly design, we can use high-level parallel processing (leads to linearly increased HW complexity and peak power) so that
we can output corrected 513 or more bits in one clock cycle.
In this case, the RS decoder has rough throughput 375Mhzx513bits > 180Gbps. This is obviously overdesign and consume
unnecessary HW.
I’m copying this email to a few VLSI implementation experts. Hopefully they can answer any further questions of you regarding this matter
in a timely manner (It is deep night in my place).
Note: After my presentation in May, I basically gave up the effort to push the improved scheme to the 40GBaseT standard since
I knew it takes much much more effort than technical analyses.
However, for the truth of science and technology, I want to take this further effort to explain my original scheme.
I truly believe that no one in this community wants to have a wrong analysis to be recorded in the IEEE history forever.
Thanks for all of your attention.