Is this just a cost efficiency thing? It only takes like 1 core to terminate 200 Gb/s of reliable bytestream using a software protocol with no hardware offload over regular old 1500-byte MTU ethernet.
If you want encrypted transport, then all you need is a parallel hardware crypto accelerator so you do not bottleneck on slow serial CPU encryption.
If you want to keep it off the memory bus, then all you need is a hardware copy/DMA engine so you do not bottleneck on slow CPU serial memcpy().
Doing a whole new bespoke network protocol in hardware seems like overkill if you are only going for 800 Gb/s.
It's not entirely, but even that would be a justifiable reason. Tail behavior of all sorts matters a lot, sophisticated congestion control and load-balancing matters a lot. ML training is all about massive collectives: a single tail latency event in a NCCL collective means all GPUs in that group are idling until the last GPU makes it.
> It only takes like 1 core to terminate 200 Gb/s of reliable bytestream using a software protocol with no hardware offload over regular old 1500-byte MTU ethernet.
The conventional TCP/IP stack is a lot more than just 20GB/s of memcpy's with 200 GbE: there's a DMA into kernel buffers and then a copy into user memory, there's syscalls and interrupts and back and forth, there's segmentation and checksums and reassembly and retransmits, and overall a lot more work. RDMA eliminates all that.
> all you need is a parallel hardware crypto accelerator
> all you need is a hardware copy/DMA engine
And when you add these and all the other requirements you get a modern RDMA network :).
The network is what kicks in when Moore's law recedes. Jensen Huang wants you to pretend that your 10,000 GPUs are one massive GPU: that only works if you have Nvlink/Infiniband or something in that league, and even then barely. And GOOG/MSFT/AMZN are too big and the datacenter fabric is too precious to be outsourced.
I am aware of how network protocol stacks work. Getting 200 Gb/s of reliable in-order bytestream per core over a unreliable, out-of-order packet-switched network using standard ethernet is not very hard with proper protocol design. If memory copying is not your bottleneck (ignoring encryption), then your protocol is bad.
Hardware crypto acceleration and a hardware memory copy engine do not constitute a RDMA engine. The API I am describing is the receiver programming into a device a (address, length) chunk of data to decrypt and a (src, dst, length) chunk of data to move, respectively. That is a far cry from a whole hardware network protocol.
> Getting 200 Gb/s of reliable in-order bytestream per core over a unreliable, out-of-order packet-switched network using standard ethernet is not very hard with proper protocol design.
You also suggested that this can be done using a single CPU core. It seems to me that this proposal involves custom APIs (not sockets), and even if viable with a single core in the common case, would blow up in case of loss/recovery/retransmission events. Falcon provides a mostly lossless fabric with loss/retransmits/recovery taken care of by the fabric: the host CPU never handles any of these tail cases.
Ultimately there are two APIs for networks: sockets and verbs. Former is great for simplicity, compatibility, and portability, and the latter is the standard for when you are willing to break compatibility for performance.
You can use a single core to do 200 Gb/s of bytestream in the presence of loss/recovery/retransmission assuming you adequately size your buffers so you do not need to stall while waiting for retransmit. So ~1 bandwidth-delay product worth of buffering for the number of lost transmits and retransmits of the same chunk of data you want to survive at full speed.
You can use such a protocol as a simple write() and read() to a single bytestream if you so desire, though you would probably be better off using a better API to avoid that unnecessary copy. Usage does not need to be anymore complicated then using a TCP socket which also provides a reliable ordered bytestream abstraction. You make bytes go in, same bytes come out the other side.
Now do that across thousands of connections. While retaining very low p99 latency.
Just the idea that using a bytestream is ok is leaving opportunity on the table. If you know what protocols you are sending, you can allow some out-of-order transmission.
Asking the kernel or dpdk or whatever to juggle contention sounds like a coherency nightmare on large scale system, is a very hard scheduling problem, that a hardware timing wheel is going to be able to just do. Getting reliability & stability at massive concurrency at low latencies feels like such an obvious place for hardware to shine, and it does here.
Maybe you can dedicate some cores of your system to maintain a low enough latency simulacra maybe, but you'd still have to shuffle all the data through those low latency cores, which itself takes time and system bandwidth. Leaving the work to hardware with its own buffers & own scheduling seems like an obviously good use of hardware. Especially with the incredibly exact delay based congestion control their close cycle timing feedback gives them: you can act way before the CPU would poll/interrupt again.
Then having own Upper-Level-Protocol processors offloads a ton more of the hard work these applications need.
You don't seem curious or interested at all. You seem like you are here to downput and belittle. There's so many amazing wins in so many dimensions here, where the NIC can do very smart things intelligently, can specialize and respond with enormous speed. I'd challenge you to try just a bit to see some upsides to specialization, versus just saying a CPU hypothetically can do everything (and where is the research showing what p99 latency the best of breed software stacks can do?).
Multipathing, delay based congestion control, predictable performance under load (even with multi-tenancy!), builtin encryption, loss recovery while maintaining in order behavior even when doing IB Verbs, modular pluggable upper layers to allow apps to use NVMe and RDMA directly, support for smooth operations even with >100k connections, separation of course based management and data flow for rapid iteration / programmable congestion control, ordered and unordered processing, connection type awareness to allow further optimization / taking advantage of unordered wins.
The results sections speaks clearly to their wins. A tiny drop rate or pre-order rate causes enormous goodput losses for RoCE. Falcon hardly notices. P99 latencies go to 7x slowdown over ideal at a 500 connections for RoCE versus 1.5x slowdown for Falcon. Falcon recovers from disruption way faster, whereas RoCE struggles to find goodput. Multipath shows major wins for effective load.
If you don't care about p99, and you don't have many connections, yeah, what we have today is fine. But this is divinely awesome stability when doing absurdly brutal things to the connection. And it all works in hostile/contested multi-tenancy environments. And it exposes friendly fast APIs like NVMe or RDMA directly (covering much of the space that AWS Nitro does).
It is also wildly amazing how they implemented this. Not as its own NIC but by surrounding a Intel E2100 200gbps NIC with their own ASIC.
The related works section makes comparisons versus other best of breed and emerging systems. Worth reading that to see more of what wins were had here.
The NIC's job these days is to keep very expensive very hot accelerators screaming along at their jobs. Wasting with congestion and retransmits and growing p99 latencies costs incredible wastes of capacity, time, and power. Falcon radically improves the ability of traffic to make successful transit. That keeps those many kilowatts of accelerators on track with the data and results they need to keep crunching.
All you have said is that RoCE is bad. If it is as you say, then I guess? Not really relevant though.
I am saying that a properly designed and implemented software network protocol gets you similar capabilities to the stated capabilities of Falcon. As such, while the feature set is not bad and the throughput is fine, it is not a very unique feature set or impressive throughput absent some sort of cost or power efficiency of doing it in hardware which I saw no mention of in the paper.
Is this just a cost efficiency thing? It only takes like 1 core to terminate 200 Gb/s of reliable bytestream using a software protocol with no hardware offload over regular old 1500-byte MTU ethernet.
If you want encrypted transport, then all you need is a parallel hardware crypto accelerator so you do not bottleneck on slow serial CPU encryption.
If you want to keep it off the memory bus, then all you need is a hardware copy/DMA engine so you do not bottleneck on slow CPU serial memcpy().
Doing a whole new bespoke network protocol in hardware seems like overkill if you are only going for 800 Gb/s.
> Is this just a cost efficiency thing?
It's not entirely, but even that would be a justifiable reason. Tail behavior of all sorts matters a lot, sophisticated congestion control and load-balancing matters a lot. ML training is all about massive collectives: a single tail latency event in a NCCL collective means all GPUs in that group are idling until the last GPU makes it.
> It only takes like 1 core to terminate 200 Gb/s of reliable bytestream using a software protocol with no hardware offload over regular old 1500-byte MTU ethernet.
The conventional TCP/IP stack is a lot more than just 20GB/s of memcpy's with 200 GbE: there's a DMA into kernel buffers and then a copy into user memory, there's syscalls and interrupts and back and forth, there's segmentation and checksums and reassembly and retransmits, and overall a lot more work. RDMA eliminates all that.
> all you need is a parallel hardware crypto accelerator > all you need is a hardware copy/DMA engine
And when you add these and all the other requirements you get a modern RDMA network :).
The network is what kicks in when Moore's law recedes. Jensen Huang wants you to pretend that your 10,000 GPUs are one massive GPU: that only works if you have Nvlink/Infiniband or something in that league, and even then barely. And GOOG/MSFT/AMZN are too big and the datacenter fabric is too precious to be outsourced.
I am aware of how network protocol stacks work. Getting 200 Gb/s of reliable in-order bytestream per core over a unreliable, out-of-order packet-switched network using standard ethernet is not very hard with proper protocol design. If memory copying is not your bottleneck (ignoring encryption), then your protocol is bad.
Hardware crypto acceleration and a hardware memory copy engine do not constitute a RDMA engine. The API I am describing is the receiver programming into a device a (address, length) chunk of data to decrypt and a (src, dst, length) chunk of data to move, respectively. That is a far cry from a whole hardware network protocol.
> Getting 200 Gb/s of reliable in-order bytestream per core over a unreliable, out-of-order packet-switched network using standard ethernet is not very hard with proper protocol design.
You also suggested that this can be done using a single CPU core. It seems to me that this proposal involves custom APIs (not sockets), and even if viable with a single core in the common case, would blow up in case of loss/recovery/retransmission events. Falcon provides a mostly lossless fabric with loss/retransmits/recovery taken care of by the fabric: the host CPU never handles any of these tail cases.
Ultimately there are two APIs for networks: sockets and verbs. Former is great for simplicity, compatibility, and portability, and the latter is the standard for when you are willing to break compatibility for performance.
You can use a single core to do 200 Gb/s of bytestream in the presence of loss/recovery/retransmission assuming you adequately size your buffers so you do not need to stall while waiting for retransmit. So ~1 bandwidth-delay product worth of buffering for the number of lost transmits and retransmits of the same chunk of data you want to survive at full speed.
You can use such a protocol as a simple write() and read() to a single bytestream if you so desire, though you would probably be better off using a better API to avoid that unnecessary copy. Usage does not need to be anymore complicated then using a TCP socket which also provides a reliable ordered bytestream abstraction. You make bytes go in, same bytes come out the other side.
Now do that across thousands of connections. While retaining very low p99 latency.
Just the idea that using a bytestream is ok is leaving opportunity on the table. If you know what protocols you are sending, you can allow some out-of-order transmission.
Asking the kernel or dpdk or whatever to juggle contention sounds like a coherency nightmare on large scale system, is a very hard scheduling problem, that a hardware timing wheel is going to be able to just do. Getting reliability & stability at massive concurrency at low latencies feels like such an obvious place for hardware to shine, and it does here.
Maybe you can dedicate some cores of your system to maintain a low enough latency simulacra maybe, but you'd still have to shuffle all the data through those low latency cores, which itself takes time and system bandwidth. Leaving the work to hardware with its own buffers & own scheduling seems like an obviously good use of hardware. Especially with the incredibly exact delay based congestion control their close cycle timing feedback gives them: you can act way before the CPU would poll/interrupt again.
Then having own Upper-Level-Protocol processors offloads a ton more of the hard work these applications need.
You don't seem curious or interested at all. You seem like you are here to downput and belittle. There's so many amazing wins in so many dimensions here, where the NIC can do very smart things intelligently, can specialize and respond with enormous speed. I'd challenge you to try just a bit to see some upsides to specialization, versus just saying a CPU hypothetically can do everything (and where is the research showing what p99 latency the best of breed software stacks can do?).
Multipathing, delay based congestion control, predictable performance under load (even with multi-tenancy!), builtin encryption, loss recovery while maintaining in order behavior even when doing IB Verbs, modular pluggable upper layers to allow apps to use NVMe and RDMA directly, support for smooth operations even with >100k connections, separation of course based management and data flow for rapid iteration / programmable congestion control, ordered and unordered processing, connection type awareness to allow further optimization / taking advantage of unordered wins.
The results sections speaks clearly to their wins. A tiny drop rate or pre-order rate causes enormous goodput losses for RoCE. Falcon hardly notices. P99 latencies go to 7x slowdown over ideal at a 500 connections for RoCE versus 1.5x slowdown for Falcon. Falcon recovers from disruption way faster, whereas RoCE struggles to find goodput. Multipath shows major wins for effective load.
If you don't care about p99, and you don't have many connections, yeah, what we have today is fine. But this is divinely awesome stability when doing absurdly brutal things to the connection. And it all works in hostile/contested multi-tenancy environments. And it exposes friendly fast APIs like NVMe or RDMA directly (covering much of the space that AWS Nitro does).
It is also wildly amazing how they implemented this. Not as its own NIC but by surrounding a Intel E2100 200gbps NIC with their own ASIC.
The related works section makes comparisons versus other best of breed and emerging systems. Worth reading that to see more of what wins were had here.
The NIC's job these days is to keep very expensive very hot accelerators screaming along at their jobs. Wasting with congestion and retransmits and growing p99 latencies costs incredible wastes of capacity, time, and power. Falcon radically improves the ability of traffic to make successful transit. That keeps those many kilowatts of accelerators on track with the data and results they need to keep crunching.
All you have said is that RoCE is bad. If it is as you say, then I guess? Not really relevant though.
I am saying that a properly designed and implemented software network protocol gets you similar capabilities to the stated capabilities of Falcon. As such, while the feature set is not bad and the throughput is fine, it is not a very unique feature set or impressive throughput absent some sort of cost or power efficiency of doing it in hardware which I saw no mention of in the paper.