As high-bandwidth carrier Ethernet services become more prevalent, skeptical clients are increasingly challenging providers and agents to prove they got the big pipe they paid for. You might think the solution is straightforward – do an online speed test, or in a private environment conduct a large file transfer test, and all is well. Not so fast.
We had a client who thought it was that easy and ended up nearly ripping out a multi-site Ethernet network, unconvinced they got what they purchased. That experience taught us that it’s vital for carriers and consultants to figure out how to explain the intricacies of high-bandwidth networks in a way that is understandable to non-technical decision-makers. It’s a lot harder than it sounds, because in reality it’s very technical.
First things first – let’s dispense with the overhead myth.
Most folks brush off client questions of capacity by explaining that there’s “overhead” that accounts for the difference between what the client sees and what they purchased. While there is indeed overhead associated with both TCP/IP and Ethernet, from a pure payload perspective that's rarely the root cause of perceived performance issues on very large circuits.
Overhead for a given frame has many elements – things like IP headers, error correction bits, and timestamps. But even if you max out all of these overhead bits, the perfect network should still have almost 93% of the bandwidth available for application data.
What are client expectations?
Let’s say you have a client who just bought a 100Mbps Ethernet circuit. That translates to a throughput of around 11.9MB/sec (Google the math). In a perfect world 93% of the bandwidth should be available, making the theoretical max usable throughput around 11MB/sec (I know it's not perfect, but that's a discussion for another article).
So your client thinks he should be able to send an 11GB file in around 16 minutes. If they have a clever IT person, he knows that it's not fair to just transfer a file to a Windows shared directory (SMB isn’t terribly efficient). So they use a freeware FTP server and put a Blueray rip of Avatar on it, then fire up their trusty FTP client and try to transfer it over the link. And the performance…isn’t 11MB/sec.
What’s going on here?
To understand why these types of tests will never accurately represent a big circuit’s capacity, it helps to think of the internet as a bunch of tubes (no, really). Actually, as a bunch of tubes inside another tube.
The big tube is your fat Ethernet connection. The carrier assures you its capacity is 100Mbps. Inside that tube are any number of TCP connections. Think of these as virtual circuits – small tubes inside that big tube. In order to truly test the full capacity of that Ethernet connection, you have to establish a TCP connection that fills up that big tube.
For those of you that like to read ahead to the end, I’ll save you the trouble. You can never get that little tube to fill up the big tube. The way the TCP protocol is built and the physics of network latency make it nearly impossible (in fairness, when some people say "overhead" they mean all this stuff).
Now we have to get really technical.
If your customer doesn’t believe that layman's explanation (and I don’t blame them), you’re going to have to get technical. The reason there is a limit to the size of those TCP connections has to do with how the TCP protocol was originally designed to provide reliable data transmission while still performing well.
When you send data via TCP, your computer breaks that data up into a bunch of packets then sends them out over the network to the destination. Those packets may not all follow the same path, but they all have to show up eventually at the destination so they can be put back together into that spreadsheet or Gangnam Style parody video.
The way TCP handles this is by requiring the receiver to send back an acknowledgement that it got a clean packet with no errors. If the sender doesn’t get this verification, or if it hears that the packet was damaged in transit, it’ll re-send it.
This back and forth would be very inefficient if it happened one packet at a time, so TCP allows a sender to negotiate with the receiver to agree on a number of packets that can be in transit at one time. This way the sender can blast out a bunch of packets and process the acknowledgements as they come in. This strategy ensures that the performance of the TCP connection is maximized.
The number of packets that the sender and destination agree on is known as the “receive window,” and it is the limiting factor in any TCP transfer. It effectively sets the capacity for that TCP connection.
By default, the TCP protocol defines this window to be no larger than 64KB. If you’re on a network with round-trip latency of 40ms, that means that your FTP transfer can have a maximum of 64KB of data flying around the network at any given point in time. Since it takes 40ms for those TCP packets to get from point A to point B and for the acknowledgement to be received, that works out to a data transfer rate of 13.1Mbps (for those of you checking the math at home don’t forget 1KB = 1024B). Before you go cancelling that GigE connection, stick with me – it’ll all make sense soon.
You heard right – a standard TCP connection can’t go faster than 13.1Mbps. Unless…
There are ways to manipulate this window and make it bigger than 64KB. I won’t go into that here, but if you really want to dig deep, these articles are a good place to start. The problem is that there are a number of components in a network system that can manipulate the window. Your OS has its own window setting, but so may the software you’re using to do the transfer. Depending on the network setup and the actual protocol you’re using to do the transfer, your routers may also manipulate the TCP traffic or may not be able to handle the tricks that are used to make the receive window bigger.
The bottom line is that there are a lot of reasons why your single TCP connection will have trouble breaking the 13.1Mbps barrier. And even if you do open the receive window, you still have physical limitations on the computers doing the sending and receiving. They need enough memory and fast enough disks and CPUs to keep up with the storm of TCP traffic.
As if that weren’t enough, the fact that TCP requires acknowledgements at all means you’ll never be able to send at the full capacity of your circuit because you have to share the pipe with those return packets. And we haven’t even talked about dropped packets or transmission errors.
So how can you convince your client they’re getting what they paid for?
While there are devices that can test the actual capacity of a circuit, I’m betting you don’t have one (or two, since that’s what you’d really need). My suggestion would be to explain that there are limitations to a single communication session that would cause a stand-alone test to appear to fail. To demonstrate this fact, you can set up two transfers on two sets of computers across the same pipe. You should see both transfers max out at the same rate as one transfer would by itself, proving that the limitation is in the transfer, not in the pipe.
The reason this is important is that in the real world, traffic going over that connection isn’t going to come in the form of a single large transfer. It’s going to be dozens or hundreds of connections accessing data from any number of servers. Each of those little tubes won’t have to jockey for space in the big tube, meaning the net performance of the network will be greater. And meaning that in reality, your client will indeed see most of the bandwidth they were promised.
If your client still doesn’t believe you, maybe Senator Ted Stevens can explain it better.