Dan Bricklin's Web Site: www.bricklin.com
Why We Don't Need QOS: Trains, Cars, and Internet Quality of Service
Common sense argues against widespread use of QOS techniques on the Internet. It is better to just get more capacity.
I keep reading how all sorts of applications for the Internet will require new, "Quality of Service" (QOS) features in order to work. There is this feeling that special handling of "the right traffic" will be required to make things work. I believe that there are some dangerous flaws in this type of thinking. Not only can QOS make the Internet worse for applications we need and depend upon, it won't be able to help deliver on its promise. I also believe that while the need for QOS sounds intuitive at first, once you think about it a bit, you'll see that common sense argues against it in the context of the Internet.
There are a variety of arguments in favor of QOS. One is that for applications like voice (well, voice is the main application cited, I believe) the fact that IP connectivity does not "guarantee" you delivery of data with a known delay makes it an inappropriate medium for those applications. "For voice, we must have QOS" seems to be the feeling. "With a dedicated circuit over a wire, we have QOS -- you know there won't be those unacceptable drop outs or delays. We must have the equivalent for unreliable connections like the Internet to be acceptable for voice." Other arguments for QOS revolve around "value". If my "valuable" voice traffic has to share the pipes equally with large file downloads (of maybe -- Horrors! -- shared music or pornography) it would be unfair. The "good" traffic should have priority. The telephone and cable-TV people are used to thinking in terms of circuits and value pricing, so this viewpoint is understandable.
Learning from an everyday analogy
To help us understand the ideas of QOS, let's look at another field most of us have observed close hand: Cars and roads. One-lane roads are kind of like circuits, with cars traveling down them kind of like data packets.
In the early days of mechanical transportation (trains), where the "road" was the most expensive part, all sorts of carefully choreographed schedules had to be worked out to make the trains run on time. Any changes would require coordinating with the whole operation. An unexpected priority train was not something you could do lightly. In urban public transportation, careful coordination let many trains work at once, with special help from indicator lights to keep trains the minimum distance apart to maximize utilization.
When automobiles came along, with their ability to go "off road", various protocols were worked out for letting multiple vehicles share a road. As long as there weren't too many cars, this worked fine. Eventually, multilane roads were needed. Drivers could self-organize by changing lanes, and traffic could work with only simple rules.
With cars, there is a QOS issue everybody is familiar with: Emergency vehicles. We all know that when an emergency vehicle comes upon us, it is our duty to move our car out of the way to let that vehicle go by. How do we know it's an emergency vehicle? There are special types of lights and sounds that make it very clear. Only "authorized" vehicles may use these signals and normal cars are not provided with them.
A bus changes lanes to let a fire truck go through an intersection, then two cars pull over to the side of an uncrowded one-way street to let it go by as a very loud siren sounds
How does this work in practice? If a road is pretty empty, the emergency vehicle doesn't have to do anything special, and just goes it's way. If the road is pretty full, then the algorithm works fine: The emergency vehicle gets through at an OK speed, with only minor delays for others (who know that it could be them in that ambulance next time or are happy the police car isn't chasing them). It doesn't work as well in a very crowded situation (think of a backed up tunnel). The emergency vehicle crawls through (if at all), but at least it does better than most "normal" cars. If all of the vehicles are emergency vehicles, they will all get backed up. If there is an emergency that isn't in an "authorized" vehicle, things get murky. For example, a policeman responding to a burglary will get through but a normal car racing to the hospital with someone having a heart attack may not. Other issues: When is it "OK" for a policeman to use those indicators and when is it not? Can I pay for the privilege to get priority?
What we learn from this simple analysis that the basic "emergency vehicle QOS" algorithm works in the following situations: It is only for situations that are determined by society to be of general benefit, it is unnecessary in situations where there is lots of capacity on the roads, it works well when there is moderate congestion, it works less well (but better than nothing) when there is heavy congestion, and it works only when a very small percentage of the traffic is emergency. It doesn't help when most of the traffic is "emergency" and there is heavy congestion.
Rush hour with little room to pull over
Let's look at this graphically:
Figure 1: Quality of Service techniques have most benefit in a narrow range as you start using all of capacity
This makes great sense in cities, where road capacity is very expensive to increase, and traffic usually doesn't get far above the knee in that curve, but where traffic congestion is often well above the region on the graph where you don't need help from a QOS algorithm. Emergencies are rare (compared to general traffic), and trained people with the public good in mind make decisions about when to take advantage of the special right of way.
Note that this "pull over to the side when you hear a siren and see a flashing light" algorithm doesn't scale very well. It works at normal driving speeds, with long fields of view or very slow speeds, and space to move over, with today's automobiles. The siren technique that works well in dense areas works less well on highways, and the flashing lights technique doesn't work as well in commercial areas with other flashing lights. Emergency vehicles use both (and additional tricks like mirror-image "Emergency" signs, etc.).
Generalizing and applying to the Internet
What have we learned about QOS here?
It isn't needed when there is enough capacity.
It works well when the priority cases are a small percentage.
It doesn't help when you run out of capacity or have too many of the priority cases.
The algorithms are specific to the situations, technologies, etc.
Now let's look at the Internet. Just like with cars, help from QOS techniques isn't needed if there is enough capacity. The main time it could be of benefit is when utilization is high, but not too high.
Unlike cars, there may be less agreement on how to decide which packets to give priority and which not. (Is an Instant Message to a help desk from a doctor, or a radiologist transferring an MRI image, more or less important than a phone call about going out for pizza? What about an IM about going out for pizza or a porn image?) You can't just use paying as a proxy for value to society (pornography often involves payment and would have no problem competing with pictures emailed from a new parent to a grandparent). With automobiles, we rarely use the emergency priority, and then we have a trained person making the determination knowing the situation. That is not feasible, nor desirable, for the Internet.
If the capacity of the Internet was fixed, and the utilization was fixed, much like the roads in a city, growing at a slow, known rate, then perhaps maybe QOS could be helpful (if we could figure out when to use it). But the capacity of the Internet is constantly growing, with miles and miles of unused fiber waiting to be lit. Any work done on QOS will just help for a little while until things get too congested, then the QOS algorithms will do more harm then good, and they probably won't scale to new transport technologies, nor new uses of IP connectivity. If the same effort went into increasing capacity, then the QOS would be unnecessary.
Working on QOS only helps a small percentage of traffic work a bit better for a small increase in utilization. Routinely, we have increased capacity by factors of 10 through new technology, and there is no reason to believe this will stop. Complicated QOS techniques can slow the creation of newer, higher capacity transports that would lessen the need for QOS in the first place.
Here are some graphs that illustrate this idea. They assume that traffic needing QOS is a small percentage and easily distinguished -- both poor assumptions.
Figure 2: We start out with a 100Mb/second network with no QOS. It is very good for a wide variety of traffic up to, let's say, 30Mb/second, and really bogs down at 50Mb/second.
Figure 3: We add a theoretical QOS algorithm to the 100Mb/second network. For our "preferred" traffic, it is good up to, let's say, 50Mb/second, but bogs down over 90Mb/second (or less if too high a percentage of the traffic is "preferred").
Figure 4: Instead of adding QOS to our initial case, we upgrade to a faster, 1Gb/second network (either through new technology or using more of the same technology -- IP connectivity does not need a single circuit). Now we have a good transport for up to 300Mb/second, and it only bogs down at 500Mb/second.
This reminds me of the arguments against Ethernet-style networks, where they bog down when there is too much traffic. (In its simplest form, Ethernet lets nodes on the network transmit whenever they want. If two happen to talk at the same time and "collide", confusing the receivers, both nodes wait a random amount of time and try again. Too many collisions adds overhead and slows everything down.) It turns out that when an Ethernet is lightly loaded, collisions are rare, and traffic flows freely. For up to 30% of "capacity", the non-deterministic system acts very much like dedicated wires and there are studies that say it is quite useful at even higher levels. Is it worth it to find better algorithms for the 30%-60% utilization case that will let us carry a bit more traffic, or is better to find faster ways of moving all traffic, moving us from 1Mb/sec, to 5Mb, to 10Mb, to 100Mb, to 1Gb, etc., which open up new applications at the same time? I think the answer is clear: Go for more capacity rather than handling the narrow advance from dealing with congestion. That's what worked for Ethernet.
In the days of the old telephone and television systems, infrastructure was slowly upgraded. "Demand" was known, and things were all sized to make the most of expensive switching equipment and wires to handle the high cases (certain holidays?). Under unforeseen load (during an emergency or TV-show call-in, or unexpected demand on a particular exchange such as when WordPerfect shipped a buggy upgrade to the entire USA) the system would bog down. Complex algorithms could be implemented because the characteristics and growth of use (voice conversations among living people) was relatively fixed and well understood, and factors of 2 were very meaningful. As I pointed out in a previous essay, all of this work still doesn't guarantee communications when you want it. The biggest problem to communicating, not a connection that has long delays but rather the problem of the person not being near the phone, got solved by a solution at the end: Answering machines. The fact that a broken wire or malfunctioning equipment are common things is forgotten when one talks about the advantage of dedicated lines vs. "unreliable" IP connectivity.
The Internet provides a simple service (IP connectivity) that is well suited to being transported by a wide range of technologies. Unlike with automobiles, advances in many areas come together to increase capacity such that factors of 2 happen every year or so. The dynamics of each application is not as well understood, and new applications are constantly being added. Growth in demand is very great, and is much faster than population growth. Rather than get a bit more out of what we have today, it is much better to just get much more capacity.
-Dan Bricklin, 30 July 2003
I received this from Prof. Jerry Saltzer (he's one of the authors of the End-to-End Argument paper that was one of the key organizing principles of the Internet, he also developed RUNOFF, the ancestor of many on-line text formatting systems, and taught me at MIT):
I think the general line of argument and the analogy is quite useful. But there is one fact that you need to somehow incorporate in the discussion.
I have heard that the telephone system has at least one QOS feature that does not conform to your model of how things break down when there is too much low QOS traffic. If there is an earthquake in Los Angeles, and everyone who survived picks up their phone to call and reassure their relatives in other states, the telephone system will bog down and there may be 30-minute waits for a dial tone. But if someone at the police department picks up a phone to call the hospital a dial tone will appear immediately. The phone system provides a strict priority system intended to help assure that if there is still a path through the network, certain customers will be able to use it to the exclusion of others.
This scheme obviously does not scale very well, but it does not need to scale to directly address an important public policy concern.
The analogy in the packet world may be that there is no point in providing elaborate QOS features, but there may be a public policy need for a brute force, extremely simple, non-scalable QOS mechanism.
I agree with Prof. Saltzer that we need to think through public-safety issues to appear to provide at least a small amount of connectivity to "approved" end points using as simple and basic a way as possible. As Bob Frankston has pointed out to me, we have to be careful about relying too much on any special cases, because they are often more brittle than the "standard" systems they try to work around (as we found out on September 11, 2001, when the Internet handled the load better than the "special" dedicated phone systems). Also, "special" systems are great targets for the "bad guys".
Simple QOS system in New York City, with special lanes marked for use by fire trucks
Bob also points out that the Internet doesn't clog up the same ways telephone "circuits" do, where you have to wait for someone to hang up before the next line becomes free. It slows down more like traffic, where it moves at a crawl, but somethings eventually get through.
As I see it, a more important issue is the one of "graceful degradation" of applications as connectivity gets congested. Applications that relate to public safety must be constructed in such a way as to assume that connectivity might become bogged down, and deal with it in as helpful ways as possible. For example, switching from "live" voice to "store and forward" Instant Messaging (voice and/or text) as bandwidth drops or delays become long. Drop outs should be flagged, so you know the message is "Don't go up the back stairs", not "...Go up the back stairs". Such dealing with changing conditions was difficult in the old, analog/mechanical equipment days, with nothing more than amplifiers, levers, and gears. Today we have enormous amounts of computing power and storage at the edges. Error-correcting codes are routinely used, as are buffered receivers (such as those used with the low-quality video live from Iraq). They are just simple solutions.
-Dan Bricklin, 2 August 03
© Copyright 1999-2014 by Daniel Bricklin
All Rights Reserved.