Wednesday, October 12, 2011

Why wireless (802.11) roaming is a nightmare (and why CCX can help)

Part 1: the nightmare

Have you ever suffered from roaming issues, got your VoWLAN call disconnected when jumping from one AP to another, and wondered why... why do some devices roam just fine while some others roam poorly, dropping packets if not the entire session? Are some vendors so dummy that they don't know how to implement a proper roaming algorithm? Well, they are probably not. Roaming algorithms are always an arbitration between contradicting needs. This article explains what happens when your wireless device decides to roam, and what choice can be made that will make your roaming experience a non-event or a nightmare.
Notice that this article is full of notes designed to add to your knowledge. You can skip the notes if you are only focusing on the roaming and scanning issue.

Your wireless device and its environment

To understand how roaming happens, you have to put yourself in the shoes of your wireless device (yes, wireless devices have shoes sometimes). For your wireless device, the world is full of unknown. It does not have the nice administrator view of the entire wireless infrastructure. If you were the wireless device, the only things you would know would be :
  • that there is one AP
  • communication with this AP is possible.
You know a lot about yourself, but you don't know much about this AP. You know:
  • the AP channel,
  • the BSSID (the AP MAC address associated to the SSID, or network name, your user configured you to join).

As you receive frames from the AP, you analyze the frame RSSI, and deduce the data rate you could use to send unicast frames back to the AP. Your know your current power level, but have no real idea of the AP power level.

Suppose that you send a frame and it is not acknowledged.... you start asking yourself many questions:
  • Is it that your power is too low and the AP did not hear you?
  • Should you resend at the same data rate with a higher power level, if possible?
  • Was there a collision because someone else sent at the same time? Should you just wait an EIFS (extended interframe space, used when a collision is detected), and resend the frame at the same data rate, same power level?
  • Did the user move away from the AP, and is your current modulation/data rate not adapted anymore?
  • Should you resend the frame at a lower data rate? With the same power level? Or a higher power level, to be safe?
  • Are you getting to the edge of the cell? Is there a cell edge?
  • Did the administrator determine an area where you are not supposed to be in the current cell anymore?
  • How is this edge determined? Based on AP power limitation? Based on physical obstacles (door, wall, etc. Remember that you do not see, so you do not know if your user brought you behind a wall or not)? Based on allowed rates limitation?
  • If there is a cell edge and if you are getting close to it, should you start to scan for another AP?

Based on these 4 possible scenarios (power too low for the AP current distance, collision, distance to AP increased, edge of the cell reached, with or without porential other AP), your wireless client driver has to make a decision on what to do next...

Power level or data rate?

For all these possibilities, power level is a critical issue. If you are a mobile wireless device, your ability to conserve and spend sparingly your battery energy is what makes you popular (who wants to buy a VoWLAN phone with a one hour battery life if another brand offers the same type of VoWLAN phone with a 6 hour battery life?). For this reason, it is common to see wireless devices make initial power level decisions when joining a cell. Based on the AP RSSI and SNR, an internal algorithm decides of the right power level. For example, suppose your power level ranges from 1 mW to 40 mW. Your user turns you on, and you are set to automatically join an SSID. You send a probe request at the lowest mandatory rate you support and maximum power (40 mW).
This is a mandatory requirement as per the 802.11 standard. Stations discovering the network by sending a probe request always send the request at highest possible power, and lowest possible data rate. This ensures that the request is heard as far as possible.
This is a probe request:

The AP is expected to answer with a probe response:

A probe response has the same format as a beacon. The only difference is that a beacon contains an additional field that the probe response does not contain (the TIM, or Traffic Indication Map, which lists the stations for which the AP has traffic buffered. All other fields are the same).

You receive a probe response from the AP and you determine that this frame was received at – 37 dBm RSSI and 41 dBm SNR. Your internal power determination algorithm immediately thinks: “Wow! These are awesome conditions! I must be very close to the AP! I bet I can send and be heard at 5 mW!”. You then start sending your next frames at 5 mW, and check your success rate. This success rate is often determined in terms of PER (packet error rates): how many frames get dropped and not acknowledged when I use this power level? If there are significantly more drops than at the previous power level (and my vendor-specific power level decision making algorithm is going to tell me how much “significantly more” is going to be), I might need to increase my power level until the packet error rate falls below an acceptable threshold.
Depending on the vendor, this power change test is made often... or not. If your priority is to conserve battery power, you may want to make this test and lower your power level when first joining the cell, then keep the power low as long as you can. This means that if you determined a comfortable power level and your packet error rate starts to increase, you may want to choose to revert to a lower data rate, at same power level, rather than increase your power level.

To scan or not to scan

If you send a unicast frame to the AP that does not get acknowledged, your first move is probably to give that frame a second chance. You wait an EIFS, pick up a random number, countdown from there, then resend the frame a second time, at same power level and same data rate.
What if the frame does not get acknowledged a second time? It is time to think. Here again, each vendor proprietary algorithm will determine how you think:
  • Should you try a third time? 
  • Should you revert to a lower, more robust, data rate? 
  • Should you increase your power level? 
  • Should you start scanning for other APs? Are there any other APs? 
If you are in a SOHO environment with only one AP, scanning is simply a waste of battery energy... and nothing tells you what type of environment you are in (the user of the wireless device might know, but the wireless device itself has no clue). You will not know about other APs until you start scanning the other channels... which consumes time and energy, maybe just to discover that there is no other AP, or that the other APs do not serve your SSID.
Reverting to a lower data rate may be a safer solution from a battery conservation standpoint.


The first algorithms that were implemented with this type of logic are commonly called SRAs: sticky roaming algorithm. With SRA, you try to hang on to the current AP as much as you can, lowering your data rate down to the lowest rate if you have to.
This is usually done one data rate at a time. For example, if you were transmitting at 48 Mbps, you would first reverse to 36 Mbps, and try that data rate (for one or several frames depending on your internal proprietary algorithm). If 36 Mbps does not provide a satisfactory loss rate (i.e. loss rate, or packet error rate is still too high for your internal algorithm "acceptable level)", you would try 24 Mbps, etc.
If reverting down to lower data rates is not enough, labs experiments determined that increasing your local power level would be less energy costly than jumping in to the unknown of scanning and roaming. So once your reached your lowest possible data rate, you would increase your power level to maintain your packet error rate below the acceptable level.
Notice that some vendors implement an "intelligent SRA" algorithm, that takes into account the AP signal. For example, if the AP RSSI was -41 dBm and suddenly drops to -71 dBm, the "intelligent algorithm" would determine that increasing the power level is needed even before lowering down the data rate. Each vendor has a table of specifications that determines what data rate is possible at what power level, for example here.

From an admin standpoint, these clients are the sticky clients, unadapted to enterprise environments. These clients stick to their old AP, even if they are far away from it and just below another AP that would provide far better performances. But understand that from the client perspective clinging to the AP is related to survival, in order to maintain battery power. This client will start scanning as a last resort decision, because it does not know if there are any other APs out there, and is not expected any other AP anyway.

Better algorithms

But is this clinging behavior really conserving battery? If you are transmitting at 54 Mbps for example, sending 2346 bytes of data in a frame may take you 350 microseconds, while transmitting the same frame at 1 Mbps may take you 550 microseconds. Simple math shows you that your radio is on for 60% more time when you send at 1 Mbps, therefore consuming 60% more power. Also, if you are in the 1 Mbps area, you are far away from the AP. Your frame on its way to the AP has far more chances to collide with another RF signal than if you were close to the AP and sending at 54 Mbps.

This increased risk is related to 2 factors:
  • the signal has a longer path to travel, so it has more chances to hit another signal on this long path: you have less chances to hit another signal if you travel 3 meters than if you travel 100 meters.
  • the signal takes longer to cross that distance, so it stays longer in the air, and the longer a signal is in the air, the more chances it has to be hit by something suddenly starting to send.

This means that your error rate is “naturally” going to be higher at the edge of the cell than close to the AP. If the error rate is higher, the retry rate is also going to be higher. The more you retry, the more your device is using its battery to re-send instead of peacefully going to sleep/doze mode.
This make that the SRA algorithms were not that battery-efficient after all. As you device moves away from the AP, its energy consumption increases (because of signals taking longer to be sent, and having to be resent more often).
For this reason, second generation algorithms were built that determined that the station should start scanning before getting to the extreme situation of completely losing contact with the current AP.

Some drivers were even designed to allow you to determine the roaming (and therefore scanning) aggressiveness. If you know that you are in a corporate environment with many APs, you can set the behavior to “aggressive roaming” (scan early and jump if you find a better AP). If you know that you are in a one AP environment, you can set the behavior to “conserve power” (stick to the current AP as long as you can). For example, the Intel 4965 (win7 driver):

ERA: scanning doubts

Okay, so you decided to throw away your first generation SRA algorithm, and implement instead the "enterprise roaming algorithm" (ERA), starting to scan as you move away from the AP, in order to jump to another AP and maintain a good data rate, because a good data rate equals better battery conservation.
But wait a minute. This is easier said than done. Scanning in itself is not going to solve all your problems. Here again, try to think like a wireless card. Scanning is going to consume power... also, scanning can be done passively or actively.

Passive scanning: energy-efficient but time-consuming

Passive scanning is the most “energy efficient” mode. You set your radio to the next channel, and listen to detect if any beacon is heard. If your original AP is on channel 1, you may want to jump to channel 6 and listen there, because it is the next adjacent (non-overlapping) channel.
In the IEEE 802.11 standard, a channel is adjacent if it is not overlapping with the other channel. In 2.4 GHz band, channels are 5 MHz apart. Channel 1 peak frequency is 2412 MHz, channel 2 peak frequency is 2417 MHz (so channel 1 and channel 2 are 5 MHz apart).
Two channels are not overlapping for 802.11b if their peak frequency is 25 MHz apart. Two channels are not overlapping for 802.11g if their peak frequency is 20 MHz apart (but as most 802.11g systems are built to be compatible with older 802.11b clients, 25 MHz is typically used for 802.11b.g networks).
Channel 6 peak frequency is 2437 MHz, 25 MHz away from channel 1, so channels 1 and 6 are not overlapping. Again, 2 channels are adjacent if they are non-overlapping. Channel 1 and 6 are adjacent.
You will find many vendors who erroneously call "adjacent" 2 neighbouring and overlapping channels, for example 1 and 2.  As log as you understand what the vendor means, all is perfect, but be aware that these channels are not adjacent for the 802.11 standard, they are overlapping.

Jumping from channel 1 to channel 6 to scan is nice, but here again, you do not know the environment. Maybe the next AP is set to channel 3, and is far away enough so that you will not understand its signal when listening on channel 6. So you probably have to scan each channel in turn, starting from 2, then 3, 4, etc.

Can you hear a signal on channel 3 from channel 6? Well, maybe. Any signal spreads beyond its main frequency, although it is weaker as you move away from the main frequency. This phenomenon is related to the spectral mask. This is a typical spectral mask for OFDM signals:
 You can see here that the 802.11 specification dictates that your signal should be -28 dB weaker than the main signal when you are 20 Mhz away from the main frequency. So it is weaker, but your card may be able to hear it, and maybe understand it.
By the way, beacons contain a field, called DS Parameter set, that indicates on which channel the AP is supposed to be. If the signal was captured while scanning another channel, at leat your station will know on which channel the AP is:


How long should you listen on each channel? You know that beacons are sent by default every 100 TU (or 102.4 ms). Reason would say that you should stay at least that long on each channel... but this would mean that it would take you more than one second to scan all channels in the 2.4 GHz spectrum, and even worse, maybe more than 2 seconds (depending on your regulatory domain and the number of allowed channels) in the 5 GHz spectrum...
What if your current AP has traffic to send to you in between? Luckily, the 802.11 standard thought about this issue. Before leaving your current channel, you need to send a frame to the AP with your Power save bit set to 1, so that the AP knows that you are not available.
With non-WMM stations, this frame is an data null frame, in other words a data frame with an empty body, and just the power management bit set to 1:
 With WMM station, any frame can be sent by the station to the AP, as long as the Power Management bit is set to 1.
When returning to the channel, non-WMM stations must ask the AP if any traffic arrived and was buffered in between, using a specific PS_Poll frame.
WMM station can simply send any frame to the AP to inform about their return to the channel.

Does that Power management bit solve your "I'm off channel for a while to detect other APs" issue? Not completely. You are still supposed to be back to the active channel for the next DTIM.
A DTIM is the beacon that says that the AP has broadcast or multicast traffic to send to the cell. A DTIM can be sent every beacon, or at longer interval (2, 5, 200 beacons if you configure your AP that way).
If this DTIM is in every beacon, you have to be back in less than a beacon interval... so, you CANNOT be away for an entire beacon interval! Do you really need to be away that long? One way around this issue is simply to scan, and jump back to your main channel as soon as you hear a beacon in the scanned channel. For example, suppose that you are scanning channel 3. In worst case scenario, there is no AP there, and you stay until it is time to jump back to channel 1 and listen to the next DTIM. In best case scenario, you hear a beacon in channel 3 after just a few milliseconds of scan, and happily return immediately to channel 1, knowing that there is an AP in channel 3.

Does that make an efficient scanning algorithm? Not really... In fact, none of these behaviors is entirely satisfying:
  • if there is an AP in channel 3 and it has the same beacon interval as your main AP in channel 1, and if by coincidence both APs send their beacon at the same time, you will not discover the AP in channel 3... although it is there!
    Luckily, there is a solution for this issue: go back to scan channel 3 a few times... why would that solve the problem if both APs are set to send their beacon at the same time and with the same interval? Because a beacon is just like any other frame. Suppose the beacon interval is 100 TU. 100 TU after having started to send the previous beacon, the AP will try to send the next beacon. In order to do so, the AP will need the medium to be idle (if someone is sending at that time, the Ap will have to wait until the medium gets free). The AP will also need to pickup a random number and countdown from there, just like for any other frame. This makes that although the beacon interval is set to 100 TU, practical cell conditions make that there is usually not exactly 100 TU between each beacon. By coming back a few times to channel 3, you will eventually hear the AP beacon.
  • If after listening to channel 3, you hear a beacon (and better yet, a beacon from an AP supporting your SSID, with an acceptable RSSI and SNR), should you be satisfied? Probably not! This AP you heard may be far, even if its signal is acceptable. There may be another, closer AP, that you haven't heard yet.
    Jumping to the conclusion that the AP you heard is your next best candidate may be a mistake. Some drivers make this mistake, which leads to poor roaming decisions (and comments from the network admin, in the mood “why on earth is that client jumping to this AP in the lower floor whereas there is another better AP just above the client, in the same room??). Wisdom states that you should make sure that you detected all APs before deciding that you know about channel 3. This brings you back to the scenario where you have to stay longer on channel 3...
The sad conclusion is that there is no easy solution: in order to passively scan, you need to spend time away from the main channel and listen. Passive scanning is energy-efficient but time consuming...

Active scanning: time-efficient but energy-consuming

Another way to detect the environment is active scanning. Instead of passively listening to the other channels, you send a probe request. This behavior is more efficient, because APs are supposed to answer probe requests. Within a few milliseconds, you can know what other AP is on the scanned channel.
Once again, this is still not a perfect solution:
  • Some environments disallow active scanning (flight safe mode for example). Therefore, some wireless clients do not be set to actively scan by default... my nokia phone E71 is a perfect example.
  • Some APs are set to hide their SSID and not answer to probes... but this is a deviation from the standard.
  • Just like for passive scanning, should you be happy with the first probe response you get? Probably not. Here again, the contention mechanisms apply, and the probe response is just like any other frame: in order to send it, the AP must decide that the medium is idle, pick up a random number and count down from there before sending. Therefore, you may very well receive the probe response from a distant AP before the response from a closer AP. You need to spend some time on the channel before deciding that you are sure that you got all answers.
Your request, just like the AP response, may collide, and you may not be aware of that collision. If you send a request and get not response, is it that there is no AP or that your request was not received because of a collision? If you receive 2 answers, does that mean that there are 2 APs on this channel, or could there be a third AP that did not get your probe, or whose response was lost because of a collision? In most cases, you will have to probe a few times before deciding that you know what APs are on that scanned channel.
Spending time means spending energy. This is worsened by the fact that active scanning implies sending and receiving, which consumes more battery power that simply receiving. Active scanning is more time-efficient than passive scanning (although it does take time), but also more energy-consuming.

HARA: hybrid adaptive

So scanning while conserving battery power is a fine balance between passive scanning and active scanning, while trying to privilege  lower data rates before deciding that scanning is needed. The exact formula (at what AP RSSI/SNR drop level do you start scanning, how many times do you retry a lost frame before reverting to lower data rate, when do you increase your local power) depends on the vendor. The exact algorithm is of course kept secret, not only because you don't want to help your competitors by providing tools that would help them be more efficient, but also because the behavior depends in great part on the exact hardware you use (circuits efficiency, card and antenna position in the device and performances). The algorithm implementing this type of adaptive behavior are usually grouped under the common name Hybrid Adaptive Roaming Algorithm (HARA).
Some drivers are directly adaptive and implement different types of scanning behavior. When your AP RSSI/SNR/packet error rate reaches a specific (vendor dependent) threshold, your client would start passively scanning the other channels. If your reach another, lower threshold and are about to lose the current connection to your AP, and if passive scanning did not provide any good candidate AP to roam to, your client turns to panic mode and starts frantically and actively scanning the other channels, to find another AP before it is too late and your connection is lost. Smart eh? Some vendors even do a pre-assessment (for example Intel). When you first switch on your wireless device, the device actively scan all channels, and deduces the environment type. If multiple APs with the same SSID are detected, the environment will be seen as Enterprise, and your client will start passively scanning in the background soon in the roaming process (because it knows that there are other APs out there). If no other AP is detected with the same SSID, the client will revert to a sticky behavior, closer to SRA, because it assumes a home or SOHO environment. It is still not a pure SRA, because you may be in a meeting room where only one AP is detected, and still in an enterprise environment. Therefore, the HARA will still switch to passive/active scanning when your AP RSSI/SNR levels drop.

Going further: the need for scanned channel AP power

All these constraints lead you to the conclusion that you will have to spend some time scanning around before deciding that there is another AP that you jump to. Why do some devices roam well and others don't? It depends of course on which algorithm they implement. Recent devices may still implement SRA or ERA algorithm, just because the vendor never updated its roaming algorithm for this type of device (for many possible reasons, ranging from cost to typical device use cases). Even if the device implements HARA, it adaptation to your environment will depend heavily on if the device expected behavior match your deployment conditions. Drivers are heavily tested and optimized for specific roam events (for example office environment with sudden roam needed due to a door closing between the device and AP, or warehouse low roam and stickiness expected for a barcode scanner in a high multipath environment). Although it is not possible for you to get the exact roaming behavior for which the device is optimized (unless you work for the driver development team of one of these vendors haha), reading the release notes of the driver with an "educated eye" will give you some hints on what the vendor tried to achieve in each driver release.
In all cases, blame the environment, not the device!

Regardless on how carefully the vendor designed its driver, there are still many parameters out of control of your device. We will name just one to be used as a typical example. Even when you scan and discover other APs, a doubt will stay in your wireless device mind (yes, wireless devices also have minds, sometimes :-)): what is the power of the AP I just discovered?
Does this power matter? Yes, it is key in the roaming decision.
Look at the following scenario:

This is a bird view, so you know that the laptop is moving toward the right. The laptop was connected to an AP somewhere on the lower left, and decided that it had to roam. Scanning the other channels, the laptop discovers 2 APs. One AP offers a -71 dBm RSSI (AP to laptop signal), and the other a -72 dBm RSSI. Whcih AP should the laptop roam to?
Even an admin having a global view would have to stop and think for a second. AP1 power level is 5 mW, just like the laptop power level. This power level symmetry allows for an identical data rate on the way up and on the way down, 24 Mbps in both cases. Good! On the other hand, the laptop is moving away fro this AP, so this nice connection is only going to last for a few seconds.
For how long exactly? It depends of course on how far from the AP the client is. A general rule is that you lose 46 dB at 1 meter from the AP (so if the AP signal is 20 dBm, the client should read an RSSI of about -26 dBm), and you lose roughly 10 dBs as you double the distance (- 36 dBm at 2 m, -46 dBm at 4 m, etc). This is very general and the environment characteristics would dictate the exact values, but these are common references in indoor open space deployments. Typical roaming speed is 1.2 m / seconds...

So connecting to AP2 may be better after all... except that for now, AP2 offers a lower RSSI (-72 dBm). AP2 power is also very different from the laptop power level: 40 mW against 5 mW. This makes that frames down from the AP may get to the laptop at 24 Mbps, but frames from the laptop to the AP will probably have to be sent slower, for example at 6 Mbps. This is something that the laptop will have to discover by sending frames and reverting back. Without knowing the AP power, the laptop has to assume that if the AP RSSI is -72dBm, frames sent to the AP will also be heard by the AP at -72 dBm. It's only by failing to get ACK frames that the laptop will understand that there is a power mismatch ad that frames have to be sent slower. It will try 18, then 12, then 9 before succeeding at 6 Mbps. What a waste of time! Yet, as the laptop moves toward AP2, this is more of time investment, as the connection to AP2 will improve after a few seconds...
You know, what would be great is if the laptop could know the AP power, so that, as the laptop moves toward the right, the laptop could evaluate both APs RSSI variation and determine the best strategy: jump temporarily to AP1, then AP2, or directly to AP2. But there is no mechanism in the 802.11 standard to allow for this kind of power level information exchange between APs and stations...

Cisco Wireless VoIP 802.11n phone?

Cisco announced a few days ago the end of life / end of sale for their Cisco Unified Wireless IP phone 7921 (see here). The 7925 is the last end user product in the wireless line (even the CB21Ag card, with its PCMCIA format and lack for Win7 support is dying away)... so will Cisco give up and soon throw away the 7925 as well, or will they come up with new product and a 802.11n phone?
Cisco is clearly moving away from the end-user market, but WiFi phones are an enterprise product that complement the large range of Cisco wired IP phone.
But a 802.11n phone? Cisco has already certified the Cius as a 802.11n phone. Cisco and Dell are the only ones having a Wi-Fi-only 802.11n phone (see here), but there are hundreds of dual band GSM/WiFi certified 802.11n phones (see here).
All these phones are single radio, single stream. This is allowed by the 802.11n amendment, and is expected to last for a while. The main concern in a phone is to conserve battery. As soon as you put 2 radio modules in a device, you consume twice as much power as another device with a single radio module. Although you can find clever ways to use the second module sparingly, your phone battery is still going to last a lot less than you competitor's single-radio/single stream phone... not to mention the technical challenge to put 2 or 3 radio modules, with 2, 3 or 4 antennas, in a device that has to be light and fit in your hand.
Isn't single radio/single stream enough anyway? If you use your phone for what it is... a phone, you need to send a receive 50 packets per second, consuming about 200 Kbps. So why bother with 802.11n? The gain in range might be useful to you, but all design guides work hard at convincing wireless network designers that they should design small cells, to control the data rate and number of devices in their network. So you might be able to take advantage of the additional range provided by 802.11n, but only if the network you join is poorly designed... :-D
Most dual band phones certified for 802.11n offer this certification to allow for a more comfortable browsing (and downloading) experience, not really for the voice part itself.
So it would probably make sense for Cisco to work on 802.11n hybrid devices (phone + something else that needs bandwidth, like the Cius and its telepresence feature), but probably not for a pure "phone only" device... I have no internal insight on Cisco secret plans, so this is just a thoughts I figured I should share, as I often get this Cisco 802.11n phone question, but this is by no means an informed or even educated guess on Cisco strategy... :-)