Ockham’s razor

Today I spent a LOT of time troubleshooting a case that should have been easy to solve.

The customer reported intermittent heavy problems with his internet connectivity when connected to a 2960x switch connected to an edge router 2911. The download speed would crawl down to a measly 0.5mbit/s, upload would stay at 9mbit/s. The site had 50+ people so this basically meant an outage for the whole branch.
However, if the router was disconnected and only one laptop was connected directly to the line, the problems would stop immediately. Also, in another experiment a restart of the router would solve the problem for a while (4-5 hours). Then the problems would reoccur and another restart was necessary.

Finally, there was a cisco 474 WaaS device connected inline between the switch and the router.

Topology-wise, users go out to the internet using a local exit (not through HQ). On the same exit interface, there is a ipsec over gre tunnel to the corporate network.

It took me a while to get hold of all the logs. It seems that problems started a week ago. The first symptom was that the tunnel with eigrp routing to the corporate network would go down and tcp logging experienced problems. The same problem occurred multiple times over the last ten days.

The problem could be on any device: the switch, the WaaS device, the router, even the hosts, the links, or the vpn hub in the hq.

I was able to eliminate the hub because it had tens of tunnels and this tunnel was the only flapping tunnel. All went through the same exit link at hq.

When there is one switch, it almost never is a problem (unlike with complex L2 topologies where STP recomputes paths). However, there were some rogue devices (APs and switches) and 2960x had input errors on links to some of those devices. However, i thought it was unlikely that all users would be behind that one faulty link.

So I was left with the WaaS device, the router, the hosts, the links. The ISP reported no problems on the link, but hey they always lie so i couldn’t discount that possibility. I decided to spend time looking at the configuration of the router. I spent about three hours analyzing and overanalyzing the config, thinking that maybe CBAC was causing problems. This was my first mistake: even on a 2911, CBAC won’t cause problems on a 10mbit link. There’s simply not enough work to do. There was enough free memory and cpu was running at 2-3%.

After about 4 hours, I decided to have one more look at the logs, this time at the logging buffer which collected an audit trail for CBAC (ip inspect audit-trail), not on the logs on the server. One entry repeated itself over and over again and it was about one host. It seemed that the audit trail reported that one user was creating a lot of connections that were being inspected. I quickly checked ip nat statistics and bingo – one user had about 100 nat entries, with the source port being a torrent port 6889.

Again, the simplest explanation was the best. If there is no bandwidth, it is because there is not enough bandwidth, no qos, and someone (or more people) are using up that bandwidth. I’ve once again made the mistake of putting the initial blame on the router trying to find a bug, memory leaks, cpu problems etc.

It is also a bad design because if there is a local exit to the internet, someone should police this exit and mark traffic so that scavenger traffic doesn’t eat up all the bandwidth. GRE traffic should have a separate class (maybe even a priority class?)

No free bandwidth? Well, no free bandwidth… Ockham’s razor.

 

 

Skomentuj

Wprowadź swoje dane lub kliknij jedną z tych ikon, aby się zalogować:

Logo WordPress.com

Komentujesz korzystając z konta WordPress.com. Wyloguj /  Zmień )

Zdjęcie na Google

Komentujesz korzystając z konta Google. Wyloguj /  Zmień )

Zdjęcie z Twittera

Komentujesz korzystając z konta Twitter. Wyloguj /  Zmień )

Zdjęcie na Facebooku

Komentujesz korzystając z konta Facebook. Wyloguj /  Zmień )

Połączenie z %s