Multicast: your bidir RP needs to know that it’s the RP for bidir groups

Hello

Today a real no-brainer (especially in hindsight). I had a config where the distribution switch has some vlan interfaces with multicast devices connected to access switches downstream. The problem was that devices on different vlans couldn’t communicate with multicast even though PIM sparse was enabled on both vlan interfaces and RP was set for all groups for bidir traffic. The problem was really simple: if you add the keyword bidir, make sure that on RP you set the same bidir keyword, otherwise it only wants to be an RP for non-bidir multicast traffic.

The lesson here is again very simple: divide the problem into digestible chunks:

  • what vrf are you looking at?
  • is the RP for the same vrf? do you have any RP-address config where you have bidir for some groups only?
  • is sparse mode enabled? or bidir?
  • what can you see in show ip mroute? S,G entries or only *,G entries? why?
  • what can you see in show ip igmp membership tables?
  • where is the RP? what can you see on the RP? Does it know that it is the RP and is PIM enabled on the interface with the IP address that is the RP? Is the same mode enabled (sparse? bidir?) Is it bidir for all groups or only for some?

A bit of research (especially if you deal only occasionally with multicast) doesn’t hurt. That way you know (and you’re not surprised…) why you only see *,G entries when bidir is enabled.

 

ISE 1.4 show run and snmp freeze – disable CDP and you’re golden.

I’ve noticed that if you can’t do show run on ISE and the server doesn’t respond to SNMP, a quick solution is to disable CDP on your gi interface and everything starts working fine. Funny, innit?

 

PPPoE needs lower MTU and MSS or some apps and sites will break

Hello

I still remember my first job in the network field, it was for a company called Sonicwall. Most of our phone calls were from really inexperienced admins, and because most of them had pppoe connections, a large chunk of the tickets could be solved by suggesting that mtu should be 1492, not 1500.
Now today my colleague had an interesting case where employees couldn’t log in to skype business. The router logs had a lot of max fragments errors, which clearly showed that it was busy trying to fragment packets but it just couldn’t store all the fragments in the memory.

Adjusting the TCP MSS on the dialer interface solved the problem. This is also described here. It turns out that skype logins use TCP and skype seems to ignore the PMTUD.

https://www.cisco.com/c/en/us/support/docs/ip/transmission-control-protocol-tcp/200932-Ethernet-MTU-and-TCP-MSS-Adjustment-Conc.html

 

3650 CPU spikes and memory loss with software 16.3.5b XE

Hello

Everyone loves getting those SNMP agent lost alarms twice a day right? This is great to make your NOC statistics look bad because you don’t have enough time to take on a ticket because it gets resolved automatically within 10 minutes.

So i started googling for an answer to this petty problem and couldn’t find anything. On Cisco software the 16.3.5 is deferred, but not 16.3.5b so this seemed fine. Release notes don’t mention any flagrant bugs either.

On the device the problem itself is kind of weird – every 7 minutes or so i get CPU spikes to 85% with SNMP and iosd_ipc and dbal processing being the biggest cpu hogs (SNMP 25%, iosd 25%, dbal about 20%), but the snmp agent lost events ocurred only twice a day, not every 7 minutes, so there had to be some other factor that was difficult to track down. I figured that I could find the guilty party by looking up show snmp stats oid to see what table was being retrieved when CPU was busy. And here was my first suprise – even though it was clear that the AuthManager tables were being retrieved, I couldn’t exclude them from the SNMP view – exclusions did not work. What a drag. This took me about 2 hours on Friday afternoon, and because I hate it when stuff won’t just work, i couldn’t just go home. Partly because my personality is as broken as ios code.

About 5.30 p.m. I realized that the ios is broken beyond repair and figured that since that ios is 2 years old , the upgrade will probably solve the issue anyway and went home. It goes without saying that i wasn’t in a good mood. Not being able to google the answer is never good and usually means that your google search strings are wrong (so probably the CPU was not the problem here). The cisco live presentation on tshooting ios xe is a bit crap, though, so it’s not as if i hadn’t tried. There’s a ton of show commands but no clear advice on what is good output and what is bad output.

Today I had some more time and finally found the relevant field notice: https://www.cisco.com/c/en/us/support/docs/field-notices/703/fn70359.html with the memory leak info and a suggestion to upgrade the software
The conclusions from this are as follows

  1. Sometimes neither release notes  nor cisco software download pages are updated and you need to look further and further.
  2. You need to upgrade your software regularly… 16.3.5b was released 2 years ago and has been found to have multiple vulnerabilities anywa
  3. That ”complete rewrite” of the code from 3.x to 16.x is not exactly a success.
  4. Spending too much time on researching old software versions may be a waste of time because you need to upgrade anyway.
  5. Corporate ”software upgrade research teams” are crap and cannot be relied upon. I should have received that info from them a long time ago

 

Languages, networks, hiatuses etc.

Hello again

I haven’t posted much in the last 12 months for multiple reasons. First of all, around May last year I decided to take a short break from learning about networks and do something different for a while. Also, spending more time with my son felt better than typing in cisco commands. So i did both. I spent way more time with my family and in any spare time I had I was learning French. After a year or so I’m now reading ‚Tous des idiots’ by Thomas Erikson so maybe even my soft skills will improve as a result of my yet another foray into the realm of languages. Being able to speak french is a nice hook to keep you aboard in the network world, too. #linkedinupdate 😀

Anyways, after a year of charging my batteries I feel I can do some network stuff again on top of what i’ve been doing at work. Cisco is revamping its range of exams so once the first books start to come out i’m planning to cover all the new stuff.

BTW it’s 2019 and where’s ipv6? where is it i’m asking? 😉 not in my projects…

WLC firmwares 8.3.143.5 – 8.3.143.8 with critical bugs

Hello

Cisco TAC has recently published versions .9 and .10 of the 8.3.143 software for WLC, which resolve critical malloc failures of a whole range of newer access points in images known as 8.3MR4Esc, bug number https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvm18273

The symptom is clear: users can’t log in using their PSKs, msglog on the WLC says that PSK is incorrect even though users put in the right PSK, while for dot1x users it’s EAP timeouts for M0 or M2 messages etc. Once logged to the AP, you can see that the event log is full of mallocs and tracebacks.

If this happens to you, upgrade as soon as possible. However, it is only possible to get these versions from Cisco TAC. The funny thing is that we only upgraded to 8.3.143.7, because this TAC version resolved a TACACS vulnerability. #ilovecisco

Apparently the same bugs appear in 8.5.x versions, see the bug details at bugsearch site.

HP printers losing DNS entries

Hello

Users started complaining about not being able to use several HP printers in their company. Whenever that happened, they were not able to resolve their names. They were still able to log in to the printer via the IP address.

The company has an IPAM solution that sends DDNS updates to the DNS server, which is administered by a third party so we were not able to have a look at the DNS policies.

What we noticed after a while was that even though our standard DHCP lease was 7 days, HP printers would get 30 days! It turns out that the parameter ”default lease time” does not mean ”maximum lease time”. If a device asks for a longer lease, it gets it via DHCP option 51. Aaargh!

Once we found this, we asked the DNS admins what the scavenging policy was. It turned out that the DNS server considered dynamic entries as stale if they were not refreshed within 8 days. The math was easy: the lease was refreshed at 50% of the lease, so between the 9th day and 15th days of the lease, DNS could scavenge the DNS entry. The DNS admins were reluctant to say what the scavenging interval was.

The solution was to change the parameter „default lease time” to „maximum lease time” on the DHCP server.

Incidentally, the same applies to iphones so make sure you don’t have hundreds of 30-day leases for your BYOD guests. It’s quite easy to run out of IP addresses…