Multicast: your bidir RP needs to know that it’s the RP for bidir groups

Hello

Today a real no-brainer (especially in hindsight). I had a config where the distribution switch has some vlan interfaces with multicast devices connected to access switches downstream. The problem was that devices on different vlans couldn’t communicate with multicast even though PIM sparse was enabled on both vlan interfaces and RP was set for all groups for bidir traffic. The problem was really simple: if you add the keyword bidir, make sure that on RP you set the same bidir keyword, otherwise it only wants to be an RP for non-bidir multicast traffic.

The lesson here is again very simple: divide the problem into digestible chunks:

  • what vrf are you looking at?
  • is the RP for the same vrf? do you have any RP-address config where you have bidir for some groups only?
  • is sparse mode enabled? or bidir?
  • what can you see in show ip mroute? S,G entries or only *,G entries? why?
  • what can you see in show ip igmp membership tables?
  • where is the RP? what can you see on the RP? Does it know that it is the RP and is PIM enabled on the interface with the IP address that is the RP? Is the same mode enabled (sparse? bidir?) Is it bidir for all groups or only for some?

A bit of research (especially if you deal only occasionally with multicast) doesn’t hurt. That way you know (and you’re not surprised…) why you only see *,G entries when bidir is enabled.

 

PPPoE needs lower MTU and MSS or some apps and sites will break

Hello

I still remember my first job in the network field, it was for a company called Sonicwall. Most of our phone calls were from really inexperienced admins, and because most of them had pppoe connections, a large chunk of the tickets could be solved by suggesting that mtu should be 1492, not 1500.
Now today my colleague had an interesting case where employees couldn’t log in to skype business. The router logs had a lot of max fragments errors, which clearly showed that it was busy trying to fragment packets but it just couldn’t store all the fragments in the memory.

Adjusting the TCP MSS on the dialer interface solved the problem. This is also described here. It turns out that skype logins use TCP and skype seems to ignore the PMTUD.

https://www.cisco.com/c/en/us/support/docs/ip/transmission-control-protocol-tcp/200932-Ethernet-MTU-and-TCP-MSS-Adjustment-Conc.html

 

3650 CPU spikes and memory loss with software 16.3.5b XE

Hello

Everyone loves getting those SNMP agent lost alarms twice a day right? This is great to make your NOC statistics look bad because you don’t have enough time to take on a ticket because it gets resolved automatically within 10 minutes.

So i started googling for an answer to this petty problem and couldn’t find anything. On Cisco software the 16.3.5 is deferred, but not 16.3.5b so this seemed fine. Release notes don’t mention any flagrant bugs either.

On the device the problem itself is kind of weird – every 7 minutes or so i get CPU spikes to 85% with SNMP and iosd_ipc and dbal processing being the biggest cpu hogs (SNMP 25%, iosd 25%, dbal about 20%), but the snmp agent lost events ocurred only twice a day, not every 7 minutes, so there had to be some other factor that was difficult to track down. I figured that I could find the guilty party by looking up show snmp stats oid to see what table was being retrieved when CPU was busy. And here was my first suprise – even though it was clear that the AuthManager tables were being retrieved, I couldn’t exclude them from the SNMP view – exclusions did not work. What a drag. This took me about 2 hours on Friday afternoon, and because I hate it when stuff won’t just work, i couldn’t just go home. Partly because my personality is as broken as ios code.

About 5.30 p.m. I realized that the ios is broken beyond repair and figured that since that ios is 2 years old , the upgrade will probably solve the issue anyway and went home. It goes without saying that i wasn’t in a good mood. Not being able to google the answer is never good and usually means that your google search strings are wrong (so probably the CPU was not the problem here). The cisco live presentation on tshooting ios xe is a bit crap, though, so it’s not as if i hadn’t tried. There’s a ton of show commands but no clear advice on what is good output and what is bad output.

Today I had some more time and finally found the relevant field notice: https://www.cisco.com/c/en/us/support/docs/field-notices/703/fn70359.html with the memory leak info and a suggestion to upgrade the software
The conclusions from this are as follows

  1. Sometimes neither release notes  nor cisco software download pages are updated and you need to look further and further.
  2. You need to upgrade your software regularly… 16.3.5b was released 2 years ago and has been found to have multiple vulnerabilities anywa
  3. That ”complete rewrite” of the code from 3.x to 16.x is not exactly a success.
  4. Spending too much time on researching old software versions may be a waste of time because you need to upgrade anyway.
  5. Corporate ”software upgrade research teams” are crap and cannot be relied upon. I should have received that info from them a long time ago

 

Languages, networks, hiatuses etc.

Hello again

I haven’t posted much in the last 12 months for multiple reasons. First of all, around May last year I decided to take a short break from learning about networks and do something different for a while. Also, spending more time with my son felt better than typing in cisco commands. So i did both. I spent way more time with my family and in any spare time I had I was learning French. After a year or so I’m now reading ‚Tous des idiots’ by Thomas Erikson so maybe even my soft skills will improve as a result of my yet another foray into the realm of languages. Being able to speak french is a nice hook to keep you aboard in the network world, too. #linkedinupdate 😀

Anyways, after a year of charging my batteries I feel I can do some network stuff again on top of what i’ve been doing at work. Cisco is revamping its range of exams so once the first books start to come out i’m planning to cover all the new stuff.

BTW it’s 2019 and where’s ipv6? where is it i’m asking? 😉 not in my projects…

WLC firmwares 8.3.143.5 – 8.3.143.8 with critical bugs

Hello

Cisco TAC has recently published versions .9 and .10 of the 8.3.143 software for WLC, which resolve critical malloc failures of a whole range of newer access points in images known as 8.3MR4Esc, bug number https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvm18273

The symptom is clear: users can’t log in using their PSKs, msglog on the WLC says that PSK is incorrect even though users put in the right PSK, while for dot1x users it’s EAP timeouts for M0 or M2 messages etc. Once logged to the AP, you can see that the event log is full of mallocs and tracebacks.

If this happens to you, upgrade as soon as possible. However, it is only possible to get these versions from Cisco TAC. The funny thing is that we only upgraded to 8.3.143.7, because this TAC version resolved a TACACS vulnerability. #ilovecisco

Apparently the same bugs appear in 8.5.x versions, see the bug details at bugsearch site.

HP printers losing DNS entries

Hello

Users started complaining about not being able to use several HP printers in their company. Whenever that happened, they were not able to resolve their names. They were still able to log in to the printer via the IP address.

The company has an IPAM solution that sends DDNS updates to the DNS server, which is administered by a third party so we were not able to have a look at the DNS policies.

What we noticed after a while was that even though our standard DHCP lease was 7 days, HP printers would get 30 days! It turns out that the parameter ”default lease time” does not mean ”maximum lease time”. If a device asks for a longer lease, it gets it via DHCP option 51. Aaargh!

Once we found this, we asked the DNS admins what the scavenging policy was. It turned out that the DNS server considered dynamic entries as stale if they were not refreshed within 8 days. The math was easy: the lease was refreshed at 50% of the lease, so between the 9th day and 15th days of the lease, DNS could scavenge the DNS entry. The DNS admins were reluctant to say what the scavenging interval was.

The solution was to change the parameter „default lease time” to „maximum lease time” on the DHCP server.

Incidentally, the same applies to iphones so make sure you don’t have hundreds of 30-day leases for your BYOD guests. It’s quite easy to run out of IP addresses…

Applying hotfix to ISE

Hello

Today I was hotfixing my ISE because of CSCvm14030 vulnerability. Here’s the output.

The only difference between patching and hotfixing is that the command is „application install” rather than ”patch install”.

I chose to do this manually on all nodes rather than the GUI automatic mode because i wanted to be in control over when I hotfix what node.

myISE/admin# application install ise-apply-CSCvm14030_2.1.0.474_common_1-SPA.tar.gz MyREPO
Save the current ADE-OS running configuration? (yes/no) [yes] ? yes
Generating configuration…
Saved the ADE-OS running configuration to startup successfully

Getting bundle to local machine…
Unbundling Application Package…
Verifying Application Signature…
Initiating Application Install…

Checking if CSCvm14030_2.1.0.474_common_1 is already applied
– Successful

Checking ISE version compatibility
– Current ISE Version: 2.1.0.474 Patch Version: 7
– Successful

Applying hot patch CSCvm14030_2.1.0.474_common_1
– Taking backup of integrity files
– Taking backup of patch related files
– Running hotpatch wrapper script
– Restarting ISE services

Stopping ISE Monitoring & Troubleshooting Log Collector…
grep: write error
Stopping ISE Monitoring & Troubleshooting Log Processor…
grep: write error
grep: write error
grep: write error
grep: write error
grep: write error
grep: write error
grep: write error
ISE PassiveID Service is disabled
ISE pxGrid processes are disabled
Stopping ISE Application Server…
ISE PassiveID Service is disabled
ISE pxGrid processes are disabled
Stopping ISE Application Server…
Stopping ISE Certificate Authority Service…
Stopping ISE EST Service…
ISE Sxp Engine Service is disabled
ISE TC-NAC Service is disabled
Stopping ISE Profiler Database…
Stopping ISE Indexing Engine…
grep: write error
Stopping ISE Monitoring & Troubleshooting Session Database…
ISE PassiveID Service is disabled
ISE pxGrid processes are disabled
Stopping ISE Application Server…
Stopping ISE Certificate Authority Service…
Stopping ISE EST Service…
ISE Sxp Engine Service is disabled
ISE TC-NAC Service is disabled
Stopping ISE Profiler Database…
Stopping ISE Indexing Engine…
grep: write error
Stopping ISE Monitoring & Troubleshooting Session Database…
Stopping ISE AD Connector…
grep: write error
Stopping ISE Database processes…
Starting ISE Monitoring & Troubleshooting Session Database…
Starting ISE Profiler Database…
grep: write error
Starting ISE Application Server…
Starting ISE Certificate Authority Service…Starting ISE EST Service…
Starting ISE Monitoring & Troubleshooting Log Processor…
Starting ISE Monitoring & Troubleshooting Log Collector…
Starting ISE Indexing Engine…
Starting ISE AD Connector…
Note: ISE Processes are initializing. Use ‚show application status ise’
CLI to verify all processes are in running state.

 

DMVPN NAT mystery

Hello

Today I came across an interesting case, where two out of three tunnels on one spoke (out of 50 spokes in total) were down. The third (working) tunnel was connected to the same two hubs but via mpls.

I checked:

  • tunnel configuration. It was identical in all other 40+ spokes
  • crypto isakmp and ipsec policies. Again, identical.
  • state of crypto: both ISAKMP and IPSEC were up, encrypting and decrypting tunnels

So now it was up to GRE and NHRP to do their job of establishing DMVPN.

When I checked the state of dmvpn on the hub, it showed UP, however, the spokes were showing their state as NHRP. What does it mean? After doing NHRP debug I knew that the spoke wasn’t getting the resolution reply for the two bad tunnels.

Finally, I saw that there were 2 NAT entries for GRE traffic. It was strange, because the traffic was sourced by the same host on the LAN and the destinations were (in both cases) the dmvpn hub.
I looked at the nat translation statement: it was a nat overload to the outside (public) interface.

As GRE doesn’t cooperate well with PAT, I decided to clear the entries, especially that the the source host was Incomplete in the ARP table.

After clearing the nats from the ip nat translation table, the tunnels went up.

What does it teach us? We should make a deny statement for GRE traffic in our NAT ACLs and only use TCP and UDP in nat statements rather than IP.

I’m wondering if the same issue will crash my DMVPN lab or not.

broken CCIE lab with BGP synchronization and OSPF

Hello

We have the following scenario:

Capture

IOU 4 has a static route 10.1.9.9 to IOU 9. It redistributes it into OSPF. IOU4 has OSPF adj to IOU5, IOU5 to IOU2.

IOU2 is a BGP neigh to IOU5 and also to IOU4. IOU4 and IOU5 are route reflector clients of IOU2.

The problem starts when we enable BGP synchronization on R5. What do we know about synchronization? the condition is the route in BGP needs to match the same route in IGP to avoid black holes. where R1>R2>R3 but BGP is only enabled on R1 and R3. R2 causes the black hole.

But having the routes match in BGP and IGP is not the only condition, i’m afraid… let’s have a look at this case:

R5 receives the ospf route to IOU9 from IOU4, but it receives the BGP route to IOU9 from IOU2! The route from the internal routing protocol will not get synchronized with the same route in BGP.

The only fix is to change bgp router-id on IOU2 to be the same as OSPF router-id on IOU4.