Everyone loves getting those SNMP agent lost alarms twice a day right? This is great to make your NOC statistics look bad because you don’t have enough time to take on a ticket because it gets resolved automatically within 10 minutes.
So i started googling for an answer to this petty problem and couldn’t find anything. On Cisco software the 16.3.5 is deferred, but not 16.3.5b so this seemed fine. Release notes don’t mention any flagrant bugs either.
On the device the problem itself is kind of weird – every 7 minutes or so i get CPU spikes to 85% with SNMP and iosd_ipc and dbal processing being the biggest cpu hogs (SNMP 25%, iosd 25%, dbal about 20%), but the snmp agent lost events ocurred only twice a day, not every 7 minutes, so there had to be some other factor that was difficult to track down. I figured that I could find the guilty party by looking up show snmp stats oid to see what table was being retrieved when CPU was busy. And here was my first suprise – even though it was clear that the AuthManager tables were being retrieved, I couldn’t exclude them from the SNMP view – exclusions did not work. What a drag. This took me about 2 hours on Friday afternoon, and because I hate it when stuff won’t just work, i couldn’t just go home. Partly because my personality is as broken as ios code.
About 5.30 p.m. I realized that the ios is broken beyond repair and figured that since that ios is 2 years old , the upgrade will probably solve the issue anyway and went home. It goes without saying that i wasn’t in a good mood. Not being able to google the answer is never good and usually means that your google search strings are wrong (so probably the CPU was not the problem here). The cisco live presentation on tshooting ios xe is a bit crap, though, so it’s not as if i hadn’t tried. There’s a ton of show commands but no clear advice on what is good output and what is bad output.
Today I had some more time and finally found the relevant field notice: https://www.cisco.com/c/en/us/support/docs/field-notices/703/fn70359.html with the memory leak info and a suggestion to upgrade the software
The conclusions from this are as follows
- Sometimes neither release notes nor cisco software download pages are updated and you need to look further and further.
- You need to upgrade your software regularly… 16.3.5b was released 2 years ago and has been found to have multiple vulnerabilities anywa
- That ”complete rewrite” of the code from 3.x to 16.x is not exactly a success.
- Spending too much time on researching old software versions may be a waste of time because you need to upgrade anyway.
- Corporate ”software upgrade research teams” are crap and cannot be relied upon. I should have received that info from them a long time ago