CUOS Certificate Expiration Issues and Regeneration

I arrived on site at a remote client picking up a deployment that was done some 18 months ago for additional scope of work.  However, when I arrived to begin, I encountered some strange behaviour that I struggled to attribute to anything obvious.

What I saw was:

 

  • High number of AXL connections on CUCM:

admin:utils diagnose test

Log file: platform/log/diag3.log

Starting diagnostic test(s)
===========================
test – disk_space : Passed (available: 1796 MB, used: 12360 MB)
skip – disk_files : This module must be run directly and off hours
test – service_manager : Passed
test – tomcat : Passed
test – tomcat_deadlocks : Passed
test – tomcat_keystore : Passed
test – tomcat_connectors : Passed
test – tomcat_threads : Passed
test – tomcat_memory : Passed
test – tomcat_sessions : FailedThe following web applications have an unusually large number of active sessions: axl. Please collect all of the Tomcat logs for root cause analysis: file get activelog tomcat/logs/*
skip – tomcat_heapdump : This module must be run directly and off hours
test – validate_network : Passed
test – raid : Passed
test – system_info : Passed (Collected system information in diagnostic log)
test – ntp_reachability : Passed
test – ntp_clock_drift : Passed
test – ntp_stratum : Passed
skip – sdl_fragmentation : This module must be run directly and off hours
skip – sdi_fragmentation : This module must be run directly and off hours

 

  • Backups Failing

 

  • Jabber services listed as” UNKNOWN” from the GUI, even though listed as “STARTED” from Serviceability and from CLI:

 

Selection_006

 

  • Immediate Switch-Version failed after an upgrade due to SELinux failing to upgrade correctly, causing differing behaviour on different servers:
    – Java errors on login to server and CLI scripts unable to run due to permissions issues
    – A Cisco DB service refusing to start!
    – Kicked out of ssh session directly after authentication – SELinux issue

 

 

The SELinux issue actually prevented a switch-back as this could not be initiated from CLI.  To resolve this I had to complete a manual partition switch using the CUCM/CUC Recovery CD as well as a File System Check and Repair!

 

This was all very, very strange!

 

Once back to a “stable” platform after the partition swap and file-system check, I again attempted an upgrade, which this time failed:

04/20/2016 19:46:49 upgrade_install.sh|Started auditd…|<LVL::Info>
04/20/2016 19:46:49 upgrade_install.sh|Started setroubleshoot…|<LVL::Info>
04/20/2016 19:46:49 upgrade_install.sh|Changed selinux mode to enforcing|<LVL::Info>
04/20/2016 19:46:49 upgrade_install.sh|Cleaning up rpm_archive…|<LVL::Info>
04/20/2016 19:46:49 upgrade_install.sh|Removing /common/rpm-archive/10.5.2.12901-1|<LVL::Info>
04/20/2016 19:46:50 upgrade_install.sh|File:/usr/local/bin/base_scripts/upgrade_install.sh:604, Function: main(), Upgrade Failed — (1)|<LVL::Error>
04/20/2016 19:46:50 upgrade_install.sh|Parse argument status=upgrade.stage.error|<LVL::Debug>
04/20/2016 19:46:50 upgrade_install.sh|_set_upgrade_status_attribute: status set to upgrade.stage.error|<LVL::Debug>
04/20/2016 19:46:50 upgrade_install.sh|is_upgrade_lock_available: Upgrade lock is not available.|<LVL::Debug>
04/20/2016 19:46:50 upgrade_install.sh|is_upgrade_in_progress: Already locked by this process (pid: 32606).|<LVL::Debug>
04/20/2016 19:46:50 upgrade_install.sh|release_upgrade_lock: Releasing lock (pid: 32606)|<LVL::Debug>

 

This time, I had selected an immediate switch-version instead of using a 2-stage process after a successful upgrade.  Again, I could confirm a successful upgrade:

 

04/20/2016 19:39:01 post_upgrade|Post Upgrade RTMTFinish|<LVL::Notice>
04/20/2016 19:39:01 post_upgrade|========================= Upgrade complete. Awaiting switch to version. =========================|<LVL::Info>
04/20/2016 19:39:01 upgrade_manager.sh|Post-upgrade processing complete|<LVL::Info>
04/20/2016 19:39:01 upgrade_manager.sh|Application install on inactive partition complete|<LVL::Info>
04/20/2016 19:39:01 upgrade_manager.sh|L2 upgrade… Run selinux_on_inactive_partition|<LVL::Info>

 

So now, I concentrated on the post-upgrade switch-version logging, and picked up on this:

 

04/20/2016 19:46:38 upgrade_manager.sh|(CAPTURE) /sbin/restorecon.orig set context /etc/opt/cisco/elm/client/.security/trust_certs/Cisco_Root_CA_M1.pem->system_u:object_r:cisco_etc_t:s0 failed:’Operation not permitted’|<LVL::Debug>
04/20/2016 19:46:38 upgrade_manager.sh|(CAPTURE) /sbin/restorecon.orig set context /etc/opt/cisco/elm/client/.security/trust_certs/Cisco_Root_CA_M1.der->system_u:object_r:cisco_etc_t:s0 failed:’Operation not permitted’|<LVL::Debug>
04/20/2016 19:46:38 upgrade_manager.sh|(CAPTURE) /sbin/restorecon.orig set context /home/sftpuser/sftp_connect.sh->specialuser_u:object_r:user_home_t:s0 failed:’Operation not permitted’|<LVL::Debug>

 

It was then that I became very confused.  These strange issues together all pointed to **certificate expiration**.  However, the system had been installed less than 18 months ago, so this didn’t make sense!  I checked the certificates and confirmed – all expired.  Then it dawned on me.  When we had installed the system, we had staged the network and voice at the same time.  We’d pointed NTP to the firewall – which must have, prior to go-live, been set as a master by our firewall team … with the wrong date-time! 😦

So, root cause found, but now to resolve?!

 

Certificates are a tricky business in CUCM since the implementation of Security by Default.  I don’t intend to cover this topic here (it’s absolutely huge), but I will say this – changing certificates on CUCM can run the risk of getting you fired.

If you don’t know 100% what you are doing, you could cause an outage that will require serious skill to repair, an emergency TAC engagement,  manual phone intervention or a combination of the 3.

 

So, I now needed to complete Certificate Regeneration on CUCM, CUC and IM&P.

Some notes here:

  • My cluster was not in Mixed Mode – my life just got a whole lot easier!
  • The client didn’t have a Private CA, so we were forced to use Self-Signed.  A major consideration if you have signed certs.
  • I was working on a BE6K – with no redundancy – so for me…  ITL files are going to be an issue.  I must to do a “Pre- 8.0 Rollback” and blank all the ITL files before I start.
  • Multi-App integrations are probably going to require manual certificate download/import between clusters for IM&P

 

Following the above process, I carefully:

  1. Blanked all ITL files, with necessary TVS and TFTP service restarts as required
  2. Stopped TFTP Services
  3. Regenerated all applicable certificates as per the above guide
  4. Manually deleted “-trust” certificates as per the guide – I preferred the GUI to the CLI for this – watch out for related bugs in the above guide.
  5. Restarted all applicable services
  6. Removed Pre-8.0 Rollback Enterprise Parameter ITL blanking
  7. Again, restarted TVS and TFTP services
  8. Imported regenerated certs for intra- and inter-cluster communications

 

An Important Note:

When restart TFTP, always wait 5-10mins for the TFTP files to rebuild on the server.  More haste, Less-Job-Come-Monday.

 

Rebooted IM&P, and immediately saw that all the affected IM&P services were listed correctly:

Selection_010.png

 

Helpful links on SBD (Security by Default) and CUOS Certificates in General:

 

 

Check out the ITLRecovery enhancements in 10.x and above!

 

#dontcalltac

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s