Application ApicVision is not healthy

Recently we upgraded customers to ACI 5.2(2f) and plugin were failing including installation of new plugin as the UCSM external switch for UCS automation.

Anyway here is the fix:

Symptom: – App installation/enable/disable takes a long time and does not complete. acidiag scheduler images does not list the app container images after a long time.

Conditions: any 5.2.1.x or 5.2.2.x or 5.2.3.x or 5.2.4.x images can run into this issue. This can affect any scale-out app, including NIR and ExternalSwitch. This usually happens when there is unstable app container that keeps restarting. acidiag scheduler containers shows some containers constantly restarting.(The command is available from 5.2.3.x release)

Workaround: Workaround is the following: (Run on APIC 1)

1) acidiag scheduler disable

2) acidiag scheduler cleanup force

3) acidiag scheduler enable

Disable any apps especially ones that can be identified as constantly restarting when you detect such an issue and then proceed with these steps. If this is not possible, simply perform the steps after downgrade completes.

Workaround is to be performed ONLY ONCE and the apic cluster is in a healthy state according to acidiag avread. All apics should have the same version and health=255. After workaround, wait for 30-45 mins to check the status on the app.

dev1-ifc1# acidiag scheduler cleanup force Stopped all apps

[True] APIC-01 – Disabled scheduler [True] APIC-02 – Disabled scheduler

[True] APIC-03 – Disabled scheduler [True] APIC-01 – Cleaned up scheduler (forced)

[True] APIC-02 – Cleaned up scheduler (forced)

[True] APIC-03 – Cleaned up scheduler (forced)

dev1-ifc1# acidiag scheduler enable Enabling scheduler:

[True] APIC-01 – Enabled scheduler

[True] APIC-02 – Enabled scheduler

[True] APIC-03 – Enabled scheduler

Benefits of using a Standby Cisco APIC Controller

So you may wonder why you should consider purchasing a standby APIC controller for your current APIC cluster. A valid argument is why you would need a standby APIC if you already have a Cisco 24x7x4 Smartnet contract. Well Smartnet is great but:

  1. We are currently experiencing hardware shortages so sometimes Cisco cannot commit to getting you a replacement APIC in 4 hours or less.
  2. If you do get an APIC replacement, it will most likely not ship with the code version of your current APIC cluster. As a result, you will spend time updating the CIMC ~1 hour + updating the APIC application code version to match. The time it takes depends on the code versions. In some cases, the appliance may ship with a newer code version and you will have to downgrade as well. You cannot assume that Smartnet 24x7x4 will mean you will have an operational APIC with in 4 hours because code upgrades/downgrades will be required.
  3. When you get a replacement APIC, you must still take time to UNRACK the failed APIC and Install the new APIC so there is some time with physical labor to provision the replacement.
  4. If you had a standby APIC, it will get automatically upgraded during your upgrade cycle so if you have a failed APIC in the cluster, you can easily replace the failed with the Standby APIC controller. No new physical connections are needed as all of this has already been planned and provisioned.

Cisco ACI APIC OOB Management MAC Address is flapping

Recently I came across a request to rename the APIC Controllers and directly after the APICs were renamed and rebooted, OOB management access started flaking out. Basic PING tests revealed OOB reachability issues. What I discovered was that the Bond1 interface was consistently failing over between Eth1/1 and Eth1/2, the OOB management would continuously relearn the MAC addresses on these ports and this created a management access issue.

The ACI code version this was experienced with was 5.2(6e)

Cisco APIC Physical Interfaces

I confirmed this by checking where the BOND1 MAC address was being learned and after multiple refreshes on the management switch, it was obvious that it was flapping back and forth. This caused the PING tests to fail from time to time.

I did some web surfing for this issue and found some reference from an older 1.x version around LLDP issues when changing the hostname and the Cisco fix was to shutdown one of the BOND1 interfaces. But this is not acceptable for my customers, so I started doing some digging and the ultimate fix was to decommission each APIC at a time and wipe it. Then re-add the APICs to the cluster and after these steps were performed, the APIC OOB MAC Flapping was resolved.

STEPs

  1. Document the preferred hostname, Fabric name, Fabric ID, POD #, VTEP pool, OOB MGMT IP and Gateway, the Infra VLAN used and the local admin credentials. These are the parameters you will need to recommission the APICs.
  2. Decommission one APIC at a time from the cluster. Wait ~5 minutes to make sure its all replicated properly.
  3. Console via CIMC into the target APIC and Wipe it out using below:
apic# acidiag touch clean
This command will wipe out this device. Proceed? [y/N] y
apic# acidiag touch setup
This command will reset the device configuration, Proceed? [y/N] y
apic# acidiag reboot
This command will restart the this device, Proceed? [y/N] y

3. After the APIC reboots, the SETUP should run and you can reprovision the APIC via CIMC using STEP 1’s parameters. Wait ~2-3 minutes for the APIC to converge itself after the settings.

4. Finally, commission the target APIC in the GUI by simply right-clicking the old APIC in the list and clicking commission. Even though the old APIC name may be shown, after ~5 minutes, the recommissioned APIC will converge into the cluster and be shown properly.

I am also asked sometimes how the OOB management switches should be configured for the APIC OOB BOND1 interfaces, since they are “bonded”. These ports should be treated similar to regular access ports.

!---OOB-MGMT-SWITCH---- ! 
interface G1/0/1  
 desc <APIC-HOSTNAME> OOB Bond1 management interface  
 switchport mode access  
 switchport access vlan <OOB-MGMT-VLAN>  
 spanning-tree portfast 
!

Reason to use Cisco Nexus instead of Cisco Catalyst in the Data Center

Time and time again I have customers wanting to understand the true benefit of a Cisco Nexus switch versus a Cisco catalyst switch in the data center for connecting servers. Customers may argue that they just need simple 1G or 10G speeds, dual-homed with a port-channel and they can achieve this simply using a stack of Catalyst 9300 or Catalyst 9500 switches. So here are a few reasons to ponder:

Code upgrade

If you have servers that are dual-homed for redundancy across two separate Cisco catalyst switches which are stacked because you want to leverage a port-channel then that sounds fine and dandy but when it comes time to upgrade the switch, because there will be a time you have to upgrade the switch, the whole stack will need to be reloaded resulting in an outage to your servers. This is not the case with Cisco Nexus.

Lower latency = better performance

VeloCloud in AWS

After a few hours of troubleshooting, I found out that when using the 3.3 brownfield Cloudformation template, entering the VCO as an IP does not work. You must use the FQDN instead of the IP for the VCO. I also made sure to set the version to 331 instead of 321. The instance type of C5.4xlarge. After the vEdge joins the orchestrator, then you can upgrade the version to a newer code.