Application ApicVision is not healthy

Recently we upgraded customers to ACI 5.2(2f) and plugin were failing including installation of new plugin as the UCSM external switch for UCS automation.

Anyway here is the fix:

Symptom: – App installation/enable/disable takes a long time and does not complete. acidiag scheduler images does not list the app container images after a long time.

Conditions: any 5.2.1.x or 5.2.2.x or 5.2.3.x or 5.2.4.x images can run into this issue. This can affect any scale-out app, including NIR and ExternalSwitch. This usually happens when there is unstable app container that keeps restarting. acidiag scheduler containers shows some containers constantly restarting.(The command is available from 5.2.3.x release)

Workaround: Workaround is the following: (Run on APIC 1)

1) acidiag scheduler disable

2) acidiag scheduler cleanup force

3) acidiag scheduler enable

Disable any apps especially ones that can be identified as constantly restarting when you detect such an issue and then proceed with these steps. If this is not possible, simply perform the steps after downgrade completes.

Workaround is to be performed ONLY ONCE and the apic cluster is in a healthy state according to acidiag avread. All apics should have the same version and health=255. After workaround, wait for 30-45 mins to check the status on the app.

dev1-ifc1# acidiag scheduler cleanup force Stopped all apps

[True] APIC-01 – Disabled scheduler [True] APIC-02 – Disabled scheduler

[True] APIC-03 – Disabled scheduler [True] APIC-01 – Cleaned up scheduler (forced)

[True] APIC-02 – Cleaned up scheduler (forced)

[True] APIC-03 – Cleaned up scheduler (forced)

dev1-ifc1# acidiag scheduler enable Enabling scheduler:

[True] APIC-01 – Enabled scheduler

[True] APIC-02 – Enabled scheduler

[True] APIC-03 – Enabled scheduler

Reason to use Cisco Nexus instead of Cisco Catalyst in the Data Center

Time and time again I have customers wanting to understand the true benefit of a Cisco Nexus switch versus a Cisco catalyst switch in the data center for connecting servers. Customers may argue that they just need simple 1G or 10G speeds, dual-homed with a port-channel and they can achieve this simply using a stack of Catalyst 9300 or Catalyst 9500 switches. So here are a few reasons to ponder:

Code upgrade

If you have servers that are dual-homed for redundancy across two separate Cisco catalyst switches which are stacked because you want to leverage a port-channel then that sounds fine and dandy but when it comes time to upgrade the switch, because there will be a time you have to upgrade the switch, the whole stack will need to be reloaded resulting in an outage to your servers. This is not the case with Cisco Nexus.

Lower latency = better performance

VeloCloud in AWS

After a few hours of troubleshooting, I found out that when using the 3.3 brownfield Cloudformation template, entering the VCO as an IP does not work. You must use the FQDN instead of the IP for the VCO. I also made sure to set the version to 331 instead of 321. The instance type of C5.4xlarge. After the vEdge joins the orchestrator, then you can upgrade the version to a newer code.

CMLv2 Node QUEUED

If your wondering why you can’t get the nodes past “QUEUED” in CML, its because the images aren’t loaded.

  1. make sure your refplat-xxx-fcs file is mounted under CD/DVD drive
  2. Log in with sysadmin to port ip:9090
  3. Open up terminal and type in sudo /usr/local/bin/copy-refplat-iso-to-disk.sh

PAN and BFD

Setting up a BFD session between Palo Alto and Cisco ACI Leaf or General Nexus Switch

If Device A (ex. Palo Alto) does not support BFD Echo and only BFD Control Packets, Device B (ex. Cisco Switch) will not utilize BFD Echo and will only use BFD Control Packets for the BFD session. As a result, the highest transmit interval between both peers multiplied by the multiplier = the hold-down time. Without BFD Echo, the hold-down time will be how long the BFD peer will wait till BFD session goes down.

Another consideration is that depending on the Palo Alto model, high CPU control plane traffic will effect BFD and may tear your adjacency/peering down.

I have tested 16 eBGP peers on Palo Alto 3220 connected to ACI leaf-A and 16 other eBGP peers on same Palo Alto connected to ACI leaf-B. If the BFD timers were anything below 900 x 3, after an ACI leaf-A or leaf-B reload the Palo Alto would randomly bring down eBGP neighbors from ACI leaf-B, even though no issue occurred between PAN and ACI leaf-B. BFD would tear down because of a control plane spike as PAN must be processing BFD in software. The only acceptable timers were 900 x 3. Anything lower, the Palo Alto would tear down BFD which would bring down the eBGP Peering.