ACI Leaf switch SSD failure

This is a short post about the SSD issue in ACI leaf switches. Now ACI fabrics are approaching 5 or 6 years of operations people start noticing error codes F3073 and F3074. When you Google on these faults you’re likely to find this technote from Cisco. The issue is that the SSDs in the switches are nearing the end of their life. Error F3074 will tell you the SSD has reached 80% of its lifetime and F3073 is raised when the SSD reaches 90% of its lifetime. Once the SSD reaches 100% it will remount itself in read-only mode. This will cause the switch to be unusable.

The full fault text for these faults is:

  • F3074: fltEqptFlashFlash-minor-alarm (80% lifetime)
  • F3073: fltEqptFlashFlash-worn-out (90% lifetime)

Figuring out current SSD status of your switches

Probably your first question reading this is “What is the lifetime on my switches?”. The second question will likely be “how much time do my switches have left before this will be an issue?”.

To figure out what the current lifetime counter is for your switches you can perform the following command on the switch itself:

moquery -c eqptFlash

This command will give you some information about that specific switch. The output looks as follows:

Total Objects shown: 1

# eqpt.Flash
acc          : read-write
cap          : 61057
childAction  :
deltape      : 112
descr        : flash
dn           : sys/ch/supslot-1/sup/flash
gbb          : 0
id           : 1
lba          : 0
lifetime     : 10
majorAlarm   : no
mfgTm        : 2020-06-09T05:26:15.695+01:00
minorAlarm   : no
modTs        : 2021-03-08T05:10:37.895+01:00
model        : MODEL
monPolDn     : uni/fabric/monfab-default
operSt       : ok
peCycles     : 3622
readErr      : 0
rev          : MU01.00
rn           : flash
ser          : SERIALNUMBER
status       :
tbw          : 12.378319
type         : flash
vendor       : VENDOR
warning      : no
wlc          : 0

This specific switch has a lifetime of 10%. So I’m fairly confident this switch will have a long life ahead of it. But what when you’re already at 70%? How can you find out based on information from the switch itself how much time you have left? For a regular user it’s difficult to give an exact number, but we can get an indication with some simple math.

For this we have to figure out how many days it takes for a single percent of lifetime to be consumed. Luckily, since version 3.2(5d) switches perform SSD lifetime logging. This causes the switch to register its current lifetime every day. That means we can use this information to figure out how many days it takes for a single percent to be consumed. I would recommend you to take the average consumption of the last 5 to 10 percent since SSD usage might differ depending on switch load and software versions. So going back too far might cause you to either overestimate the number of days you have left or underestimate it.

This log file is located at /mnt/pss/ssd_log_amp.log. You can view this file without root privileges. Depending on the running version you might have a different format of the logfile, but it should contain a column called “lifetime”. That is the lifetime as it was registered by the switch on the reported date. Each line in the logfile is one day, so just count the days.

For example, on my switch it takes 23 days for the lifetime to consume a single percent. Extrapolating that number means that it will take about 2070 days to get to 100% from where I am now, which means about 5,5 years. This assumes linear usage of the SSD, which might not be the case. However, when you’re closer to the end of the SSD’s life this estimate will be more accurate. So if I were at 90% for this switch and I was still using 1% per 23 days, it would take about 230 days to reach 100%. That means I have to fix the issue pretty fast.

Solution

So what can we do to fix this issue? Unfortunately as Cisco themselves already say in the technote:

Can the SSD be replaced in the field?

No. The SSD is not a field replaceable unit. Entire Chassis will have to be RMAed when the failure is on the leaf. On modular Spines, you will have to RMA the supervisor.

So, the whole switch needs to be replaced. Contact your partner or Cisco TAC to start a RMA process. If you’re running gen1 hardware (which is likely when you are running into this issue) you might consider replacing the switches with a newer model. Gen1 switches will not be supported in ACI 5 and onward.