Data center management is a full-time job, and we really mean full-time. An uneventful night can quickly become an all-hands-on-deck remediation effort if unplanned downtime arises. Granted, these fiascoes are not necessarily common, but they happen regularly enough, and are costly enough, that they continue to be the scourge of data center management.
In addition to the monetary losses (nearly $9,000 per minute), according to Tech Target, service disruptions may violate SLAs, cause customer churn and, in many cases, inconvenience many thousands of people. Case in point, several notable airliners have experienced IT downtime in the past few years that have grounded tens of thousands of travelers.
To prevent these and other negative fallout scenarios from unplanned IT outages, you must address the issues at their sources.
What are the top causes of downtime?
Flooding and other disasters not included, the top four causes of data center downtime are the following:
- UPS failures: Paradoxically, the very equipment that is designed to keep IT systems online when the primary power source fails is the top cause of downtime. Data centers must be able to fail safely. If the mechanisms that are put in place to facilitate redundancy fail, then you don’t stand a chance.
- Human error: Spilled drinks, getting tripped up in tangled cables, miscalculations that lead to unbalanced power loads and other manifestations of human folly are surprisingly common sources of data center downtime.
- Cyberattacks: In the past few years, the number of data center outages caused by DDoS (distributed denial of service), ransomware and other forms of cybercrime have spiked. “Security expert” is yet another hat that data center managers need to wear.
- Overheating: Computing equipment or network switches will rarely cause downtime, but when they do, it’s often in response to overheating. Sometimes this is just because of poor airflow management or inadequate temperature monitoring. But in other cases, CRAC units will fail, causing temperatures to rise, especially in known hot spots.
Clearly, addressing each of these issues needs to be approached in a different way – especially cyber attacks, which require evermore complex threat detection and response strategies. Meanwhile, UPS failure can be addressed through proper maintenance and management of secondary power sources, and optimal configuration of A/B power feeds (more about that here).
That said, other environmental threats such as power, temperature and even human error can be mitigated.
1. Temperature and airflow monitoring
Servers and network switches, especially those positioned farthest from the cool-air source, need to be closely monitored for temperature and airflow in real time. For the former, install temperature sensors at a minimum of six points in each rack (the top, middle and bottom of racks, both in the front and rear). Configure alarms that will notify data center operators the moment that an allowable or safe temperature threshold is exceeded. This will help you address the development of hot spots as they form.
Speaking of which, hot spots are often caused by cool air that bypasses the IT load, or as warm air is improperly expelled into return plenums. Both of these airflow issues can result in an outage of an entire rack if, for instance, a top-of-rack network switch overheats and goes offline. To prevent this from happening, data center operators must make sure that:
- Front-to-back airflow is maintained, using airflow redirecting hardware if necessary.
- Exhaust is adequately expelled.
To achieve the latter, you may require rack-based, active containment (and if you’re operating in a high-density environment, then you should already be using active containment). This system relies on pressure sensors that control the RPM of fans built into the containment chambers. As air pressure changes, RPM automatically adjusts to maintain zero pressure. This helps prevent the recirculation of warm air into the cold aisle.
2. Smarter PDUS
Then there’s the issues of unbalanced power loads, tripping over cables and other data center gaffes.
First and foremost, color-coded power distribution units (PDUs) can help facility operators avoid an accidental unbalancing of a PDU as computing equipment is swapped out. This is crucial for several reasons, one of which is that it can cause a UPS failure down the line. If, for instance, the combined load of your A/B power feeds exceeds maximum capacity, what will happen if one of those feeds fails? The answer: You’ll end up with a short. You could also use real-time power monitoring embedded into intelligent PDUs to catch these problems early.
If, however, you prefer to use basic PDUs, the other option is to invest in a brand that enables mobile monitoring. Specifically, you can use a mobile app to take photos of a live barcode, which will then provide a one-time smartphone display of up-to-the-second power details of that PDU.
Finally, there’s the issue of cables coming out of place, if not from vibrations, then as a result of an under-caffeinated employee stumbling over them. Most best-in-class PDUs will come with locking bezels to prevent these mishaps. Simultaneously, you can reduce the total number of cables used to connect intelligent PDUs to your network switches by daisy-chaining power strips that rely on Rapid Spanning Tree Protocol (RSTP). This protocol prevents broadcast storms, meaning a series of daisy-chained PDUs can be connected to the network switch at only two points, instead of from each individual power strip.
At a glance, all of this may seem like a lot of work to set up. However once these systems are in place, they’ll significantly improve the sturdiness and resilience of your power infrastructure.
Remember, it’s not just data center downtime that’s at stake. It’s your downtime, too.
This guest blog has been written by our Partner Geist.