Efficiency gains: predictive maintenance supports data center operations

Replacing equipment just before it fails based on streaming analytics is a smart move for data center operators.
28 July 2022

Full time job: keeping data centers up and running requires monitoring not just computing hardware, but other key infrastructure such as generators and cooling equipment. Image credit: Google.

Saving data to the cloud – through services such as Microsoft’s OneDrive, Apple’s iCloud, DropBox, Google Drive, and others – has become second nature. And, as users focus on the perks of accessing files across multiple devices (while taking comfort from the seamless backing up of their data), it’s easy to forget that this digital information is sitting somewhere on a physical storage device. Or, more likely, on several devices as data mirroring routines whir away, syncing information across different locations to ensure efficient data recovery.

This behind the scenes activity needs power, which is supplied through a chain of generators, transfer switches, transformers, and power distribution units, all backed up with uninterruptable power supplies for good measure. And, in the main, everything performs beautifully, like a modern concert orchestra playing an extremely challenging symphony. Tier 4 data centers, which are built to be completely fault-tolerant, have an expected uptime of 99.995% – a figure that translates to just 26.3 minutes of downtime per year; a truly remarkable achievement.

Infrastructure insights

Critical to achieving such high levels of data center availability is the rise of predictive maintenance systems and the algorithms that power them. Thanks to these smart software tools, engineering teams can better forecast potential equipment failures and remedy incidents before they occur. It’s a welcome scenario as running a modern data center is a balancing act. On the one hand, there is the need to keep energy costs down, but computer components are fussy about temperature and failure rates climb rapidly when conditions drift outside ideal operating windows, which explains why operators value reliable heating and cooling so highly.

In tandem with regular inspections, predictive maintenance helps staff to spot not just when a hard drive or processor is about to fail, but also flag issues with compressors, air filters, fan units and other unsung equipment that operates in the background to keep cloud services running. And trends such as the growth in networked and automated sensing make gathering operating data (the oxygen that allows machine learning to breathe) such as temperatures, air flow rates, power consumption and other facilities information a breeze.

Streaming analytics

These streaming analytics provide essential training data for artificial intelligence routines primed to scour the information in search of anomalies – rare events – that could signpost impending equipment failure or malfunction. Such outliers are easy for machine learning to latch onto and save maintenance crews from scrolling through screens and screens of data. Also, the more data that you can feed the algorithm, the better the machine learning will become in setting more accurate thresholds for when parts should be replaced.

Clever data center designs keep disruption to an absolute minimum when it’s time to carry out repairs – for example, by engaging uninterruptable power supplies and having redundant heating and cooling arrangements to facilitate concurrent maintenance. Increasingly, equipment designs themselves are able to self-diagnose faults or remind teams that consumable items need to be ordered or topped up – which all helps, considering the scale of the task.

Facebook owner Meta operates 21 data centers in the US, Europe and Singapore, but the maintenance does stop there. The social media heavyweight also runs 60 solar plants, 19 wind power sites, 9 water restoration facilities and 1 heat recovery plant, in order to keep Instagram and its other products online.

Constantly learning

Equipment that can require higher levels of maintenance includes chillers, humidifiers and other elements that are essential to keeping computer rooms air-conditioned so that servers can hum away nicely in their racks. Design updates bring improvements, but also mean that predictive maintenance routines need to be constantly learning to account for changes in the setup.

Focusing on data storage itself, solid state drives (SSDs) are putting up strong competition to traditional hard drives thanks to snappier data access. There’s also evidence that SSDs could bring reliability gains, according to findings by Google engineers who teamed up with researchers from the University of Toronto in Canada to crunch the numbers. Based on six years of flash storage operating data, the group found that SSDs needed to be replaced less often than HDDs. Other studies also come out in favor of SSDs, but things unravel rapidly if the flash memory is allowed to get too hot – doubling up on the earlier point, equipment failure has strong ties to temperature.

Other markets

Interest in predictive maintenance is booming as operators chase competitive uptime targets and look to extend the lifetime of their equipment assets. And what’s working for data centers can work elsewhere too. Airlines have long used statistical methods to optimize jet engine maintenance and other big sectors include manufacturing as well as healthcare – a fast-growing market for AI-driven reliably improvements, according to analysts.

The total size of the predictive maintenance market depends on who you ask, but estimates place it in the 4 to 7 billion dollars (US) a year category, and research firms all seem to agree that it’s likely to triple in size over the next five years. What’s more, the technique fits nicely in the portfolio of features offered by computerized maintenance management systems, which are rapidly becoming a cornerstone of efficient facilities management.