Lessons learned from running the world’s largest data centers

Mon, 15th Jan 2018

FYI, this story is more than a year old

While managing facility operations for large data centers certainly takes specialized skills in a range of disciplines, the more you do it, the better you get at it.

Given that Schneider Electric has more than 800 people managing facility operations for some 100 large data centers around the globe, it's fair to say we've learned a great deal.

In fact, I recently viewed a webinar that a colleague of mine presented on the topic, “Lessons Learned from Running the World's Largest Data Centers.”

In this post, I'll pass along at least a few of those lessons (and invite you to check out the webinar for the rest).

Most of the lessons we've learned fall into one of five general categories:

Competency
Standardization
Risk management
Tracking and reporting
Operation and maintenance costs

Competency

In terms of competency, the main issue is that most companies have expertise that lies in areas other than managing data centers, a topic we covered in this previous post.

That's as it should be.

If you're in, say, retail, healthcare or manufacturing, your expertise lies in those areas; the data center is merely a supporting function.

But it's an issue if you want to run the data center using internal employees, because you don't have a large workforce to pull from. I've been to conferences where entire panels have been dedicated to the issue of training millennials in data center operations. Universities are only now starting programs to address the issue.

As a result, we routinely see companies with data center infrastructure management (DCIM) and other tools installed, but they're not using them to their full extent – because they simply don't have the appropriate expertise.

Standardization

With respect to standardization, companies tend to run into trouble after mergers and acquisitions, or if they experience rapid growth.

They wind up with a series of data centers, with no common set of standards in terms of how to operate them.

No matter if you've got two data centers or 20, you need to share learnings among all of them.

Schneider Electric's standards and procedures are best in class in part because we are diligent about sharing what we learn in operating each one of the 100 or so that we operate. We use those learnings to continually update our processes and procedures so when a problem occurs, we have sound emergency procedures in place to follow.

They should include back-out procedures to follow in the event something unexpected happens after a data center change – to prevent the issue from getting worse.

Risk management

Such procedures are closely related to the risk management topic. One of the big lessons here is to have a full-system approach to data center management.

If you need to take a component out of service to perform maintenance, for example, you need to first understand the impact and dependencies of that component with respect to the rest of the data center.

Doing so requires a thorough understanding of the data center.

For any data center we manage, Schneider Electric likes to get in on the construction phase, or as close to it as possible.

That way we can gain a thorough understanding of the architectural drawings, piping, wiring and so forth – all of which is knowledge that helps mitigate the risk that goes into operating a data center.

Tracking and reporting

Tracking and reporting is an area that gets overlooked far too often, leading to wasted operational costs.

With proper tracking and reporting, you should be able to identify stranded IT capacity – that old rack of servers over in the corner, for example, that nobody is really sure still serves a purpose. (We've all seen those, right?)

Reclaiming that capacity can help you stave off a data center expansion by getting more out of the space you've already got.

Operation and maintenance costs

Which leads to the final area, operation and maintenance costs.

We've learned plenty of lessons in how to keep these costs down, like using condition-based and predictive maintenance to replace components only when they really need it, as opposed to when some schedule says they do.

And if you effectively track your assets (see previous point), then you can start determining which ones require the most maintenance – and potentially save money by replacing them.

Article by Anthony DeSpirito, Schneider Electric Data Center Blog