Using robots in facilities maintenance

January 5th, 2010

I took a leap last week and decided to specify the use of robots to perform janitorial/custodial tasks that are typically conducted by humans.  It had been an option for our contractors/service providers to try for the past year, but none of our datacenters or vendors had expressed any interest in pursuing it. 

It would appear that custodial service providers have a stake in keeping as many services as possible labor-intensive; they make money on percentages (profit, overhead, G&A) directly proportional to the amount of labor hours & dollars that they charge.  Good for a service provider’s bottom line, and more expensive for the serviced. 

The bottom line is somewhat important; not as important as availability, perhaps, but still something that management - and particularly upper management - regards as focal.  If we can reduce the labor footprint by use of automation in a cost effective manner, and also show that the results are equivalent, we’ll have made headway towards the goal of a lights-out, unmanned datacenter environment.

I’m starting a small pilot to use iRobot Roomba and Scooba cleaning robots to do some basic vacuuming about 6,000sf of carpeted administrative area and 2,500sf of hard-floored hallways, and some basic floor washing in 2,500sf of lobby and break room at one of my datacenters in Ohio.  Anyone have experience or lessons-learned that they want to pass on for this pilot?

So far, no one has said “no, we can’t do that” or that they don’t like the idea of robots autonomously running around their datacenter.  Or they haven’t read the contract’s performance work statement…

When some results regarding the implementation, effectiveness, and cost effectiveness of the system are available, I’ll attempt to summarize what we’ve learned from this experiment.

What is DataCenterEngineer.com about?

March 1st, 2009

Word cloud from www.datacenterengineer.com


It’s just the right thing to do.

November 30th, 2008

I’ve been asking colleagues in my office for feedback on maintenance services for the last couple of months, trying to gauge how they feel our maintenance support does at our datacenters. I was quite surprised to get either no feedback at all, or a feeling that it all seems to be going just fine, thank you. That’s about as far from what even a cursory look at it would reveal. By inspection, audit, downtime results, accidental outages, equipment failures, shortened equipment lifecycles - any quantifiable measure - our current services leave a LOT to be desired.

What I found even more interesting this week was feedback that the new maintenance instructions, explaining the standards of how maintenance has to be accomplished to be effective, were too detailed, stringent, and technical for maintenance personnel to follow effectively. Setting a standard for how work has to be done - is too draconian.

I’m having a difficult time coming to terms with this. From my view, we are paying people to do something - accomplish proper and effective maintenance on our equipment and systems. From all measures, they are not doing so effectively. Defining the standards of what we expect for the pay is somehow unrealistic? I can’t grasp that concept.

Should I expect to be paid if I do not do my job? Can I sit at my desk, screw around and practice my sodoku, chat with friends online, read the paper, and collect a paycheck every two weeks? (Sure, I could. People in my office do exactly those things. You can probably guess how much respect I have for their ethics.) So no, I have to disagree with the others. This is not making me popular… but the ethical high road is also the most cost effective strategy.

Random surfing

November 17th, 2008

Annoyed at the enterprise’s lack of vision this morning, so I spent some time surfing around predictive maintenance and other random websites.  I liked this one a lot more than the vibration analysis sites…  Check out the cave photography at http://www.deepearthphotography.com

Three-phase power distribution and datacenter efficiency

November 14th, 2008

I am asked frequently to defend the mandate for three-phase power distribution to IT (Information Technology) equipment in the datacenter. This usually stems from customer representatives and server manufacturers sending small quantities (or individual pieces) of IT equipment and expecting it to be installed immediately. Historically, it has been less expensive on a per-installation basis (price) basis to install a single-phase power circuit to support that single piece of gear. However, viewing the price of installation of a single piece of equipment belies the total cost of ownership (TCO) for the life of that equipment, particularly when you realize that multiple generations of IT equipment are constantly being tech refreshed over the life of the datacenter.

Yes, I am a Believer®. I believe in paying less for the same (or superior) results.

Allow me to clarify that “price” indicates the dollars one must spend immediately to get a server up and running, while “cost” indicates the sum of the installation price and all of the energy that is used to keep that server operational over its life cycle. Numerous external industry studies have confirmed that efficiency pays for itself over a life cycle, and various regulations and directives are aimed at this long-term view.

Reduction of long-term operational costs significantly outweighs the immediate price savings of a cheap solution. Given the opportunity to know the TCO of purchase of a car, we may see that a Honda Accord costs $50,000 to own for twenty years, while a used Ford Taurus costs $25,000 over five years. The initial investment for the Honda may be $35,000 while the Ford only sets us back $5,000 today, but the average cost of owning and operating the vehicles per year are $2,500/year for the Honda and $5,000/year for the Ford. Can we really afford the cheaper vehicle in the long view?

The only way to maximize power efficiency and optimize life cycle cost (LCC) of energy usage in the datacenter is to consider the overall datacenter holistically. Each part of the center affects all of the others. The IT equipment is the driver, but the power circuit distribution, upstream power infrastructure, cooling infrastructure, and indeed the facility itself must be considered as a unified whole to have an effective efficiency strategy.

The use of three-phase power distribution in the datacenter is the most effective and straightforward means of reducing datacenter power usage. Installation of three-phase power starts a snowball effect in future energy and efficiency savings.

Let’s consider all of the places where we end up with life cycle cost savings.

• Power circuit consolidation: Installation costs drop significantly when we power up a rack at a time, instead of running individual power circuits to each new piece of equipment. A single 3-phase MPDU (Modular Power Distribution Unit, aka “power strip”) could power ten or more servers (currently up to about 5.7kW of equipment, using the equipment and technology we suggest today). Rather than installing ten single-phase circuits, we install one and save up to 85% of the installation costs for copper conductors, conduit, and labor for installation.
• Fewer circuit breaker panels are needed, because a single 3-pole breaker can supply enough electricity to support what up to ten 1-pole breakers would use in a single-phase distribution scheme, reducing panel requirements by up to 70%.
• As fewer breaker spaces are needed, we need fewer PDUs (Power Distribution Units) and RDCs (Remote Distribution Cabinets, aka satellites or expansion panels). Needing 70% fewer PDUs is significant, as each unit can cost $20,000 to purchase and $5,000 for installation to existing main distribution panel (MDP) 480V breakers.
• Fewer PDUs means fewer transformers in the datacenter generating heat. Energy is consumed (lost) converting 480V to 208/120V, and released as heat that must be transferred back out of the datacenter by installing additional cooling.
• Since more power can be transferred per breaker space by using 3-phase MPDUs, we increase the amount of electricity that each PDU actually delivers to the IT equipment. Most PDUs are 90% full of breakers, but only delivering 10% of their electrical capacity. By increasing the electricity each PDU pushes, we move significantly forward along the efficiency curve for that transformer – meaning that the average amount of heat generated by the transformer per server is significantly reduced.
• Facilities are designed with a certain number of 480V breakers on each MDP, just as PDUs have a certain number of circuit breaker panels. When you run out of circuit breaker spaces on the PDU, you have to buy and install another – which takes up another 480V breaker on the MDP. What happens when we run out of open 480V breaker spaces? We have to install an upgraded MDP, which involves a major engineering effort and can cost us millions of dollars and requires shutting down of the datacenter (in general). We can avoid these millions of dollars in expense by using our existing resources as wisely as possible.
• Fewer PDUs on the datacenter floor means more room for additional IT equipment racks. Physically filling up the datacenter means that we have to build a new datacenter to support the additional workload. Over the life cycle of the enterprise, that is akin to having to replace that Ford Taurus after only five years, instead of keeping the Honda for twenty. The expected price of a replacement datacenter is in excess of $100 million and increasing; the longer we can avoid the need for this outlay, the better off we are.
• Underfloor cooling technology has limits as to how much heat transfer we can achieve. By installing fewer power circuits underfloor and clogging up the cooling air plenum, we can increase static pressure and deliver more cooling to each rack. By delivering more cooling to each rack, we can install more IT equipment in each and use more of the MPDU capabilities for high-density power distribution.
• By installing more IT equipment per rack, we reduce the costs of purchasing and installing more racks for the same number of servers in the datacenter.
• For a given amount of cooling (ten 20-ton air conditioning units, for example), increasing the efficiency of the cooling system by keeping the cooling air plenum clear allows us to use fewer air conditioners to achieve the same cooling results. By turning off one or two air conditioners we save 10-20% of the energy it takes to run those, and eliminate the heat that their motors generate inside the datacenter.
• Having fewer PDUs and air conditioning units reduces the costs needed to operate and maintain those infrastructure systems. Maintaining an eight-car fleet is just less expensive than maintaining a ten-car fleet, especially when there is no reason to have those extra units in the first place.
• Advanced metered MPDUs allows facility managers to balance by-phase power loading at the rack level, which in turn balances phase loading at the PDU and thereby reduces harmonics and neutral loads, increasing PDU transformer efficiency and further reducing waste heat generation.
• The MPDU technology specified in our Facilities Standards enables visual and remote power consumption tracking (through the building automation system), which establishes baseline and trending information and will allow additional energy reduction techniques to be employed in the future (including smart load shedding and remote power management of the IT equipment).
• As IT equipment is tech refreshed every few years, the lower amount of churn at the PDU breaker panel and underfloor power circuit levels reduces abatement and reinstallation costs.

Further, three-phase distribution provides scalable flexibility for an uncertain future, where the direction of IT equipment is in question. What are we reasonably certain of? IT equipment will get denser in terms of computing power, power consumption, and heat generation.

Three-phase distribution is scalable in amperage – we can replace the currently prescribed 30A circuits with 60A when we need to supply more power to a rack in the future. Certainly, one can argue that this can also be accomplished with single- and two-phase distribution, but we eliminate the situation where a one- or two-phase circuit needs to be replaced with a three-phase, and the surrounding circuit breaker spaces are already in use. With a three-phase system, the same three concurrent breaker positions are ready for use by the future IT equipment requirement. This results in further lower churn costs at subsequent tech refreshes.

And finally, reducing power consumption also reduces production of greenhouse gases in a relatively linear manner. We decrease TCO and LCC, decrease power consumption, and decrease our environmental footprint.

What have we got to lose?


Datacenter health reporting

September 2nd, 2008

Have you ever seen the status-of-the-datacenter briefings that go up the management chain to the high-level stakeholders (owners, CEOs, shareholders, and the like)? Have you ever asked to? I’ve been finding it amazing the information that is subverted in the name of politics when we present our version of how the enterprise is situated, and whether it is capable of performing the missions we have as we are ’supposed’ to - meaning, 24×7 operations, 4-9s or better uptime, and able to support a high density of modern IT equipment.

Each level of management seems to want to improve the data somehow, to push actual problems under the table, to not communicate ongoing challenges and issues. I am guessing that there is a self-preservation aspect to these sorts of modifications, but if you look at the process from a subjective point of view, these actions hurt the enterprise. In the end, those people actually able to make decisions regarding important things such as the direction of the company and what sort of funding is made available to make the system(s) robust as they believe them to be are simply given false, rose-colored-glasses information.

One of the roles of the facilities engineering and facilities management team is to accurately report the status of the enterprise systems and whether they can support the mission of the enterprise. When there is an issue, we have an obligation not only to resolve the problems, but also to conduct root cause analysis, identify and plan for corrections to causal issues, communicate the problems and recommended solutions for lessons-learned growth in the organization, to describe why fighting symptoms and assigning blame are inappropriate (they waste resources and do not resolve root issues), and basically to tell it like it is. That’s why we have engineering degrees; we’re partially here to keep people with business, marketing, accounting, and management degrees straight and honest.

Let the marketing types sell the enterprise as 6-9s capable. Correct their assumptions when possible, but don’t fight them. Their roles are different. They are here to make money, and we are here to spend it (to achieve capability for them to sell, of course).

But don’t ever, ever report falsehoods that make it seem as though everything is all peachy when it isn’t. Have a datacenter with a critical SPOF that is going to cause problems one of these days? Identify it. If not, when that failure and unplanned downtime happen, the accusations will roll back down all the way to the bottom - you. Keep copies of your correspondence stating your disagreement with the data being passed upwards, along with copies of the actual data, to show where the data was ‘interpreted’ to make it look better.

Of course, you’re the ones that will have to resolve those problems when they occur still, but you won’t have to fight two battles at once when it happens - correcting the problems AND finding yourself a new job.

O&M Standards Implementation - Finally.

August 16th, 2008

I wrote a bit about this back in May, and am pleased to be able to say that after three more months of dedicated work on the project, I finally had two breakthroughs in the last week that will enable our enterprise to begin live testing of our enterprise-wide operations and maintenance program.

First, as I have mentioned previously, the deferred maintenance at our datacenters has had very deleterious effects on our ability to support customers’ perceptions of adequate uptime.  A change in leadership - almost always necessary prior to a revolution - and elimination of some of the old guard, focused less on long-term sustainability than short-term perceptions and profits, has occurred.  New regimes see many issues and problems to be corrected.  Some are realistic, others are larger issues that may not have been in the power of previous operators to correct, even had they tried to.  Still, some sort of new idealism has been infused into the organization, as they slowly begin to see what could be done to make things better - and it has to be done relatively quickly, before frustration and budget realities set in.  This new leadership has been looking and is frustrated with the current situation, and ordered a change - any change - to make things better.  “We really have to get this under control.”  That’s simple and indirect enough, and I am taking it as a mandate to change the business processes to meet the new vision from on high.

Second, the ‘product’ is finally ready for release.  A three-year development and pilot is nearing it’s first year of (relative) successful testing, and the results are clear - not a perfect solution to problems, but one that has dramatic and obvious positive results.  Taking all of the lessons learned from that pilot program, the course-corrected new program makes scope limitations, expands roles and responsibilities of a variety of stakeholders, manages contract administration more logically and simply, and - has a rather unique element, not tried before that we can determine.

At the core of operating any enterprise are basic rules.  Those rules must be codified, understood, and followed to meet with any level of success beyond that which luck would bring.  Yes, I am simply talking about a set of standards - Facility Standards (defining performance requirements and infrastructure organization) and O&M Standards (defining how facilities will be operated and maintained).

The issue we have organizationally faced, in all of our support contracts for all of our diverse equipment and datacenters, has been the uniqueness of each contract and the challenges resulting from that.  Every site is different; every contract has to cover those differing conditions.  One of the most challenging pieces of developing a solid contract document is anticipating future changes to covered systems and incorporating those eventualities and contingencies, sometimes up to five years into the future.  Contracts are generally intended to be (and are managed to be) static documents, non-reflective of the reality we live in daily.

In order to attempt an end-run around this, our Standards are now designed to not be simple statements of ‘x system shall be configured in this manner,’ but rather in a performance work statement (PWS) format, defining expectations as well as minimum standards, and holding whichever servicing organization gets whichever piece(s) of O&M work to certain defined, quantifiable performance standards - and generally, without telling them exactly how they are supposed to obtain the now-codified results.  Rather than inserting a static statement of work to be performed into a contract document, the new O&M program will simply require the servicing organization (contractor or similar) to conduct all work in accordance with the current version of the appropriate Standard(s), which are free to remain dynamic and flexible based on evolving situations, circumstances, requirements, and leadership edicts.

The key to managing the contracts becomes simply adequate dissemination of current standards to the servicing organizations, their review of any changes periodically, and modification of firm-fixed-price agreements by negotiation of equitable adjustments on existing contracts.  The actual work to be performed by a contractor can thus change over time based on changing requirements of the Standards, without requiring scope changes to the static contract documents.

Now, finally having been granted the authority to ‘fix’ the new leadership’s perceived problems, we begin to move into what is sure to be a long and painful contract acquisition phase.  I fully expect this to take nine months to complete successfully, and results will not be seen for at least three to six months after that.  I only hope that the will to continue fixing our O&M problems will last that long.

Irony. As pointless as a beachball.

July 22nd, 2008

I can’t tell sometimes if the purpose of a support organization is to provide support or to do exactly the opposite of their mission, setting up procedural roadblocks to stymie the flow of support.  (There’s no Socratic irony in this post whatsoever, if you couldn’t tell.)

In order to have my contracting office solicit vendors for an electrical services contract, their rules state that they must have a “bona fide” requirement that the contract will support.  Well, I certainly have a requirement to be able to have electrical services performed in my datacenter; things like adding circuits happens all the time, and it is significantly easier and less expensive to have an agreement already in place with a vendor than to get individual contracts for each and every individual circuit I might need. 

The definition and etymology of “bona fide,” from the Merriam-Webster Online Dictionary:

Main Entry:
bo·na fide 
Pronunciation:
\ˈbō-nə-ˌfīd, ˈbä-; ˌbō-nə-ˈfī-dē, -ˈfī-də\
Function:
adjective
Etymology:
Latin, literally, in good faith
Date:
1632
1 : made in good faith without fraud or deceit <a bona fide offer to buy a farm>
2 : made with earnest intent : sincere
3 : neither specious nor counterfeit : genuine
synonyms see authentic

That’s a pretty clear-cut definition by me.  “In good faith.”  Nowhere in there is there a statement about actually doing something, for example having an initial work assignment in order to solicit the contract documents. 

Actually, the term specifically means the opposite of how contracting is using it.  “In good faith.”  “Made with earnest intent.”  Why, oh why can they not see the great humorous conundrum they have thrown up and laugh along with me?  (Or at least use another, more accurate and less ironic word.)

No, they continue to insist that an actual order is required against the contract vehicle (which is logical based on a need to perform competitive bidding based on a set of tasks, perhaps).  Perhaps that logic is also flawed, as a sample task for vendors to consider would satisfy the logical argument, but does not satisfy the contracting officer.  There is some precedent for not allowing a “sample task” for bidding purposes, since we would not intend for that “sample task” ever to actually be accomplished - therefore having the potential to be viewed as fraudulent or deceitful.

So, is it not with earnest intent that I request a contract vehicle, even though I may not have an installation requirement today?  The overall need conclusively meets the definition of “bona fide.”  The contracting office’s narrow usage of “bona fide” to mean instantaneous and current need without regard to the actual requirement, strangely enough, is specious (oh, the joy of antonyms).

Yep.  That’s the definition of irony, all right.

Never enough

July 5th, 2008

Good afternoon, and welcome back. I would apologize for tardiness, but I hope that, instead, you have been out enjoying the summer as I have.

It has been an interesting year in our O&M world thus far. There have been a number of successes, a number of temporary setbacks, no major failures, and most importantly, a considerable furthering of the program development.

Expect Major Changes™ and lots of interesting new things to think about in the next few months. Meanwhile, back at the ranch.

Despite best efforts, measurable capability increases, downtime decreases, and millions of dollars expended in improving conditions to improve the datacenter capabilities and O&M support, we still face an adversarial situation politically. Without a budget and staff ten times what we have today, a major dent in the situation still appears unreachable in the near term, and it is obvious on the ground. Frustrations in our ability to positively affect the situation are apparent, and are bubbling up to dangerous levels.

I don’t have any thoughts on how to improve this situation today, simply my own frustrations that best efforts and an unfortunate long-term plan to repair the facilities and build a sustainable O&M program do not satisfy user demands.

Optimal Datacenter Staffing Levels, Part 2

May 23rd, 2008

How many people or equivalent people do we need to support each of our datacenters? Can we come up with some sort of algorithm that we would use to define the staffing levels we felt were necessary to maintain an adequate level of operations, maintenance, and engineering support?

I don’t believe that it’s actually going to be possible to develop an effective algorithm to determine optimal or satisficing staffing requirements for (a) facility managers or (b) assigned O&M personnel at the datacenters in a consistent, logical manner. The conditions, facilities, people, and needs at each individual site appear to be too unique for us to create a formula, plug in how large the datacenter is, and get a magic number.

Development of a basic numerical model is not difficult; we can very easily set up a spreadsheet to calculate fractional full-time employee (FTE) equivalents for a number of the functions that we anticipate being performed. Where the model breaks down is the “threshold” number that we choose for each type of work. A threshold is set as an upper limit on how much one FTE could reasonably be expected to handle – for example, 50,000 square feet of datacenter management.

The proposed initial solution is based solely on the square footages of different pieces of the datacenters, and on the total UPS capacity (all systems added together as if configured for N redundancy). We’ve tentatively decided that one FTE custodial contractor can keep, for example, 75,000 square feet clean to our standards, or that the amount of time required for fully scheduled preventative maintenance and repair of all of the facility systems can be estimated at one FTE per 1,000 kVA of total UPS capacity (reflecting a number of systems and their complexity). These numbers are our selection thresholds. The results change based on where we set those thresholds, based on our experiences supporting the sites in the past.

Sample Staffing ToolThresholds are generic numbers based on the ‘average’ or expected datacenter component being measured, and where one set of thresholds yields a staffing mix that sounds reasonable at one datacenter, the same threshold set is obviously wildly off at another.

This is because of the unique nature of the datacenters. Each has a history – how well or poorly it has been maintained, whether it is an empty, unused shell or a 90% filled facility, or the average age and condition of the structure and infrastructure equipment. O&M costs to sustain the facilities are wildly different based on age, usage, mission, and geography, and we’re not surprised to see that we cannot pick “one number” and apply it equally to all of the facilities with satisfactory results.

This is a similar but far less rigorous technique than conducting time & motion studies, actually recording how long tasks take to complete and how much time is spent in other transitional and unscheduled activities – things that we would like to be able to measure using CMMS software in the future. We will have to remember to be careful because at any given staffing level we cannot make an assumption of 100% personnel efficiency. If we have too many staff, the work will expand to meet the time available, resulting in inflated estimates.

Unfortunately, part of the exercise shows an obvious bias regarding each of the facilities. Because we know them each as individuals, we see their various problems and compare them to one another, even when we know we should not for objective questions. Even before looking at results for any particular site there is a preconceived notion of the staffing answer. We can build the model to give the answer we’re looking for at any point, but that neither makes the solution ‘right’ nor does it make the model provide appropriate solution sets for the other sites.