Recent events have made me think about what we can learn from other sectors. The inspiration was the launch of a new initiative that I think is very important for all Data Centre operators. The group behind the launch is the Date Centre Incident Reporting Network (http://www.dcirn.org) whose purpose is to provide a mechanism to allow anonymous reporting of Data Centre failures. Their idea is that the whole industry needs to be made aware of data centre failures so that important lessons can be learnt and shared throughout the industry without any blame being attached. The DCIRN has taken its lead from the Aviation industry where anonymous reporting over many years has helped to reduce air accidents and made us all safer. This got me thinking about other sectors that could teach the data centre industry a thing or two and a recent experience with the NHS provided me with an example.
A few days ago my 4 year old daughter had a febrile convulsion at nursery school. Although scary it is apparently not unusual in young children and her teachers were able to make a quick decision calling an ambulance which arrived within 10 minutes. The ambulance crew made a quick assessment, administered some basic care and sped off to Hospital where she was assessed again, given medication and kept under observation. Within 2 hours she was discharged and home in time for tea. My daughters condition was noticed and assessed at several stages where key decisions were made by her teachers, an emergency despatcher, an ambulance crew, an A& E triage doctor and a Registrar. Another good example of the benefits of a National Health Service but what can the Data Centre industry learns from this?
I think there are at least three important lessons here;
- Monitor your assets closely; my daughter was under the close supervision of skilled and caring staff, how well are your data centre assets monitored?
- React quickly to an emergency. You need a monitoring tool and a means of responding with proven and tested systems and processes.
- Have triage systems to assess if an alarm needs acting on. Too many Data Centre monitoring tools just send alarms without the ability to federate different alarm states. A slowly rising temperature might not require immediate response but coupled with a failed CRAC unit it probably does. If your system “cries wolf” all the time you may ignore vital data.
There are important lessons for Data Centre operators that can be learnt from other sectors. Managing critical infrastructure in a Data Centre is not yet seen as a matter of life and death but, as the DCIRN points out, the growing importance of data and the role of IT in society s increasing the likelihood of fatalities resulting from a DC outage. There is a growing awareness that Data Centres are an important part of our national infrastructure and their resilience is a matter of national importance.
Learn about Visibility, Analytics and Control of your data centre here.
This blog has been written by Steve Bailey, MD, AIT.