As some of you may be aware, I’m currently involved in building a cloud-based DR environment for a couple of core business systems. We opted to use Amazon EC2 for this, and I thought I’d share a few observations gleaned along the way.
Over the next few weeks I’m going to be taking a look at the area of disaster recovery from the IT perspective, with a focus on how you might be able to take advantage of cloud platforms to help ensure your business could survive a catastrophic disaster.
First, a quick disclaimer..
This & related posts are not to be considered a definitive guide to all things disaster recovery. Nor should they be considered as complete or appropriate advice to be applied wholesale within your environment without due care & attention. If you are not sufficiently technically competent to decide whether or not a solution is appropriate, please seek the advice of someone who is. Any disaster recovery solution must be appropriate for your organisation & meet your objectives.
With the disclaimer out of the way, this & the following articles are intended to take a look at the process of planning for business survival in the event of a major, show-stopping disaster. The primary focus to start with is looking at the technical aspects involved, skipping over many of the associated business aspects.
While not the subject at hand, something to keep in mind throughout is that for every IT-focused business continuity plan, it’s vitally important that your organisation has a corresponding BUSINESS-focused business continuity plan which addresses questions such as where will your staff go to work if your primary office building becomes uninhabitable? How would you contact your staff to tell them not to come in or to go to your alternative location? What happens to your customers….? How do they get in-touch with you?
Going back a few years, the slightest mention of the terms “Diaster Recovery” or “Business Continuity Planning” (especially when used by external auditors or insurers etc) had an unnerving ability to strike fear second to none into both those who held budgetary responsibility for funding such an undertaking, and those who would be in the running to be responsible for its implementation.
Setting the implementation worries aside however, unless you are of the opinion that encountering a catastrophic failure of your core systems or having your main offices rendered unusable following something outside of your control would actually be an “opportunity” in disguise, spending some time planning for such an event is usually time well spent.
Depending on the scale of your organisation, Disaster Recovery (from an IT perspective) may evoke thoughts of articulated trailers full of computer equipment arriving in your car park or mirrored data centres… For many other organisations however, it probably evokes thoughts of wild panic while you consider just what you’d actually do should your data centre & its contents of carefully managed servers disappear overnight.
Although disaster recovery or business continuity plans are generally plans which noone every wants to be faced with actually invoking, consider for one moment what might happen to your company’s day to day operation if its computer based services or systems ceased to exist. What would happen if email services disappeared, file servers couldn’t be accessed, your order processing, stock control, payment or financial systems couldn’t be accessed… Or in this age of social media, consider the potential damage to your social audience engagement & online reputations. How would your business be affected if your public web sites disappeared off the Internet? Permanently?
Generally speaking, none of those scenarios are what could be considered good things to happen. In the majority of cases, the day a major disaster strikes is probably one you’d regard as a BAD DAY (capital letters definitely required).
With enough forward planning and the appropriate level of investement however, it is perfectly possible to plan for such an eventuality and establish a degree of preparedness relevent to your business. Some businesses may have no viable option other than creating a 100% mirror of their production environments complete with real-time data replication; others might consider that just ensuring their email continued to operate would be enough.
Before the advent of managed, 3rd party-hosted systems & commercially viable cloud-hosted Infrastructure-as-a-Service (IaaS) platforms, if you were unable to justify the expense of procuring or maintaining enough physical hardware to run your core systems for DR, you essentially had two options: Trim back your DR plans until they fit into your budgets, or decide to keep your fingers crossed, hope your luck was good and that you would never have a problem.
Many businesses, large and small, bravely opt for the second option. They sit back, get on with their daily trading and simply hope that disaster will never strike, or if it does, hope that they won’t be so badly affected that they’re unable to continue operating.
This approach may work if your organisation is small enough or not so dependant on email or other computer-based systems. Over time, this approach has an unfortunate effect of creating a false sense of security; it can rapidly become business-as-usual until something major goes wrong or they’re hit by a natural disaster… Perhaps they’re hoping that their business insurers will cover the cost of replacement kit; and that they’ll pay up immediately so that replacement hardware can be bought? Its a great idea, but it doesn’t usually happen…
If your organisation is a subscriber to the “luck” approach, yet couldn’t actually continue operating in the event of disaster striking, consider challenging the status-quo. Full-scale disaster recovery planning might not be something that the organisation is in a position to consider, but yet many small steps can often be taken to help increase the organisation’s chance of survival with minimal disruption to normal operation.
So, assuming you’re still reading by this point and are managing to avoid quivering under your desk or going to find that lucky rabbit’s foot / 4-leaf-clover / handy chunk of wood / whatever-else-you’re-relying-on, what can you do to help give yourself a fighting chance in a DR scenario?
The first task at hand is to ensure you understand what you’re trying to protect to the best degree possible. Take a detailed look through your production environment and ensure you have enough technical documentation to understand what each element of the environment does and how it interacts with, or relates to all other elements. Unless your environment consists of a single machine, diagrams are usually key to this and can help identify what you’d need to reconstitute a particular system or service. This might seem an obvious thing to do, but the amount of organisations which do not have this level of clarity is nothing short of staggering.
A significant note of caution here is that attempting to create a disaster recovery solution without first gathering this knowledge can only be dangerous for your organisation; sometimes much more dangerous than adopting the rely-on-luck approach & doing nothing. This is simply because by spending time planning, you’re creating a sense of security and can often end up believing that you’re fully prepared should something happen, yet your plans would be highly likely to contain flaws – not necessarily through any technical error, but because unless you truly understand your systems & how they interact, you run the risk of missing something vital without which your system cannot function.
For each system it’s also vitally important to understand its importance to your organisation – in terms of:
– Whether it’s critical to your daily basis or if it’s only needed once a month / quarter / year?
– How long your business could viably operate without it?
– What impact does it being unavailable have on the business, or other systems?
– In the event of a major disaster, how much potential data loss could be tolerated?
– Is it vital that NOTHING is lost, or could your business continue with a few hours or longer of lost data?
– How important is it compared to other systems?
– If you can only work on restoring one service at a time, which should you work on first?
These points can be considered a set of baseline recovery objectives for each of your systems – forming a set of guidelines against which to asses possible DR solutions.
It might be a slightly obvious statement to make, but it’s also essential to ensure each system is fully considered from a technical perspective. Among many others, some key things to check include:
- How many users rely on the system?
- How do they access it? Citrix? Applications running on desktop computers? HTTP?
- Are there alternative access routes/methods which could be used in a DR scenario?
- How much data does it store, and how / where does it store it?
- How frequently does the data change?
- What % of its data is created/updated on a daily basis?
- What does that % represent in terms of Gbs or Tbs of data?
- What options do you have in terms of replicating it, mirroring its data or identifying & copying changed data?
- How do you back it up, where are those backups stored & what devices/software would you need to restore them?
- What other elements of your underlying infrastructure does the system depend on?
Once you have a solid understanding of what your environment contains, double check that you’ve included the supporting infrastructure and underlying network services – DNS, DHCP, Active Directory, RADIUS etc…
Don’t forget your telecoms either – communication in the event of a disaster tends to become even more important than normal. Your staff will need to be able to communicate with each other via phone, and are probably going to be working from different locations to their normal offices. Are mobile phones going to be enough? Do you need to be able to receive calls on your existing fixed-line numbers?
What happens if your email servers are unavailable? Will mobile devices still be able to send & receive messages?
Armed with your environment diagrams & documentation, then spend some time identifying key dependancies for each of your systems/applications or services. This is a vital step as otherwise it’s all too easy to establish a great DR solution for one of your systems, only to find out in a DR situation that it isn’t able to actually do anything without having 3 other systems up & running.
Having identified your key systems, the next challenge is to work through your systems and determine how best to approach establishing some form of DR provision.
The “best” approach to aim for is the one that fits your business’s objectives for continuity, while delivering a supportable, maintainable & affordable solution. Don’t be afraid to consider what may be new technology or new approaches for your organisation – Cloud and IaaS platforms can offer incredible benefits when compared to more traditional approaches to BCP, but come with their own set of costs & technical challenges.
Finally for this, keep in mind that here is no “perfect” one-size-fits-all solution to disaster recovery planning as every business is different with different priorities, different demands on their IT systems, and different ways of interacting with their customers.
Therefore, every business’ plans for disaster recovery need to be as unique as they are to ensure that their DR planning reflects their own, wholly unique set of business continuity objectives.
Next: A look at what to do with this information & some thoughts on where to start…