Disaster recovery is more than application backups.

By: Allan Evora

For some reason, organizations believe SCADA backups and disaster recovery are synonymous. They’re not. Granted, application backups are better than nothing, but should never be a substitute for a proper disaster recovery procedure.

Here are our recommendations for the best ways to minimize the impact of hard drive failure or an infected/corrupted operating system and quickly get back to business as usual.

 

Why is disaster recovery necessary?

Disaster recovery is a method of recuperating after an event renders the SCADA hard drive or computer electronics inoperable. Failures happen for a variety of reasons, such as:

  • The building gets flooded
  • A power quality event (such as a spike) fries the motherboard or hard drive
  • Equipment failure

But it’s not just a physical issue that necessitates disaster recovery. The rising risk of cyberattacks such as ransomware often require a system roll back to before the virus was installed.

When a disaster occurs, most systems require a quick response that gets the system back up and running ASAP. The problem is, most owners aren’t doing the right type of backups.

 

Stop relying on application backups as your “disaster recovery plan”

Many mission critical organizations only do application backups. This type of backup ensures you have the latest copy of the SCADA application configuration, PLC programs, data logs, and report configurations.

True, an application backup still allows a SCADA system rebuild in the event of a system failure…but it takes much longer to recover. Most mission critical facilities don’t have that kind of time.

With an application backup, you still have to:

  • Purchase a new computer
  • Reload the operating system (OS)
  • Set environment variables
  • Apply all Windows updates
  • Reload application software and any patches

…all of which represent time.

Restoring application software is easier said than done. The process of reloading an OS and setting up all the required settings, such as network, users, security, and applying patches or updates can be a long, tedious process. The engineer who is in the best position to cost-effectively rebuild your system is probably the original engineer who built your system, or someone with similar application software experience. Unfortunately, that type of engineer might not be available in your emergency situation.

 

Add disaster recovery (DR) backups to your disaster recovery plan

In contrast, conducting a DR backup by creating a disk image (cloning the hard drive) allows control systems integrators to quickly support and restore customer’s systems in the event of a failure.

A disk image provides an exact replica of the system at a particular point in time. Specifically, it involves taking and storing a digital image hard drive, including not only the application software, but also the operating system, settings, and other data. There are many different options out there but a common software we use is Acronis True Image.

Using an image for disaster recovery gives a control systems integrator a lot more flexibility in who they dispatch to get a SCADA system back up and running.

It’s imperative to keep these disaster recovery images up to date. That means after any major configuration change on the system, you must ensure a new image is taken. Keeping your images up to date can be the difference of a restoration process taking hours vs. days.

System backups should always be geographically separated from their original. If there’s a fire in the building and both the primary and backup mechanism are affected, you’re out of luck.

 

Disaster recovery in virtual environments

In virtual environments, it is much easier to support and manage disaster recovery. Since your computer is and it associated operating system, and application software and settings are contained within a software virtual machine, backing up the computer system is just a matter of creating snapshots (copies) of your system on a regular basis.

Most environments like Microsoft’s Hyper-V and Dell’s VMware have built-in or optional software to back up your virtual machines while your system is running. Depending on your computer and IT environment, you can save the backup on the cloud or a local network. As an added plus, because internal IT resources typically understand VM, most customer’s IT resources are willing to help us during disaster recovery, which helps speed up the process.

Unless you have a smaller SCADA system that doesn’t require more than a single computer to run all software, we recommend VM for all our customers.

 

Case study: don’t forget to test your plan

Don’t wait until you undergo a catastrophe to see if your disaster recovery plan actually works. A customer of ours recently discovered this the hard way.

Their IT department oversaw the creation of system backups and stored the images at a remote location. After the customer’s SCADA system failed, Affinity Energy was asked to restore the system. After arriving onsite, we affirmed the system failed due to multiple hard drive failures and corruption of the RAID 5 configuration. We procured and installed new hard drives. All we had left to do was restore the system images IT took and stored…but there was a problem.

We brought the computer to the remote storage location…but after a day of attempting to restore the images, their IT department wasn’t successful. We’re not exactly sure what went wrong, but obviously they had never tested their disaster recovery plan.

In the end, we rebuilt the system from scratch. What should have taken less than a day ended up taking multiple.

 

How to test your system backup on a live system

There is a simple way to test if your system backup will work as your disaster recovery solution.

  1. Clone the hard drive
  2. Remove the “old” hard drive you’ve cloned
  3. Take the newly cloned hard drive (it’s an exact duplication at this point) and put it in place of the old hard drive
  4. Store the old hard drive in a remote location. (This is now your backup.)
  5. Work on your system for a while and make sure it works. Congratulations! You’ve now proven your “test” works!
  6. If the cloned image doesn’t work, something was wrong with your image. Thank goodness this is just a test!
  7. Take the hard drive out, and put the “old” one back in, and start troubleshooting what went wrong.

 

5 lessons learned from disaster recovery

Here a few quick tips we’ve learned over the years from helping customers plan for and execute their disaster recovery plans.

  1. Location of media, CDs, DVDs, and license files can change over the years. Always making sure the software you need is readily available at your fingertips will make disaster recovery that much smoother.
  2. Screen capture any license files, software serial numbers, or activation codes. Not having this information can make the restoration process very difficult.
  3. DR backups and application backups should always be stored in a remote location, in a place easily accessible by the control system integrator.
  4. Disk/VM imaging is not a one-time thing. It should be done whenever a change to the system baseline is performed (application software updated, major OS updates, new software installed.) The more you image, the fewer updates and security patches we’ll have to apply after disaster recovery.
  5. Keep track and be aware of the computer hardware that your system is running on. Once the computer is outside of OEM extended warranty, we recommend customers consider upgrading the computer. A hard drive image may not restore properly if certain key hardware such as the motherboard or RAID controller are not identical when restoring the image. This is not a problem for VM so you may want to consider transitioning to a DR strategy that involves creating VM images of your system once you are outside of the OEM warranty period.