By: Duane Beyer and Mike Kalkas
Disaster recovery happens in IT, but is often overlooked or improperly implemented in OT
Remember data backup on tape cassettes? I do. The small data center I worked in had a great process for incremental backups run on one system/processor with multiple disk drives. Our tapes went into locked bins, were sent to offsite storage, and rotated back every few months. Occasionally these backups were taken to a similar data center and loaded to test system recovery process.
Backups have evolved.
Today, servers hold multiple VMs, have huge data throughput with a previously unheard of memory capacity. However, backup tools and recovery methods have also grown to accommodate the enormous change.
Though backup and recovery has become a much simpler and automatic process on the IT side of things, in many cases it is still completely missed or intentionally stopped on the Operations side.
Why aren’t data centers backing up SCADA?
IoT (Internet of Things) is pushing OT (Operations Technology) and IT together. Companies embracing the OT/IT merge are starting to realize some of IT’s most typical processes aren’t being done or do not work on the OT side.
Like backing up SCADA, DCIM, and EPMS systems.
What makes these systems different?
It all starts with process. Since the beginning, IT covered corporate data and had windows of low/no user times, while OT was considered a manufacturing asset (under the facilities management umbrella) with a 24/7 schedule. In the past, OT systems were separate from corporate networks and the internet. Backups were done by an engineer in a manual process. The segregation of the two networks was seen as a means of perfect safety with little need for regular backups since there was little visibility/access into this world.
However, system visibility has become a critical issue, and this means connections to more networks including intranet and internet. Now, as IoT pushes OT and IT together, IT is learning many of the automatic processes they perform on corporate networks aren’t being done on the OT side.
Unfortunately, starting up these processes isn’t as simple as adding servers to an automatic backup schedule.
Adding to the complexity of it all is the unknown value of data involved, how the interconnected systems respond during the operation of IT processes, and a general lack of understanding by IT of OT system(s) and how they are engineered. As IT tries to run normal processes, they can get issues like:
- Network bandwidth issues causing equipment to lose connection with each other.
- Unknown systems mysteriously stopping. Tracking down the cause takes days or weeks.
- Alarms going out to high level people within the company causing unwanted attention to IT and their processes.
- Lost data and revenue.
- Unsafe failure of equipment and systems.
SCADA data backup: why do it?
That question opens up a whole can of worms. Here are just a few of the reasons that pop into my head.
- If the drives in your Galaxy Repository (GR) node server die and you don’t have a data backup, you could lose every DCIM, SCADA, etc. project that has ever been integrated at your location. It’s not just the data that’s gone, the digital infrastructure (all devices added or removed) is completely obliterated. Think of how much money is spent on integration. Wasted. In addition, even if devices continue to operate properly, you’d never be able to make changes, deploy, or undeploy, which means if one of your servers went down, you wouldn’t be able to replace it.
- Think of the sheer amount of customer power data collected in historians. What would happen if the data drives died and you lost everything? How are you supposed to bill your customers without their energy usage data?
- Not backing up data could result in a mandate or compliance violation. For example, some pharmaceutical manufacturers are regulated by the FDA to store energy data from each batch of drugs they produce (via data backup). And for good reason. If the company was accused of bad batches due to temperature fluctuation, the backups would be able to validate if that were true. This extends into other federal mandates as obscure as occupational safety.
The list of possible scenarios that could destroy your data really is endless, but here are a few examples in brief:
- Act of God (flooding, earthquake, fire)
- Equipment failure and data corruption (due to age, manufacturing flaws, unintentional destruction)
- Accidental overloading
- Data-destroying virus
- Human error
- Terrorist act
- Bugs in the system
What should be backed up?
Your historian and GR node are the two most critical parts to back up. (They represent money!) After that, add in your application and I/O servers. Oh, and don’t forget that pesky documentation (because to restore all these backups, you’re going to need the documentation that tells you how to do it.)
You should never initiate a backup of your servers without careful planning. In a Wonderware system, the GR has some data that changes as development is done but for the most part this data just sits there unchanging. Because of this, you should have a good full backup of your server/VM as well as the GR database to start with, and then incremental backups of the GR database as things get changed. This saves the cost of having too much data in your backups costing you a lot of money for storage space.
Backing up a historian works similarly to a GR since both run off a RDBMS. Running a full backup to start with, and incremental backups on a regular schedule, will help prevent loss of data. Putting the GR or historian on a schedule to regularly clone the machine will cause a huge overhead of data storage. Also, since you would not want to add more software to your machine, or make major changes since your RDBMS is running on this machine, you only need the original clone of the machine. After that, you want to capture only the changes caused by incoming data or patches of the system. Mirroring the data drive will only get you part of the way there in securing your data.
Application and I/O servers
Unlike a historian, application servers require clones instead of a data backup. The software installed in an application server isn’t like a database server. It’s a jumble of application files that live all over the place. Because these applications use the system registry and in many cases, have hidden files, the only choice is to clone the machine to get a current image that can be reinstalled as necessary.
Problems with OT data backup
You can’t just add SCADA systems to the IT backup plan. The devices involved aren’t just used during business hours. Because of this, there are a few issues custom to OT backups that IT doesn’t usually encounter when conducting a data backup.
Lack of bandwidth makes automatic backups nearly impossible
Historians are under very heavy use in a data center. An EPMS is heavily loaded. SCADA/DCIM is live 24/7. There’s no downtime or after hours. Because the power is always on, it’s always collecting data.
Because of this, the way IT typically does their process for backing up servers is NOT the same process you need to follow when backing up an OT server.
If you start backing up with traditional tools, or set up automatic backup systems, you’ll cause a bandwidth hog. If you don’t set up a notification that backups are running, a flood of alarms could occur causing a panic when the historian or data servers temporarily disconnect.
This can affect your ability to function in real time as well. I’ve seen a backup that brought the application server to its knees. The application took several minutes to navigate through screens.
What this all boils down to is: you can’t just use any data backup tool. Use a backup strategy that collects one file at a time so it’s not hogging bandwidth. It will take longer, but it will cut down on system resources. And please, let your people know the backup is happening ahead of time to avoid panic if alarms sound. If there is a notification system, contact the person in charge of it and get it temporarily disabled to avoid nuisance alerts going out in the first place.
Your SAN drives aren’t mirrored
Unlike IT personnel, OT managers often don’t understand the importance or benefit of mirroring SAN drives (Storage Area Network). Also, many just don’t know it’s an option. If they did, it would be an issue of getting financing and justifying costs to managers who want nothing to do with anything IT-related.
A SAN drive is a high-speed drive cluster that presents shared pools of storage to multiple servers. If you install many SAN drives, and shadow them, they become a hot backup. If you lose your first set of drives, the second set is available for immediate use with no data loss.
This seems obvious to people with an IT background, but the electrical/operational managers in charge of the projects don’t have an IT background, and don’t understand the cost, need, or benefits behind mirroring data drives to maintain the robustness of the system.
Make SCADA data backup part of your disaster recovery plan
Because a SCADA engineer/integration engineer is so far down the food chain, we often don’t have a lot of say in backups or disaster recovery plans. It’s usually up to the customer to decide the priorities.
But the way I see it, there is an unacceptable amount of critical data that is one error away from being totally lost. SCADA data must be part of an IT department’s preventative strategy. Clear communication between IT and OT to understand each other’s needs, requirements, and processes is critical to the overall health and success of the business.
Duane Beyer was an Application Engineer at Affinity Energy from 2009-2017 with responsibility for developing, deploying, and maintaining integrated control systems. His recent contributions include major enhancements to monitoring systems for QTS Data Centers, Bank of America, and Amylin Pharmaceutical.
Early in his 35-year diversified career, Duane worked for Griffin Automation and Eastman Kodak Apparatus Div. as an automation and process controls engineer with experience in automotive, electronics, and batch chemical manufacturing. After 5 years at Digital Equipment Corporation as a VAX Cluster System and Operations Manager overseeing monthly data and telecom allocations to Eastman Kodak, Duane rejoined Eastman Kodak Research Labs as a software engineer to create formula management software applications used to define the manufacturing of photo sensitized chemicals and roll coating of photo sensitive film and paper.
Duane holds a B.S. in Computer Science and minor in Mathematics from SUNY Brockport, as well as an A.A.S. in Electrical Technology from Erie Community College in Buffalo N.Y. He also attended Masters Classes at Rochester Institute of Technology for Software Development and Management.
Michael Kalkas was an Application Engineer at Affinity Energy from 2012-2017, with responsibility for developing, deploying, and maintaining integrated SCADA (Supervisory Control and Data Acquisition) systems. Some of his daily responsibilities were automated process software design, hardware footprint bolstering, system functionality testing, and system monitoring.
With over 25 years of electrical engineering experience, Michael previously worked for GE Appliance and Lighting, Manpower Inc., and the U.S. Army. Prior to joining Affinity Energy, Michael spent 11 years at Cooper Lighting as a Senior Lab Technician, where he performed photometric testing/UL testing against various fixture designs and lamp sources, maintained a UL Lab Testing SCADA system, and developed custom tools and databases for data manipulation and filtering.
Michael received a B.S. in Computer Science from South University, AAS in Electrical Engineering Technology from Blue Ridge Community College and is licensed as a Microsoft Certified Professional and Microsoft Technology Associate.