Today the basics of Information Technology (IT) security testing are becoming increasingly more automated, although we are still far from the point where everything can be done this way. We are also in an age where more and more systems that were originally completely air-gapped are being exposed to the internet; as a result security is being considered an add-on rather than being integrated from the very beginning.
Nowhere is this more important than in systems that form our Critical National Infrastructure (CNI); systems, such as electrical and water supply, which are necessary for the function of day-to-day life. These control often complex and dangerous interconnected devices which fall into the category of Industrial Control Systems (ICS); these can then be further broken down into two distinct sub-categories and are considered in the following brief glossary, along with other important terms:
- IT – Information Technology. This is the hardware and software used for storing, retrieving, or transmitting information;
- OT – Operational Technology. This is hardware and software that monitors or triggers changes in physical devices, and will be used in this post to refer to the component devices of ICS systems;
- WAN – Wide Area Network. This is a network, primarily used for computer networking, which extends over a large geographical area.
- CNI – Critical National Infrastructure. Systems (including both IT and OT) and assets that are essential for the function of society;
- ICS – Industrial Control Systems. These are the computer systems which control complex and often dangerous physical processes, and can be subdivided into two distinct categories - – Supervisory Control and Data Acquisition (SCADA) systems and Distributed Control Systems (DCS);
- SCADA – Supervisory Control and Data Acquisition systems. These are types of ICS that span a Wide Area Network (WAN), and is often the term most commonly used to refer (incorrectly) to all types of ICS;
- DCS – Distributed Control Systems. These are types of ICS in which no WAN is involved;
- HMI –Human Machine Interface. This is a user interface that enables a person to interact with a machine or system, and is a term most commonly applied to interfaces in ICS.
- PLCs – Programmable Logic Controllers. These are small computers used in OT which use a highly specialised operating system to handle events in real-time.
- IIoT – Industrial Internet of Things. This is the term commonly used to refer to the increasing tendency to connect OT to the internet.
The reason the increasingly common connection of all ICS systems to the internet is so important is that the OT involved in such systems is often quite old, and was not designed to be secure against this kind of exposure. This is in contrast to IT, where the risk of wide-spread connectivity is far better understood and mitigated against.
There is therefore an increasing focus on securing these OT systems, however this is often mistakenly done by simply adapting or re-using the well-documented tools, techniques and methodologies applied to security testing of IT systems. The remainder of this blog post will explore, at a fairly high level, what the issue with this approach is and what changes need to be made.
A Brief History of OT Security Incidents
Security incidents, both intentional and accidental in nature, that affect OT (primarily in ICS systems) can be considered to be high-impact but low-frequency (HILF); they don’t happen often, but when they do the cost to the business can be considerable. For example, Kaspersky Lab’s white paper “The State of Industrial Cybersecurity 20171” put the average cumulative cost for a business at $347,603 (£265,881), including both the consequences of the incident and the remedial actions needed, with over half of the companies interviewed admitting to at least one incident over the preceding twelve month period. The subsequent survey in 20182, by comparison, reported that less than half of the participating companies admitted to any security incidents over the preceding twelve month period, suggesting that security of OT systems is improving. However the cost of breaches still remains high – and not only in terms of financial cost.
For the purpose of this blog post, security incidents are considered to include the following broad categories:
- Intentional actions taken by an external attacker that circumvent security or disrupt the normal running of the system;
- Intentionally malicious actions by an employee that circumvent security or disrupt the normal running of the system; and
- Accidental actions by an employee that circumvent security or disrupt the normal running of the system.
It's worth noting that accurate numbers on the frequency of breaches affecting ICS systems are hard to ascertain. As the 2018 paper identified, only 30% of the companies reviewed at the time were required to report security incidents to a regulatory body, although hopefully this will change under GDPR regulations. However, even with limited historical data, there are some significant examples of the impact and scope of such incidents:
- Sterigenics Ethylene Oxide Explosion, 2004 – workers attempting to troubleshoot an incident used a super-user password to override safety measures. As a result, a flammable mixture was exposed to an open flame, causing an explosion which injured four employees and caused property damage in excess of $27 million.
- Stuxnet, 2006-2010 – a self-replicating piece of complex malware was identified on Siemens WinCC SCADA systems in 2010. This malware was designed to target specific ICS systems and sabotage the PLCs in order to cause the physical destruction of centrifuges used in Iran’s nuclear program. The virus used was incredibly complex, including the use of multiple previously undisclosed vulnerabilities (known as zero day (0day) vulnerabilities), and was designed jointly by the US and Israel in an attempt to undermine the Iranian nuclear program.
- German Steel Mill Attack, 2014 – an attack against a steel mill prevented the blast furnace from being shut down, resulting in significant physical damage
- Ukrainian Power Outage, 2015 – an attack against Ukrainian SCADA systems left around 230,000 people without power for several hours.
- Wannacry, 2017 – Malware which encrypts files stored on infected computers spread globally. Although no reports on the effect of ICS systems have been formally released, it seems likely that some systems will have experienced downtime due to incompatibility in the windows systems connected to OT systems3.
While the above list cannot be considered to be exhaustive in any fashion, it does provide a glimpse into the repercussions of incidents involving OT systems – be they accidental or malicious. The real question remains though; why and how is OT different to IT in terms of security requirements?
IT vs OT
A common way of looking at the security requirements of an IT system is to acknowledge that it’s data-centric; the key requirements are the confidentiality, integrity, and availability (C, I & A) of the data being processed. Take online banking as an example; the security requirements centre, at a very high level, around allowing you to access your funds, making sure that the accounts you can see are correct and the numbers you can see are accurate, and making sure that no one else has access without permission.
A common mistake is to look at OT systems the same way, but to change the order to be availability, integrity and confidentiality. In this way we can claim that standard IT security measures can be trivially adapted and applied to OT systems.
This is unfortunately not the case. For one thing, confidentiality of data on OT systems is rarely a priority at all; an attacker being able to see if the furnace is off or on, or what temperature it’s operating at, really isn’t a concern. For another, OT is less concerned with data and more concerned with physical processes. Instead, the primary concerns of OT systems employed in ICS are going to be safety and reliability. In these systems any downtime can cost a company thousands in revenue (one blog post from 20184 puts the average cost to automotive manufacturers at $22,000 per minute of downtime) so they are often designed to run continuously for years, and any malfunction can be costly not only in lost product or damaged machinery, but also in terms of the safety and well-being of personnel and impact on the environment.
So, in the same way that we considered the security requirements on online banking earlier, consider the security requirement of the UK’s electric grid, in particular the OT systems that keep the country supplied rather than the IT systems that deal with customer data. The primary risk here is the ability to deliver a constant supply of electricity to the country, and that this is done safely. As stated before, this boils the key concerns down to safety and reliability, rather than being concerned with the confidentiality, integrity or availability of data.
Looking beyond these key, base concepts, a presentation by Byres, Lissimore and Kube5 also presents a list of the differences between IT and OT from a security testing perspective:
- Differing Performance Requirements;
- Differing Reliability Requirements;
- “Unusual” Operating Systems (OS) and Applications;
- Differing Security Architectures; and
- Differing Risk Management Goals.
This discrepancy between the priorities of IT and OT systems is at the heart of the reason why IT security testing methodologies, common findings and common recommendations cannot simply be adapted and reused in the OT space. This is the result of a lack of understanding of OT throughout cyber industries, outside of the field of product assessments. The increased interconnection of IT and OT has led to the view that OT is an extension of IT, which is not the case. In addition, many IT security professionals have little or no exposure to OT systems despite the recent drives to increase the security of such systems. This compounds the issue of trying to apply traditional IT security measures and testing methodologies through a lack of understanding of the potential impacts.
To further illustrate this let us take a look at some examples from IT security testing, and consider their impact on an OT system.
The use of automated scanning using tools such as NMAP and Nessus is a common activity during infrastructure testing of IT systems and generally involves sending packets out to either identify the logical network layout or to check individual computers for known vulnerabilities. Occasionally such scanning may be restricted to avoid certain machines, or omitted completely because it has the possibility to cause unexpected behaviour such as causing a computer to shut down, but this is very much the exception in IT.
By contrast such activities are far more likely to cause unexpected results in OT, and these may have much more costly consequences. The most commonly anticipated consequence is that the devices being scanned are likely to shut down or become unresponsive in reaction to what is essentially unexpected input. As mentioned above, this can cost the company thousands of pounds per minute. Alternately, such activity can cause unexpected mechanical reactions, such as moving a robotic arm suddenly in a manufacturing plant (causing physical damage to surrounding machinery, or injury to staff working in the area), or triggering different quantities of artificial colouring to be added to soft drinks in a production pipeline, resulting in the product needing to be recalled.
Patching issues are one of the issues that we see most commonly on IT systems undergoing security review, and is something of a hot topic in the security industry. Operating systems and the applications installed on them are constantly targeted by both attackers looking for a way in and by security professionals looking for holes that need to be fixed. When one is found and reported, the originator of the OS or software will often work on a patch for that hole, which then needs to be installed on the affected computer in order to secure it.
This applies to both IT networks that are connected to the internet, and therefore more widely exposed to anyone who wishes to try and find these holes, and to networks which are either air-gapped or which use other security controls to restrict access. The generally accepted wisdom is that systems should have security patches and updates installed within a short time of them becoming available, and that only OS’s which are supported by the supplier should be used.
Even in IT systems this isn’t as straightforward as it may sound, as patches may cause issues in the way different pieces of software interact with each other. As a result organisations with larger networks tend to install patches on individual machines first to check that there are no problems, before rolling them out across the rest of the network.
These issues are compounded in OT networks, where machines often run continuously for years by design. This means that:
- They may be reliant on older software versions that are no longer receiving security updates but cannot be upgraded;
- They have very short maintenance windows (time is money); and
- The cost of downtime caused by patches causing unexpected behaviour is prohibitive.
As a result, patching is not a priority for most ICS systems as it could affect our core principles of safety and reliability. Before the IIoT it was also not a concern as an attacker would need to physically plug into the OT network somewhere in order to exploit the lack of patching; Unfortunately this is one of the areas that needs much more attention, as these systems are now increasingly exposed to the internet and therefore pose a much more realistic and accessible target than they once did.
Encryption and Authentication
Another security mechanism that we check for on every IT security assessment is the use of encryption. Without encryption any data has the potential to be read or modified by someone watching your network. It is used to protect financial transactions, login pages, emails and more, to the extent that a system not using encryption is considered to be a high risk target. And encryption tends to go hand-in-hand with authentication, which is the mechanism used to make sure you are the only person who can access your data. After all, there’s little sense in using a thirty character long password involving letters, numbers and symbols if someone is sat watching you type it out.
However, just because they are such common security mechanisms in IT does not mean this is true for OT. For a start, we have already pointed out that confidentiality of data on such systems simply is not a priority, although it likely is for the connected IT systems. Although use of encryption and authentication may provide some level of defence against malicious attacks, it also consumes computing resources which can potentially result in lag on the system, and this can have a serious impact on reliability and safety. Imagine a reactor in a power station that is overheating; if there is a delay in sending the reading then there is a delay in the initiation of the cooling or shutdown mechanism, which could potentially result in physical damage or destruction to the reactor.
Furthermore, authentication in OT systems can be considered something of a barrier to entry. As we have discussed previously, security incidents involving ICS systems are considered to be HILF. An employee responding to an emergency alarm does not want to be delayed by the act of having to remember and enter a complicated password as they try to diagnose the problem and act accordingly; in fact such security measures are more likely to result in employees sharing login details and writing them on sticky notes stuck to the console in order to save time.
This is likely to be a more common occurrence than an attacker gaining access to the system and doing something nefarious. Therefore, in the interest of safety, it might be considered more cost effective to accept the risk of a single breach by an attacker than to hinder employees responsible for maintaining the safety and reliability of the system. While this is far from ideal, it is still a factor that needs to be taken into consideration.
What Needs to Change?
The main takeaway from this post, we hope, is that a like-for-like use of IT security measures is not appropriate in OT systems used in ICS. There needs to be a move away from treating OT as a branch of IT and toward treating it as a separate type of technology with distinct security requirements, a comprehensive list of which is discussed by NIST6.
At a high level, this means security professionals need to make changes in two distinct areas at a minimum:
- The approach to security testing in OT systems. This means an overhaul of the methodologies, tools and techniques used to assess the security state of an OT system, rather than simply adapting existing IT testing approaches.
- The identification of security flaws, and the recommendations we provide to fix them. At present these still seem to be largely based on the issues that would be identified as causes for concern in IT systems, and the recommendations (such as use of encryption and patching) also reflect this. Instead, the principles of safety and reliability need to be at the core of the findings that are identified, and recommendations need to take into account a deeper understanding of the actual requirements of the system.
In addition to the above, thought also needs to be given to who the threat actors are in each scenario considered as part of a security review of OT systems, as the profile is going to differ from a typical IT system.
Security professionals assessing OT systems need to consider if the system is safety critical, what the worst case scenario is if testing causes any disruption to the system (not just in terms of financial cost, but also physical damage, impact on the environment, and threat to life), where the safety barrier of the system lies, and what the most realistic threats to the system are.
For this reason every assessment of an OT system should be considered bespoke, with any OT testing methodology therefore being flexible enough to adapt to each unique system. If that means that we spend our time talking to safety case writers and process engineers to fully understand the system and never actually plug our network in because of the risk of disruption, then that is the course we need to be prepared to take.
Summary and Conclusions
Although brief, hopefully this blog post has gone some way towards explaining why IT and OT needs to be considered as two distinct categories, and why OT security testing is not the same as IT security testing.
Although in recent years huge steps have been taken to understand and improve the security of the increasingly interconnected OT systems employed in ICS there is still a long way to go. This is not the end of the conversation, merely the next step in learning how to best approach an increasingly more complicated set of systems which have historically been isolated from the need for security scrutiny.
- (1) https://go.kaspersky.com/rs/802-IJN-240/images/ICS%20WHITE%20PAPER.pdf
- (2) https://ics.kaspersky.com/the-state-of-industrial-cybersecurity-2018/
- (3) https://ics-cert.kaspersky.com/reports/2017/06/22/wannacry-on-industrial-networks/
- (4) https://due.com/blog/understanding-the-financial-cost-of-downtime-in-manufacturing/
- (5) https://cansecwest.com/slides06/csw06-byres.pdf
- (6) https://csrc.nist.gov/publications/detail/sp/800-82/rev-2/final
- Measuring the Risk of Cyber Attack in Industrial Control Systems (Cook, Smith, Maglaras and Janicke, 2016)
- SCADA / DCS Penetration Testing, Jonathan Pollet, SANS SCADA SUMMIT 2008
- W32.Stuxnet Dossier, Symantec, 2010
- SCADA Security, What’s broken and how to fix it (Andrew Ginter)