There is a deep vein of academic scholarship investigating what enables organizations to be ‘highly reliable’, that is to say functioning optimally even when conditions are severe. Examples are emergency rooms, fire departments etc.
As security threats to our systems mount, and as these systems are so embedded in how every company delivers value, it is important to understand how the large enterprises can learn from and emulate, not just the unicorns of the web industry, but these highly reliable organizations.
It turns out however, that the principles and practices of DevOps synthesize critical attributes of highly reliable organizations and provide a template from which enterprises may learn how to become highly reliable.
As much as we, as security professionals, charge our organizations to listen and act, so must we learn to enable our organizations to become reliable in the face of threat.
The Hedgehog and the Fox
Archilocus was a Greek lyric poet in the 7th Century BC. We have very little of his work remaining, and most of it is in the form of scraps and fragments. One aphorism that has endured is, “The Fox knows many things, but the hedgehog knows one big thing.” For all we know this could have been a marginal doodle while watching an Aegean sunset but it has had quite an impact on a range of disciplines.
For the first section of this post, I will perform a whistle-stop review of some management thinking shaped by this idea and then explain why the results are pertinent for security professionals.
Isiah Berlin was an Oxford historian and philosopher in the mid 20th Century and wrote about the hedgehog and the fox.
He explains the difference in his book of the same name, thus: “For there exists a great chasm between those, on one side, who relate everything to a single central vision… on the other side, those who pursue many ends, often unrelated and even contradictory, connected, if at all, only in some de facto way, for some psychological or physiological cause, related by no moral or aesthetic principle…”
Jim Collins, in Good to Great adopted the metaphor to explain how the most successful and enduring companies operated, ”Those who built the good-to-great companies were, to one degree or another, hedgehogs. They used their hedgehog nature to drive toward what we came to call a Hedgehog Concept for their companies. Those who led the comparison companies tended to be foxes, never gaining the clarifying advantage of a Hedgehog Concept, being instead scattered, diffused, and inconsistent”.
But how effectively has this idea stood up to testing since 2001 when it was first published? Not well. An article by Steven Leavit of Freakonomics fame, tracked the performance of companies praised in "Good to Great" and found that many had performed poorly. Examples being, Fannie Mae (!), Circuit City and Wells Fargo. So is the idea of “the Big idea” (or Hedgehog concept) dead?
Phil Rosenzweig, author of “The Halo Effect” thinks so. He cautions against these pat cause and effect explanations of performance. For me, the core idea was that resilience is necessary since success is not absolute. The parameters of success will be a function of the market and that changes pretty rapidly. Bad things happen, or a “Black Swan” as Nassim Taleb has it. Taleb’s idea of anti-fragility is also very powerful. Consider an organization that becomes improved through change. Like a leather satchel that is broken-in and improves with age, rather than a crystal wine glass that ceases to function after a small knock.
What do resilient organizations look like? How do we organize to enable us to improve under threat? Is resilience the new ‘hedgehog concept’ and where does security fit in?
My thesis is that if there is a ‘hedgehog concept’ in modern business it is velocity and not resilience. But that resilience (including with respect to security threats) requires fox-like behavior in order to produce reliable business performance.
“Reliability depends on the lack of unwanted, unanticipated, and unexplainable variance in performance” Eric Hollnagel said.
Highly reliable organizations
Karl Weick is an organizational theorist who has studied how organizations make decisions and process information with which to make those decisions. Much of this work has been in the area of highly reliable organizations.
A useful definition of reliability comes from another academic, Paul Schulman, “The major determinant of reliability in an organization is not how greatly it values reliability or safety per se over other organizational values, but rather how greatly it disvalues the mis-specification, mis-estimation, and misunderstanding of things.”
Here are some examples of the kinds of organizations that promote this kind of behavior:
- Naval aircraft carriers
- Chemical production plants
- Offshore drilling rigs
- Air traffic control systems
- Incident command teams
- Wildland firefighting crews
- Hospital ER/Intensive care units
A famous study of a failure of reliability is the Space Shuttle Columbia explosion on re-entry into the Earth’s atmosphere on Feb. 1, 2003. The explosion of the shuttle was caused by the breakage and collision of tiles on a wing of the shuttle. At launch, some damage to the tiles was noted. Some engineers at NASA believed that the damage to the wing could be catastrophic but their concerns were not addressed in the two weeks that Columbia spent in orbit because management believed that even in the case of major damage there was little that could be done to fix it. So how can an organization fail to respond to this kind of information?
Weick identifies some heuristics against which we can rate our capability to be reliable, in other words to respond effectively to experience and improve by it:
- How preoccupied with failure are you? Do you treat near misses as information for improvement or as evidence of your awesomeness as a security team?
- How much do you attempt to simplify? Do you solicit views from outside your security team?
- How sensitive are you to the whole operation? Do teams interact enough with each other to understand the other jobs being done and are able to form a whole picture of the operation? How much do you share a picture of the threat landscape with the people you are trying to influence?
- Are you committed to resilience? Do you invest in people’s competence, especially in terms of informal contacts and networks that can be used to solve problems effectively?
- Do you respect expertise? Often a security team will feel that their expertise is not respected across the organization and that people do not listen. But even with a security team, does everyone know who has the expertise to respond to an issue rather than merely the hierarchical rank to do so?
In the case of the Columbia disaster, many of these questions yielded answers that pointed to a culture of hierarchy and deferred responsibility, “NASA’s culture of bureaucratic accountability emphasized chain of command, procedure, following the rules, and going by the book….Allegiance to hierarchy and procedure had replaced deference to NASA engineers’ technical expertise” CAIB report states.