Tuesday, August 9, 2011

No-Failure Design and Disaster Recovery: Lessons from Fukushima

One of the striking aspects of the early stages of the nuclear accident at Fukushima-Daiichi last March was the nearly total absence of disaster recovery capability. For instance, while Japan is a super-power of robotic technology, the nuclear authorities had to import robots from France for probing the damaged nuclear plants. Fukushima can teach us an important lesson about technology.

The failure of critical technologies can be disastrous. The crash of a civilian airliner can cause hundreds of deaths. The meltdown of a nuclear reactor can release highly toxic isotopes. Failure of flood protection systems can result in vast death and damage. Society therefore insists that critical technologies be designed, operated and maintained to extremely high levels of reliability. We benefit from technology, but we also insist that the designers and operators "do their best" to protect us from their dangers.

Industries and government agencies who provide critical technologies almost invariably act in good faith for a range of reasons. Morality dictates responsible behavior, liability legislation establishes sanctions for irresponsible behavior, and economic or political self-interest makes continuous safe operation desirable.

The language of performance-optimization  not only doing our best, but also achieving the best  may tend to undermine the successful management of technological danger. A probability of severe failure of one in a million per device per year is exceedingly  and very reassuringly  small. When we honestly believe that we have designed and implemented a technology to have vanishingly small probability of catastrophe, we can honestly ignore the need for disaster recovery.

Or can we?

Let's contrast this with an ethos that is consistent with a thorough awareness of the potential for adverse surprise. We now acknowledge that our predictions are uncertain, perhaps highly uncertain on some specific points. We attempt to achieve very demanding outcomes  for instance vanishingly small probabilities of catastrophe  but we recognize that our ability to reliably calculate such small probabilities is compromised by the deficiency of our knowledge and understanding. We robustify ourselves against those deficiencies by choosing a design which would be acceptable over a wide range of deviations from our current best understanding. (This is called "robust-satisficing".) Not only does "vanishingly small probability of failure" still entail the possibility of failure, but our predictions of that probability may err.

Acknowledging the need for disaster recovery capability (DRC) is awkward and uncomfortable for designers and advocates of a technology. We would much rather believe that DRC is not needed, that we have in fact made catastrophe negligible. But let's not conflate good-faith attempts to deal with complex uncertainties, with guaranteed outcomes based on full knowledge. Our best models are in part wrong, so we robustify against the designer's bounded rationality. But robustness cannot guarantee success. The design and implementation of DRC is a necessary part of the design of any critical technology, and is consistent with the strategy of robust satisficing.

One final point: moral hazard and its dilemma. The design of any critical technology entails two distinct and essential elements: failure prevention and disaster recovery. What economists call a `moral hazard' exists since the failure prevention team might rely on the disaster-recovery team, and vice versa. Each team might, at least implicitly, depend on the capabilities of the other team, and thereby relinquish some of its own responsibility. Institutional provisions are needed to manage this conflict.

The alleviation of this moral hazard entails a dilemma. Considerations of failure prevention and disaster recovery must be combined in the design process. The design teams must be aware of each other, and even collaborate, because a single coherent system must emerge. But we don't want either team to relinquish any responsibility. On the one hand we want the failure prevention team to work as though there is no disaster recovery, and the disaster recovery team should presume that failures will occur. On the other hand, we want these teams to collaborate on the design.

This moral hazard and its dilemma do not obviate the need for both elements of the design. Fukushima has taught us an important lesson by highlighting the special challenge of high-risk critical technologies: design so failure cannot occur, and prepare to respond to the unanticipated.


  1. Insurance companies usually mitigate this problem by penalizing designs that can cost them a lot of money. So the fact that insurance companies won't undertake to insure nuclear reactors can teach us something.

  2. Another case where one cannot get insurance (except from the government) is building on a floodplain. Recent experience on the Missouri River is reminding people why - even the biggest reservoir system in the world can be overwhelmed by sufficient rainfall and snowmelt! What makes the Missouri particularly interesting is that the magnitude of the flood storage ability is reduced in order to ensure water storage to support navigational traffic. Guaranteeing navigation reduces the magnitude of runoff that can be safely stored.

  3. ‘Passive defense’ is a key concept for maintaining and enhancing earthquake (or disaster) resilience of structures. Stoppers, fail-safe mechanisms, specification-based (not performance-based) design concepts, etc. may be examples of passive defense. If the electric batteries had been moved to higher levels in the Fukushima Daiichi Nuclear Power Plant (this action is passive action), the catastrophic event might not have occurred.

    Recently the terminology of ‘out of scenario’ is often used in Japan after the Tohoku earthquake of March 11 for expressing the issue of extremely low probability. A scientific writer in USA uses the words of Black Swan. However it depends on the knowledge or understanding of users whether the event is the Black Swan or not.

    In the field of building structural engineering under severe uncertainty, three important concepts are getting strong interests recently. The first one is redundancy, the second one is robustness and the third one is resilience. If a building structure is redundant (sufficient safety factor or sufficient mechanism for fail-safe), the building structure must be robust. Furthermore if a building structure is robust, the building structure must be resilient for disasters. A more redundant structure is strongly desired after recent catastrophic natural disasters and terrorism

    The most important one is to narrow the range of Black Swan and reduce the possibility of catastrophic events, desirably to zero. It is hoped that structural engineers concentrate their skills to this theme by creating new kinds of ‘passive defense’.

  4. Everybody is behaving irrationally.

    In Japan, they built Nuclear Reactors in very high risk land, prone to huge eathquakes and even huger tsunamis, and then Fukushima happened.

    In Germany, the exact opposite happened. While most or all locations are extremely low risk there, they decided, in mindless panic, to eliminate all their nuke plants (which means far, far more pollution, not to mention far more e3xpensive energy)

    In the US, the utterly insignificant INcident (not even an accident!) at TMI in 1979 resulted in no new Nuke plants built in 30 years!

    Only France got it right, and even exports nuke-produced electricity to its neighbors!

    I also expect China to act rationally and build as many nuke plants as it can in safe locations.