• Washington, USA

Publications

How We Got Here Incident Investigation Guide for Software

The Post-Incident Guide · Dec 8 2021

Recent events have shown how critical digital services are to individuals, organizations and society as a whole. Now more than ever, learning from incidents is crucial to helping maintain service reliability requirements so companies can continue delivering on their commitments to clients and keep employees connected and productive
Read more

Readiness to Learn

Safely and Reliably Deploy to the Cloud · Sep 28 2021

Cloud native companies that are well positioned to meet future market demands know that continuous learning is a competitive advantage.
Read more

Human performance in uncertain environments

Utah Avalance Association · Aug 20 2021

The avalanche community already recognizes the limitations of strict rule following in making judgements about the snowpack. For example, in the Canadian Avalanche Association Observational Guidelines and Reporting Standards for Weather, Snowpack and Avalanches (OGRS), there are seven7 instances of the word “rule” whereby six (6!) of these indicated that a definitive rule was impossible! (The seven-7th instance was to describe a rule of thumb and indicate variability was required). If the rules by themselves are unable to prescriptively define safe decisions, yet, many outcomes are successful, then avalanche professionals must be doing something right! Given this paradox, we can make a guess that there is sophisticated cognitive work – in perception, reasoning, evaluation and judgement – that goes into successfully managing the ambiguity in forecasting and guiding work.
Read more

Covid Resilience series

Covid Resilience series · Jan 1 2021

I served as an editor and author for a number of articles concerning COVID resilience, including Continuous Learning as a Tool for Adaptation, Designing & Managing for Resilience, Adaptive Frontline Incident Response: Human-Centered Incident Management, Shifting Modes: Creating a Program to Support Sustained Resilience, Meeting the Challenges of Disrupted Operations: Sustained Adaptability for Organizational Resilience.
Read more

Meeting the Challenges of Disrupted Operations

Sustained Adaptability for Organizational Resilience · Dec 31 2020

Specifically, in this article we will look at what an organization can do structurally - through organizational design, tooling, work practices and procedures - during (and prior to) a surprising and disruptive event that establishes the conditions that help engineering teams adapt in practice and in real time as the disruptive event occurs. Organizations can create the conditions for adaptation through how they structure the system of work surrounding individuals and teams - in doing so, they either expand or constrain the potential for safe adaptation. We’ll explore an example of this by looking at two typically independent performance improvement initiatives - incident analysis and chaos engineering - that, when integrated both improve system performance, in addition to improving the capacity of individuals within the organization to recognize and cope with problems in real time.
Read more