
Framework 10 of 12 · Recovery Phase · Graceful degradation
Failure Architecture Blueprint
The kintsugi principle: failures repaired visibly become sources of deeper trust
A design template for how agentic systems fail gracefully - including detection, communication, containment, recovery, and trust restoration. The kintsugi principle: failures repaired visibly become sources of deeper trust.
Failure Architecture Blueprint: Core Principles
Failure Is Inevitable and Designable
Agentic systems will fail. The question is not whether, but how. Failure Architecture treats failure as a designable experience - not an exception to be handled, but a first-class scenario with its own design patterns, user experiences, and trust implications.
Detection Must Be Immediate
The time between failure occurrence and detection is the danger zone. During this period, the agent may continue acting on flawed assumptions, compounding the original error. Failure Architecture prioritises rapid detection through monitoring, anomaly detection, and integrity checks.
Communication Must Be Honest and Complete
When failure is detected, the communication to the user must be honest about what happened, what the impact is, and what the system is doing about it. Euphemistic or incomplete failure communication erodes trust faster than the failure itself.
Containment Prevents Cascade
A failure in one part of the system should not cascade to other parts. Containment architecture isolates failures, prevents propagation, and preserves the integrity of unaffected operations. This is especially critical in multi-agent systems where one agent's failure could compromise others.
Recovery Must Restore Trust, Not Just Function
Fixing the technical problem is necessary but not sufficient. The user's trust was damaged by the failure, and the recovery experience must address both the functional and the relational damage. Visible, transparent recovery - the kintsugi principle - can actually strengthen trust beyond its pre-failure level.
The measure of an agentic system is not whether it fails - all systems fail. The measure is how it fails: how quickly it detects, how honestly it communicates, how effectively it contains, and how completely it recovers.
Failure Architecture Blueprint: Implementation Patterns
Failure Detection Patterns
Monitoring and anomaly detection patterns that identify failures quickly: outcome verification checks, constraint violation alerts, performance degradation signals, and integrity verification routines. Designed for both real-time detection and post-hoc discovery.
When to use: As continuous monitoring infrastructure running alongside all agent operations.
Communication Protocols
Structured patterns for communicating failures to users: what happened, what the impact is, what the system is doing, and what the user needs to do. Includes severity-calibrated templates, honest language guidelines, and multi-channel notification patterns.
When to use: Immediately upon failure detection, calibrated to the severity of the failure.
Containment Architecture
Technical and experiential patterns for isolating failures and preventing cascade. Includes circuit breakers, fallback modes, graceful degradation pathways, and scope limitation patterns that contain the blast radius of any single failure.
When to use: As foundational infrastructure in any multi-component agentic system.
Recovery Pathways
Structured approaches to restoring both function and trust after failure. Includes automated rollback patterns, manual intervention workflows, compensation mechanisms, and trust restoration sequences. Each pathway is designed for a specific failure severity level.
When to use: After containment, as the primary mechanism for returning to normal operation.
Trust Restoration Design
Patterns specifically focused on rebuilding trust after failure. Includes transparent post-mortem communication, demonstrated improvement evidence, increased oversight periods, and progressive re-delegation pathways. Based on the kintsugi principle that visible repair can strengthen trust.
When to use: After functional recovery, as the final phase of the failure response lifecycle.
Failure Architecture Blueprint: Commerce Applications
Transaction Reversal and Dispute
When an agentic purchase goes wrong - wrong item, wrong price, wrong vendor - the Failure Architecture Blueprint governs the reversal process. This includes automated return initiation, refund processing, and dispute escalation. The framework ensures that the consumer's financial exposure is minimised and the recovery experience is transparent.
Substitution Failure Handling
When the agent substitutes an out-of-stock item and the substitution is unacceptable, the framework provides patterns for rapid correction: automatic return of the substitution, re-search with tighter constraints, and human consultation if the original intent cannot be satisfied.
Payment Processing Failures
Payment failures in agentic commerce require special handling because the consumer may not be present when they occur. The framework defines patterns for payment retry, alternative payment method selection, transaction hold communication, and order preservation during payment resolution.
Vendor Reliability Failures
When a trusted vendor fails to deliver - late shipment, damaged goods, incorrect items - the framework governs both the immediate resolution and the long-term trust recalibration. The agent's trust score for that vendor is adjusted, and the consumer is informed of the change.
Kintsugi teaches that broken things repaired with gold become more beautiful than they were before. In agentic design, failures handled with transparency and care become the foundation of deeper trust.
Failure Architecture Blueprint: Guidance for Teams
Start With
- -Map your system's top 5 most likely failure modes
- -Build detection mechanisms for each failure mode
- -Create honest, severity-calibrated communication templates
- -Implement containment patterns that prevent failure cascade
Build Toward
- -Automated failure classification and routing to appropriate recovery pathways
- -Predictive failure detection that identifies problems before they manifest
- -Cross-agent failure coordination in multi-agent systems
- -Longitudinal failure analysis that identifies systemic patterns
Measure By
- -Mean time to detection - how quickly are failures identified?
- -Mean time to communication - how quickly are users informed?
- -Containment success rate - how often are failures isolated before cascading?
- -Trust recovery rate - do users resume delegation after failure recovery?
Failure Architecture Blueprint: Lifecycle Connections
Framework 04
Trust Calibration Model
Failure is the primary cause of trust erosion. The Trust Calibration Model provides the framework for measuring and restoring trust after failures.
Explore frameworkFramework 09
Explainability & Observability
Honest failure communication depends on explainability. The agent must be able to explain what went wrong and why.
Explore frameworkFramework 05
Interrupt Pattern Library
Failure communication is the highest-priority interrupt. The Interrupt Pattern Library provides the delivery mechanisms.
Explore frameworkFailure Architecture Blueprint: What Comes Next
Failure Architecture handles what happens when things go wrong. The next framework - Onboarding & Capability Discovery - designs the entry experience that sets accurate expectations from the start.
Failure Architecture Blueprint: The Framework Ecosystem
Navigate the complete lifecycle of Agentic Experience Design. Each framework addresses a distinct phase of the human-agent relationship.