Chapter 1: Why Software Architecture is More Important Than Ever
Architecture in the Agile world, architecture in the cloud, the role of the architect as a “non-value adder”. Definition of Continuous Architecture and set of principles. Description of the end-to-end case study.
Chapter 2: Architecture in Practice—Essential Activities
Maxim Smirnov — IT architect, author of the “Architecture of IT Solutions” channel. In the past, the chief architect of Beeline, the Bank of Russia and Binbank Digital.
A random condition that, if it occurs, can lead to the failure of a system or component.
Failure
A situation where the system deviates from the required behavior. Fault is the cause, Failure is the consequence.
Availability
Measurable characteristic: the ratio of availability time to the total system operating time.
Reliability
Probability of failure-free operation over a specified period of time in a specified environment.
High-Availability vs Resilience
Old Approach: High-Availability
Application and Database Clusters
Cross-site data replication
Hot standby for emergency transfer
Problems: complexity, high cost, long failover, complete inaccessibility during recovery
New approach: Resilience
Each part of the system is responsible for its stability
Adaptive behavior during failures
Limiting error propagation (blast radius)
Advantages: flexibility, graceful degradation, quick recovery of individual components
Key insight: High-Availability was designed for monolithic on-premise systems. For distributed microservice systems (on-premise or in the cloud), Resilience is a more suitable approach.
MTBF vs MTTR
MTBF
Mean Time Between Failures
Mean time between failures. The focus of the High-Availability approach is to maximize the time between failures.
MTTR
Mean Time To Recover
Average recovery time. Focus of the Resilience approach: minimize cooldown and blast radius.
Paradigm Shift: In modern systems, part failures occur quite often. Instead of trying to prevent all failures (MTBF), it is more important to learn to recover quickly (MTTR) and limit the impact of failures.
Mechanisms for providing Resilience
Repeated requests
Retry with exponential backoff
Automatic restart
Self-healing processes
Circuit Breaker
Open circuit during faults
Bulkhead
Isolation of system compartments
Timeout
Correct work with latency
Fallback
Spare answer options
Additional Resources
The authors created the website continuous-architecture.org with a set of documents: