System Design Space
Knowledge graphSettings

Updated: March 25, 2026 at 12:30 AM

Infrastructure as Code

medium

IaC practice for cloud-native: declarative models, state management, reusable modules, drift detection and secure rollout strategies.

Infrastructure as Code becomes truly necessary once infrastructure no longer fits in the team’s memory and has to turn into a reviewable history of decisions.

In real design work, the chapter shows how declarative definitions, policy checks, reusable modules, and disciplined handling of state and secrets turn infrastructure changes from manual magic into a repeatable engineering process.

In interviews and architecture reviews, it helps frame IaC through reproducibility, drift, safety, and rollback rather than only through the choice of Terraform or another tool.

Practical value of this chapter

Design in practice

Model infrastructure declaratively and include policy checks before production rollouts.

Decision quality

Separate reusable modules, state backends, and secret handling for scalable IaC operations.

Interview articulation

Describe the full change lifecycle: plan, review, apply, drift detection, and rollback strategy.

Trade-off framing

Explain the balance between delivery speed and safety when infrastructure is managed as code.

Context

Cloud Native Overview

IaC transforms infrastructure from a manual effort into a repeatable engineering process.

Open chapter

Infrastructure as Code is the discipline of platform management through versioned declarations and controlled pipelines. The main advantage is repeatability and auditability, the main requirement is a strict engineering process around changes.

Basic principles

  • The infrastructure is described declaratively and versioned in the same way as application code.
  • Changes undergo review, policy checks and an automated plan/apply pipeline.
  • Repeatability is more important than manual speed: the same pattern unfolds the same way in different envs.
  • Any drift between the code and the actual infrastructure must be detected and corrected.

Architectural areas of attention

State management

Store state centrally, with lock and versioning. The loss of state destroys the controllability of changes.

Module boundaries

Structure modules by domain ownership. Avoid giant root modules with implicit dependencies.

Secrets & config

Secrets should not live in an IaC repository. Use secret managers and short-lived credentials.

Policy as code

Fix the required guardrails: naming, encryption, network policy, quotas, region restrictions.

Next

GitOps

GitOps extends IaC through pull-based reconciliation and continuous drift correction.

Open chapter

Tool selection

Terraform/OpenTofu

Standardized multi-cloud provisioning and mature provider ecosystem.

Pulumi/CDK

Infrastructure as full-fledged code in programming languages with reusable abstractions.

Kubernetes manifests + controllers

Declarative management of cluster resources and platform API at runtime.

IaC operating model

Authoring

Modules, variables, and naming conventions define the platform contract. This is where linting and static policy checks should be enforced first.

Outcome: A clear pull request with controlled blast radius and readable infrastructure diff.

Planning

The pipeline generates a plan with expected changes to resources, permissions, and network policies. This is the core control point before apply.

Outcome: An approved plan reviewed by platform, security, and owning team.

Apply

Changes are applied only via automated pipelines with audit trail, state locking, and controlled parallelism.

Outcome: Repeatable rollout without manual edits in cloud consoles.

Operate

Continuous drift detection, module lifecycle management, credential rotation, and postmortems for failed applies.

Outcome: Stable IaC operations and fewer unplanned platform incidents.

Related topic

Cost Optimization & FinOps

IaC and FinOps should be connected: cost guardrails and resource governance must be codified.

Open chapter

Environment strategy and ownership

Account/Subscription per environment

Best fit: Large organizations with strict isolation and security boundary requirements.

Strengths

  • Clear blast-radius separation between dev/stage/prod.
  • Easier enforcement of isolated budget and access policies.

Risks

  • Higher operational overhead for bootstrap and baseline maintenance.
  • Requires standardized landing zones and reusable module libraries.

Workspace-per-env in a single account

Best fit: Teams with moderate scale and limited platform engineering capacity.

Strengths

  • Faster initial adoption and lower operating overhead.
  • Simpler unified pipeline for common service templates.

Risks

  • Lower isolation and higher risk of accidental cross-environment changes.
  • Needs strict discipline around naming, state, and config boundaries.

Domain-owned stacks

Best fit: Organizations with platform teams and federated domain ownership.

Strengths

  • Domain teams own their infrastructure lifecycle and ship faster.
  • Platform team can focus on reusable modules and guardrails.

Risks

  • Without governance, quality bar diverges across domains.
  • Requires a shared module catalog and centralized policy model.

Common anti-patterns

One global state for the whole platform

Problem: A single state file becomes a bottleneck: lock contention, long applies, and large blast radius on failure.

Fix: Split state by domain/environment and reduce cross-stack coupling.

Manual hotfixes in cloud consoles

Problem: Out-of-band console changes create drift and make the next apply unpredictable.

Fix: Backport emergency changes to IaC via PR immediately after incident mitigation.

Secrets stored in repository

Problem: Secrets in tfvars/manifests leak into commit history and CI artifact retention.

Fix: Use secret managers, short-lived credentials, and runtime injection in pipelines.

Apply from local laptops

Problem: Local apply bypasses audit trail, increases version skew risk, and hurts reproducibility.

Fix: Allow apply only from centralized CI/CD runners with policy gates.

Patterns that work in practice

  • Versioned module library with backwards-compatible interfaces.
  • Mandatory policy gates: encryption, tagging, network boundaries, IAM least privilege.
  • Ephemeral preview environments for risky platform changes.
  • Nightly drift detection with auto-created remediation tickets.
  • Unified ownership catalog for modules, state backends, and runtime operations.
  • Progressive apply rollout strategy for critical production resources.

Implementation roadmap (0-120 days)

0-30 days

IaC platform baseline

Set up state backend, locking/versioning, repository structure, and shared naming/tagging standards.

30-60 days

Policy and security

Introduce policy-as-code, misconfiguration scanning, secret management, and mandatory review workflow.

60-90 days

Delivery stabilization

Standardize plan/apply pipelines, clarify ownership boundaries, and add rollback/recovery runbooks for state.

90-120 days

Domain scaling

Onboard domain teams to shared modules, maturity metrics, and regular drift governance cycles.

Security

Supply Chain Security

The IaC pipeline must be part of the trust chain: signatures, provenance, and dependency control.

Open chapter

IaC maturity metrics

Lead time for infrastructure change

Target: < 1 day for standard changes

Shows whether IaC speeds up delivery instead of adding process overhead.

Change failure rate

Target: Quarter-over-quarter reduction

Share of IaC changes that cause rollback or incidents.

Drift resolution time

Target: < 24 hours for critical drift

How fast infrastructure returns to desired state after manual or emergency deviations.

Policy compliance

Target: >= 95% successful policy checks

How consistently teams comply with mandatory guardrails.

Module reuse ratio

Target: > 60%

Share of infrastructure provisioned through standard platform modules.

Practical checklist

  • There is a single workflow plan/apply with a mandatory review and audit trail.
  • Critical changes go through policy-gates before merge/apply.
  • State backend is protected, versioned and has a backup/restore runbook.
  • Regular drift detection is carried out in all key environments.
  • There is a strategy of modular decomposition and ownership by teams.

References

Related chapters

Enable tracking in Settings