System Design Space
Knowledge graphSettings

Updated: May 11, 2026 at 5:54 AM

Infrastructure as Code

medium

Infrastructure as Code practice: declarative infrastructure management, state, modules, policy checks, drift control, and auditable change application.

Infrastructure as Code becomes truly necessary once infrastructure no longer fits in the team’s memory and has to become a reviewable history of decisions.

In real design work, the chapter shows how declarative definitions, policy checks, reusable modules, state, and secrets turn infrastructure changes from manual magic into a repeatable engineering process.

In interviews and architecture reviews, it helps frame Infrastructure as Code through reproducibility, drift, safety, and rollback rather than only through the choice of Terraform or another tool.

Practical value of this chapter

Design in practice

Model infrastructure declaratively and include policy checks before production rollouts.

Decision quality

Separate reusable modules, state backends, and secret handling for scalable IaC operations.

Interview articulation

Describe the full change lifecycle: plan, review, apply, drift detection, and rollback strategy.

Trade-off framing

Explain the balance between delivery speed and safety when infrastructure is managed as code.

Context

Cloud Native Overview

Infrastructure as Code turns manual platform operations into a repeatable engineering process.

Open chapter

Infrastructure as Code is the discipline of managing a platform through versioned declarations, automated checks, and controlled application of changes. Its main advantage is repeatability and auditability; its main requirement is a rigorous engineering process around every change.

This chapter connects desired state, state backends, policy as code, policy checks, infrastructure drift, audit trails, and the plan/apply pipeline into one manageable change model.

Basic principles

  • Infrastructure is described declaratively and versioned with the same discipline as application code.
  • Changes pass through review, policy checks, and a controlled plan/apply pipeline.
  • Repeatability matters more than manual speed: the same template should behave consistently across environments.
  • Any infrastructure drift between code and the real environment must be detected and corrected.

Architectural areas of attention

State management

Store state centrally, use state locking, and keep versions. Losing state makes infrastructure changes hard to reason about.

Module boundaries

Structure modules around domain responsibility. Avoid giant root modules with hidden dependencies.

Secrets and configuration

Secrets should not live in the infrastructure repository. Use secret managers and short-lived credentials.

Policy as code

Codify required guardrails: naming rules, encryption, network policies, quotas, and region restrictions.

Next

GitOps

GitOps extends IaC through pull-based reconciliation and continuous drift control.

Open chapter

Tool selection

Terraform/OpenTofu

Standardized resource provisioning, multi-cloud scenarios, and a mature provider ecosystem.

Pulumi/CDK

Infrastructure described in programming languages when reusable abstractions and richer control flow are needed.

Kubernetes manifests + controllers

Declarative management of cluster resources and platform APIs in the runtime environment.

IaC operating model

Authoring

Modules, variables, and naming rules define the platform contract. Linting and static policy checks should be introduced here first.

Outcome: A clear pull request with a limited failure radius and a readable infrastructure change set.

Planning

The pipeline produces a plan that shows expected changes to resources, permissions, and network policies. This is the main control point before apply.

Outcome: An approved plan reviewed by the platform, security, and owning product team.

Apply

Changes are applied only through an automated pipeline with an audit trail, state locking, and controlled parallelism.

Outcome: A repeatable rollout without manual changes in cloud consoles.

Operate

Regular infrastructure drift detection, module lifecycle management, credential rotation, and postmortems for failed applies.

Outcome: Stable IaC operations and fewer unplanned platform incidents.

Related topic

Cost Optimization & FinOps

IaC and FinOps meet where cost, quotas, and resource ownership rules are expressed in code.

Open chapter

Environment strategies and ownership

Separate account or subscription per environment

Best fit: Large organizations with strict isolation and security-boundary requirements.

Strengths

  • Clear failure-radius separation between development, staging, and production.
  • Easier enforcement of separate budgets and access policies.

Risks

  • More operating overhead to bootstrap and maintain an architecture baseline in every environment.
  • Requires standardized landing zones and a reusable module library.

Workspace per environment in one account

Best fit: Teams with moderate scale and limited platform engineering capacity.

Strengths

  • Faster initial adoption and lower early operating cost.
  • A simpler shared pipeline for common service templates.

Risks

  • Weaker isolation and higher risk of accidental cross-environment changes.
  • Requires strict discipline around naming, state, and configuration boundaries.

Stacks owned by domain teams

Best fit: Organizations with a platform team and domain-oriented product teams.

Strengths

  • Teams own the lifecycle of their infrastructure and can ship changes faster.
  • The platform team can focus on reusable platform modules and guardrails.

Risks

  • Without architecture governance, quality standards start diverging across domains.
  • Requires a central module catalog and a shared policy model.

Common anti-patterns

One global state for the whole platform

Problem: A single state file becomes a bottleneck: lock contention, long applies, and a large failure radius when something goes wrong.

Fix: Split state by domains and environments, and make dependencies between stacks explicit.

Manual fixes in cloud consoles

Problem: Out-of-band console changes create infrastructure drift and make the next apply unpredictable.

Fix: After emergency mitigation, backport the change into the infrastructure repository through a pull request.

Secrets stored in the repository

Problem: Secrets in variable files and manifests quickly leak into commit history and CI backups.

Fix: Use a secret manager, short-lived credentials, and dynamic injection in the pipeline.

Applying changes from a local machine

Problem: Local apply bypasses the audit trail, increases version skew risk, and hurts reproducibility.

Fix: Allow apply only from centralized CI/CD runners with policy checks.

Practices that work

  • A versioned module library with backward-compatible interfaces.
  • Mandatory policy checks for encryption, tags, network boundaries, and least-privilege IAM.
  • Preview environments for risky platform changes.
  • Nightly infrastructure drift detection and automatically created remediation tasks.
  • A single ownership catalog for modules, state backends, and runtime operations.
  • Progressive apply for critical production resources.

Adoption roadmap

0-30 days

IaC platform baseline

Set up state backend, locking, versioning, repository structure, and shared naming and tagging rules.

30-60 days

Policy and security

Introduce policy as code, misconfiguration scanners, secret management, and a mandatory review workflow.

60-90 days

Delivery stabilization

Standardize the plan/apply pipeline, clarify responsibility boundaries, and add runbooks for rollback and state recovery.

90-120 days

Domain scaling

Onboard domain teams to shared modules, maturity metrics, and a regular infrastructure-drift governance loop.

Security

Supply Chain Security

The IaC pipeline must be part of the trust chain: signatures, artifact provenance, and dependency control.

Open chapter

IaC maturity metrics

Infrastructure change lead time

Target: < 1 day for standard changes

Shows whether IaC actually speeds up delivery instead of adding bureaucracy.

Change failure rate

Target: Quarter-over-quarter reduction

Share of infrastructure changes that lead to rollback or incidents.

Drift resolution time

Target: < 24 hours for critical drift

How quickly infrastructure returns to the desired state after manual or emergency deviations.

Policy compliance

Target: >= 95% successful policy checks

How consistently teams follow mandatory guardrails.

Module reuse ratio

Target: > 60%

Share of infrastructure provisioned through standard platform modules.

Practical checklist

  • There is one plan/apply workflow with mandatory review and an audit trail.
  • Critical changes pass policy checks before merge and apply.
  • The state backend is protected, versioned, and has a backup and restore runbook.
  • Infrastructure drift detection runs regularly across all key environments.
  • There is a modular decomposition strategy and clear team ownership.

References

Related chapters

Enable tracking in Settings