System Design Space
Knowledge graphSettings

Updated: April 24, 2026 at 2:48 PM

Interplanetary Distributed Computing System

medium

Classic task: autonomous nodes, contact windows, store-and-forward delivery, and state convergence under extreme latency.

An interplanetary distributed system breaks almost every familiar network assumption: latency is measured in minutes or hours, communication windows are rare, and disconnection is the normal operating mode rather than the exception.

The case helps you design local autonomy, store-and-forward delivery, deterministic state reconciliation, and planning around rare contact windows and tight resource budgets.

For interviews and architecture discussions, it is useful because it forces you to revisit hidden assumptions about time, coordination, and control in an environment where confirmation arrives far too late.

Contact Windows

The channel is available rarely and briefly, so command dispatch, acknowledgement, and result upload must be designed around those windows.

Local Autonomy

The remote segment has to make safe decisions without the center, or every lost link turns into mission downtime.

Deterministic Reconciliation

After connectivity returns, both sides need to merge updates in the same way or divergence keeps growing after every exchange.

Safety Priority

Emergency stop, safety commands, and manual override flows must outrank routine work by explicit policy.

Interplanetary Distributed Computing System is a case where almost every cloud-era assumption breaks down. Latency is measured in minutes or hours, communication windows are rare, and disconnection is the default mode rather than the exception. The interesting part of the design is not faster RPC, but building autonomous segments that can keep operating safely until the next contact window.

Source

Hacking the System Design Interview

Chapter 15 frames this problem around rare communication windows, autonomy, and safe state reconciliation.

Читать обзор

Where this pattern appears

  • Space and robotics missions: long autonomous execution between uplink windows.
  • Defense and critical field segments: nodes that can stay isolated for long periods.
  • Remote maritime and industrial fleets: operations that continue while central connectivity is unstable.
  • Disaster-response systems: safe behavior in environments where communication comes back only in bursts.

Documentary

Local First (short summary)

Useful background on why local autonomy matters whenever reliable connectivity cannot be assumed.

Читать обзор

Functional requirements

Command API

  • POST /commands - build a command package for a remote segment
  • GET /commands/:id/status - inspect delivery and execution status
  • POST /sync/windows/:id - upload accumulated state changes during a specific communication window
  • POST /reconcile - merge conflicting updates after connectivity is restored

Mission and operations capabilities

  • Local planning and execution during complete disconnection from central control
  • Store-and-forward delivery with deduplication and bounded retries
  • Explicit priority classes for emergency stop, safety override, and routine tasks
  • Traceable decision history showing which state and policy led to a local action

Non-functional requirements

RequirementTargetWhy it matters
Inter-segment latencyMinutes to hoursThe channel is physically constrained by the environment
Autonomy while disconnected24-72 hours and beyondThe remote segment cannot pause the mission while waiting for Earth
Delivery guaranteeAt-least-once with duplicate protectionCritical commands cannot be lost, but retries must stay safe
State convergenceAfter each communication windowCenter and edge only converge after delayed exchanges arrive
Decision traceabilityFull command-to-result chainIncidents may need to be investigated hours or days after the actual event

High-Level Architecture

Theory

Distributed Message Queue

Useful for reasoning about asynchronous delivery, retries, ordering, and delivery semantics in a partitioned network.

Читать обзор

High-Level Architecture

command bundles -> autonomous edge execution -> reconciliation during contact windows

This topology separates command dispatch, autonomous execution, and the later synchronization loop.

Mission Control
global operations
Policy Engine
priority + TTL
Relay Network
store-and-forward
Orbital Gateway
window ingress
Edge Cluster
isolated domain
Local Planner
task sequencing
Execution Workers
idempotent runs
Local Event Log
append-only
Result Bundler
delta packaging
Sync Uplink
window transfer
Conflict Resolver
merge rules
Archive Store
canonical timeline

The architecture separates command dispatch, local execution, and the return synchronization loop. That keeps delivery behavior explicit, avoids hidden assumptions about always-on connectivity, and makes it clear when reconciliation can safely happen.

Write and Read Paths

Write and Read Paths

How command bundles are written and how results and state updates return under extreme latency.

Write path: mission control assembles a command bundle, transfers it through the relay network, and the remote segment persists it in a local log.

Command Bundle

Layer 1

mission control

Mission control prepares a command package with priority, TTL, and safety policy.

Policy Layer

Layer 2

validate and sign

The policy layer validates the bundle and signs it before transfer.

Relay Network

Layer 3

store-and-forward

Commands move through a delay-tolerant network with retries and duplicate protection.

Edge Queue

Layer 4

remote ingest

The orbital gateway receives the bundle and pushes it into the local execution queue.

Local Event Log

Layer 5

durable append

The command is persisted in a durable log for autonomous execution and safe replay.

Write path checkpoints

  • Commands should include an idempotency key, priority class, and TTL.
  • Store-and-forward delivery is mandatory because transfer can take minutes or hours.
  • A local event log is required for safe replay after failures.

The write path prepares commands for eventual transfer rather than immediate execution. The read path is anchored in local state, and results only travel back when the next communication window opens.

State reconciliation and resilience

Deeper dive

Clock Synchronization

Useful background for logical time, event ordering, and conflict handling in asynchronous replication.

Читать обзор

Conflict-resolution model

When both sides can keep making progress while disconnected, merge behavior must be formalized before the system ever ships.

dominant_update = max_by(priority, logical_time)
if same_priority: merge_by_domain_rules()
ack_state = applied | queued | rejected
  • Safety priority: emergency and safety commands outrank routine work.
  • Idempotent apply: retries must not corrupt state.
  • Deterministic merge: both sides need the same result after every sync window.

Resilience loops

Reliability comes from safe retries, explicit checkpoints, and local safety behavior rather than fast acknowledgements.

  • Replay and retransmission: critical bundles are retried until a valid sync opportunity appears.
  • Checkpointing: local progress is persisted so execution can resume after reboot.
  • Local fallback rules: the remote segment must remain safe even without central guidance.
  • Post-sync audit: tracing and archive history explain why two segments diverged and how they converged again.

Deeper dive

Leslie Lamport and distributed systems

Happens-before is directly tied to causality: one event can influence another only if a signal could have traveled between them.

Читать обзор

Risks and common mistakes

  • RPC mindset: treating interplanetary links like a normal synchronous API.
  • No disconnected mode: if local autonomy is missing, the mission stalls as soon as the channel disappears.
  • Unspecified merge policy: different segments can settle on different truths.
  • Retries without duplicate protection: repeated delivery starts damaging state instead of preserving reliability.
  • Weak observability: incidents become unexplainable once the evidence arrives much later than the action.

What to make explicit in the interview

  • Which operations are allowed only with confirmed connectivity and which ones must remain fully local.
  • How state convergence works once the channel returns and why that reconciliation model was chosen.
  • Which signals matter most for investigation: sync lag, retransmission backlog, rejected commands, and safety overrides.
  • What trade-off you chose between remote autonomy, state accuracy, and communication cost.

Related chapters

Enable tracking in Settings