An interplanetary distributed system breaks almost every familiar network assumption: latency is measured in minutes or hours, communication windows are rare, and disconnection is the normal operating mode rather than the exception.
The case helps you design local autonomy, store-and-forward delivery, deterministic state reconciliation, and planning around rare contact windows and tight resource budgets.
For interviews and architecture discussions, it is useful because it forces you to revisit hidden assumptions about time, coordination, and control in an environment where confirmation arrives far too late.
Contact Windows
The channel is available rarely and briefly, so command dispatch, acknowledgement, and result upload must be designed around those windows.
Local Autonomy
The remote segment has to make safe decisions without the center, or every lost link turns into mission downtime.
Deterministic Reconciliation
After connectivity returns, both sides need to merge updates in the same way or divergence keeps growing after every exchange.
Safety Priority
Emergency stop, safety commands, and manual override flows must outrank routine work by explicit policy.
Interplanetary Distributed Computing System is a case where almost every cloud-era assumption breaks down. Latency is measured in minutes or hours, communication windows are rare, and disconnection is the default mode rather than the exception. The interesting part of the design is not faster RPC, but building autonomous segments that can keep operating safely until the next contact window.
Source
Hacking the System Design Interview
Chapter 15 frames this problem around rare communication windows, autonomy, and safe state reconciliation.
Where this pattern appears
- Space and robotics missions: long autonomous execution between uplink windows.
- Defense and critical field segments: nodes that can stay isolated for long periods.
- Remote maritime and industrial fleets: operations that continue while central connectivity is unstable.
- Disaster-response systems: safe behavior in environments where communication comes back only in bursts.
Documentary
Local First (short summary)
Useful background on why local autonomy matters whenever reliable connectivity cannot be assumed.
Functional requirements
Command API
POST /commands- build a command package for a remote segmentGET /commands/:id/status- inspect delivery and execution statusPOST /sync/windows/:id- upload accumulated state changes during a specific communication windowPOST /reconcile- merge conflicting updates after connectivity is restored
Mission and operations capabilities
- Local planning and execution during complete disconnection from central control
- Store-and-forward delivery with deduplication and bounded retries
- Explicit priority classes for emergency stop, safety override, and routine tasks
- Traceable decision history showing which state and policy led to a local action
Non-functional requirements
| Requirement | Target | Why it matters |
|---|---|---|
| Inter-segment latency | Minutes to hours | The channel is physically constrained by the environment |
| Autonomy while disconnected | 24-72 hours and beyond | The remote segment cannot pause the mission while waiting for Earth |
| Delivery guarantee | At-least-once with duplicate protection | Critical commands cannot be lost, but retries must stay safe |
| State convergence | After each communication window | Center and edge only converge after delayed exchanges arrive |
| Decision traceability | Full command-to-result chain | Incidents may need to be investigated hours or days after the actual event |
High-Level Architecture
Theory
Distributed Message Queue
Useful for reasoning about asynchronous delivery, retries, ordering, and delivery semantics in a partitioned network.
High-Level Architecture
command bundles -> autonomous edge execution -> reconciliation during contact windowsThis topology separates command dispatch, autonomous execution, and the later synchronization loop.
The architecture separates command dispatch, local execution, and the return synchronization loop. That keeps delivery behavior explicit, avoids hidden assumptions about always-on connectivity, and makes it clear when reconciliation can safely happen.
Write and Read Paths
Write and Read Paths
How command bundles are written and how results and state updates return under extreme latency.
Write path: mission control assembles a command bundle, transfers it through the relay network, and the remote segment persists it in a local log.
Command Bundle
Layer 1mission control
Mission control prepares a command package with priority, TTL, and safety policy.
Policy Layer
Layer 2validate and sign
The policy layer validates the bundle and signs it before transfer.
Relay Network
Layer 3store-and-forward
Commands move through a delay-tolerant network with retries and duplicate protection.
Edge Queue
Layer 4remote ingest
The orbital gateway receives the bundle and pushes it into the local execution queue.
Local Event Log
Layer 5durable append
The command is persisted in a durable log for autonomous execution and safe replay.
Write path checkpoints
- •Commands should include an idempotency key, priority class, and TTL.
- •Store-and-forward delivery is mandatory because transfer can take minutes or hours.
- •A local event log is required for safe replay after failures.
The write path prepares commands for eventual transfer rather than immediate execution. The read path is anchored in local state, and results only travel back when the next communication window opens.
State reconciliation and resilience
Deeper dive
Clock Synchronization
Useful background for logical time, event ordering, and conflict handling in asynchronous replication.
Conflict-resolution model
When both sides can keep making progress while disconnected, merge behavior must be formalized before the system ever ships.
dominant_update = max_by(priority, logical_time) if same_priority: merge_by_domain_rules() ack_state = applied | queued | rejected
- Safety priority: emergency and safety commands outrank routine work.
- Idempotent apply: retries must not corrupt state.
- Deterministic merge: both sides need the same result after every sync window.
Resilience loops
Reliability comes from safe retries, explicit checkpoints, and local safety behavior rather than fast acknowledgements.
- Replay and retransmission: critical bundles are retried until a valid sync opportunity appears.
- Checkpointing: local progress is persisted so execution can resume after reboot.
- Local fallback rules: the remote segment must remain safe even without central guidance.
- Post-sync audit: tracing and archive history explain why two segments diverged and how they converged again.
Deeper dive
Leslie Lamport and distributed systems
Happens-before is directly tied to causality: one event can influence another only if a signal could have traveled between them.
Risks and common mistakes
- RPC mindset: treating interplanetary links like a normal synchronous API.
- No disconnected mode: if local autonomy is missing, the mission stalls as soon as the channel disappears.
- Unspecified merge policy: different segments can settle on different truths.
- Retries without duplicate protection: repeated delivery starts damaging state instead of preserving reliability.
- Weak observability: incidents become unexplainable once the evidence arrives much later than the action.
What to make explicit in the interview
- Which operations are allowed only with confirmed connectivity and which ones must remain fully local.
- How state convergence works once the channel returns and why that reconciliation model was chosen.
- Which signals matter most for investigation: sync lag, retransmission backlog, rejected commands, and safety overrides.
- What trade-off you chose between remote autonomy, state accuracy, and communication cost.
Related chapters
- Distributed Message Queue - Store-and-forward delivery, retries, and delivery semantics for links with extreme latency.
- Clock Synchronization - Logical time, event ordering, and conflict control under huge network delays.
- Consensus Protocols - Where consensus is truly required and where post-reconnect reconciliation is the better fit.
- Local First (short summary) - Local autonomy and operation without permanent connectivity in systems with rare network access.
- Leslie Lamport and distributed systems - Causality and happens-before as the basis for deterministic state merging.
- Hacking the System Design Interview - The original case and the trade-off framing behind this system design problem.
