Interplanetary Distributed Computing System

An interplanetary distributed system breaks almost every familiar network assumption: latency is measured in minutes or hours, communication windows are rare, and disconnection is the normal operating mode rather than the exception.

The case helps you design local autonomy, store-and-forward delivery, deterministic state reconciliation, and planning around rare contact windows and tight resource budgets.

For interviews and architecture discussions, it is useful because it forces you to revisit hidden assumptions about time, coordination, and control in an environment where confirmation arrives far too late.

Contact Windows

The channel is available rarely and briefly, so command dispatch, acknowledgement, and result upload must be designed around those windows.

Local Autonomy

The remote segment has to make safe decisions without the center, or every lost link turns into mission downtime.

Deterministic Reconciliation

After connectivity returns, both sides need to merge updates in the same way or divergence keeps growing after every exchange.

Safety Priority

Emergency stop, safety commands, and manual override flows must outrank routine work by explicit policy.

Interplanetary Distributed Computing System forces you to design as if a steady connection never exists. Latency runs from minutes to hours, communication windows are rare, and disconnection is the working assumption, not the failure case. No amount of faster RPC helps here. The real work is building remote segments that keep the mission running safely until the next contact window opens, long after any acknowledgement from the center would be useful.

Source

Hacking the System Design Interview

Chapter 15 frames this problem around rare communication windows, autonomy, and safe state reconciliation.

Читать обзор

Where this pattern appears

Space and robotics missions: long autonomous execution between uplink windows.
Defense and critical field segments: nodes that can stay isolated for long periods.
Remote maritime and industrial fleets: operations that continue while central connectivity is unstable.
Disaster-response systems: safe behavior in environments where communication comes back only in bursts.

Documentary

Local First (short summary)

Useful background on why local autonomy matters whenever reliable connectivity cannot be assumed.

Читать обзор

Functional requirements

Command API

POST /commands - build a command package for a remote segment
GET /commands/:id/status - inspect delivery and execution status
POST /sync/windows/:id - upload accumulated state changes during a specific communication window
POST /reconcile - merge conflicting updates after connectivity is restored

Mission and operations capabilities

Local planning and execution during complete disconnection from central control
Store-and-forward delivery with deduplication and bounded retries
Explicit priority classes for emergency stop, safety override, and routine tasks
Traceable decision history showing which state and policy led to a local action

Non-functional requirements

Requirement	Target	Why it matters
Inter-segment latency	Minutes to hours	The channel is physically constrained by the environment
Autonomy while disconnected	24-72 hours and beyond	The remote segment cannot pause the mission while waiting for Earth
Delivery guarantee	At-least-once with duplicate protection	Critical commands cannot be lost, but retries must stay safe
State convergence	After each communication window	Center and edge only converge after delayed exchanges arrive
Decision traceability	Full command-to-result chain	Incidents may need to be investigated hours or days after the actual event

High-Level Architecture

Theory

Distributed Message Queue

Useful for reasoning about asynchronous delivery, retries, ordering, and delivery semantics in a partitioned network.

Читать обзор

High-Level Architecture

command bundles -> autonomous edge execution -> reconciliation during contact windows

Earth control plane

Mission Control

global operations

Policy Engine

priority + TTL

Relay Network

store-and-forward

Planetary edge plane

Orbital Gateway

window ingress

Edge Cluster

isolated domain

Local Planner

task sequencing

Execution Workers

idempotent runs

Local Event Log

append-only

Synchronization and reconcile plane

Result Bundler

delta packaging

Sync Uplink

window transfer

Conflict Resolver

merge rules

Archive Store

canonical timeline

Control -> policy -> relay -> gateway

command bundles and contact-window transfer

Edge cluster -> planner -> workers -> log

autonomous execution without permanent connectivity

Results -> uplink -> reconcile -> archive

state return and eventual convergence

This topology separates command dispatch, autonomous execution, and the later synchronization loop.

The architecture separates command dispatch, local execution, and the return synchronization loop. That keeps delivery behavior explicit, avoids hidden assumptions about always-on connectivity, and makes it clear when reconciliation can safely happen.

Write and Read Paths

How command bundles are written and how results and state updates return under extreme latency.

Layer 1: Command Bundle. Mission control prepares a command package with priority, TTL, and safety policy.

Write path checkpoints

•Commands should include an idempotency key, priority class, and TTL.
•Store-and-forward delivery is mandatory because transfer can take minutes or hours.
•A local event log is required for safe replay after failures.

The write path prepares commands for eventual transfer rather than immediate execution. The read path is anchored in local state, and results only travel back when the next communication window opens.

State reconciliation and resilience

Deeper dive

Clock Synchronization

Useful background for logical time, event ordering, and conflict handling in asynchronous replication.

Читать обзор

Conflict-resolution model

Both sides keep changing state while the link is down, and each learns of the other's changes hours late. The merge rule has to be fixed up front. Skip it and the first contact window leaves center and edge disagreeing about history, with the tie broken by chance instead of policy.

dominant_update = max_by(priority, logical_time)
if same_priority: merge_by_domain_rules()
ack_state = applied | queued | rejected

Safety priority: emergency and safety commands outrank routine work.
Idempotent apply: retries must not corrupt state.
Deterministic merge: both sides need the same result after every sync window.

Resilience loops

Reliability comes from safe retries, explicit checkpoints, and local safety behavior rather than fast acknowledgements.

Replay and retransmission: critical bundles are retried until a valid sync opportunity appears.
Checkpointing: local progress is persisted so execution can resume after reboot.
Local fallback rules: the remote segment must remain safe even without central guidance.
Post-sync audit: tracing and archive history explain why two segments diverged and how they converged again.

Deeper dive

Leslie Lamport and distributed systems

Happens-before is directly tied to causality: one event can influence another only if a signal could have traveled between them.

Читать обзор

Risks and common mistakes

RPC mindset: treating interplanetary links like a normal synchronous API.
No disconnected mode: if local autonomy is missing, the mission stalls as soon as the channel disappears.
Unspecified merge policy: different segments can settle on different truths.
Retries without duplicate protection: repeated delivery starts damaging state instead of preserving reliability.
Weak observability: incidents become unexplainable once the evidence arrives much later than the action.

What to make explicit in the interview

Which operations are allowed only with confirmed connectivity and which ones must remain fully local.
How state convergence works once the channel returns and why that reconciliation model was chosen.
Which signals matter most for investigation: sync lag, retransmission backlog, rejected commands, and safety overrides.
What trade-off you chose between remote autonomy, state accuracy, and communication cost.

References

Cerf, Burleigh et al. — RFC 4838: Delay-Tolerant Networking Architecture (IRTF, 2007)Burleigh, Fall, Birrane — RFC 9171: Bundle Protocol Version 7 (IETF, 2022)NASA — Delay/Disruption Tolerant Networking: overview and operational experience

Related chapters

Distributed Message Queue - Store-and-forward delivery, retries, and delivery semantics for links with extreme latency.
Clock Synchronization - Logical time, event ordering, and conflict control under huge network delays.
Consensus Protocols - Where consensus is truly required and where post-reconnect reconciliation is the better fit.
Local First (short summary) - Local autonomy and operation without permanent connectivity in systems with rare network access.
Leslie Lamport and distributed systems - Causality and happens-before as the basis for deterministic state merging.
Hacking the System Design Interview - The original case and the trade-off framing behind this system design problem.