System Design Space
Knowledge graphSettings

Updated: May 7, 2026 at 6:26 PM

Interservice communication patterns

medium

Synchronous and asynchronous service communication: REST, gRPC, GraphQL, queues, pub/sub, contracts, timeouts, retries, idempotency, and backpressure.

Inter-service communication rarely fails on the happy path. It fails in timeouts, retries, and assumptions each side forgot to make explicit.

In real design work, the chapter shows how to choose synchronous and asynchronous patterns by SLA, acceptable business latency, coupling, and explicit rules for timeout, retry, backoff, and idempotency.

In interviews and engineering discussions, it helps frame backpressure, queue build-up, and partial failure as contract-level design properties rather than accidental implementation details.

Practical value of this chapter

Design in practice

Choose the interaction style by SLA, service coupling, and acceptable business-flow latency.

Decision quality

Encode timeout, retry, backoff, and idempotency in contracts rather than improvised handlers.

Interview articulation

Tie pattern choice to latency, reliability, and delivery-speed outcomes.

Failure framing

Model backpressure and queue growth before they become incidents in production.

Context

Decomposition Strategies

Service boundaries shape both the number and type of interactions between services.

Open chapter

Interservice communication patterns should be chosen by latency budget, operation criticality, and operational constraints rather than by fashion. The goal is predictable system behavior under load and during failures.

Synchronous interaction patterns

HTTP/gRPC request-response

Works well when the caller needs an immediate answer and the latency budget is tight. In production it needs clear timeouts, a retry budget, and an explicit degradation policy.

Aggregator/BFF composition

A dedicated service composes data from several upstream services. It is convenient for UI, but fan-out without caching and parallelism quickly becomes a latency bottleneck.

Asynchronous interaction patterns

Queue-based async

Producer and consumer are decoupled in time, which helps smooth traffic spikes and background work. It fits command processing when retries and a dead-letter queue must be controlled.

Pub/Sub events

One service publishes an event and multiple subscribers react independently. It improves extensibility and reduces coupling across teams.

Event-carried state transfer

The event carries enough context to reduce synchronous callback requests between services. The cost is stricter versioning discipline and careful payload sizing.

gRPC, REST, and GraphQL: configuration examples and mini benchmark

Single region, internal VPC, TLS enabled, payload around 1 KiB.

Service node: 4 vCPU / 8 GB RAM, 300 concurrent virtual users.

Reads from in-memory cache, no external DB and no heavy business logic.

Numbers below are a lab comparison point, not a universal law.

REST (HTTP/1.1 + JSON)

Simple integration and interoperability with external clients

# NGINX upstream + keep-alive
upstream user_api {
  server user-api:8080;
  keepalive 256;
}

server {
  listen 443 ssl http2;
  location /v1/ {
    proxy_http_version 1.1;
    proxy_set_header Connection "";
    proxy_set_header X-Request-Id $request_id;
    proxy_read_timeout 300ms;
    proxy_connect_timeout 80ms;
    proxy_pass http://user_api;
  }
}

gRPC (HTTP/2 + Protobuf)

Lower protocol overhead and a strict IDL contract

// service.proto
syntax = "proto3";
package catalog.v1;

service CatalogService {
  rpc GetItem(GetItemRequest) returns (GetItemResponse);
}

// envoy cluster (fragment)
clusters:
  - name: catalog_grpc
    connect_timeout: 0.08s
    type: STRICT_DNS
    http2_protocol_options:
      max_concurrent_streams: 512
    load_assignment:
      cluster_name: catalog_grpc
      endpoints: ...

GraphQL (BFF/Gateway)

Client-driven contract and composition across multiple domains

const server = new ApolloServer({
  schema,
  persistedQueries: {
    cache: redisCache,
  },
  plugins: [responseCachePlugin()],
});

// resolver guardrails
const resolvers = {
  Query: {
    dashboard: async (_, args, ctx) =>
      ctx.loaders.dashboardByUser.load(args.userId),
  },
};
Approachp50, msp95, msThroughputComment
REST (JSON, HTTP/1.1)12 ms41 ms~6.1k req/sJSON serialization overhead and more bytes on the wire.
gRPC unary (Protobuf, HTTP/2)7 ms24 ms~9.8k req/sBetter CPU and network efficiency with similar business logic.
GraphQL gateway (persisted queries + DataLoader)15 ms53 ms~4.3k req/sGreat for UI flexibility, but resolver overhead and fan-out risks remain.

Protobuf schema evolution without surprises

Never reuse field numbers after deletion.

Mark removed fields as `reserved` (both by number and by name).

Add new fields only as optional/nullable with safe default behavior.

For enums, always keep `*_UNSPECIFIED = 0` and handle unknown values.

Breaking changes (type change, moving into `oneof`, removing required behavior) require a new contract version.

Before (v1)

syntax = "proto3";
message UserProfile {
  string user_id = 1;
  string email = 2;
  string phone = 3;
}

After (v2, safe evolution)

syntax = "proto3";
message UserProfile {
  string user_id = 1;
  string email = 2;
  reserved 3;
  reserved "phone";
  optional string telegram = 4;
}
ChangeBackwardForwardComment
Added a new fieldYesYesOlder consumers ignore unknown fields.
Removed field + reservedConditionalNoIf old producers still send it, the new consumer loses that value.
Changed field type (int32 -> string)NoNoWire format changes and decoding becomes unsafe.
Added enum valueYesConditionalOld code needs fallback handling for unknown enum values.

Performance

Performance Engineering

Latency and throughput should be measured on your own workloads with realistic payloads.

Open chapter

Latency and throughput comparison

ApproachTypical latencyTypical throughputCommon fitKey trade-off
REST sync15-60 ms (p95)3k-8k req/s per nodeExternal/public APIs, simple integrationsHeavier payloads and usually higher CPU serialization cost.
gRPC sync8-30 ms (p95)6k-15k req/s per nodeInternal low-latency RPC, streamingNeeds IDL governance/tooling and HTTP/2 readiness.
GraphQL (BFF/Gateway)25-90 ms (p95)1k-5k req/s on the gatewayUI aggregation, product-driven contractsResolver fan-out, harder profiling and caching.
Queue-based async40 ms - 2 s10k-120k msg/sBackground commands, smoothing traffic spikesEventual consistency and a separate operational loop for queues.
Pub/Sub events20-300 ms50k-500k msg/s (cluster)Domain events with multiple independent subscribersHarder ordering/duplication control and contract evolution.

Real event contracts: CloudEvents and AsyncAPI

CloudEvents (domain event example)

{
  "specversion": "1.0",
  "type": "com.shop.order.paid.v1",
  "source": "urn:shop:payments",
  "id": "evt-01HQ7V0R4Z6A0G3T95S1ZQ6B9N",
  "time": "2026-03-03T14:23:44Z",
  "subject": "order/938475",
  "datacontenttype": "application/json",
  "dataschema": "https://events.shop.dev/schemas/order-paid-v1.json",
  "data": {
    "orderId": "938475",
    "userId": "u-1821",
    "amount": 149.90,
    "currency": "USD",
    "paymentMethod": "card"
  }
}

AsyncAPI (channel + payload contract)

asyncapi: 3.0.0
info:
  title: Order Events API
  version: 1.4.0
channels:
  order.paid.v1:
    address: order.paid.v1
    messages:
      orderPaid:
        $ref: '#/components/messages/OrderPaid'
operations:
  onOrderPaid:
    action: receive
    channel:
      $ref: '#/channels/order.paid.v1'
    messages:
      - $ref: '#/channels/order.paid.v1/messages/orderPaid'
components:
  messages:
    OrderPaid:
      payload:
        type: object
        required: [orderId, userId, amount, currency]
        properties:
          orderId: { type: string }
          userId: { type: string }
          amount: { type: number }
          currency: { type: string }

Each event should include business key (`orderId`) and technical id (`id`) for deduplication.

Use explicit versioning in `type`/topic (`...v1`) and keep the schema in a registry.

Document delivery SLA, at-least-once/exactly-once expectations, and TTL.

Assign an owning team and deprecation policy for each event version.

Reliability

Fault Tolerance Patterns

Distributed communication is fragile unless resilience policies are part of the contract.

Open chapter

How to choose an interaction pattern

Need a response to the user within one HTTP request -> prefer a synchronous path.

Need resistance to spikes and loose coupling -> use async communication through a queue or topic.

If the operation is money/order critical, check idempotency and ordering before selecting a pattern.

If you have a lot of cross-service hops, reduce the depth of synchronous chains and implement cache/materialized views.

Timeout budget for each service hop and one end-to-end deadline policy.

Retries with jitter and a retry budget, so dependency degradation does not turn into a retry storm.

Circuit breaker/bulkhead for fault isolation and concurrency control.

Idempotency keys for commands and deduplication for event consumers.

Dead-letter and parking-lot queues for invalid or problematic messages.

Practical checklist

  • For each integration channel, an owner, SLO and error budget are specified.
  • Contracts are versioned and verified by contract tests in CI.
  • There is a degradation strategy when the downstream service is unavailable.
  • Tracing covers the end-to-end path through synchronous and asynchronous segments.
  • Critical commands and events are processed idempotently.

References

Related chapters

Enable tracking in Settings