System Design Space
Knowledge graphSettings

Updated: March 25, 2026 at 1:00 AM

Interservice communication patterns

medium

Synchronous and asynchronous patterns of interaction between services: RPC, messaging, pub/sub, contracts, retries and backpressure.

Inter-service communication does not fail on the happy path. It fails in timeouts, retries, and the assumptions each side forgets to make explicit.

In real design work, the chapter shows how synchronous and asynchronous patterns should be chosen through SLA, acceptable business latency, coupling, and explicit treatment of timeout, retry, backoff, and idempotency.

In interviews and engineering discussions, it helps frame backpressure, queue build-up, and partial failure as contract-level design properties rather than accidental implementation details.

Practical value of this chapter

Design in practice

Choose sync vs async communication by SLA, coupling, and acceptable business-flow latency.

Decision quality

Encode timeout, retry, backoff, and idempotency in contracts rather than ad-hoc handlers.

Interview articulation

Tie pattern choice to latency, reliability, and developer productivity outcomes.

Failure framing

Model backpressure and queue build-up before they become production incidents.

Context

Decomposition Strategies

The method of system decomposition determines the types and number of interservice interactions.

Open chapter

Interservice communication patterns you need to choose not according to fashion, but according to latency budget, criticality of the operation and operational restrictions. The main goal is predictable behavior of the system under load and during failures.

Synchronous patterns

HTTP/gRPC request-response

Suitable for low-latency requests when the client needs an immediate response. In production you usually need timeout budgets, retry budget, and explicit degradation policy.

Aggregator/BFF composition

A separate service collects data from several upstreams. Convenient for UI, but fan-out without cache and parallelism quickly becomes a latency bottleneck.

Asynchronous patterns

Queue-based async

Producer and consumer are decoupled in time; useful for smoothing spikes and background jobs. Fits command processing where controlled retries and DLQ are needed.

Pub/Sub events

One emitter publishes an event, multiple subscribers react independently. Good for extensibility and reducing coupling across teams.

Event-carried state transfer

The event carries enough context to reduce synchronous callback requests between services. The cost is stricter versioning discipline and larger payloads.

gRPC vs REST vs GraphQL: configurations and mini benchmark

Single region, internal VPC, TLS enabled, payload around 1 KiB.

Service node: 4 vCPU / 8 GB RAM, 300 concurrent virtual users.

Reads from in-memory cache, no external DB and no heavy business logic.

Numbers below are a lab baseline example, not universal truth.

REST (HTTP/1.1 + JSON)

Simple integration and interoperability with external clients

# NGINX upstream + keep-alive
upstream user_api {
  server user-api:8080;
  keepalive 256;
}

server {
  listen 443 ssl http2;
  location /v1/ {
    proxy_http_version 1.1;
    proxy_set_header Connection "";
    proxy_set_header X-Request-Id $request_id;
    proxy_read_timeout 300ms;
    proxy_connect_timeout 80ms;
    proxy_pass http://user_api;
  }
}

gRPC (HTTP/2 + Protobuf)

Lower protocol overhead and strict IDL contract

// service.proto
syntax = "proto3";
package catalog.v1;

service CatalogService {
  rpc GetItem(GetItemRequest) returns (GetItemResponse);
}

// envoy cluster (fragment)
clusters:
  - name: catalog_grpc
    connect_timeout: 0.08s
    type: STRICT_DNS
    http2_protocol_options:
      max_concurrent_streams: 512
    load_assignment:
      cluster_name: catalog_grpc
      endpoints: ...

GraphQL (BFF/Gateway)

Client-driven contract and composition across multiple domains

const server = new ApolloServer({
  schema,
  persistedQueries: {
    cache: redisCache,
  },
  plugins: [responseCachePlugin()],
});

// resolver guardrails
const resolvers = {
  Query: {
    dashboard: async (_, args, ctx) =>
      ctx.loaders.dashboardByUser.load(args.userId),
  },
};
Approachp50 latencyp95 latencyThroughputComment
REST (JSON, HTTP/1.1)12 ms41 ms~6.1k req/sJSON serialization overhead and larger wire payload.
gRPC unary (Protobuf, HTTP/2)7 ms24 ms~9.8k req/sBetter CPU/network efficiency with similar business logic.
GraphQL gateway (persisted queries + DataLoader)15 ms53 ms~4.3k req/sGreat for UI flexibility, but resolver overhead and fan-out risks remain.

Protobuf schema evolution without pain

Never reuse field numbers after deletion.

Mark removed fields as `reserved` (both by number and by name).

Add new fields only as optional/nullable with safe default behavior.

For enums, always keep `*_UNSPECIFIED = 0` and handle unknown values.

Breaking changes (type change, moving into `oneof`, removing required behavior) require a new contract version.

Before (v1)

syntax = "proto3";
message UserProfile {
  string user_id = 1;
  string email = 2;
  string phone = 3;
}

After (v2, safe evolution)

syntax = "proto3";
message UserProfile {
  string user_id = 1;
  string email = 2;
  reserved 3;
  reserved "phone";
  optional string telegram = 4;
}
ChangeBackwardForwardComment
Added a new fieldYesYesOlder consumers ignore unknown fields.
Removed field + reservedConditionalNoIf old producers still send it, the new consumer loses that value.
Changed field type (int32 -> string)NoNoWire format changes and decoding becomes unsafe.
Added enum valueYesConditionalOld code needs fallback handling for unknown enum values.

Performance

Performance Engineering

Latency and throughput should be measured on your own workloads with realistic payloads.

Open chapter

Latency/throughput comparison table

ApproachTypical latencyTypical throughputCommon fitKey trade-off
REST sync15-60 ms (p95)3k-8k req/s per nodeExternal/public APIs, simple integrationsHeavier payloads and usually higher CPU serialization cost.
gRPC sync8-30 ms (p95)6k-15k req/s per nodeInternal low-latency RPC, streamingNeeds IDL governance/tooling and HTTP/2 readiness.
GraphQL (BFF/Gateway)25-90 ms (p95)1k-5k req/s on gatewayUI aggregation, product-driven contractsResolver fan-out, harder profiling and caching.
Queue-based async40 ms - 2 s10k-120k msg/sBackground commands, smoothing traffic spikesEventual consistency and queue operations overhead.
Pub/Sub events20-300 ms50k-500k msg/s (cluster)Domain events with multiple independent subscribersHarder ordering/duplication control and contract evolution.

Real event contracts: CloudEvents and AsyncAPI

CloudEvents (domain event example)

{
  "specversion": "1.0",
  "type": "com.shop.order.paid.v1",
  "source": "urn:shop:payments",
  "id": "evt-01HQ7V0R4Z6A0G3T95S1ZQ6B9N",
  "time": "2026-03-03T14:23:44Z",
  "subject": "order/938475",
  "datacontenttype": "application/json",
  "dataschema": "https://events.shop.dev/schemas/order-paid-v1.json",
  "data": {
    "orderId": "938475",
    "userId": "u-1821",
    "amount": 149.90,
    "currency": "USD",
    "paymentMethod": "card"
  }
}

AsyncAPI (channel + payload contract)

asyncapi: 3.0.0
info:
  title: Order Events API
  version: 1.4.0
channels:
  order.paid.v1:
    address: order.paid.v1
    messages:
      orderPaid:
        $ref: '#/components/messages/OrderPaid'
operations:
  onOrderPaid:
    action: receive
    channel:
      $ref: '#/channels/order.paid.v1'
    messages:
      - $ref: '#/channels/order.paid.v1/messages/orderPaid'
components:
  messages:
    OrderPaid:
      payload:
        type: object
        required: [orderId, userId, amount, currency]
        properties:
          orderId: { type: string }
          userId: { type: string }
          amount: { type: number }
          currency: { type: string }

Each event should include business key (`orderId`) and technical id (`id`) for deduplication.

Use explicit versioning in `type`/topic (`...v1`) and keep schema in a registry.

Document delivery SLA: at-least-once/exactly-once expectations and TTL.

Assign an owning team and deprecation policy for each event version.

Reliability

Fault Tolerance Patterns

Communication without resilience policies in a distributed environment is usually unstable.

Open chapter

How to choose a pattern

We need a response to the user within one HTTP request -> often sync path.

We need resistance to spikes and loose coupling -> async via queue/topic.

If the operation is money/order critical, check idempotency and ordering before selecting a pattern.

If you have a lot of cross-service hops, reduce the depth of synchronous chains and implement cache/materialized views.

Timeout budgets for each hop and general end-to-end deadline policy.

Retry with jitter and retry budget, so as not to create a retry storm when the dependency degrades.

Circuit breaker/bulkhead for fault isolation and concurrency control.

Idempotency keys for commands and deduplication for event-consumers.

DLQ/parking lot for invalid or problematic messages.

Practical checklist

  • For each integration channel, an owner, SLO and error budget are specified.
  • Contracts are versioned and verified by contract tests in CI.
  • There is a degradation strategy when the downstream service is unavailable.
  • The trace covers the end-to-end path through sync and async segments.
  • Critical commands and events are processed idempotently.

References

Related chapters

Enable tracking in Settings