Service Discovery — System Design Space

Service discovery looks like a small detail right up until there are too many services and environments for manual addresses and static configuration.

In real design work, the chapter shows how service registries, DNS, TTL, health checks, traffic balancing, and failover form a shared control plane for service-to-service connectivity.

In interviews and engineering discussions, it helps surface stale addresses, service-registry outages, and split-brain failure modes before they show up as incidents.

Practical value of this chapter

Design in practice

Design discovery around dynamic instances and automatic failover behavior.

Decision quality

Define service-registry consistency, TTL policy, and health-signal propagation.

Interview articulation

Justify client-side versus infrastructure-side discovery through latency and operational simplicity.

Failure framing

Model stale addresses, service-registry outages, and control-plane split-brain scenarios.

Context

Interservice communication patterns

Service-to-service communication becomes fragile when instance addresses and routing rules drift away from real system state.

Open chapter

In a distributed system instance addresses are short-lived: instances move, crash, scale, and fail over. The moment a client clings to a hard-coded address, the first migration turns into a user-facing outage. Service discovery answers how services keep finding each other through all that motion. A reliable discovery loop holds together the service registry, health checks, traffic balancing, and recovery rules — let one link drop out, and traffic lands on a dead address.

Discovery models

Client-side discovery

The client pulls the list of service endpoints from the registry itself and picks a target with local load balancing. There is no extra hop, but discovery logic moves into every client — and drifts apart once the libraries are not standardized.

Infrastructure-side discovery

The client sees one stable entry point — an LB or proxy — while choosing the concrete instance stays hidden inside the infrastructure. The client stays thin, but that entry point becomes a shared node every request passes through.

DNS-based discovery

Services are published as DNS names, and plain name resolution is enough for the client. The price of that simplicity is TTL and DNS caching: while the old record expires, the client keeps knocking on an address that is already gone.

Client-side discovery

SDK performs registry lookup, keeps a local instance pool, and routes requests on the client side.

Strengths

Maximum control over routing and retry logic on the client side.
Fast reaction to local latency and error-rate metrics.
Independent from a central proxy in the data path.

Limitations

Discovery SDK must be supported across all services and languages.
Harder to enforce uniform rules across the whole platform.

Best fit: High-load internal services with a unified SDK and mature observability platform.

Pipeline: registration -> health -> lookup -> routing

registry v42

Request queue

SD-REQ-101Web

billing / tenant:acme

SD-REQ-102Mobile

profile / user:42

SD-REQ-103Partner

orders / order:7712

SD-REQ-104Web

billing / tenant:globex

Discovery plane

Client performs lookup and selects an endpoint locally using a routing policy.

Waiting for request

service-a-01

healthy

zone: cluster-aload: 1served: 0

service-a-02

healthy

zone: cluster-bload: 2served: 0

service-a-03

healthy

zone: cluster-cload: 1served: 0

Ready

Ready to simulate the discovery flow.

Latest decision: —

Key components

Service registry: stores current instance addresses, zones, versions, and other metadata.
Health checks: readiness and liveness probes determine whether traffic can be sent to an instance.
Heartbeat and TTL: inactive instances are removed from discovery when they stop renewing their registration.
Load-balancing policy: round-robin, least-loaded, and locality-aware routing.
Retry and timeout policy: protects clients from short failures and network jitter.

Adjacent practice

Service Mesh Architecture

In many companies, the service catalog is complemented by proxy routing through a service mesh.

Open chapter

Industry approaches

Kubernetes Service and DNS

Best fit: Teams running primarily on Kubernetes, where traffic routing is already standardized through Service and DNS.

Typical stack: Service, EndpointSlice, CoreDNS, readiness/liveness probes, kube-proxy/IPVS.

Strengths

Minimal standalone registry infrastructure because discovery is built into the runtime platform.
Instance movement and rolling updates are reflected automatically in the endpoint pool.

Risks and limitations

Multi-cluster scenarios need extra mechanisms such as MCS, federated DNS, or a service mesh.
Incorrect client or DNS cache settings can slow down failover.

Consul catalog

Best fit: Hybrid VM and Kubernetes environments, multi-datacenter organizations, and teams with an explicit platform control plane.

Typical stack: Consul agents, service catalog, health checks, ACL, Consul Connect.

Strengths

A single service catalog across runtimes and network segments.
Rich metadata and access policies: you can constrain who sees whom right at the discovery layer.

Risks and limitations

The control plane needs disciplined operations: Raft/gossip topology, upgrade policy, and backups.
Without strict health-check hygiene, stale service endpoints accumulate in the catalog.

Eureka and client-side load balancing

Best fit: Java/Spring ecosystems and east-west microservice traffic where predictable latency matters.

Typical stack: Eureka Server, Spring Cloud Netflix, client-side load balancing, and resilience policies.

Strengths

Client-side target selection avoids an extra network hop through a central proxy.
Works well with retry and circuit-breaker policies implemented in client SDKs.

Risks and limitations

Client libraries must be standardized, otherwise discovery behavior diverges between services.
Polyglot stacks make it harder to keep one protocol and one operating model.

Cloud-managed discovery and Envoy/xDS

Best fit: AWS/GCP platform teams that prioritize a managed control plane and cloud-service integration.

Typical stack: AWS Cloud Map or Traffic Director + Envoy/xDS, sometimes with a managed service mesh.

Strengths

Registry clusters are kept and upgraded by the provider, not by an on-call team.
Native integration with IAM, VPC, and cloud observability — less glue between discovery and the rest of the platform.

Risks and limitations

Vendor lock-in risk at the API, network-policy, and operations layers.
Regional degradation and control-plane API limits still need explicit testing.

Foundation

DNS

DNS is the name-resolution foundation behind many service discovery implementations.

Open chapter

Trade-offs

Registry consistency and availability

The stricter the registry's consistency requirement, the more readily it refuses to answer during network problems — and with no answer from the registry, discovery stalls.

TTL freshness and DNS load

Short TTL values make route updates faster, but increase load on DNS and the control plane.

Centralized control and client autonomy

A centralized control plane is convenient, but it increases the blast radius of configuration mistakes.

Dynamic addresses and stale caches

A cache answers fast, but exactly during failover it holds an address that no longer exists the longest — the client keeps hitting a dead instance until the record expires.

Practical checklist

Automatic deregistration is in place when an instance fails or becomes isolated.
Discovery behavior is tested under network partitions and controller failures.
Retries and timeouts use jitter and bounded retry counts.
Stale service endpoints and address-resolution latency are monitored.
Service names and ownership rules are standardized at the platform level.

References

Related chapters

DNS - The name-resolution foundation behind many discovery strategies.
Service Mesh Architecture - A service mesh adds policy-aware routing on top of service discovery.
Interservice communication patterns - How a service calls its neighbor is decided together with how it finds it — otherwise communication and discovery patterns start contradicting each other.
Kubernetes Fundamentals - Practical service discovery through Kubernetes Service, Endpoints, and DNS.
Fault Tolerance Patterns - Discovery must work together with retries, circuit breakers, and health checks.