Knowledge graphSettings

Updated: May 7, 2026 at 6:26 PM

Service Discovery

medium

How services find current addresses for one another: service registries, DNS, health checks, traffic balancing, and failure behavior.

Service discovery looks like a small detail right up until there are too many services and environments for manual addresses and static configuration.

In real design work, the chapter shows how service registries, DNS, TTL, health checks, traffic balancing, and failover form a shared control plane for service-to-service connectivity.

In interviews and engineering discussions, it helps surface stale addresses, service-registry outages, and split-brain failure modes before they show up as incidents.

Practical value of this chapter

Design in practice

Design discovery around dynamic instances and automatic failover behavior.

Decision quality

Define service-registry consistency, TTL policy, and health-signal propagation.

Interview articulation

Justify client-side versus infrastructure-side discovery through latency and operational simplicity.

Failure framing

Model stale addresses, service-registry outages, and control-plane split-brain scenarios.

Context

Interservice communication patterns

Service-to-service communication becomes fragile when instance addresses and routing rules drift away from real system state.

Open chapter

Service discovery solves a basic distributed-systems problem: how services find each other when instances move, fail over, and scale. A reliable discovery loop ties together the service registry, health checks, traffic balancing, and recovery rules.

Discovery models

Client-side discovery

The client reads the service registry, keeps a local pool of service endpoints, and chooses a target with local load balancing.

Infrastructure-side discovery

The client calls a stable entry point such as an LB or proxy, while the infrastructure chooses the concrete service instance.

DNS-based discovery

Services are published as DNS names; clients rely on name resolution, TTL, and DNS cache behavior.

Client-side discovery

SDK performs registry lookup, keeps a local instance pool, and routes requests on the client side.

Strengths

  • Maximum control over routing and retry logic on the client side.
  • Fast reaction to local latency and error-rate metrics.
  • Independent from a central proxy in the data path.

Limitations

  • Discovery SDK must be supported across all services and languages.
  • Harder to enforce uniform rules across the whole platform.
Best fit: High-load internal services with a unified SDK and mature observability platform.
Pipeline: registration -> health -> lookup -> routing
registry v42

Request queue

SD-REQ-101Web
billing / tenant:acme
SD-REQ-102Mobile
profile / user:42
SD-REQ-103Partner
orders / order:7712
SD-REQ-104Web
billing / tenant:globex

Discovery plane

Client performs lookup and selects an endpoint locally using a routing policy.

Waiting for request

service-a-01

healthy
zone: cluster-aload: 1served: 0

service-a-02

healthy
zone: cluster-bload: 2served: 0

service-a-03

healthy
zone: cluster-cload: 1served: 0
Ready

Ready to simulate the discovery flow.

Latest decision: —

Key components

  • Service registry: stores current instance addresses, zones, versions, and other metadata.
  • Health checks: readiness and liveness probes determine whether traffic can be sent to an instance.
  • Heartbeat and TTL: inactive instances are removed from discovery when they stop renewing their registration.
  • Load-balancing policy: round-robin, least-loaded, and locality-aware routing.
  • Retry and timeout policy: protects clients from short failures and network jitter.

Adjacent practice

Service Mesh Architecture

In many companies, the service catalog is complemented by proxy routing through a service mesh.

Open chapter

Industry approaches

Kubernetes Service and DNS

Best fit: Teams running primarily on Kubernetes, where traffic routing is already standardized through Service and DNS.

Typical stack: Service, EndpointSlice, CoreDNS, readiness/liveness probes, kube-proxy/IPVS.

Strengths

  • Minimal standalone registry infrastructure because discovery is built into the runtime platform.
  • Instance movement and rolling updates are reflected automatically in the endpoint pool.

Risks and limitations

  • Multi-cluster scenarios need extra mechanisms such as MCS, federated DNS, or a service mesh.
  • Incorrect client or DNS cache settings can slow down failover.

Consul catalog

Best fit: Hybrid VM and Kubernetes environments, multi-datacenter organizations, and teams with an explicit platform control plane.

Typical stack: Consul agents, service catalog, health checks, ACL, Consul Connect.

Strengths

  • A single service catalog across runtimes and network segments.
  • Rich metadata and access policies for governed discovery.

Risks and limitations

  • The control plane needs disciplined operations: Raft/gossip topology, upgrade policy, and backups.
  • Without strict health-check hygiene, stale service endpoints accumulate in the catalog.

Eureka and client-side load balancing

Best fit: Java/Spring ecosystems and east-west microservice traffic where predictable latency matters.

Typical stack: Eureka Server, Spring Cloud Netflix, client-side load balancing, and resilience policies.

Strengths

  • Client-side target selection avoids an extra network hop through a central proxy.
  • Works well with retry and circuit-breaker policies implemented in client SDKs.

Risks and limitations

  • Client libraries must be standardized, otherwise discovery behavior diverges between services.
  • Polyglot stacks make it harder to keep one protocol and one operating model.

Cloud-managed discovery and Envoy/xDS

Best fit: AWS/GCP platform teams that prioritize a managed control plane and cloud-service integration.

Typical stack: AWS Cloud Map or Traffic Director + Envoy/xDS, sometimes with a managed service mesh.

Strengths

  • Lower operational load on self-hosted registry clusters.
  • Native integration with IAM, VPC, and cloud observability ecosystems.

Risks and limitations

  • Vendor lock-in risk at the API, network-policy, and operations layers.
  • Regional degradation and control-plane API limits still need explicit testing.

Foundation

DNS

DNS is the name-resolution foundation behind many service discovery implementations.

Open chapter

Trade-offs

Registry consistency and availability

Very strict consistency can reduce discovery availability during network problems.

TTL freshness and DNS load

Short TTL values make route updates faster, but increase load on DNS and the control plane.

Centralized control and client autonomy

A centralized control plane is convenient, but it increases the blast radius of configuration mistakes.

Dynamic addresses and stale caches

Caches speed up address resolution, but may hold stale records during failover.

Practical checklist

  • Automatic deregistration is in place when an instance fails or becomes isolated.
  • Discovery behavior is tested under network partitions and controller failures.
  • Retries and timeouts use jitter and bounded retry counts.
  • Stale service endpoints and address-resolution latency are monitored.
  • Service names and ownership rules are standardized at the platform level.

References

Related chapters

Enable tracking in Settings