Service discovery looks like a small detail right up until there are too many services and environments for manual addresses and static configuration.
In real design work, the chapter shows how service registries, DNS, TTL, health checks, traffic balancing, and failover form a shared control plane for service-to-service connectivity.
In interviews and engineering discussions, it helps surface stale addresses, service-registry outages, and split-brain failure modes before they show up as incidents.
Practical value of this chapter
Design in practice
Design discovery around dynamic instances and automatic failover behavior.
Decision quality
Define service-registry consistency, TTL policy, and health-signal propagation.
Interview articulation
Justify client-side versus infrastructure-side discovery through latency and operational simplicity.
Failure framing
Model stale addresses, service-registry outages, and control-plane split-brain scenarios.
Context
Interservice communication patterns
Service-to-service communication becomes fragile when instance addresses and routing rules drift away from real system state.
Service discovery solves a basic distributed-systems problem: how services find each other when instances move, fail over, and scale. A reliable discovery loop ties together the service registry, health checks, traffic balancing, and recovery rules.
Discovery models
Client-side discovery
The client reads the service registry, keeps a local pool of service endpoints, and chooses a target with local load balancing.
Infrastructure-side discovery
The client calls a stable entry point such as an LB or proxy, while the infrastructure chooses the concrete service instance.
DNS-based discovery
Services are published as DNS names; clients rely on name resolution, TTL, and DNS cache behavior.
Client-side discovery
SDK performs registry lookup, keeps a local instance pool, and routes requests on the client side.
Strengths
- Maximum control over routing and retry logic on the client side.
- Fast reaction to local latency and error-rate metrics.
- Independent from a central proxy in the data path.
Limitations
- Discovery SDK must be supported across all services and languages.
- Harder to enforce uniform rules across the whole platform.
Request queue
Discovery plane
Client performs lookup and selects an endpoint locally using a routing policy.
service-a-01
service-a-02
service-a-03
Ready to simulate the discovery flow.
Latest decision: —
Key components
- Service registry: stores current instance addresses, zones, versions, and other metadata.
- Health checks: readiness and liveness probes determine whether traffic can be sent to an instance.
- Heartbeat and TTL: inactive instances are removed from discovery when they stop renewing their registration.
- Load-balancing policy: round-robin, least-loaded, and locality-aware routing.
- Retry and timeout policy: protects clients from short failures and network jitter.
Adjacent practice
Service Mesh Architecture
In many companies, the service catalog is complemented by proxy routing through a service mesh.
Industry approaches
Kubernetes Service and DNS
Best fit: Teams running primarily on Kubernetes, where traffic routing is already standardized through Service and DNS.
Typical stack: Service, EndpointSlice, CoreDNS, readiness/liveness probes, kube-proxy/IPVS.
Strengths
- Minimal standalone registry infrastructure because discovery is built into the runtime platform.
- Instance movement and rolling updates are reflected automatically in the endpoint pool.
Risks and limitations
- Multi-cluster scenarios need extra mechanisms such as MCS, federated DNS, or a service mesh.
- Incorrect client or DNS cache settings can slow down failover.
Consul catalog
Best fit: Hybrid VM and Kubernetes environments, multi-datacenter organizations, and teams with an explicit platform control plane.
Typical stack: Consul agents, service catalog, health checks, ACL, Consul Connect.
Strengths
- A single service catalog across runtimes and network segments.
- Rich metadata and access policies for governed discovery.
Risks and limitations
- The control plane needs disciplined operations: Raft/gossip topology, upgrade policy, and backups.
- Without strict health-check hygiene, stale service endpoints accumulate in the catalog.
Eureka and client-side load balancing
Best fit: Java/Spring ecosystems and east-west microservice traffic where predictable latency matters.
Typical stack: Eureka Server, Spring Cloud Netflix, client-side load balancing, and resilience policies.
Strengths
- Client-side target selection avoids an extra network hop through a central proxy.
- Works well with retry and circuit-breaker policies implemented in client SDKs.
Risks and limitations
- Client libraries must be standardized, otherwise discovery behavior diverges between services.
- Polyglot stacks make it harder to keep one protocol and one operating model.
Cloud-managed discovery and Envoy/xDS
Best fit: AWS/GCP platform teams that prioritize a managed control plane and cloud-service integration.
Typical stack: AWS Cloud Map or Traffic Director + Envoy/xDS, sometimes with a managed service mesh.
Strengths
- Lower operational load on self-hosted registry clusters.
- Native integration with IAM, VPC, and cloud observability ecosystems.
Risks and limitations
- Vendor lock-in risk at the API, network-policy, and operations layers.
- Regional degradation and control-plane API limits still need explicit testing.
Foundation
DNS
DNS is the name-resolution foundation behind many service discovery implementations.
Trade-offs
Registry consistency and availability
Very strict consistency can reduce discovery availability during network problems.
TTL freshness and DNS load
Short TTL values make route updates faster, but increase load on DNS and the control plane.
Centralized control and client autonomy
A centralized control plane is convenient, but it increases the blast radius of configuration mistakes.
Dynamic addresses and stale caches
Caches speed up address resolution, but may hold stale records during failover.
Practical checklist
- Automatic deregistration is in place when an instance fails or becomes isolated.
- Discovery behavior is tested under network partitions and controller failures.
- Retries and timeouts use jitter and bounded retry counts.
- Stale service endpoints and address-resolution latency are monitored.
- Service names and ownership rules are standardized at the platform level.
References
Related chapters
- DNS - The name-resolution foundation behind many discovery strategies.
- Service Mesh Architecture - A service mesh adds policy-aware routing on top of service discovery.
- Interservice communication patterns - Communication and discovery patterns need to be designed together.
- Kubernetes Fundamentals - Practical service discovery through Kubernetes Service, Endpoints, and DNS.
- Fault Tolerance Patterns - Discovery must work together with retries, circuit breakers, and health checks.
