Context
Interservice communication patterns
Communication between services becomes fragile without the correct discovery and routing layer.
Service discovery solves the basic problem of a distributed environment: how services find each other in conditions of dynamic instances, failover and scaling. A reliable discovery circuit reduces time-to-recovery and reduces cascading incidents.
Discovery models
Client-side discovery
The client itself receives a list of instances from the registry and selects an endpoint via local load balancing.
Server-side discovery
The client accesses a stable entry point (LB/proxy), and routing to services is hidden within the infrastructure.
DNS-based discovery
Services are published as DNS names; clients use standard DNS resolvers and TTL policies.
Client-side discovery
SDK performs registry lookup, keeps a local instance pool, and routes requests on the client side.
Strengths
- Maximum control over routing and retry logic on the client side.
- Fast reaction to local latency and error-rate metrics.
- Independent from a central proxy in the data path.
Limitations
- Discovery SDK must be supported across all services and languages.
- Harder to enforce uniform rules across the whole platform.
Request queue
Discovery plane
Client performs lookup and selects an endpoint locally using a routing policy.
service-a-01
service-a-02
service-a-03
Ready to simulate the discovery flow.
Latest decision: —
Key Components
- Service registry: stores current endpoints and instance metadata.
- Health checks: readiness/liveness signal whether traffic can be directed to the instance.
- Heartbeat/session TTL: automatic removal of inactive nodes from the discovery circuit.
- Load balancing policy: round-robin, least-loaded, locality-aware routing.
- Retry/timeout policy: protection against short-term failures and network fluctuations.
Foundation
DNS
DNS is the basic building block for many service discovery implementations.
Trade-offs
Consistency vs availability registry
Too strict consistency can impair the availability of discovery in case of network problems.
TTL freshness vs DNS/query overhead
A short TTL speeds up route updates, but increases the load on the DNS/control plane.
Centralized control vs local autonomy
A centralized control plane is convenient, but it increases the blast radius in case of configuration errors.
Dynamic endpoints vs cache staleness
Caches speed up lookups, but can keep stale addresses during failover.
Practical checklist
- Implemented automatic deregistration logic when a node fails/isolated.
- The behavior of discovery in partition scenarios and in case of controller failures has been tested.
- Configured retries/timeouts with jitter and limiting repetitions.
- There is monitoring of stale endpoints and latency lookup in the discovery path.
- Service names and ownership are standardized at the platform level.
