Acing SDI
Practice task from chapter 11
Ad click event aggregator: dedupe, windowing, and consistent analytics outputs.
Ad Click Event Aggregator tests your ability to design a streaming system where speed, correctness, and metric explainability all matter at once. It is a common interview case at the boundary of data platform and product analytics.
Functional requirements
- Ingest ad click/impression/conversion events.
- Deduplicate events for billing correctness.
- Build minute/hour/day window aggregates.
- Serve realtime dashboards and batch reports.
Non-functional requirements
- Stable operation under campaign traffic bursts.
- Bounded latency for near-realtime analytics.
- Clear data freshness and lineage visibility.
- Controlled storage and recomputation costs.
High-Level Architecture
Theory
Streaming Data
Windowing, watermarks, late events, reprocessing, and realtime/batch trade-offs.
High-Level Architecture
stream ingest + window aggregation + reconciliationThis topology combines ingest flow, window aggregation, and a reconciliation/backfill control loop for billing correctness.
The architecture separates ingest, realtime serving, and reliability control loops with batch reconciliation. This keeps dashboard latency predictable while preserving billing correctness.
Write/Read Paths
Write/Read Paths
How events are written into aggregates and how dashboards read metrics under load.
Write path: ingest accepts events, runs deduplication/windowing, and updates serving aggregates for near-realtime analytics.
Event Sources
SDK / trackers / pixels
Clicks, impressions, and conversions are sent to ingest endpoints.
Collector API
validate + enrich
Schema validation, enrichment, and idempotency key generation.
Stream + Dedupe
Kafka/PubSub + state
Stream processor applies dedupe, ordering, and late-event handling.
Window Aggregator
minute / hour / day
Windowed aggregates are computed and written into serving storage.
Serving Store
ClickHouse/Pinot
Aggregate storage optimized for fast analytical reads.
Event Sources
SDK / trackers / pixels
Clicks, impressions, and conversions are sent to ingest endpoints.
Collector API
validate + enrich
Schema validation, enrichment, and idempotency key generation.
Stream + Dedupe
Kafka/PubSub + state
Stream processor applies dedupe, ordering, and late-event handling.
Window Aggregator
minute / hour / day
Windowed aggregates are computed and written into serving storage.
Serving Store
ClickHouse/Pinot
Aggregate storage optimized for fast analytical reads.
Write path checkpoints
- •Ingress idempotency protects billing from double counting.
- •Window aggregation builds minute/hour/day views while handling late events.
- •Immutable raw storage remains the source of truth for replay and reconciliation.
Data and deduplication
- Idempotency key such as
ad_id + user_id + ts_bucket. - Late events handled via watermarks and grace periods.
- Schema evolution with strict versioning and backward compatibility.
- Aggregate correction through reprocessing over immutable raw data.
SLO and operational metrics
- Data freshness (p95 end-to-end lag).
- Duplicate rate and window completeness.
- Reprocessing duration and backfill cost.
- Mismatch between online dashboard and billing reports.
Questions to clarify in interview
- Required billing precision: near-exact or acceptable tolerance.
- Dashboard freshness SLA and what lag is considered critical.
- Need for drill-down into raw events and retention duration.
- Auditability and legal/compliance constraints for event history.
