System Design Space
Knowledge graphSettings

Updated: April 16, 2026 at 10:20 PM

CPU and GPU: overview and differences

medium

How to choose between CPU and GPU by workload shape: latency versus throughput, hybrid pipelines, and the infrastructure cost of acceleration.

This chapter matters not because it compares benchmark numbers, but because it explains the nature of the work itself: where you need a fast response to one request and where massive parallelism actually helps.

In practice, it helps you choose the compute path by workload shape: what is latency-sensitive, what belongs in batched execution, and where data-movement overhead cancels the theoretical GPU win.

In interviews and design discussions, it gives you a concrete way to justify CPU-versus-GPU choices through workload shape, throughput, and operating cost instead of hype.

Practical value of this chapter

Workload profile

Supports CPU/GPU decisions by workload shape: latency sensitivity, batch size, and parallelism level.

Resource economics

Encourages evaluating not only raw performance, but also accelerator cost, data movement, and reserve modes.

Capacity planning

Provides a practical model for inference, analytics, and mixed compute scenarios.

Interview framing

Helps justify compute stack selection with measurable criteria rather than trends.

Source

Central processing unit

General CPU structure and its role in computation.

Перейти на сайт

CPU versus GPU is really a question of latency and throughput. CPUs win where individual requests need quick response, predictable tail latency, and tight control over p99 behavior. GPUs win where homogeneous work can be batched and executed in parallel.

In real systems, the choice is rarely exclusive: CPU keeps orchestration, business logic, and fallback paths, while GPU accelerates inference or bulk numeric stages. The real cost is shaped by VRAM, memory bandwidth, and data movement, not only by raw FLOPS.

That is why this chapter is as much about measurement as it is about hardware: CPU vectorization, interconnects such as NVLink, and profiling tools such as flamegraphs and pprof all matter when deciding where the real bottleneck lives.

How CPU is organized

  • Cores execute instructions and coordinate thread scheduling.
  • The ALU performs arithmetic and logical operations.
  • The control unit manages instruction sequencing and execution flow.
  • Registers provide the fastest memory near each core.
  • L1/L2/L3 caches reduce the cost of data access.
  • The memory controller and buses connect CPU with RAM and devices.

Source

Graphics processing unit

GPU architecture and key properties of parallel computing.

Перейти на сайт

How GPU is organized

  • SM/CU blocks handle massively parallel execution.
  • Thread pools launch thousands of lightweight threads.
  • Schedulers map warps and wavefronts to compute units.
  • VRAM provides high-bandwidth local memory.
  • Cache and memory controllers improve data access efficiency.
  • The command processor accepts and schedules tasks from the CPU.

CPU vs GPU comparison

CPU

  • A small number of complex cores
  • High performance per thread
  • Strong on branching and irregular control flow

A small number of powerful cores execute heterogeneous tasks.

GPU

  • Many simpler compute units
  • High throughput on homogeneous operations
  • Best for massively parallel workloads

Many simpler cores execute similar work in parallel.

Dynamic visualization: workload simulator

Select a workload profile and batch size. The simplified model below shows when CPU wins and when GPU wins.

Batch size: 32

Range: 8 - 256

CPU

Fit for this profile: 88%

Batch processing time: 7 ms

Estimated throughput: 4571 units/s

GPU

Fit for this profile: 35%

Batch processing time: 8 ms

Estimated throughput: 4000 units/s

Simulation result

Relative batch time, CPU to GPU: 0.88x (CPU is faster in this scenario).

Bottleneck: branching, lock contention, and tail latency. Recommendation: Keep CPU as the main execution path and offload only isolated compute-heavy stages to GPU.

Dynamic visualization: hybrid CPU + GPU pipeline

In real systems, CPU and GPU almost always work as one pipeline. Switch stages to inspect how the load moves.

The transfer stage is often where hidden cost appears: even fast links such as NVLink do not remove the price of data copies and GPU-launch preparation.

Input validation, feature extraction, serialization, and batch assembly.

CPU utilization82%
GPU utilization12%

Key metric

Batch preparation time

Optimization focus

CPU vectorization, fewer copies, and lower allocation churn

Where CPU fits better and where GPU fits better

CPU

  • Server requests with complex business logic
  • Transactions, service coordination, and OS-level tasks
  • Workloads with unpredictable branching

GPU

  • Graphics and rendering
  • Machine learning and tensor-heavy operations
  • Large-scale parallel compute and simulation

Common CPU/GPU selection mistakes

Choosing GPU just because an accelerator is available

Moving branch-heavy logic to GPU without pricing the launch overhead often makes latency worse.

Ignoring data transfer costs

With small batches, the time spent moving data between CPU and GPU can erase the entire expected gain.

Looking only at host-level metrics

CPU and RAM are not enough; you also need GPU utilization, memory-bandwidth, and queue-depth signals.

No reserve path when GPU is unavailable

Without a CPU fallback path, the system loses resilience under GPU scarcity or scheduling failures.

Profiling should show where time is actually spent: flamegraphs and pprof help on CPU, while GPU traces and queue telemetry clarify whether the accelerator is busy or just waiting on data.

Practical recommendations

Hybrid pipeline

Keep control flow, business logic, and reserve paths on CPU; move bulk math to GPU.

Batch size driven by SLO

Tune batch size from latency SLOs: too small underutilizes GPU, too large increases tail latency.

Profile before and after

Profile CPU and GPU before and after optimization so decisions stay measurement-driven.

Capacity planning across two compute domains

Model CPU-bound and GPU-bound stages separately to plan scaling and infrastructure cost correctly.

Practical takeaway

In production systems, CPU and GPU usually work as a pair: CPU handles control flow, routing, and reserve paths, while GPU takes on large homogeneous computations. The decision depends first on workload shape: latency and branching favor CPU, while uniform parallel work favors GPU.

Why this matters for system design

  • Helps choose the right compute path for each workload and design the processing pipeline deliberately.
  • Directly affects infrastructure cost: CPU-heavy and GPU-heavy services scale in different ways.
  • Defines memory and network constraints: GPU paths are especially sensitive to bandwidth and transfer strategy.
  • Reduces architecture risk: CPU fallback and explicit SLOs keep the system resilient under GPU scarcity.

Related chapters

Enable tracking in Settings