This chapter matters not because it compares benchmark numbers, but because it explains the nature of the work itself: where you need a fast response to one request and where massive parallelism actually helps.
In practice, it helps you choose the compute path by workload shape: what is latency-sensitive, what belongs in batched execution, and where data-movement overhead cancels the theoretical GPU win.
In interviews and design discussions, it gives you a concrete way to justify CPU-versus-GPU choices through workload shape, throughput, and operating cost instead of hype.
Practical value of this chapter
Workload profile
Supports CPU/GPU decisions by workload shape: latency sensitivity, batch size, and parallelism level.
Resource economics
Encourages evaluating not only raw performance, but also accelerator cost, data movement, and reserve modes.
Capacity planning
Provides a practical model for inference, analytics, and mixed compute scenarios.
Interview framing
Helps justify compute stack selection with measurable criteria rather than trends.
Source
Central processing unit
General CPU structure and its role in computation.
CPU versus GPU is really a question of latency and throughput. CPUs win where individual requests need quick response, predictable tail latency, and tight control over p99 behavior. GPUs win where homogeneous work can be batched and executed in parallel.
In real systems, the choice is rarely exclusive: CPU keeps orchestration, business logic, and fallback paths, while GPU accelerates inference or bulk numeric stages. The real cost is shaped by VRAM, memory bandwidth, and data movement, not only by raw FLOPS.
That is why this chapter is as much about measurement as it is about hardware: CPU vectorization, interconnects such as NVLink, and profiling tools such as flamegraphs and pprof all matter when deciding where the real bottleneck lives.
How CPU is organized
- Cores execute instructions and coordinate thread scheduling.
- The ALU performs arithmetic and logical operations.
- The control unit manages instruction sequencing and execution flow.
- Registers provide the fastest memory near each core.
- L1/L2/L3 caches reduce the cost of data access.
- The memory controller and buses connect CPU with RAM and devices.
Source
Graphics processing unit
GPU architecture and key properties of parallel computing.
How GPU is organized
- SM/CU blocks handle massively parallel execution.
- Thread pools launch thousands of lightweight threads.
- Schedulers map warps and wavefronts to compute units.
- VRAM provides high-bandwidth local memory.
- Cache and memory controllers improve data access efficiency.
- The command processor accepts and schedules tasks from the CPU.
CPU vs GPU comparison
CPU
- A small number of complex cores
- High performance per thread
- Strong on branching and irregular control flow
A small number of powerful cores execute heterogeneous tasks.
GPU
- Many simpler compute units
- High throughput on homogeneous operations
- Best for massively parallel workloads
Many simpler cores execute similar work in parallel.
Dynamic visualization: workload simulator
Select a workload profile and batch size. The simplified model below shows when CPU wins and when GPU wins.
Batch size: 32
Range: 8 - 256
CPU
Fit for this profile: 88%
Batch processing time: 7 ms
Estimated throughput: 4571 units/s
GPU
Fit for this profile: 35%
Batch processing time: 8 ms
Estimated throughput: 4000 units/s
Simulation result
Relative batch time, CPU to GPU: 0.88x (CPU is faster in this scenario).
Bottleneck: branching, lock contention, and tail latency. Recommendation: Keep CPU as the main execution path and offload only isolated compute-heavy stages to GPU.
Dynamic visualization: hybrid CPU + GPU pipeline
In real systems, CPU and GPU almost always work as one pipeline. Switch stages to inspect how the load moves.
The transfer stage is often where hidden cost appears: even fast links such as NVLink do not remove the price of data copies and GPU-launch preparation.
Input validation, feature extraction, serialization, and batch assembly.
Key metric
Batch preparation time
Optimization focus
CPU vectorization, fewer copies, and lower allocation churn
Where CPU fits better and where GPU fits better
CPU
- Server requests with complex business logic
- Transactions, service coordination, and OS-level tasks
- Workloads with unpredictable branching
GPU
- Graphics and rendering
- Machine learning and tensor-heavy operations
- Large-scale parallel compute and simulation
Common CPU/GPU selection mistakes
Choosing GPU just because an accelerator is available
Moving branch-heavy logic to GPU without pricing the launch overhead often makes latency worse.
Ignoring data transfer costs
With small batches, the time spent moving data between CPU and GPU can erase the entire expected gain.
Looking only at host-level metrics
CPU and RAM are not enough; you also need GPU utilization, memory-bandwidth, and queue-depth signals.
No reserve path when GPU is unavailable
Without a CPU fallback path, the system loses resilience under GPU scarcity or scheduling failures.
Profiling should show where time is actually spent: flamegraphs and pprof help on CPU, while GPU traces and queue telemetry clarify whether the accelerator is busy or just waiting on data.
Practical recommendations
Hybrid pipeline
Keep control flow, business logic, and reserve paths on CPU; move bulk math to GPU.
Batch size driven by SLO
Tune batch size from latency SLOs: too small underutilizes GPU, too large increases tail latency.
Profile before and after
Profile CPU and GPU before and after optimization so decisions stay measurement-driven.
Capacity planning across two compute domains
Model CPU-bound and GPU-bound stages separately to plan scaling and infrastructure cost correctly.
Practical takeaway
In production systems, CPU and GPU usually work as a pair: CPU handles control flow, routing, and reserve paths, while GPU takes on large homogeneous computations. The decision depends first on workload shape: latency and branching favor CPU, while uniform parallel work favors GPU.
Why this matters for system design
- Helps choose the right compute path for each workload and design the processing pipeline deliberately.
- Directly affects infrastructure cost: CPU-heavy and GPU-heavy services scale in different ways.
- Defines memory and network constraints: GPU paths are especially sensitive to bandwidth and transfer strategy.
- Reduces architecture risk: CPU fallback and explicit SLOs keep the system resilient under GPU scarcity.
Related chapters
- Why foundational knowledge matters - shows how hardware and runtime constraints become architecture decisions in practice.
- Structured Computer Organization (short summary) - provides hardware fundamentals: abstraction layers, ISA, memory, and CPU-device interaction.
- Operating system: overview - connects the topic to scheduling, system calls, and the impact of the kernel on latency.
- RAM and storage - shows why memory bandwidth and data locality often matter more than peak compute capacity.
- Performance Engineering - adds practical guidance for profiling CPU/GPU load and improving full-path performance.
- The history of Google TPUs and their evolution - broadens the accelerator comparison and clarifies when GPU or TPU gives better results.
