This chapter matters not because it compares benchmark numbers, but because it explains the different nature of the work: general-purpose sequential control on CPUs versus massive parallelism on GPUs.
In practice, it helps you choose the compute path by workload shape: what is latency-sensitive, what belongs in batch, and where data-transfer overhead cancels the theoretical GPU win.
In interviews and design reviews, it gives you a concrete way to justify compute choices through workload shape, throughput, and operating cost instead of hype.
Practical value of this chapter
Workload profile
Supports CPU/GPU decisions by workload shape: latency sensitivity, batch mode, and parallelism level.
Resource economics
Encourages evaluating both performance and total operating cost of compute choices.
Capacity planning
Provides a practical model for inference, analytics, and mixed compute scenarios.
Interview framing
Helps justify compute stack selection with measurable criteria rather than trends.
Source
Central processing unit
General CPU structure and its role in computation.
CPU and GPU solve the same problem, but in different ways. CPU focuses on versatility and low latency, while GPU is built for massive parallelism and throughput.
CPU architecture basics
- Cores execute instructions and coordinate thread scheduling.
- ALU performs arithmetic and logical operations.
- Control Unit manages instruction sequencing and execution flow.
- Registers provide the fastest memory near each core.
- L1/L2/L3 caches reduce data access latency.
- Memory controller and buses connect CPU with RAM and devices.
Source
Graphics processing unit
GPU architecture and key properties of parallel computing.
GPU architecture basics
- SM/CU (multiprocessors) handle massively parallel execution.
- Thread pools launch thousands of lightweight threads.
- Scheduler/dispatch maps warps and wavefronts to compute units.
- VRAM provides high-bandwidth local memory.
- Cache and memory controllers improve data access efficiency.
- Command processor accepts and schedules tasks from the CPU.
CPU vs GPU comparison
CPU
- Small number of complex cores
- High per-thread performance
- Best for branching and latency-critical paths
A small number of powerful cores execute heterogeneous tasks.
GPU
- Many relatively simple cores
- High throughput for homogeneous operations
- Great for data-parallel workloads
Many simpler cores execute similar work in parallel.
Dynamic visualization: workload simulator
Select a workload profile and batch size. The simplified model below shows when CPU wins and when GPU wins.
Batch size: 32
Range: 8 - 256
CPU
Fit for this profile: 88%
Batch processing time: 7 ms
Throughput estimate: 4571 req/s
GPU
Fit for this profile: 35%
Batch processing time: 8 ms
Throughput estimate: 4000 req/s
Simulation result
CPU/GPU speedup by batch time: 0.88x (CPU faster).
Bottleneck: branching, lock contention, and tail latency. Recommendation: Use CPU as the primary runtime; offload only isolated compute-heavy kernels to GPU.
Dynamic visualization: hybrid CPU + GPU pipeline
Production systems typically run CPU and GPU together. Switch stages to inspect the resource profile.
Input validation, feature extraction, serialization, and batch assembly.
Key metric
Batch preparation time
Optimization focus
CPU vectorization, fewer copies, and lower allocation churn
Where each one works better
CPU
- Server requests with complex business logic
- Transactions, service coordination, and OS-level tasks
- Workloads with unpredictable branching
GPU
- Graphics and rendering
- Machine learning and tensor-heavy operations
- Large-scale parallel compute and simulation
Common CPU/GPU selection mistakes
Using GPU just because it is available
Moving branch-heavy logic to GPU without accounting for kernel launch overhead often worsens latency.
Ignoring data transfer costs
With small batches, CPU-to-GPU and GPU-to-CPU transfer time can consume the entire expected gain.
Looking only at host-level metrics
CPU and RAM metrics are not enough; track GPU utilization, memory bandwidth, and queue depth too.
No degradation path when GPU is unavailable
Without a CPU fallback path, the system loses resilience under GPU scarcity or scheduling failures.
Practical recommendations
Hybrid pipeline
Keep orchestration, flow control, and business logic on CPU; move bulk math to GPU.
Batch sizing by SLO
Tune batch size from latency SLOs: too small underutilizes GPU, too large increases tail latency.
Profile before and after
Use flamegraphs or pprof on CPU and kernel traces on GPU so optimization is measurement-driven.
Two-domain capacity planning
Model CPU-bound and GPU-bound stages separately to plan scaling and infrastructure cost correctly.
Practical conclusion
In production systems, CPU and GPU almost always work as a pair: CPU manages control flow and orchestration, while GPU handles bulk computations. Architecture choice depends on workload shape: latency and branching favor CPU, while homogeneous parallel workloads favor GPU.
Why this matters for system design
- Helps choose the right runtime for each workload and design a correct processing pipeline.
- Directly affects infrastructure cost: CPU-dominant and GPU-dominant services scale differently.
- Defines memory and network constraints: GPU paths are sensitive to bandwidth and transfer strategy.
- Reduces architecture risk: CPU fallback and explicit SLOs keep the system resilient under GPU scarcity.
Related chapters
- Why foundational knowledge matters - explains how hardware constraints shape architecture decisions in system design.
- Structured Computer Organization (short summary) - provides hardware fundamentals: abstraction layers, ISA, memory, and CPU-device interaction.
- Operating system: overview - extends the topic with scheduling, system calls, and kernel-space impact on latency.
- RAM and persistent storage - shows why memory bandwidth and data locality often matter more than raw compute capacity.
- Performance Engineering - adds practical guidance for profiling CPU/GPU load and optimizing end-to-end performance.
- The history of Google TPUs and their evolution - broadens the accelerator comparison and clarifies when GPU or TPU gives better results.
