System Design Space
Knowledge graphSettings

Updated: March 24, 2026 at 11:23 AM

CPU and GPU: overview and differences

medium

Comparison of architecture and workload types: CPU versatility versus GPU parallelism.

This chapter matters not because it compares benchmark numbers, but because it explains the different nature of the work: general-purpose sequential control on CPUs versus massive parallelism on GPUs.

In practice, it helps you choose the compute path by workload shape: what is latency-sensitive, what belongs in batch, and where data-transfer overhead cancels the theoretical GPU win.

In interviews and design reviews, it gives you a concrete way to justify compute choices through workload shape, throughput, and operating cost instead of hype.

Practical value of this chapter

Workload profile

Supports CPU/GPU decisions by workload shape: latency sensitivity, batch mode, and parallelism level.

Resource economics

Encourages evaluating both performance and total operating cost of compute choices.

Capacity planning

Provides a practical model for inference, analytics, and mixed compute scenarios.

Interview framing

Helps justify compute stack selection with measurable criteria rather than trends.

Source

Central processing unit

General CPU structure and its role in computation.

Перейти на сайт

CPU and GPU solve the same problem, but in different ways. CPU focuses on versatility and low latency, while GPU is built for massive parallelism and throughput.

CPU architecture basics

  • Cores execute instructions and coordinate thread scheduling.
  • ALU performs arithmetic and logical operations.
  • Control Unit manages instruction sequencing and execution flow.
  • Registers provide the fastest memory near each core.
  • L1/L2/L3 caches reduce data access latency.
  • Memory controller and buses connect CPU with RAM and devices.

Source

Graphics processing unit

GPU architecture and key properties of parallel computing.

Перейти на сайт

GPU architecture basics

  • SM/CU (multiprocessors) handle massively parallel execution.
  • Thread pools launch thousands of lightweight threads.
  • Scheduler/dispatch maps warps and wavefronts to compute units.
  • VRAM provides high-bandwidth local memory.
  • Cache and memory controllers improve data access efficiency.
  • Command processor accepts and schedules tasks from the CPU.

CPU vs GPU comparison

CPU

  • Small number of complex cores
  • High per-thread performance
  • Best for branching and latency-critical paths

A small number of powerful cores execute heterogeneous tasks.

GPU

  • Many relatively simple cores
  • High throughput for homogeneous operations
  • Great for data-parallel workloads

Many simpler cores execute similar work in parallel.

Dynamic visualization: workload simulator

Select a workload profile and batch size. The simplified model below shows when CPU wins and when GPU wins.

Batch size: 32

Range: 8 - 256

CPU

Fit for this profile: 88%

Batch processing time: 7 ms

Throughput estimate: 4571 req/s

GPU

Fit for this profile: 35%

Batch processing time: 8 ms

Throughput estimate: 4000 req/s

Simulation result

CPU/GPU speedup by batch time: 0.88x (CPU faster).

Bottleneck: branching, lock contention, and tail latency. Recommendation: Use CPU as the primary runtime; offload only isolated compute-heavy kernels to GPU.

Dynamic visualization: hybrid CPU + GPU pipeline

Production systems typically run CPU and GPU together. Switch stages to inspect the resource profile.

Input validation, feature extraction, serialization, and batch assembly.

CPU load82%
GPU load12%

Key metric

Batch preparation time

Optimization focus

CPU vectorization, fewer copies, and lower allocation churn

Where each one works better

CPU

  • Server requests with complex business logic
  • Transactions, service coordination, and OS-level tasks
  • Workloads with unpredictable branching

GPU

  • Graphics and rendering
  • Machine learning and tensor-heavy operations
  • Large-scale parallel compute and simulation

Common CPU/GPU selection mistakes

Using GPU just because it is available

Moving branch-heavy logic to GPU without accounting for kernel launch overhead often worsens latency.

Ignoring data transfer costs

With small batches, CPU-to-GPU and GPU-to-CPU transfer time can consume the entire expected gain.

Looking only at host-level metrics

CPU and RAM metrics are not enough; track GPU utilization, memory bandwidth, and queue depth too.

No degradation path when GPU is unavailable

Without a CPU fallback path, the system loses resilience under GPU scarcity or scheduling failures.

Practical recommendations

Hybrid pipeline

Keep orchestration, flow control, and business logic on CPU; move bulk math to GPU.

Batch sizing by SLO

Tune batch size from latency SLOs: too small underutilizes GPU, too large increases tail latency.

Profile before and after

Use flamegraphs or pprof on CPU and kernel traces on GPU so optimization is measurement-driven.

Two-domain capacity planning

Model CPU-bound and GPU-bound stages separately to plan scaling and infrastructure cost correctly.

Practical conclusion

In production systems, CPU and GPU almost always work as a pair: CPU manages control flow and orchestration, while GPU handles bulk computations. Architecture choice depends on workload shape: latency and branching favor CPU, while homogeneous parallel workloads favor GPU.

Why this matters for system design

  • Helps choose the right runtime for each workload and design a correct processing pipeline.
  • Directly affects infrastructure cost: CPU-dominant and GPU-dominant services scale differently.
  • Defines memory and network constraints: GPU paths are sensitive to bandwidth and transfer strategy.
  • Reduces architecture risk: CPU fallback and explicit SLOs keep the system resilient under GPU scarcity.

Related chapters

Enable tracking in Settings