CPU and GPU: overview and differences

This chapter matters not because it compares benchmark numbers, but because it explains the nature of the work itself: where you need a fast response to one request and where massive parallelism actually helps.

In practice, it helps you choose the compute path by workload shape: what is latency-sensitive, what belongs in batched execution, and where data-movement overhead cancels the theoretical GPU win.

In interviews and design discussions, it gives you a concrete way to justify CPU-versus-GPU choices through workload shape, throughput, and operating cost instead of hype.

Practical value of this chapter

Workload profile

Supports CPU/GPU decisions by workload shape: latency sensitivity, batch size, and parallelism level.

Resource economics

Encourages evaluating not only raw performance, but also accelerator cost, data movement, and reserve modes.

Capacity planning

Provides a practical model for inference, analytics, and mixed compute scenarios.

Interview framing

Helps justify compute stack selection with measurable criteria rather than trends.

Source

Central processing unit

General CPU structure and its role in computation.

Перейти на сайт

CPU versus GPU is really a question of latency and throughput. CPUs win where individual requests need quick response, predictable tail latency, and tight control over p99 behavior. GPUs win where homogeneous work can be batched and executed in parallel.

In real systems, the choice is rarely exclusive: CPU keeps orchestration, business logic, and fallback paths, while GPU accelerates inference or bulk numeric stages. The real cost is shaped by VRAM, memory bandwidth, and data movement, not only by raw FLOPS.

So the choice is settled by measurement, not on paper: CPU vectorization, interconnects such as NVLink, and profiling tools such as flamegraphs and pprof all matter when deciding where the real bottleneck lives.

How CPU is organized

Cores execute instructions and coordinate thread scheduling.
The ALU performs arithmetic and logical operations.
The control unit manages instruction sequencing and execution flow.
Registers provide the fastest memory near each core.
L1/L2/L3 caches reduce the cost of data access.
The memory controller and buses connect CPU with RAM and devices.

Source

Graphics processing unit

GPU architecture and key properties of parallel computing.

Перейти на сайт

How GPU is organized

SM/CU blocks handle massively parallel execution.
Thread pools launch thousands of lightweight threads.
Schedulers map warps and wavefronts to compute units.
VRAM provides high-bandwidth local memory.
Cache and memory controllers improve data access efficiency.
The command processor accepts and schedules tasks from the CPU.

CPU vs GPU comparison

CPU

A small number of complex cores
High performance per thread
Strong on branching and irregular control flow

A small number of powerful cores execute heterogeneous tasks.

GPU

Many simpler compute units
High throughput on homogeneous operations
Best for massively parallel workloads

Many simpler cores execute similar work in parallel.

Dynamic visualization: workload simulator

Select a workload profile and batch size. The simplified model below shows when CPU wins and when GPU wins.

Batch size: 32

Range: 8 - 256

CPU

Fit for this profile: 88%

Batch processing time: 7 ms

Estimated throughput: 4571 units/s

GPU

Fit for this profile: 35%

Batch processing time: 8 ms

Estimated throughput: 4000 units/s

Simulation result

Relative batch time, CPU to GPU: 0.88x (CPU is faster in this scenario).

Bottleneck: branching, lock contention, and tail latency. Recommendation: Keep CPU as the main execution path and offload only isolated compute-heavy stages to GPU.

Dynamic visualization: hybrid CPU + GPU pipeline

In real systems, CPU and GPU almost always work as one pipeline. Switch stages to inspect how the load moves.

The transfer stage is often where hidden cost appears: even fast links such as NVLink do not remove the price of data copies and GPU-launch preparation.

Input validation, feature extraction, serialization, and batch assembly.

CPU utilization82%

GPU utilization12%

Key metric

Batch preparation time

Optimization focus

CPU vectorization, fewer copies, and lower allocation churn

Where CPU fits better and where GPU fits better

CPU

Server requests with complex business logic
Transactions, service coordination, and OS-level tasks
Workloads with unpredictable branching

GPU

Graphics and rendering
Machine learning and tensor-heavy operations
Large-scale parallel compute and simulation

Common CPU/GPU selection mistakes

Choosing GPU just because an accelerator is available

Branch-heavy logic maps poorly onto parallel units: without pricing the launch overhead, moving it to GPU more often hurts latency than helps it.

Ignoring data transfer costs

On small batches, moving data between CPU and GPU can erase the entire expected gain — count the speedup together with the copy cost, not on its own.

Looking only at host-level metrics

CPU and RAM show only half the picture. Without GPU utilization, memory bandwidth, and queue depth, the bottleneck stays invisible until it turns into an incident.

No reserve path when GPU is unavailable

GPU scarcity and scheduling failures happen regularly. With no CPU fallback, that workload simply stalls — the system has nothing to degrade to.

Profiling should show where time is actually spent: flamegraphs and pprof help on CPU, while GPU traces and queue telemetry clarify whether the accelerator is busy or just waiting on data.

Practical recommendations

Hybrid pipeline

Keep control flow, business logic, and reserve paths on CPU; move bulk math to GPU.

Batch size driven by SLO

Batch size is a trade-off, not a default: too small underutilizes GPU, too large inflates tail latency. Anchor it to the latency SLO.

Profile before and after

Profile CPU and GPU before and after each change: otherwise optimization rests on a guess, and a regression can pass for a win.

Capacity planning across two compute domains

Model CPU-bound and GPU-bound stages separately to plan scaling and infrastructure cost correctly.

Practical takeaway

In production systems, CPU and GPU usually work as a pair: CPU handles control flow, routing, and reserve paths, while GPU takes on large homogeneous computations. The decision depends first on workload shape: latency and branching favor CPU, while uniform parallel work favors GPU.

Why this matters for system design

Fixes the main compute path for each workload — get it wrong and the mistake is baked into the pipeline and carried downstream.
Hits the infrastructure bill: CPU-heavy and GPU-heavy services scale differently and cannot be planned with one yardstick.
Pushes memory and network to the front: for GPU paths, memory bandwidth, transfer path, and batch size matter more than peak FLOPS.
Keeps the system standing under GPU scarcity: a CPU fallback and an explicit SLO are what carry it through.

Related chapters

Why foundational knowledge matters - shows how hardware and runtime constraints become architecture decisions in practice.
Structured Computer Organization (short summary) - provides hardware fundamentals: abstraction layers, ISA, memory, and CPU-device interaction.
Operating system: overview - connects the topic to scheduling, system calls, and the impact of the kernel on latency.
RAM and storage - shows why memory bandwidth and data locality often matter more than peak compute capacity.
Performance Engineering - adds practical guidance for profiling CPU/GPU load and improving full-path performance.
The history of Google TPUs and their evolution - broadens the accelerator comparison and clarifies when GPU or TPU gives better results.