The TPU story matters not only as hardware history, but as an example of how compute economics reshape ML architecture.
The chapter shows why accelerator specialization, the split between training and inference, and platform constraints directly affect engineering decisions around models.
It is especially useful when hardware choice and the cost of the live path become part of the architectural trade-off.
Practical value of this chapter
Hardware lens
See how accelerator choices reshape system design around ML models.
Compute economics
Connect accelerator strategy to the economics of training and inference.
Platform context
Understand why infrastructure choices become part of ML product design, not just platform plumbing.
Architecture narrative
Add a more mature infrastructure perspective to ML architecture answers.
Primary source
Book Cube TPU series
A two-part breakdown of why Google needed TPUs and how the generations evolved.
This chapter brings together your posts and official Google materials to explain how TPUs grew out of an infrastructure bottleneck, why Google chose a dedicated ASIC path, and how the architecture moved from an inference accelerator to a large-scale training platform before bending back toward inference in the GenAI era.
Why did TPUs appear in the first place?
Sharply improve the price-to-performance ratio for ML inference compared with the CPUs and GPUs available at the time.
Make the decision quickly and get it into a working deployment under tight timelines.
Keep the economics sustainable as ML load grows across Google services.
Evolution of TPU by generation
TPU v1
Inference- It took about 15 months to go from project start to production deployment.
- 28 nm process technology, 700 MHz, ~40 W.
- Benchmark: 92 TOPS INT8, a noticeable jump in energy efficiency.
TPU v2
Training + inference- A shift from an inference-only accelerator to a platform that could handle both training and inference.
- TPU Pod with 256 chip network.
- Order of magnitude: 180 TFLOPS and 64 GB of HBM, based on the sources used in this chapter.
TPU v3
Productivity growth- Liquid cooling introduced.
- Compute capacity and memory bandwidth increased substantially.
- Order of magnitude: up to 420 TFLOPS, based on the sources used in this chapter.
TPU v4
Scaling pod networks- Optical circuit switching to accelerate inter-chip communication.
- Focus on distributed training for large-scale models.
- Order of magnitude: 275 TFLOPS per chip, based on the sources used in this chapter.
TPU v5e / v5p
Cost optimization- Emphasis on more efficient economics for training and inference.
- Improved power efficiency and pod scaling.
- Support for sparsity and more flexible load profiles.
TPU v6 Trillium
Performance leap- Up to 4.7x more compute per chip than TPU v5e, according to Google.
- HBM capacity and throughput doubled, and interconnect bandwidth also increased.
- Roughly 67% higher energy efficiency than TPU v5e, according to Google.
TPU v7 Ironwood
Inference in the GenAI era- A return to the idea of an accelerator built primarily for inference, like TPU v1, but at a very different scale.
- Up to 9,216 chips in a liquid-cooled cluster.
- Order of magnitude: 4,614 TFLOPS per chip, 192 GB of HBM, and 7.37 TB/s of memory bandwidth.
TPU and GPU: how to read the comparisons
Compute profile
GPUs are usually more versatile, while TPUs are tuned more aggressively for tensor-heavy workloads and fit more deeply into the Google Cloud stack.
Economics
In a number of comparisons, TPUs look stronger on cost per useful unit of work, but the answer depends heavily on the model, batch size, and the quality of optimization.
Ecosystem
NVIDIA's CUDA ecosystem is broader; TPUs are especially strong when the team is already building around TensorFlow, JAX, and GCP services.
Practical takeaway: comparing FLOPS, tokens, and dollars without a shared methodology is an easy way to fool yourself. Look at the model, numeric precision, batch size, interconnect, software stack, and the latency and throughput budget you actually need to hit.
Key TPU Evolution Inflection Points
From product bottleneck to a dedicated ASIC
TPU v1 was not a research experiment, but a practical response to a production bottleneck: Google needed a dedicated ASIC path to keep inference latency and cost under control as neural workloads grew rapidly.
Impact: From day one, the architecture was designed around production SLAs and data center energy efficiency, not just peak benchmark numbers.
v2/v3 shift: from inference accelerator to general platform
As model sizes increased, accelerating inference alone was no longer enough. TPU v2/v3 added support for large-scale training, HBM, and pod-level scaling.
Impact: Google could speed up the full ML lifecycle in one stack: experiments, training, and live inference.
v4/v5 shift: inter-chip network and pod economics
In distributed training, compute is only part of the limit; interconnect becomes critical. TPU evolution increased the focus on network fabric, pod-level scaling, and total cost of ownership.
Impact: Optimization moved to the full-system level: compute, memory, network, and operations together.
v6/v7 shift: inference-first again in the GenAI era
GenAI workloads pulled inference back into the center: long contexts, high throughput demand, and predictable latency at scale.
Impact: TPU v7 Ironwood effectively revisits the original v1 idea, but at massive cluster scale and with a much more advanced memory and interconnect profile.
Strengths and weaknesses of the TPU approach
Pros
- Specialization in tensor operations and deep learning.
- High energy efficiency and strong total-cost-of-ownership economics in many AI scenarios.
- Deep integration with Google Cloud, TensorFlow and JAX.
- Good scalability via TPU Pod approach.
Restrictions
- Availability is primarily via Google Cloud.
- Less versatility for atypical computing workloads.
- The tool ecosystem as a whole is narrower than that around CUDA.
- A meaningful vendor lock-in risk if the architecture becomes deeply tied to TPU-specific assumptions.
TPU Selection Framework for Real Projects
Load profile
Signal in favor of TPU: Repeated tensor-heavy training and inference tasks with a clear optimization path in TensorFlow and JAX.
Where mistakes happen: If you have many custom kernels or mixed tasks, GPU versatility may matter more.
Data center economics
Signal in favor of TPU: Token/iteration cost and long-term energy efficiency are core constraints.
Where mistakes happen: Without a proper total-cost-of-ownership model, choosing by raw hourly hardware price often leads to the wrong answer.
Network architecture
Signal in favor of TPU: You need pod-level scaling for training and inference and are ready to optimize interconnect behavior actively.
Where mistakes happen: If network and software stack are not ready, adding chips will not produce linear performance gains.
Engineering ecosystem
Signal in favor of TPU: The team already uses GCP managed services and is ready to invest in XLA/JAX/TensorFlow profiling.
Where mistakes happen: If your stack is deeply CUDA-centered and multi-cloud portability is strict, migration cost can be high.
What to carry into your own architecture decisions
- Treat accelerator strategy as part of product architecture, not as an afterthought buried in infrastructure.
- Optimize not only model quality, but also the economics of the full training and inference cycle.
- Design some portability into the system if reducing vendor lock-in is important.
- Measure end-to-end efficiency across the full chain: model, memory, interconnect, software stack, and operations.
- Validate scalability on production-like data paths and real SLOs, not only on synthetic benchmarks.
References
Book Cube #3822
Part 1: why Google needed TPUs at all and how generations v1-v3 evolved.
Book Cube #3823
Part 2: generations v4-v7, accelerator economics, and comparison with GPU.
Google Cloud: TPU transformation (10-year look back)
Google's official overview of how TPU generations evolved and why that direction mattered.
In-Datacenter Performance Analysis of a TPU (ISCA 2017)
A classic paper on TPU v1, its motivation, and comparison with CPU and GPU.
CloudExpat comparison
A comparison of TPU v5e, H100, and Trainium economics; useful, but worth reading critically because methodology matters.
All numerical comparisons in this chapter are directional and come from the cited sources; they still need to be validated against a concrete workload.
Related chapters
- Why should an engineer know ML and AI? - Section context and the role of ML thinking for architects.
- CPU vs GPU - The baseline accelerator comparison you want before talking about TPU versus GPU.
- Google Global Network - The networking foundation that matters for distributed TPU and GPU clusters.
- Performance Engineering - Practical ways to measure latency and throughput and optimize systems under load.
