The TPU story shows that in AI, the model is only part of the answer; the compute base it can live on economically matters just as much.
The chapter shifts the discussion away from abstract "more compute" toward architectural choices about throughput, memory, energy, training versus inference, and the cost of the full infrastructure.
It is especially useful wherever you need to explain why hardware choices change not only performance, but also the product bets a team can afford to make.
Practical value of this chapter
Design in practice
Translate guidance on TPU evolution and accelerator-infrastructure impact on AI architecture into architecture decisions for data flow, model serving, and quality control points.
Decision quality
Evaluate system quality through both model and platform metrics: precision/recall, latency, drift, cost, and operational risk.
Interview articulation
Frame answers as data -> model -> serving -> monitoring, showing where constraints appear and how you manage them.
Trade-off framing
Make trade-offs explicit for TPU evolution and accelerator-infrastructure impact on AI architecture: experiment speed, quality, explainability, resource budget, and maintenance complexity.
Primary source
book_cube TPU series
A two-part analysis of the emergence of TPU and the evolution of generations.
This chapter is compiled based on your posts and official Google materials: how TPUs emerged from the infrastructure bottleneck, why Google needed an ASIC approach, and how the architecture evolved from inference-chip to large-scale training platform and back to inference-first in the era of GenAI.
Why did TPUs appear in the first place?
Give a multiple increase in price/performance for ML inference compared to available CPUs/GPUs.
Make a decision quickly and put it into production in a short time.
Maintain cost efficiency as ML loads grow in Google products.
Evolution of TPU by generation
TPU v1
Inference- Development in ~15 months from start to deployment.
- 28 nm process technology, 700 MHz, ~40 W.
- Benchmark: 92 TOPS INT8, a noticeable jump in energy efficiency.
TPU v2
Training + inference- The transition from a “chip for inference” to a train+infer platform.
- TPU Pod with 256 chip network.
- Order of magnitude: 180 TFLOPS, 64 GB HBM (according to chapter sources).
TPU v3
Productivity growth- Liquid cooling introduced.
- Compute and memory bandwidth have been significantly increased.
- Order of magnitude: up to 420 TFLOPS (according to chapter sources).
TPU v4
Scaling pod networks- Optical circuit switching to speed up inter-chip communication.
- Focus on distributed training of large scale models.
- Order of magnitude: 275 TFLOPS per chip (according to chapter sources).
TPU v5e / v5p
Cost optimization- Emphasis on cost-effective training/inference.
- Improved power efficiency and pod scaling.
- Support for sparsity and more flexible workload profiles.
TPU v6 Trillium
Performance leap- Up to 4.7x compute growth per chip vs TPU v5e (according to Google).
- Double HBM capacity/throughput and interconnect bandwidth.
- ~67% higher energy efficiency vs TPU v5e (according to Google).
TPU v7 Ironwood
Inference of the GenAI era- A return to the inference-first idea, like TPU v1, but on a new scale.
- Up to 9,216 chips in a liquid-cooled cluster.
- Order of magnitude: 4,614 TFLOPS/chip, 192 GB HBM, 7.37 TB/s memory bandwidth.
TPU vs GPU: How to Read the Comparisons
Compute and memory profile
GPUs are usually more versatile, TPUs are more optimized for tensor workloads and are closely integrated with the Google Cloud stack.
Economy
In a number of training/inference sources, TPUs show the best cost per workload, but the estimates strongly depend on the model, batch size and optimization level.
Ecosystem
NVIDIA's CUDA ecosystem is wider; TPUs win in scenarios where the team is already building a pipeline on TensorFlow/JAX and the GCP managed infrastructure.
An important practical point: comparing FLOPS/tokens/dollars without a common methodology easily leads to distortions. Look at the model, precision, batch, interconnect, software stack and operating restrictions.
Key TPU Evolution Inflection Points
From product bottleneck to a dedicated ASIC
TPU v1 was not a research experiment, but a practical response to a production problem: keep inference latency and cost under control for Google services as neural model usage grew quickly.
Impact: From day one, the architecture was designed for production SLA and data center energy efficiency, not only for peak benchmark numbers.
v2/v3 shift: from inference chip to train+infer platform
As model sizes increased, accelerating inference alone became insufficient. TPU v2/v3 added large-scale training support, HBM memory, and pod-level scaling.
Impact: Google could accelerate the full ML lifecycle in one stack: experiments, training, and production inference.
v4/v5 shift: inter-chip network and pod economics
In distributed training, compute is only part of the limit; interconnect becomes critical. TPU evolution increased focus on network fabric, pod scaling and TCO.
Impact: Optimization moved to the full-system level: compute + memory + network + operations.
v6/v7 shift: inference-first again in the GenAI era
GenAI workloads put inference back in the center: long contexts, high throughput demands, and predictable latency at scale.
Impact: TPU v7 Ironwood effectively revisits the original v1 idea, but at massive cluster scale and with modern memory/interconnect characteristics.
Strengths and weaknesses of the TPU approach
Pros
- Specialization in tensor operations and deep learning.
- High energy efficiency and strong TCO economics in a number of AI scenarios.
- Deep integration with Google Cloud, TensorFlow and JAX.
- Good scalability via TPU Pod approach.
Restrictions
- Availability is primarily via Google Cloud.
- Less versatility for atypical computing workloads.
- The tool ecosystem as a whole is narrower than that around CUDA.
- Risks of vendor lock-in when the architecture is deeply tied to TPU specifics.
TPU Selection Framework for Real Projects
Workload profile
Signal in favor of TPU: Repeated tensor-heavy training/inference tasks with a clear optimization path in TensorFlow/JAX.
Where mistakes happen: If you have many custom kernels or mixed workloads, GPU versatility can be more important.
Data center economics
Signal in favor of TPU: Token/iteration cost and long-term energy efficiency are core constraints.
Where mistakes happen: Without a proper TCO model, choosing by raw hourly hardware price often leads to wrong conclusions.
Network architecture
Signal in favor of TPU: You need pod-level scaling for training/inference and can actively optimize interconnect behavior.
Where mistakes happen: If network and software stack are not ready, adding chips will not produce linear performance gains.
Engineering ecosystem
Signal in favor of TPU: The team already uses GCP managed services and is ready to invest in XLA/JAX/TensorFlow profiling.
Where mistakes happen: If your stack is deeply CUDA-centered and multi-cloud portability is strict, migration cost can be high.
What to take into your own System Design
- Plan hardware strategy as part of product architecture, not as an afterthought.
- Optimize not only model quality, but also training/inference cycle economics.
- Design a portability layer to lower vendor lock-in risk.
- Measure end-to-end efficiency: model + interconnect + software stack + operations.
- Validate scalability on production-like data paths and SLOs, not only synthetic benchmarks.
References
book_cube #3822
Part 1: TPU background and evolution v1-v3.
book_cube #3823
Part 2: evolution of v4-v7 and comparison with GPU.
Google Cloud: TPU transformation (10-year look back)
An official overview of the evolution of TPU generations.
In-Datacenter Performance Analysis of a TPU (ISCA 2017)
Classic article on TPU v1 and comparisons with CPU/GPU.
CloudExpat comparison
Comparison of economic efficiency of TPU v5e, H100 and Trainium (read critically, taking into account the methodology).
All numerical comparisons in this chapter are provided as guidelines from the specified sources and require validation for a specific workload.
Related chapters
- Why should an engineer know ML and AI? - Section context and the role of ML thinking for architects.
- CPU vs GPU - Core accelerator differences before a TPU/GPU comparison.
- Google Global Network - Network foundation relevant for distributed TPU/GPU clusters.
- Performance Engineering - Latency/throughput measurement and optimization practices.
