Primary source
book_cube TPU series
A two-part analysis of the emergence of TPU and the evolution of generations.
This chapter is compiled based on your posts and official Google materials: how TPUs emerged from the infrastructure bottleneck, why Google needed an ASIC approach, and how the architecture evolved from inference-chip to large-scale training platform and back to inference-first in the era of GenAI.
Why did TPUs appear in the first place?
Give a multiple increase in price/performance for ML inference compared to available CPUs/GPUs.
Make a decision quickly and put it into production in a short time.
Maintain cost efficiency as ML loads grow in Google products.
Evolution of TPU by generation
TPU v1
Inference- Development in ~15 months from start to deployment.
- 28 nm process technology, 700 MHz, ~40 W.
- Benchmark: 92 TOPS INT8, a noticeable jump in energy efficiency.
TPU v2
Training + inference- The transition from a “chip for inference” to a train+infer platform.
- TPU Pod with 256 chip network.
- Order of magnitude: 180 TFLOPS, 64 GB HBM (according to chapter sources).
TPU v3
Productivity growth- Liquid cooling introduced.
- Compute and memory bandwidth have been significantly increased.
- Order of magnitude: up to 420 TFLOPS (according to chapter sources).
TPU v4
Scaling pod networks- Optical circuit switching to speed up inter-chip communication.
- Focus on distributed training of large scale models.
- Order of magnitude: 275 TFLOPS per chip (according to chapter sources).
TPU v5e / v5p
Cost optimization- Emphasis on cost-effective training/inference.
- Improved power efficiency and pod scaling.
- Support for sparsity and more flexible workload profiles.
TPU v6 Trillium
Performance leap- Up to 4.7x compute growth per chip vs TPU v5e (according to Google).
- Double HBM capacity/throughput and interconnect bandwidth.
- ~67% higher energy efficiency vs TPU v5e (according to Google).
TPU v7 Ironwood
Inference of the GenAI era- A return to the inference-first idea, like TPU v1, but on a new scale.
- Up to 9,216 chips in a liquid-cooled cluster.
- Order of magnitude: 4,614 TFLOPS/chip, 192 GB HBM, 7.37 TB/s memory bandwidth.
TPU vs GPU: How to Read the Comparisons
Compute and memory profile
GPUs are usually more versatile, TPUs are more optimized for tensor workloads and are closely integrated with the Google Cloud stack.
Economy
In a number of training/inference sources, TPUs show the best cost per workload, but the estimates strongly depend on the model, batch size and optimization level.
Ecosystem
NVIDIA's CUDA ecosystem is wider; TPUs win in scenarios where the team is already building a pipeline on TensorFlow/JAX and the GCP managed infrastructure.
An important practical point: comparing FLOPS/tokens/dollars without a common methodology easily leads to distortions. Look at the model, precision, batch, interconnect, software stack and operating restrictions.
Strengths and weaknesses of the TPU approach
Pros
- Specialization in tensor operations and deep learning.
- High energy efficiency and strong TCO economics in a number of AI scenarios.
- Deep integration with Google Cloud, TensorFlow and JAX.
- Good scalability via TPU Pod approach.
Restrictions
- Availability is primarily via Google Cloud.
- Less versatility for atypical computing workloads.
- The tool ecosystem as a whole is narrower than that around CUDA.
- Risks of vendor lock-in when the architecture is deeply tied to TPU specifics.
References
book_cube #3822
Part 1: TPU background and evolution v1-v3.
book_cube #3823
Part 2: evolution of v4-v7 and comparison with GPU.
Google Cloud: TPU transformation (10-year look back)
An official overview of the evolution of TPU generations.
In-Datacenter Performance Analysis of a TPU (ISCA 2017)
Classic article on TPU v1 and comparisons with CPU/GPU.
CloudExpat comparison
Comparison of economic efficiency of TPU v5e, H100 and Trainium (read critically, taking into account the methodology).
All numerical comparisons in this chapter are provided as guidelines from the specified sources and require validation for a specific workload.
