The history of Google TPUs and their evolution

The TPU story matters not only as hardware history, but as an example of how compute economics reshape ML architecture.

The chapter shows why accelerator specialization, the split between training and inference, and platform constraints directly affect engineering decisions around models.

It is especially useful when hardware choice and the cost of the live path become part of the architectural trade-off.

Practical value of this chapter

Hardware lens

See how accelerator choices reshape system design around ML models.

Compute economics

Connect accelerator strategy to the economics of training and inference.

Platform context

Understand why infrastructure choices become part of ML product design, not just platform plumbing.

Architecture narrative

Add a more mature infrastructure perspective to ML architecture answers.

Primary source

Book Cube TPU series

A two-part breakdown of why Google needed TPUs and how the generations evolved.

Open post

This chapter brings together the Book Cube posts and official Google materials to explain how TPUs grew out of an infrastructure bottleneck, why Google chose a dedicated ASIC path, and how the architecture moved from an inference accelerator to a large-scale training platform before bending back toward inference in the GenAI era.

Why did TPUs appear in the first place?

The CPUs and GPUs available at the time did not give the price-to-performance ratio ML inference needed — the goal was to sharply improve exactly that.

The decision had to be more than made: it had to reach a working deployment under tight timelines.

As ML load grew across Google services, the economics could not be allowed to fall apart.

Evolution of TPU by generation

2015

TPU v1

Inference

It took about 15 months to go from project start to production deployment.
28 nm process technology, 700 MHz, ~40 W.
Benchmark: 92 TOPS INT8, a noticeable jump in energy efficiency.

2017

TPU v2

Training + inference

A shift from an inference-only accelerator to a platform that could handle both training and inference.
TPU Pod with 256 chip network.
Order of magnitude: 45 TFLOPS (bf16) and 16 GB of HBM per chip; the widely quoted 180 TFLOPS and 64 GB refer to the 4-chip board.

2018

TPU v3

Productivity growth

Liquid cooling introduced.
Compute capacity and memory bandwidth increased substantially.
Order of magnitude: 123 TFLOPS (bf16) and 32 GB of memory per chip; the 420 TFLOPS launch figure refers to the 4-chip board.

2021

TPU v4

Scaling pod networks

Optical circuit switching to accelerate inter-chip communication.
Focus on distributed training for large-scale models.
Order of magnitude: 275 TFLOPS (bf16/int8) and 32 GB of memory per chip, based on the sources used in this chapter.

2023

TPU v5e / v5p

Cost optimization

Emphasis on more efficient economics for training and inference.
Improved power efficiency and pod scaling.
Support for sparsity and more flexible load profiles.

2024

TPU v6 Trillium

Performance leap

Up to 4.7x more compute per chip than TPU v5e, according to Google.
HBM capacity and throughput doubled, and interconnect bandwidth also increased.
Roughly 67% higher energy efficiency than TPU v5e, according to Google.

2025

TPU v7 Ironwood

Inference in the GenAI era

A return to the idea of an accelerator built primarily for inference, like TPU v1, but at a very different scale.
Up to 9,216 chips in a liquid-cooled cluster.
Order of magnitude: 4,614 TFLOPS (FP8) per chip, 192 GB of HBM, and 7.37 TB/s of memory bandwidth.

TPU and GPU: how to read the comparisons

Compute profile

GPU versatility comes at the cost of focus; TPUs are tuned more aggressively for tensor-heavy workloads and fit more deeply into the Google Cloud stack, but that narrow specialization is itself the price.

Economics

In a number of comparisons, TPUs look stronger on cost per useful unit of work, but the answer depends heavily on the model, batch size, and the quality of optimization.

Ecosystem

What decides this is not a peak number but how ready the stack is: NVIDIA's CUDA ecosystem is broader, while TPUs are especially strong when the team is already building around TensorFlow, JAX, and GCP services.

Practical takeaway: comparing FLOPS, tokens, and dollars without a shared methodology is an easy way to fool yourself. Look at the model, numeric precision, batch size, interconnect, software stack, and the latency and throughput budget you actually need to hit.

Key TPU Evolution Inflection Points

From product bottleneck to a dedicated ASIC

TPU v1 was not a research experiment, but a practical response to a production bottleneck: Google needed a dedicated ASIC path to keep inference latency and cost under control as neural workloads grew rapidly.

Impact: From day one, the architecture was designed around production SLAs and data center energy efficiency, not just peak benchmark numbers.

v2/v3 shift: from inference accelerator to general platform

As model sizes increased, accelerating inference alone was no longer enough. TPU v2/v3 added support for large-scale training, HBM, and pod-level scaling.

Impact: Google could speed up the full ML lifecycle in one stack: experiments, training, and live inference.

v4/v5 shift: inter-chip network and pod economics

In distributed training, compute is only part of the limit; interconnect becomes critical. TPU evolution increased the focus on network fabric, pod-level scaling, and total cost of ownership.

Impact: Optimization moved to the full-system level: compute, memory, network, and operations together.

v6/v7 shift: inference-first again in the GenAI era

GenAI workloads pulled inference back into the center: long contexts, high throughput demand, and predictable latency at scale.

Impact: TPU v7 Ironwood effectively revisits the original v1 idea, but at massive cluster scale and with a much more advanced memory and interconnect profile.

Strengths and weaknesses of the TPU approach

Pros

Specialization in tensor operations and deep learning.
High energy efficiency and strong total-cost-of-ownership economics in many AI scenarios.
Deep integration with Google Cloud, TensorFlow and JAX.
Good scalability via TPU Pod approach.

Restrictions

Availability is primarily via Google Cloud.
Less versatility for atypical computing workloads.
The tool ecosystem as a whole is narrower than that around CUDA.
A meaningful vendor lock-in risk if the architecture becomes deeply tied to TPU-specific assumptions.

TPU Selection Framework for Real Projects

Load profile

Signal in favor of TPU: Repeated tensor-heavy training and inference tasks with a clear optimization path in TensorFlow and JAX.

Where mistakes happen: If you have many custom kernels or mixed tasks, GPU versatility may matter more.

Data center economics

Signal in favor of TPU: Token/iteration cost and long-term energy efficiency are core constraints.

Where mistakes happen: Without a proper total-cost-of-ownership model, choosing by raw hourly hardware price often leads to the wrong answer.

Network architecture

Signal in favor of TPU: You need pod-level scaling for training and inference and are ready to optimize interconnect behavior actively.

Where mistakes happen: If network and software stack are not ready, adding chips will not produce linear performance gains.

Engineering ecosystem

Signal in favor of TPU: The team already uses GCP managed services and is ready to invest in XLA/JAX/TensorFlow profiling.

Where mistakes happen: If your stack is deeply CUDA-centered and multi-cloud portability is strict, migration cost can be high.

What to carry into your own architecture decisions

Treat accelerator strategy as part of product architecture, not as an afterthought buried in infrastructure.
A model that is great but bankrupts you on inference is bad architecture: weigh the economics of the full training and inference cycle alongside quality.
Design some portability into the system if reducing vendor lock-in is important.
Measure end-to-end efficiency across the full chain: model, memory, interconnect, software stack, and operations.
Validate scalability on production-like data paths and real SLOs, not only on synthetic benchmarks.

References

Book Cube #3822

Part 1: why Google needed TPUs at all and how generations v1-v3 evolved.

Book Cube #3823

Part 2: generations v4-v7, accelerator economics, and comparison with GPU.

Google Cloud: TPU transformation (10-year look back)

Google's official overview of how TPU generations evolved and why that direction mattered.

In-Datacenter Performance Analysis of a TPU (ISCA 2017)

A classic paper on TPU v1, its motivation, and comparison with CPU and GPU.

CloudExpat comparison

A comparison of TPU v5e, H100, and Trainium economics; useful, but worth reading critically because methodology matters.

All numerical comparisons in this chapter are directional and come from the cited sources; they still need to be validated against a concrete workload.

Related chapters

Why should an engineer know ML and AI? - Section context and the role of ML thinking for architects.
CPU vs GPU - The baseline accelerator comparison you want before talking about TPU versus GPU.
Google Global Network - The networking foundation that matters for distributed TPU and GPU clusters.
Performance Engineering - Practical ways to measure latency and throughput and optimize systems under load.