System Design Space
Knowledge graphSettings

Updated: April 5, 2026 at 2:52 PM

The history of Google TPUs and their evolution

medium

How Google moved from TPU v1 for inference to Ironwood: architectural trade-offs, compute economics, and what distinguishes the TPU approach from GPU-centric designs.

The TPU story matters not only as hardware history, but as an example of how compute economics reshape ML architecture.

The chapter shows why accelerator specialization, the split between training and inference, and platform constraints directly affect engineering decisions around models.

It is especially useful when hardware choice and the cost of the live path become part of the architectural trade-off.

Practical value of this chapter

Hardware lens

See how accelerator choices reshape system design around ML models.

Compute economics

Connect accelerator strategy to the economics of training and inference.

Platform context

Understand why infrastructure choices become part of ML product design, not just platform plumbing.

Architecture narrative

Add a more mature infrastructure perspective to ML architecture answers.

Primary source

Book Cube TPU series

A two-part breakdown of why Google needed TPUs and how the generations evolved.

Open post

This chapter brings together your posts and official Google materials to explain how TPUs grew out of an infrastructure bottleneck, why Google chose a dedicated ASIC path, and how the architecture moved from an inference accelerator to a large-scale training platform before bending back toward inference in the GenAI era.

Why did TPUs appear in the first place?

Sharply improve the price-to-performance ratio for ML inference compared with the CPUs and GPUs available at the time.

Make the decision quickly and get it into a working deployment under tight timelines.

Keep the economics sustainable as ML load grows across Google services.

Evolution of TPU by generation

2015

TPU v1

Inference
  • It took about 15 months to go from project start to production deployment.
  • 28 nm process technology, 700 MHz, ~40 W.
  • Benchmark: 92 TOPS INT8, a noticeable jump in energy efficiency.
2017

TPU v2

Training + inference
  • A shift from an inference-only accelerator to a platform that could handle both training and inference.
  • TPU Pod with 256 chip network.
  • Order of magnitude: 180 TFLOPS and 64 GB of HBM, based on the sources used in this chapter.
2018

TPU v3

Productivity growth
  • Liquid cooling introduced.
  • Compute capacity and memory bandwidth increased substantially.
  • Order of magnitude: up to 420 TFLOPS, based on the sources used in this chapter.
2021

TPU v4

Scaling pod networks
  • Optical circuit switching to accelerate inter-chip communication.
  • Focus on distributed training for large-scale models.
  • Order of magnitude: 275 TFLOPS per chip, based on the sources used in this chapter.
2023

TPU v5e / v5p

Cost optimization
  • Emphasis on more efficient economics for training and inference.
  • Improved power efficiency and pod scaling.
  • Support for sparsity and more flexible load profiles.
2024

TPU v6 Trillium

Performance leap
  • Up to 4.7x more compute per chip than TPU v5e, according to Google.
  • HBM capacity and throughput doubled, and interconnect bandwidth also increased.
  • Roughly 67% higher energy efficiency than TPU v5e, according to Google.
2025

TPU v7 Ironwood

Inference in the GenAI era
  • A return to the idea of an accelerator built primarily for inference, like TPU v1, but at a very different scale.
  • Up to 9,216 chips in a liquid-cooled cluster.
  • Order of magnitude: 4,614 TFLOPS per chip, 192 GB of HBM, and 7.37 TB/s of memory bandwidth.

TPU and GPU: how to read the comparisons

Compute profile

GPUs are usually more versatile, while TPUs are tuned more aggressively for tensor-heavy workloads and fit more deeply into the Google Cloud stack.

Economics

In a number of comparisons, TPUs look stronger on cost per useful unit of work, but the answer depends heavily on the model, batch size, and the quality of optimization.

Ecosystem

NVIDIA's CUDA ecosystem is broader; TPUs are especially strong when the team is already building around TensorFlow, JAX, and GCP services.

Practical takeaway: comparing FLOPS, tokens, and dollars without a shared methodology is an easy way to fool yourself. Look at the model, numeric precision, batch size, interconnect, software stack, and the latency and throughput budget you actually need to hit.

Key TPU Evolution Inflection Points

From product bottleneck to a dedicated ASIC

TPU v1 was not a research experiment, but a practical response to a production bottleneck: Google needed a dedicated ASIC path to keep inference latency and cost under control as neural workloads grew rapidly.

Impact: From day one, the architecture was designed around production SLAs and data center energy efficiency, not just peak benchmark numbers.

v2/v3 shift: from inference accelerator to general platform

As model sizes increased, accelerating inference alone was no longer enough. TPU v2/v3 added support for large-scale training, HBM, and pod-level scaling.

Impact: Google could speed up the full ML lifecycle in one stack: experiments, training, and live inference.

v4/v5 shift: inter-chip network and pod economics

In distributed training, compute is only part of the limit; interconnect becomes critical. TPU evolution increased the focus on network fabric, pod-level scaling, and total cost of ownership.

Impact: Optimization moved to the full-system level: compute, memory, network, and operations together.

v6/v7 shift: inference-first again in the GenAI era

GenAI workloads pulled inference back into the center: long contexts, high throughput demand, and predictable latency at scale.

Impact: TPU v7 Ironwood effectively revisits the original v1 idea, but at massive cluster scale and with a much more advanced memory and interconnect profile.

Strengths and weaknesses of the TPU approach

Pros

  • Specialization in tensor operations and deep learning.
  • High energy efficiency and strong total-cost-of-ownership economics in many AI scenarios.
  • Deep integration with Google Cloud, TensorFlow and JAX.
  • Good scalability via TPU Pod approach.

Restrictions

  • Availability is primarily via Google Cloud.
  • Less versatility for atypical computing workloads.
  • The tool ecosystem as a whole is narrower than that around CUDA.
  • A meaningful vendor lock-in risk if the architecture becomes deeply tied to TPU-specific assumptions.

TPU Selection Framework for Real Projects

Load profile

Signal in favor of TPU: Repeated tensor-heavy training and inference tasks with a clear optimization path in TensorFlow and JAX.

Where mistakes happen: If you have many custom kernels or mixed tasks, GPU versatility may matter more.

Data center economics

Signal in favor of TPU: Token/iteration cost and long-term energy efficiency are core constraints.

Where mistakes happen: Without a proper total-cost-of-ownership model, choosing by raw hourly hardware price often leads to the wrong answer.

Network architecture

Signal in favor of TPU: You need pod-level scaling for training and inference and are ready to optimize interconnect behavior actively.

Where mistakes happen: If network and software stack are not ready, adding chips will not produce linear performance gains.

Engineering ecosystem

Signal in favor of TPU: The team already uses GCP managed services and is ready to invest in XLA/JAX/TensorFlow profiling.

Where mistakes happen: If your stack is deeply CUDA-centered and multi-cloud portability is strict, migration cost can be high.

What to carry into your own architecture decisions

  • Treat accelerator strategy as part of product architecture, not as an afterthought buried in infrastructure.
  • Optimize not only model quality, but also the economics of the full training and inference cycle.
  • Design some portability into the system if reducing vendor lock-in is important.
  • Measure end-to-end efficiency across the full chain: model, memory, interconnect, software stack, and operations.
  • Validate scalability on production-like data paths and real SLOs, not only on synthetic benchmarks.

References

All numerical comparisons in this chapter are directional and come from the cited sources; they still need to be validated against a concrete workload.

Related chapters

Enable tracking in Settings