Google Splits Its AI Chip in Two to Cut Inference Costs by 80%

The announcement Google made at Cloud Next in Las Vegas on April 22 looked, on the surface, like an incremental hardware update. Two new chips, a new name for a data-center fabric, another round of performance claims. But the architecture beneath the numbers represents the clearest signal yet that the AI chip industry has reached a structural inflection point — one where a single general-purpose accelerator can no longer serve the economics of both training and inference, and where Google has decided to stop pretending otherwise.

For seven generations, Google's Tensor Processing Units were unified designs: one chip family meant to handle both the compute-intensive work of training large models and the latency-sensitive job of serving them to users. The eighth generation breaks that pattern entirely. TPU 8t — codenamed Sunfish, co-designed with Broadcom — is a training chip. TPU 8i — codenamed Zebrafish, co-designed with MediaTek — is an inference chip. Both are headed to TSMC's 2-nanometre process node, targeted for late 2027. The decision to split the product line is not an engineering footnote. It is a strategic statement about where AI infrastructure economics are heading, and a direct challenge to Nvidia's strategy of selling one class of hardware for everything.

Why Training and Inference Can No Longer Share Silicon

How Broadcom could be boosted by Google's in-house chips

For most of the deep learning era, the gap between training a model and running it was narrow enough that the same hardware worked for both. GPUs were designed for parallel matrix math, and the math looked roughly similar whether you were computing gradients or generating tokens. That overlap justified Nvidia's product line and made unified accelerators the default.

That default is breaking down for a specific, measurable reason: the rise of Mixture-of-Experts architectures and multi-agent workloads. MoE models route each token through only a fraction of their total parameters, which means inference latency is dominated by memory bandwidth rather than raw compute. A chip tuned for dense matrix multiplication at training scale will be inefficient at inference for MoE models because the bottleneck is in the wrong place. Training needs peak FP8 throughput and interconnect bandwidth between chips in a pod. Inference needs massive on-chip SRAM, fast cache for key-value attention, and low-latency serving of sparse activations to millions of concurrent users.

Google's engineers concluded that no single die can optimize both simultaneously without compromising one. The result is two chips with different memories, different interconnects, and different silicon tradeoffs — designed for different jobs.

The Economics of the Split

Google's next-gen TPUs promise a 4.7x performance boost | TechCrunch

The financial implications are substantial. Google is claiming 80% better performance-per-dollar for TPU 8i compared to the previous Ironwood generation for low-latency inference serving of MoE models. That is not a marginal efficiency gain. It is the kind of step-change that can redefine what cloud inference costs.

For the training side, the TPU 8t superpod scales to 9,600 chips and two petabytes of shared HBM3e2 memory in a single fabric, reaching 121 FP4 exaFLOPS per pod and delivering 2.7 times better price-performance than Ironwood for large-scale training workloads. A new megascale data-center fabric called Virgo Network connects up to 134,000 TPU 8t chips with 47 petabits per second of non-blocking bi-sectional bandwidth — bandwidth figures that begin to look less like a hardware spec and more like a statement of strategic intent about what Google believes future frontier training runs will require.

For enterprise customers, the split matters in a direct way: running AI agents at scale is almost entirely an inference problem. An enterprise running millions of concurrent AI agents is not training a model. It is hitting inference endpoints thousands of times per second, trying to hit latency targets measured in hundreds of milliseconds, and paying a cloud bill that scales with every token generated. An 80% improvement in inference economics, if the numbers hold under real workloads, would be a meaningful reduction in the cost of deploying AI at production scale.

A Supply Chain Nvidia Cannot Replicate

What Google is building around these chips is as significant as the chips themselves. The supply chain behind TPU 8 now involves four partners: Broadcom for the training chip, MediaTek for the inference chip, Marvell for a new memory processing unit and a potential additional inference-optimized TPU currently in design discussions, and TSMC for 2nm fabrication. The Broadcom agreement for TPU design runs through 2031. Google is reportedly producing nearly two million of the Marvell memory processing units, with design finalization expected within the next year.

This is a deliberate diversification play. Nvidia's competitive advantage has historically rested on a tightly integrated stack: CUDA software, NVLink interconnects, HBM sourced from a small number of suppliers, and a design process conducted almost entirely in-house. The result is a coherent product line where performance, software, and memory are co-optimized — but also one that can only be replicated or exceeded by another integrated actor.

Google's approach inverts that logic. By distributing chiplet design responsibilities across multiple semiconductor partners, Google reduces dependence on any single vendor relationship while pulling each partner's specialized expertise into the final product. Broadcom's chiplet packaging and interface design, MediaTek's experience with cost-optimized, high-volume silicon, and Marvell's memory architecture capabilities each contribute to a system that no single company outside Google could assemble in the same way.

The Register and TheNextWeb noted that Google remains committed to offering Nvidia hardware to cloud customers alongside its own — Nvidia's Vera Rubin GPU is expected to be available on Google Cloud later this year. Google is not trying to evict Nvidia from its cloud. It is trying to build enough internal silicon capability that Nvidia's pricing leverage diminishes, particularly for the inference workloads where Google Cloud competes most directly with AWS, Azure, and a widening set of inference-focused startups.

What Changes for Cloud Buyers

For enterprises and developers buying compute, the near-term practical implications are limited — TPU 8t and 8i are both targeted for late 2027. But the announcement shapes purchasing decisions and architectural choices being made today.

First, the performance claims signal where Google Cloud's competitive positioning is headed. Any enterprise currently running large-scale AI inference on Nvidia hardware inside Google Cloud should be watching the TPU 8i roadmap closely. An 80% reduction in inference cost-per-token, if real, would make re-architecting workloads around Google's silicon worth engineering time in a way that earlier TPU generations could not justify.

Second, the MoE architecture optimization is a tell about where Google believes frontier AI is going. Google, Meta, and Mistral have all shipped MoE-based models. The fact that Google is designing inference hardware around MoE serving — not around dense transformer inference — signals that its internal model roadmap is firmly in the MoE direction. Developers building on Google Cloud infrastructure would be building for the same architectural future.

Third, the Virgo Network fabric, with its 47 petabit-per-second bandwidth at 134,000-chip scale, is relevant for the handful of organizations training at frontier scale. The only entities training at that scale today are the handful of frontier labs. For everyone else, the TPU 8i inference story is the more immediately actionable development.

The Vera Rubin Signal and What Coexistence Means

Google's decision to keep Nvidia in its cloud while launching competing silicon is itself a strategy, not a contradiction. Bloomberg and TechCrunch both noted that Google, like Microsoft and Amazon, offers Nvidia hardware as part of a portfolio, with Vera Rubin expected on the platform in 2026. This coexistence serves a near-term purpose — enterprise customers have Nvidia-dependent software stacks that are not trivially portable — but it also reflects a longer-term calculation.

Google is not trying to win by replacing Nvidia on day one. It is trying to win by building internal silicon that is competitive enough on specific workloads — training at massive scale, MoE inference at low latency — that the proportion of Google Cloud revenue flowing through Nvidia hardware declines over time. The chip margin stays inside Alphabet instead of going to Santa Clara. For a company with the capital spending profile Alphabet has committed to AI infrastructure, capturing the chip economics on even a fraction of that spend represents tens of billions of dollars in cost structure over a decade.

Bloomberg has reported that Alphabet is simultaneously planning to commit up to $40 billion to Anthropic, which runs substantially on Google Cloud infrastructure. Anthropic runs Claude, a frontier model with rapidly growing enterprise adoption. The TPU 8i inference chip is, in a meaningful sense, also being designed for the workload that Anthropic will be scaling through Google's infrastructure in the same 2027 timeframe the chip is targeting.

The Strategic Wager Google Is Making

Google is betting that AI infrastructure in 2027 and beyond will be defined by two separate workloads with incompatible silicon requirements — and that being the only hyperscaler to have shipped purpose-built dies for both, at 2nm, with a diversified four-partner supply chain, is a durable advantage.

That bet is not guaranteed. Training-inference splits require software ecosystems to catch up. The same MLOps tooling that already handles GPU heterogeneity will need to route workloads efficiently across TPU 8t and 8i clusters. Enterprises with mixed training-and-inference needs will face more architectural decisions, not fewer. And the 2027 target date leaves substantial room for Nvidia's Vera Rubin and whatever generation follows it to reset the competitive baseline.

But the direction of the bet reflects a real analysis of where AI compute economics are breaking. The companies that shape that inflection — with hardware designed for the actual workloads rather than hardware adapted from prior generations — are the ones that set the pricing and performance baselines that every other player has to respond to. Google, at Cloud Next 2026, made clear it intends to be one of those companies.

Sources: TechCrunch (April 22), Bloomberg (April 22), TheNextWeb (April 22), The Register (April 22), Google Cloud Blog, NextPlatform.

Share:X

Briefing

The BossBlog Daily

Essential insights on AI, Finance, and Tech. Delivered every morning. No noise.

Unsubscribe anytime. No spam.

Tools mentioned

Affiliate

Selected partner tools related to this topic.

AI Copilot Suite

Content drafting, summarization, and workflow automation.

Try AI Copilot →

AI Model Monitoring

Track model quality, latency, and drift with alerts.

View Monitoring Tool →

Low-fee Global Broker

Multi-market access with transparent pricing.

Open Broker Account →

Some links above are affiliate links. We earn a commission if you sign up through them, at no extra cost to you. Affiliate revenue does not influence editorial coverage. See methodology.