20,000 Chips

The Cerebras IPO, the architectural shift beneath it, and the milestones I will be watching

By Ritesh VajariyaMay 17, 202611 min read

Listen — narrated by Ritesh

0:00 / 21:26

Two disclosures upfront. I spent eighteen months at Cerebras as Global Head of Generative AI Strategy and Business Development. I left on good terms. I also hold Cerebras stock as vested RSUs, and I am genuinely happy about that. I want you to know it as you read this.

This is not investment advice. This is analysis based entirely on public information: the S-1, Cerebras's own press releases and technical blogs, OpenAI and AWS announcements, and industry coverage. My background informs how I read that public information. It does not give me access to anything non-public, and nothing in what follows depends on anything that is not openly disclosed.

Cerebras priced its IPO at $185 per share on May 13, 2026, raising $5.55 billion. The stock opened at $350, hit an intraday high of $386 that briefly pushed the fully diluted market cap past $100 billion, and closed its first day at $311, up 68%. It is the largest US technology IPO since Uber's 2019 debut.

Every publication covered the pop. The deeper story is what had already been built, and what is now being assembled around it.

The Wrong First-Order Read

The market framed this as a contrarian architecture finally getting its public validation. A decade-long bet on wafer-scale computing. The GPU alternative story goes public.

That framing has the timeline wrong. Cerebras did not become validated on May 14, 2026. The validation has been arriving in revenue lines, customer contracts, and production deployments for several years.

Look at the revenue trajectory in the S-1. $25 million in 2022. $79 million in 2023. $290 million in 2024. $510 million in 2025, up 76% year over year. That curve is not a thesis. That is paying customers expanding deployments.

Now look at who those customers are. MBZUAI in the UAE is training frontier English-Arabic models on Cerebras systems. G42's three Condor Galaxy supercomputers were built on Cerebras hardware and have been operational since 2023. Meta's Llama API runs on Cerebras infrastructure. Mistral's Le Chat assistant. Perplexity's AI search. IBM enterprise applications. Cognition's Devin software engineering agent. Mayo Clinic's Genomic Foundation Model for personalized medicine. GSK's drug discovery models. The US Department of Energy, the Department of Defense, and a $45 million DARPA contract on optical chip interconnects with Ranovus for real-time battlefield simulation.

The breadth was already there. The honest caveat is that the revenue picture is more concentrated than the customer roster suggests, with MBZUAI and G42 accounting for 86% of 2025 revenue per the S-1. Worth noting that the OpenAI deployment began in February 2026 with Codex Spark, so OpenAI revenue is not yet visible in the 2025 numbers; that contribution starts showing up in 2026 results. I will come back to revenue concentration in the scenarios. But the production work spans sovereign AI, frontier model training, consumer AI platforms, agentic coding, biomedical research, drug discovery, and national defense. Cerebras did not show up at the Nasdaq with a promising architecture and a single anchor customer.

What OpenAI Just Confirmed

The OpenAI relationship matters because it added a new kind of proof on top of an existing customer base.

On February 12, 2026, three months before the IPO, OpenAI launched GPT-5.3-Codex-Spark. Codex Spark runs exclusively on Cerebras Wafer-Scale Engine 3. It is the first OpenAI model ever not served on Nvidia hardware. It delivers over 1,000 tokens per second, roughly fifteen times faster than the standard Codex model. Codex Spark has been available to ChatGPT Pro users as a research preview for a quarter, and Codex more broadly now has over a million weekly active users per Tom's Hardware.

What OpenAI confirmed is not that the architecture works. The Llama API, Le Chat, Devin, and Perplexity had already confirmed that at scale. What OpenAI confirmed is that Cerebras can run latency-critical workloads for OpenAI's product surfaces, and that OpenAI was willing to make Cerebras its first non-Nvidia production deployment.

Sachin Katti, OpenAI's head of industrial compute, framed what it changed: Cerebras adds a dedicated low-latency inference tier. GPUs remain foundational for training and broad inference. The Cerebras tier is for workloads where response speed is the product experience.

The Architectural Shift Beneath the IPO

Here is the story the IPO coverage has largely skipped: the two largest AI infrastructure stacks in the world have independently converged on the same architectural principle in the same quarter. Cerebras itself laid this out in a March blog titled The GPU Is Being Split in Half. The framing is correct, and it matters more than the day-one stock chart.

On March 13, 2026, Cerebras and AWS announced a disaggregated inference solution. AWS Trainium handles prefill, the compute-intensive prompt processing phase. The Cerebras CS-3 handles decode, the memory-bandwidth-intensive token generation phase. They are connected over Amazon's Elastic Fabric Adapter and accessible through Amazon Bedrock. David Brown, VP of Compute and ML Services at AWS, framed the expected result as inference "an order of magnitude faster" than what is available today.

Three days later, at GTC 2026 in San Jose, Nvidia announced its LPX rack, built around the LP30 chip, integrating Groq's LPU architecture into the Vera Rubin inference stack. The technique is called Attention FFN Disaggregation. The GPU runs attention, which is memory-bound and dynamic. The LPU runs the feed-forward network portion of decode, which is deterministic and latency-sensitive. They are connected through Spectrum-X Ethernet.

The two stacks use different chips, different cloud platforms, and slightly different disaggregation strategies. But the architectural principle is the same: production inference at scale requires specialized silicon for each phase of the workload, not monolithic GPU clusters. Cerebras's own blog captures it well: "Every cloud provider is moving here. AWS and Cerebras, NVIDIA and Groq, Oracle, Azure. Every major LLM serving framework, Dynamo, SGLang, vLLM, llm-d, already supports disaggregation."

This is the architectural verdict the IPO is actually pricing. Not "Cerebras as the GPU alternative," which was always the wrong framing. Cerebras as the decode-tier silicon inside a disaggregated stack that is becoming the industry standard. Nvidia and AWS placed parallel bets on the same architectural future, using different specialized chips, in the same week.

The performance case for the decode tier is published and specific. Per Cerebras's own benchmark comparison, the WSE-3 delivers roughly 1,000 to 2,000 times higher effective memory bandwidth than the Nvidia B200, because it keeps memory on-chip rather than relying on off-chip HBM. The agent workload math is even more concrete. A 10-step agent chain at standard GPU inference speeds of 50 tokens per second takes more than 30 seconds. The same chain on Codex Spark at 1,200 tokens per second finishes in under 3 seconds. That is not a marginal improvement. It is the difference between an agent that feels usable and one that does not.

The race from here is execution. Nvidia plans to ship LPX in Q3 2026. On the AWS side, my hunch is that the real production rollout and the strategic positioning of the Trainium-CS-3 solution will land at AWS re:Invent in late November or early December 2026. AWS uses re:Invent as its flagship product launch venue every year, and the March announcement's "available in the next couple of months" framing fits that timeline.

The Manufacturing Math

The Master Relationship Agreement disclosed in the S-1 commits OpenAI to over $20 billion for 750 megawatts of Cerebras inference capacity through 2028, with expansion provisions to 2 gigawatts by 2030.

Those numbers demand a calculation worth running carefully.

Each CS-3 system draws 23 kilowatts. A production deployment also requires supporting infrastructure (CPU machines, networking hardware, and cooling), adding roughly 10 to 12 kilowatts per unit. When NextPlatform did this math in January using raw CS-3 power alone, the answer came to 32,768 systems. When you account for supporting infrastructure overhead, the number comes down toward 20,000 systems. The realistic deployment range is somewhere in between.

What strikes me about this number is that the constraint is not capital. The IPO raised $5.55 billion. The market cap gives the company currency for further expansion. The constraint is execution: TSMC capacity allocation, supply chain coordination, and logistics across OpenAI's growing data center footprint. TSMC fabricates for Nvidia, Apple, AMD, and dozens of customers whose volumes dwarf Cerebras. The Ranovus optical interconnect work is one of several pieces being built toward cluster-scale deployment, but the core silicon bottleneck remains TSMC capacity.

The manufacturing ramp is the variable that decides whether the OpenAI deal becomes the revenue the S-1 describes or a timeline that keeps stretching.

The Multimodal Question

One angle the analyst coverage has touched lightly is whether Cerebras competes for image generation and video generation workloads. Looking at the public benchmark roster, Cerebras has emphasized LLM inference and frontier training so far, with multimodal generation benchmarks notably absent compared to what GPUs and TPUs have published.

This is not a gap in the architecture. The wafer-scale design's memory bandwidth characteristics are well suited for diffusion workloads. The question is what happens when multimodal moves to the front of the queue.

The next wave of AI compute demand is not just faster text. It is real-time video, generative media, and multimodal production at scale. If Cerebras publishes competitive image and video benchmarks in the next twelve to eighteen months, the addressable market expands considerably beyond what the inference-only story supports. That upside is not yet priced into the current valuation. It is one of the most consequential open questions about where Cerebras goes from here.

What the Next Thirty-Six Months Decide

The IPO is a pricing event. The next thirty-six months are an execution event. Two scenarios diverge from here.

In the first, disaggregated inference becomes the standard and Cerebras holds the decode tier inside AWS while Nvidia holds it inside Vera Rubin. The OpenAI deployment scales toward the contracted 750 megawatts. Bedrock customers begin routing real enterprise traffic through CS-3. Cerebras publishes multimodal benchmarks. Revenue diversifies beyond the UAE concentration as the AWS and OpenAI channels mature. By 2028, the question for enterprise AI buyers is not whether to use specialized inference silicon, but which cloud stack they want to live inside.

In the second, Nvidia's LPX integration ships faster and absorbs the disaggregated inference market through CUDA ecosystem gravity before the AWS-Cerebras path scales. Cerebras holds a strong position in sovereign AI, frontier model training, and direct enterprise sales but does not achieve the breadth through Bedrock that the current valuation implies. UAE customer concentration remains a structural risk.

The convergence makes the first scenario more likely than the previous "Cerebras versus Nvidia" framing did, because both architectural paths now point toward specialized decode silicon as a permanent layer of the stack. But which specific implementations win which workloads is genuinely unsettled. The execution differences over the next eighteen months will produce most of the evidence.

The watch items are familiar at this point: whether the AWS Bedrock Trainium-CS-3 solution reaches general availability at re:Invent with real customer numbers, whether Nvidia's LPX ships on time in Q3 2026 with named deployment customers, whether OpenAI expands the Cerebras deployment from one model to several, and whether Cerebras publishes competitive multimodal benchmarks by early 2027.

What I Hope Happens From Here

A few hopes I have for Cerebras over the next eighteen months.

I hope Cerebras stays special-purpose. The biggest risk after a $100 billion valuation event is that the company tries to expand its story by taking on every adjacent workload and competing with Nvidia on Nvidia's terms. The wafer-scale architecture is most valuable when it is doing what nothing else can do, not when it is duplicating what GPUs already do well. Cerebras as the decode-tier silicon for latency-critical AI inference is a defensible position. Cerebras as another general-purpose accelerator chasing every compute problem on earth is not.

I hope the next WSE generation pushes on-chip SRAM significantly higher. 44 gigabytes on WSE-3 is the architectural moat today, but context windows are growing, reasoning models consume more memory per token, and multimodal workloads are even more memory-intensive than text. The industry direction is clear, including Groq's planned LP40 reportedly using hybrid bonded DRAM to extend on-chip memory. I would like to see WSE-4 push toward 100 gigabytes or more. The advantage depends on staying ahead on memory bandwidth at scale.

I hope Cerebras stays independent. The most architecturally interesting outcome of this IPO is that there is now a public, well-capitalized alternative to the GPU stack, accessible inside the AWS cloud. If Cerebras gets acquired, the bifurcation thesis weakens. The decode tier becomes a feature inside someone else's roadmap rather than a real competitive layer of the market. I hope the public markets give Cerebras the runway to stay standalone for at least the next decade.

I hope the multimodal benchmarks land within twelve to eighteen months. The architecture supports it. The team can prioritize it. Publishing benchmarks that compete credibly with TPUs and GPUs on image and video generation would close the most visible gap in the current story.

The Bottom Line

The IPO did not validate a thesis. It priced a business that had already been validated by paying customers, and that now sits inside an industry-wide architectural shift toward disaggregated inference. What Cerebras still has to prove is whether the architecture can be manufactured at the scale the OpenAI contract implies, distributed at the scale the AWS partnership requires, and expanded into multimodal workloads.

What that means for decision-makers is more concrete than waiting for re:Invent. Three things make sense to do now.

Start the architectural conversation about disaggregated inference today. The bifurcation is settled. Whether your organization ends up using Cerebras through AWS, LPX through Nvidia, or both, the framework for thinking about prefill versus decode is a conversation your AI infrastructure leaders should be having now. The vendor question follows the architectural one, and most teams have not yet had the architectural one.

Run a real pilot. Cerebras Cloud is publicly available today, with a free API tier for evaluation and paid tiers starting at $10 self-serve per the Cerebras pricing page. Worth noting that Cerebras Code, the higher-tier coding subscription at $50 to $200 per month, is currently sold out on that same page, which is itself a useful signal of where capacity is being allocated. The cost of a four-week evaluation on your actual latency-sensitive workloads is small. The cost of being late to the architectural shift, if it lands the way I think it will, is higher.

Treat re:Invent as confirmation, not a gate. If the Trainium-CS-3 disaggregated solution ships at general availability with strong customer numbers in November or December, you have your scaling answer for managed services and you can move from pilot to production. If it does not, direct Cerebras Cloud remains available, Nvidia's LPX is shipping in parallel, and you have already done the architectural work to use either path.

The bet Cerebras made a decade ago was right. The bet the industry has now made, parallel to Cerebras, is that the bifurcation is permanent and the decode tier matters. The next eighteen months will show which implementations of that bet win which workloads. The decision to start preparing your stack for that future does not require waiting for the answer.

P.S. Mark your calendar for AWS re:Invent in late November 2026, but do not wait until then to act. Identify one AI workflow in your organization where latency directly determines product quality. That is your candidate for a disaggregated inference pilot. Cerebras Cloud is available this week. The re:Invent moment will confirm which path is executing faster. The decision to start does not need that confirmation.