6Pages write-ups are some of the most comprehensive and insightful I’ve come across – they lay out a path to the future that businesses need to pay attention to.
— Head of Deloitte Pixel
At 500 Startups, we’ve found 6Pages briefs to be super helpful in staying smart on a wide range of key issues and shaping discussions with founders and partners.
— Thomas Jeng, Director of Innovation & Partnerships, 500 Startups
6Pages is a fantastic source for quickly gaining a deep understanding of a topic. I use their briefs for driving conversations with industry players.
— Associate Investment Director, Cambridge Associates
Read by
BCG
500 Startups
Used at top MBA programs including
Stanford Graduate School of Business
University of Chicago Booth School of Business
Wharton School of the University of Pennsylvania
Kellogg School of Management at Northwestern University
Reading Time Estimate
12 min read
Listen on:
Apple PodcastsSpotifyGoogle Podcasts
1. Deal-making and inference chips
  • Driving down the cost of inference is critical for the AI industry’s economics. Inference chips (vs. training chips) will be the larger part of the AI chip market – perhaps as soon as this year – and will eventually represent 90-95% of the market. By one estimate, as many as 80-90% of GPUs (graphics processing units) might already be used for inference – up from about 40% two years ago.
  • This past year has seen growing momentum behind inference chips, in terms of deal-making, fundraising, and custom accelerators. This week, AWS (Amazon Web Services) announced a multi-year partnership to use startup Cerebras’ inference chip – Wafer-Scale Engine (WSE) – in its data centers. Cerebras’ dinner plate-sized WSE with on-chip embedded SRAM (static random-access memory) is optimized for speed at large scale, claiming up to 15x faster inference vs. GPU-based clouds. Cerebras signed a $10B deal with OpenAI in Jan 2026, and last month raised $1B at a $23B valuation.
  • Separately, Nvidia plans to unveil next month a new “inference system” that will include a chip designed by startup Groq, which uses an on-chip embedded SRAM architecture as well. Groq’s LPUs (Language Processing Units) are specialized AI accelerators that are optimized for low latency in applications like real-time voice and chatbots. According to Groq, inference using its LPUs is “faster than Nvidia’s best chips” and uses 1/3 to 1/6 as much power. Back in Dec 2025, Nvidia acquired a non-exclusive license to Groq’s IP and hired its CEO, in a reported $20B deal.
  • OpenAI’s aggressive maneuvers to spread its bets have spurred other major chip buyers to lock up their own supply. OpenAI embarked on a spate of chip deal-making in H2 2025, which has continued into the near-present (e.g. AMD, Oracle, CoreWeave, SK Hynix, Samsung, Broadcom, Nvidia, Amazon, Cerebras). Last month, Meta signed a spate of major chip deals across the industry – a 6-GW agreement with AMD, an expanded relationship with Nvidia worth tens of billions of dollars, and a deal worth billions of dollars to rent TPUs (Tensor Processing Units) from Google. Both OpenAI and Meta are also working on their own accelerators, with Meta revealing 4 new in-house AI chips earlier this week.
  • M&A is on the table as players seek to bolster their inference-chip capabilities. Last month, after acquisition talks failed, Intel invested $350M in AI chip startup SambaNova – viewed as relatively weak among the SRAM-based chip startups – in a multi-year collaboration that will see SambaNova use Intel server chips and graphics cards. Before that, in Jun 2025, AMD acquired the team behind AI inference chip developer Untether AI. In the same vein, Meta in Sep 2025 acquired AI chip startup Rivos, which was working on a GPU for inference based on the open-source RISC-V chip architecture. (Meta had previously made an $800M offer for AI chip startup FuriosaAI, which was turned down in Mar 2025.)
  • Fundraising for inference-chip startups has been on a tear. In addition to Cerebras’ raise, there have been big funding rounds for D-Matrix ($275M at a $2B valuation; Nov 2025), Positron ($230M at a $1B+ valuation; Feb 2026), Taalas ($169M; Feb 2026), and Axelera AI ($250M+; Feb 2026), among others.
  • The recent attention on inference chips is largely oriented around two factors – speed and cost. Nvidia’s general-purpose GPU architectures – which are dominant in training – are not always ideal for inference. Nvidia GPUs’ external high-bandwidth memory (HBM) has more capacity (good for training large models), but is significantly slower than the SRAM-heavy architectures used by Cerebras, Groq, and startups like D-Matrix and Taalas. As a point of reference, OpenAI has been using Nvidia chips to power most of its inference fleet but was reportedly dissatisfied with the chips’ speed of inference in domains like software development and agentic AI. (GPU instances and slower inference can also mean higher operational costs in cloud environments with usage-based billing.) SRAM can be up to 1,000x faster than the HBM4 used in Nvidia GPUs, although it is space-inefficient with lower throughput-per-dollar. Cerebras and Groq’s traction seems to indicate, however, that there is market willingness to pay more for low latency.
  • Cost is still a factor for workloads that can be readily shifted to non-GPU processors. While inference costs have plummeted on a per-query basis, high usage means that overall costs have continued their steep rise. OpenAI’s inference costs – which it pays in cash – roughly tripled last year, with some estimates suggesting that its inference costs may have exceeded its revenue. Anthropic’s inference costs in 2025 were reportedly 23% higher than expected, resulting in gross margins (40%) that were 10% lower than expected.
  • These different chips can be combined in various ways, depending on the use case. For instance, CPUs can be integrated with GPUs to produce powerful thin-and-light PCs with long battery life – the laptop version of today’s smartphones. An “AI PC” might incorporate a specialized NPU chipset (alongside a CPU and GPU) to run AI models locally.
  • Some companies are taking a disaggregatedapproach to inference, using different chips for the faster “prefill” stage translating queries into tokens vs. the slower “decode” stage generating the answers. Amazon is linking Cerebras chips with its own in-house Trainium3 accelerators using Elastic Fabric Adapter (EFA) networking, with Trainium3 chips handling prefill and Cerebras chips handling decode. Similarly, Nvidia’s anticipated inference system is expected to use its GPUs (Rubin or Rubin CPX) for prefill, while Groq’s chips are used for decoding. The disaggregated approach is considered ideal for “large, stable workloads.”
  • Inference infrastructure isn’t just hardware. Startups working on optimizing inference processing are seeing a fundraising boom. AI inference platform Fireworks AI raised $250M at a $4B valuation (Oct 2025). “AWS for inference” startup Baseten raised $300M at a $5B valuation (Jan 2026), with Nvidia investing $150M. Inferact (from the creators of vLLM) raised $150M at a $800M valuation (Jan 2026). RadixArk (developer of SGLang) raised $400M (Jan 2026). Other inference-optimization startups include Luminal, Tensormesh, Clarifai, FriendliAI, and Together AI.
Related Content:
  • Feb 27 2026 (3 Shifts): Nvidia is returning to consumer PCs
  • Jan 9 2026 (3 Shifts): TPUs & custom AI chips will be a big category
Become an All-Access Member to read the full brief here
All-Access Members get unlimited access to the full 6Pages Repository of880 market shifts.
Become a Member
Become a Member
Already a Member?
Disclosure: Contributors have financial interests in Meta, Microsoft, Alphabet, Amazon, OpenAI, Anthropic, Broadcom, Robinhood, and Coinbase. Amazon, Google, and OpenAI are vendors of 6Pages.
Have a comment about this brief or a topic you'd like to see us cover? Send us a note at tips@6pages.com.
All Briefs
See more briefs

Get unlimited access to all our briefs.
Make better and faster decisions with context on far-reaching shifts.
Become a Member
Become a Member
Get unlimited access to all our briefs.
Make better and faster decisions with context on what’s changing now.
Become a Member
Become a Member