6Pages write-ups are some of the most comprehensive and insightful I’ve come across – they lay out a path to the future that businesses need to pay attention to.
— Head of Deloitte Pixel
At 500 Startups, we’ve found 6Pages briefs to be super helpful in staying smart on a wide range of key issues and shaping discussions with founders and partners.
— Thomas Jeng, Director of Innovation & Partnerships, 500 Startups
6Pages is a fantastic source for quickly gaining a deep understanding of a topic. I use their briefs for driving conversations with industry players.
— Associate Investment Director, Cambridge Associates
Read by
BCG
500 Startups
Used at top MBA programs including
Stanford Graduate School of Business
University of Chicago Booth School of Business
Wharton School of the University of Pennsylvania
Kellogg School of Management at Northwestern University
Reading Time Estimate
10 min read
Listen on:
Apple PodcastsSpotifyGoogle Podcasts
1. Google's TurboQuant algorithm
  • On Tuesday, Google Research described a software breakthrough called TurboQuant that’s been shaking up parts of the AI ecosystem. TurboQuant is an AI memory compression algorithm that Google claims is dramatically more efficient. In Google’s words, TurboQuant “reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency.” It has CDN (content delivery network) Cloudflare’s CEO, Matthew Prince, calling this “Google’s DeepSeek” moment. Despite TurboQuant still being in the lab, it has already sent memory-chip stocks tumbling by nearly $100B this week amid a broader decline. Micron’s stock price alone is down 11% since Tuesday.
  • Background: AI generally works by manipulating high-dimensional vectors – data structures that serve as mathematical representations of complex information like a word or image. Because these vectors are so memory-intensive, a key-value (KV) cache is used during an AI chat as a running "digital cheat sheet" of prior context/attention – avoiding repeat calculations and allowing for faster output generation. The KV cache for large models can become massive as chats progress, which means it becomes a limiting factor as it uses up memory, slows down responsiveness, and raises costs (e.g. hardware, power). This is one of the reasons (but not the only) why an AI chatbot can start breaking down during a long chat.
  • The process of vector quantization – mapping a larger input set (of continuous values) to a smaller output set (of discrete values) – can reduce high-dimensional vectors but typically at a cost in data loss, speed and/or memory overhead. TurboQuant’s promise of zero accuracy loss, an 8x speedup, and 6x reduction in needed memory seems to break this tradeoff.
  • To achieve its outcomes, TurboQuant leans on two other algorithms – PolarQuant and Quantized Johnson-Lindenstrauss (QJL). PolarQuant is used for high-quality compression, applying a clever technique in which the vector is converted into polar coordinates and randomly rotated in a way that maps the data onto a more predictable sphere-like grid. The other algorithm, QJL, is an efficient error-checker that addresses the tiny amount of error left from PolarQuant.
  • According to Google, TurboQuant is particularly useful in two areas – (1) compressing the KV cache during inference, and (2) for vector search. The first use case is being given the most attention right now, since the KV cache has become a key bottleneck in today’s AI production deployments. On the second use case, Google believes TurboQuant can be useful for vector search, which would be important for faster and more efficient semantic search at Google’s scale.
  • TurboQuant’s ability to compress the KV cache is particularly important during the decode stage of inference. During decode, a much smaller KV cache means less data needs to be moved from the memory to the processors for every token generated, so the processors aren’t sitting idle as long waiting for the data. It doesn’t remove the memory bottleneck (i.e. decode doesn’t become compute-bound) but it does raise the ceiling.
  • Just earlier this week, Menlo Ventures justified its investment in multi-silicon inference startup Gimlet Labs based on inference’s heterogeneous needs, saying: “Each stage requires different hardware: Prefill [translating queries into tokens] is compute-bound; decode [generating output] is memory-bound; and tool calls are network-bound.” People are now asking the question: Will decode continue to be memory-bound in the future?
  • Driving down the cost of inference is critical for the AI industry’s economics. Inference (vs. training) will be the larger part of the AI market. By one estimate, as many as 80-90% of GPUs (graphics processing units) are already being used for inference – up from about 40% two years ago. TurboQuant comes at a time when the industry is experiencing major memory-chip shortages, which have been driving up the cost of AI chipsets as well as the price of electronics from laptops to Playstation 5s. Earlier this month, memory-chip maker SK Group’s chairman indicated that he expected shortages to last for at least 4-5 years.
  • TurboQuant has the potential to impact the economics of the AI business, but maybe not as much as implied by the market reaction. First, it has less of an impact on training, as well as the prefill stage of inference. In the decode stage of inference, if KV cache is notionally 40-60% of memory in a datacenter, then a 6x reduction in KV cache would roughly translate into a 2x reduction overall. (This would be even less if actual KV cache memory savings were more like 2.7x, as some analysts believe, rather than 6x.) While the cost of serving could fall and make deployment more profitable for AI players, memory would remain a bottleneck, despite the higher ceiling.
  • Also, keep in mind that memory-chip shortages – given the current supply-demand imbalance and the extended time to stand up more capacity – are likely to persist in the near term. And further out, cheaper inference could mean even more usage (Jevons Paradox) and demand – especially if compression algorithms allow larger models and long-context AI to be run on consumer edge devices like laptops and phones.
Related Content:
  • Mar 13 2026 (3 Shifts): Deal-making and inference chips
  • Dec 19 2025 (3 Shifts): Memory-chip shortages & higher prices
Become an All-Access Member to read the full brief here
All-Access Members get unlimited access to the full 6Pages Repository of886 market shifts.
Become a Member
Become a Member
Already a Member?
Disclosure: Contributors have financial interests in Meta, Microsoft, Alphabet, OpenAI, Anthropic, and Discord. Google and OpenAI are vendors of 6Pages.
Have a comment about this brief or a topic you'd like to see us cover? Send us a note at tips@6pages.com.
All Briefs
See more briefs

Get unlimited access to all our briefs.
Make better and faster decisions with context on far-reaching shifts.
Become a Member
Become a Member
Get unlimited access to all our briefs.
Make better and faster decisions with context on what’s changing now.
Become a Member
Become a Member