6Pages write-ups are some of the most comprehensive and insightful I’ve come across – they lay out a path to the future that businesses need to pay attention to.
— Head of Deloitte Pixel
At 500 Startups, we’ve found 6Pages briefs to be super helpful in staying smart on a wide range of key issues and shaping discussions with founders and partners.
— Thomas Jeng, Director of Innovation & Partnerships, 500 Startups
6Pages is a fantastic source for quickly gaining a deep understanding of a topic. I use their briefs for driving conversations with industry players.
— Associate Investment Director, Cambridge Associates
Read by
BCG
500 Startups
Used at top MBA programs including
Stanford Graduate School of Business
University of Chicago Booth School of Business
Wharton School of the University of Pennsylvania
Kellogg School of Management at Northwestern University
Reading Time Estimate
12 min read
Listen on:
Apple PodcastsSpotifyGoogle Podcasts
1. Compressing models without losing capability
  • PrismML has released the 1-bit Bonsai models for free under the permissive Apache 2.0 license. The research underlying its architecture was developed at Caltech, which owns the underlying intellectual property, with PrismML as the sole exclusive licensee. While PrismML open-sourced the Bonsai models, its whitepaper was vague about the compression method, which remains proprietary.
  • At the core of this is the insight that inference has different needs than training. PrismML is aiming to develop high-performance, efficient AI models that can run locally and privately on edge devices (e.g. phones, laptops, wearables, robotics), with low latency. Its 1-bit Bonsai 8B-parameter model has a memory footprint of just 1.15 GB (vs. a typical 16 GB), while its 4B- and 1.7B-parameter versions have memory footprints of 0.5 GB and 0.24 GB, respectively. As a result, it is suitable for “low-latency inference on consumer-grade CPUs, NPUs, and edge GPUs.”
  • PrismML’s model is significantly larger than Microsoft’s BitNet, and appears to be high-quality based on benchmarks. It is closer to being a true binary 1-bit model with 1.125 bits per weight, vs. BitNet’s 1.58-bit (ternary) model, which was a tradeoff that Microsoft made for performance reasons. PrismML’s reported efficiency gains in memory use, speed, and energy consumption are also a substantial step up from BitNet, offering greater incentives for adoption. Bonsai is also easier to use, since it works on Nvidia GPUs using the more common llama.cpp inference framework as well as on Apple Silicon.
  • PrismML’s results suggest that many modern LLMs may be over-parameterized for the tasks they perform. Conventional 16-bit floating-point weights means each parameter is carrying high-resolution numerical information that may not be necessary for inference. These inefficiencies can be ironed out by improving how information is encoded and accessed, enabling compression without catastrophic degradation in downstream tasks. Bonsai is reportedly quite good at code generation and tool calls, for instance, although it carries some tradeoffs for math and reasoning.
  • The PrismML team believes that the future of inference is in 1-bit precision. The 1-bit architecture could be applied to larger models, shrinking a 2-TB model, for instance, down to 150 GB. If AI hardware (e.g. specialized accelerators) is designed for 1-bit precision in the future, it could make these models even faster and more energy-efficient by swapping complex multiplications for simple addition/subtraction.
  • The largest models will probably not be replaced wholesale with compressed 1-bit models. Instead, we’re likely to see more of a stratification. Highly capable frontier models will continue as training and distillation backbones, and be used to serve the subset of more complex inference requests coming in from endpoints. In parallel, more efficient, compressed models will become widely deployed for real-time inference in a multitude of everyday devices. What could emerge in this latter shift is a broader rethinking of AI scaling laws, as algorithmic efficiency rises in importance. It has the potential to open up dimensions that are less heavily reliant on hardware and massive compute budgets.
Related Content:
  • Mar 13 2026 (3 Shifts): Deal-making and inference chips
Become an All-Access Member to read the full brief here
All-Access Members get unlimited access to the full 6Pages Repository of889 market shifts.
Become a Member
Become a Member
Already a Member?
Disclosure: Contributors have financial interests in Meta, Microsoft, Alphabet, Amazon, and Rivian. Amazon and Google are vendors of 6Pages.
Have a comment about this brief or a topic you'd like to see us cover? Send us a note at tips@6pages.com.
All Briefs
See more briefs

Get unlimited access to all our briefs.
Make better and faster decisions with context on far-reaching shifts.
Become a Member
Become a Member
Get unlimited access to all our briefs.
Make better and faster decisions with context on what’s changing now.
Become a Member
Become a Member