“
6Pages write-ups are some of the most comprehensive and insightful I’ve come across – they lay out a path to the future that businesses need to pay attention to.
— Head of Deloitte Pixel
“
At 500 Startups, we’ve found 6Pages briefs to be super helpful in staying smart on a wide range of key issues and shaping discussions with founders and partners.
— Thomas Jeng, Director of Innovation & Partnerships, 500 Startups
“
6Pages is a fantastic source for quickly gaining a deep understanding of a topic. I use their briefs for driving conversations with industry players.
— Associate Investment Director, Cambridge Associates
Read by

Used at top MBA programs including
Apr 3 2026
12 min read
1. Compressing models without losing capability
- On Tuesday, Khosla Ventures-backed AI startup PrismML came out of stealth with a splash, open-sourcing “the world's first commercially viable 1-bit large language models.” “1-bit” refers to the amount of memory used to represent a value – the standard for AI model training is 16-bit (although mixed precision is commonly used in practice). The choice of precision has an impact on reasoning performance, memory use, inference compute, latency/speed, and energy consumption. In the past, low-bit models were not capable enough to be practically deployable. According to the PrismML team, its 1-bit Bonsai 8B-parameter model is competitive in performance with full-precision 8B models such as Meta’s Llama 3 8B, while being 14x smaller, 8x faster, and 4-5x more energy-efficient (i.e. it uses 75-80% less power).
- PrismML has released the 1-bit Bonsai models for free under the permissive Apache 2.0 license. The research underlying its architecture was developed at Caltech, which owns the underlying intellectual property, with PrismML as the sole exclusive licensee. While PrismML open-sourced the Bonsai models, its whitepaper was vague about the compression method, which remains proprietary.
- Typical model compression through quantization – mapping continuous, high-precision values (e.g. 32-bit floating-point) to a smaller set of discrete, low-precision values (e.g. 8-bit integers) – reduces precision but still retains the numerical values. In contrast, for 1-bit models that are natively trained (rather than quantized after training), the weights are fully replaced with binary (-1, +1) or ternary (-1, 0, +1) values, requiring specialized training-time techniques and architectural changes. Bonsai, in particular, was trained from Alibaba’s open-source Qwen3-8B using Google v4 TPUs. Given that PrismML has not mentioned any faster training, it’s likely that this current generation of 1-bit models is still dependent on large-scale training pipelines.
- At the core of this is the insight that inference has different needs than training. PrismML is aiming to develop high-performance, efficient AI models that can run locally and privately on edge devices (e.g. phones, laptops, wearables, robotics), with low latency. Its 1-bit Bonsai 8B-parameter model has a memory footprint of just 1.15 GB (vs. a typical 16 GB), while its 4B- and 1.7B-parameter versions have memory footprints of 0.5 GB and 0.24 GB, respectively. As a result, it is suitable for “low-latency inference on consumer-grade CPUs, NPUs, and edge GPUs.”
- Researchers have been talking about 1-bit LLMs (large language models) for more than two years. Microsoft introduced its BitNet 1-bit architecture in Oct 2023, later open-sourcing its bitnet.cpp inference framework for CPUs in Oct 2024 and its BitNet-b1.58-2B-4T (1.58-bit, 2B parameters, 4T training tokens) model in Apr 2025, both under the permissive MIT License. On CPUs, Microsoft reported that its BitNet model was 1.37x to 6.17x faster with 55-82% less energy consumption (vs. 16-bit models). It was also more memory-efficient at 0.4 GB (vs. 2 GB to 4.8 GB). While the BitNet model was competitive in performance against other small 1B- to 2B-parameter models, users grumbled that it was still “basically unusable.”
- 1-bit model selection has generally been limited up to now. As one Reddit user complained previously, “[T]here are no trained from scratch bitnet models that are any good.” Adoption of Microsoft’s BitNet was also constrained by the need to use the bitnet.cpp framework (vs. Hugging Face transformers) to achieve efficiency gains. There was the potential for biased and inaccurate outputs as well, with Microsoft warning, “We do not recommend using BitNet b1.58 in commercial or real-world applications without further testing and development.”
- PrismML’s model is significantly larger than Microsoft’s BitNet, and appears to be high-quality based on benchmarks. It is closer to being a true binary 1-bit model with 1.125 bits per weight, vs. BitNet’s 1.58-bit (ternary) model, which was a tradeoff that Microsoft made for performance reasons. PrismML’s reported efficiency gains in memory use, speed, and energy consumption are also a substantial step up from BitNet, offering greater incentives for adoption. Bonsai is also easier to use, since it works on Nvidia GPUs using the more common llama.cpp inference framework as well as on Apple Silicon.
- Model evaluation criteria are being reframed in terms of efficiency-adjusted performance. While Bonsai doesn’t strictly outperform models with a similar number of parameters, it dominates in terms of intelligence density. Intelligence density is defined by PrismML as “the amount of intelligence a model can deliver per unit size (measured in GB).” Calculated as the negative log of the model's error rate divided by the model size, Bonsai’s intelligence density scores a 1.060 vs. Qwen3-8B’s 0.096. Another evaluation from the community rates Bonsai’s accuracy per GiB at 0.72 vs. Qwen3.5-9B’s 0.17.
- PrismML’s results suggest that many modern LLMs may be over-parameterized for the tasks they perform. Conventional 16-bit floating-point weights means each parameter is carrying high-resolution numerical information that may not be necessary for inference. These inefficiencies can be ironed out by improving how information is encoded and accessed, enabling compression without catastrophic degradation in downstream tasks. Bonsai is reportedly quite good at code generation and tool calls, for instance, although it carries some tradeoffs for math and reasoning.
- The PrismML team believes that the future of inference is in 1-bit precision. The 1-bit architecture could be applied to larger models, shrinking a 2-TB model, for instance, down to 150 GB. If AI hardware (e.g. specialized accelerators) is designed for 1-bit precision in the future, it could make these models even faster and more energy-efficient by swapping complex multiplications for simple addition/subtraction.
- If larger models that previously required multi-GPU clusters can eventually run on a single consumer device, workloads – and their economics – would shift away from centralized cloud providers and toward decentralized edge devices. At the same time, the growth of inference requests would likely drive a surge in demand for backend infrastructure to support all of these endpoints. 1-bit architectures could furthermore be applied inside data centers, “improving hardware utilization, lowering operating costs, and reducing energy consumption.” As one Google exec put it, “Efficiency at the model level compounds across infrastructure.”
- The largest models will probably not be replaced wholesale with compressed 1-bit models. Instead, we’re likely to see more of a stratification. Highly capable frontier models will continue as training and distillation backbones, and be used to serve the subset of more complex inference requests coming in from endpoints. In parallel, more efficient, compressed models will become widely deployed for real-time inference in a multitude of everyday devices. What could emerge in this latter shift is a broader rethinking of AI scaling laws, as algorithmic efficiency rises in importance. It has the potential to open up dimensions that are less heavily reliant on hardware and massive compute budgets.
Related Content:
- Mar 27 2026 (3 Shifts): Google's TurboQuant algorithm
- Mar 13 2026 (3 Shifts): Deal-making and inference chips
Become an All-Access Member to read the full brief here
All-Access Members get unlimited access to the full 6Pages Repository of889 market shifts.
Become a Member
Already a Member?Log In
Disclosure: Contributors have financial interests in Meta, Microsoft, Alphabet, Amazon, and Rivian. Amazon and Google are vendors of 6Pages.
Have a comment about this brief or a topic you'd like to see us cover? Send us a note at tips@6pages.com.
All Briefs
Get unlimited access to all our briefs.
Make better and faster decisions with context on far-reaching shifts.
Become a Member
Already a Member?Log In
Get unlimited access to all our briefs.
Make better and faster decisions with context on what’s changing now.
Become a Member
Already a Member?Log In


