A couple of weeks ago it dropped: Anthropic moved OpenClaw users straight to API pricing without further ado. For OpenClaw users this means costs many times higher than the subscription-based pricing they were used to (Claude Subscribers Now Have to Pay to Use OpenClaw).
Anthropic’s move is a good illustration of why building on a single model provider is a risk. But the phenomenon isn’t Anthropic-specific — other providers can make similarly arbitrary changes. And regardless of decisions like these, we’re living in a situation where cloud infrastructure isn’t keeping up with the growth in AI usage. There’s even talk that the big model providers might go bankrupt — and if that happens, prices will go up and capacity won’t be available.
In this post I’ll explain what AI self-sufficiency is, when it’s worth investing in, and how to get there in practice.
Why is dependence on AI model providers a problem?
There are several reasons why this matters, and they keep coming up in conversations with clients:
- Arbitrary decisions by model providers. Model providers change pricing, discontinue products and tighten their terms of service whenever they please. If your entire AI operation is built on a single vendor’s services, a change like this can cause a serious fire on the customer side and, at worst, kill the whole operation.
- AI usage is growing faster than cloud capacity. There are more and more of us using AI all the time. We pile into ChatGPT, Claude and elsewhere in big numbers to chat about our coding tasks, job applications and whatever else. In this situation load spikes appear, showing up as slow responses, rate limits and outages. And the more AI becomes part of our normal work process, the more painful those outages feel.
- Economic uncertainty and AI. Model providers’ businesses are capital-intensive and the hype is intense. If we’re in a bubble and that bubble pops, a model provider can fall over and disappear from the market entirely. Organisations’ solutions and users’ habits are often tightly coupled to the model or model family being used. Today, switching providers is a technically trivial operation, but with the switch the user experience can change from great to dismal, and other broad-scale instability can show up as well.
- AI and information security. Some clients handle sensitive material, and others have entire development environments running on-prem. In defence industry companies, for example, the whole development environment can be completely disconnected from the internet. In those cases an external API simply isn’t an option.
- Cost. Using model providers costs money, and the costs grow the more you use the models. Especially with autonomous agents, token counts grow quickly.
These factors and the constantly growing mission-criticality of AI make this a very timely topic. Should you go along with the crowd, or take AI under your own control? What if you ran the models you need yourself, on your own machine?
What does AI self-sufficiency mean?
AI self-sufficiency means owning and running the infrastructure where the models execute, and using them for your own needs. The AI self-sufficient don’t buy resources from OpenAI, Anthropic, Mistral or anywhere else. Instead they use open source models, downloaded onto their own machine and run there.
In practice this means:
- You don’t spend money on anything other than electricity (and depreciation on the hardware you bought).
- You can use the models to your heart’s content and as far as your hardware can handle — no token budgets.
- There are no outages caused by external load spikes, because the only user is you or your team.
- If a model provider goes bankrupt, no problem. Your own models keep running at home or in your own data centre.
- You aren’t dependent on model providers’ pricing changes or other arbitrary decisions.
- The materials you process don’t go out to the internet — they stay on your own machines.
In return, you give up access to the very latest top-tier models and you take on the responsibility of keeping your own hardware running. This is a good solution for many use cases, and in a sense it makes you market-independent.
My path towards AI self-sufficiency
The first time I ran AI models on my own machine was at the end of 2018. Back then there were no model providers in the current sense of the word — even OpenAI was only founded in December of 2015, and generative AI wasn’t yet generally available. The models in use were quite different: open embedding models, of the kind still used today in vector search, and other NLP models that few people use any more. I ran these daily in my work, when NLP solutions had to be built starting from a much lower abstraction level. I wrote about this in an earlier post.
Before the LLM era, AI was always about self-sufficiency.
Then in 2022 ChatGPT arrived and within a few years practically everyone working in NLP had moved over to using large language models. Soon LLM-related frameworks started appearing, like LangChain, LangGraph and Ollama. A language model itself is just a thing you pour text into and get text out of; if you want to build applications and services on top of it, you have to do that yourself too — and frameworks like these help with that work.
In 2024 I bought serious hardware: a machine with a reasonable amount of VRAM and good memory bandwidth so I could run larger local language models on it. Even before ChatGPT was released, open language models that you could download to your own machine had started appearing. I had decided to dig into using them.
I quickly noticed that the LLM frameworks of the time had a strong monetisation push: LangChain packaged its product so that support mainly covered paid model providers. So if you wanted to build some kind of agent with LangChain, you had to use the model vendors’ paid models or contribute integrations for local models to LangChain yourself. From a self-sufficiency perspective this felt like the wrong direction to me, and I wrote my own small LLM framework. Along the way I got to learn how a language model’s memory and tools are built. On top of the framework I’d written and the local models, I built a small ArXiv assistant that followed publications and could discuss topics based on the articles published on ArXiv.
All of this is about the same thing: how to make use of AI without external help or resources. Right now I’m exploring and experimenting with AI self-sufficiency in software development — and really it was this latest step that prompted me to write this piece.
How do you become AI self-sufficient?
You need two things: a suitable machine and language models. You’ll most likely also need some of the LLM frameworks I mentioned above.
What does a suitable machine look like?
Two properties stand out above the others: the amount of VRAM on the GPU and the memory bandwidth.
During execution the language model is mainly kept in the GPU’s VRAM, and the size of the model determines how much VRAM is needed. If there isn’t enough VRAM, what happens? It’s worth understanding this correctly:
- Modern inference tools — MLX, llama.cpp, Ollama — offload part of the model’s layers into system memory (RAM), or in the worst case onto disk, when VRAM isn’t enough. So the system doesn’t necessarily crash, but inference slows down significantly because, for each token, part of the computation has to wait for slower data movement.
- If, on the other hand, you use a strict GPU-only configuration or the model’s quantisation has been chosen incorrectly, model loading ends in an out-of-memory error.
- In practice, for generative use where you want answers in a reasonable time, the model has to fit in VRAM — otherwise the experience quickly becomes frustrating.
Once there’s enough VRAM, the next important thing is memory bandwidth. During execution the GPU does matrix computation. The computation an inference call requires is not performed “all at once”; it’s a big multi-stage operation that’s broken into tiles: parts of the model and data are moved from VRAM into the compute cores’ buffers or registers, the computation is done, and the results are moved back into VRAM. The compute operations are fast; the transfers cost — i.e. they take time. Memory bandwidth determines how quickly data moves between the compute cores and VRAM, and that’s why it’s one of the most important individual spec numbers for LLM use.
Hardware options
- Apple and unified memory. If you’re one of Apple’s disciples like yours truly, you can buy a powerful MacBook Pro or, for a desktop, a Mac Studio. Apple’s current devices have so-called unified memory: GPU processes share the same memory with other processes, so you have as much “VRAM” available as the machine has memory in total. On Apple’s current devices, VRAM = RAM.
- Self-built or spec’d PC. You choose the motherboard, GPUs and other components and assemble the machine yourself or order it pre-built. This gives the most freedom, but also the most fiddling.
- Off-the-shelf AI servers. Lately, ready-made AI servers or small servers have also come on the market, for example Nvidia’s DGX Spark. Prices are steep and VRAM is at 128 GB, but they also have combining capabilities: two of these can be linked together to form an AI server with, for example, 256 GB of VRAM. Like Apple devices and self-built workstations, these too are devices you can keep in a closet at home or even on your desk.
- Rack side. If you have your own data centre, then the scale is of course much larger — and so is the price.
Where do you get language models?
The de facto model hub for open source models is Hugging Face. You can find models there for every purpose: general-purpose chat models, models optimised for programming, models tuned for tool use (tool calling, agent use), reasoning models for tasks that need thinking, vision models for understanding images, and embedding models for vector search. The same model is often available in different quantisation levels (8-bit, 4-bit, etc.) so you can fit the model into smaller VRAM.
A bit more on running models on a Mac
Lately I’ve mostly been running models on a Mac.
For pure LLM inference use (= you say something to the model and the model replies), Mac is currently a competitive option: you can get enough memory and the memory bandwidth is fast. In inference use this shows up directly as the model producing responses quickly. For comparison, the Nvidia DGX Spark AI server I mentioned above can certainly hold large models in memory, but its speed at producing responses isn’t any better.
If, on the other hand, I were training large models myself, then cloud hardware would be a suitable option. Training is done only once, so it would be a one-off cost in the form of cloud fees. A multi-GPU workstation or training on a Dask cluster are also good options for training. Inference and training are different processes and they benefit from different performance profiles.
On a Mac it’s worth using MLX-LM-optimised models — that is, models that have been packaged to run on Apple hardware on top of the MLX framework. On Hugging Face there’s an entire organisation (mlx-community) that publishes these versions.
One concrete observation from my own LLM framework work: implementing memory around MLX-LM models was much more straightforward and efficient than with the other options (Ollama and transformers). MLX-LM models can save their internal state (the KV cache, i.e. the activations of the attention layers) after each inference request, and this state can be fed back to the model on the next inference. So the model continues from where the previous request ended and “remembers” what was said in the previous inference.
In the other LLM execution frameworks I tried (Ollama and transformers), the only option seemed to be feeding the conversation history in as part of the inference. And that is enormously less efficient: in addition to the input being processed at that very moment, the entire history has to be processed again, on every round. And of course the framework maintainer has to build the logic used for storing the history. Let me know if the situation has changed or if I missed something — everything in this area moves fast.
Closing thoughts
Local AI and different use cases.
AI assistance: AI-assisted activity, for example software development, writing, and various design tasks at work and for personal use, is user-driven work alongside AI. Here it also holds that the user has to wait for AI’s responses, which means that AI’s speed is a decisive factor. A local language model is fairly slow for its size, slower than cloud models outside of peak hours, so it’s worth being prepared for that.
OpenClaw and other autonomous agents: OpenClaw as a use case is completely different. Firstly, OpenClaw reacts to various stimuli (for example an email to the user, an alert from a monitoring system, a bug report) and it can start its work immediately when such an input comes from somewhere, which means OpenClaw may have already finished its work before the user has had a chance to react to the matter in any way. Secondly, OpenClaw can perform a kind of self-examination on a schedule, and this also happens in a time window where a little slowness in the work doesn’t really matter. I’d say local models pair very well with OpenClaw.
There are other use cases too, but those are perhaps the two most timely.
Thanks for your interest! Send me a message or comment (mail, LinkedIn, YouTube) about your own experiences, or if you’d like to dig deeper into any of this!