Like the sound of private, air-gapped LLMs? Welcome to generative AI on CPUs.

Dynamic sparsity allows generative AI to run on CPUs, side-stepping GPU shortages, and enabling private, air-gapped LLMs.
13 July 2023

Air-gapped generative AI: dynamic sparsity makes it possible to run billion parameter models on CPUs and have pocket LLMs that users could carry around. And if that wasn’t enough, the approach could fix GPU shortages too.

Getting your Trinity Audio player ready...

We are living in boom times for GPUs. The huge appetite for augmenting software with large language models (LLMs) and building advanced chatbots is great news for chip designers such as NVIDIA, whose hardware powers generative AI. Users are scrambling for LLM training stacks and inference architecture, leading to supply issues. But what if you could sidestep GPU shortages entirely by running deep learning algorithms on CPUs instead?

It’s a radical idea that some analysts believe could impact NVIDIA’s stock price in the long term. And not only would switching to currently underutilized CPUs address the problem of GPU shortages, energy-efficient approaches to training LLMs on CPUs bring other benefits to the table as well.

Fed with huge datasets, LLMs can take months to train and optimize, consuming large amounts of energy and water (generative AI is thirsty for cooling) in the process. And environmentalists are concerned about the carbon footprint of having banks of GPUs running 24/7 as developers race to build ever more powerful LLMs and push ahead of the competition.

Dynamic sparsity saves resources

OpenAI’s GPT-3, which provided the foundation for the hugely successful ChatGPT, has 175 billion parameters – the various weights that determine how deep learning neural networks map inputs into outputs. And, as it turns out, this number can be made dynamically much smaller without impacting model accuracy, which is a game changer for democratizing AI.

Being able to train deep learning models with hundreds of thousands of input dimensions and several thousand hidden layers on a CPU makes generative AI much more portable and standalone. So-called pocket LLMs – which could be carried in a backpack and kept offline – provide air-gapped generative AI capabilities that are ideal for keeping company data safe.


In the excitement to see what’s possible using conversational AI, many users may have inadvertently submitted sensitive data to LLMs hosted in the cloud. AI heavyweights such as Open AI and Google warn users not to enter sensitive information – pointing out that human reviewers may process conversations for quality purposes.

Big players have invested substantial sums in reinforcement learning with human feedback to make advanced chatbots the success story that they’ve become. And air-gapped, pocket LLMs running on CPUs give users the chance to bring that model refinement in-house and take full control over the development of domain-specific solutions trained with private data.

Chips in high demand – why are there GPU shortages?

GPUs became all the rage in the AI community when developers saw that bigger was better. Large models with many hidden layers fed with huge amounts of data produced rubbish to begin with. But leaving that compute whirring away for days – back-propagating results to optimize the weights of the neural network, and completing multiple passes (epochs) through the whole training set – produced staggering results.

In the case of next-word predicting LLMs, data doesn’t even have to be labelled. Models are shown sentences with words removed and have to guess what’s missing. Incredibly, when performed at scale, this unsupervised learning method is capable of teaching computers how to translate languages, write code, and converse convincingly with humans – to list just a few wonders of generative AI.

Under the hood, words are broken down into sub-parts known as tokens and stored in large vectors that are multiplied together. And it’s the ability to efficiently perform vector multiplication that gives GPUs the edge, hence the high demand for chips from leading designers such as NVIDIA.

GPUs have relatively small instruction sets, but large numbers of cores. Originally, they were devised to perform simple tasks – turning pixels on and off – but rapidly, in parallel at scale. And today, GPU use goes beyond graphics, having proven to be popular when mining cryptocurrency was all the rage, and more recently unpinning the latest AI boom.

But for how long? As we hinted at earlier, dynamic sparsity changes things by only considering the AI model parameters that are necessary for the current sample. Training deep learning algorithms on GPUs is a brute-force approach that performs billions of vector multiplications whether they are needed or not. In reality, only a few thousand model parameters may actually be updated when shown a new input, with the bulk of the calculations simply multiplying something by zero.

Bolt dynamic sparsity engine for CPUs

By focusing the compute on the high activations and ignoring the low activations, developers find that it’s possible to use commonly available CPUs – even though such chips have fewer cores. And users can once again benefit from the superior memory capacity of CPUs – one of the constraints of fast, parallel-processing GPUs.

“People haven’t seen what CPUs can do,” Anshu Shrivastava CEO and Founder of ThirdAI (pronounced third-eye) told TechHQ. “We’ve done all the heavy lifting to make model training very efficient.”

Rather than having to pipe their business intelligence into a third-party service, users can instead – thanks to ThirdAI’s dynamic sparsity engine, dubbed Bolt – build and deploy billion parameter models that run on their own CPUs. And having that information locally, in-house, makes it so much easier to keep those generative AI-solutions – which could be internal company chatbots, or highly interactive document search tools – up to date.

“Every data point that goes through infrerence can become part of a new training set,” adds Shrivastava. “It’s a continuous process.”