Is running AI on CPUs making a comeback?

How can AI on CPUs outperform GPUs? Efficient memory addressing using hashing functions dramatically improves model training and inference.
2 August 2023

Pocket LLMs could soon be in fashion thanks to hardware-efficient hashing algorithms that make running AI on CPUs a breeze.

Getting your Trinity Audio player ready...

If somebody told you that a refurbished laptop could eclipse the performance of an NVIDIA A100 GPU when training a 200 million-parameter neural network, you’d want to know the secret. Running AI routines on CPUs is supposed to be slow, which is why GPUs are in high demand, and NVIDIA shareholders are celebrating. But maybe it’s not that simple.

Part of the issue is that the development and availability of GPUs, which can massively parallelize matrix multiplications, has made it possible to brute force progress in AI. Bigger is better when it comes to both the amount of data used to train neural networks and the size of the models, reflected in the number of parameters.

Considering state-of-the-art large language models (LLMs) such as OpenAI’s GPT-4, the number of parameters is now measured in the billions. And training what is, in effect, a vast, multi-layered equation – by first specifying model weights at random and then refining those parameters through backpropagation and gradient descent – is now firmly GPU territory.

Nobody runs high-performance AI routines on CPUs, or at least that’s the majority view. The growth in model size, driven by the gains in accuracy, has led users to overwhelmingly favor much faster GPUs to carry out billions of calculations back and forth.

But the scale of the latest generative AI models is putting this brute force GPU approach to the test. And many developers no longer have the time, money, or computing resources to compete – fine-tuning billions of artificial neurons that comprise the many-layered networks.

Experts in the field are asking if there’s another, more efficient way of training neural networks to perform tasks such as image recognition, product recommendation, and natural language processing (NLP) search.

Artificial neural networks are compared to the workings of the human brain. But the comparison is a loose one as the human brain operates using the power of a dim light bulb, whereas state-of-the-art AI models require vast amounts of power, have worryingly large carbon footprints, and require large amounts of cooling.

That being said, the human brain consumes a considerable amount of energy compared with other organs in the body. But its orders of magnitude GPU-beating capabilities stem from the fact that the brain’s chemistry only recruits the neurons that it needs – rather than having to perform calculations in bulk.

AI developers are trying to mimic those brain-like efficiencies in computing hardware by engineering architectures known as spiking neural networks. Neurons behave more like accumulators and fire only when repeatedly prompted. But it’s a work in progress.

However, it’s long been known that training AI algorithms could be made much more efficient. Matrix multiplications assume dense computations, but researchers have shown a decade ago that just picking the top ten percent of neuron activations will still produce high-quality results.

The issue is that to identify the top ten percent you would still have to run all of those sums in bulk, which would remain wasteful. But what if you could look up a list of those most active neurons based on a given input?

And it’s the answer to this question that opens up the path to running AI on CPUs, which is potentially game-changing – as the observation that a refurbished laptop can eclipse the performance of an NVIDIA A100 GPU hints at.

How to run AI on CPUs

So what is this magic? At the heart of the approach is the use of hash tables, which famously run in constant time (or thereabouts). In other words, searching for an entry in a hash table is independent of the number of locations. And Google puts this principle to work on its web search.

For example, if you type ‘Best restaurants in London’ into Google Chrome, that query – thanks to hashing, which turns the input into a unique fingerprint – provides the index to a list of topical websites that Google has filed away at that location. And it’s why, despite having billions of websites stored in its vast index, Google can deliver search results to users in a matter of milliseconds.

And, just as your search query – in effect – provides a lookup address for Google, a similar approach can be used to identify which artificial neurons are most strongly associated with a piece of training data, such as a picture of a cat.

In neural networks, hash tables can be used to tell the algorithm which activations need to be calculated, dramatically reducing the computational burden to a fraction of brute force methods, which makes it possible to run AI on CPUs.

In fact, the class of hash functions that turn out to be most useful are dubbed locally sensitive hash (LSH) functions. Regular hash functions are great for fast memory addressing and duplicate detection, whereas locally sensitive hash functions provide near-duplicate detection.

Dynamic sparsity

LSH functions can be used to hash data points that are near to each other – in other words, similar – into the same buckets with high probability. And this, in terms of deep learning, dramatically improves the sampling performance during model training.

Hash functions can also be used to improve the user experience once models have been trained. And computer scientists based in the US at Rice University, Texas, Stanford University, California, and from the Pocket LLM pioneer ThirdAI, have proposed a method dubbed HALOS: Hashing Large Output Space for Cheap Inference, which speeds up the process without compromising model performance.

As the team explains, HALOS reduces inference into sub-linear computation by selectively activating only a small set of likely-to-be-relevant output layer neurons. “Given a query vector, the computation can be focused on a tiny subset of the large database,” write the authors in their conference paper. “Our extensive evaluations show that HALOS matches or even outperforms the accuracy of given models with 21× speed up and 87% energy reduction.”

Field test

Commercially, this approach is helping merchants such as Wayfair – an online retailer that enables customers to find millions of products for their homes. Over the years, the firm has worked hard to improve its recommendation engine, noting a study by Amazon that even a 100-millisecond delay in serving results can put a noticeable dent in sales.

And, sticking briefly with online shopping habits, more recent findings published by Akamai report that over half of mobile website visitors will leave a page that takes more than three seconds to load – food for thought as half of consumers are said to browse for products and services on their smartphones.

All of this puts pressure on claims that clever use of hash functions can enable AI to run on CPUs. But the approach more than lived up to expectations, as Wayfair has confirmed in a blog post. “We were able to train our version three classifier model on commodity CPUs, while at the same time achieve a markedly lower latency rate,” commented Weiyi Sun – Associate Director of Machine Learning at the company.

Plus, as the computer scientists described in their study, the use of hash-based processing algorithms accelerated inference too.