Edge AI: how to make deep learning more efficient

Quantization, pruning, and teacher and student models are just a few ways to make deep learning more efficient to open up edge AI use cases.
22 August 2023

Efficient edge AI on the horizon: multi-layer neural networks are powerful universal function approximators that can now be squeezed into smaller footprint hardware thanks to a suite of deep learning optimization techniques.

Artificial intelligence (AI) is transformative across today’s industrial landscape. Everything from enterprise software to machine automation is benefiting from the ability of multi-layered neural networks – with sufficient data and training – to make sense of our world. But as the size of deep learning models balloons, opening the door to more complex natural language processing (NLP) and other AI applications, so does the amount of compute that’s required. And that’s a problem when it comes to edge AI.

Edge AI trend

Deploying deep learning algorithms on portable computing hardware such as smartphones or onboard vehicles gives users access to powerful image recognition capabilities – to give just one of many use cases. And running models locally on edge AI hardware provides resilience against any interruption in connectivity.

There are also energy considerations. Users are starting to question the environmental impact of running giant AI algorithms in the cloud, given the energy cost of training models with billions of parameters and consuming large amounts of cooling water in the process. But, as it turns out, developers have become experts at pruning their models to reduce the computing demands of deep learning inference with only a minor impact on the accuracy of results.

These efficiency measures are great news for enabling edge AI. And to understand how the various methods work, it’s useful to first paint a picture of deep learning and consider how multi-layer neural networks turn inputs into meaningful outputs.

At an abstract level, you can think of a deep neural network as a universal function approximator. Given enough parameters, everything can be represented by a mathematical function. You might have seen formulae that look like shells when plotted in 3D or fractals that resemble tree branches. And large numbers of artificial neurons have proven to be capable of describing images and finding missing words in sentences.

Training these AI algorithms involves adjusting millions of model weights to make patterns of artificial neurons sensitive to certain inputs, such as edge features in an image. It’s also necessary to set biases for each of the nodes in the network, to determine the strength of the activation that’s required to make the corresponding artificial neurons ‘fire’.

If you’ve ever seen an analog music synthesizer covered in knobs, this is a good analogy, but multiply the number of dials by a million or more. And our input could be the feed from a video camera, which – after passing through all of the settings – turned on a light whenever a dog was seen in the image.

Looking at the numbers on the dials, we might see that some parameters are more important than others. And that brings us to the concept of model pruning, which is one approach to squeezing algorithms onto edge AI hardware.

Today, developers use a variety of methods to make edge AI neural networks faster to run and smaller to accommodate without compromising performance. One approach is to zero out very small model weights, which can pinpoint artificial neurons that have little impact on how the algorithm behaves.

Another trick is to retrain the pruned model over a few iterations, which may result in fine tweaks to the other parameters, to recover any of the lost accuracy. Some pruned image recognition algorithms can behave more effectively than the original neural networks, which is a great result for edge AI.

Unfortunately, large language models (LLMs) can be harder to optimize as the retraining step isn’t trivial. But a new approach termed Wanda (pruning by weights and activations), which has been evaluated on the LLaMA family of LLMs, shows that considering activation paths allows 50% of the structure to be pruned without a major loss in performance. And, importantly, the training doesn’t need to be rerun to update the weights.

Considering how the weights are represented can help too – for example, storing values as 8-bit integers rather than in single-precision floating-point format (FP32) can save dramatically on memory. Conventionally, model weights are scaled to between zero and one, but those values can still be recovered from the memory-saving integers for processing.

Another strategy for making algorithms more efficient for edge AI applications is to deploy so-called teacher and student models, where the student learns from the richer information provided by the teacher. Specifically, the teacher model can give the student model the probability distribution of the most likely results as training inputs.

This approach has been used successfully to build DistilBERT, a distilled version of BERT: smaller, faster, cheaper, and lighter. Using teacher and student models (also known as knowledge distillation), Hugging Face researchers showed that it’s possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster.

And to understand why this is such a big deal, it’s worth noting that BERT is one of the most useful NLP models out there. BERT can be used for text encoding to retrieve similar passages from other data. It can summarize large amounts of text information and provide answers to questions.

Considering edge AI, lightweight NLP models could process data locally to preserve privacy and protect sensitive information that clients may not want to be sent to the cloud. And companies could use DistilBERT to build their own proprietary semantic search engines for navigating business data without having to send any of that data to Google, for example.

AI success stories

AI success stories in the cloud are inspiring a variety of use cases. And, as developers become more accomplished at compressing that algorithmic performance into smaller footprints, we can expect those benefits to translate into edge AI applications too.

Also, users have a growing number of tools to lean on to optimize their machine-learning models. Google’s TensorFlow Model Optimization Toolkit supports the deployment of models to edge devices that have restrictions on processing, memory, power consumption, network usage, and model storage space.

There are other options too, such as model optimization SDKs that enable efficient deep learning for embedded systems. And providers include Swedish deep tech firm Embedl, which recently raised 45 MSEK (USD 4.1 million) to scale up its operations.

“Embedl’s solution provides significant performance enhancement when developing Autonomous Driving Systems (AD) and Advanced Driving Assistance Systems (ADAS) in the automotive sector,” writes the firm on its website. “It will also allow AI to be incorporated into consumer products with less powerful hardware.”

According to the company, customers can use the SDK to create deep learning algorithms that can be run on battery-powered devices, which signposts another trend in edge AI.