Google announces that LLMs with vision change businesses

Multimodal embeddings model allows developers to use text and image vectors interchangeably to turbocharge product search.
23 August 2023

Multimodal embeddings extend the idea of how similar data clusters together, and tech giants such as Google and Facebook show how images, text, and other media can improve recommendation systems and different use cases.

Getting your Trinity Audio player ready...

Imagine an information space where data is arranged based on similarity. And, in such a digital world, embeddings are the vector coordinates for those various pieces of information. What’s more, developments in multimodal embeddings, which open the door to large vision language models (VLMs), are expected to turbocharge product search and other e-commerce features offered by marketplaces.

Embeddings are useful to establish similarities between words and phrases – for example, the words ‘king’ and ‘queen’ may be closer to each other than unrelated text such as ‘bananas’ and ‘DIY’. Using numbers, or more specifically multi-dimensional vectors, to represent words adds meaning and enables a range of operations.

Embedding space arithmetic

For example, depending on the training dataset, it’s been shown that subtracting ‘England’ from ‘London’ and then adding ‘Japan’ produces ‘Tokyo’ thanks to the vectors that have been assigned as part of the embedding process. And, as mentioned, things get more interesting still when you can add images to the mix to generate so-called multimodal embeddings.

Embeddings representing words, images, or a combination of both, draw attention to text that’s used in a similar context. And these relationships are useful to developers when building search applications and recommender systems – to give a couple of examples. Also, the methodology has come a long way in the past decade, moving from static pre-trained values determined using Word2Vec or Glove embeddings to contextual approaches such as ELMo.

Contextual approaches assign more than one vector to a word to account for multiple meanings – for example, the use of the word ‘present’ in the vicinity of ‘gift’ is different to ‘present’ in the vicinity of ‘now’. And these models are trained by running forwards and backwards over large amounts of data and predicting the next and preceding words, respectively, in the sequence.

Multimodal embeddings include a shared representation that gathers information from different kinds of data, such as text and images. And early breakthroughs such as ViLBERT, which adds image context to natural language processing (NLP), point to the advantages of using multimodal training sets.

ViLBERT made it possible to generate image captions, use text to highlight people and objects in photographs, find visuals that match a description, and complete many more tasks. And picking up on the use cases for multimodal embeddings, now available through Google Cloud’s Vertex AI machine learning (ML) platform, the ability to combine LLMs and image data can add value to semantic search.

Multimodal embeddings support marketplaces

Working with 5.8 million product images from Mercari’s online marketplace, Google researchers showed how multimodal embeddings enabled a powerful search experience simply by pointing the VLM at item images. “The images are sorted into extremely specific categories, providing a glimpse into the complex way that [the] model understands the images,” write the developers.

A search for ‘cups in Google logo colors’ produced 60 results of tumblers for sale in red, yellow, blue, and green – matching the search giant’s colorway. “The model can identify the colors of the Google logo and which images contain those colors, all without any explicit training (zero shot learning),” add the researchers in their blog post.

Using the approach, which is underpinned by multimodal embeddings, e-commerce operators can make recommendations to sellers on the item name, description, and selling price for items based on an uploaded image.

Other opportunities include security monitoring, as systems can be loaded with prompts such as ‘person at the door’ so that they can alert staff when live images match the text. It’s possible that VLMs will also help to better label data for training autonomous vehicles (AVs) or at least help developers find images quickly based on text prompts, even when photos are untagged.

Google has a demo site that will match an uploaded image (or link to an image) to products on its own merchandise store, when a relevant identification is made. And putting the demo to the test, it worked well for TechHQ – returning a variety of mugs and tumblers in the search results in response to a photo of a ceramic cup.

And it’s not just Google that’s exploring the opportunities. In May 2023, Facebook Research uploaded a paper dubbed ‘ImageBind: One embedding Space to Bind Them All’ to arXiv, which shows how to build a joint embedding across six modalities. In this case, the multimodal embedding encompassed – images, text, audio, depth, thermal, and IMU data (such as accelerometer or gyroscope readings).

The joint embeddings enable image retrieval based on a sound, for example, and the model – currently only a research prototype – could extend search beyond text inputs and even voice commands. Plus, it takes the embedding space arithmetic that’s possible to the next level.

Alaa El-Nouby, who was involved in building ImageBind, has shared some embedding maths examples on his Twitter feed, such as how adding the sound of birdsong to photos of fruit produces images of birds in fruit trees. And a picture of a signpost plus thunderstorm audio generates photos of cloudy and rainy city streets.

It’s interesting to consider multimodal embeddings alongside the proliferation of smartphones with increasingly capable audio and video capture capabilities. Especially given that those media files will typically have metadata stored with them, which could further contribute to shared representations in vector forms.

For example, researchers in the US have shown how visual representations can be learned by creating a joint embedding between sections of a photo and EXIF metadata.

There are numerous ways that multimodal embeddings can benefit businesses. In the financial sector, analysts have shown how combining representations of company returns with market segments defined in news articles can provide superior company descriptions.

“Industry classification schemes provide a taxonomy for segmenting companies based on their business activities,” writes the University College Dublin team in its paper. “However, even modern classification schemes have failed to embrace the era of big data and remain a largely subjective undertaking prone to inconsistency and misclassification.”

Using shared information between different data types is another example of how multimodal embeddings can offer a fresh solution to firms.