Gain an Initial Understanding Of This Transformative Technology
Visual large language models (LLM), also known as visual language models, are a subset of generative AI. They leverage the same architectures and tricks that make plain LLMs so powerful (like ChatGPT) and extend it to the visual domain. These models are best described as multimodal, as they can receive inputs from text and images at once. Where some applications can only receive data in text form, a visual LLM allows the user to also pass in visual content and perform queries about it.
Imagine inputting a recording of a video call with colleagues. A visual LLM would be able to process this video along with a text prompt, such as “Who was at the meeting?”, “What was discussed during this call?” or "Where was each attendee seated?" Based on this prompt and the video data provided, you can get a list of attendees, a summary, and even an analysis of participants’ sentiments. This application, which requires both video and text to create an accurate response, uses a visual LLM as its foundational model.
In general, VLLM are typically concerned with the following vision and language tasks:
Image Captioning: Generating text to describe the contents of the image.
Visual Question Answering (VQA): Answering questions about the contents of the image.
Image-Text Classification: Determine if an image matches the description given by the user.
Visual Grounding / Object Detection: Identify and localize objects within an image based on text descriptions.
Object Segmentation: Segmenting parts of an image based on text descriptions.
Optical Character Recognition (OCR): Extracting text from an image.
What makes VLLMs so groundbreaking is that these are generalist models. While we already had models capable of fulfilling each of the tasks above, these were very domain specific and required retraining for each specific application. VLLMs, on the other hand, have learned general concepts thanks to their massive training dataset and architecture. This makes them suitable for a vast number of applications without any modification, at the expense of more computational resources.
For those who are watching the growth of AI and appreciate its continued development, this form of deep learning (particularly when compared to machine learning) represents an astounding leap forward.
Use Cases For Visual Large Language Models
A visual LLM’s ability to receive, process, and output information based on multimodal data provides an opportunity for many industries to develop powerful use cases.
Here are some examples of new and potential visual LLM uses, broken out by a variety of industries:
Virtual medicine: A patient or doctor can upload images or videos of symptoms, along with a health history summary, to receive a more accurate diagnosis.
Business: A team working with a new client during a video call can receive a transcript of this call for review, along with an analysis of when clients seemed less engaged in the conversation.
Customer service: A customer wants a chatbot to help them initiate a merchandise return by showing the extent of the product’s shipping damage and explaining that they do not want to be held liable for the damage.
Consumer use: A person having dinner in a restaurant loves the dish they are eating so much that they want to know how to make it. They upload an image of their dinner and prompt the model with text that requests that this dish’s contents should be added to their weekly grocery delivery app.
Agricultural use: A farmer wants to better prepare their crop for the hot weather in the week ahead. They input video of their crop’s current health along with detailed weather forecasts to make determinations about the optimal volume of resources, like fertilizer and water, to keep their plants healthy during the heatwave.
Manufacturing: A visual LLM can generate training information for employees about a new piece of machinery. To ensure that the training is tonally and factually accurate, a facility trainer only needs to prompt the model with past training videos (or even training videos that they think of as a good example) and a manual for the machine that employees are being trained on.
What Are Some Of the Most Well-Known Visual Large Language Models?
Several models from leading technology companies provide the foundation for visual language model applications available today.
Some of the most popular models include:
CLIP: Developed by OpenAI, CLIP (Contrastive Language-Image Pretraining) is trained to process video and text simultaneously. It is a “zero-shot” model, meaning that it doesn’t have to be trained for a specific task to perform well on it.
Flamingo: An early model from Google DeepMind, Flamingo was trained with more advanced techniques and a more integrated architecture which results in better zero-shot performance.
Florence: Another zero-shot visual language model, Florence, developed by Microsoft, is pretrained to deliver strong results from multimodal inputs.
Although some models are trained extensively to provide accurate feedback without training (zero-shot), even better results can be achieved if the VLLM is fine-tuned using examples of the specific application.
Depending on the use case, an engineering team designing a new application may prefer a fine-tuned model over a zero-shot model. Although a zero-shot model is perhaps user friendly for general information, a many-shot model allows the team to train the visual LLM on specific kinds of data to ensure maximum accuracy and deliver an effective, marketable use case.
As more and more deep learning applications enter the market, stakeholders want to be sure that the output they provide is beneficial and safe for users. For example, Hugging Face recently partnered with the University of Edinburgh’s Natural Language Processing Group and Open Life Science AI to create benchmarking for how LLMs perform when providing medical diagnoses and responses.
Want to try a VLLM yourself? Try out LLaVA in Hugging Face, another great vision language model.
Learn More About What’s Next For Machine Learning and Deep Learning
Continue following the conversation about groundbreaking AI engineering by exploring recent articles on general and specific topics, no matter where you are in your journey with this technology.
Scientific researchers and software developers can explore these deep-learning-AI-specific topics:
If you’re interested in understanding why generative AI is different from machine learning, or you’ve heard of neural networks and other key terms, but you’re unsure how they work, try these articles that explain the basics:
Comments