🧩 What is a Multimodal AI?

AI's attempt to be more human-centric

🧩 What is a Multimodal AI?
What is a multimodal AI and why does it matter? Let's investigate!

Executive summary: Multimodal AI gives machines a broader, more human-like ability to understand and respond to the world. Learn the definition of the term and why it matters, and see some real-world examples without technical jargon (other than my using the word multimodal, that is.)


As you might imagine, I spend a lot of time researching AI products, how they work, what they are (and are not) suitable for, and how they measure up against similar products. One feature so-called foundational AI models support is multimodality, which, of course, rolls right off your tongue and is super easy to understand, right? That was a joke, but knowing what the term means and how you can benefit from it is essential, so let's learn why we should care about an AI with multimodal capabilities!

⌨️ Unimodal

Some popular AI models interact with you in a singular (unimodal) way. While I may be generalizing a bit here, if the AI solution only has one type of input and output, then you would refer to it as unimodal. An excellent example of this is a model that only accepts text as an input and responds with text as an output.

Unimodal examples

Unimodal AI models focus on doing one thing well. An excellent example of unimodal models include:

  • DeepSpeech: An open-source speech-to-text engine that receives audio as input and outputs transcribed text
  • Google Translate: Translates text from one language to another
  • WaveNet: Translates text to speech in natural ways.

Imagine you created a training course and want to make it available in multiple languages but do not have the budget to hire staff to re-create each course in a native language. In that case, you might:

  1. Record the training session
  2. Use a tool like DeepSpeech to create a transcription of your course
  3. Use Google Translate to convert the text into the other language(s) you wish to deliver the training in
  4. Use WaveNet to output new audio for the training course in that translated language

You might be using best-of-breed AI models, but you will have to learn those individually, requiring you to learn highly specialized and technical skills. Wouldn't it be nice if one tool could do all (or even a portion) of that for you? It is not easy to combine all these skills into a single, unified platform, but it is happening, and we call them multimodal AI models.

Introducing Multimodal Models

A gradient background with a microphone, camera, phone, AI on a laptop, and handwritten alphabet letters.
Multimodal models can interact with you in multiple ways and understand the context.

You will often hear the term multimodal in transportation planning. For example, city planners might develop multimodal transportation systems where a single street can support buses, trains, and bike lanes to better serve the community's diverse needs.

Just like our city planning example, there are times when you want to interact with the AI in different ways. Perhaps you want to speak your prompt or take a picture and ask a question about it. Further, you might like the AI to respond with text, images, or voice. This ability to switch contexts and understand the conversation makes an AI model multimodal.

Now that you understand the concept, let's explore how we can benefit from multimodal AI.

🖼️ Image and photo creation

In the following example, we can use text to describe an image we want the AI to create. The AI combines text and image generation capabilities to create a coherent, visually accurate depiction of the described scene.

Prompt: Create a stunning photorealistic view of the sun setting over the ocean.
Sunset over a calm ocean with vibrant orange, pink, and purple skies, reflecting on gentle waves.
Embracing the day's final light as the sun kisses the horizon, casting golden reflections across the tranquil ocean.
AI: Here is the photorealistic view of the sun setting over the ocean that you requested. I hope you find it stunning!

Multimodal capabilities used

  • Natural language processing (NLP) Understands and interprets the text prompt to ensure the generated image matches the user's description.
  • Generative models (GANs, diffusion models) synthesize the initial photorealistic image, drawing on learned patterns from large image datasets to create a sunset over the ocean.
  • Computer vision (style transfer and enhancement) Refines and enhances the generated image, ensuring it captures an actual sunset's details and visual effects.

🎨 Image creation and creative thinking

In the following example, we ask AI to create a logo and branding styles.

Prompt: I canned tomatoes and want to sell them. Create a stylized logo that represents the product. Since I want brand consistency when advertising them, tell me what colors and fonts I should use.
AI: Here is a stylized tomato designed for use as a logo. It has a modern and abstract look, with clean lines and vibrant colors, making it visually striking and easily recognizable.
AI: The brand colors for your company are centered around a vibrant and modern palette inspired by the stylized tomato logo. The primary color, Tomato Red (#D32F2F), conveys the richness and ripeness of fresh tomatoes, while Leaf Green (#388E3C) serves as a secondary color, emphasizing freshness and natural quality. Bright Highlight (#FFCDD2) adds lighter accents, creating depth, and Deep Green (#1B5E20) is used for shadows and darker contrasts. A Neutral Background (#F1F8E9) keeps the overall look clean and minimalistic, ensuring the brand's visual identity is cohesive and inviting across all platforms.
Color Name Color Hex Code
Tomato Red
#D32F2F
Leaf Green
#388E3C
Bright Highlight
#FFCDD2
Deep Green
#1B5E20
Neutral Background
#F1F8E9
AI: For your brand, consider using a combination of fonts that reflect both the freshness and handcrafted quality of your canned tomatoes. "Montserrat" is an excellent choice for your primary font; it’s a modern, clean, and versatile sans-serif that pairs well with the vibrant, minimalist aesthetic of your logo. For a more personal, artisanal touch, you might complement it with a hand-drawn or script font like "Pacifico" or "Amatic SC," which adds a sense of warmth and authenticity, perfect for labeling and secondary text. This blend of modern and hand-crafted fonts will create a balanced, approachable, and cohesive brand identity.

Multimodal capabilities used

  • Natural Language Processing (NLP) to read and understand your prompt
  • Generative AI to create the logo
  • Computer vision to inspect the logo for use in designing the brand standards
  • Recommendation system to share the best colors and fonts
  • Knowledge recommendation and representation to share the best color palette and fonts

🧐 Image analysis and summarization

In the following example, I do not create an image. Instead, I upload a photo and ask the AI to inspect it and tell me what it sees. Since I might be publishing the image, it is always a best practice to create a caption and alt text for those who may not be able to see it.

Prompt: Describe this image
Cat and dog wearing sunglasses, lounging on a beach chair by the ocean, enjoying a sunny day.
An image I submitted to ChatGPT for analysis.
AI: The image depicts a cat and a dog relaxing together on a beach chair. Both the cat and the dog are wearing sunglasses, with the cat sporting a pair of brown frames and the dog wearing orange ones. The background features a serene beach scene with the ocean in the distance, giving the impression that they are enjoying a sunny day by the sea. The overall mood of the image is playful and lighthearted, highlighting the companionship between the two animals.
Prompt: Create a caption for this image in 100 characters or less.
AI: Chillin' in style: Just a couple of cool friends soaking up the sun!
Prompt: Now create some alt text for people that cannot see the image.
AI: Cat and dog wearing sunglasses, lounging on a beach chair by the ocean, enjoying a sunny day.

Multimodal capabilities used

  • Computer vision for image recognition and object detection
  • Natural language processing (NLP) and text generation to understand and respond to your prompts
  • Image-to-text conversion, which many would say is a multimodal activity in and of itself, where the AI understands the image well enough that it can represent what is in the photo using text.
  • Assistive AI creates captions and alt text for the photo.

🧮 Analysis, reasoning, and math

Some AI models can even solve math equations, but let's take that further and use different modalities. In the following example, we upload an image with a chalkboard containing a math equation and ask the AI to solve the problem.

Prompt: Solve this
Chalkboard with "2 x 2 =" written on it, surrounded by an abacus, hourglass, and other antique tools.
I submitted an image to ChatGPT to analyze and answer the math problem on the chalkboard.
AI: The math problem on the chalkboard in the image is 2×2.
The solution is: 2×2=4

Multimodal capabilities used

  • Computer vision and optical character recognition (OCR) will analyze images to detect and recognize written text, converting visual characters into digital format.
  • Natural language processing (NLP) interprets the extracted text, recognizing it as a math problem that needs to be solved.
  • Mathematical reasoning and computation can solve the identified mathematical equation using basic arithmetic operations.

🗣️ Voice

Speaking to an AI means you can prompt with your voice. That means you can ask it to do things I already showed you, such as creating an image and brainstorming on a new business venture by creating a logo and branding. You can also use it to do real-time translations with other people, such as the demonstration in the following video.

I know some people in my newsletter may be discouraged from using YouTube at work, so I will describe the video here:

  1. A person uses ChatGPT with their voice and asks for a real-time English<>Italian conversation.
  2. The person asks a question in English.
  3. The AI (ChatGPT) speaks the question in Italian.
  4. Another person responds in Italian.
  5. The AI translates that response back into English.
  6. The entire conversation feels natural, intuitive, and, honestly, a little fun since the AI was not reading out words like a 1950s robot.

Multimodal capabilities used

  • Automatic speech recognition (ASR) converts spoken language into text for processing and translation.
  • Natural language processing (NLP) handles the text's understanding, translation, and contextual interpretation, enabling seamless communication between languages.
  • Text-to-speech (TTS) converts the translated text into spoken language, allowing continuous, voice-based interaction.
  • Dialogue management orchestrates the conversation flow, ensuring that the AI listens, translates, and responds in the correct language at the right times.

🤖 Are there other modalities?

Yes, there are, and the list would be outdated if I shared all the modalities. Here are a few more multimodal scenarios:

  • Sensors, videos, and text identify people or things on a video that may require human intervention and describe the scene in text.
  • Voice and gesture input for virtual reality headsets
  • Image, text, and audio for storytelling and training
  • Generative AI for text and audio to create songs with instrumentation
  • Natural language processing and video to create unique videos or animations

To date, I am not aware of any one AI that can perform all of these modalities, but we are getting close! ChatGPT and Microsoft Copilot seem to be the closest, recognizing that Microsoft Copilot uses ChatGPT. Google's Gemini is catching up, and at the moment, Anthropic's Claude seems to be more focused on text and natural language processing.

🙋 Have an AI project? Want to talk offline? Grab some time on my calendar!

I hope this article helped you better understand how AI chatbots can help you learn new skills. If you want to discuss this topic further or purchase a one-on-one session to learn this skill, schedule a one-on-one session on my calendar.

30 Minute Meeting - Bill Raymond

Dr. Deborah Chen provided editing services for this article.