How AI Learns to See and Read Together with Multimodal LLMs?

Discover how multimodal LLMs combine vision and language to automate document processing, enhance reasoning, and transform enterprise workflows with AI.

1. What Are Multimodal LLMs?

Traditional Large Language Models (LLMs) like the old GPT you used or if you are technical Llama or Mistral models, process only one type of data: text. These models excel at reasoning, summarizing, and generating natural language but cannot directly interpret what’s inside an image, video, or audio clip.

Multimodal LLMs, on the other hand, can handle multiple modalities of information, where a modality refers to a specific type of input signal, such as text, images, sound, or video.

Let’s focus on the most common and practical combination today: text + images.

A multimodal model can both see (the image pixels) and read (the text in the image), a capability that enables entirely new categories of AI systems, from document understanding to visual question answering, and even medical image interpretation.

Figure 1: You can send an image and talk about it [1]

2. Real-World Use Cases

Multimodal models are emerging in almost every domain where humans process both visual and textual cues. Let’s explore some representative examples.

Image Captioning

The most intuitive task is to give the model a picture, and it describes what it sees. You can send a picture of a cat on a laptop, and the response would be like describing it “A curious cat sitting on a laptop keyboard.”

Visual Question Answering (VQA)

You can ask questions about the image you sent: “How many people are in this photo?” or “What brand is the car?”. The model looks, reasons, and answers in natural language.

Chart, Diagram, and Table Parsing

Models can now extract numerical data from visual elements: reading charts, identifying anomalies, and even converting tables into LaTeX or Markdown automatically.

Cross-Modal Reasoning

You can ask questions that blend text and visuals: “Does the chart in this slide support the argument in the summary below?”. Such reasoning across modalities represents a step toward true multimodal cognition.

3. How Do Multimodal LLMs Work?

Under the hood, these models combine language modeling and computer vision in a unified neural architecture. There are two dominant design families.

Figure 2: Two Methods of Multimodal Language Models [2]

Method A: Unified Embedding + Decoder Architecture

This approach feeds both text and image inputs into a single decoder model, usually adapted from a pre-trained LLM such as GPT-2, Gemma, Llama3.2, or Phi-3.

Step 1: Image Encoding

An image encoder, typically a Vision Transformer (ViT) or a CLIP encoder, splits an image into small patches (like dividing text into tokens). Each patch is converted into an embedding vector representing visual features.

Figure 3: A classic setup of Vision Transformer (ViT) [3]

Step 2: Linear Projection

Since the image embeddings and text embeddings may have different vector dimensions, a linear projection layer (a single fully connected layer) is used to align them. For instance, a 256-dimensional image vector might be projected to match a 768-dimensional text embedding.

Figure 4 Linear projection layer that projects flattened image patches from a 256-dimensional into a 768-dimensional embedding space [2] — Figure 4: Linear projection layer that projects flattened image patches from a 256-dimensional into a 768-dimensional embedding space [2]

Step 3: Concatenation and Decoding

Once aligned, the image and text embeddings are concatenated and passed together into the LLM decoder. The model then reasons jointly over both modalities. Think of it as teaching a language model to “read images” by converting pictures into a form of visual tokens that it can process just like words.

Figure 5 Side-by-side image tokenization and text tokenization, where the role of the projector is to match the text token embedding dimensions. — Figure 5: Side-by-side image tokenization and text tokenization, where the role of the projector is to match the text token embedding dimensions.

This approach is used in systems like LLaVA, OpenFlamingo, and Fuyu, with Fuyu being notable for simplifying the architecture by learning its own patch embeddings instead of relying on an external vision encoder.

Method B: Cross-Modality Attention Architecture

The second major design pattern integrates the two modalities via cross-attention rather than direct concatenation.

Here, the image encoder and text model remain largely independent, but inside the transformer’s attention layers, queries from text tokens attend to keys and values from image embeddings.

If you recall the original Attention Is All You Need architecture [4], this is akin to how the decoder attends to encoder outputs during translation.

Figure 6: Cross-attention mechanism used in the original transformer architecture [4]

Figure 7: Regular self-attention mechanism, only one input [2]

Figure 8 Cross attention, where there can be two different inputs x1 and x2 [2]

In the original Transformer from Attention Is All You Need [4], x1 is the input sequence for the decoder, and x2 is the sequence produced by the encoder. In a multimodal LLM, x2 instead comes from an image encoder.

There are some advantages in this Method B:

Efficiency: Image embeddings are introduced only where needed, reducing the input context length.
Modularity: The base LLM can retain its text-only performance while still learning to “see.”
Flexibility: Easier fine-tuning across specialized domains (e.g., charts vs. photos).

This architecture is adopted by models like Gemini 1.5 Pro, GPT-4V, and NVLM (NVIDIA), all leveraging attention-based fusion between vision and language streams.

4. Training Multimodal LLMs

Training such models typically proceeds in two phases, similar to modern text-only LLMs:

Phase 1: Pretraining

Start from a pretrained LLM and a pretrained image encoder (e.g., CLIP).
Keep both frozen; train only a small projector network that aligns image features to text embeddings.
This phase ensures the model learns a shared representation space.

Phase 2: Instruction Fine-Tuning

Unfreeze parts of the LLM (and possibly cross-attention layers).
Train on multimodal instruction datasets, such as visual QA, captioning, OCR + reasoning pairs.
The model learns to follow natural language prompts involving both text and image references.

Some models also apply RLHF (Reinforcement Learning from Human Feedback) or contrastive alignment to reduce hallucinations and improve factual grounding, which is still a persistent challenge in multimodal reasoning.

Figure 9 Aligning models and reducing hallucinations are important, Gemini team endeavour [5] — Figure 9: Aligning models and reducing hallucinations are important, Gemini team endeavour [5]

Why Multimodality Matters Now

In enterprise workflows, visual information dominates invoices, receipts, contracts, dashboards, medical scans, diagrams, and marketing creatives all combine visual and textual elements.

Until recently, these required separate pipelines: OCR systems for reading, LLMs for reasoning. Now, multimodal LLMs unify this process, enabling end-to-end automation.

Key benefits include:

Higher automation: No need for handcrafted OCR rules.
Contextual reasoning: Understands layout, text, and semantics together.
Reduced latency: One model replaces multiple specialized components.
Better user experience: Natural interaction like “Upload your document and ask what it means.”

5. MDP AI Expense Portal

To illustrate these ideas in action, consider our MDP AI Expense Portal, a computer-vision-powered expense management system designed for enterprises.

Employees upload receipts or invoices as images.
The multimodal AI model automatically reads vendor, date, total, VAT, and item details: even from low-quality or handwritten receipts.
Extracted data populates digital expense forms instantly.
Managers review and approve requests through an integrated dashboard.

This system blends Vision Transformers, OCR fine-tuning, and instruction-aligned LLM reasoning, turning unstructured visual data into structured financial records.

The result:

Reduction in manual data entry,
Error rates significantly lower than traditional OCR pipelines,
and a faster, more transparent expense approval workflow.

In short, it’s a real-world embodiment of multimodal LLM technology, transforming document understanding into operational efficiency.

6. Looking Ahead

The trajectory of multimodal LLM research is accelerating rapidly. Recent models such as Gemini 2.5 show that image-language fusion is becoming more efficient, data-driven, and instruction-tuned.

As models learn to integrate vision, text, and sound more naturally, we edge closer to generalist AI systems, capable of perceiving and reasoning the world holistically.

For enterprises, that means smarter automation.
For developers, it means unified APIs across modalities.
And for users, it means AI that can finally “understand” both what we say and what we show.

If you’d like to see how multimodal LLMs are already reshaping your business operations, the AI Expense Portal is perfect for showing how seeing + reading together can turn piles of receipts into structured, actionable insight in seconds.

Let's talk about your AI journey and transform your business with multimodal LLMs: https://mdpgroup.com/en/contact-us/

References

[1] https://arxiv.org/pdf/2302.14045

[2] https://sebastianraschka.com/blog/2024/understanding-multimodal-llms.html

[3] https://arxiv.org/abs/2010.11929

[4] https://arxiv.org/abs/1706.03762

[5] https://x.com/zswitten/status/1948483504481964382