Designing a Multilingual Semantic Search Architecture

Most enterprise search systems still rely on keyword matching, which is a method that often fails to capture intent, especially in multilingual environments. At MDP Group, we built a semantic retrieval architecture that understands meaning, not just words, which enables smarter, faster, and more accurate search experiences across domains.In this article, we explore the limitations of keyword search, introduce our three-stage semantic architecture, and share key insights from real-world experiments.

Table of Contents

From Keyword Search to Semantic Understanding

Most enterprise search systems start with a textbox. You type in a few words, and the system returns items containing those words. This simple model, keyword-based search, has powered intranets, document management systems, and knowledge bases for decades.But in 2025, it’s simply not enough.Keyword search works on the surface: it matches strings, not meaning. Especially in multilingual environments, or in internal corpora where people use varied terminology, keywords fail to capture intent.Take these examples:

“Sharing new product ideas” and “submitting a new solution proposal” mean nearly the same thing — but share zero words.
“Vendor evaluation” and “supplier comparison” belong in the same workflow, but won’t match unless you write them identically.

The result? Users either:

Miss critical information entirely,
Submit duplicate content unknowingly,
Or give up altogether.

This is where semantic retrieval becomes essential: not matching what was said, but what was meant.

What Is Semantic Retrieval?

Semantic retrieval doesn’t just look at words, it looks at meaning.It converts both queries and documents into dense vector representations using modern language models. These vectors capture the intent behind the words, making it possible to find similar ideas even if they’re phrased differently, or written in different languages.At MDP Group, we’ve applied this idea to:

Suggestion systems (“Has this been proposed before?”)
Procurement processes (grouping similar supplier offers)
Knowledge bases (finding the right answer, no matter how it’s asked)

We built a modular architecture with three key components to handle this retrieval pipeline with flexibility and speed.

Understanding the Components of Semantic Retrieval

Before diving into the full architecture, it’s important to understand the role of each core component in the proposed semantic retrieval pipeline:

Retriever
Responsible for quickly identifying semantically relevant candidates from a large corpus using dense vector representations.

How Does Retriever Work?

Query Encoding: The user’s input is passed through a multilingual sentence embedding model, which transforms the text into a high-dimensional vector that captures its semantic meaning.
Vector Index Lookup: The system uses a vector search engine, such as Pinecone or FAISS, to compare the query vector against a precomputed index of document vectors. These vectors were previously generated from the corpus during the indexing phase.
Similarity Scoring: Each item in the index is scored based on vector similarity (typically cosine similarity or inner product) relative to the query.
Top-k Retrieval: The most semantically similar items are returned as initial candidates for further processing.
Reranker

Many candidates returned by the retriever are loosely related. The reranker ensures that the top-ranked results are those most relevant to the specific query context, especially in use cases involving nuance or domain-specific language.

How Does Reranker Work?

Pair Construction: Each of the top-k candidates retrieved in the first stage is paired with the original query, forming input pairs like (query, candidate).
Contextual Scoring: These pairs are passed through a reranking model (e.g., MiniLM, BERT-based cross-encoder) that evaluates their semantic and contextual alignment.
Score Assignment: Each candidate is assigned a new relevance score based on this model's output, typically a scalar value between 0 and 1.
Reordering: The candidate list is reordered according to these scores, with the most contextually relevant result promoted to the top.
Judge LLM
Its role is to verify the semantic equivalence between the query and the top-ranked result, ensuring that the match goes beyond surface-level similarity and aligns meaningfully with the user’s intent, even if expressed differently or in another language.

This stage is especially important in high-stakes or precision-critical use cases such as deduplication, legal or support document retrieval, and decision-making systems.

How Does Judge LLM Work?

Input Preparation: The top result from the reranker (or directly from the retriever if reranking is skipped) is paired with the original query.
LLM Evaluation: This pair is sent to a large language model fine-tuned or prompted for semantic judgment tasks. The model evaluates not just literal meaning, but also context, intent, and domain nuance.
Decision Output: The model returns a discrete label or boolean decision, for example:

"Equivalent"
"Related but not equivalent"
"Not relevant"

Each layer adds precision and depth to the retrieval process, enabling robust performance across multilingual and high-variance enterprise data.

Three-Stage Architecture

To balance performance, accuracy, and adaptability, we use a three-stage pipeline for semantic matching:

1. Retriever: Embedding-Based Nearest Neighbors

The user’s input is encoded using a multilingual sentence embedding model, such as:

all-mpnet-base-v2
multilingual-e5-large
jina-embeddings-v3

We search for the top 10 semantically closest candidates using Pinecone. This stage is fast, resulting in milliseconds.

2. Reranker: Contextual Fine-Grained Scoring

The initial 10 candidates are then re-scored using a lightweight reranker, which is typically a small LLM trained to assess sentence-pair relevance.This reranker helps filter out surface-level matches and boosts truly relevant suggestions to the top.3. Judge LLM: Semantic Equivalence CheckFinally, the top-ranked candidate is passed to a Judge LLM that returns a binary decision:"Is this semantically equivalent, categorically related, or proposing a similar solution?"This step provides an additional layer of trust. Especially when suggestions or offers are slightly paraphrased, this model decides if they're meaningfully the same.

Experiments

We tested multiple versions of this pipeline on a 16 GB VRAM GPU, using both local models and external APIs. The results show interesting trade-offs:

Configuration	Top-k	Latency (s)	Notes
Embedder + Reranker + Local Judge	10	17.18	Fully local
Embedder + Reranker + API Judge	10	2.79	API-based verification
Embedder + API Judge (No Reranker)	10	1.78	Fastest configuration
API Judge, Top-k = 5	5	2.97	Minimal gain from k drop
API Judge, Top-k = 3	3	2.92	Similar to k=5

Key Findings

Top-k has little impact on latency and most processing time is dominated by Judge step.
Moving the Judge model to API significantly reduces latency (~6x improvement).
Removing the reranker speeds things up even further, useful for real-time use cases.
For highly sensitive scenarios, the local pipeline may be acceptable despite slower speed.

Conclusion

Keyword search is no longer sufficient: It matches exact words and not the meaning, making it ineffective in multilingual or semantically diverse environments.
Semantic retrieval addresses this gap: It captures the intent behind different phrasings, enabling more accurate and flexible information discovery.
Our three-stage architecture balances speed and accuracy:
- Retriever: Finds top semantic matches quickly using vector search (e.g., FAISS or Pinecone).
- Reranker: Reorders the results for fine-grained relevance using a lightweight model.
- Judge LLM: Verifies semantic equivalence, adding an extra layer of trust.
Key findings from experiments:
- The Judge LLM dominates latency — it is the slowest step in the pipeline.
- Using the Judge via API significantly improves speed (up to 6× faster than local).
- Removing the reranker further reduces latency, with acceptable trade-offs in accuracy.
- Changing top-k (e.g., 3, 5, or 10) has minimal effect on latency—Judge cost remains the bottleneck.
Practical recommendations:
- For real-time systems: Use an embedder + Judge API without reranker for fastest results.
- For high-stakes or privacy-sensitive scenarios: Run all components locally to maximize control and reliability.
- For resource-constrained environments: The fastest viable option is embedder + Judge via API, offering good balance between speed and accuracy.

Final Thoughts

Semantic retrieval goes beyond matching words, and it captures meaning. In today’s multilingual and dynamic environments, this shift is essential. By focusing on intent, we unlock faster, smarter, and more reliable search experiences.At MDP Group, we see semantic search as a cornerstone of intelligent systems. Through our AI consulting services, we help you design and implement advanced, scalable, and business-ready AI solutions tailored to your unique needs

Let’s talk about your AI journey.

Eda Yılmaz

Data Scientist