Table of Contents
Modern enterprise AI systems rely on one foundational layer: embeddings.
If embeddings fail to accurately represent meaning, especially in noisy, domain-specific business language; every downstream system weakens: retrieval, similarity, and interpretation.
For English, this foundational layer is mature; however, for Turkish, it is still catching up.
Although multilingual models have improved substantially, high-quality Turkish sentence embeddings tailored for enterprise use cases remain limited. Research consistently shows that Turkish’s rich agglutinative morphology and highly productive derivational structure pose challenges for current NLP architectures [1, 2]. Errors in tokenization and semantic drift frequently occur when models trained mostly on English or multilingual corpora attempt to encode Turkish business text.
Given production data, we asked: Which existing embedding model performs best on real Turkish enterprise data?
To answer this, we benchmarked 15 state-of-the-art embedding models on a small but highly representative dataset for semantic textual similarity.
This experiment produced a clear internal winner. But more importantly, it revealed a deeper insight:
The Turkish NLP ecosystem urgently needs a strong, general-purpose embedding model that is linguistically robust, domain-aware, and production-ready.
We selected a wide variety of embedding architectures: multilingual, Turkish-specific, compact, large, dense, and hybrid.
We avoided generic STS datasets and used text that directly reflects MDP’s production workloads.
This dataset captures the messy linguistic reality of Turkish enterprise environments, precisely where multilingual models struggle most.
We measured:
Our goal was not just accuracy, but also:
“How predictable and reliable is this model in real production thresholds?”
Across nearly all tasks, multilingual-e5-large demonstrated the most stable and reliable performance.
Its strengths aligned closely with TR-MTEB findings, where E5 variants also led the benchmark with the highest Mean (Task) and Mean (Type) scores [6].
Threshold
TP
FP
Precision
Recall
F1
0.89
19
37
0.339
0.9048
0.493
0.90
16
3
0.842
0.7619
0.800
0.91
15
2
0.882
0.7143
0.789
0.92
13
0
1.000
0.6190
0.764
Our internal results align with the broader academic landscape.
The TR-MTEB benchmark, introduced in 2025, is the first large-scale, task-diverse evaluation suite for Turkish embeddings [8]. It includes:
These models represent important progress but remain research-grade; plus, not yet validated across noisy enterprise domains like finance, logistics, customer support, or procurement.
Research on Turkish morphology shows persistent issues around:
Combined with real enterprise text issues:
Even strong multilingual models lose nuance and mis-rank similarities.
In practice, this leads to:
Thus, even though E5-large is a strong baseline, there is still room and need for:
A monolingual Turkish embedder is not just a linguistic improvement. It’s also lighter, faster, and cheaper to deploy on a scale.
Our benchmark confirmed an important truth:
As MDP continues to expand its AI product ecosystem, building a high-quality Turkish embedder becomes a strategic foundation; not just for our own systems, but for any serious enterprise operating in Türkiye’s data landscape.
[1] Çöltekin, Ç. (2014, May). A set of open-source tools for Turkish natural language processing. In LREC (pp. 1079-1086).
[2] Tohma, K., & Kutlu, Y. (2020). Challenges encountered in Turkish natural language processing studies. Natural and Engineering Sciences, 5(3), 204-211.
[3] Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., & Wei, F. (2024). Multilingual E5 Text Embeddings: A Technical Report. arXiv Preprint arXiv:2402. 05672.
[4] Zhang, X., Zhang, Y., Long, D., Xie, W., Dai, Z., Tang, J., … Others. (2024). mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, 1393–1412.
[5] Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
[6] Baysan, M. S., & Gungor, T. (2025, November). TR-MTEB: A Comprehensive Benchmark and Embedding Model Suite for Turkish Sentence Representations. In C. Christodoulopoulos, T. Chakraborty, C. Rose, & V. Peng (Eds), Findings of the Association for Computational Linguistics: EMNLP 2025 (pp. 8867–8887). doi: 10.18653/v1/2025.findings-emnlp.471
[7] Kesgin, H. T., Yuce, M. K., & Amasyali, M. F. (2023). Developing and Evaluating Tiny to Medium-Sized Turkish BERT Models. arXiv Preprint arXiv:2307. 14134.
[8] https://huggingface.co/trmteb/turkish-embedding-model
[9] Ezerceli, Ö., Gümüşçekiçci, G., Erkoç, T., & Özenç, B. (2025, September). TurkEmbed: Turkish Embedding Model on Natural Language Inference & Sentence Text Similarity Tasks. 2025 Innovations in Intelligent Systems and Applications Conference (ASYU), 1–6. doi:10.1109/asyu67174.2025.11208511
[10] Ezerceli, Ö., Hussieni, M. E., Taş, S., Bayraktar, R., Terzioğlu, F. B., Çelebi, Y., & Asker, Y. (2025). TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval. arXiv [Cs.CL]. Retrieved from http://arxiv.org/abs/2511.16528
Data Scientist
5 Benefits of SAP ERP for Your Business
Nowadays, change, new developments, etc. factors are happening very quickly. For example, supply chain shortages, changing customer needs, market...
How to Use SAP Screen Personas Scripting?
If you work with SAP, you’ve likely heard about SAP Screen Personas — the tool that allows you to simplify and customize SAP GUI screens for a...
MDP Insights: An Interview with Our Web Team Leader
This week we sat down with our Web & Mobile Development Manager, Ahmet Buğra Okyay, to have a chat about what his team does at here MDP. Our Web...
Custom Business Configurations and Maintenance Objects
Business configurations in SAP refer to the setup and customization of various business processes and functions to meet specific organizational...
Why should Companies Migrate to SAP S/4HANA?
S/4HANA is SAP's powerful and comprehensive ERP system that can be used by organizations of all sizes and industries. SAP S/4HANA offers a range of...
The Role of Cloud Connector in CPI
What is Cloud Connector?Cloud Connector is an application that creates a secure connection with SAP ‘cloud’ in order to ensure that systems in...
What is Lean Manufacturing?
Lean Manufacturing, which emerged in the Toyota Production System in the 1940s, is now used by businesses operating in any sector that want to...
Benefits of Using SAP Fiori Applications for Businesses
SAP Fiori includes multiple intuitive applications and guides that optimize the user experience, enabling users to reduce errors and increase...
What is IT Asset Management Software?
The ITAM - Quick GuideBusinesses today have more assets to manage than ever before, with the rise of IT assets used in the workplace. As employee...
Your mail has been sent successfully. You will be contacted as soon as possible.
Your message could not be delivered! Please try again later.