Blog

How to Build High-Performance LLM Systems?

If an LLM is accurate but slow, users abandon it.  

“Fast” is not a single number. It’s the result of choices across model size, serving stack, batching, caching, memory layout, and even prompt design.  

In this guide we design for fast LLM interaction using a chatbot as the running example.  

1) The Interaction Timeline: A Mental Model and Key Metrics 

When a user sends a message to a chatbot, the request flows through several stages before the response appears on the screen:  

client → network → prompt assembly → (prefill) → first token → (decode) → stream → post-process   

Two of these stages matter most for perceived speed: prefill and decode 

  • Prefill stage: The model processes the entire input prompt, encodes it, and fills the KV (key–value) cache. This step directly influences how long it takes for the first token to appear.  
  • Decode stage: Once the KV cache is ready, the model generates tokens one by one, attending to the cached values. This step determines how smooth and continuous the streaming experience feels.  

To understand and optimize chatbot responsiveness, we need to measure specific timepoints in this pipeline:  

  • TTFT (Time To First Token): The elapsed time from the request start until the very first token is emitted.  

Intuition: “How long am I staring at a blank screen?”  

  • TPOT (Time Per Output Token): The average time it takes to produce each subsequent token after the first one. Often reported as tokens per second. Can be measured as TPS = 1 / TPOT. 

Intuition: “Does the response stream feel natural and fluent?”  

  • Token Generation Time: The total time to generate all tokens after the first one:  

t_generation = num_output_tokens * TPOT 

  • E2E Latency (End-to-End Latency): The full user-perceived latency, from request start to the last token:  

E2E = TTFT + t_generation 

For short, conversational responses (≈50–150 tokens), some rules of thumb apply:  

  • TTFT < 200 ms feels “snappy.”  
  • 3–4 tokens/s matches human reading pace.  
  • <2 tokens/s feels sluggish and unnatural.  
  • P99 E2E latency under ~2 s ensures even the slowest cases remain acceptable.  

2) Percentiles, Queues, and the Pain of the Tail  

Looking only at averages hides the true user experience. In production, what matters most is not the typical response but the worst-case responses that users still encounter. This is why latency percentiles are critical:  

  • P50 (median): what half of users' experience.  
  • P95: the slower 5% of requests.  
  • P99: the worst 1%, often the difference between a “usable” system and one that feels broken.  

When traffic is bursty or batching is too aggressive, requests can queue up. These queues inflate the time-to-first-token and total latency for unlucky users. Even if your P50 looks great, the P99 can silently be disastrous.   

Although real serving systems are more complex than textbook formulas, basic queueing theory provides useful intuition.  

  • Let λ = request arrival rate (requests per second).  
  • Let μ = service rate (how many requests per second the system can process).  
  • utilization (ρ) = λ / μ 

As utilization ρ approaches 1, expected waiting time grows sharply. Even a small increase in traffic when the system is already near saturation can cause P99 latency to explode, while averages remain stable.  

To deliver consistently fast chatbot responses, you must optimize not only raw throughput but also tail behavior:  

  • Batching window tuning: shorter windows lower TTFT but reduce throughput; longer windows improve throughput but worsen tails.  
  • Admission control: reject or shed excess load early to keep utilization ρ comfortably below 1.  

In other words, system design must aim for predictable latency under load, not just impressive throughput numbers on paper. 

3) Always Report Useful Output  

When evaluating the performance of an LLM system, it’s tempting to talk only about throughput. But not all throughput is created equal.  

Let’s break it down.  

3.1) Requests vs. Tokens  

  • RPS (Requests Per Second): How many complete user requests the system can serve per second. This is a user-level measure.  
  • TPS (Tokens Per Second): How many tokens the model processes or generates per second. This is a model-level measure, and it comes in two flavors:  
  • Input TPS (Prefill): the rate at which input tokens are encoded and written into the KV cache.  
  • Output TPS (Decode): the rate at which output tokens are generated and streamed back to the user.  

Both matters. A system with high TPS but low RPS is good at long documents, but poor at handling many concurrent chats. The reverse may hold for short, bursty workloads.  

3.2) Throughput vs Goodput 

Raw throughput ignores failures and timeouts. What users actually perceive is the number of successful tokens delivered per second.  

goodput_TPS = TPS * (1 - failure_rate) 

Example:  

  • Raw throughput: 1000 TPS.  
  • 20% of requests timeout.  
  • Goodput = 1000 × (1 – 0.2) = 800 TPS.  

Even if your dashboard shows 1000 TPS, users only experience 800 TPS worth of useful work.  

Always report goodput, not just throughput. Successful user experience depends on the tokens that arrive correctly and on time, not the theoretical maximum the system could produce.  

4) Where Time (and Memory) Goes: Bottlenecks  

Even when the model is well-optimized, there are two main areas where latency and resource usage accumulate: the prefill vs. decode cost and the KV cache memory wall.  

4.1) Prefill vs. Decode Cost  

Every request passes through two distinct computational stages:  

  • Prefill: The model reads the entire input prompt, encodes it, and writes the results into the KV (key–value) cache. This operation is expensive because its cost scales directly with the prompt length (S) and the overall model size (layers, heads, hidden dimensions). Long prompts can significantly increase TTFT.  
  • Decode: After the KV cache is filled, the model generates tokens one by one. Each new token reuses the cache, so decoding is cheaper than prefill. However, it still scales with sequence length because each new token must attend over the growing cache. This is why longer contexts slow down generation, even if per-token FLOPs seem small.  
  • Rule of thumb: Prefill dominates latency for long prompts (many input tokens), while decode dominates for long generations (many output tokens).  

4.2) The KV Cache “Memory Wall”  

Surprisingly, the biggest bottleneck in modern LLM serving is often memory, not compute. The KV cache, which stores the intermediate keys and values used in attention grows linearly with both sequence length and batch size.  

The approximate memory cost is:  

  • Memory_KV ≈ (#layers) * (#heads) * (head_dim) * (seq_len) * (batch_size) * 2 * (bytes_per_element)  

The ×2 accounts for storing both keys (K) and values (V).  

  • bytes_per_element depends on precision: FP16 ≈ 2 bytes, FP8 ≈ 1 byte, INT4 ≈ 0.5 byte (if supported).  
  • Long prompts (large seq_len) and large batch sizes blow up KV memory usage.  
  • Models often “run out of VRAM” due to the cache before compute FLOPs are saturated.  
  • This limits the maximum batch size (hence throughput) and/or forces serving with shorter context windows.  
  • This “memory wall” is exactly why techniques like PagedAttention were developed: they restructure KV memory management to avoid fragmentation and let you scale batch sizes without hitting the wall prematurely.  

5) Optimizations That Actually Move the Needle  

Only a handful of techniques consistently improve LLM inference. These target memory usage, compute efficiency, and scheduling. 

5.1) Quantization 

Quantization reduces model weights (and sometimes activations) to lower-bit formats like INT8 or INT4 instead of full precision. This lowers VRAM use and memory bandwidth, speeding up inference since more data fits in cache. The cost is a possible accuracy drop: usually negligible for chat, but more noticeable in math or coding tasks. Methods like GPTQ, AWQ, or SpQR help balance this trade-off. INT8 is the stable, widely supported default, while INT4 offers larger efficiency gains but needs careful testing to avoid regressions. 

5.2) PagedAttention  

Traditional KV cache allocation reserves large contiguous memory chunks, which causes fragmentation, wasted VRAM, and limits batch sizes. PagedAttention solves this by managing the cache like virtual memory: it splits storage into fixed-size blocks and maps sequences dynamically. This approach reduces fragmentation, frees memory as soon as sequences end, and allows larger effective batch sizes with higher throughput, often improving tail latency. Block size matters. Smaller blocks reduce fragmentation but add overhead, while larger blocks lower overhead but risk fragmentation again. The best setting depends on model and hardware, so defaults are a good start before profiling. 

5.3) FlashAttention  

FlashAttention is a fused GPU kernel that optimizes the attention operation by minimizing costly reads and writes to high-bandwidth memory (HBM). Instead, it leverages faster on-chip SRAM and registers, which significantly speeds up both prefill and decode, particularly in long context windows where memory traffic dominates. The performance boost varies by GPU architecture and kernel support, so results depend on the hardware. FlashAttention is most effective when combined with KV cache optimizations like PagedAttention, delivering both speed and memory efficiency.  

5.4) Speculative Decoding  

Speculative decoding pairs a small “draft” model with a larger target model. The draft proposes multiple tokens in advance, and the larger model validates or rejects them. When the acceptance rate is high, this approach can speed up generation by 30–40%, shifting much of the work to the smaller model. However, its benefit drops in structured tasks like JSON or grammar-constrained outputs, where acceptance rates are lower. Effectiveness is workload-dependent, so it should always be benchmarked before adoption. 

5.5) Prefix/Prompt Caching  

Prefix caching stores encoder states or KV blocks for repeated prompt segments, such as system messages or common RAG scaffolds. By reusing these cached components, it cuts prefill cost and speeds up requests with recurring patterns. Chatbots with fixed system instructions benefit the most. The main challenge is managing cache validity — even small prompt changes can break reuse. Reliable use requires careful versioning and strict memory budgeting to avoid wasted capacity.  

5.6) Dynamic Batching & Scheduling  

Dynamic batching groups incoming requests into a single execution window, balancing throughput and latency. Short windows (around 10 ms) minimize time-to-first-token (TTFT) but limit overall throughput, while longer windows (30 ms or more) improve throughput at the cost of higher tail latency. To keep performance predictable under heavy load, admission control is critical: instead of queuing endlessly and letting P99 latency spike, the system should reject excess requests early. This ensures stable service quality even when demand surges. 

Optimizations that matter most; quantization, PagedAttention, FlashAttention, speculative decoding, prefix caching, and dynamic batching are all about removing real bottlenecks. They directly target memory pressure, compute utilization, and scheduling delays.  

6) Designing a Fast Chatbot Pipeline  

A chatbot is not just a model, it is a pipeline. Each stage adds its own latency, and optimizing the entire chain is the only way to achieve snappy, low-latency interactions. A simplified flow looks like this:  

User, Frontend (streaming UI), Gateway (rate limiting, authentication), Prompt Builder (system + history + user + tools),  [RAG, optional] Retriever > Reranker > Context Assembler,  LLM Server (vLLM / TGI / llama.cpp / Ollama) , Post-process (validation, guardrails) , Stream to user  

6.1) Breaking Down the Pipeline  

  • Frontend (Streaming UI): The interface should render tokens as soon as they arrive. Streaming dramatically reduces perceived latency, even if total E2E time doesn’t change.  
  • Gateway: Handles authentication and rate limiting. Efficient gateways prevent overload without adding noticeable delay.  
  • Prompt Builder: Constructs the full prompt from system instructions, conversation history, user query, and tool outputs. Well-designed caching or summarization can cut this cost.  
  • Retriever & Reranker (Optional RAG): If the chatbot uses retrieval-augmented generation, documents are fetched and ranked before being injected into the prompt. Latency here depends on vector DB performance, cache hit rate, and reranking complexity.  
  • LLM Server: The inference engine (vLLM, TGI, llama.cpp, or Ollama) performs prefill and decode. This is usually the dominant contributor to latency.  
  • Post-Processing: Adds guardrails, validates JSON/structured output, or enforces formatting. Lightweight but non-negligible.  
  • Streaming Back: Tokens are streamed back to the frontend, where they are rendered progressively.  

6.2) Latency Budget Example  

For a short response (≈100 tokens), a realistic latency budget might look like this:  

  • Retrieval + rerank: 30–70 ms (optimize with local vector DBs, aggressive caching, and low top-k).  
  • Prompt assembly: 5–15 ms.  
  • TTFT (prefill + queuing + first token): 150–200 ms.  
  • Decode (≈100 tokens @ 3–5 tok/s): 300–800 ms.  
  • Post-processing: 5–20 ms.  

6.3) End-to-End Target  

  • P50 latency: ≈ 0.7–1.5 seconds.  
  • P99 latency: under 2–3 seconds, even under load.  

This budget ensures that the chatbot feels instant and conversational for the majority of users, while keeping tail latency under control.  

7) Measurement That Matters 

It’s impossible to optimize what you don’t measure. A fast chatbot requires careful tracking of both user-perceived metrics (what the user feels) and server-internal metrics (what actually happens inside the system).  

7.1) Core Latency Metrics  

TTFT (Time to First Token): Time from request start until the first token is delivered.  

  • TTFT = t_first_token - t_request_start 
  • User intuition: “How long until I see something on screen?”  

TPOT (Time per Output Token): Average time per generated token after the first token arrives.  

  • TPOT = (t_last_token - t_first_token) / num_output_tokens 
  • Reported as tokens per second (TPS = 1 / TPOT).  

E2E (End-to-End Latency): Full time from request start to last token.  

  • E2E = t_last_token - t_request_start 

Goodput: The throughput that actually succeeds (ignores failures/timeouts).  

  • Goodput = successful_tokens / time 

Additional System Metrics  

  • Percentiles: Always compute P50, P95, and P99 for TTFT and E2E. Averages alone mask tail latency issues.  
  • Cache hit ratio: For prefix caches and retrieval caches, higher hit rates reduce prefill cost.  
  • Queue times vs. service times: Expose queue delays separately from processing delays.  
  • GPU memory usage: Track both max_memory_allocated and max_memory_reserved to diagnose memory fragmentation and over-provisioning.  

7.2) Pitfalls to Avoid  

  • Confusing stream chunks with tokens: Many servers stream text in fragments, which don’t always equal tokens. Always measure at the token level.  
  • Reporting only averages: P95/P99 latencies tell the real story.  
  • Ignoring warm vs. cold cache runs: Cache warm-up often skews first-run results.  
  • Not documenting batching policy: Record batch window size, maximum batch size, and concurrency when reporting benchmarks.  
  • Forgetting output constraints: Grammar enforcement, JSON schemas, and tool-calling can all significantly impact speed. Benchmarks without these constraints may overstate real-world performance.  

8) Reproducible Benchmarks 

A claim like “our system handles 1000 TPS” means little without context. To make performance data credible, benchmarks must be reproducible, transparent, and workload-aware. That requires consistent data collection, clear reporting, and visualizations that expose both averages and tails.  

8.1) What to Benchmark  

When designing benchmarks, cover all aspects that affect latency and throughput. 

Workload types:  

  • Short chat prompts (≈50–150 output tokens).  
  • Long-form generations (≈300–600 tokens).  
  • Concurrency levels: representative of real load (e.g., 8, 32, 128 users).  
  • Serving modes: Hugging Face baseline vs. vLLM (PagedAttention), with and without quantization.  
  • Hardware: CPU vs. consumer GPUs (8 GB, 16 GB, 24 GB VRAM).  

8.2) Collecting Raw Data  

Each benchmark run should produce a CSV log with one row per request. Suggested schema:  

  • timestamp – The exact time the request was initiated (typically in ISO 8601 format). 
  • hardware – Hardware specification used for the run (e.g., A100-80GB, RTX4090, CPU). 
  • model – Model name and version (e.g., Llama-3-8B, Qwen2-7B). 
  • precision – Numeric precision used during inference (fp32, fp16, bf16, int8, nf4, etc.). 
  • server – Server or instance identifier (e.g., node-1, srv-azure01). 
  • concurrency – Number of parallel requests processed simultaneously. 
  • batch_window_ms – Micro-batch window size in milliseconds (how long requests wait before batching). 
  • prompt_tokens – Number of input tokens provided in the request. 
  • output_tokens – Number of tokens generated by the model in the response. 
  • ttft_ms (Time To First Token) Latency in milliseconds from request start to the first generated token. 
  • e2e_ms (End-to-End Latency) Total latency in milliseconds from request submission to full completion. 
  • tpot_ms (Time Per Output Token) – Average latency per output token in milliseconds. 
  • success – Boolean flag indicating if the request succeeded (1 = success, 0 = failure). 
  • gpu_alloc_gb – GPU memory actively allocated for this request (in GB). 
  • gpu_reserved_gb – Total GPU memory reserved by the serving framework (in GB). 

8.3) Visualizing the Data  

Charts make it easy to spot trends and tail issues:  

  • Bar charts: Compare TTFT (P50 or P99) across models or servers.  
  • Line charts: Plot batch window size vs. P99 TTFT to reveal batching trade-offs.  
  • Box plots: Show latency distributions (spread between P50, P95, P99).  
  • Stacked bars: Break down E2E latency into retrieval, prefill, decode, post-processing.  

8.4) Benchmarking Checklist  

  • Report hardware details (GPU type, VRAM, CPU cores).  
  • State batching policy (window size, max batch).  
  • Separate warm vs. cold cache runs.  
  • Include percentiles (P50/P95/P99), not just averages.  
  • Note any grammar/JSON constraints or tool-calling overheads.  

Benchmarks without clear methodology, percentiles, and reproducible data are misleading. Tables and charts are not just for pretty visuals; they are essential for diagnosing bottlenecks and validating optimizations.  

9) RAG Without Regret: Timing the Whole Pipeline  

Most production chatbots today use Retrieval-Augmented Generation (RAG). While RAG improves accuracy and grounding, it also adds extra latency stages; retrieval, reranking, and context assembly on top of the model’s inference time.  

The common mistake is to benchmark only LLM inference. In real-world scenarios, what matters is the end-to-end (E2E) pipeline latency, which the user perceives.  

9.1) The RAG Latency Budget  

We can express the end-to-end latency in a RAG system as:  

E2E ≈ t_retrieval + t_rerank + t_prompt_assembly + TTFT + (tokens_out * TPOT) + t_post 

  • t_retrieval: time for vector search or database lookup.  
  • t_rerank: time for a cross-encoder or reranker to reorder candidates.  
  • t_prompt_assembly: joining retrieved content with system/user prompts.  
  • TTFT: time to first token (prefill + queuing).  
  • tokens_out × TPOT: total generation time, depending on output length.  
  • t_post: any validation, guardrails, or formatting after generation.  

This equation shows that retrieval latency can be as significant as decode latency, especially if the retriever or reranker is inefficient. 

9.2) How to Speed Up RAG  

  • To keep RAG responsive without sacrificing quality, focus on these strategies:  
  • Keep documents short: Chunk text into 300–600 tokens. This reduces the retrieval cost and keeps prompts lighter during prefill.  
  • Reduce top-k candidates: Don’t over-fetch. Use multi-stage retrieval, a fast retriever (e.g., dense or hybrid search) followed by a small reranker.  
  • Cache frequent queries and contexts: High cache hit rates dramatically cut retrieval latency.  
  • Use prefix caching: For system prompts and static RAG scaffolds (e.g., “You are a helpful assistant answering with sources”), cache prefill states to avoid recomputation.  

RAG is not “free”. It introduces additional latency layers. To deliver a fast and grounded chatbot, you must budget time across the entire pipeline, not just the model. Optimizations in retrieval, caching, and reranking often matter as much as quantization or attention tricks inside the LLM. 

10) UX and Product Choices That Affect Speed  

System-level optimizations like quantization, caching, and batching are crucial, but the user experience (UX) layer also shapes how fast a chatbot feels. Even when backend latency stays the same, smart design choices in how answers are delivered can make the system feel far more responsive. 

10.1) Streaming UI  

Instead of holding the response until it’s complete, stream tokens to the user as soon as they are generated. This drops perceived latency because users see progress immediately. It’s like reading subtitles in real time instead of waiting for the full transcript after a talk.  

10.2) Partial Rendering  

Render the first paragraph or a summary first, and progressively add details as the model finishes. This gives users actionable information early. A customer support bot, for example, can show the first troubleshooting step immediately, then expand with the full instructions.  

10.3) Chunked Reasoning  

Design answers to start short by default, and let users explicitly expand them if they want more. This reduces end-to-end latency for most interactions, while still supporting depth when needed. A common pattern is “Expandable answers” where users click More to reveal additional details.  

10.4) Grammar and Output Constraints  

Forcing strict output formats like JSON, XML, or SQL can slow generation, since speculative decoding and fast sampling work less effectively under rigid constraints. The best practice is to enable grammar enforcement only when absolutely necessary (e.g., structured API outputs), and keep chat responses freer for faster streaming. 

A chatbot doesn’t need sub-100 ms TTFT to feel fast. UX techniques like streaming, partial rendering, chunked reasoning, and selective grammar constraints can dramatically improve perceived speed.  

In practice, user-perceived responsiveness = system optimizations × UX design choices. 

11) Serving Stack: What to Run  

Choosing the right serving stack is as important as model selection. Different frameworks offer different trade-offs in terms of throughput, latency, hardware efficiency, and developer ergonomics.  

11.1) vLLM  

  • Strengths: Implements PagedAttention for efficient KV cache management and supports dynamic batching out of the box.  
  • Best for: High-throughput workloads where you need to serve many concurrent users with decent TTFT.  
  • Trade-off: Slightly more complex to deploy than single-binary tools like Ollama.  

11.2) TGI (Text Generation Inference)  

  • Strengths: Strong production features such as multi-GPU scaling, quantization hooks, and Hugging Face ecosystem integration.  
  • Best for: Teams already invested in HF tooling or needing robust deployment.  
  • Consideration: Kernel and FlashAttention support should be verified for maximum efficiency.  

11.3) llama.cpp  

  • Strengths: Extremely portable, runs on CPU and small GPUs. Optimized C++ implementation makes it lightweight and efficient, even on laptops.  
  • Best for: Local apps, prototyping, or edge deployments.  
  • Limitation: Not designed for large-scale serving at high concurrency.  

11.4) Ollama  

  • Strengths: Offers frictionless user experience with simple model downloads and one-line runs. Perfect for quick experiments and desktop chatbots.  
  • Best for: Developers who want to experiment with LLMs locally with minimal setup.  
  • Limitation: Not tuned for high-performance serving pipelines at scale.  

Regardless of the serving stack, monitoring is critical for ensuring performance and reliability. Common metrics to track include:  

  • Latency percentiles: TTFT, TPOT, and E2E (P50/P95/P99).  
  • Queue length & wait time: To diagnose batching and scheduling bottlenecks.  
  • Batch window actuals and sizes: Ensure batching policies are behaving as configured.  
  • GPU memory usage: Track both allocated and reserved memory to detect fragmentation or leaks.  
  • Cache hit ratios: For both prefix caches and retrieval caches.  
  • Timeout, retry, and drop rates: To measure goodput rather than raw throughput.  

Pick the serving stack that matches your scale and constraints from lightweight local experiments (Ollama, llama.cpp) to high-throughput production workloads (vLLM, TGI). But no matter which you choose, instrumentation and monitoring are non-negotiable for achieving reliable low-latency LLM serving.  

12) Putting It All Together  

The right latency target depends on the use case. Different applications place different emphasis on TTFT, TPOT, or throughput.  

  • General chatbot: Aim for TTFT < 200 ms, TPOT ≥ 3 tokens/s, and P99 E2E < 2 seconds for short conversational replies. This keeps interactions feeling natural and responsive.  
  • Realtime copilots (typing assistants): Here, responsiveness is everything. TTFT < 150 ms is crucial, and maintaining a smooth TPOT is more important than raw throughput. If the assistant lags behind the typing speed, the experience breaks.  
  • Code or search completion: TTFT < 100 ms matters most, because developers expect results as soon as they type. Throughput is less critical, low latency dominates.  
  • Document processing / batch jobs: In workloads like summarizing reports or parsing PDFs, throughput is king. Here, optimize for goodput TPS rather than TTFT. Users tolerate higher latency per request as long as throughput scales with batch size.  

13) Final Checklist  

To design a chatbot that feels instant, use this checklist as a guide:  

  • Define SLIs (Service Level Indicators): Track TTFT P99, E2E P99, goodput, cache hit ratios, and timeout rates.  
  • Set SLOs (Service Level Objectives): For example, “P99 TTFT < 250 ms at 99.5% success over 30 days.”  
  • Measure prefills vs. decode: Time both stages separately, as optimizations affect them differently.  
  • Instrument RAG: If using retrieval, measure retrieval latency, reranking, and prompt assembly along with LLM inference.  
  • Enable PagedAttention (vLLM): Essential for memory efficiency and higher throughput.  
  • Consider FlashAttention: Especially valuable for long contexts to reduce compute cost.  
  • Experiment with quantization: Start with INT8, then test INT4. Always validate accuracy on your own tasks.  
  • Tune batching policies: Adjust the batching window for your concurrency profile; add admission control to protect P99 latency.  
  • Publish transparent benchmarks: Always document model versions, quantization, serving frameworks, hardware, batching policy, and whether runs were warm or cold.  

14) Conclusion 

Fast LLM interaction is a system problem.  

  • Measure the right metrics: TTFT, TPOT, E2E, goodput, and P99 latency.  
  • Eliminate bottlenecks: use PagedAttention for memory, FlashAttention for compute, quantization and speculative decoding for speed, and prefix caching for repeated prompts.  
  • Benchmark the entire pipeline end-to-end, not just the model, with transparent tables and charts under realistic concurrency. 
  • In RAG systems, speed is not only about the model. Retrieval, reranking, and prompt assembly also add latency. You must optimize the whole pipeline end-to-end, not just the model 

Similar
Blog

Your mail has been sent successfully. You will be contacted as soon as possible.

Your message could not be delivered! Please try again later.