You’re debugging a race condition. Three hours in, you can almost see the interleaving threads. Slack pings. You reply, turn back, and the mental model is gone. Twenty minutes to rebuild what took three hours to construct.
This is a context management failure. And it’s the same class of problem whether the “processor” is your prefrontal cortex, a 128K-token LLM window, or a microservice juggling request-scoped state. All three share the same constraints: limited capacity, degradation under load, and catastrophic loss when boundaries are violated.
This post is about the practical techniques that work across all three.
Table of Contents
Context operates on three layers. Each has different constraints, but the management strategies transfer between them.
Layer
Description
System Context
State at the code level. Request-scoped variables, database transactions, bounded contexts in DDD. Capacity is defined by memory and architecture.
AI Context
The token window. Everything an LLM can "see" at inference time. Capacity is defined by the model's context window (4K to 200K+ tokens) and your budget.
Cognitive Context
Human working memory. Four to seven chunks that degrade under stress, fatigue, and interruption. Capacity is biological and non-negotiable.
The core tension is identical at every layer: what do you keep, what do you compress, and what do you offload? The rest of this post answers that question with concrete techniques.
Every context has a budget. The skill is fitting maximum signal into minimum space.
In prompt engineering, this means treating your system prompt like expensive real estate. Every token should be load-bearing (remove it and the output breaks) or it shouldn’t be there.
Instead of one massive instruction block, structure your prompt in priority tiers:
This mirrors progressive disclosure in UI design: show only what’s needed for the current decision. A well-structured prompt with 800 tokens consistently outperforms a kitchen-sink prompt with 3,000 tokens because the model spends less attention on irrelevant instructions.
When a conversation exceeds your context budget, don’t just truncate from the top. Summarize completed threads and keep active ones intact:
Instead of keeping 15 raw messages:
[Summary] User wants a REST API for inventory management. We agreed on FastAPI + PostgreSQL. Authentication will use JWT with refreshing tokens. Endpoints for /products and /orders are finalized.[Active thread] Currently designing the /shipments endpoint. User wants batch operations and webhook notifications.
[Summary] User wants a REST API for inventory management. We agreed on FastAPI + PostgreSQL. Authentication will use JWT with refreshing tokens. Endpoints for /products and /orders are finalized.
[Active thread] Currently designing the /shipments endpoint. User wants batch operations and webhook notifications.
This is lossy compression. You’re betting that the summary captures what matters. The quality of that bet determines everything downstream.
Cognitive parallel: This is exactly what good meeting notes do. Nobody transcribes every word. You capture decisions, open questions, and follow the next steps. The rest is noisy.
Sophie Leroy’s research on attention residue showed that when you switch from Task A to Task B, part of your mind stays on Task A. The residue is worse when Task A is unfinished. This maps directly to both LLM context and system architecture.
In prompt engineering, boundary violations look like context pollution: instructions from one task leaking into another.
When building multi-step AI workflows, treat each step as a bounded context with explicit inputs and outputs:
Each step gets a clean context window. Step 3 never sees the raw document, only the structured extraction. This prevents the model from getting confused by irrelevant details and makes each step independently testable.
The anti-pattern is the “mega-prompt” that tries to do extraction, validation, and generation in a single call. It works for simple cases and fails unpredictably for complex ones.
When you must combine different concerns in a single prompt, use explicit delimiters:
<task>Generate a product description</task> <constraints>Max 150 words. Tone: professional.</constraints> <context>Product specs: [data]</context> <negative>Do not mention competitor products.</negative>
XML-style tags give the model clear signals about where one context ends and another begins. Without fences, instructions blend together and priority becomes ambiguous.
Cognitive parallel: Cal Newport’s deep work framework is the same principle applied to human attention. Block uninterrupted time, make the boundaries explicit, defend them. Context fencing for your calendar.
RAG (Retrieval-Augmented Generation) solved a fundamental problem for LLMs: you can’t fit everything in the window, so you fetch what you need at query time. The same logic applies to human cognition and team knowledge.
Instead of stuffing your system prompt with every possible scenario, build a retrieval layer that pulls relevant context based on the user’s query:
Key insight: retrieval quality depends on how you stored the information, not just how you search for it. A vector database performs terribly if the chunking strategy is wrong. A note-taking system is useless without connection metadata.
When pausing a complex prompt chain, save a structured checkpoint:
{ "task": "API design for payment service", "decisions_made": [ "REST over GraphQL", "Stripe as payment provider", "Idempotency keys on all POST endpoints" ], "open_questions": [ "Webhook retry policy", "Partial refund handling" ], "current_focus": "Webhook endpoint design" }
Feed this snapshot into the next session instead of replaying the entire conversation. This is the LLM equivalent of interstitial journaling: writing a brief context dump during transitions so you can restore state without full replay.
Cognitive parallel: Architecture Decision Records (ADRs) serve the same function for teams. An ADR that says “we chose Kafka” without recording why it is a document that fails at its only job: preserving decision context across time.
Statelessness is elegant. Stateless functions are easy to test, stateless services scale horizontally, and stateless conversations avoid privacy headaches. But complex domains are inherently stateful. Pretending otherwise just pushes the state somewhere harder to manage.
Hold minimum state for the current operation but leave a trail for reconstruction. In prompt chains, this looks like conversation summarization with selective memory:
Each new turn sees: [reference context] + [archived summary] + [active messages]
This gives the model enough history to stay coherent without burning your entire context window on stale conversation. The pattern is identical to event sourcing in system design.
Anti-pattern: Context hoarding. Keeping everything “just in case.” In prompts, this means stuffing the system message with every instruction you’ve ever needed. More context is not a better context. The relevant context is a better context.
Three failure modes to watch for:
Context management requires context. You need to understand your system’s constraints to manage them. You need attention to manage attention. You need memory to build a memory system.
This recursion isn’t a bug. It’s a feature. Every improvement compound. The developer who builds good module boundaries thinks more clearly. The prompt engineer who masters compression builds better systems. The team that writes good ADRs onboards faster.
The goal isn’t a mind like still water. It’s a mind like a well-maintained repository: clear history, clean boundaries, useful documentation, and the confidence that when you need something, you know exactly where to find it.
Data Scientist Data Scientist architecting scalable ML systems. Builds production-grade solutions that combine predictive analytics with agentic AI capabilities.
Message Metric Calculations SAP Integration Suite
The number of messages is a critical criterion for financial management, resource allocation, performance monitoring, capacity planning, compliance...
Inside MDP Group’s Web & Mobile Development Team: Tech Stack & Approach
MDP Group's web and mobile development team specializes in building tailor-made digital solutions that bridge SAP systems with modern web and mobile...
Desi Calculation in EWM
IntroductionIn EWM Monitor we can display volume by different units of measure. In this blog, I will explain how you can calculate desi as a volume...
How to Track Usage of SAP Fiori Applications?
SAP Fiori has a structure that focuses entirely on user experience (UX). For this reason, we can define SAP Fiori as a package designed by SAP to...
Soft Bias, Sharp Harm: Auditing Generative Models for Fairness
Introduction Fairness auditing for generative systems is not a single metric exercise but a structured, context-aware evaluation process. Fluent and...
What is Behavior Extension in SAP RAP Applications?
SAP's modern development framework, RAP (Restful ABAP Programming), is a platform designed to ensure that software solutions are flexible and modular...
What is Cloud Computing?
Technological innovations have revolutionized business life. One of these innovations is the concept of "Cloud Computing". Although it is gaining...
Groovy Scripting in SAP Integration Suite
What is Groovy and Groovy Script?Groovy is a versatile and powerful language for the Java platform. It has static-typing and static compilation...
What Are Side Effects in SAP RAP?
In SAP RAP (Restful Application Programming) side effects refer to the impact that changes in one part of the data model or UI can have on other...
Your mail has been sent successfully. You will be contacted as soon as possible.
Your message could not be delivered! Please try again later.