You’re debugging a race condition. Three hours in, you can almost see the interleaving threads. Slack pings. You reply, turn back, and the mental model is gone. Twenty minutes to rebuild what took three hours to construct.
This is a context management failure. And it’s the same class of problem whether the “processor” is your prefrontal cortex, a 128K-token LLM window, or a microservice juggling request-scoped state. All three share the same constraints: limited capacity, degradation under load, and catastrophic loss when boundaries are violated.
This post is about the practical techniques that work across all three.
Table of Contents
Context operates on three layers. Each has different constraints, but the management strategies transfer between them.
Layer
Description
System Context
State at the code level. Request-scoped variables, database transactions, bounded contexts in DDD. Capacity is defined by memory and architecture.
AI Context
The token window. Everything an LLM can "see" at inference time. Capacity is defined by the model's context window (4K to 200K+ tokens) and your budget.
Cognitive Context
Human working memory. Four to seven chunks that degrade under stress, fatigue, and interruption. Capacity is biological and non-negotiable.
The core tension is identical at every layer: what do you keep, what do you compress, and what do you offload? The rest of this post answers that question with concrete techniques.
Every context has a budget. The skill is fitting maximum signal into minimum space.
In prompt engineering, this means treating your system prompt like expensive real estate. Every token should be load-bearing (remove it and the output breaks) or it shouldn’t be there.
Instead of one massive instruction block, structure your prompt in priority tiers:
This mirrors progressive disclosure in UI design: show only what’s needed for the current decision. A well-structured prompt with 800 tokens consistently outperforms a kitchen-sink prompt with 3,000 tokens because the model spends less attention on irrelevant instructions.
When a conversation exceeds your context budget, don’t just truncate from the top. Summarize completed threads and keep active ones intact:
Instead of keeping 15 raw messages:
[Summary] User wants a REST API for inventory management. We agreed on FastAPI + PostgreSQL. Authentication will use JWT with refreshing tokens. Endpoints for /products and /orders are finalized.[Active thread] Currently designing the /shipments endpoint. User wants batch operations and webhook notifications.
[Summary] User wants a REST API for inventory management. We agreed on FastAPI + PostgreSQL. Authentication will use JWT with refreshing tokens. Endpoints for /products and /orders are finalized.
[Active thread] Currently designing the /shipments endpoint. User wants batch operations and webhook notifications.
This is lossy compression. You’re betting that the summary captures what matters. The quality of that bet determines everything downstream.
Cognitive parallel: This is exactly what good meeting notes do. Nobody transcribes every word. You capture decisions, open questions, and follow the next steps. The rest is noisy.
Sophie Leroy’s research on attention residue showed that when you switch from Task A to Task B, part of your mind stays on Task A. The residue is worse when Task A is unfinished. This maps directly to both LLM context and system architecture.
In prompt engineering, boundary violations look like context pollution: instructions from one task leaking into another.
When building multi-step AI workflows, treat each step as a bounded context with explicit inputs and outputs:
Each step gets a clean context window. Step 3 never sees the raw document, only the structured extraction. This prevents the model from getting confused by irrelevant details and makes each step independently testable.
The anti-pattern is the “mega-prompt” that tries to do extraction, validation, and generation in a single call. It works for simple cases and fails unpredictably for complex ones.
When you must combine different concerns in a single prompt, use explicit delimiters:
<task>Generate a product description</task> <constraints>Max 150 words. Tone: professional.</constraints> <context>Product specs: [data]</context> <negative>Do not mention competitor products.</negative>
XML-style tags give the model clear signals about where one context ends and another begins. Without fences, instructions blend together and priority becomes ambiguous.
Cognitive parallel: Cal Newport’s deep work framework is the same principle applied to human attention. Block uninterrupted time, make the boundaries explicit, defend them. Context fencing for your calendar.
RAG (Retrieval-Augmented Generation) solved a fundamental problem for LLMs: you can’t fit everything in the window, so you fetch what you need at query time. The same logic applies to human cognition and team knowledge.
Instead of stuffing your system prompt with every possible scenario, build a retrieval layer that pulls relevant context based on the user’s query:
Key insight: retrieval quality depends on how you stored the information, not just how you search for it. A vector database performs terribly if the chunking strategy is wrong. A note-taking system is useless without connection metadata.
When pausing a complex prompt chain, save a structured checkpoint:
{ "task": "API design for payment service", "decisions_made": [ "REST over GraphQL", "Stripe as payment provider", "Idempotency keys on all POST endpoints" ], "open_questions": [ "Webhook retry policy", "Partial refund handling" ], "current_focus": "Webhook endpoint design" }
Feed this snapshot into the next session instead of replaying the entire conversation. This is the LLM equivalent of interstitial journaling: writing a brief context dump during transitions so you can restore state without full replay.
Cognitive parallel: Architecture Decision Records (ADRs) serve the same function for teams. An ADR that says “we chose Kafka” without recording why it is a document that fails at its only job: preserving decision context across time.
Statelessness is elegant. Stateless functions are easy to test, stateless services scale horizontally, and stateless conversations avoid privacy headaches. But complex domains are inherently stateful. Pretending otherwise just pushes the state somewhere harder to manage.
Hold minimum state for the current operation but leave a trail for reconstruction. In prompt chains, this looks like conversation summarization with selective memory:
Each new turn sees: [reference context] + [archived summary] + [active messages]
This gives the model enough history to stay coherent without burning your entire context window on stale conversation. The pattern is identical to event sourcing in system design.
Anti-pattern: Context hoarding. Keeping everything “just in case.” In prompts, this means stuffing the system message with every instruction you’ve ever needed. More context is not a better context. The relevant context is a better context.
Three failure modes to watch for:
Context management requires context. You need to understand your system’s constraints to manage them. You need attention to manage attention. You need memory to build a memory system.
This recursion isn’t a bug. It’s a feature. Every improvement compound. The developer who builds good module boundaries thinks more clearly. The prompt engineer who masters compression builds better systems. The team that writes good ADRs onboards faster.
The goal isn’t a mind like still water. It’s a mind like a well-maintained repository: clear history, clean boundaries, useful documentation, and the confidence that when you need something, you know exactly where to find it.
Data Scientist Data Scientist architecting scalable ML systems. Builds production-grade solutions that combine predictive analytics with agentic AI capabilities.
Integration of SAP Cloud for Customer with S/4HANA Cloud
SAP S/4HANA is one of SAP's next-generation business solutions, built on the SAP HANA database technology. This platform enables businesses to...
Understanding Open-Source Large Language Models
Open-source large language models give individuals and organizations full control over their data, infrastructure, and AI capabilities, offering a...
Main Features and Benefits of SAP TM
In today's globalised and rapidly increasing technology, competing with competitors has become more challenging than ever. In such a business world,...
What is SAP Flexible Workflow?
Flexible Workflow is a new workflow model designed to facilitate workflow configurations. The workflow model has features that can send instant...
The Benefits of Working with an Experienced SAP Consulting Firm
Businesses around the world continue to invest in SAP because of the many benefits it can bring to their organisations. SAP offers a wide range of...
SAP Signavio Process Manager – SAP Solution Manager Integration
New Business Process Connector for SAP Signavio SolutionsThe SAP Signavio solutions now feature an upgraded business process model connector,...
How are Kaizen Events Done?
Kaizen is a business philosophy focused on the idea of achieving significant improvements by making small, incremental changes to processes, products...
What Do SAP Integration Suite Adapters Provide? Guide
Today's world of technology has required businesses to have an integrated structure. As businesses invest in new technologies day by day, their...
Benefits of Using SAP Fiori Applications for Businesses
SAP Fiori includes multiple intuitive applications and guides that optimize the user experience, enabling users to reduce errors and increase...
Your mail has been sent successfully. You will be contacted as soon as possible.
Your message could not be delivered! Please try again later.