What Makes LLMs ‘Forget’? Token Limits and Persistence

All Notes

Technology

Colorful abstract pattern resembling digital waves with intricate texture in blue and purple hues.

What Makes LLMs ‘Forget’? Token Limits and Persistence

Have you ever had a brilliant, insightful conversation with an AI model, only for it to completely “forget” a crucial detail just a few turns later? It’s a frustrating experience that might lead you to believe LLMs have a short-term memory problem. While that’s an intuitive way to think about it, the reality is more nuanced and rooted in fundamental architectural and operational design choices.

This post will peel back the layers to explain precisely why LLMs seem to forget, focusing on two core concepts: the context window and token limits, and the stateless nature of LLM APIs. We’ll then explore the clever engineering strategies developers employ to give these seemingly forgetful agents a persistent memory.

The Core Problem: The Context Window and Token Limits

At the heart of an LLM’s “forgetfulness” lies its context window, which is essentially its working memory. To understand this, we first need to understand tokens.

What Exactly is a Token?

When you send text to an LLM, it doesn’t process individual letters or even entire words in the way humans do. Instead, it breaks down the text into smaller units called tokens. A token can be a whole word, a part of a word (like “un-” or “-ing”), punctuation, or even a single character for certain languages. For English, a good rule of thumb is that 1,000 tokens equate to roughly 750 words.

LLMs operate by predicting the next token based on the sequence of tokens they have seen so far. This sequence, encompassing your prompt and the model’s previous responses (if you’re maintaining a conversation), is what’s fed into the model’s neural network.

The Finite Nature of the Context Window

Every LLM has a predefined maximum context window size, measured in tokens. This is a hard limit on how much information the model can consider at any given moment to generate its response. Think of it like a very narrow, high-focus spotlight. Everything illuminated by the spotlight is what the LLM “sees” and uses for its next prediction.

Why does this limit exist? The primary reason is computational complexity. The core mechanism behind LLMs, particularly the “Transformer” architecture, relies heavily on a component called the “attention mechanism.” The computational cost of attention grows quadratically with the length of the input sequence (O(n^2), where ’n’ is the number of tokens). This means doubling the context window length doesn’t just double the computational cost; it quadruples it!

This quadratic scaling makes processing extremely long sequences prohibitively expensive in terms of GPU memory, processing time, and ultimately, cost. Even with massive advances, there’s still a practical limit to how large a context window can be before inference becomes too slow or too expensive for real-time applications.

How Forgetting Happens: The Sliding Window Effect

When you engage in a long conversation with an LLM, the entire conversation history (your prompts and the AI’s responses) is often bundled together and sent back to the model with each new turn. As the conversation grows, it eventually hits the context window limit.

Once that limit is reached, a “sliding window” effect comes into play. The oldest tokens in the conversation are truncated or pushed out to make room for the new input and generated output. The LLM literally loses access to that information because it’s no longer within its current context window. It’s not that the LLM has a “bad memory”; it’s that the information is no longer presented to it.

For example, if an LLM has a 4,096-token context window and your conversation exceeds that, the very first messages you exchanged will be dropped from the input, making them inaccessible to the model for subsequent responses. This is why the LLM can “forget” something you discussed 20 messages ago, even if it was crucial.

The Stateless Nature of LLM APIs

Beyond the context window, another fundamental reason for “forgetfulness” in practical applications is the stateless nature of LLM APIs.

When you interact with an LLM through an API (like OpenAI’s GPT-4, Google’s Gemini, or Anthropic’s Claude), each request is typically treated as an independent transaction. The server hosting the LLM doesn’t inherently remember your previous requests or maintain a persistent “session” for you.

Why stateless? Statelessness offers significant advantages for scalable web services:

Scalability: Each request can be handled by any available server, making it easy to distribute load and scale horizontally.
Simplicity: No complex session management required on the server side.
Robustness: If a server goes down, no conversational state is lost for active users, as the state is managed by the client or an external system.

This means that for the LLM to “remember” anything from a previous turn, the application or client interacting with the API must explicitly send the relevant past conversation history (or a summary of it) with every single new request. If your application doesn’t do this, or only sends the last few turns to save on tokens, the LLM will indeed appear to have amnesia.

Bridging the “Memory Gap”: Persistence Strategies

Given these inherent limitations, how do developers build LLM applications that maintain coherent, long-term conversations? The answer lies in sophisticated persistence strategies, primarily implemented outside the LLM itself.

1. Prompt Engineering for Context Management

Before resorting to external systems, clever prompt engineering can help manage the context within the token limit:

Summarization: For very long conversations, a common technique is to periodically summarize the past dialogue. Instead of sending the full transcript, you send a concise summary of what has been discussed so far, followed by the latest turn. This keeps the token count down while retaining the essence of the conversation.
- Example: After every 10 turns, prompt the LLM: “Please provide a concise summary of our conversation so far, focusing on key decisions, facts, and open questions.” Then, inject this summary into the next prompt.
Keyword/Fact Extraction: Similar to summarization, you can instruct the LLM (or a separate smaller model) to extract key entities, facts, or instructions from the conversation and inject these as bullet points into future prompts. This is less conversational but very effective for goal-oriented interactions.
“Remember This” Instructions (Limited Use): While you can tell an LLM, “Remember that I prefer coffee over tea,” this instruction itself consumes tokens. If the conversation continues for many turns, this instruction might eventually be pushed out of the context window. It’s helpful for immediate, short-term memory but not for truly persistent facts across sessions.

2. External Memory Systems: Retrieval Augmented Generation (RAG)

The most powerful and widely adopted strategy for giving LLMs long-term memory is Retrieval Augmented Generation (RAG). This architecture connects the LLM to external knowledge bases.

Here’s how RAG typically works:

Storage (Vector Databases): Conversational history, documents, specific facts, or user preferences are broken down into smaller chunks (e.g., paragraphs, sentences). Each chunk is then converted into a numerical representation called an “embedding” using an embedding model. These embeddings are stored in a specialized database known as a vector database (e.g., Pinecone, Weaviate, ChromaDB, Milvus). Embeddings allow semantic similarity search – chunks with similar meanings will have similar numerical representations.
Retrieval: When a user asks a new question or provides a new input, that input is also converted into an embedding. This query embedding is then used to search the vector database for the most semantically relevant chunks. The system “retrieves” the pieces of information most likely to be useful to answer the current query or continue the conversation coherently.
Augmentation: The retrieved relevant chunks are then added to the LLM’s prompt along with the user’s original query and any recent conversation history that fits within the context window.
Generation: The LLM then generates its response using this augmented prompt. Because it has access to the relevant external knowledge, it can provide accurate, contextually aware, and “remembered” information that extends far beyond its immediate context window.

Benefits of RAG:

Overcomes Token Limits: RAG effectively bypasses the context window limitation for long-term memory. The LLM only needs to process the relevant retrieved information, not the entire historical corpus.
Grounds Responses: By grounding the LLM in specific, verifiable information from external sources, RAG significantly reduces the likelihood of hallucinations (the LLM making up facts).
Dynamic Knowledge: Knowledge can be updated in the external database without retraining the LLM, making systems more agile and up-to-date.
Source Attribution: Some RAG implementations can point to the specific source document or conversation turn from which the information was retrieved.

Drawbacks of RAG:

Complexity: Implementing a robust RAG system adds significant architectural complexity compared to simply calling an LLM API.
Retrieval Quality: The effectiveness of RAG heavily depends on the quality of the embeddings and the retrieval algorithm. “Bad retrieval” (fetching irrelevant information) can confuse the LLM or lead to poor responses. This is often referred to as the “needle in a haystack” problem, where even if the relevant information is in the long context, the model might struggle to focus on it.
Cost: Maintaining and querying vector databases adds to the operational cost.

Libraries like LangChain and LlamaIndex provide powerful frameworks for building RAG applications, abstracting away much of the underlying complexity.

3. Fine-tuning (For Persistent Knowledge, Not Dynamic Memory)

While not a solution for dynamic conversational memory, it’s worth mentioning fine-tuning. Fine-tuning involves taking a pre-trained LLM and training it further on a smaller, domain-specific dataset. This process adjusts the model’s weights, effectively embedding specific knowledge, styles, or behaviors directly into the model itself.

Fine-tuning is excellent for:

Specific Knowledge: Teaching the model about a particular niche topic not well covered in its initial training data.
Desired Tone/Style: Making the model consistently respond in a certain way (e.g., a customer service agent, a witty poet).

However, fine-tuning is not suitable for dynamic, changing conversational memory because:

It’s a one-time (or infrequent) training process, not a continuous learning loop.
It’s costly and time-consuming.
It doesn’t allow the model to “remember” individual user interactions or dynamically updated information unless that information was part of its fine-tuning dataset.

RAG is for dynamic, contextual memory; fine-tuning is for static, intrinsic knowledge and behavior.

The Road Ahead: Towards Longer Contexts and True “Memory”

The field of LLMs is evolving rapidly. We are seeing incredible progress in increasing context window sizes:

Models like Google’s Gemini 1.5 Pro boast context windows of up to 1 million tokens (and even 2 million in preview), allowing them to process entire books, codebases, or multi-hour videos in a single prompt.
Anthropic’s Claude 3 Opus can handle 200K tokens, with experimental larger contexts.

While these massive context windows significantly reduce the “forgetting” problem for many use cases, they don’t eliminate it entirely. The quadratic scaling still exists, and there’s a phenomenon where models can struggle with retrieving very specific pieces of information from extremely long contexts, sometimes referred to as the “needle in a haystack” problem.

Beyond simply larger windows, research is ongoing into more efficient attention mechanisms and novel architectures that could fundamentally change how LLMs handle long-term dependencies. The concept of AI agents that can autonomously manage their own memory using external tools, databases, and planning capabilities is also a promising frontier, moving us closer to truly “remembering” AI systems.

Note: While architectural improvements aim for efficiency, the underlying concept of a finite processing window remains. Truly infinite memory without external mechanisms is a very complex challenge, bordering on artificial general intelligence (AGI).

Conclusion

The apparent “forgetfulness” of LLMs isn’t a sign of cognitive deficiency but rather a direct consequence of their architectural design: fixed context windows constrained by computational power, and the stateless nature of their API interactions.

Understanding these limitations is paramount for anyone building or interacting with LLM applications. While the models themselves don’t inherently “remember” past conversations, the innovative application of prompt engineering and, crucially, external memory systems like Retrieval Augmented Generation (RAG) empowers developers to create intelligent agents that can maintain coherence, recall past facts, and learn from extended interactions, bridging the gap towards a more persistent and intelligent AI future.

References & Further Reading: