This is the first article in a planned series about AI. Here we’ll dip a bit into the technical side of LLM memory and context. We’ll break down why models so often forget facts or simply make them up.


How Attention Breaks Down in Long Contexts

Let’s start with the fundamentals. Transformers rely on self-attention — every token attends to every other token and decides what deserves focus. The quadratic complexity O(n²) is why models need optimizations like FlashAttention-2/3, RoPE, ALiBi, and others.

These optimizations come at a cost. They introduce positional biases:

  • Primacy bias — the model remembers the beginning of the context best.
  • Recency bias — it also remembers the end fairly well.
  • Everything in the middle — gets weakened attention.

Imagine reading a 500-page book in one sitting. You remember the first chapter and the last one, but what was on page 247? The model works roughly the same way — except it won’t admit it forgot. Instead, it confidently makes something up.


Lost-in-the-Middle

The classic 2023 paper by Nelson Liu and colleagues from Stanford is familiar to almost everyone working with RAG. It showed a clear U-shaped curve: accuracy drops by 30–50% when the relevant information sits in the middle of a context containing 20–30 documents.

Models have improved since 2023, but the problem persists.

In 2025–2026, teams from MIT and other labs confirmed: the Lost-in-the-Middle effect still exists even in models with 1M+ token windows. The root cause is architectural — causal attention masking combined with decay in positional embeddings (especially RoPE). The longer the window, the larger the “middle” becomes, and the stronger the bias gets.

In simple terms: increasing the context window doesn’t solve the problem — it just enlarges the zone where the model gets confused and forgets.


Benchmarks 2025–2026

Needle-in-a-Haystack — the test originally popularized by Greg Kamradt in 2023. Hide a fact deep in a long text and ask the model to find it. Simple, visual, and very convincing.

By 2025–2026, it has evolved into a whole family of benchmarks:

  • U-NIAH (2026) — directly compares long-context LLMs versus RAG.
  • NeedleBench, BABILong, LooGLE — test not just finding a single fact, but synthesizing information scattered across different parts of the context.

Why Hallucinations Increase with Context Length

This raises a logical question: why does the model lie instead of saying “I don’t know”?

Here are three key insights from recent research:

Vectara Hallucination Leaderboard (HHEM-2.3, February 2026): On documents longer than 32K tokens, hallucination rates are consistently higher than on short ones. Models start inventing connections between facts that aren’t actually there. They see fragment A and fragment C, miss B in the middle, and create their own version of B.

OpenAI research (December 2025), “When More Becomes Less”: Increasing context length raises inference cost but does not improve quality. The model receives partial evidence and fills in the gaps plausibly. That is literally the definition of hallucination.

MIT (January 2025): The most interesting (and annoying) finding — models use more confident language precisely when they hallucinate in long contexts. The less real evidence the model sees, the more assertively it makes things up.

Simplified version: Long context → more noise → attention becomes diffuse → model fails to find the real fact → but it was trained to always provide an answer → so it starts getting creative.

This is not a bug in any specific model. It’s a consequence of how the transformer architecture works together with RLHF, which teaches the model to be helpful (i.e., always respond).


What Actually Helps in Production

Now let’s talk practical techniques.

Chunking — 70% of RAG Success

If you’re struggling with hallucinations in a RAG system, the first thing to check is how you’re splitting your documents.

Don’t trust the default splitter blindly.

  • RecursiveCharacterTextSplitter (LangChain) — the reliable workhorse. 400–512 tokens with 10–20% overlap. Simple, predictable, and great for structured text.
  • SemanticChunker — splits based on meaning using embeddings. Gives +15–20% accuracy on complex documents where logical blocks don’t align with paragraphs.
  • HierarchicalNodeParser (LlamaIndex) — multi-level chunks (2048 → 512 → 128). Lets the system first find the right section, then the right paragraph, then the right sentence.

The key idea: don’t shove 100K tokens into the model “just in case.” Give it 3–5 truly relevant 500-token chunks instead. Less noise = fewer hallucinations.

Example: Chunking + Reranking

Python

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

# 1. Recursive + overlap
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=100,  # ~20%
    separators=["\n\n", "\n", ". ", " ", ""]
)

# 2. Semantic splitter (for complex documents)
from langchain.text_splitter import SemanticChunker
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)
semantic_splitter = SemanticChunker(embeddings)

# 3. Reranking
from langchain.retrievers.document_compressors import CrossEncoderReranker

compressor = CrossEncoderReranker(
    model_name="BAAI/bge-reranker-large"
)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectorstore.as_retriever(
        search_kwargs={"k": 20}  # Retrieve 20, reranker keeps top 5
    )
)

Retrieve 20 chunks, let the reranker pick the top 5. This is critical. Without reranking, you send a bunch of semi-relevant chunks that only add noise.

Verification Layers

Even with perfect chunking, the model can still hallucinate. That’s why you need a second line of defense.

  • Self-Consistency / Chain-of-Verification (CoVe) — ask the model to generate an answer, then generate questions to verify it, then answer those questions.
  • Critic Agent — a separate (cheaper, smaller) LLM checks whether the answer is fully supported by the context. “Here’s the context, here’s the answer. Is everything in the answer grounded? Yes/No + explanation.”
  • Symbolic verification — after generation, run the output through a Knowledge Graph or a Pydantic schema with Guardrails. If the answer contains dates, numbers, or names — validate them programmatically.

Agentic Workflows

This is the 2026 pattern. Instead of a single “context → answer” call, build a loop with multiple agents:

  1. Planner — breaks the task into subtasks.
  2. Retriever Agent — searches for each subtask separately.
  3. Executor — generates the answer.
  4. Verifier Agent — checks it and sends it back for revision if needed.

It’s more expensive and slower. But in production, where a hallucination can cost money or reputation, it pays off.

Here’s a minimal verification loop example:

Python

from langchain.agents import create_react_agent, AgentExecutor
from langchain.tools import Tool

def verify_answer(answer: str, context: str) -> str:
    """A separate critic model checks the answer"""
    prompt = (
        "Check whether the answer is fully supported by the context. "
        "Reply YES/NO + explanation.\n"
        f"Context: {context}\n"
        f"Answer: {answer}"
    )
    return llm.invoke(prompt)

tools = [
    Tool(
        name="Verifier",
        func=verify_answer,
        description="Checks if the answer matches the context"
    )
]

agent = create_react_agent(llm, tools, prompt=react_prompt)
agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    verbose=True,
    max_iterations=5  # No more than 5 attempts
)

Hybrid Neuro-Symbolic Approaches

For domains with strict accuracy requirements — law, medicine, finance — there’s one more level.

Combine LLM + Knowledge Graph + symbolic verifier. The LLM generates, the graph checks facts and relationships, and the symbolic layer validates logic.

2025 research shows 60–80% reduction in hallucinations in such domains. However, this requires heavy infrastructure: a knowledge graph, formal rules, and a team to maintain it.

If you’re building a chatbot for an online store, this is probably overkill.


Bottom line: Feed it less — get better results.

If this article felt too technical but the topic interests you, let me know — I’ll prepare a simpler, more beginner-friendly version that explains everything in plain language.