Beyond Retrieval: World Models for Knowledge Memory

Three posts in, we have established a set of distinctions. Retrieval is not memory: retrieval returns stored items by similarity; memory generalises by applying learned structure to unseen inputs. Privacy is an architecture problem: encrypting vectors does not protect documents, and jurisdiction matters more than ciphertexts. Graph structure is not an audit trail: seeing the knowledge graph does not explain what the LLM did with it. These are critiques of the current generation. This post asks the harder question: what comes after?

The retrieval ceiling

Current RAG systems, including the one we run in production, share a fundamental property: they do not learn. A retrieval system with ten thousand documents is not smarter than the same system with one thousand. It has broader coverage. It can surface more relevant passages. But it has not understood anything about the domain. It has no model of how concepts in that domain relate, evolve, or constrain each other.

This is not a shortcoming that better indices fix. Dense retrieval, sparse retrieval, hybrid retrieval, graph-structured retrieval: all of them are search over a fixed representation. The representation is computed at index time. It does not change when the system encounters a query it could not answer. It does not reorganise when the corpus shifts underneath it. It is a lookup table with varying degrees of sophistication in the lookup.

For many workloads this is adequate. A fire-safety agent that retrieves the correct DIN norm paragraphs for a given building configuration does not need to learn. It needs fast, accurate retrieval over a stable corpus. We ship this today. It works.

For other workloads it is not adequate. Consider a compliance agent working across EU regulatory frameworks: directives, national transpositions, implementing regulations, court interpretations, across 27 member states, evolving over time. The relationships between these sources are not co-occurrence in embedding space. They are hierarchical constraints, temporal supersession chains, cross-domain dependencies, and jurisdictional scoping rules. A retrieval system can find relevant documents. It cannot reason about which prior interpretation still holds after an amendment, or predict which downstream regulations a proposed directive change will invalidate. That requires understanding the domain, not indexing it.

The intelligent librarian

A human librarian who has worked thirty years in a specialised collection does not retrieve by keyword. She has built an internal model of her domain. She anticipates what you will need before you finish asking. She knows that if you are researching Topic A, you will inevitably need to understand Topic B, because C constrains them both. She can tell you that a source has been superseded without looking it up, because she understands the rules by which supersession operates in that field. She gets better with every interaction, not merely broader.

This is not retrieval. This is a world model of a knowledge domain.

Yann LeCun’s A Path Towards Autonomous Machine Intelligence proposes world models as a core component of intelligent systems. The argument: an agent that can predict the consequences of actions in an internal model is fundamentally more capable than one that can only react. The paper describes a Joint Embedding Predictive Architecture (JEPA) that learns representations by predicting in abstract representation space, not by generating tokens or pixels.

This is not a theoretical proposal waiting for validation. JEPA is expanding across domains at pace. V-JEPA 2, released by Meta in 2025, trains a 1.2 billion parameter model on over one million hours of video and achieves zero-shot robot control in unseen environments. It learns physical causality: object permanence, gravity, collision dynamics. LLM-JEPA, co-authored by LeCun and published at ICLR 2026, demonstrates that adding a JEPA objective to standard language model training outperforms next-token prediction alone across Llama3, Gemma2, and OLMo models. Graph-JEPA applies the same principle to graph-structured data: mask a subgraph, predict its representation from context. JEPA-DNA applies it to genomics, solving what the authors call the “granularity trap”: models trained on masked tokens learn local syntax but miss global biological function. JEPA training forces the model to predict higher-level functional embeddings instead.

The pattern: every domain with structure that goes beyond surface-level token co-occurrence benefits from predictive learning in representation space. The expansion from vision to video to language to graphs to genomics follows a clear trajectory. The next domain with rich, formal, hierarchical structure that existing approaches fail to fully capture is knowledge.

Now consider the same principle applied to a regulatory knowledge domain. Instead of predicting the next video frame, the system predicts relationships between legal concepts, or between engineering specifications and failure modes. The representations are not pixel embeddings but concept embeddings drawn from a structured corpus. The training signal is not “predict the masked patch” but “predict the missing constraint” or “predict the consequence of this change.”

A system that has learned the generative structure of a knowledge domain can answer questions that no stored document addresses. This is not a product claim. It is a research hypothesis we intend to test.

A world model for EU employment law would not retrieve paragraphs. It would hold a learned representation of how employment-law concepts relate: which provisions constrain which others, which court decisions narrowed or broadened a statutory term, which temporal rules govern supersession. Given a novel fact pattern, the model would predict which legal constraints apply, even if no stored document describes that exact situation. Given a proposed amendment, it would predict which existing interpretations become unstable.

We do not know whether this is achievable. The components exist separately: Graph-JEPA proves the architecture works on graphs; LLM-JEPA proves it works for language; JEPA-DNA proves it captures global structure that token-level training misses. What has not been demonstrated is the combination: JEPA-style training on structured knowledge corpora, where the “patches” are not image regions or token spans but legal concepts, regulatory relationships, and temporal constraints.

The training data problem is real but different from vision. There are no millions of unlabelled samples, but there are thousands of formally structured documents with explicit hierarchical and temporal relationships. The evaluation problem is equally hard: there is no ImageNet for legal reasoning. We do not know what “accuracy” means for a world model of employment law. And the fundamental question remains open: JEPA trains on perceptual units (spatial patches, token spans). Knowledge operates on logical units (constraints, implications, hierarchies). Whether the architecture needs modification to handle this difference, or whether it generalises as-is, is the core research question.

Why EU regulatory structure is a uniquely suited domain

European regulation is arguably the most structurally complex knowledge system that exists in codified form. A single compliance question may require traversing:

An EU Directive (setting objectives)
National transposition laws (27 variants, each with different implementation choices)
Implementing regulations (technical detail)
Court interpretations (ECJ and national courts, creating precedent chains)
Temporal supersession (amendments that invalidate prior interpretations)
Cross-domain interactions (where labour law meets data protection meets corporate governance)

This is not a flat corpus. It is a multi-layered, temporally evolving, hierarchically constrained knowledge system with formal structure. For retrieval, this complexity is a curse: more documents, more ambiguity, more false positives. For a predictive model, this complexity is a training signal. The richer the structure, the more there is to learn.

There is a deeper reason why regulatory knowledge may be particularly suited to JEPA-style training. LLM-JEPA is most effective on data that provides “multiple views of the same underlying knowledge”: text and code representing the same functionality, for instance. Regulatory domains have this structure natively. A statutory provision and the court decision interpreting it are two views of the same legal concept. An EU directive and its national transposition are two views of the same regulatory intent. A norm text, its academic commentary, and its practical application are three views at different levels of abstraction. This multi-view structure is not something we would need to engineer. It is how legal knowledge is organised.

Adjacent research supports the premise from multiple directions. GraphCompliance demonstrates that aligning policy knowledge graphs with context graphs outperforms both LLM-only and RAG baselines by 4-7 percentage points on GDPR compliance scenarios. A research agenda for Regulation-Aware Neuro-Symbolic Legal World Models proposes “Legal Digital Twins” that simulate regulatory environments using explicit rule systems. We share their diagnosis of the problem but differ in approach: where they encode rules explicitly, we investigate whether rules can be learned from structured corpora through predictive training. The gap between “predict a missing link in a knowledge graph” and “learn the generative rules of a regulatory domain” remains wide. But the trajectory from Graph-JEPA to a domain-specific Knowledge-JEPA is shorter than the trajectory from nothing.

Privacy-preserving memory as a separate research axis

If a knowledge system performs learned computation rather than storage and lookup, the privacy problem changes shape.

In a retrieval system, privacy is primarily an access-control problem. Documents are stored. Queries match against stored representations. The boundary is clear: who can access which documents? Tenant isolation, jurisdiction control, and access-control lists handle most of the problem. We argued in Privacy in RAG Is an Architecture Problem that this architectural approach is more verifiable than cryptographic approaches that encrypt vectors but leave documents exposed.

A world model complicates this. The model does not store documents. It learns a compressed representation of a knowledge domain. The privacy-relevant question is no longer “who can read which document” but “what did the model learn from which data, and can that learning be attributed or reversed?” This is the machine-unlearning problem, and it is unsolved in general.

If the model serves multiple tenants, the problem deepens. Tenant A’s regulatory corpus and Tenant B’s regulatory corpus both contribute to the model’s understanding. Can the model serve Tenant B without leaking structural insights derived from Tenant A’s documents? Federated graph learning offers partial answers: models trained across organisational boundaries without centralising data. Differential privacy offers formal guarantees at a cost to model quality. Confidential computing offers hardware-level isolation. None of these are sufficient alone, and their combination at production quality does not yet exist.

This is why we treat privacy-preserving memory as a separate research axis, not a feature of the retrieval layer. The retrieval layer has a known privacy model: access control over stored items. The reasoning layer, if one ever exists, will require a different privacy model: formal guarantees over learned representations. These are different engineering problems with different threat models.

Scalable memory that improves with use

The third axis follows from the first two. A retrieval system that ingests more documents gets broader coverage but not deeper understanding. A system with a world model would get both. The mechanism would be some form of continual learning: the model updates its internal representation as new documents arrive, as queries reveal gaps, as outcomes signal which predictions were correct and which were not.

Continual learning in neural networks is an active research area with well-known failure modes. Catastrophic forgetting remains the central challenge. Regularisation methods offer partial mitigation. Replay buffers offer another. None are production-grade for the kind of knowledge system we are describing.

The retrieval system, by contrast, does not forget. A document ingested today is retrievable tomorrow. This is a real advantage. It is also the reason that retrieval remains the practical choice for production workloads. The memory system, if it exists, would need to match this reliability while adding the ability to improve. That is a hard constraint, and we do not yet know where the boundary sits.

Honest scope

We do not know if this works. We want to find out.

None of what we have described in this post is shipped. None of it has a timeline. These are research hypotheses we are investigating alongside a production knowledge platform that ships retrieval with known properties and known limits.

What we know: JEPA works for vision, video, language, graphs, and genomics. Graph-level masked prediction learns meaningful representations. Multi-view training on language outperforms single-objective approaches. What we do not know: whether the architecture transfers to knowledge domains where the fundamental units are logical (constraints, implications, hierarchies) rather than perceptual (patches, tokens, subgraphs). Whether regulatory corpora of thousands of documents provide sufficient training signal. Whether the result is evaluable in any meaningful sense. These are not engineering problems. They are research questions that may have negative answers.

The production system is honest about what it does: it retrieves context from structured indices, returns sources, logs queries, and tells the operator when confidence is low. The research direction asks what would need to be true for those limits to shift. These are different activities with different standards. We do not blur them.

The question we are trying to answer: Can JEPA-style predictive models, trained on structured regulatory corpora, learn the generative rules of a knowledge domain well enough to predict consequences, detect contradictions, and generalise beyond stored facts, while preserving formal privacy guarantees across tenants? If yes, that changes what knowledge infrastructure means. If no, we will have learned something precise about where the boundary between retrieval and reasoning actually sits, and that knowledge has value too.

Why the architecture is designed for this boundary

Enchilada’s engine layer is pluggable. LightRAG today, LazyGraphRAG or a pure vector engine tomorrow, with no API change at the proxy. The proxy returns RetrievedContext, not Memories. The type system enforces the distinction.

This was not an accident. If a reasoning layer becomes practical, it will slot in alongside retrieval, not replace it. A query might draw on both: the retrieval engine surfaces relevant stored items; the world model predicts relationships not present in the corpus. The results merge at the proxy level. The calling application sees a unified response. The architecture keeps the categories separate because they are separate.

We are building the retrieval layer now. We are investigating the reasoning layer alongside it. If you are running workloads that pressure the boundary between retrieval and understanding, we are interested in hearing about them.