Two weeks ago, I came across the first mention of a “world model” outside of the robotics space where it’s maybe already over-used. Maybe others have used it before and I haven’t been paying attention - either way, it made a compelling case for organizations that work more like control loop systems. The inclination to be at the vanguard of tech and the luxury of a clean and measurable output signal (money) to feed back into the system, influence its decisions, mutate the setup in a way that measurably improves the output make Block an outlier, though.
Most of everyone else is in the literal trenches in AI adoption, on a gradient ranging from “Claude licenses for everyone” to “outsourced entire flows or departments to agents”. And the issues also range from not reaping benefits because the mechanics of organizations are not evolving to “what are these things even doing”. Both hitting walls that look different but ultimately point to the same underlying truth: organizations are messy, the link between output and the actual success metrics is convoluted, companies are never temporally stable, can’t afford to stay temporally stable and the workflows bend and twist around these changes in the most creative or banal ways.
Enter: ontologies, RAG, embeddings, graph traversal - all intellectually stimulating topics to talk about but it’s easy to get lost without reflecting on the why. We already have an intuition that we somehow have to capture reality and collapse it into these tools to make them machine readable. But the concomitant need for everything to change is driving a significant amount of additional tension into the requirement. It’s not just giving an AI a flowchart to follow. It’s giving it a flowchart in an ever-changing internal representation of the organization.
We’re here not because the industry tried different solutions and failed, but rather because the sequence of innovation in the LLM space (or what we broadly call AI now) had to lead to this. That’s the charitable read. The not-so-charitable read is that the industry landed on tools before addressing the substrate because tool usage makes for better demos and a more compelling glimpse into the future than some abstract representation of knowledge.
It boils down to the arc of prompt engineering (now a bit comical in retrospect), prompt intentionality, wrapping LLMs into agents with tools and their respective personas, circulating .md file and marketing them as “employees” inevitably leading to the realization that agents really lack common sense and intuition.
There’s a wall you hit when you operate in an environment that has a certain structural truth and base your behaviour on first principles alone. Most discussions with agents are “in principle”. They’re not directly relevant if you don’t inject an entire memory lane into the context. It’s not on Notion. It’s not in BambooHR. It’s not in the way GitHub conventions are encoded or what Actions are set up. This structural truth is encoded in how things work together. In how decisions lead to outcomes that lead to new decisions - leading to better outcomes or maybe worse ones.
Where we are with agents is a softer version of the age-old principle of garbage-in/garbage-out in AI. Maybe even a more insidious version of it. Garbage-in is not bad data. It is absence of data and first principles that are, at most (and charitably) a partial fit.
Humans pick up on how things work, involuntarily, and they develop this organizational intuition. They have random coffee chats that uncover more of this structural truth. They navigate internal hurdles, inefficiencies, quirks and continuously update their world model. We take it for granted because it comes naturally.
We build a lot of dirt roads masquerading as paved roads because human intuition and communication was doing the heavy lifting. An agent navigating these roads is landing from pothole to pothole. Policies that are outdated. Org structures that don’t exist anymore or exist but do something different now. Decision-level context that changed. There’s a gradient of fidelity from org structure all the way down to decision-level structure that we haven’t quite solved for yet. And this sits at the intersection of systems rather than in the systems themselves. With every new system or process, the problem space doesn’t stay additive, it becomes multiplicative. Thankfully, the value we extract from solving this does too.
A refund over $25,000 looks policy-compliant in the finance system, but approval actually depends on contract carveouts in CRM, an active fraud review in Zendesk, a regional exception the ops lead granted in Slack, and a temporary controller delegation recorded in email. The agent can retrieve each artifact separately, but we’re relying on it to figure out how they relate. Before you know it, you’re writing if-else statements disguised as prompts so it knows who actually has authority, which rule takes precedence, or whether the exception is still in force.
That’s the shape of the problem. It’s why we are starting to talk about Human-in-the-loop systems. HITL in principle is fine. HITL as it’s getting deployed is a bandaid and should be recognized as such. It positions humans as janitors rather than orchestrators. As we flood some poor knowledge worker’s inbox with all sorts of decisions to verify, greenlight, contextualize, it’s time we also had a discussion on how to model the environment and address this bottleneck in model cognition. We’re too inclined to look towards humans as semaphores in real-time agent decision-making and not inclined enough to look at how humans pollinate the substrate and only step in as a last resort.
The good news is that the tooling, harnesses and capabilities of LLMs allow us to do something meaningful in this space. The bad news is that we can’t continue to wave our hands and reach for other demos. This is the wall. Cross-ontology reasoning that goes beyond the horrible piles of mappings and rules - the “old world” - is now feasible. It’s not perfect. Mapping ontologies to context graphs, and doing something useful with them is, if anything, an art form, and most tooling out there offloads the actual hard part onto developers. But it’s possible. And without it, we’ll have agents with flimsy intent compounding problems at machine speed.