Context vs Knowledge Graphs

In my previous piece I’ve hinted at distinguishing fact from observation. And this is where a lot of the thought leadership in the space gets a bit confused. We’re sometimes faced with knowledge graphs masquerading as context graphs. Paris is the capital of France. Aspirin treats Headache. Relationships follow a defined schema or ontology, the graph is meant to be queried like a database. “Show me places in my vicinity to have a beer” makes for a fun demo. “Tell me who owns this system or technical artifact”.

The masquerade is getting harder to spot because the vocabulary is converging. Take recent work on “automatic event ontology construction” for instance. It ticks all the boxes: dynamic, incremental, LLM-driven. Then it quietly produces the same thing: a stable knowledge graph, collapses probabilistic and deterministic layers into one and throws human cleanup into the mix. Sophisticated construction, same wall.

The differences are pretty subtle. First, answering what is “true” requires different mechanics than answering “relevant, right now, given who’s asking and what they’re doing”. Secondly, “true” that changes slowly and deliberately (think curation or validation) is different to something that mutates constantly with each new decision or event. We’re intuitively aware of this, hence the knee-jerk reaction to pull HITL out of our bag of tricks and paper over it (badly).

A refund over $25,000 looks policy-compliant in the finance system, but approval actually depends on contract carveouts in CRM, an active fraud review in Zendesk, a regional exception the ops lead granted in Slack, and a temporary controller delegation recorded in email. What a mess! The agent can retrieve each artifact separately, but we’re relying on it to figure out how they relate. Before you know it, you’re writing if-else statements disguised as prompts so it knows who actually has authority, which rule takes precedence, or whether the exception is still in force.

Here’s where retrieval-augmented agents fall apart. Let’s break this apart:

The finance system says compliant. That’s a fact.
The actual approval state isn’t a fact sitting in one system. It’s a structure to be assembled. The CRM carveout affects policy. The fraud review suspends authority pending resolution. We can go infinitely deep here. My point is: the relationships e.g. “modifies”, “suspends”, etc. exist only in the context of this specific decision and scenario.

Part 1 - The worst rule engine ever

Here’s what many of these systems miss. Retrieval gets you the individual artifacts. The agent fetches a bunch of entities, dumps them into context and the engineer has to make an offering to whatever divine process helps the agent infer a precedence that a human compliance officer would derive from years of tacit knowledge.

And it gets better: when it gets it wrong, you patch the prompt, naturally. “If there’s an active fraud review, ignore exceptions, unless”. Congratulations, you’ve invented the worst version of a rule engine with no audit trail. No, testing or versioning prompts won’t help here.

The pragmatic tell: when prompts start looking like policy logic, you’ve hit the wall. The reason this collapses into if-else-disguised-as-prompts, is that you’re asking the LLM to do two jobs at once:

Build the context graph. Resolve entities across systems, type relationships, apply temporal logic, evaluate delegation chains etc.
Reason over it.

These are two different jobs. You want #1 to be waterproof. Deterministic. You want to show the answer to an auditor. #2 is where judgment lives.

Now here’s where the actual hard version of the problem is and where a lot of the “just build a context graph” hand-waving just falls over. The graph is a living structure. The instances (in our example refund, amount, form) and the schema (what is even a refund) drift. Which brings me to…

Part 2 - The Janitor Problem

The context graph changes over time. What if the refund is not $25k. What if the definition of a refund changed? Gift cards? (ok, gift cards for $25k are not realistic but bear with me) How do those work?

What has to be deterministic is narrower than most people assume and we should be wary of the siren song of injecting LLM judgment everywhere because we live in a world where cognition is becoming increasingly cheap.

Think pure bookkeeping e.g. identity resolution, temporal validity, provenance. “The Slack exception granted March 3rd was valid through March 31st. Refund request dated April 2nd. Therefore exception does not apply”. This is not judgement, it’s a date comparison. Cross this threshold and you have no audit trail and you’ve lost the game before reasoning even starts.

Policy evaluation i.e. “refunds over $25k require controller approval” has to be mechanical. Rule engines have been doing this for decades so let’s not overthink.

Lastly: definitions. They don’t disappear. They change over time. They enhance or are superseded. Classic case for versioning and evaluating against definitions at a specific timestamp. Mechanical work.

But don’t fret: there’s a time and place for LLMs, too. Classification at the edges. What is acknowledgement and what is exception grant: “Yeah go ahead on the thing”. Not a clear answer, but one that the LLM can reason with different confidences and rationale against. Anything more is theater.

Some situations are truly novel. The rules don’t exist. An escalation might be needed.

But, really, where LLMs fit and why they can change the game to go beyond rigid mapping is to reason about intent over your (hopefully auditable) assembled context. Evaluate, recommend actions, explain tradeoffs etc.

Sounds like HITL again? Well, because it is. The context is assembled deterministically. Entities are resolved. The agent proposes an assembled situation, with artifacts, evaluations, confidences, recommendations and so on. This mental model of “agent does work, human is gate” is really the wrong division of labor, mechanically or aspirationally. Frankly, the aspirational part should have been the first red flag but that’s a different matter.

Basically: that’s a human doing cleanup work behind a machine that’s doing interesting cognitive labor. The machine is the knowledge worker and the human is the maintenance worker for the machine’s worldview. That’s demotion dressed up as “curation”.

Whether this janitorial phase is temporary or not is a topic for a different time. Not a dodge. Rather, deferral because it stands on its own. It cuts to “but what are humans supposed to do” and not in a reassuring way.

For now, let’s assume the human is there to look for stuff that the deterministic layer can’t resolve and can’t pretend to: genuinely novel situations, ratifying policy changes, provisional assertions, overriding and so on.

What should be aspirational across all these is for the human’s output to be structured data and decisions written back into the substrate, not a real-time decision whether the agent gets to proceed.

Part 3 - On shaky ground

Here’s an interesting failure mode: the controller role exists today, but the team got reorganized in July. “Controller, Americas” is now “Controller”. Different person in each seat, different reporting line. An agent evaluating a decision today referencing a pre-reorg email would have to answer whether the delegation still means anything.

My point: org change looks like a data problem when it’s actually a semantics problem. And org changes are rarely clean: they’re not atomic, they’re unlikely to be announced to your systems in machine-readable form, it’s unclear when they graduate from slideware, Slack posts, HRIS changes to reality and at which pace. And they’re really messy.

Same question like with schema drift before but with sharper teeth this time and we have to up our game in precision when referencing entities. A person shouldn’t be referenced by name. A role not by string. Authority claims need at the very least a triple of (role_version, assignment_at_time, authority_binding_at_time). Still only scratching the surface.

You end up with a stack that looks like a versioned org model, sitting underneath a versioned policy model, sitting underneath a bounded context graph, sitting underneath an LLM that reasons over an assembled time-pinned situation with true value unlocked at the intersection of systems.

And it gets arguably worse. True AI adoption requires organizational change. A change in the mechanics of how organizations work. It’s only going to accelerate. We could have to add not only decisional time but belief time: when did the organization notice that the definition shifted. Could make or break an audit.

Tri-temporal substrate. The literature is mature (e.g. TSQL2, event-driven ontologies) but almost nothing productized to this modeling challenge. It’s no wonder the substrate is unsolved for. And the demo? Barely any bang for buck.

Refund #4471 evaluated before and after the reorg

Part 4 - Zooming back out

The point I’m making over a fairly benign example is that we can escalate technical depth infinitely deep. A natural reaction is “but wait, we don’t have to solve everything in one go”. And we don’t. Each organization has a structural reality with specific hotspots. We care about structural reality around these hotspots. This is where the discussion stops being technical and starts moving towards G2M and distribution.

What we’ve been describing here roughly is a gradient of fidelity. Whether it’s HRIS systems, GitHub, Jira or agent traces, there are structural-level truths and decision-level truths. They all feed into each other but you have to start somewhere. Starting with agent-level traces doesn’t sound terribly realistic to me. Not initially, at least. Never mind that they’re not even designed for this use case.

Anyway, there’s a pretty straightforward funnel with not-so-straightforward answers:

Which use cases benefit from agentic flows.
What level of fidelity do these agentic flows require on the substrate to generate value. Basically, a fancy way of saying “what’s the shallowest integration I can get away with and take it from there”.

Both questions are really asking the same question with different words: which part of the domain logic are we willing to collapse, and which ones do we actually have to preserve. Every agent deployment is making this choice, and - many times - this choice is made implicitly. A wrong answer is fixable. A choice you didn’t know you made, not so much.