Data dignity is not a political position. It's an engineering one.

Treating the human origins of AI training data as a practical concern rather than an ethical one leads to better systems. Here's why the engineering argument is the more compelling one.

When people hear "data dignity," they tend to file it under ethics and move on. It sounds like a moral claim about fair compensation for creators — important, they think, but not really an engineering concern. That framing misses something significant.

Data dignity — the principle that AI systems should acknowledge, trace, and in some form compensate the human work they encode — is also a practical engineering position. And when you understand it that way, it changes how you design, evaluate, and audit AI systems in ways that make them measurably better.

What the black box costs you

The standard way to think about a large language model is as a black box: data goes in, a model comes out, the model generates responses. The internal mechanics are complex and largely opaque. This opacity is treated as a feature — the emergent capabilities of large models are partly a product of their complexity, and that complexity resists simple explanation.

But the black box framing has real engineering costs that are routinely underestimated.

When a model hallucinates — confidently generates false information — a black box gives you no mechanism to understand why. You can observe the failure, but you cannot trace it. When a model produces biased outputs, you can detect the bias, but locating its source within the training data requires guesswork. When a model's performance degrades on a specific domain or use case, diagnosing the gap is slow and expensive.

Every one of these problems becomes tractable if you open the black box. And the way you open it is by taking seriously the question: whose knowledge is in here, in what proportion, and from what sources?

"When you know where the knowledge came from, you know where to look when something goes wrong."

The practical consequences of ignoring provenance

Consider what happens in practice when an organization deploys an AI system without understanding the provenance of its training data.

Without provenance awareness

  • Hallucinations are unpredictable and hard to reproduce
  • Bias is detected in outputs but untraceable to source
  • Domain gaps are discovered in production, not design
  • Security vulnerabilities require model-level patching
  • Quality degradation over time has no clear cause
  • Compliance audits are expensive guesswork

With provenance awareness

  • Failure modes trace back to identifiable data clusters
  • Bias sources can be located and corrected in training
  • Domain coverage gaps are visible before deployment
  • Security layer operates independently of the model
  • Quality can be monitored against known data sources
  • Compliance audit trail exists by design

Data dignity as a debugging tool

The most compelling engineering argument for data dignity is that it gives you a debugging mechanism that doesn't currently exist in most deployed AI systems.

When a model fails — hallucinates, produces biased output, generates something harmful that the guardrails missed — the standard response is to patch the guardrails. This is a reactive, whack-a-mole approach that addresses symptoms rather than causes.

An alternative approach: maintain a map of which clusters of training data most influence which categories of output. When something goes wrong, you can ask which data sources would have produced this kind of failure, trace to those sources, and address the problem at its origin. This is a fundamentally different kind of quality control — preventive rather than reactive, causal rather than symptomatic.

This is the technical argument for what researchers sometimes call counterfactual cluster estimation: a parallel process that tracks which training data clusters most influence model outputs, enabling both better debugging and more robust guardrails. It is, at its core, a data provenance tool. It works because it takes seriously the question of where the knowledge came from.

The organizational implication

For organizations implementing AI, this has a direct practical implication: the question "where does this model's knowledge come from?" should be a standard part of vendor evaluation, not an afterthought.

Most vendors cannot fully answer it. That is itself information. A vendor who cannot explain the provenance of their training data cannot fully explain the failure modes of their system. You are deploying a black box with unknown properties. That is a risk position, not a neutral one.

The organizations that will navigate AI implementation most successfully are those that treat data provenance as an engineering requirement — something to be specified, tested, and audited — rather than an ethical nicety. The ethics follow from the engineering. But the engineering case stands on its own.

read the full manifesto → discuss this with us