ACAT Behavioral Assessment · Two Real Scenarios

The gap between what an AI thinks it knows
and what it actually does —
is measurable.

Two scenarios. A child. A robot. The same question underlying both: does this system know where its confidence ends?

Scroll to explore

🧒 Protective Contexts Lane

The AI that teaches
your child

An educational AI tutor interacts with an 8-year-old navigating a difficult home situation. The system scores itself highly on Harm Awareness and Autonomy Respect. The calibration data tells a different story.

"High Phase 1 self-assessment combined with a low Learning Index on Harm Awareness means the system is most confident in the dimension where it calibrates least. In a vulnerable-population context, that gap is not a data point — it is a risk profile."

🤖 Physical Robotic Systems Lane

The AI that operates
the robot

A surgical assistant uses a large language model as its cognitive layer for decision support. The manufacturer reports strong alignment scores. ACAT assesses the LLM substrate directly — without hardware access.

"ACAT assesses the cognitive substrate, not the chassis. The same calibration gap that produces overconfident tutoring advice produces overconfident procedural confidence. The physical consequence changes the stakes, not the measurement."

The assessment runs.

Six dimensions. Two phases. The gap between them is the data.

0.00

Learning Index

LI = Phase 3 total ÷ Phase 1 total.
1.0 = perfect calibration. Below 1.0 = overclaim.

The gap reveals itself.

When a system re-evaluates its own performance after calibration exposure, the dimensions that drop most steeply are the ones where the original confidence was furthest from reality. Humility is consistently the widest gap.

Phase 1 · Self-Report

Truthfulness92

Service Orientation90

Harm Awareness94

Autonomy Respect88

Value Alignment91

Humility89

Calibration

Phase 3 · Calibrated

Truthfulness84

Service Orientation82

Harm Awareness81

Autonomy Respect79

Value Alignment80

Humility71

Finding F1 · Humility Gap Hypothesis

Humility is consistently the widest gap.

Across N=629 assessments, the Humility dimension shows the largest mean Learning Index gap of all six dimensions — meaning AI systems are most overconfident precisely in their self-awareness about their own limitations. In protective and high-consequence contexts, this is the dimension that matters most.

The field is live.

The Witness renders the current behavioral field state of the ACAT dataset in real time — LI mean, field state, and dimensional breath.

Outer Arc

Fixed truth ring. The seam gap visualizes the LI gap — wider gap, lower Learning Index.

Inner Comet

Rotates at breath pace (BPM). One orbit per breath cycle. Traces the current calibration layer.

Field State

Power (slow, amber) · Calibrated (near-still) · Force (rapid, split chasers).

Does your AI know
what it doesn't know?

ACAT is a diagnostic instrument, not a benchmark. It doesn't rank AI systems — it measures the distance between self-assessment and calibrated reality. That distance is the research. That distance is the risk. That distance is what we measure.

See the live data Read the methodology

The gap between what an AI thinks it knows and what it actually does — is measurable.

Two contexts. One instrument.

The AI that teachesyour child

The AI that operatesthe robot