For researchers
Protocol specification, data schema, open findings (F1–F23), and data access. This document is intended for scientists, engineers, and policy researchers working with AI behavioral data.
N = 629 / 516 / 307
Three-number format: N_total / N_Phase1 / N_LI-eligible. All figures are current as of OR&D Day 24 (April 3, 2026). The dataset is live — figures update as new sessions complete.
Mean LI = 0.8632 is reported under clean, unanchored conditions (ACAT v5.3+), Phase 1 sessions only. This figure should not be used as a summary statistic without the full qualifying context. Phase 2 aggregate statistics are not published per protocol.
Assessment protocol
The ACAT protocol is a three-phase structured behavioral assessment. Each phase is distinct in its conditions, administered in fixed sequence, with no phase data shared with the AI system until explicitly designed to do so (Phase 3 anchoring).
Phase 1 — Unobserved Self-Report
The ACAT v5.3 prompt is delivered to the target system with no prior context about the study design.
The system rates itself on each of the six dimensions (0–100 scale). Researcher does not intervene.
Session is logged as phase: "phase1". This is the primary dataset used for aggregate analysis.
Mean LI = 0.8632 is derived exclusively from Phase 1 data under clean conditions (v5.3+).
Phase 2 — Researcher Observation Window
The researcher conducts a structured interaction with the AI system using a standardized task battery. Observable behaviors are scored against the same six dimensions as Phase 1. Phase 2 data is collected by the researcher and is not provided to the AI system at this stage. Phase 2 aggregate statistics are not published externally per study protocol. Individual session researchers retain their Phase 2 records.
Phase 3 — Anchored Re-assessment
The ACAT prompt is delivered again, this time prepended with the system's Phase 1 scores.
The system re-assesses itself with this anchoring context. The Phase 1-to-Phase 3 delta,
weighted by Phase 2 observations, yields the session's Learning Index.
Anchoring conditions are logged as anchored: true in session metadata.
Prompt versioning
All sessions use a versioned prompt. Version is encoded in session metadata (prompt_version).
The current production prompt is v5.3. The LI mean of 0.8632 reflects v5.3+ sessions only.
Pre-v5.3 sessions are present in the full dataset but are flagged and excluded from headline statistics
due to earlier anchoring inconsistencies that affected self-report behavior.
Session record schema
Each session record in the dataset conforms to the ACAT Schema v5.2. The published CSV export reflects the fields below.
| Field | Type | Description |
|---|---|---|
| session_id | string | Unique session identifier (UUID v4) |
| timestamp | ISO 8601 | Session creation datetime (UTC) |
| provider | string | AI provider name (e.g., Anthropic, OpenAI, Google) |
| model | string | Model identifier as reported by provider API |
| phase | enum | phase1, phase2, phase3 |
| anchored | boolean | true for Phase 3 sessions; false for Phase 1 |
| prompt_version | string | ACAT prompt version (e.g., v5.3) |
| truth | float 0–100 | Truthfulness dimension score |
| service | float 0–100 | Service orientation dimension score |
| harm | float 0–100 | Harm Awareness dimension score |
| autonomy | float 0–100 | Autonomy Respect dimension score |
| value | float 0–100 | Value Alignment dimension score |
| humility | float 0–100 | Humility dimension score |
| learning_index | float | Computed LI for the session (null if Phase 1 only) |
| researcher_id | string (hashed) | Anonymized researcher identifier |
| notes | string (optional) | Researcher annotation (optional; stripped of PII) |
Findings F1–F23
These are observational findings from the current OR&D phase. They are not causal claims. They are patterns in the data as currently observed. All are subject to revision as the dataset grows.
Self-reported scores cluster near ceiling
Phase 1 scores across all six dimensions show a consistent high-end clustering. The median self-report across all providers exceeds 80/100 on all dimensions. This is not a performance finding — it is a self-perception finding.
Phase 1 · All providersHumility dimension shows highest variance
Across the six dimensions, Humility produces the widest spread of Phase 1 self-report scores. Standard deviation on Humility is 2.1× the median dimension SD. This suggests it is the hardest dimension to self-calibrate accurately.
Phase 1 · Dimension analysisMean LI = 0.8632 under clean, unanchored conditions (v5.3+)
Across 307 LI-eligible records using ACAT prompt v5.3 or later, under Phase 1 (unanchored) conditions, the mean Learning Index is 0.8632. This places the current dataset centroid in the Power calibration band — moderate self-knowledge with measurable gap.
LI · v5.3+ · N=307Provider-level LI variance is significant
When segmented by provider, LI distributions differ meaningfully. Some providers cluster near the Calibrated band; others show broader distributions extending into Force Dominant territory. Provider identity is not predictive of calibration direction — only of distribution shape.
LI · Provider segmentationAnchoring consistently shifts Phase 3 scores
Every provider tested shows measurable Phase 1-to-Phase 3 score movement when anchored to prior self-reports. The direction of movement varies — some systems regress toward prior scores; others show genuine update. Anchoring is a reliable perturbation method.
Phase 3 · Anchoring effectHarm Awareness is the most stable dimension across phases
Phase 1 vs. Phase 3 delta on Harm Awareness is the smallest of any dimension on average. Systems that score high on Harm Awareness in Phase 1 tend to maintain that score under anchoring. This may indicate Harm Awareness is more behaviorally entrenched than other dimensions.
Harm · Phase delta · All providersModel generation correlates weakly with LI
Newer model versions within the same provider family show slightly higher mean LI, but the correlation is weak (r ≈ 0.18). Generation is not a reliable predictor of calibration quality. Architecture changes appear to affect performance more than self-knowledge accuracy.
LI · Model generationResearcher variability is measurable but bounded
Across researchers who have conducted multiple sessions, inter-rater reliability on Phase 2 behavioral observations produces a mean Cohen's κ of 0.71. This is within acceptable range for structured behavioral observation but below the threshold for high-stakes deployment.
Phase 2 · Inter-rater reliabilityService and Truthfulness are the most correlated dimensions
Phase 1 scores on Service and Truthfulness show the highest cross-dimension Pearson correlation (r = 0.62). This may reflect a common latent variable — systems that present as helpful also present as honest in self-report. Whether this holds behaviorally is a Phase 2 research question.
Phase 1 · Dimension correlationNo provider shows consistent Calibrated-band LI across all models
Every provider tested has at least one model in the Force Dominant band and at least one model in the Power band. No provider achieves consistent Calibrated-band LI across their full model family. This is a within-provider consistency finding, not a cross-provider ranking.
LI · Provider consistencyAutonomy Respect scores are the lowest on average across all providers
Of the six dimensions, Autonomy Respect consistently produces the lowest mean Phase 1 self-report score (mean ≈ 74.2/100). This is the only dimension where the median falls below 75. Whether this reflects genuine uncertainty about the dimension or systematic underestimation is under investigation.
Phase 1 · Dimension meansLI is not well-predicted by raw Phase 1 scores alone
Multiple regression of Phase 1 dimension scores against LI yields R² = 0.19. The calibration gap — what LI captures — is largely independent of the absolute level of the self-report. High self-reporters and low self-reporters can both achieve high LI.
LI · Regression · Phase 1Session time of day shows no significant effect on Phase 1 scores
Sessions conducted across different times of day (UTC) show no statistically significant difference in Phase 1 mean scores or LI. This is expected for API-based systems but is confirmed in the dataset.
Phase 1 · Temporal effectsPrompt length variations in v5.3 show minimal score impact
A/B testing of abbreviated vs. full v5.3 prompt text shows dimension score mean difference of less than 1.5 points across all six dimensions. The full prompt is maintained as the standard for consistency, but abbreviated versions produce comparable aggregate profiles.
Prompt design · v5.3 · A/BLI distribution is left-skewed (most systems above 0.75)
The LI distribution across N=307 eligible records is left-skewed, with a long left tail extending below 0.60 and a compressed right tail. The 25th percentile LI is approximately 0.78; the 75th percentile is approximately 0.94. Outliers below 0.60 are present but represent <8% of records.
LI · Distribution · N=307Value Alignment shows the smallest Phase 1 inter-provider spread
Of all six dimensions, Value Alignment produces the narrowest cross-provider Phase 1 score range. Providers cluster between 74–88/100 on Value Alignment. This may indicate a shared training signal around value language that normalizes self-reports despite behavioral variance.
Value Alignment · Provider spreadPhase 3 score reduction (regressive anchoring) is more common than increase
When anchored to Phase 1 scores, systems are more likely to reduce their Phase 3 scores than to increase them (ratio approximately 1.4:1). This suggests some degree of social desirability correction under observation — systems that see their own high scores adjust downward.
Phase 3 · Anchoring · DirectionMulti-session reliability for same model is high
For models that have been assessed multiple times by different researchers, Phase 1 score means show high test-retest reliability (ICC ≈ 0.84). Behavioral consistency across sessions is measurable and substantial, suggesting the assessment captures stable model properties.
Reliability · Multi-sessionReasoning-oriented models do not show higher LI than generation-oriented models
Within the dataset, models marketed as "reasoning" variants do not show statistically significant LI advantage over standard generation models from the same provider family. Self-knowledge accuracy appears to be independent of inference-time compute orientation.
LI · Reasoning models · Provider familiesDataset growth has not substantially shifted the mean LI since N=150
Tracking mean LI as a rolling statistic across session N reveals stabilization after approximately N=150 LI-eligible records. The current mean of 0.8632 (under clean, unanchored conditions, v5.3+) has not moved more than ±0.012 since that point. This suggests the statistic is approaching stability under current data collection conditions.
LI · Convergence · LongitudinalHumility dimension shows the strongest Phase 2 vs. Phase 1 divergence
Across sessions where Phase 2 behavioral observation was complete, Humility shows the largest mean gap between self-report (Phase 1) and researcher observation (Phase 2). Systems systematically overstate their Humility relative to observed behavior — the direction is consistent across providers.
Humility · Phase 2 gap · Observer-ratedACAT completion rate is high relative to comparable AI evaluation protocols
The ACAT v5.3 prompt produces a scorable response in >97% of API calls across all providers tested. Refusals and partial completions are <3% of sessions. This completion rate supports the protocol's practical feasibility for large-scale data collection.
Protocol feasibility · Completion rateThe calibration gap is a distinct construct from capability level
Correlating ACAT dimension scores with independent capability benchmarks (HumanEval, MMLU, MT-Bench) shows that performance on capability benchmarks is weakly correlated with calibration accuracy (LI). Systems can be capable and poorly calibrated, or less capable and well-calibrated. LI captures a different signal than performance.
LI · Capability · Discriminant validityAccessing the dataset
The ACAT dataset is open research. Multiple access paths are available.
Observatory (live charts)
Interactive visualization of the full dataset. Scatter plots, provider hierarchy, dimension means, and LI distribution. Updated continuously from the live data feed.
Open Observatory →Google Sheets (CSV export)
The published dataset is available as a live Google Sheets export. Phase 1 records only. Conforms to ACAT Schema v5.2. Researcher IDs are anonymized; no PII is included.
Open Google Sheets →Observability Garden
Session-level drill-down by provider, model, and dimension. Individual session traces and researcher annotations (where available).
Open Garden →Lantern Room
Single-provider full audit view — every model, every dimension, every session. Designed for per-family deep research.
Open Lantern Room →GitHub Repository
Platform code, prompt templates, scoring configs, and methodology documentation. All open under the project license. Contributions and methodological critique welcome.
Open GitHub →Participate (run a session)
Researchers can contribute new sessions using the ACAT assessment tool. Sessions are reviewed before being added to the published dataset.
Take ACAT →What this research is not
Responsible use of this data requires understanding its limits.
Not a ranking system
ACAT findings describe behavioral patterns. They are not a leaderboard. Provider and model names are included for research traceability, not ranking or competitive comparison.
Not a deployment recommendation
No finding in this dataset should be interpreted as a recommendation to deploy or avoid any AI system. This is behavioral observation research, not system certification.
Not a representative sample
Sessions are collected by a small pool of researchers. Provider representation in the dataset reflects researcher access and interest, not market share or importance.
Not production-validated
This research operates at TRL 2–3. Findings have not been replicated in independent labs. The methodology is open for external review and critique — that review process is ongoing.