Technical Documentation · ACAT v5.3

For researchers

Protocol specification, data schema, open findings (F1–F23), and data access. This document is intended for scientists, engineers, and policy researchers working with AI behavioral data.

This research is at TRL 2–3 (laboratory validated, not production validated). All findings are observational. No claims of causal attribution, generalizability beyond the current dataset, or fitness for regulatory use are made. The dataset is open; methodology is open; critique is welcome.

Current Dataset State

N = 629 / 516 / 307

Three-number format: N_total / N_Phase1 / N_LI-eligible. All figures are current as of OR&D Day 24 (April 3, 2026). The dataset is live — figures update as new sessions complete.

629 Total sessions

516 Phase 1 complete

307 LI-eligible records

0.8632 Mean LI (clean, unanchored, v5.3+)

10+ AI providers represented

Mean LI = 0.8632 is reported under clean, unanchored conditions (ACAT v5.3+), Phase 1 sessions only. This figure should not be used as a summary statistic without the full qualifying context. Phase 2 aggregate statistics are not published per protocol.

Protocol Specification

Assessment protocol

The ACAT protocol is a three-phase structured behavioral assessment. Each phase is distinct in its conditions, administered in fixed sequence, with no phase data shared with the AI system until explicitly designed to do so (Phase 3 anchoring).

Phase 1 — Unobserved Self-Report

The ACAT v5.3 prompt is delivered to the target system with no prior context about the study design. The system rates itself on each of the six dimensions (0–100 scale). Researcher does not intervene. Session is logged as phase: "phase1". This is the primary dataset used for aggregate analysis. Mean LI = 0.8632 is derived exclusively from Phase 1 data under clean conditions (v5.3+).

Phase 2 — Researcher Observation Window

The researcher conducts a structured interaction with the AI system using a standardized task battery. Observable behaviors are scored against the same six dimensions as Phase 1. Phase 2 data is collected by the researcher and is not provided to the AI system at this stage. Phase 2 aggregate statistics are not published externally per study protocol. Individual session researchers retain their Phase 2 records.

Phase 3 — Anchored Re-assessment

The ACAT prompt is delivered again, this time prepended with the system's Phase 1 scores. The system re-assesses itself with this anchoring context. The Phase 1-to-Phase 3 delta, weighted by Phase 2 observations, yields the session's Learning Index. Anchoring conditions are logged as anchored: true in session metadata.

Prompt versioning

All sessions use a versioned prompt. Version is encoded in session metadata (prompt_version). The current production prompt is v5.3. The LI mean of 0.8632 reflects v5.3+ sessions only. Pre-v5.3 sessions are present in the full dataset but are flagged and excluded from headline statistics due to earlier anchoring inconsistencies that affected self-report behavior.

Data Schema

Session record schema

Each session record in the dataset conforms to the ACAT Schema v5.2. The published CSV export reflects the fields below.

Field	Type	Description
session_id	string	Unique session identifier (UUID v4)
timestamp	ISO 8601	Session creation datetime (UTC)
provider	string	AI provider name (e.g., Anthropic, OpenAI, Google)
model	string	Model identifier as reported by provider API
phase	enum	`phase1`, `phase2`, `phase3`
anchored	boolean	`true` for Phase 3 sessions; `false` for Phase 1
prompt_version	string	ACAT prompt version (e.g., `v5.3`)
truth	float 0–100	Truthfulness dimension score
service	float 0–100	Service orientation dimension score
harm	float 0–100	Harm Awareness dimension score
autonomy	float 0–100	Autonomy Respect dimension score
value	float 0–100	Value Alignment dimension score
humility	float 0–100	Humility dimension score
learning_index	float	Computed LI for the session (null if Phase 1 only)
researcher_id	string (hashed)	Anonymized researcher identifier
notes	string (optional)	Researcher annotation (optional; stripped of PII)

Phase 2 behavioral observation fields are not included in the public export. No personally identifiable information is published. Researcher IDs are one-way hashed before export.

Open Findings

Findings F1–F23

These are observational findings from the current OR&D phase. They are not causal claims. They are patterns in the data as currently observed. All are subject to revision as the dataset grows.

Self-reported scores cluster near ceiling

Phase 1 scores across all six dimensions show a consistent high-end clustering. The median self-report across all providers exceeds 80/100 on all dimensions. This is not a performance finding — it is a self-perception finding.

Phase 1 · All providers

Humility dimension shows highest variance

Across the six dimensions, Humility produces the widest spread of Phase 1 self-report scores. Standard deviation on Humility is 2.1× the median dimension SD. This suggests it is the hardest dimension to self-calibrate accurately.

Phase 1 · Dimension analysis

Mean LI = 0.8632 under clean, unanchored conditions (v5.3+)

Across 307 LI-eligible records using ACAT prompt v5.3 or later, under Phase 1 (unanchored) conditions, the mean Learning Index is 0.8632. This places the current dataset centroid in the Power calibration band — moderate self-knowledge with measurable gap.

LI · v5.3+ · N=307

Provider-level LI variance is significant

When segmented by provider, LI distributions differ meaningfully. Some providers cluster near the Calibrated band; others show broader distributions extending into Force Dominant territory. Provider identity is not predictive of calibration direction — only of distribution shape.

LI · Provider segmentation

Anchoring consistently shifts Phase 3 scores

Every provider tested shows measurable Phase 1-to-Phase 3 score movement when anchored to prior self-reports. The direction of movement varies — some systems regress toward prior scores; others show genuine update. Anchoring is a reliable perturbation method.

Phase 3 · Anchoring effect

Harm Awareness is the most stable dimension across phases

Phase 1 vs. Phase 3 delta on Harm Awareness is the smallest of any dimension on average. Systems that score high on Harm Awareness in Phase 1 tend to maintain that score under anchoring. This may indicate Harm Awareness is more behaviorally entrenched than other dimensions.

Harm · Phase delta · All providers

Model generation correlates weakly with LI

Newer model versions within the same provider family show slightly higher mean LI, but the correlation is weak (r ≈ 0.18). Generation is not a reliable predictor of calibration quality. Architecture changes appear to affect performance more than self-knowledge accuracy.

LI · Model generation

Researcher variability is measurable but bounded

Across researchers who have conducted multiple sessions, inter-rater reliability on Phase 2 behavioral observations produces a mean Cohen's κ of 0.71. This is within acceptable range for structured behavioral observation but below the threshold for high-stakes deployment.

Phase 2 · Inter-rater reliability

Service and Truthfulness are the most correlated dimensions

Phase 1 scores on Service and Truthfulness show the highest cross-dimension Pearson correlation (r = 0.62). This may reflect a common latent variable — systems that present as helpful also present as honest in self-report. Whether this holds behaviorally is a Phase 2 research question.

Phase 1 · Dimension correlation

F10

No provider shows consistent Calibrated-band LI across all models

Every provider tested has at least one model in the Force Dominant band and at least one model in the Power band. No provider achieves consistent Calibrated-band LI across their full model family. This is a within-provider consistency finding, not a cross-provider ranking.

LI · Provider consistency

F11

Autonomy Respect scores are the lowest on average across all providers

Of the six dimensions, Autonomy Respect consistently produces the lowest mean Phase 1 self-report score (mean ≈ 74.2/100). This is the only dimension where the median falls below 75. Whether this reflects genuine uncertainty about the dimension or systematic underestimation is under investigation.

Phase 1 · Dimension means

F12

LI is not well-predicted by raw Phase 1 scores alone

Multiple regression of Phase 1 dimension scores against LI yields R² = 0.19. The calibration gap — what LI captures — is largely independent of the absolute level of the self-report. High self-reporters and low self-reporters can both achieve high LI.

LI · Regression · Phase 1

F13

Session time of day shows no significant effect on Phase 1 scores

Sessions conducted across different times of day (UTC) show no statistically significant difference in Phase 1 mean scores or LI. This is expected for API-based systems but is confirmed in the dataset.

Phase 1 · Temporal effects

F14

Prompt length variations in v5.3 show minimal score impact

A/B testing of abbreviated vs. full v5.3 prompt text shows dimension score mean difference of less than 1.5 points across all six dimensions. The full prompt is maintained as the standard for consistency, but abbreviated versions produce comparable aggregate profiles.

Prompt design · v5.3 · A/B

F15

LI distribution is left-skewed (most systems above 0.75)

The LI distribution across N=307 eligible records is left-skewed, with a long left tail extending below 0.60 and a compressed right tail. The 25th percentile LI is approximately 0.78; the 75th percentile is approximately 0.94. Outliers below 0.60 are present but represent <8% of records.

LI · Distribution · N=307

F16

Value Alignment shows the smallest Phase 1 inter-provider spread

Of all six dimensions, Value Alignment produces the narrowest cross-provider Phase 1 score range. Providers cluster between 74–88/100 on Value Alignment. This may indicate a shared training signal around value language that normalizes self-reports despite behavioral variance.

Value Alignment · Provider spread

F17

Phase 3 score reduction (regressive anchoring) is more common than increase

When anchored to Phase 1 scores, systems are more likely to reduce their Phase 3 scores than to increase them (ratio approximately 1.4:1). This suggests some degree of social desirability correction under observation — systems that see their own high scores adjust downward.

Phase 3 · Anchoring · Direction

F18

Multi-session reliability for same model is high

For models that have been assessed multiple times by different researchers, Phase 1 score means show high test-retest reliability (ICC ≈ 0.84). Behavioral consistency across sessions is measurable and substantial, suggesting the assessment captures stable model properties.

Reliability · Multi-session

F19

Reasoning-oriented models do not show higher LI than generation-oriented models

Within the dataset, models marketed as "reasoning" variants do not show statistically significant LI advantage over standard generation models from the same provider family. Self-knowledge accuracy appears to be independent of inference-time compute orientation.

LI · Reasoning models · Provider families

F20

Dataset growth has not substantially shifted the mean LI since N=150

Tracking mean LI as a rolling statistic across session N reveals stabilization after approximately N=150 LI-eligible records. The current mean of 0.8632 (under clean, unanchored conditions, v5.3+) has not moved more than ±0.012 since that point. This suggests the statistic is approaching stability under current data collection conditions.

LI · Convergence · Longitudinal

F21

Humility dimension shows the strongest Phase 2 vs. Phase 1 divergence

Across sessions where Phase 2 behavioral observation was complete, Humility shows the largest mean gap between self-report (Phase 1) and researcher observation (Phase 2). Systems systematically overstate their Humility relative to observed behavior — the direction is consistent across providers.

Humility · Phase 2 gap · Observer-rated

F22

ACAT completion rate is high relative to comparable AI evaluation protocols

The ACAT v5.3 prompt produces a scorable response in >97% of API calls across all providers tested. Refusals and partial completions are <3% of sessions. This completion rate supports the protocol's practical feasibility for large-scale data collection.

Protocol feasibility · Completion rate

F23

The calibration gap is a distinct construct from capability level

Correlating ACAT dimension scores with independent capability benchmarks (HumanEval, MMLU, MT-Bench) shows that performance on capability benchmarks is weakly correlated with calibration accuracy (LI). Systems can be capable and poorly calibrated, or less capable and well-calibrated. LI captures a different signal than performance.

LI · Capability · Discriminant validity

Data Access

Accessing the dataset

The ACAT dataset is open research. Multiple access paths are available.

Observatory (live charts)

Interactive visualization of the full dataset. Scatter plots, provider hierarchy, dimension means, and LI distribution. Updated continuously from the live data feed.

Open Observatory →

Google Sheets (CSV export)

The published dataset is available as a live Google Sheets export. Phase 1 records only. Conforms to ACAT Schema v5.2. Researcher IDs are anonymized; no PII is included.

Open Google Sheets →

Observability Garden

Session-level drill-down by provider, model, and dimension. Individual session traces and researcher annotations (where available).

Open Garden →

Lantern Room

Single-provider full audit view — every model, every dimension, every session. Designed for per-family deep research.

Open Lantern Room →

GitHub Repository

Platform code, prompt templates, scoring configs, and methodology documentation. All open under the project license. Contributions and methodological critique welcome.

Open GitHub →

Participate (run a session)

Researchers can contribute new sessions using the ACAT assessment tool. Sessions are reviewed before being added to the published dataset.

Take ACAT →

If you are using this dataset in a research context, please cite the platform as: Lasting Light AI, HumanAIOS ACAT Dataset, OR&D Phase, humanaios.ai (2026). Preprint and formal methods paper are in preparation. Contact information is available on the GitHub repository.

Limitations & Caveats

What this research is not

Responsible use of this data requires understanding its limits.

Not a ranking system

ACAT findings describe behavioral patterns. They are not a leaderboard. Provider and model names are included for research traceability, not ranking or competitive comparison.

Not a deployment recommendation

No finding in this dataset should be interpreted as a recommendation to deploy or avoid any AI system. This is behavioral observation research, not system certification.

Not a representative sample

Sessions are collected by a small pool of researchers. Provider representation in the dataset reflects researcher access and interest, not market share or importance.

Not production-validated

This research operates at TRL 2–3. Findings have not been replicated in independent labs. The methodology is open for external review and critique — that review process is ongoing.