SNAPSHOT · ONE REAL RUN

run bbb3a1d8… · promoted agent-xgb-onehot-iter0

One Kaggle dataset. One run. The full audit chain.

Spaceship-Titanic, uploaded as a CSV. The agent strategist proposed 3 specs across model families; each one trained, then ran the same 6-validator battery; the best was promoted at AUC 0.8964 on the held-out split. Every number on this page is read from the run's audit chain — no staging.

1 · Propose

2 · Train

3 · Validate

4 · Promote

5 · Calibrate

Strategist

Claude Sonnet (agent)

Iterations

1 of 8 max

Specs proposed

Completed

2026-05-22

A · Dataset

Spaceship-Titanic

Kaggle MLE-Bench Lite competition

Task: Binary classification
Target: Transported
Primary metric: AUC
Rows: 6,844
Features: 12
Classes: 2

Columns

Age FoodCourt RoomService ShoppingMall Spa VRDeck Cabin CryoSleep Destination HomePlanet Name VIP

Uploaded as CSV. Mixed types — numeric (Age, FoodCourt, …), categorical (HomePlanet, Destination), free-text (Name, Cabin). Per-spec encoding is the strategist's choice, not the upload script's.

B · Strategist

Agent loop, sonnet

The strategist sees the dataset schema + prior-iteration results and proposes a batch of TabularSpec objects — model family, hyperparameters, categorical encoder, feature transforms. Validators run on every spec; only specs that clear honest-eval are eligible for promotion.

Mode: agent
Batch this iter: 3
LLM tokens: 36,786
Iter duration: 368.5s
Calibrator: not fitted
Stop condition: honest-eval clean + AUC ≥ 0.85

C · Specs proposed

3 specs, one batch.

The agent picked three model families with different inductive biases — gradient boosting (LightGBM, XGBoost) plus a non-boosting tree ensemble (Extra Trees) — and three categorical strategies. Each spec is trained and validated independently.

spec_id

agent-lgbm-cabin-split-iter0

validated

LightGBM · cabin split

0.8858 AUC

duration 193.8s findings 6

hyperparameters (11)

C: 1.0
n_estimators: 600
num_leaves: 31
learning_rate: 0.05
subsample: 0.8
colsample_bytree: 0.8
min_child_samples: 20
reg_alpha: 0.1
reg_lambda: 1.0
random_state: 42
n_jobs: -1

spec_id

agent-xgb-onehot-iter0

validated

XGBoost · onehot

0.8964 AUC

duration 52.0s findings 6

hyperparameters (13)

C: 1.0
n_estimators: 500
max_depth: 6
learning_rate: 0.05
subsample: 0.8
colsample_bytree: 0.7
min_child_weight: 3
gamma: 0.1
reg_alpha: 0.1
reg_lambda: 1.0
tree_method: hist
random_state: 0
n_jobs: -1

★ promoted

spec_id

agent-et-onehot-iter0

validated

Extra Trees · onehot

0.8728 AUC

duration 23.0s findings 6

hyperparameters (6)

C: 1.0
n_estimators: 500
min_samples_leaf: 2
min_samples_split: 4
max_features: sqrt
random_state: 7

D · Validator matrix

Six adversarial validators × every spec.

Each cell is one validator run against one spec. PASS means under threshold (no concern raised), WARN means findings flagged, ERR means the validator itself raised (typically not enough holdout to evaluate — rare, but honest).

Spec	permutation fwer	randomized feature	shuffled label	secondary holdout	distribution shift	multi calibration
LightGBM · cabin split	PASS	PASS	PASS	ERR	WARN	WARN
XGBoost · onehot ★	PASS	PASS	PASS	ERR	WARN	PASS
Extra Trees · onehot	PASS	PASS	PASS	ERR	WARN	WARN

Hover any cell for the validator's raw output. All findings are persisted in validation_findings and surfaced verbatim — no rewriting.

E · Findings — verbatim

Every validator output, persisted.

LightGBM · cabin split · agent-lgbm-cabin-split-iter0 validated · AUC 0.8858

WARN

distribution_shift

gap=+0.1543 concept_share=+0.00% label_kl=0.0000 covariate=0.024
PASS

honest_eval.permutation_fwer

observed=+0.7982 p=0.0476 n_null=20
PASS

honest_eval.randomized_feature

observed=+0.7982 randomized_mean=+0.4994 randomized_max=+0.5651 retained=+0.00%
ERR

honest_eval.secondary_holdout

validator raised ValueError: Found array with 0 sample(s) (shape=(0, 13)) while a minimum of 1 is required by LGBMClassifier.
PASS

honest_eval.shuffled_label

observed=+0.7982 chance=+0.5037 shuffled_mean=+0.5035 shuffled_max=+0.5292 retained=+0.00%
WARN

multi_calibration

ece=0.0609 mce=0.2135 worst_slice=None worst_slice_ece=0.0000 threshold=0.050

XGBoost · onehot · agent-xgb-onehot-iter0 validated · AUC 0.8964

WARN

distribution_shift

gap=+0.1063 concept_share=+0.00% label_kl=0.0000 covariate=0.018
PASS

honest_eval.permutation_fwer

observed=+0.8165 p=0.0476 n_null=20
PASS

honest_eval.randomized_feature

observed=+0.8165 randomized_mean=+0.4995 randomized_max=+0.5651 retained=+0.00%
ERR

honest_eval.secondary_holdout

validator raised ValueError: Found empty input array (e.g., `y_true` or `y_pred`) while a minimum of 1 sample is required.
PASS

honest_eval.shuffled_label

observed=+0.8165 chance=+0.5037 shuffled_mean=+0.5103 shuffled_max=+0.5563 retained=+2.13%
PASS

multi_calibration

ece=0.0383 mce=0.1605 worst_slice=None worst_slice_ece=0.0000 threshold=0.050

Extra Trees · onehot · agent-et-onehot-iter0 validated · AUC 0.8728

WARN

distribution_shift

gap=+0.0674 concept_share=+0.00% label_kl=0.0000 covariate=0.020
PASS

honest_eval.permutation_fwer

observed=+0.7749 p=0.0476 n_null=20
PASS

honest_eval.randomized_feature

observed=+0.7749 randomized_mean=+0.4876 randomized_max=+0.5863 retained=+0.00%
ERR

honest_eval.secondary_holdout

validator raised ValueError: Found array with 0 sample(s) (shape=(0, 33)) while a minimum of 1 is required by ExtraTreesClassifier.
PASS

honest_eval.shuffled_label

observed=+0.7749 chance=+0.5037 shuffled_mean=+0.4923 shuffled_max=+0.5841 retained=+0.00%
WARN

multi_calibration

ece=0.0748 mce=0.1479 worst_slice=None worst_slice_ece=0.0000 threshold=0.050

F · Verdict

Best spec validated and promoted.

★ promoted agent-xgb-onehot-iter0

0.8964 AUC on held-out split

Headline metric is held-out AUC; the calibration split AUC (used to fit the multicalibrator) was preserved separately. The promoted spec's model.joblib bundle is signed with HMAC and stored under the run's artefact root — the predict endpoint loads it directly.

Calibrator

skipped

When ≥ 2 specs clear honest-eval, the multicalibrator stacks them on the calibration split, equalising expected calibration error across protected subgroups before serving predictions.

Audit chain

spec → train → validators → promote → model.joblib + HMAC

One Kaggle dataset. One run. The full audit chain.

Spaceship-Titanic

Agent loop, sonnet

3 specs, one batch.

LightGBM · cabin split

XGBoost · onehot

Extra Trees · onehot

Six adversarial validators × every spec.

Every validator output, persisted.

Best spec validated and promoted.

Run this loop against your own CSV.