Spaceship-Titanic, uploaded as a CSV. The agent strategist proposed 3 specs across model families; each one trained, then ran the same 6-validator battery; the best was promoted at AUC 0.8964 on the held-out split. Every number on this page is read from the run's audit chain — no staging.
A · Dataset
Kaggle MLE-Bench Lite competition
Columns
Uploaded as CSV. Mixed types — numeric (Age, FoodCourt, …), categorical (HomePlanet, Destination), free-text (Name, Cabin). Per-spec encoding is the strategist's choice, not the upload script's.
B · Strategist
The strategist sees the dataset schema + prior-iteration results and proposes a batch of TabularSpec objects — model family, hyperparameters, categorical encoder, feature transforms. Validators run on every spec; only specs that clear honest-eval are eligible for promotion.
C · Specs proposed
The agent picked three model families with different inductive biases — gradient boosting (LightGBM, XGBoost) plus a non-boosting tree ensemble (Extra Trees) — and three categorical strategies. Each spec is trained and validated independently.
spec_id
agent-lgbm-cabin-split-iter0
spec_id
agent-xgb-onehot-iter0
spec_id
agent-et-onehot-iter0
D · Validator matrix
Each cell is one validator run against one spec. PASS means under threshold (no concern raised), WARN means findings flagged, ERR means the validator itself raised (typically not enough holdout to evaluate — rare, but honest).
| Spec | permutation fwer | randomized feature | shuffled label | secondary holdout | distribution shift | multi calibration |
|---|---|---|---|---|---|---|
| LightGBM · cabin split | PASS | PASS | PASS | ERR | WARN | WARN |
| XGBoost · onehot ★ | PASS | PASS | PASS | ERR | WARN | PASS |
| Extra Trees · onehot | PASS | PASS | PASS | ERR | WARN | WARN |
Hover any cell for the validator's raw output. All findings are persisted in validation_findings and surfaced verbatim — no rewriting.
E · Findings — verbatim
F · Verdict
Headline metric is held-out AUC; the calibration split AUC (used to fit the multicalibrator) was preserved separately. The promoted spec's model.joblib bundle is signed with HMAC and stored under the run's artefact root — the predict endpoint loads it directly.
Calibrator
skipped
When ≥ 2 specs clear honest-eval, the multicalibrator stacks them on the calibration split, equalising expected calibration error across protected subgroups before serving predictions.
Audit chain
spec → train → validators → promote → model.joblib + HMAC
The same loop above — agent strategist, six validators, honest-eval promotion, multicalibrated ensemble — runs against any tabular dataset you upload. Quickstart → · Talk to us →