GNOSYS LABS Talk to us
SNAPSHOT · ONE REAL RUN
run bbb3a1d8… · promoted agent-xgb-onehot-iter0

One Kaggle dataset. One run. The full audit chain.

Spaceship-Titanic, uploaded as a CSV. The agent strategist proposed 3 specs across model families; each one trained, then ran the same 6-validator battery; the best was promoted at AUC 0.8964 on the held-out split. Every number on this page is read from the run's audit chain — no staging.

1 · Propose
2 · Train
3 · Validate
4 · Promote
5 · Calibrate
Strategist
Claude Sonnet (agent)
Iterations
1 of 8 max
Specs proposed
3
Completed
2026-05-22

A · Dataset

Spaceship-Titanic

Kaggle MLE-Bench Lite competition

Task
Binary classification
Target
Transported
Primary metric
AUC
Rows
6,844
Features
12
Classes
2

Columns

Age FoodCourt RoomService ShoppingMall Spa VRDeck Cabin CryoSleep Destination HomePlanet Name VIP

Uploaded as CSV. Mixed types — numeric (Age, FoodCourt, …), categorical (HomePlanet, Destination), free-text (Name, Cabin). Per-spec encoding is the strategist's choice, not the upload script's.

B · Strategist

Agent loop, sonnet

The strategist sees the dataset schema + prior-iteration results and proposes a batch of TabularSpec objects — model family, hyperparameters, categorical encoder, feature transforms. Validators run on every spec; only specs that clear honest-eval are eligible for promotion.

Mode
agent
Batch this iter
3
LLM tokens
36,786
Iter duration
368.5s
Calibrator
not fitted
Stop condition
honest-eval clean + AUC ≥ 0.85

C · Specs proposed

3 specs, one batch.

The agent picked three model families with different inductive biases — gradient boosting (LightGBM, XGBoost) plus a non-boosting tree ensemble (Extra Trees) — and three categorical strategies. Each spec is trained and validated independently.

spec_id

agent-lgbm-cabin-split-iter0

validated

LightGBM · cabin split

0.8858 AUC
duration 193.8s findings 6
hyperparameters (11)
C
1.0
n_estimators
600
num_leaves
31
learning_rate
0.05
subsample
0.8
colsample_bytree
0.8
min_child_samples
20
reg_alpha
0.1
reg_lambda
1.0
random_state
42
n_jobs
-1

spec_id

agent-et-onehot-iter0

validated

Extra Trees · onehot

0.8728 AUC
duration 23.0s findings 6
hyperparameters (6)
C
1.0
n_estimators
500
min_samples_leaf
2
min_samples_split
4
max_features
sqrt
random_state
7

D · Validator matrix

Six adversarial validators × every spec.

Each cell is one validator run against one spec. PASS means under threshold (no concern raised), WARN means findings flagged, ERR means the validator itself raised (typically not enough holdout to evaluate — rare, but honest).

Spec permutation fwer randomized feature shuffled label secondary holdout distribution shift multi calibration
LightGBM · cabin split PASS PASS PASS ERR WARN WARN
XGBoost · onehot PASS PASS PASS ERR WARN PASS
Extra Trees · onehot PASS PASS PASS ERR WARN WARN

Hover any cell for the validator's raw output. All findings are persisted in validation_findings and surfaced verbatim — no rewriting.

E · Findings — verbatim

Every validator output, persisted.

LightGBM · cabin split · agent-lgbm-cabin-split-iter0 validated · AUC 0.8858
  • WARN
    distribution_shift
    gap=+0.1543 concept_share=+0.00% label_kl=0.0000 covariate=0.024
  • PASS
    honest_eval.permutation_fwer
    observed=+0.7982 p=0.0476 n_null=20
  • PASS
    honest_eval.randomized_feature
    observed=+0.7982 randomized_mean=+0.4994 randomized_max=+0.5651 retained=+0.00%
  • ERR
    honest_eval.secondary_holdout
    validator raised ValueError: Found array with 0 sample(s) (shape=(0, 13)) while a minimum of 1 is required by LGBMClassifier.
  • PASS
    honest_eval.shuffled_label
    observed=+0.7982 chance=+0.5037 shuffled_mean=+0.5035 shuffled_max=+0.5292 retained=+0.00%
  • WARN
    multi_calibration
    ece=0.0609 mce=0.2135 worst_slice=None worst_slice_ece=0.0000 threshold=0.050
XGBoost · onehot · agent-xgb-onehot-iter0 validated · AUC 0.8964
  • WARN
    distribution_shift
    gap=+0.1063 concept_share=+0.00% label_kl=0.0000 covariate=0.018
  • PASS
    honest_eval.permutation_fwer
    observed=+0.8165 p=0.0476 n_null=20
  • PASS
    honest_eval.randomized_feature
    observed=+0.8165 randomized_mean=+0.4995 randomized_max=+0.5651 retained=+0.00%
  • ERR
    honest_eval.secondary_holdout
    validator raised ValueError: Found empty input array (e.g., `y_true` or `y_pred`) while a minimum of 1 sample is required.
  • PASS
    honest_eval.shuffled_label
    observed=+0.8165 chance=+0.5037 shuffled_mean=+0.5103 shuffled_max=+0.5563 retained=+2.13%
  • PASS
    multi_calibration
    ece=0.0383 mce=0.1605 worst_slice=None worst_slice_ece=0.0000 threshold=0.050
Extra Trees · onehot · agent-et-onehot-iter0 validated · AUC 0.8728
  • WARN
    distribution_shift
    gap=+0.0674 concept_share=+0.00% label_kl=0.0000 covariate=0.020
  • PASS
    honest_eval.permutation_fwer
    observed=+0.7749 p=0.0476 n_null=20
  • PASS
    honest_eval.randomized_feature
    observed=+0.7749 randomized_mean=+0.4876 randomized_max=+0.5863 retained=+0.00%
  • ERR
    honest_eval.secondary_holdout
    validator raised ValueError: Found array with 0 sample(s) (shape=(0, 33)) while a minimum of 1 is required by ExtraTreesClassifier.
  • PASS
    honest_eval.shuffled_label
    observed=+0.7749 chance=+0.5037 shuffled_mean=+0.4923 shuffled_max=+0.5841 retained=+0.00%
  • WARN
    multi_calibration
    ece=0.0748 mce=0.1479 worst_slice=None worst_slice_ece=0.0000 threshold=0.050

F · Verdict

Best spec validated and promoted.

Calibrator

skipped

When ≥ 2 specs clear honest-eval, the multicalibrator stacks them on the calibration split, equalising expected calibration error across protected subgroups before serving predictions.

Audit chain

spec → train → validators → promote → model.joblib + HMAC

Run this loop against your own CSV.

The same loop above — agent strategist, six validators, honest-eval promotion, multicalibrated ensemble — runs against any tabular dataset you upload. Quickstart → · Talk to us →