Closes the loop without the human in the middle.
ML teams are bottlenecked by humans doing computational work — propose, run, measure, iterate, dressed up as judgment. Gnosys runs that loop. Generation is cheap. Trusted measurement is not — and the validation record is the asset.
For frontier labs and ML product teams where researcher headcount is the binding constraint.
01 · Thesis
Naive LLM-driven research loops collapse to the same failure mode: finding patterns that don't exist, overfitting aggressively, mistaking selection bias for signal. MLE-STAR (Table 5) shows the AUC drop directly — 0.803 to 0.734 when the downstream checker is removed. Sakana's AI Scientist and DeepMind's FunSearch validated that the closed-loop category is real. The remaining work is the system that exploits the capability without grading its own homework.
AutoML tunes hyperparameters; Sakana writes papers; FunSearch evolves programs. Each uses one judge — usually itself — to decide what's real. Selection-bias correction is ad-hoc or unstated. Negative results live in one researcher's head.
Six independent statistical validators run against every promoted spec. AST-validated code generation. Distribution-shift decomposition. Multi-calibrated ensembles. Every cross-agent message a typed, frozen contract; SHA-256 audit chain back to spec.
02 · The loop
Four phases. One typed contract between them. Every step a row in the validation record.
Strategist picks the cohort. HP sweep, LLM agent that reads prior-round outcomes, sandboxed code-exec when curated specs aren't enough — or all three blended on one loop.
Executor fits each spec under a wall-clock timeout. Feature transforms and categorical encoding fit per-spec, persist in the joblib bundle, replay at predict time. Code paths cross a 7-phase AST validator first.
Six independent statistical validators on every spec. No LLM in the validator loop. Multi-calibrated ensembles. Distribution-shift attribution. The only place the platform's judgment is trusted.
Survivors deduped by output correlation, ensembled, run through MCGrad subgroup multicalibration. Predict endpoint, downloadable model card, signed audit chain. Whatever ships, ships with the receipt.
POST /v1/runs/{id}/predictGET /v1/runs/{id}/model_cardPOST /v1/runs/{id}/submissionEvery cross-agent message is a typed, frozen contract. SHA-256 audit chain back to spec.
03 · What survives the loop
A slice from the live research database. Adversarial validation is supposed to reject — that's the work. What's left at the bottom is what we'd ship.
Numbers from the research validation record. Replay one experiment going through the loop →
04 · Current status · Q2 2026
The bar is honest: only the loop is autonomous, only the validators decide what's real, and we don't claim production traction we don't have.
gnosys-mcp MCP server — the same /v1 surface, driven from Claude Code.code (LLM writes the agent) and rl_train (PPO with sandboxed reward / curriculum / BC-label hooks).Available on design-partner contracts. Not on the self-serve SaaS.
05 · Moat
Every shipped model carries a record: which validators it cleared, at what severity, on what holdout. That record — not the model — is the durable asset. New agent frameworks ship every quarter; trusted measurement records do not. The validation record is a queryable library across ML domains.
Once the record exists, the alternative is rebuilding it from zero. The cost grows monotonically with experiments evaluated. Every quarter we keep running this loop, the gap widens.
spec agent-xgb-conservative-iter0 promoted → model_family=xgb_classifier · cv_folds=5 · cat_encoding=native primary AUC 0.8955 on stratified 60/20/20 honest_eval.shuffled_label PASS → observed_lift 0.395 shuffled_lift 0.004 retained 0.010 honest_eval.randomized_feature PASS → observed_lift 0.395 shuffled_lift 0.001 retained 0.003 honest_eval.secondary_holdout PASS → primary_secondary 0.891 delta 0.005 honest_eval.permutation_fwer PASS → empirical_p 0.00 adjusted_p 0.00 (BH-FDR per batch) multi_calibration PASS → ECE 0.019 (post-MCGrad) worst-slice ECE 0.041 dist_shift PASS → concept_share_of_gap 12% (label + covariate dominate) audit sha256:d208…525b exportable HTML → signed audit chain over typed contracts
Real run from Spaceship Titanic, May 2026. Reproducible from the SDK — see client/examples/mle_bench_example.py.
06 · Why now
Frontier models (Claude 4.x, GPT-5) cleared the reliability bar for structured code generation under constraint. Multi-step agent reliability at the 30 to 50 steps a full research run requires has only existed for roughly twelve months.
Running thousands of hypothesis generations used to cost more than the experiments could plausibly justify. Inference costs dropped roughly 10x in the last year — thousands of validated experiments per day now fit a single-founder budget. A year ago this loop wasn't budget-feasible.
Sakana AI Scientist and DeepMind FunSearch normalized closed-loop discovery from research curiosity to obvious next step. The remaining work is the integrated system that exploits the capability without fooling itself.
07 · Who this is for
Primary: frontier labs and ML product teams running architecture, hyperparameter, or feature-pipeline iteration at scale. The hard problem isn't access to compute or models — it's that you cannot hire more $200K–$2M loaded-cost researchers fast enough to keep up with the candidate space.
Healthcare, fintech, integrity, ad-tech. Anywhere the next model release goes through model-risk, compliance, or a regulator before production. The validation record doubles as a compliance artefact — what reviewers ask for during a model risk review is exactly what the loop produces.
08 · Pricing
Design partner / enterprise is where the validation record is load-bearing. Self-serve SaaS exists so you can run the loop on your own data before talking to us. Full pricing →
Co-developed validation record on customer data. Self-hosted, SAML / SCIM, SOC2 evidence, dedicated CSM. Strategy-domain (RL, agent-code) included. Below-band pricing for design partners in the early phase in exchange for co-developing on shared roadmap items.
Design partnership →Tabular classification + regression on uploaded CSV / parquet. Honest-eval and the predict endpoint included on every plan. Free tier is 5 runs / month with hp_sweep; Starter adds the LLM agent strategist; Team adds SSO and 5 users.
Compare plans →Drop your email and a 1–2 line note about the loop you're trying to close. We reply to every message that includes context. Prefer to try the self-serve loop first? /signup →