GNOSYS LABS Talk to us
AUTONOMOUS EXPERIMENTATION · VALIDATION LAYER FIRST · DESIGN PARTNERS OPEN

The OS for
Experimentation.

Closes the loop without the human in the middle.

ML teams are bottlenecked by humans doing computational work — propose, run, measure, iterate, dressed up as judgment. Gnosys runs that loop. Generation is cheap. Trusted measurement is not — and the validation record is the asset.

For frontier labs and ML product teams where researcher headcount is the binding constraint.

01 · Thesis

A system that both generates candidates and evaluates them starts fooling itself.

Naive LLM-driven research loops collapse to the same failure mode: finding patterns that don't exist, overfitting aggressively, mistaking selection bias for signal. MLE-STAR (Table 5) shows the AUC drop directly — 0.803 to 0.734 when the downstream checker is removed. Sakana's AI Scientist and DeepMind's FunSearch validated that the closed-loop category is real. The remaining work is the system that exploits the capability without grading its own homework.

The failure mode
Single-evaluator loop

AutoML tunes hyperparameters; Sakana writes papers; FunSearch evolves programs. Each uses one judge — usually itself — to decide what's real. Selection-bias correction is ad-hoc or unstated. Negative results live in one researcher's head.

  • · 50–200 hypotheses / researcher / year
  • · $200K–$2M loaded cost per researcher
  • · No persistent record of what failed
The Gnosys answer
Validation layer, no LLM in the judge seat

Six independent statistical validators run against every promoted spec. AST-validated code generation. Distribution-shift decomposition. Multi-calibrated ensembles. Every cross-agent message a typed, frozen contract; SHA-256 audit chain back to spec.

  • · 10,000+ validated hypotheses at scale
  • · Selection bias corrected per batch (perm-FWER + BH-FDR)
  • · Negative results persistent, queryable, deduped

02 · The loop

Propose → run → validate → promote.

Four phases. One typed contract between them. Every step a row in the validation record.

Phase 1
Propose

Strategist picks the cohort. HP sweep, LLM agent that reads prior-round outcomes, sandboxed code-exec when curated specs aren't enough — or all three blended on one loop.

  • · hp_sweep
  • · agent (draft → improve → debug)
  • · code-exec (AST-validated)
Phase 2
Run

Executor fits each spec under a wall-clock timeout. Feature transforms and categorical encoding fit per-spec, persist in the joblib bundle, replay at predict time. Code paths cross a 7-phase AST validator first.

  • · 13 model families
  • · 7 composable feature ops
  • · HMAC-signed artefacts
Phase 3 · the moat
Validate

Six independent statistical validators on every spec. No LLM in the validator loop. Multi-calibrated ensembles. Distribution-shift attribution. The only place the platform's judgment is trusted.

  • · shuffled_label
  • · randomized_feature
  • · secondary_holdout
  • · perm_fwer (BH-FDR)
  • · dist_shift
  • · multi_calibration
Phase 4
Promote

Survivors deduped by output correlation, ensembled, run through MCGrad subgroup multicalibration. Predict endpoint, downloadable model card, signed audit chain. Whatever ships, ships with the receipt.

  • · POST /v1/runs/{id}/predict
  • · GET /v1/runs/{id}/model_card
  • · POST /v1/runs/{id}/submission

Every cross-agent message is a typed, frozen contract. SHA-256 audit chain back to spec.

03 · What survives the loop

Most generated experiments should not make it. The funnel is the discipline.

A slice from the live research database. Adversarial validation is supposed to reject — that's the work. What's left at the bottom is what we'd ship.

Generated
383
specs proposed by the strategist
Cleared AST · backtested
129
passed code validation, ran to completion
Critic-reviewed
112
survived adversarial critic pass
Promoted
9
cleared all six validators end-to-end

Numbers from the research validation record. Replay one experiment going through the loop →

04 · Current status · Q2 2026

What ships today, what's in flight, what's research preview.

The bar is honest: only the loop is autonomous, only the validators decide what's real, and we don't claim production traction we don't have.

Shipped
Tabular SaaS · validation layer · predict endpoint
  • Spaceship Titanic 0.8955 AUC, top-10% Kaggle, end-to-end LLM cost ~$0.05. Reproducible from the SDK.
  • All six adversarial validators on every spec, FDR-corrected per batch.
  • MCGrad subgroup multicalibration (gradient-boosted correctors, after Meta KDD 2026).
  • Composable LLM-authored feature pipelines — split, regex, hash, datetime, text-length, n-gram-hash, drop. AST-validated; persisted in the joblib bundle; replayed at predict.
  • Blended strategist — hp_sweep, agent, and code-exec on the same loop.
  • Draft → improve → debug repair loop on LLM-authored code.
  • Categorical encoders across 13 model families (native, onehot, label, target, auto).
  • gnosys-mcp MCP server — the same /v1 surface, driven from Claude Code.
  • Downloadable HTML model card. SHA-256 audit chain over typed contracts. HMAC-signed artefacts.
In progress
Co-developed with design partners
  • Five enterprise design partners over the next twelve months. Frontier labs and ML product teams; validation record co-developed on real customer data.
  • Strategy-domain RL training at competitive compute budgets — pipeline is shipped (BC warm-start, PPO + GAE-λ, self-play league); the ML ceiling against hand-tuned baselines is the live problem.
  • 100,000+ validated experiments as a queryable, deduped library across ML domains.
  • Self-serve billing in dashboard, multi-user team RBAC.
Research preview
Strategy domain · agent-code & RL
  • Kaggle Orbit Wars: same propose-validate-promote loop, applied to multi-agent RL competitions. Two execution modes — code (LLM writes the agent) and rl_train (PPO with sandboxed reward / curriculum / BC-label hooks).
  • Platform-side is complete: domain protocol, finalization, submission packaging. Honest ceiling: at small training budgets, RL is 0% vs the hand-tuned starter; the code-agent path reaches ~12%. Open ML problem, not a platform bug.
  • Vision-domain plant-pathology adapter in research; not production.

Available on design-partner contracts. Not on the self-serve SaaS.

05 · Moat

Generation is cheap. Trusted measurement is not.

Every shipped model carries a record: which validators it cleared, at what severity, on what holdout. That record — not the model — is the durable asset. New agent frameworks ship every quarter; trusted measurement records do not. The validation record is a queryable library across ML domains.

What compounds
  • · Every promoted spec, with the six-validator verdict
  • · Every rejection, with the validator that caught it
  • · Calibration history across model × cohort
  • · Distribution-shift attributions per holdout
  • · Audit chain — row-level SHA-256 over typed contracts
Why it's a moat

Once the record exists, the alternative is rebuilding it from zero. The cost grows monotonically with experiments evaluated. Every quarter we keep running this loop, the gap widens.

model_card · one promoted spec from a tabular run
spec          agent-xgb-conservative-iter0  promoted
  → model_family=xgb_classifier · cv_folds=5 · cat_encoding=native

primary       AUC 0.8955  on stratified 60/20/20

honest_eval.shuffled_label      PASS
  → observed_lift 0.395  shuffled_lift 0.004  retained 0.010

honest_eval.randomized_feature  PASS
  → observed_lift 0.395  shuffled_lift 0.001  retained 0.003

honest_eval.secondary_holdout   PASS
  → primary_secondary 0.891  delta 0.005

honest_eval.permutation_fwer    PASS
  → empirical_p 0.00  adjusted_p 0.00  (BH-FDR per batch)

multi_calibration              PASS
  → ECE 0.019 (post-MCGrad)  worst-slice ECE 0.041

dist_shift                     PASS
  → concept_share_of_gap 12%  (label + covariate dominate)

audit         sha256:d208…525b  exportable HTML
  → signed audit chain over typed contracts

Real run from Spaceship Titanic, May 2026. Reproducible from the SDK — see client/examples/mle_bench_example.py.

06 · Why now

Three conditions are true for the first time.

Agent loops are reliable

Frontier models (Claude 4.x, GPT-5) cleared the reliability bar for structured code generation under constraint. Multi-step agent reliability at the 30 to 50 steps a full research run requires has only existed for roughly twelve months.

Inference economics inverted

Running thousands of hypothesis generations used to cost more than the experiments could plausibly justify. Inference costs dropped roughly 10x in the last year — thousands of validated experiments per day now fit a single-founder budget. A year ago this loop wasn't budget-feasible.

Autonomous research crossed over

Sakana AI Scientist and DeepMind FunSearch normalized closed-loop discovery from research curiosity to obvious next step. The remaining work is the integrated system that exploits the capability without fooling itself.

07 · Who this is for

Teams where researcher headcount is the binding constraint.

Primary: frontier labs and ML product teams running architecture, hyperparameter, or feature-pipeline iteration at scale. The hard problem isn't access to compute or models — it's that you cannot hire more $200K–$2M loaded-cost researchers fast enough to keep up with the candidate space.

Primary · frontier labs & ML product teams
  • You run hundreds-to-thousands of training experiments per quarter and the wall is researcher time, not GPU time.
  • Your existing eval infrastructure tracks what ran. You want a system that decides what was real.
  • You've had a model degrade in production after looking great offline — and didn't know whether it was concept drift, covariate shift, or selection bias.
Secondary · regulated-domain ML

Healthcare, fintech, integrity, ad-tech. Anywhere the next model release goes through model-risk, compliance, or a regulator before production. The validation record doubles as a compliance artefact — what reviewers ask for during a model risk review is exactly what the loop produces.

Healthcare ML Fintech / credit AML / fraud Content integrity Ad-tech measurement Insurance pricing Clinical trials
The twelve-month plan
5
enterprise design partners, validation record co-developed on real customer data
100,000+
validated experiments — AST-validated, FDR-corrected, contract-frozen, audit-chained — as a queryable library across ML domains
$10M+
ARR per engineer. Built like our customers will be — the platform replaces researchers in the loop and we run on the same logic.

08 · Pricing

Priced where the validation record is load-bearing.

Design partner / enterprise is where the validation record is load-bearing. Self-serve SaaS exists so you can run the loop on your own data before talking to us. Full pricing →

Design partner · enterprise
$30K – $1M+ / year

Co-developed validation record on customer data. Self-hosted, SAML / SCIM, SOC2 evidence, dedicated CSM. Strategy-domain (RL, agent-code) included. Below-band pricing for design partners in the early phase in exchange for co-developing on shared roadmap items.

Design partnership →
Self-serve · try the loop yourself
$0 · $49 · $199 / month

Tabular classification + regression on uploaded CSV / parquet. Honest-eval and the predict endpoint included on every plan. Free tier is 5 runs / month with hp_sweep; Starter adds the LLM agent strategist; Team adds SSO and 5 users.

Compare plans →

Talk to us about a design partnership.

Drop your email and a 1–2 line note about the loop you're trying to close. We reply to every message that includes context. Prefer to try the self-serve loop first? /signup →