Design partners · Gnosys Labs

Who this is for

ML org leadership where you cannot hire researchers fast enough.

Every ML team running classification, regression, or RL on complex systems lives in the same loop — propose architectures, hyperparameters, or feature pipelines; train; measure; iterate. That loop is computational dressed up as judgment, sitting inside a $200K–$2M loaded-cost human. Gnosys Labs runs the loop and exports the validation record back to you.

Primary · frontier labs & ML product teams

Hundreds-to-thousands of training experiments per quarter; the wall is researcher time, not GPU time.
Existing eval infrastructure tracks what ran. You want a system that decides what was real.
You've had a model degrade in production after looking great offline — and didn't know whether it was concept drift, covariate shift, or selection bias.

Secondary · regulated-domain ML

Healthcare diagnostics, fintech credit and AML, content integrity, ad-tech measurement. The validation record we produce is exactly what model-risk reviewers ask for — documented null tests, calibration on the cohorts that matter, distribution-shift attribution, an audit chain. It's a deliverable, not a side effect.

Healthcare ML Fintech / credit AML / fraud Content integrity Ad-tech measurement Insurance pricing Clinical trials

If none of these are true, the self-serve tabular product at /signup is probably the right starting point.

What a design partnership looks like

A co-developed validation record. Yours when the engagement ends.

The validation record is the load-bearing artefact. Every spec we propose, every rejection, every promotion, every calibration update — SHA-256 audit-chained, contract-frozen, queryable. The deck slide phrase the work warrants: the validation record is a queryable library across ML domains.

What you get

The full propose→run→validate→promote loop wired against your data, with all six validators (shuffled-label, randomized-feature, secondary-holdout, perm-FWER, dist-shift, multi-calibration) firing on every spec.
Co-developed validation record on your data: every spec, every rejection, every calibration delta, every distribution-shift attribution — contract-frozen, audit-chained, queryable. Yours.
An audit-ready HTML model card per run with subgroup-conditional ECE (MCGrad) and distribution-shift attribution. The artefact compliance actually reads.
A predict endpoint over the kept-spec ensemble, calibrated by MCGrad on the subgroup definitions you care about.
Strategy-domain preview — agent-code and RL competitions on the same loop. Available on design-partner contracts; not on self-serve.
Self-hosted deployment, SSO via SAML / SCIM, SOC2 evidence packet, and a dedicated CSM if you need them.

What we need from you

Production-shaped data access (or a faithful staging mirror) — CSV / parquet on the tabular product; richer shapes on the design-partner track.
Ground-truth labels for the task. We don't generate ground truth; we validate models that learn against it.
One ML or risk engineer who owns the engagement — the loop replaces process, not domain judgment about which subgroups matter, which features are appropriate, or what the outcome metric should be.
A holdout the strategist will never see. Honest-eval's secondary-holdout validator depends on it; without it, that validator can't fire.

Transfers from the SaaS product / what's bespoke

Transfers unchanged

· The six honest-eval validators (no LLM in the judge seat)
· MCGrad multicalibration + ensemble dedup
· Strategist surface (hp_sweep, agent, sandboxed code-exec, blended)
· Composable LLM-authored feature pipelines, draft→improve→debug repair
· Audit chain, model card, predict endpoint, MCP server

Bespoke per engagement

· Data adapters into your training + eval harness
· Subgroup definitions for multi-calibration
· Promotion thresholds (your outcome metric, your bar)
· Custom validators specific to your domain
· Hosting topology (managed, self-hosted, VPC peering)

Pricing

Two bands. Smaller ML teams or workflow integration.

Smaller ML teams

$30K – $200K / year

Closes the loop on architecture, hyperparameter, and feature-pipeline iteration with the full honest-eval validator stack, OOD detection, and distribution-shift decomposition. One team, one model class to start.

Workflow integration

$200K – $1M+ / year

Gnosys Labs becomes part of the core experimentation workflow. Multiple teams and model classes; the validation record integrates with your eval infrastructure. Self-hosted, SAML / SCIM, SOC2 evidence, dedicated CSM. Custom validators and subgroup definitions for MCGrad. Strategy-domain (RL, agent-code) included.

Design partners during the early phase get below-band pricing in exchange for co-developing on shared roadmap items. The self-serve SaaS ($0 / $49 / $199 / month) exists for teams who want to run the loop on their own data first — compare self-serve plans →

Closes the loop on your problem,
with the validation record exported back to you.

ML org leadership where you cannot hire researchers fast enough.

A co-developed validation record. Yours when the engagement ends.

Two bands. Smaller ML teams or workflow integration.

Talk to us.

Closes the loop on your problem,with the validation record exported back to you.

ML org leadership where you cannot hire researchers fast enough.

A co-developed validation record. Yours when the engagement ends.

Two bands. Smaller ML teams or workflow integration.

Talk to us.

Closes the loop on your problem,
with the validation record exported back to you.