Shipping ML into regulated environments.

What it actually takes to move a model from notebook to production inside SR 11-7, HIPAA or analog institutional controls. The artifacts, the reviewers, the order.

Practice notesJuly 202510 min read

A model that works in a notebook is roughly five percent of a model that works in production. A model that works in production is roughly fifty percent of a model that survives a model-risk review. The remaining fifty percent is the artifact set, the reviewers, the order. None of it is glamorous, and all of it is the difference between a system that ships and a system that does not.

The notes below are written from the shipping side. They will not replace the official guidance for SR 11-7, HIPAA, the institutional equivalents or the NIST AI RMF. They will tell you how we sequence the work so that the official guidance becomes a sign-off, not a project.

In a regulated shop, the model is the easy part. The evidence that it is safe to ship is the work.

Shipping ML into regulated environments

Start with the disposition memo

Before the first feature is engineered, write the document that will eventually accompany the model into review. The disposition memo is two to four pages. It states:

What decision the model informs and who owns that decision.
What the model does (one paragraph; no math).
The data it is trained on, with provenance.
The success criteria, in business terms.
The known failure modes and how the design accommodates them.
The fallback when the model is unavailable or untrusted.

Most of this is unknown at the start; the memo is a living artifact. The point is that the work is being done from day one in a frame that the model-risk function will recognize. Teams that skip this step rediscover, late, that they have built the wrong thing.

The artifact set, in shipping order

1. Data lineage and treatment

Every feature has a documented source, a documented transformation chain, and a documented sensitivity classification. PHI, PII and other restricted classifications are flagged at the source. Treatment of missing values, outliers and out-of-range inputs is documented before modeling starts.

2. Evaluation methodology

The protocol for measuring model quality, including the holdout strategy, the metrics, the slices, the calibration tests and the statistical tests. Written before model training, not after. Walk-forward back-testing is the norm; random splits are the exception.

3. Challenger model

A simpler model trained on the same data, evaluated against the same protocol. The challenger is the answer to the reviewer’s implicit question: could you have done this with logistic regression?If the answer is “yes, with comparable performance,” you should ship the logistic regression.

4. Fairness analysis

Sliced performance across cohorts the regulator will ask about. Disparate-impact metrics, calibration by group, false-positive and false-negative rates by group. Documented mitigations where there are gaps, and a documented decision where the gap is being accepted.

5. Monitoring plan

Feature drift, prediction drift, performance drift, freshness, saturation. The thresholds at which each fires. The on-call rotation. The retrain trigger.

6. Deployment scaffolding

Shadow mode, canary, full rollout, rollback. The plan for each, and the criteria for advancing or reverting. Production traffic does not see the model until it has been shadowed and canaried.

7. Audit log

Every prediction, with the input features, the model version, the output, the confidence, the downstream action taken. Retained for the regulator-required period. Queryable.

The reviewers, in calling order

Different reviewers care about different artifacts. Calling them in the wrong order is the most common cause of last-minute scope changes. The order we sequence:

Privacy / data office. Confirms the data treatment is acceptable. Before any modeling.
Security.Confirms the deployment topology meets the environment’s controls. Before any deployment work.
Model risk (independent validation). Reviews the evaluation methodology, challenger, fairness analysis, monitoring plan. Before production traffic.
Business owner. Signs off on the disposition memo and the success criteria. Before launch.
Operations. Confirms the on-call, runbooks and alerting. Before launch.

Three sequencing choices that compound

Engage validation early, not at the end. Walking the independent validators through the evaluation methodology when it is still a draft is a one-hour conversation. Discovering at the end that the validators wanted a different holdout strategy is a six-week conversation.

Ship the challenger, not the champion. Default to the simpler model. Spend the complexity budget where the performance gap is large enough to defend. Reviewers reward this; so do operators.

Make monitoring a first-class artifact. The monitoring plan is part of the model, not a follow-up project. Models without monitoring fail silently; models with monitoring fail loudly, which is the only kind of failure you can act on.

The mistake we see most often

Teams build the model, then attempt to assemble the artifact set after the fact. The artifacts that result are accurate but unconvincing, because they read as documentation written to satisfy a reviewer rather than analysis that shaped the design. The reviewer notices. The model goes back.

Build the artifacts as you go, in the order the reviewers will ask for them. Treat the review as a checkpoint, not an obstacle. Models shipped this way go through review in days, not quarters, and ship more often.