The Governance Codex · Edition 2026.1
Governing Artificial Intelligence
in Clinical Trials
An auditor's playbook for safe, ethical, and defensible AI.
Written from the chair of a federal-level reviewer, this playbook turns the NIST AI Risk Management Framework, the FDA's 2025 credibility-assessment guidance, and the brand-new FDA–EMA Guiding Principles of Good AI Practice (January 2026) into a single, plain-language operating model your team can actually run — without a data-science degree to read it.
- Document
- AUR-GOV-CODEX-2026.1
- Classification
- Public · Complimentary
- Frameworks
- NIST AI RMF · FDA · EMA
- Last reviewed
- June 2026
- Audience
- Sponsors · CROs · QA · RA
- Reading time
- ~45 minutes
The Auditor's Foreword
Why a governance playbook, and why now — explained by someone whose job is to ask the hard questions after the trial is over.
◆ What you'll understand
- Why regulators now treat AI as something to be governed, not just adopted
- How a federal reviewer reads an AI-enabled submission
- What "credibility" means and why it outranks accuracy
◆ What you'll be able to do
- Frame AI governance as a patient-safety and evidence-integrity question
- Explain to a non-technical board why governance is non-optional
- Use this playbook as a working checklist, not a brochure
Picture the moment a marketing application lands on a reviewer's desk and, somewhere inside it, a sentence reads: "Eligibility was determined with the support of a machine-learning model." The reviewer does not panic. The reviewer asks a sequence of very old questions in a very new context: What did the model decide? How much did its answer matter? And can you prove it was right — not once, but reliably?
That is the whole game. AI does not get a special pass and it does not get a special punishment. It gets the same scrutiny every other piece of evidence in a regulated trial receives: show your work, show your controls, and show that a human remained accountable for the outcome. The agencies have now said this out loud. In January 2025 the FDA published its first dedicated draft guidance on using AI to support regulatory decisions, built around a risk-based credibility assessment. One year later, in January 2026, the FDA and EMA jointly issued ten Guiding Principles of Good AI Practice in Drug Development — a shared transatlantic foundation for every guidance still to come.
AUD
From the reviewer's chair
I am not impressed by a model that is 98% accurate. I am impressed by a sponsor who can tell me where the model sits in the decision, what happens if it is wrong, and what evidence they gathered in proportion to that risk. Governance is simply the discipline of being able to answer those three questions before I ask them.
How to use this playbook
This is a working document, not a reference shelf. Each section opens with plain-language objectives framed in Bloom's taxonomy — from remembering a definition to creating your own governance artifacts — so you always know what you should be able to do when you finish. Interactive tools let you classify your own AI tools, walk the credibility steps, and score your organization's readiness. Your progress saves automatically; the Resume button in the top bar brings you back to exactly where you stopped. When you reach the end, you can issue yourself a completion certificate.
A note on scope. This playbook is an educational synthesis of public frameworks, offered free by Aurelyn AI Clinical. It is not legal advice and it does not replace the source documents, your regulatory affairs team, or direct engagement with a health authority. Every framework cited is referenced in §10.
The 2026 Regulatory Landscape
A guided tour of the rules that now apply to AI in trials — what each one is for, and how they fit together rather than compete.
◆ What you'll understand
- The major AI-relevant frameworks active in 2026
- How voluntary frameworks and binding law differ in force
- Where these instruments overlap and reinforce each other
◆ What you'll be able to do
- Name the framework that governs a given AI question
- Distinguish "good practice" from "legal requirement"
- Brief a colleague on the 2025–2026 regulatory timeline
The single most common mistake is to assume these frameworks are rivals you must choose between. They are layers. NIST gives you the management system. The FDA gives you the evidence standard for regulatory decisions. The FDA–EMA principles give you the shared values. And data-integrity and privacy rules give you the non-negotiable floor. A mature program speaks all four at once.
The instruments, in plain terms
| Instrument | What it is | Force | What it governs for you |
|---|---|---|---|
| NIST AI RMF 1.0 + Generative AI Profile, 2024 |
A voluntary risk-management framework structured around four functions: Govern, Map, Measure, Manage. | Voluntary | The operating system for your whole AI governance program. |
| FDA Draft Guidance Jan 2025 · Docket FDA-2024-D-4689 |
"Considerations for the Use of AI to Support Regulatory Decision-Making" — introduces a 7-step, risk-based credibility assessment tied to a model's context of use. | Draft guidance | How to prove an AI output is trustworthy enough for a specific regulatory decision. |
| FDA–EMA Guiding Principles Jan 14, 2026 |
Ten jointly issued principles for good AI practice across the medicine lifecycle. | Principles | The shared expectations that future binding guidance will be built on. |
| ICH E6(R3) · GCP | The modern Good Clinical Practice standard, written to accommodate digital and computerized systems. | Adopted GCP | The trial-conduct umbrella every AI tool still operates under. |
| 21 CFR Part 11 | US rule for electronic records and electronic signatures — trustworthiness, audit trails, access control. | Regulation (US) | The record-keeping floor for any system AI touches. |
| EU AI Act | Risk-tiered horizontal AI law; many clinical and safety uses fall in the "high-risk" tier. | Binding law (EU) | Hard legal obligations when you operate in or ship to the EU. |
How we got here — the timeline that matters
- Jan 2023NIST releases AI RMF 1.0The four-function framework (Govern · Map · Measure · Manage) becomes the de-facto global standard for managing AI risk.
- Jul 2024NIST Generative AI Profile (AI 600-1)Extends the RMF to large language models and agentic systems — the model classes now appearing in trial operations.
- Sep 2024EMA Reflection Paper on AI in medicinesEurope sets out transparency, data quality, and stakeholder-engagement expectations across the lifecycle.
- Jan 2025FDA's first dedicated AI draft guidanceIntroduces context of use and the 7-step credibility assessment framework for AI supporting regulatory decisions.
- Apr 2025Public comment period closesIndustry, academia, and patient groups flag how to apply the steps to generative and LLM-based tools.
- Jan 14, 2026FDA & EMA issue 10 joint Guiding PrinciplesA transatlantic foundation for harmonized AI guidance — the centerpiece of §06 of this playbook.
- Expected 2026FDA final guidance anticipatedExpected to incorporate comment-period input and align with the January 2026 joint principles.
The through-line
Read the timeline closely and one word recurs in every document: risk-based. Nobody is asking you to validate a low-stakes scheduling assistant to the same standard as a model that influences who is dosed. The entire system is built to make the amount of proof you owe scale with the amount of harm a wrong answer could cause. Master that one idea and the rest of this playbook becomes common sense.
The Aurelyn Governance Model
NIST's four functions, translated from abstract risk language into concrete clinical-trial controls. Open each function to see what it asks of you.
◆ What you'll understand
- The four NIST functions and what each is for
- Why Govern is the cross-cutting function that holds the others
- How each function maps to a real trial control
◆ What you'll be able to do
- Place any governance activity into the right function
- Identify which function your organization is weakest in
- Assign ownership for each control by ID (GOV / MAP / MEA / MAN)
NIST gives us four verbs. They are not a sequence you finish — they are a loop you keep running. Govern sits in the middle and feeds the other three: it is the culture, the policies, and the people. Map sets the context. Measure tests and evaluates. Manage acts on what you find. Below, each function is expanded into the specific controls a clinical-trial program should own. Select a function to reveal its controls.
- GOV-01 Stand up a cross-functional AI governance body with clinical, regulatory, quality, data-science, ethics, and IT security at the table.
- GOV-02 Maintain a written AI policy that defines acceptable uses, prohibited uses, and the approval path before any tool touches trial data.
- GOV-03 Assign a named accountable owner for every AI system — a person, never "the vendor."
- GOV-04 Map and actively manage AI-related legal and regulatory obligations (Part 11, GCP, EU AI Act, privacy law).
- GOV-05 Govern third-party and "AI-as-a-service" risk: contracts, audit rights, and documentation flow-down to vendors.
- GOV-06 Train staff on AI literacy and on this policy, and record that they were trained.
- MAP-01 Maintain a living inventory of every AI system, model, and AI-enabled vendor service in use across studies.
- MAP-02 For each, write a one-paragraph context of use: what it does, where in the trial, and what decision it supports.
- MAP-03 Identify affected parties — participants, investigators, reviewers — and the potential impact on each.
- MAP-04 Set risk thresholds and categorize each tool before deployment, not after an incident.
- MAP-05 Document the data the system relies on and whether it is fit-for-use for this population and question.
- MEA-01 Run testing, evaluation, verification & validation (TEVV) sized to each tool's risk tier.
- MEA-02 Evaluate the whole system, including the human reviewing the output — not the model in isolation.
- MEA-03 Measure the trustworthiness characteristics in §05 with metrics appropriate to the context of use.
- MEA-04 Test for bias and subgroup performance across the populations the trial will enroll.
- MEA-05 Document methods, datasets, results, and any deviations so the evidence is independently reviewable.
- MAN-01 Prioritize and treat identified risks; document the rationale where you accept rather than mitigate.
- MAN-02 Define and preserve meaningful human oversight — a person who can question, override, and is accountable for the call.
- MAN-03 Monitor deployed models on a schedule for data drift and performance decay; re-evaluate periodically.
- MAN-04 Run an incident process: detect, respond, recover, and communicate when an AI system misbehaves.
- MAN-05 Manage change control — any model update is a change that re-enters Map and Measure.
Why Govern is in the center
Map, Measure, and Manage are activities. Govern is the gravity that keeps them in orbit. Without a governance body, an owner, and a policy, your validation evidence becomes a pile of disconnected PDFs that no one can find during an inspection. Build Govern first — everything else has somewhere to attach.
Context of Use & Model Risk
The single most important calculation in the FDA framework — made interactive. Tell the matrix how much your model matters and what happens if it's wrong, and it tells you how much evidence you owe.
◆ What you'll understand
- What "context of use" (COU) precisely means
- The two ingredients of model risk: influence & consequence
- Why two models with the same accuracy can owe very different proof
◆ What you'll be able to do
- Write a defensible context-of-use statement
- Classify any AI tool into a risk tier using the matrix
- Right-size the rigor of your credibility evidence
Before you can assess an AI model, the FDA asks you to pin down its context of use — the specific role and scope of the model: what question it answers, and how its output will be used in a decision. "A model that flags potential serious adverse events for human review in a Phase II oncology study" is a context of use. "We use AI" is not.
From the context of use, the framework derives model risk as the combination of two things:
① Model influence
How much does the model's output drive the decision, relative to other evidence and to human judgment? A model whose output is one input among many a clinician weighs has low influence. A model whose output is acted on automatically has high influence.
② Decision consequence
If the decision is wrong, how bad is the harm? A wrong answer that delays a newsletter is low consequence. A wrong answer that affects participant safety, dosing, or the integrity of the pivotal evidence is high consequence.
The Model Risk Determination Matrix
Select the cell that matches your tool. The readout shows your risk tier and the rigor the framework expects in return.
Awaiting your selection
Choose the cell that best matches one of your AI tools to see the credibility rigor the FDA framework would expect for it.
AUD
From the reviewer's chair
Sponsors lose credibility two ways: by under-proving a high-risk model, and by drowning a trivial tool in validation no one needed. The matrix protects you from both. When you tell me a tool is low-risk, I expect a crisp justification for why — low influence, low consequence — not an absence of thought.
The 7-Step Credibility Assessment
The FDA's core method, walked one step at a time. Use the controls to move through the sequence; each step shows what to do and a clinical-trial worked example.
◆ What you'll understand
- The seven steps in order
- Why "plan before you test" is the heart of the method
- How each step produces a documented artifact
◆ What you'll be able to do
- Build a credibility assessment plan for one of your models
- Decide when to engage the agency early
- Assemble the documentation a reviewer will expect
The framework's elegance is that it forces you to decide how much proof you need before you go looking for it. You define the question, define the context, weigh the risk, and only then design a credibility plan proportionate to that risk — execute it, document it, and judge whether it was enough.
Engage early
For higher-risk contexts of use, the framework actively encourages sponsors to discuss the credibility plan with the agency before executing it — for example, through existing meeting pathways. An auditor would far rather see an aligned plan than a clever after-the-fact justification.
The Seven Marks of Trustworthy AI
NIST defines what "trustworthy" actually means in seven concrete characteristics. For each, here is what it means inside a trial — and the question an auditor will ask you.
◆ What you'll understand
- The seven NIST trustworthiness characteristics
- Why they often trade off against each other
- How to weigh which characteristics matter most for a given tool
◆ What you'll be able to do
- Translate each characteristic into a testable requirement
- Answer the auditor's question for each one with evidence
- Document the trade-offs you deliberately accepted
"Trustworthy" is not a vibe. NIST breaks it into seven characteristics, and the honest part of the framework is that they compete: making a model more secure can make it less usable; maximizing accuracy can reduce explainability. Governance is not about scoring perfectly on all seven — it is about making the trade-offs deliberately and writing down why.
Valid & reliable
The output is accurate for its purpose and stays accurate across repeated, real-world use.
Safe
The system does not, under foreseeable conditions, lead to states that endanger human life or health.
Secure & resilient
It withstands adversarial inputs and recovers gracefully from disruption or attack.
Accountable & transparent
Information about the system is available, and a responsible human owner is identifiable.
Explainable & interpretable
You can describe how it works and give meaning to its outputs in human terms.
Privacy-enhanced
It safeguards individual autonomy, identity, and dignity in how it handles data.
Fair, with bias managed
Harmful bias is identified and managed so the system does not produce inequitable outcomes.
The trade-off you must document
A reviewer is not unsettled that you made trade-offs — every real system does. A reviewer is unsettled when trade-offs appear to have happened by accident. If you accepted lower explainability for higher accuracy in a high-risk tool, say so, justify it against the context of use, and show the compensating human oversight. Deliberate beats perfect.
Ten Principles of Good AI Practice
The FDA–EMA joint principles of January 2026, quoted faithfully and paired with the one operational move each demands of you.
◆ What you'll understand
- All ten FDA–EMA principles
- How they consolidate everything in §02–§05
- Which principles your program already satisfies
◆ What you'll be able to do
- Map each principle to an owner and an artifact
- Use the principles as the agenda for a governance review
- Anticipate the expectations of future binding guidance
On 14 January 2026 the FDA and EMA jointly released ten principles to steer responsible AI across the medicine lifecycle. They are not yet binding — but they are explicitly described as the foundation for future guidance. Treat them as a preview of the questions you will be asked. The bold text below is the agencies' own wording; the note beneath is your operational move.
Human-centric by design
"The development and use of AI technologies align with ethical and human-centric values."
Your move · Put patient interest and public health first in design reviews; build safeguards in from the start, not as a patch.
Risk-based approach
"…a risk-based approach with proportionate validation, risk mitigation, and oversight based on the context of use and determined model risk."
Your move · Run every tool through the §03 matrix and size your evidence to the tier it lands in.
Adherence to standards
"AI technologies adhere to relevant legal, ethical, technical, scientific, cybersecurity, and regulatory standards, including Good Practices (GxP)."
Your move · Connect AI controls to Part 11, ICH E6(R3), GAMP 5, and your security standards — don't run a parallel universe.
Clear context of use
"AI technologies have a well-defined context of use (role and scope for why it is being used)."
Your move · Maintain a written COU statement for every tool in your inventory (MAP-02).
Multidisciplinary expertise
"Multidisciplinary expertise covering both the AI technology and its context of use are integrated throughout the technology's life cycle."
Your move · Ensure clinical, data-science, and regulatory voices all sign off — not data science alone.
Data governance and documentation
"Data source provenance, processing steps, and analytical decisions are documented in a detailed, traceable, and verifiable manner, in line with GxP requirements."
Your move · Apply ALCOA+ thinking to data lineage; protect sensitive data across the whole lifecycle.
Model design and development practices
"…best practices in model and system design and software engineering… leverages data that is fit-for-use, considering interpretability, explainability, and predictive performance."
Your move · Treat model development as engineering: version control, testing, and documented design decisions.
Risk-based performance assessment
"…evaluate the complete system including human-AI interactions, using fit-for-use data and metrics appropriate for the intended context of use…"
Your move · Validate the human-plus-model system together; pick metrics that match the decision, not generic accuracy.
Life cycle management
"Risk-based quality management systems are implemented throughout the AI technologies' life cycles… scheduled monitoring and periodic re-evaluation… (e.g., to address data drift)."
Your move · Schedule monitoring and re-validation; a model is never "done" (MAN-03).
Clear, essential information
"Plain language is used to present clear, accessible, and contextually relevant information… regarding the AI technology's context of use, performance, limitations, underlying data, updates, and interpretability or explainability."
Your move · Write a plain-language summary for users and, where relevant, participants — this very playbook models the standard.
Lifecycle Controls in Practice
Governance is not an event at launch — it runs from data sourcing to retirement. These are the four control domains an auditor inspects most closely.
◆ What you'll understand
- The four lifecycle control domains
- How established GxP rules already cover most AI obligations
- What "human oversight" must actually look like
◆ What you'll be able to do
- Connect AI controls to Part 11, GAMP 5, and ALCOA+
- Design a drift-monitoring and re-validation cadence
- Specify vendor obligations in contracts and audits
Provenance & ALCOA+
Every dataset feeding a model must be Attributable, Legible, Contemporaneous, Original, and Accurate — plus Complete, Consistent, Enduring, and Available. Document where data came from, how it was processed, and every analytical decision. Under 21 CFR Part 11, the systems holding that data need audit trails, access controls, and trustworthy electronic records.
Computerized system validation & GAMP 5
GAMP 5's risk-based, category-driven approach maps cleanly onto AI: scale validation effort to the system's risk and novelty. The twist for AI is that validation is not one-and-done — a model that learns or is retrained re-enters validation through change control.
Meaningful human control
Oversight is "meaningful" only if the human can realistically understand, question, and override the output, and is accountable for the decision. A rubber-stamp reviewer who cannot interpret the model is not oversight — it is the appearance of oversight, which an auditor will see through immediately.
Drift, decay & re-evaluation
Real-world data shifts away from what a model learned — data drift. The FDA–EMA principles call for scheduled monitoring and periodic re-evaluation. Define your triggers (calendar-based and performance-based), your thresholds, and what happens when a threshold is breached.
AUD
From the reviewer's chair
"AI-as-a-service" is where I find the most surprises. A sponsor outsources a model to a vendor, and with it, quietly outsources the documentation it can no longer produce for me. You can delegate the work. You cannot delegate the accountability. Build audit rights and documentation flow-down into the contract on day one.
Governance Readiness Self-Assessment
Sixteen questions, mapped to the four NIST functions. Answer honestly — the score is for you, and it saves automatically so you can return and re-test as you improve.
◆ What you'll understand
- Where your program is strong and where it is exposed
- Your overall readiness against the four functions
◆ What you'll be able to do
- Produce a baseline readiness score to track over time
- Prioritize the function with the lowest score first
- Generate a focused remediation conversation for leadership
Your readiness, scored live
Each answer: Yes = in place & evidenced · Partial = informal or undocumented · No = gap.
The Audit Evidence Pack
The artifacts a federal reviewer will actually ask to see. Tick them off as your program produces each one — your checkmarks save automatically.
◆ What you'll understand
- The documents that constitute an audit-ready file
- How each artifact maps back to a framework requirement
◆ What you'll be able to do
- Assemble a credibility dossier for a single AI tool
- Identify which artifacts you are missing today
- Create your own version-controlled evidence index
Inspection-readiness is mostly a documentation problem. If you can hand a reviewer this set of artifacts for any AI tool in your inventory, you are in genuinely good shape. Use it as a packing list.
- AI system inventory entryTool name, owner, vendor, and where it operates in the study (MAP-01).
- Context-of-use statementThe model's role, scope, and the decision it supports (MAP-02).
- Model risk determinationInfluence × consequence, the resulting tier, and the justification (§03).
- Credibility assessment planThe proportionate evidence plan, ideally agreed with the agency for high risk (§04).
- Data provenance & ALCOA+ recordSource, processing steps, analytical decisions, and fitness-for-use (Principle 6).
- Validation / TEVV reportPerformance, subgroup/bias testing, and whole-system (human-AI) evaluation (MEA-01–05).
- Part 11 controls evidenceAudit trails, access control, and electronic-record integrity for systems involved.
- Human oversight procedureWho reviews, what they can override, and the record that they did (MAN-02).
- Monitoring & drift planTriggers, thresholds, cadence, and re-validation rules (MAN-03).
- Change-control logEvery model update, retraining, or version change and its re-assessment (MAN-05).
- Vendor governance packageContracts, audit rights, and documentation flow-down for AI-as-a-service (GOV-05).
- Plain-language summaryContext, performance, and limitations stated clearly for users (Principle 10).
Glossary & Sources
Plain-language definitions for the terms used throughout, and the primary public sources every claim is drawn from.
Glossary
- Context of Use (COU)
- The specific role and scope of an AI model — what question it answers and how its output is used in a decision.
- Model risk
- The combination of model influence (how much it drives the decision) and decision consequence (how bad a wrong answer is).
- Credibility
- Trust in an AI model's output for a particular context of use, established by evidence proportionate to its risk.
- TEVV
- Testing, Evaluation, Verification & Validation — the evidence-gathering activities under NIST's Measure function.
- Data drift
- The gradual divergence of real-world data from the data a model was trained on, which degrades performance over time.
- ALCOA+
- Data-integrity expectations: Attributable, Legible, Contemporaneous, Original, Accurate — plus Complete, Consistent, Enduring, Available.
- GxP
- The family of "Good Practice" quality regulations (GCP, GMP, GLP) governing regulated clinical and manufacturing work.
- GAMP 5
- A risk-based framework for validating computerized systems, scaling effort to system risk and novelty.
Primary sources
- NIST. Artificial Intelligence Risk Management Framework (AI RMF 1.0). January 26, 2023; plus the Generative AI Profile (NIST AI 600-1), July 26, 2024.
- FDA. "Considerations for the Use of Artificial Intelligence to Support Regulatory Decision-Making for Drug and Biological Products." Draft Guidance for Industry, January 2025 (Docket FDA-2024-D-4689).
- FDA & EMA. "Guiding Principles of Good AI Practice in Drug Development." January 2026 (10 jointly issued principles, quoted in §06).
- FDA. "Artificial Intelligence and Medical Products: How CBER, CDER, CDRH, and OCP are Working Together." March 2024, revised February 2025.
- ICH. E6(R3) Good Clinical Practice.
- US FDA. 21 CFR Part 11 — Electronic Records; Electronic Signatures.
- European Union. Regulation (EU) 2024/1689 (the EU AI Act); EMA Reflection Paper on the use of AI in the medicinal product lifecycle (September 2024).
- ISPE. GAMP 5: A Risk-Based Approach to Compliant GxP Computerized Systems (2nd ed.).
You've reached the end
You now have the full operating model: the landscape, the four functions, context of use and risk, the seven credibility steps, the seven marks of trust, the ten principles, the lifecycle controls, a scored readiness baseline, and an evidence packing list. Issue yourself a certificate of completion — then go put one tool through the §03 matrix this week.