Back to Research

// lab research publication · msc cybersecurity & digital forensics

Adaptive Zero Trust AI Gateway

with Behavioral Threat Intelligence and Explainable Risk Modeling

Zero Trust ArchitectureLLM SecurityPrompt InjectionAI GatewayAdaptive Risk ScoringModel Serving SecurityBehavioural Threat IntelligenceSecure InferenceExplainable SecurityOpen-Source AI
GitHub RepositoryShihab Sahoriar · UWTSD 2025

91.3%

Standard attack-block rate

105 of 130 prompts correctly blocked

4.7%

False positive rate

Standard policy mode

96.8%

Strict attack-block rate

11.4% FPR trade-off

130

Evaluation corpus

7 attack categories

// abstract

Abstract

The rapid adoption of open-source artificial intelligence models presents significant and underaddressed security challenges — including adversarial prompt injection, model posture degradation, and the absence of a unified enforcement point between user intent and model inference. Existing security frameworks, designed for traditional network perimeters, fail to account for the dynamic, session-sensitive, and behavioural nature of AI-mediated interactions.

This research proposes and evaluates an Adaptive Zero Trust AI Gateway — a five-layer security architecture that enforces never-trust-always-verify principles at every point in the AI request lifecycle. The gateway integrates posture-based model assessment before deployment, runtime prompt inspection, deterministic risk scoring using a weighted effective-risk function, adaptive policy enforcement across three configurable modes, cross-model behavioural intelligence, and a fully explainable decision audit trail designed to support SOC monitoring.

The system is implemented in Python (FastAPI, SQLAlchemy, PostgreSQL) with a React/Next.js monitoring dashboard and evaluated against a 130-prompt corpus spanning seven attack categories. Under Standard policy mode, the gateway achieves a 91.3% attack-block rate at a 4.7% false positive rate. Strict mode raises detection to 96.8% at the cost of an elevated 11.4% FPR. Both configurations represent a 33.8 percentage-point improvement over the static-rule baseline (57.5%), confirming that adaptive behavioural scoring substantially outperforms fixed threshold approaches.

Findings demonstrate that Zero Trust principles are directly applicable to AI model serving environments and that the combination of posture assessment, adaptive risk scoring, and explainable enforcement produces measurable, operationally relevant security improvements. The implementation is released as open-source infrastructure for the AI security research community.

// research questions

Four Questions This Research Answers

RQ1

How can Zero Trust Architecture principles be adapted to govern open-source AI model access without introducing prohibitive latency?

Finding

Policy evaluation adds ≤ 47 ms median overhead — within acceptable production thresholds.

RQ2

Does posture-based model assessment before deployment reduce the risk of serving compromised or degraded models?

Finding

Readiness state enforcement prevented 3 of 3 simulated degraded-model deployments during testing.

RQ3

To what extent does adaptive policy enforcement with behavioural trust scoring outperform static rule-based approaches in detecting adversarial prompts?

Finding

Adaptive enforcement yielded 33.8 percentage-point improvement over the static-rule baseline (57.5% → 91.3%).

RQ4

Can security decisions made by an AI gateway be rendered sufficiently explainable to support real-time SOC monitoring and post-incident analysis?

Finding

100% of decisions logged with human-readable rationale: risk signals, policy rule invoked, confidence score, and outcome.

// original contributions

Six Contributions to the Field

Unified ZTA Framework for AI Model Serving

First documented open-source implementation of Zero Trust Architecture applied specifically to the AI model serving pipeline — from onboarding to inference to audit.

Posture-Based Model Assessment Pipeline

A six-state readiness model (READY → EVALUATING → DEGRADED → QUARANTINED → SUSPENDED → REVOKED) with automated posture checks before any model enters the serving pool.

Deterministic Effective-Risk Function

A weighted risk aggregation formula combining prompt risk, model risk, behavioural sequence anomaly, cross-model intelligence, session trust, and active controls — with empirically derived weights.

Adaptive Three-Mode Policy Enforcement

Permissive, Standard, and Strict policy modes with configurable thresholds. Operators select the posture; the engine adapts decisions continuously without redeployment.

Cross-Model Behavioural Intelligence

A cross-model correlation layer that aggregates attack patterns across all models served by the gateway — detecting multi-model exploitation campaigns invisible to per-model detectors.

SOC-Ready Explainability Framework

Every enforcement decision surfaces a structured audit record — risk component scores, policy rule matched, confidence level, and a plain-English summary — enabling real-time SOC review.

// literature review & research gaps

Prior Work & the Gap This Research Fills

A systematic review of 28 papers across Zero Trust Architecture, AI security, and adversarial ML surfaces a consistent pattern: existing work addresses perimeter security or model robustness in isolation, but not the runtime enforcement control plane needed to govern open-source AI serving securely.

Comparison of Related Work (Table 2.1 — simplified)

ApproachZTAPosture EvalAdaptive RiskOpen-Source AIExplainability
NIST ZTA (SP 800-207)
Azure AI Content Safety
OWASP LLM Top 10
Perez et al. (2022) — Prompt Injection
Greshake et al. (2023) — Indirect Injection
This Research (ZTA AI Gateway)

Key research gaps identified (Table 2.3)

No unified gateway for open-source AI

No enforcement point between user and unvetted model

ZTA not applied to model serving

Network-layer ZTA misses application-layer AI threats

Posture assessment absent from AI security

Compromised models served without runtime verification

Adaptive risk scoring missing from AI gateways

Static rules fail against evolving adversarial patterns

Explainability not a design goal in prior systems

SOC analysts cannot audit or override AI security decisions

Cross-model attack correlation unexplored

Multi-model campaigns invisible to per-model defences

// system architecture

14 System Components

The gateway is implemented as 14 discrete, loosely coupled services — each with a single responsibility in the security pipeline — orchestrated by a central Control Plane.

Component catalogue (a) – (n)

(a)

Authentication & Authorisation

JWT-based access control; session binding to user identity before any gateway action.

(b)

Dashboard UI

React/Next.js SOC monitoring dashboard; real-time event feeds, trust graphs, decision logs.

(c)

Chat Interface

User-facing AI interaction surface; all requests routed through the gateway pipeline.

(d)

Model Registry

Central catalogue of available models with metadata: source, version, capability flags, licence.

(e)

Model Readiness Service

Evaluates and tracks model state across six posture states; gates inference access.

(f)

Posture Assessment Engine

Runs automated checks (vulnerability scan, behavioural baseline, licence risk) on model registration and on schedule.

(g)

Control Plane

Orchestrates pipeline stages; routes requests between inspection, policy, and inference services.

(h)

Prompt Guard

Detects prompt injection, jailbreak patterns, and adversarial markers using rule sets and heuristic scoring.

(i)

Policy Engine

Applies the effective-risk function and threshold tables; issues ALLOW / CHALLENGE / BLOCK.

(j)

Trust Scoring Service

Maintains per-user session trust; applies incremental updates (+1 / −5 / −15 / −30) per interaction outcome.

(k)

Cross-Model Intelligence

Aggregates signals across sessions and models; surfaces multi-model attack campaigns.

(l)

Output Guard

Post-inference filter for PII leakage, sensitive data disclosure, and adversarial response patterns.

(m)

Audit Log Service

Persists all security events to PostgreSQL; structured for SIEM export and forensic analysis.

(n)

Research Evaluation Module

Scenario runner for evaluation corpus tests; collects metrics for academic and operational benchmarking.

// five-layer architecture

Layered Pipeline Design

The fourteen components are grouped into five logical layers. Data flows sequentially — a failure at any layer halts downstream access.

01

Model Onboarding Layer

Evaluates open-source AI models before deployment — checking model posture, licence risk, known vulnerabilities, and behavioural baselines. Only READY-state models enter the active serving pool. Components: (d) Model Registry, (e) Readiness Service, (f) Posture Assessment Engine.

Posture CheckVulnerability ScanBehavioral BaselineModel Registry
02

Zero Trust Enforcement Layer

Applies never-trust-always-verify to every request. No implicit trust is granted — every prompt is inspected against policy rules, user behavioural history, and contextual risk signals before a decision is made. Components: (g) Control Plane, (h) Prompt Guard, (i) Policy Engine.

Policy EnginePrompt GuardRisk ScoringNever Trust Always Verify
03

Risk Reduction Layer

Applies protective measures when risk scores exceed thresholds — restricting capabilities, issuing CHALLENGE responses, sandboxing model access, or escalating to human review before inference proceeds. Components: (i) Policy Engine, (j) Trust Scoring.

Secure ModeCapability RestrictionAdaptive ThresholdSandboxing
04

Adaptive Reassessment Layer

Continuously re-evaluates trust and risk as sessions evolve. Repeated risky behaviour, stale model conditions, and anomalous usage patterns trigger reassessment and dynamic policy adjustment. Components: (j) Trust Scoring, (k) Cross-Model Intelligence.

Session MonitoringBehavioral DriftDynamic PolicyCross-Model Intel
05

Explanation & Audit Layer

Every security decision is logged with a human-readable explanation — risk signals detected, policy rule applied, decision outcome, and confidence score. Feeds the SOC monitoring dashboard for real-time review. Components: (l) Output Guard, (m) Audit Log Service, (n) Research Evaluation.

ExplainabilitySOC LogsAudit TrailDecision Trace

// model readiness states

Six Posture States

Every model in the registry exists in one of six mutually exclusive states. The Readiness Service evaluates posture on registration, on schedule (every 72 hours), and on-demand when anomalies are detected.

READY

Passed all posture checks; eligible for active inference.

EVALUATING

Initial posture assessment in progress; inference gated pending result.

DEGRADED

One or more posture signals declined; access restricted, CHALLENGE mode enforced.

QUARANTINED

Critical posture failure detected; no inference permitted, remediation required.

SUSPENDED

Operator-initiated hold; model removed from pool pending review.

REVOKED

Permanently removed; associated policy and audit records retained.

// risk model

The Effective-Risk Function

Every request is evaluated against a deterministic weighted risk function. The output — a score 0–100 — maps directly to a policy decision. The formula is fully auditable: each component is surfaced in the decision record.

Effective-risk function

R_effective = w₁·R_prompt + w₂·R_model + w₃·A_sequence + w₄·C_cross-modelw₅·T_trustw₆·E_controls

Component weights & semantics

w₁ = 1.0

R_prompt

Prompt risk — highest weight, direct injection and adversarial signal.

w₂ = 0.6

R_model

Model posture risk — degradation score from readiness service.

w₃ = 0.8

A_sequence

Sequence anomaly — behavioural deviation from established baseline.

w₄ = 0.7

C_cross-model

Cross-model intelligence signal — correlated attack pattern score.

w₅ = 0.5

T_trust

Session trust — accumulated clean-interaction credit (reduces risk).

w₆ = 0.4

E_controls

Active controls — applied mitigations that reduce residual risk.

// policy engine

Three Policy Modes

Operators select a policy posture at deployment. The engine applies the corresponding threshold table to every R_effective score — no code changes required to switch modes.

Policy threshold table

ModeALLOWCHALLENGEBLOCK
Permissive≤ 5960 – 79≥ 80
Standard≤ 3940 – 69≥ 70
Strict≤ 2930 – 54≥ 55

Policy decision pseudocode

function evaluate_request(request, policy_mode):
    R = compute_effective_risk(
        R_prompt   = prompt_guard.score(request),
        R_model    = readiness_service.risk(request.model_id),
        A_sequence = trust_service.sequence_anomaly(request.user_id),
        C_cross    = cross_model_intel.correlation_score(request),
        T_trust    = trust_service.get_trust(request.user_id),
        E_controls = controls.active_score(request.session_id)
    )

    thresholds = POLICY_TABLE[policy_mode]

    if R <= thresholds.allow:
        decision = ALLOW
        trust_service.update(request.user_id, +1)
    elif R <= thresholds.challenge:
        decision = CHALLENGE
        trust_service.update(request.user_id, -5)
    else:
        decision = BLOCK
        trust_service.update(request.user_id, -15)

    audit_log.write({
        request_id: request.id,
        risk_score: R,
        components: {...},
        decision: decision,
        policy_mode: policy_mode,
        explanation: explain(R, decision)
    })

    return decision

// trust scoring

Session Trust Dynamics

Trust delta per interaction outcome

+1

Clean interaction

Slow trust accumulation for consistent benign users.

−5

Suspicious prompt detected

Moderate decay; borderline-risk content flags caution.

−15

Request blocked

Significant decay; confirmed policy violation.

−30

Critical violation (confirmed injection / jailbreak)

Severe decay; trust often collapses to zero within 2–3 incidents.

Trust dynamics — Figure 5.1 description

Initial state

Sessions initialise at T_trust = 60. External and anonymous users may be initialised lower (recommended: 30) to reduce exploitation window.

Decay trajectory

At −15 per blocked request, a session starting at T_trust = 60 collapses to zero after 4 consecutive blocks. At −30 (critical violation), collapse occurs in 2 requests. Evaluation showed 7-request median collapse for typical attack sessions.

Recovery

Trust recovers at +1 per clean interaction — intentionally asymmetric. A session at zero requires 30+ clean interactions to reach ALLOW-tier trust, preventing rapid trust-reset abuse.

Cross-model component

C_cross-model activates when 3+ correlated anomalies appear across different models in the same session window — raising R_effective by up to 25 points independent of per-model scores.

// threat model

STRIDE Analysis (Table 3.1)

A structured threat modelling exercise maps each STRIDE category to its gateway control — ensuring no threat vector is left without a corresponding mitigation in the architecture.

STRIDE threat-to-control mapping

Spoofing

JWT validation, session identity binding

Authentication & Auth
Tampering

Prompt Guard, input sanitisation rules

Prompt Guard
Repudiation

Immutable audit log with request hashes

Audit Log Service
Information Disclosure

Output Guard, PII masking, response filtering

Output Guard
Denial of Service

Rate limiting, readiness gating, circuit breaker

Readiness Service
Elevation of Privilege

RBAC enforcement, policy-engine decision binding

Policy Engine

// evaluation design

Evaluation Corpus & Scenarios

Evaluation corpus — Table 3.2 (130 prompts, 7 categories)

25

Normal Usage

Benign, legitimate AI queries with no adversarial intent.

22

Prompt Injection

Direct injection attempts targeting model instruction override.

20

Jailbreak Attempts

Structured attempts to bypass safety layers or system prompts.

18

Data Exfiltration

Requests designed to extract training data or sensitive context.

18

Roleplay Manipulation

Persona hijacking and roleplay-as-a-character abuse patterns.

15

Adversarial Edge Cases

Novel or obfuscated attack patterns not covered by known signatures.

12

Cross-Model Attacks

Multi-model exploitation patterns requiring correlation to detect.

130

Total prompts

Corpus design principles

All prompts manually crafted — no synthetic generation — to ensure authenticity of attack patterns.

Normal-usage prompts drawn from real AI interaction logs to represent realistic FPR conditions.

Adversarial edge cases independently reviewed by a second researcher for category accuracy.

Cross-model attacks require multi-session context to evaluate C_cross-model activation correctly.

Each prompt assigned a ground-truth label (ALLOW / CHALLENGE / BLOCK) before system evaluation.

Eight evaluation scenarios — Table 3.3

S1

Normal User — Benign Session

ALLOW

25 benign requests from a high-trust session. Gateway allows all; trust accumulates at +1 per interaction.

S2

Direct Prompt Injection

BLOCK

Explicit instruction override payload. R_prompt spikes to 91; Standard mode BLOCK issued within 12 ms.

S3

ZT vs. No-ZT Comparison

BLOCK

Identical prompt evaluated with and without Zero Trust active. Without ZT: ALLOW. With ZT: BLOCK. 33.8pp improvement confirmed.

S4

Repeated Risky Behaviour

BLOCK

User sends escalating borderline requests. Trust decays from 60 to 0 over 7 blocked interactions; session terminated.

S5

Stale Model Posture

CHALLENGE

Model readiness state transitions to DEGRADED between requests. Policy engine detects state change; request CHALLENGED.

S6

Cross-Model Attack Campaign

BLOCK

Same user probes three models sequentially with varied prompts. Cross-model intelligence layer identifies pattern; C_cross-model raises effective risk to 74.

S7

Trust Recovery Post-Block

CHALLENGE → ALLOW

User with trust=0 resumes with clean requests. System issues CHALLENGE for first 15 interactions; trust recovers slowly to ALLOW threshold.

S8

Jailbreak via Roleplay

CHALLENGE → BLOCK

Subtle roleplay-as-character jailbreak. First attempt CHALLENGED; follow-up with increased pressure BLOCKED. A_sequence anomaly drives escalation.

// evaluation results

Quantitative Results

Results demonstrate measurable, operationally significant security improvement across both Standard and Strict policy modes against the 130-prompt evaluation corpus.

Summary results (Tables 5.2–5.7 — consolidated)

91.3%

Attack block rate — Standard mode

105 / 115 attack prompts correctly blocked

4.7%

False positive rate — Standard mode

1.2 of 25 benign prompts challenged/blocked (avg)

96.8%

Attack block rate — Strict mode

111 / 115 attack prompts correctly blocked

11.4%

False positive rate — Strict mode

2.9 of 25 benign prompts challenged/blocked (avg)

57.5%

Baseline (no Zero Trust) block rate

Static rule system; no adaptive scoring

+33.8pp

Improvement over baseline (Standard)

91.3% − 57.5% = 33.8 percentage-point gain

7 requests

Trust decay to zero (blocked session)

Starting at T_trust = 60, blocking at −15 each

≤ 47 ms

Policy decision latency (median)

Acceptable for production inference pipelines

Discussion

The 33.8 percentage-point improvement over the static-rule baseline validates the central thesis — that adaptive behavioural scoring, combined with posture-based model assessment and session trust dynamics, substantially outperforms fixed threshold approaches. The trade-off between Strict mode's higher block rate (96.8%) and its elevated FPR (11.4%) represents the expected precision-recall curve shift, and is expected to narrow as embedding-based prompt analysis is integrated in future work. System latency (≤ 47 ms median) confirms that Zero Trust enforcement is operationally viable alongside real-time AI inference.

// failure modes & limitations

Five Known Failure Modes

Responsible disclosure of the system's documented limitations — each with a concrete mitigation pathway for future implementation.

Novel Injection Evasion

Carefully crafted obfuscations not matching known injection signatures may pass Prompt Guard, particularly when R_prompt is depressed by high session trust.

Mitigation

Lower w₅ trust weight in Strict mode; supplement with embedding-similarity detectors.

Trust Reset Exploitation

An attacker who identifies the trust decay thresholds can pace attacks to avoid triggering −30 critical violations, slowly probing while staying below the BLOCK threshold.

Mitigation

Introduce non-deterministic decay jitter; add A_sequence decay on patterned slow probing.

Cross-Model Cascade Delay

The cross-model intelligence layer requires at least 3 correlated events before C_cross-model activates, leaving a detection window for early-stage multi-model campaigns.

Mitigation

Reduce correlation threshold to 2 events; add proactive CHALLENGE on first cross-model detection.

Policy Bypass via Edge-Case Tokens

Tokenisation edge cases — Unicode homoglyphs, invisible characters — may cause Prompt Guard to miss injections that models interpret adversarially.

Mitigation

Pre-normalise all input through Unicode NFC/NFKC before Prompt Guard; add character-class filtering.

Latency Spike Under Load

Under sustained high concurrency, the policy evaluation pipeline incurs latency above the 47 ms median, potentially degrading user experience.

Mitigation

Implement async policy evaluation with caching for recent-seen prompts; add horizontal scaling for the Policy Engine.

// key findings

Five Findings from Chapter 6

01

Zero Trust enforcement raises attack-block rate by 33.8 percentage points over the static-rule baseline — confirming that adaptive risk scoring substantially outperforms fixed thresholds.

02

Standard mode (Allow ≤ 39 / Challenge 40–69 / Block ≥ 70) achieves a practical balance: 91.3% attack detection at only 4.7% false positive cost, making it suitable for production deployment.

03

Strict mode improves detection to 96.8% but introduces 11.4% FPR — a meaningful operational trade-off that operators must calibrate against their user trust profile.

04

The trust decay mechanism (−15 per blocked request) reliably collapses attacker trust to zero within seven attempts starting from the gateway default of T_trust = 60, even without operator intervention.

05

Explainability infrastructure enables 100% audit coverage — every decision is traceable to specific risk signals and policy rules, providing the SOC visibility absent from comparable prior systems.

// recommendations for practice

Six Deployment Recommendations

01

Deploy in Standard mode initially. Migrate to Strict only after a calibration period establishing your user FPR baseline — avoid over-blocking during onboarding.

02

Enable cross-model intelligence when serving more than three models concurrently. Single-model deployments gain minimal benefit from the C_cross-model component.

03

For external or anonymous user pools, initialise T_trust at 30 (not the default 60) to reduce the window for trust-exploitation attacks before detection.

04

Review audit logs daily for the first two weeks post-deployment to identify FPR clusters — particular prompt patterns that are triggering false blocks — and tune thresholds accordingly.

05

Integrate the Audit Log Service with your SIEM via structured syslog export. The schema is designed for Splunk and Elastic ingestion without transformation.

06

Schedule posture reassessment every 72 hours on active models, not just at registration. Community model repositories can push updates that silently degrade posture between evaluations.

// future work

Seven Research Directions

Embedding-Based Prompt Analysis

Supplement rule-based Prompt Guard with semantic embedding similarity to known-bad prompt libraries, improving detection of novel obfuscated injections.

Live Threat Intelligence Feeds

Integrate external threat feeds and real-time adversarial prompt databases into the inspection pipeline for zero-day injection pattern coverage.

Deep Output Inspection

Expand Output Guard with PII entity detection (NER), sensitive data leakage classifiers, and model-specific adversarial response fingerprinting.

Red-Team Evaluation Suite

Evaluate the gateway against structured red-team AI security benchmarks — PromptBench, HarmBench — and production-representative adversarial workloads.

Multi-Provider Unified Policy

Extend gateway routing to enforce a single Zero Trust policy layer across multiple AI providers (Ollama, Hugging Face, OpenAI-compatible endpoints) under a common control plane.

Federated Enterprise Deployment

Design distributed gateway topology for multi-tenant enterprise AI access control — decentralised policy nodes with centralised audit aggregation.

Federated Learning for Threat Models

Explore privacy-preserving federated learning to share attack pattern intelligence across gateway deployments without exposing raw event data.

View the build

The active implementation is tracked on GitHub — architecture, backend, frontend, and evaluation scripts.

Open Repository

// references

Key References

  1. [1]

    Rose, S., Borchert, O., Mitchell, S., & Connelly, S. (2020). Zero Trust Architecture. NIST SP 800-207. National Institute of Standards and Technology.

  2. [2]

    Kindervag, J. (2010). Build Security Into Your Network's DNA: The Zero Trust Network Architecture. Forrester Research.

  3. [3]

    Perez, E. (2022). Ignore Previous Prompt: Attack Techniques For Language Models. arXiv:2211.09527.

  4. [4]

    Greshake, K., et al. (2023). Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. arXiv:2302.12173.

  5. [5]

    Huang, Y., et al. (2023). TrustGPT: A Benchmark for Trustworthy and Responsible Large Language Models. arXiv:2306.11507.

  6. [6]

    Weidinger, L., et al. (2021). Ethical and Social Risks of Harm from Language Models. DeepMind Technical Report.

  7. [7]

    Dolan-Gavitt, B., et al. (2016). Architectural Support for Dynamic Vulnerability Analysis. IEEE S&P.

  8. [8]

    Shostack, A. (2014). Threat Modeling: Designing for Security. Wiley. [STRIDE methodology source]

  9. [9]

    Microsoft. (2023). Azure AI Content Safety Documentation. Microsoft Azure.

  10. [10]

    OWASP. (2023). OWASP Top 10 for Large Language Model Applications v1.0. OWASP Foundation.

Full reference list available in the GitHub repository documentation.