Operational Model

This document describes Hexarch’s deployment patterns, runtime assumptions, upgrade model, and failure modes.

Architecture Overview

Hexarch consists of three components:

Control Plane (Hexarch Application)

Technology: React frontend + FastAPI backend
State: PostgreSQL (production) or SQLite (development)
Purpose: Policy management, configuration distribution, audit logging

Data Plane (Gateway Nodes)

Technology: Java Gateway (z-gateway)
State: In-memory policy snapshot
Purpose: Request routing, policy enforcement, decision logging

Guardrails API (Python SDK Backend)

Technology: FastAPI with OPA integration
State: SQLite or PostgreSQL
Purpose: Policy evaluation, authorization decisions, audit chain management

Deployment Patterns

Single-Region Deployment

┌─────────────────────────────────────────────────┐
│                     Region                      │
│                                                 │
│  ┌──────────┐      ┌──────────┐                 │
│  │ Control  │ ───→ │ Gateway  │ ───→ Backends   │
│  │  Plane   │      │  Fleet   │                 │
│  └──────────┘      └──────────┘                 │
│       │                                         │
│       ↓                                         │
│  ┌──────────┐                                   │
│  │ Database │                                   │
│  └──────────┘                                   │
└─────────────────────────────────────────────────┘

Control plane manages gateway fleet in one region
Suitable for single-region applications
Database is the single source of truth

Multi-Region Deployment

┌─────────────────────────┐    ┌─────────────────────────┐
│        Region A         │    │        Region B         │
│                         │    │                         │
│  ┌──────────┐           │    │           ┌──────────┐  │
│  │ Control  │ ───→ ───→ ┼────┼ ───→ ───→ │ Gateway  │  │
│  │  Plane   │           │    │           │  Fleet   │  │
│  └──────────┘           │    │           └──────────┘  │
│       │                 │    │                         │
│       ↓                 │    │                         │
│  ┌──────────┐           │    │                         │
│  │ Database │           │    │                         │
│  │ (Primary)│           │    │                         │
│  └──────────┘           │    │                         │
└─────────────────────────┘    └─────────────────────────┘

Control plane in primary region
Gateway fleets in multiple regions
Configuration propagates cross-region
Database replication for disaster recovery

Runtime Assumptions

Gateway Node Requirements

From the GatewayNode interface and ClusterManager.tsx:

interface GatewayNode {
  softwareVersion: string;
  metrics: {
    cpu: number;     // Percentage
    memory: number;  // Percentage (JVM heap)
    tps: number;     // Transactions per second
  };
}

Memory: Gateway nodes hold policy snapshots in memory. Larger policy sets require more heap. Monitor JVM heap usage.

CPU: Policy evaluation is CPU-bound. Complex policies (regex matching, many conditions) increase CPU load.

Network: Nodes require connectivity to the control plane for configuration updates and audit log submission.

Control Plane Requirements

Database: PostgreSQL for production. Connection pooling recommended for high-throughput deployments.

API Server: Stateless FastAPI application. Horizontal scaling behind a load balancer.

State: All state is in the database. Application instances can be replaced without data loss.

Configuration Distribution

Snapshot Lifecycle

From the ConfigSnapshot and ConfigArtifact interfaces:

Define → Validate → Bundle → Deploy → Verify

Define: Create policies, rules, entitlements
Validate: Artifact passes schema checks and optional security scan
Bundle: Create named snapshot with version map
Deploy: Push snapshot to target clusters
Verify: Confirm nodes applied configuration via hash comparison

Propagation Model

From the GatewayNode interface:

desiredSnapshotId: string;  // What control plane wants
appliedSnapshotId: string;  // What node is running

Control plane sets desiredSnapshotId. Nodes pull configuration, apply it, and report appliedSnapshotId. Mismatch indicates drift.

Hot-Swap Updates

Gateway nodes support policy hot-swap:

New configuration loads while current config serves traffic
Atomic switchover when new config is ready
No request interruption during updates

Failure mode: If hot-swap fails (e.g., incompatible filter signature), node reports lastSyncError and remains on previous configuration.

Upgrade Model

Control Plane Upgrades

Deploy new control plane version alongside existing
Migrate database schema if needed
Switch traffic to new version
Verify functionality
Decommission old version

Rollback: Keep old version available. Database schema migrations should be reversible.

Gateway Upgrades

From the GatewayNode.softwareVersion field:

Deploy new gateway version to a subset of nodes
Verify functionality and performance
Roll out to remaining nodes
Monitor for drift or sync errors

Canary pattern: Update one node per cluster first. Monitor cohesion metrics before proceeding.

Configuration Upgrades

Create new snapshot with updated policies
Deploy to staging cluster
Verify behavior
Promote to production clusters
Monitor decision patterns

Rollback: Previous snapshots are retained. Deploy the prior snapshot to revert.

Failure Modes

Control Plane Unavailable

Impact: Gateway nodes continue operating with last-known configuration. New policy changes cannot be deployed. Audit logs queue locally (if configured).

Detection: Health checks fail. Gateway nodes show stale lastHeartbeat.

Recovery: Restore control plane. Nodes reconcile automatically.

Gateway Node Failure

From the GatewayNode.status enum:

status: 'Ready' | 'Starting' | 'Error' | 'Divergent' | 'Offline'

Impact: Traffic is routed to healthy nodes (load balancer responsibility).

Detection: Node status changes to Error or Offline. Cohesion percentage drops.

Recovery: Node restarts and reconciles to desired configuration.

Database Failure

Impact: Control plane cannot read or write state. Policy changes fail. Audit logging fails.

Detection: Health endpoint reports database: error.

Recovery: Restore database from backup. Audit chain may have gaps during outage.

Network Partition

Control plane ↔ Gateway: Nodes operate on last-known configuration. Configuration drift is possible.

Gateway ↔ Backends: Gateway cannot forward requests. Depends on timeout configuration.

Detection: Heartbeat gaps, cohesion drops, error rate increases.

Recovery: Restore connectivity. Force reconciliation if nodes diverged.

Monitoring Recommendations

Key Metrics

Metric	Source	Threshold
Fleet cohesion	ClusterManager	< 100% = investigate
Node status	GatewayNode.status	`Error`/`Divergent` = alert
JVM heap	GatewayNode.metrics.memory	> 80% = scale or tune
Audit chain verification	`/audit-logs/verify`	`ok: false` = investigate immediately
Decision rate	SSE stream	Anomalies indicate traffic or policy issues

Health Endpoints

Guardrails API: GET /health

{
  "status": "ok",
  "version": "1.0.0",
  "database": "ok"
}

Gateway Node: (JMX or custom endpoint per deployment)

Alerting

Recommended alerts:

Cohesion < 100% for > 5 minutes
Any node in Error/Divergent for > 2 minutes
Audit chain verification failure
Control plane health check failure
Database connectivity failure

Disaster Recovery

Backup Strategy

Database: Regular backups per your database’s best practices
Audit chain: Periodic exports to immutable storage (S3, GCS)
Configuration: Snapshots are versioned; export current active snapshots

Recovery Procedures

Control plane loss: Restore from database backup. Gateway nodes reconcile automatically.
Gateway fleet loss: Deploy new nodes. They pull current configuration on startup.
Database corruption: Restore from backup. Verify audit chain integrity post-restore.

RTO/RPO Considerations

RTO: Depends on deployment infrastructure. Kubernetes deployments recover quickly.
RPO: Depends on backup frequency. Continuous replication minimizes data loss.