Operational Model

This document describes Hexarch’s deployment patterns, runtime assumptions, upgrade model, and failure modes.

Architecture Overview

Hexarch consists of three components:

Control Plane (Hexarch Application)

Data Plane (Gateway Nodes)

Guardrails API (Python SDK Backend)

Deployment Patterns

Single-Region Deployment

┌─────────────────────────────────────────────────┐
│                     Region                      │
│                                                 │
│  ┌──────────┐      ┌──────────┐                 │
│  │ Control  │ ───→ │ Gateway  │ ───→ Backends   │
│  │  Plane   │      │  Fleet   │                 │
│  └──────────┘      └──────────┘                 │
│       │                                         │
│       ↓                                         │
│  ┌──────────┐                                   │
│  │ Database │                                   │
│  └──────────┘                                   │
└─────────────────────────────────────────────────┘

Multi-Region Deployment

┌─────────────────────────┐    ┌─────────────────────────┐
│        Region A         │    │        Region B         │
│                         │    │                         │
│  ┌──────────┐           │    │           ┌──────────┐  │
│  │ Control  │ ───→ ───→ ┼────┼ ───→ ───→ │ Gateway  │  │
│  │  Plane   │           │    │           │  Fleet   │  │
│  └──────────┘           │    │           └──────────┘  │
│       │                 │    │                         │
│       ↓                 │    │                         │
│  ┌──────────┐           │    │                         │
│  │ Database │           │    │                         │
│  │ (Primary)│           │    │                         │
│  └──────────┘           │    │                         │
└─────────────────────────┘    └─────────────────────────┘

Runtime Assumptions

Gateway Node Requirements

From the GatewayNode interface and ClusterManager.tsx:

interface GatewayNode {
  softwareVersion: string;
  metrics: {
    cpu: number;     // Percentage
    memory: number;  // Percentage (JVM heap)
    tps: number;     // Transactions per second
  };
}

Memory: Gateway nodes hold policy snapshots in memory. Larger policy sets require more heap. Monitor JVM heap usage.

CPU: Policy evaluation is CPU-bound. Complex policies (regex matching, many conditions) increase CPU load.

Network: Nodes require connectivity to the control plane for configuration updates and audit log submission.

Control Plane Requirements

Database: PostgreSQL for production. Connection pooling recommended for high-throughput deployments.

API Server: Stateless FastAPI application. Horizontal scaling behind a load balancer.

State: All state is in the database. Application instances can be replaced without data loss.

Configuration Distribution

Snapshot Lifecycle

From the ConfigSnapshot and ConfigArtifact interfaces:

Define → Validate → Bundle → Deploy → Verify
  1. Define: Create policies, rules, entitlements
  2. Validate: Artifact passes schema checks and optional security scan
  3. Bundle: Create named snapshot with version map
  4. Deploy: Push snapshot to target clusters
  5. Verify: Confirm nodes applied configuration via hash comparison

Propagation Model

From the GatewayNode interface:

desiredSnapshotId: string;  // What control plane wants
appliedSnapshotId: string;  // What node is running

Control plane sets desiredSnapshotId. Nodes pull configuration, apply it, and report appliedSnapshotId. Mismatch indicates drift.

Hot-Swap Updates

Gateway nodes support policy hot-swap:

Failure mode: If hot-swap fails (e.g., incompatible filter signature), node reports lastSyncError and remains on previous configuration.

Upgrade Model

Control Plane Upgrades

  1. Deploy new control plane version alongside existing
  2. Migrate database schema if needed
  3. Switch traffic to new version
  4. Verify functionality
  5. Decommission old version

Rollback: Keep old version available. Database schema migrations should be reversible.

Gateway Upgrades

From the GatewayNode.softwareVersion field:

  1. Deploy new gateway version to a subset of nodes
  2. Verify functionality and performance
  3. Roll out to remaining nodes
  4. Monitor for drift or sync errors

Canary pattern: Update one node per cluster first. Monitor cohesion metrics before proceeding.

Configuration Upgrades

  1. Create new snapshot with updated policies
  2. Deploy to staging cluster
  3. Verify behavior
  4. Promote to production clusters
  5. Monitor decision patterns

Rollback: Previous snapshots are retained. Deploy the prior snapshot to revert.

Failure Modes

Control Plane Unavailable

Impact: Gateway nodes continue operating with last-known configuration. New policy changes cannot be deployed. Audit logs queue locally (if configured).

Detection: Health checks fail. Gateway nodes show stale lastHeartbeat.

Recovery: Restore control plane. Nodes reconcile automatically.

Gateway Node Failure

From the GatewayNode.status enum:

status: 'Ready' | 'Starting' | 'Error' | 'Divergent' | 'Offline'

Impact: Traffic is routed to healthy nodes (load balancer responsibility).

Detection: Node status changes to Error or Offline. Cohesion percentage drops.

Recovery: Node restarts and reconciles to desired configuration.

Database Failure

Impact: Control plane cannot read or write state. Policy changes fail. Audit logging fails.

Detection: Health endpoint reports database: error.

Recovery: Restore database from backup. Audit chain may have gaps during outage.

Network Partition

Control plane ↔ Gateway: Nodes operate on last-known configuration. Configuration drift is possible.

Gateway ↔ Backends: Gateway cannot forward requests. Depends on timeout configuration.

Detection: Heartbeat gaps, cohesion drops, error rate increases.

Recovery: Restore connectivity. Force reconciliation if nodes diverged.

Monitoring Recommendations

Key Metrics

MetricSourceThreshold
Fleet cohesionClusterManager< 100% = investigate
Node statusGatewayNode.statusError/Divergent = alert
JVM heapGatewayNode.metrics.memory> 80% = scale or tune
Audit chain verification/audit-logs/verifyok: false = investigate immediately
Decision rateSSE streamAnomalies indicate traffic or policy issues

Health Endpoints

Guardrails API: GET /health

{
  "status": "ok",
  "version": "1.0.0",
  "database": "ok"
}

Gateway Node: (JMX or custom endpoint per deployment)

Alerting

Recommended alerts:

  1. Cohesion < 100% for > 5 minutes
  2. Any node in Error/Divergent for > 2 minutes
  3. Audit chain verification failure
  4. Control plane health check failure
  5. Database connectivity failure

Disaster Recovery

Backup Strategy

Recovery Procedures

  1. Control plane loss: Restore from database backup. Gateway nodes reconcile automatically.
  2. Gateway fleet loss: Deploy new nodes. They pull current configuration on startup.
  3. Database corruption: Restore from backup. Verify audit chain integrity post-restore.

RTO/RPO Considerations

Further Reading