Operational Model
This document describes Hexarch’s deployment patterns, runtime assumptions, upgrade model, and failure modes.
Architecture Overview
Hexarch consists of three components:
Control Plane (Hexarch Application)
- Technology: React frontend + FastAPI backend
- State: PostgreSQL (production) or SQLite (development)
- Purpose: Policy management, configuration distribution, audit logging
Data Plane (Gateway Nodes)
- Technology: Java Gateway (z-gateway)
- State: In-memory policy snapshot
- Purpose: Request routing, policy enforcement, decision logging
Guardrails API (Python SDK Backend)
- Technology: FastAPI with OPA integration
- State: SQLite or PostgreSQL
- Purpose: Policy evaluation, authorization decisions, audit chain management
Deployment Patterns
Single-Region Deployment
┌─────────────────────────────────────────────────┐
│ Region │
│ │
│ ┌──────────┐ ┌──────────┐ │
│ │ Control │ ───→ │ Gateway │ ───→ Backends │
│ │ Plane │ │ Fleet │ │
│ └──────────┘ └──────────┘ │
│ │ │
│ ↓ │
│ ┌──────────┐ │
│ │ Database │ │
│ └──────────┘ │
└─────────────────────────────────────────────────┘
- Control plane manages gateway fleet in one region
- Suitable for single-region applications
- Database is the single source of truth
Multi-Region Deployment
┌─────────────────────────┐ ┌─────────────────────────┐
│ Region A │ │ Region B │
│ │ │ │
│ ┌──────────┐ │ │ ┌──────────┐ │
│ │ Control │ ───→ ───→ ┼────┼ ───→ ───→ │ Gateway │ │
│ │ Plane │ │ │ │ Fleet │ │
│ └──────────┘ │ │ └──────────┘ │
│ │ │ │ │
│ ↓ │ │ │
│ ┌──────────┐ │ │ │
│ │ Database │ │ │ │
│ │ (Primary)│ │ │ │
│ └──────────┘ │ │ │
└─────────────────────────┘ └─────────────────────────┘
- Control plane in primary region
- Gateway fleets in multiple regions
- Configuration propagates cross-region
- Database replication for disaster recovery
Runtime Assumptions
Gateway Node Requirements
From the GatewayNode interface and ClusterManager.tsx:
interface GatewayNode {
softwareVersion: string;
metrics: {
cpu: number; // Percentage
memory: number; // Percentage (JVM heap)
tps: number; // Transactions per second
};
}
Memory: Gateway nodes hold policy snapshots in memory. Larger policy sets require more heap. Monitor JVM heap usage.
CPU: Policy evaluation is CPU-bound. Complex policies (regex matching, many conditions) increase CPU load.
Network: Nodes require connectivity to the control plane for configuration updates and audit log submission.
Control Plane Requirements
Database: PostgreSQL for production. Connection pooling recommended for high-throughput deployments.
API Server: Stateless FastAPI application. Horizontal scaling behind a load balancer.
State: All state is in the database. Application instances can be replaced without data loss.
Configuration Distribution
Snapshot Lifecycle
From the ConfigSnapshot and ConfigArtifact interfaces:
Define → Validate → Bundle → Deploy → Verify
- Define: Create policies, rules, entitlements
- Validate: Artifact passes schema checks and optional security scan
- Bundle: Create named snapshot with version map
- Deploy: Push snapshot to target clusters
- Verify: Confirm nodes applied configuration via hash comparison
Propagation Model
From the GatewayNode interface:
desiredSnapshotId: string; // What control plane wants
appliedSnapshotId: string; // What node is running
Control plane sets desiredSnapshotId. Nodes pull configuration, apply it, and report appliedSnapshotId. Mismatch indicates drift.
Hot-Swap Updates
Gateway nodes support policy hot-swap:
- New configuration loads while current config serves traffic
- Atomic switchover when new config is ready
- No request interruption during updates
Failure mode: If hot-swap fails (e.g., incompatible filter signature), node reports lastSyncError and remains on previous configuration.
Upgrade Model
Control Plane Upgrades
- Deploy new control plane version alongside existing
- Migrate database schema if needed
- Switch traffic to new version
- Verify functionality
- Decommission old version
Rollback: Keep old version available. Database schema migrations should be reversible.
Gateway Upgrades
From the GatewayNode.softwareVersion field:
- Deploy new gateway version to a subset of nodes
- Verify functionality and performance
- Roll out to remaining nodes
- Monitor for drift or sync errors
Canary pattern: Update one node per cluster first. Monitor cohesion metrics before proceeding.
Configuration Upgrades
- Create new snapshot with updated policies
- Deploy to staging cluster
- Verify behavior
- Promote to production clusters
- Monitor decision patterns
Rollback: Previous snapshots are retained. Deploy the prior snapshot to revert.
Failure Modes
Control Plane Unavailable
Impact: Gateway nodes continue operating with last-known configuration. New policy changes cannot be deployed. Audit logs queue locally (if configured).
Detection: Health checks fail. Gateway nodes show stale lastHeartbeat.
Recovery: Restore control plane. Nodes reconcile automatically.
Gateway Node Failure
From the GatewayNode.status enum:
status: 'Ready' | 'Starting' | 'Error' | 'Divergent' | 'Offline'
Impact: Traffic is routed to healthy nodes (load balancer responsibility).
Detection: Node status changes to Error or Offline. Cohesion percentage drops.
Recovery: Node restarts and reconciles to desired configuration.
Database Failure
Impact: Control plane cannot read or write state. Policy changes fail. Audit logging fails.
Detection: Health endpoint reports database: error.
Recovery: Restore database from backup. Audit chain may have gaps during outage.
Network Partition
Control plane ↔ Gateway: Nodes operate on last-known configuration. Configuration drift is possible.
Gateway ↔ Backends: Gateway cannot forward requests. Depends on timeout configuration.
Detection: Heartbeat gaps, cohesion drops, error rate increases.
Recovery: Restore connectivity. Force reconciliation if nodes diverged.
Monitoring Recommendations
Key Metrics
| Metric | Source | Threshold |
|---|---|---|
| Fleet cohesion | ClusterManager | < 100% = investigate |
| Node status | GatewayNode.status | Error/Divergent = alert |
| JVM heap | GatewayNode.metrics.memory | > 80% = scale or tune |
| Audit chain verification | /audit-logs/verify | ok: false = investigate immediately |
| Decision rate | SSE stream | Anomalies indicate traffic or policy issues |
Health Endpoints
Guardrails API: GET /health
{
"status": "ok",
"version": "1.0.0",
"database": "ok"
}
Gateway Node: (JMX or custom endpoint per deployment)
Alerting
Recommended alerts:
- Cohesion < 100% for > 5 minutes
- Any node in Error/Divergent for > 2 minutes
- Audit chain verification failure
- Control plane health check failure
- Database connectivity failure
Disaster Recovery
Backup Strategy
- Database: Regular backups per your database’s best practices
- Audit chain: Periodic exports to immutable storage (S3, GCS)
- Configuration: Snapshots are versioned; export current active snapshots
Recovery Procedures
- Control plane loss: Restore from database backup. Gateway nodes reconcile automatically.
- Gateway fleet loss: Deploy new nodes. They pull current configuration on startup.
- Database corruption: Restore from backup. Verify audit chain integrity post-restore.
RTO/RPO Considerations
- RTO: Depends on deployment infrastructure. Kubernetes deployments recover quickly.
- RPO: Depends on backup frequency. Continuous replication minimizes data loss.
Further Reading
- Fleet Governance — authority vs. execution model
- Configuration Lifecycle — snapshot deployment process
- Security Model — threat model and trust boundaries