High Availability Deployment Model
Objective
Define how Cyprob EE maintains service continuity under component failures and growth in scan workload.
Deployment Modes (Current Model)
Cyprob EE supports two operational modes in production packaging:
- Standalone: API + embedded workers in a single runtime path
- HA-Lite: API/control-plane separated from dedicated worker containers
Mode selection can be driven by environment and capacity constraints; operationally, HA-Lite is the preferred path for higher throughput and failure isolation.
Recommended Topology (HA-Lite)
- 1+ API/control-plane instance(s)
- PostgreSQL as primary datastore (with customer-preferred resilience pattern)
- Multiple stateless worker containers
- Reverse proxy/ingress in front of API
- Optional watcher/governor process for worker scaling control
Why HA-Lite Matters
- Isolates scan execution from API responsiveness
- Scales worker capacity without redesigning control-plane
- Improves blast-radius control when worker crashes or restarts occur
Failure Behavior Expectations
Worker Failure
Expected behavior:
- In-flight work for failed worker is retried/reassigned based on queue/recovery logic.
- Control-plane and UI/API should remain available.
Operational action:
- Replace/restart failed worker.
- Verify queue drain and scan progression.
API/Control-Plane Restart
Expected behavior:
- Temporary API interruption possible.
- Background work continuity depends on queue/state persistence and worker availability.
Operational action:
- Restore API service.
- Re-check scan states and health endpoints.
Database Unavailability
Expected behavior:
- Core platform operations degrade or pause.
- New scan creation and state transitions are impacted.
Operational action:
- Restore DB availability first.
- Validate connection pool and migration status.
Minimum Sizing Guidance (Starting Point)
- Small pilot: 1 API + embedded worker path, 4-8 GB RAM class
- Production baseline: 1 API + dedicated workers, 8+ GB RAM class
- Scale strategy: add worker replicas first, then tune API/DB resources
Final sizing must be validated with target count, scan policy depth, and expected concurrency.
Operational Health Checks
- API health endpoint returns healthy (
/healthor deployment-specific/healthz) - Worker heartbeat/current status visible in operations telemetry
- Queue depth remains within expected range during peak windows
- Failed/retried job rates are monitored and bounded
HA Validation Checklist
- Simulate one worker crash during active scans and verify recovery.
- Restart API/control-plane during non-critical window and verify continuity.
- Confirm scan queue drains after transient failures.
- Confirm alerting/observability catches degraded states.
Limitations
- HA-Lite is not equivalent to full multi-region active-active architecture.
- Resilience posture depends on database and infrastructure hardening choices.
Next Action
Continue with FAQ for common technical and commercial objections during evaluation.