High Availability Deployment Model

Objective

Define how Cyprob EE maintains service continuity under component failures and growth in scan workload.

Deployment Modes (Current Model)

Cyprob EE supports two operational modes in production packaging:

Standalone: API + embedded workers in a single runtime path
HA-Lite: API/control-plane separated from dedicated worker containers

Mode selection can be driven by environment and capacity constraints; operationally, HA-Lite is the preferred path for higher throughput and failure isolation.

Recommended Topology (HA-Lite)

1+ API/control-plane instance(s)
PostgreSQL as primary datastore (with customer-preferred resilience pattern)
Multiple stateless worker containers
Reverse proxy/ingress in front of API
Optional watcher/governor process for worker scaling control

Why HA-Lite Matters

Isolates scan execution from API responsiveness
Scales worker capacity without redesigning control-plane
Improves blast-radius control when worker crashes or restarts occur

Failure Behavior Expectations

Worker Failure

Expected behavior:

In-flight work for failed worker is retried/reassigned based on queue/recovery logic.
Control-plane and UI/API should remain available.

Operational action:

Replace/restart failed worker.
Verify queue drain and scan progression.

API/Control-Plane Restart

Expected behavior:

Temporary API interruption possible.
Background work continuity depends on queue/state persistence and worker availability.

Operational action:

Restore API service.
Re-check scan states and health endpoints.

Database Unavailability

Expected behavior:

Core platform operations degrade or pause.
New scan creation and state transitions are impacted.

Operational action:

Restore DB availability first.
Validate connection pool and migration status.

Minimum Sizing Guidance (Starting Point)

Small pilot: 1 API + embedded worker path, 4-8 GB RAM class
Production baseline: 1 API + dedicated workers, 8+ GB RAM class
Scale strategy: add worker replicas first, then tune API/DB resources

Final sizing must be validated with target count, scan policy depth, and expected concurrency.

Operational Health Checks

API health endpoint returns healthy (/health or deployment-specific /healthz)
Worker heartbeat/current status visible in operations telemetry
Queue depth remains within expected range during peak windows
Failed/retried job rates are monitored and bounded

HA Validation Checklist

Simulate one worker crash during active scans and verify recovery.
Restart API/control-plane during non-critical window and verify continuity.
Confirm scan queue drains after transient failures.
Confirm alerting/observability catches degraded states.

Limitations

HA-Lite is not equivalent to full multi-region active-active architecture.
Resilience posture depends on database and infrastructure hardening choices.

Next Action

Continue with FAQ for common technical and commercial objections during evaluation.

Objective​

Deployment Modes (Current Model)​

Recommended Topology (HA-Lite)​

Why HA-Lite Matters​

Failure Behavior Expectations​

Worker Failure​

API/Control-Plane Restart​

Database Unavailability​

Minimum Sizing Guidance (Starting Point)​

Operational Health Checks​

HA Validation Checklist​

Limitations​

Next Action​

Objective

Deployment Modes (Current Model)

Recommended Topology (HA-Lite)

Why HA-Lite Matters

Failure Behavior Expectations

Worker Failure

API/Control-Plane Restart

Database Unavailability

Minimum Sizing Guidance (Starting Point)

Operational Health Checks

HA Validation Checklist

Limitations

Next Action