Docs · Guides
Incident runbook
What to check, in order, when something breaks. Each section is intentionally short — print it out, keep it where the on-call engineer can find it at 02:00 AM. Severities below match what you’d use in a JIRA / Linear ticket.
0. Before anything — confirm the failure
- Reproduce from a second device / browser / IP. ~10% of reported “outages” are a single laptop’s DNS or VPN.
- Check status.cloudnx.in for platform-side incidents.
- Check your audit logs for the last action — many incidents start with “I changed X then it stopped working”.
P1 — VM unreachable / production down
Symptom: users report 502 / connection refused / timeout.
- Is the VM running? Portal → Instances. Status should be
running. Ifstoppedorerror, click Start. Ifprovisioning> 5 min, ping support. - Browser SSH from the portal (instance detail → Open SSH). If the browser SSH works, the VM is alive — the problem is application-level. Skip to the next section.
- Disk full?
df -h. If/is ≥ 95%, the OS won’t accept writes — truncate / rotate logs, delete/var/cache/apt/archives/*.deb, look atjournalctl --vacuum-size=500M. - Service crashed?
sudo systemctl status <your-service>.sudo journalctl -u <your-service> --since "30 minutes ago"tells you why. Restart withsudo systemctl restart. - Roll back to a snapshot if a recent deploy broke things. Instance detail → Snapshots → Restore. Takes ~30 seconds. Data written since the snapshot is lost — copy
/var/logoff first. - Escalate to [email protected] with subject prefix
[P1]and your account UUID + instance ID. Postpaid customers: use the on-call phone number on your contract.
P2 — Slow database
Symptom: queries that normally take 50 ms now take 5 seconds.
- Top by query.
SELECT * FROM pg_stat_activity WHERE state <> 'idle' ORDER BY query_start;— one long-running query usually blocks everything else. - Locks.
SELECT * FROM pg_locks WHERE NOT granted;— anything not granted means contention. Find the blocking PID and decide whether toSELECT pg_terminate_backend(<pid>). - I/O.
iostat -xz 1on the VM.%utilat 100% on the data device means you’ve outgrown the volume — resize up, or move the data dir to a bigger NVMe pool. - Vacuum.
SELECT relname, n_dead_tup, last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 10;— if dead tuples are in millions and last autovacuum is hours old, runVACUUM ANALYZE <table>;. - Missing index.
EXPLAIN ANALYZEyour slow query. Seq scans on tables > 10k rows are a smell.
P2 — Disk filling up
sudo du -h --max-depth=1 / | sort -h | tail -10— top 10 largest dirs.- Common culprits:
/var/log(rotate / truncate),/var/lib/docker(docker system prune -af),/tmp(just delete),~/.cacheon the app user. - If you can’t shrink, grow: /portal/volumes → Resize. Online, no downtime. Then
resize2fs /dev/vdXinside the VM.
P1 — Suspected security event
Symptom: unfamiliar SSH login, unexpected process running, outbound traffic to suspicious IPs.
- Don’t shut down. Snapshot the VM first (instance detail → Snapshots → Take snapshot) for forensic analysis.
- Isolate. Firewall → set ingress to your IP only. Leaves the attacker locked out without destroying evidence.
- Rotate everything. SSH keys (/portal/ssh-keys), API keys (/portal/api-keys), database passwords, KMS keys (
cloudnx kms rotate), any embedded secrets in container images. - Notify. Email [email protected]+ your DPO. We’ll coordinate audit-log export and confirm whether the breach touched any CloudNx-managed service.
- Post-mortem. Write it within 48 hours. Even if the incident was small. Habits matter more than the specific event.
Region-wide outage
Single-region today (EU West). If the entire region is unreachable, your options are:
- Wait — we publish status updates every 15 min on status.cloudnx.in and email all customers within 30 min of detection.
- Failover to your offsite backup (see Backup strategy). Bring up a temporary stack on AWS / GCP from the latest backup. The 5-step CloudNx-to-AWS migration runs in reverse.
- For prolonged outages (> 4 hours), we credit prepaid customers automatically per the SLA.
Multi-region active-active will land when the second AX41 is procured. Until then, layer-3 offsite backups (CloudNx S3 + Backblaze B2) are your DR plan.
Escalation matrix
| Severity | Definition | Channel | Response SLA |
|---|---|---|---|
| P1 | Production down — users affected | support@ with [P1] · postpaid: oncall phone | Paid: < 1h · Postpaid: < 30 min |
| P2 | Degraded — workaround exists | support@ | Paid: < 4h |
| P3 | Question / non-urgent bug | support@ or Discord | Paid: < 24h business · Free: best-effort |
| Security | Suspected breach / vulnerability | [email protected] | < 1h, 24×7 |