Incident Playbook for Beginners

Calmly troubleshoot slowdowns, errors, and outages — even at 3AM. These are real stories, explained in language anyone new to on-call can follow.

🚦 "Why does my API suddenly slow down?"

Symptom:

Out of the blue, your API takes 2 seconds instead of 100ms to respond. Users say: “It’s randomly slow.” No errors in your logs.

Where to start

Check CPU, memory, database, network traffic metrics.
Open tracing tools (Datadog, Grafana) — see where the request lags.
Is your API waiting on another service?
Are too many requests hitting you at once?

What’s really happening?

User makes a request.
You call a DB or another API — it’s sometimes slow.
Your service waits, builds up more requests.
Finally replies… late.

Why?

DB is backed up, so your app waits for a free connection.
External API is just slow.
Too many requests in your queue.
App is paused for GC (memory cleanup).

Quick checks

Is DB or an external service overloaded?
Look for latency spikes in dashboards.
Shell command:
```
kubectl top pod
```

Quick fixes

Set timeouts on outgoing calls.
Add caching for frequent requests.
Temporarily scale up your service.
Restart any stuck pod.

Prevent this next time

Alert on high latency.
Watch dependency health.
Do real-traffic load simulations before release.

🚦 "Why are users seeing 500 errors, but my app logs are empty?"

Symptom:

Users see a 500 error. You don’t see any error logged.

Where to check

Error before your app? (Check load balancer, gateway, firewall.)
Did your container restart before it could log?
Are logs being sampled or filtered too aggressively?

What’s really happening?

LB gets user request, forwards to you.
Your service crashes — no log gets written.
LB returns 500 to the user.

Why?

Crash before log flush.
Proxy returns error for you.
Log level too high for error details.
App killed by OS for OOM (out-of-memory).

Quick checks

kubectl describe pod <pod>
kubectl get events
dmesg | grep -i oom
Compare Request IDs across system logs.
Temporarily lower log level.

Quick fixes

Roll back to last stable version.
Restart pods, give more memory.
Add try/catch + log in all routes.

Prevent this next time

Always include correlation/request IDs everywhere.
Synthetic user tests.
Make error logging more robust by default.

🚦 "99% uptime, but users still complain"

Symptom:

Dashboards: “100% uptime.” Users: “The site’s down/slow!”

Where to check

Are your health checks only hitting /health?
Regional/ISP issue?
Are users using different API endpoints?
Is frontend actually failing, not backend?

What’s really happening?

Health check always returns OK, but real routes fail.
Problems in certain regions or features, not covered by synthetic checks.

Why?

Health check is too shallow — just “is server alive.”
Downstream service is the one broken.
Metrics average out and hide regional outages.

Quick checks

Test real endpoints from multiple regions.
Compare synthetic vs real user traffic in dashboards.
Review readiness/liveness probes for depth.

Quick fixes

Make health checks exercise real app dependencies.
Add geographic dashboards.
Reroute or throttle bad regions.

Prevent this next time

RUM (real user monitoring) everywhere.
Define uptime as “real requests succeed.”
SLOs/SLAs based on end-user outcome.

🚦 "Suddenly 100s of alerts: what now?"

Symptom:

You’re paged at 3AM, alerts flooding in. Every service looks down. What now?

Where to check

Which alert/service failed FIRST?
Recent deploy or infra change?
Any dependencies that triggered cascade failures?

What’s really happening?

Something core (DB, major service) failed.
All dependents start to fail, too.
Each one fires an alert, you get swamped.

Why?

No alert grouping/deduplication, so you get one alert per service per pod.
Downstream failures avalanche from the real root cause.

Quick checks

Timeline: which alert came first?
Any recent deploy or config change?
Focus on user symptoms first.

Quick fixes

Silence duplicate alerts.
Roll back latest change if needed.
Triage: fix what hits users, not just what’s red.

Prevent this next time

Root-cause alert grouping.
Write recovery playbooks/runbooks.
Blameless postmortems after each storm.

🚦 "App’s memory keeps rising — OOM killed"

Symptom:

App runs fine, then restarts. Memory usage graph only goes up. OOMKilled in Kubernetes.

Where to check

Container/pod status for restarts or OOM reason.
Memory profiles/heap dumps.
Any feature rolled out using more memory?

What’s really happening?

App creates objects/data, doesn’t release old ones.
GC can’t free enough space.
OS kills process for hitting memory limit.

Why?

Memory leak (long-lived refs).
Cache grows endlessly.
Unclosed file handles, sockets.

Quick checks

kubectl top pod
App exposes heap profile (e.g. /debug/pprof Go, VisualVM–Java)
See what uses most RAM.

Quick fixes

Restart service (buy time).
Cap cache size or TTL.
Fix leaks, close all resources.
Raise memory limits (short-term).

Prevent next time

Test with profilers pre-prod.
Add dashboards & alerts for leak trends.
Enforce resource cleanup with automated/static checking.

🚦 "Random DB CPU spikes, app slows to a crawl"

Symptom:

Database CPU spikes to 90%, queries slow way down. No obvious changes — except, did you just ship a big new feature?

Where to check

DB CPU, query, and connection metrics.
Recent code deploys?
Batch/report jobs during peak?
App’s DB pool usage?

What’s really happening?

New query hits DB, scans big table.
More user traffic = more bad queries.
Everything backs up, users timeout.

Why?

Missing index.
Slow JOINs/filters.
Batch/report jobs.

Quick checks

SHOW FULL PROCESSLIST (MySQL)
EXPLAIN SELECT ...
Check pool/slow query dashboard.

Quick fixes

Add index, kill stuck queries.
Temp: scale up DB.
Schedule heavy jobs off-hours.

Prevent next time

Alerts on slow queries.
Load test DB after code changes.
Cap timeouts/connections.

🚦 "Only users in Asia are timing out"

Symptom:

Users in one country have errors, everyone else is fine.

Where to check

DNS/CDN routes.
Per-region LB or healthcheck status.
Test from VPN/tools in that region.

What’s really happening?

DNS routes Asia users to asia-server.
That server’s DB is down. US/EU fine.
Only Asia reports issues.

Why?

Edge server misconfigured.
Cloud region down.
DNS cache poisoning or wrong routes

Quick checks

dig yourapp.com
traceroute yourapp.com
Regional dashboard per endpoint.

Quick fixes

Reroute traffic to healthy region.
Restart broken services.
Invalidate DNS/CDN caches.

Prevent next time

Alerts + dashboards by region/country.
Regular synthetic tests around the world.
Failover plans ready.

🌟 Final Takeaway

Incidents are just your system’s way of teaching you. Every error, crash, or slowdown is a story. Calmly follow the symptoms, check the steps, fix the problem — and you’ll get better at it every time.

Incident Playbook for Beginners

🚦 "Why does my API suddenly slow down?"

Where to start

What’s really happening?

Why?

Quick checks

Quick fixes

Prevent this next time

🚦 "Why are users seeing 500 errors, but my app logs are empty?"

Where to check

What’s really happening?

Why?

Quick checks

Quick fixes

Prevent this next time

🚦 "99% uptime, but users still complain"

Where to check

What’s really happening?

Why?

Quick checks

Quick fixes

Prevent this next time

🚦 "Suddenly 100s of alerts: what now?"

Where to check

What’s really happening?

Why?

Quick checks

Quick fixes

Prevent this next time

🚦 "App’s memory keeps rising — OOM killed"

Where to check

What’s really happening?

Why?

Quick checks

Quick fixes

Prevent next time

🚦 "Random DB CPU spikes, app slows to a crawl"

Where to check

What’s really happening?

Why?

Quick checks

Quick fixes

Prevent next time

🚦 "Only users in Asia are timing out"

Where to check

What’s really happening?

Why?

Quick checks

Quick fixes

Prevent next time

🌟 Final Takeaway

Related Articles

System Design Power-Guide 2025: What To Learn, In What Order, With Real-World Links

DSA Patterns Master Guide: How To Identify Problems, Pick Patterns, and Practice (With LeetCode Sets)

Searching & Sorting Master Guide: Visuals, Java Templates, Variants, and LeetCode Sets