Backend EngineeringIncident ResponseMonitoringSRETroubleshootingOn-call

Incident Playbook for Beginners: Real-World Monitoring and Troubleshooting Stories

Satyam Parmar
October 30, 2025
7 min read

Incident Playbook for Beginners

Calmly troubleshoot slowdowns, errors, and outages — even at 3AM. These are real stories, explained in language anyone new to on-call can follow.

🚦 "Why does my API suddenly slow down?"

Symptom:

Out of the blue, your API takes 2 seconds instead of 100ms to respond. Users say: “It’s randomly slow.” No errors in your logs.

Where to start

  • Check CPU, memory, database, network traffic metrics.
  • Open tracing tools (Datadog, Grafana) — see where the request lags.
  • Is your API waiting on another service?
  • Are too many requests hitting you at once?

What’s really happening?

  1. User makes a request.
  2. You call a DB or another API — it’s sometimes slow.
  3. Your service waits, builds up more requests.
  4. Finally replies… late.

Why?

  • DB is backed up, so your app waits for a free connection.
  • External API is just slow.
  • Too many requests in your queue.
  • App is paused for GC (memory cleanup).

Quick checks

  • Is DB or an external service overloaded?
  • Look for latency spikes in dashboards.
  • Shell command:
    kubectl top pod
    

Quick fixes

  • Set timeouts on outgoing calls.
  • Add caching for frequent requests.
  • Temporarily scale up your service.
  • Restart any stuck pod.

Prevent this next time

  • Alert on high latency.
  • Watch dependency health.
  • Do real-traffic load simulations before release.

🚦 "Why are users seeing 500 errors, but my app logs are empty?"

Symptom:

Users see a 500 error. You don’t see any error logged.

Where to check

  • Error before your app? (Check load balancer, gateway, firewall.)
  • Did your container restart before it could log?
  • Are logs being sampled or filtered too aggressively?

What’s really happening?

  1. LB gets user request, forwards to you.
  2. Your service crashes — no log gets written.
  3. LB returns 500 to the user.

Why?

  • Crash before log flush.
  • Proxy returns error for you.
  • Log level too high for error details.
  • App killed by OS for OOM (out-of-memory).

Quick checks

  • kubectl describe pod <pod>
  • kubectl get events
  • dmesg | grep -i oom
  • Compare Request IDs across system logs.
  • Temporarily lower log level.

Quick fixes

  • Roll back to last stable version.
  • Restart pods, give more memory.
  • Add try/catch + log in all routes.

Prevent this next time

  • Always include correlation/request IDs everywhere.
  • Synthetic user tests.
  • Make error logging more robust by default.

🚦 "99% uptime, but users still complain"

Symptom:

Dashboards: “100% uptime.” Users: “The site’s down/slow!”

Where to check

  • Are your health checks only hitting /health?
  • Regional/ISP issue?
  • Are users using different API endpoints?
  • Is frontend actually failing, not backend?

What’s really happening?

  1. Health check always returns OK, but real routes fail.
  2. Problems in certain regions or features, not covered by synthetic checks.

Why?

  • Health check is too shallow — just “is server alive.”
  • Downstream service is the one broken.
  • Metrics average out and hide regional outages.

Quick checks

  • Test real endpoints from multiple regions.
  • Compare synthetic vs real user traffic in dashboards.
  • Review readiness/liveness probes for depth.

Quick fixes

  • Make health checks exercise real app dependencies.
  • Add geographic dashboards.
  • Reroute or throttle bad regions.

Prevent this next time

  • RUM (real user monitoring) everywhere.
  • Define uptime as “real requests succeed.”
  • SLOs/SLAs based on end-user outcome.

🚦 "Suddenly 100s of alerts: what now?"

Symptom:

You’re paged at 3AM, alerts flooding in. Every service looks down. What now?

Where to check

  • Which alert/service failed FIRST?
  • Recent deploy or infra change?
  • Any dependencies that triggered cascade failures?

What’s really happening?

  1. Something core (DB, major service) failed.
  2. All dependents start to fail, too.
  3. Each one fires an alert, you get swamped.

Why?

  • No alert grouping/deduplication, so you get one alert per service per pod.
  • Downstream failures avalanche from the real root cause.

Quick checks

  • Timeline: which alert came first?
  • Any recent deploy or config change?
  • Focus on user symptoms first.

Quick fixes

  • Silence duplicate alerts.
  • Roll back latest change if needed.
  • Triage: fix what hits users, not just what’s red.

Prevent this next time

  • Root-cause alert grouping.
  • Write recovery playbooks/runbooks.
  • Blameless postmortems after each storm.

🚦 "App’s memory keeps rising — OOM killed"

Symptom:

App runs fine, then restarts. Memory usage graph only goes up. OOMKilled in Kubernetes.

Where to check

  • Container/pod status for restarts or OOM reason.
  • Memory profiles/heap dumps.
  • Any feature rolled out using more memory?

What’s really happening?

  1. App creates objects/data, doesn’t release old ones.
  2. GC can’t free enough space.
  3. OS kills process for hitting memory limit.

Why?

  • Memory leak (long-lived refs).
  • Cache grows endlessly.
  • Unclosed file handles, sockets.

Quick checks

  • kubectl top pod
  • App exposes heap profile (e.g. /debug/pprof Go, VisualVM–Java)
  • See what uses most RAM.

Quick fixes

  • Restart service (buy time).
  • Cap cache size or TTL.
  • Fix leaks, close all resources.
  • Raise memory limits (short-term).

Prevent next time

  • Test with profilers pre-prod.
  • Add dashboards & alerts for leak trends.
  • Enforce resource cleanup with automated/static checking.

🚦 "Random DB CPU spikes, app slows to a crawl"

Symptom:

Database CPU spikes to 90%, queries slow way down. No obvious changes — except, did you just ship a big new feature?

Where to check

  • DB CPU, query, and connection metrics.
  • Recent code deploys?
  • Batch/report jobs during peak?
  • App’s DB pool usage?

What’s really happening?

  1. New query hits DB, scans big table.
  2. More user traffic = more bad queries.
  3. Everything backs up, users timeout.

Why?

  • Missing index.
  • Slow JOINs/filters.
  • Batch/report jobs.

Quick checks

  • SHOW FULL PROCESSLIST (MySQL)
  • EXPLAIN SELECT ...
  • Check pool/slow query dashboard.

Quick fixes

  • Add index, kill stuck queries.
  • Temp: scale up DB.
  • Schedule heavy jobs off-hours.

Prevent next time

  • Alerts on slow queries.
  • Load test DB after code changes.
  • Cap timeouts/connections.

🚦 "Only users in Asia are timing out"

Symptom:

Users in one country have errors, everyone else is fine.

Where to check

  • DNS/CDN routes.
  • Per-region LB or healthcheck status.
  • Test from VPN/tools in that region.

What’s really happening?

  1. DNS routes Asia users to asia-server.
  2. That server’s DB is down. US/EU fine.
  3. Only Asia reports issues.

Why?

  • Edge server misconfigured.
  • Cloud region down.
  • DNS cache poisoning or wrong routes

Quick checks

  • dig yourapp.com
  • traceroute yourapp.com
  • Regional dashboard per endpoint.

Quick fixes

  • Reroute traffic to healthy region.
  • Restart broken services.
  • Invalidate DNS/CDN caches.

Prevent next time

  • Alerts + dashboards by region/country.
  • Regular synthetic tests around the world.
  • Failover plans ready.

🌟 Final Takeaway

Incidents are just your system’s way of teaching you. Every error, crash, or slowdown is a story. Calmly follow the symptoms, check the steps, fix the problem — and you’ll get better at it every time.

Related Articles

Home