Incident Playbook for Beginners: Real-World Monitoring and Troubleshooting Stories
Incident Playbook for Beginners
Calmly troubleshoot slowdowns, errors, and outages — even at 3AM. These are real stories, explained in language anyone new to on-call can follow.
🚦 "Why does my API suddenly slow down?"
Symptom:
Out of the blue, your API takes 2 seconds instead of 100ms to respond. Users say: “It’s randomly slow.” No errors in your logs.
Where to start
- Check CPU, memory, database, network traffic metrics.
- Open tracing tools (Datadog, Grafana) — see where the request lags.
- Is your API waiting on another service?
- Are too many requests hitting you at once?
What’s really happening?
- User makes a request.
- You call a DB or another API — it’s sometimes slow.
- Your service waits, builds up more requests.
- Finally replies… late.
Why?
- DB is backed up, so your app waits for a free connection.
- External API is just slow.
- Too many requests in your queue.
- App is paused for GC (memory cleanup).
Quick checks
- Is DB or an external service overloaded?
- Look for latency spikes in dashboards.
- Shell command:
kubectl top pod
Quick fixes
- Set timeouts on outgoing calls.
- Add caching for frequent requests.
- Temporarily scale up your service.
- Restart any stuck pod.
Prevent this next time
- Alert on high latency.
- Watch dependency health.
- Do real-traffic load simulations before release.
🚦 "Why are users seeing 500 errors, but my app logs are empty?"
Symptom:
Users see a 500 error. You don’t see any error logged.
Where to check
- Error before your app? (Check load balancer, gateway, firewall.)
- Did your container restart before it could log?
- Are logs being sampled or filtered too aggressively?
What’s really happening?
- LB gets user request, forwards to you.
- Your service crashes — no log gets written.
- LB returns 500 to the user.
Why?
- Crash before log flush.
- Proxy returns error for you.
- Log level too high for error details.
- App killed by OS for OOM (out-of-memory).
Quick checks
kubectl describe pod <pod>kubectl get eventsdmesg | grep -i oom- Compare Request IDs across system logs.
- Temporarily lower log level.
Quick fixes
- Roll back to last stable version.
- Restart pods, give more memory.
- Add try/catch + log in all routes.
Prevent this next time
- Always include correlation/request IDs everywhere.
- Synthetic user tests.
- Make error logging more robust by default.
🚦 "99% uptime, but users still complain"
Symptom:
Dashboards: “100% uptime.” Users: “The site’s down/slow!”
Where to check
- Are your health checks only hitting
/health? - Regional/ISP issue?
- Are users using different API endpoints?
- Is frontend actually failing, not backend?
What’s really happening?
- Health check always returns OK, but real routes fail.
- Problems in certain regions or features, not covered by synthetic checks.
Why?
- Health check is too shallow — just “is server alive.”
- Downstream service is the one broken.
- Metrics average out and hide regional outages.
Quick checks
- Test real endpoints from multiple regions.
- Compare synthetic vs real user traffic in dashboards.
- Review readiness/liveness probes for depth.
Quick fixes
- Make health checks exercise real app dependencies.
- Add geographic dashboards.
- Reroute or throttle bad regions.
Prevent this next time
- RUM (real user monitoring) everywhere.
- Define uptime as “real requests succeed.”
- SLOs/SLAs based on end-user outcome.
🚦 "Suddenly 100s of alerts: what now?"
Symptom:
You’re paged at 3AM, alerts flooding in. Every service looks down. What now?
Where to check
- Which alert/service failed FIRST?
- Recent deploy or infra change?
- Any dependencies that triggered cascade failures?
What’s really happening?
- Something core (DB, major service) failed.
- All dependents start to fail, too.
- Each one fires an alert, you get swamped.
Why?
- No alert grouping/deduplication, so you get one alert per service per pod.
- Downstream failures avalanche from the real root cause.
Quick checks
- Timeline: which alert came first?
- Any recent deploy or config change?
- Focus on user symptoms first.
Quick fixes
- Silence duplicate alerts.
- Roll back latest change if needed.
- Triage: fix what hits users, not just what’s red.
Prevent this next time
- Root-cause alert grouping.
- Write recovery playbooks/runbooks.
- Blameless postmortems after each storm.
🚦 "App’s memory keeps rising — OOM killed"
Symptom:
App runs fine, then restarts. Memory usage graph only goes up.
OOMKilledin Kubernetes.
Where to check
- Container/pod status for restarts or OOM reason.
- Memory profiles/heap dumps.
- Any feature rolled out using more memory?
What’s really happening?
- App creates objects/data, doesn’t release old ones.
- GC can’t free enough space.
- OS kills process for hitting memory limit.
Why?
- Memory leak (long-lived refs).
- Cache grows endlessly.
- Unclosed file handles, sockets.
Quick checks
kubectl top pod- App exposes heap profile (e.g.
/debug/pprofGo, VisualVM–Java) - See what uses most RAM.
Quick fixes
- Restart service (buy time).
- Cap cache size or TTL.
- Fix leaks, close all resources.
- Raise memory limits (short-term).
Prevent next time
- Test with profilers pre-prod.
- Add dashboards & alerts for leak trends.
- Enforce resource cleanup with automated/static checking.
🚦 "Random DB CPU spikes, app slows to a crawl"
Symptom:
Database CPU spikes to 90%, queries slow way down. No obvious changes — except, did you just ship a big new feature?
Where to check
- DB CPU, query, and connection metrics.
- Recent code deploys?
- Batch/report jobs during peak?
- App’s DB pool usage?
What’s really happening?
- New query hits DB, scans big table.
- More user traffic = more bad queries.
- Everything backs up, users timeout.
Why?
- Missing index.
- Slow JOINs/filters.
- Batch/report jobs.
Quick checks
SHOW FULL PROCESSLIST(MySQL)EXPLAIN SELECT ...- Check pool/slow query dashboard.
Quick fixes
- Add index, kill stuck queries.
- Temp: scale up DB.
- Schedule heavy jobs off-hours.
Prevent next time
- Alerts on slow queries.
- Load test DB after code changes.
- Cap timeouts/connections.
🚦 "Only users in Asia are timing out"
Symptom:
Users in one country have errors, everyone else is fine.
Where to check
- DNS/CDN routes.
- Per-region LB or healthcheck status.
- Test from VPN/tools in that region.
What’s really happening?
- DNS routes Asia users to
asia-server. - That server’s DB is down. US/EU fine.
- Only Asia reports issues.
Why?
- Edge server misconfigured.
- Cloud region down.
- DNS cache poisoning or wrong routes
Quick checks
dig yourapp.comtraceroute yourapp.com- Regional dashboard per endpoint.
Quick fixes
- Reroute traffic to healthy region.
- Restart broken services.
- Invalidate DNS/CDN caches.
Prevent next time
- Alerts + dashboards by region/country.
- Regular synthetic tests around the world.
- Failover plans ready.
🌟 Final Takeaway
Incidents are just your system’s way of teaching you. Every error, crash, or slowdown is a story. Calmly follow the symptoms, check the steps, fix the problem — and you’ll get better at it every time.
Related Articles
System Design Power-Guide 2025: What To Learn, In What Order, With Real-World Links
Stop bookmarking random threads. This is a tight, no-fluff map of what to study for system design in 2025 - what each topic is, why it matters in interviews and production, and where to go deeper.
DSA Patterns Master Guide: How To Identify Problems, Pick Patterns, and Practice (With LeetCode Sets)
A practical, pattern-first road map for entry-level engineers. Learn how to identify the right pattern quickly, apply a small algorithm template, know variants and pitfalls, and practice with curated LeetCode problems.
Searching & Sorting Master Guide: Visuals, Java Templates, Variants, and LeetCode Sets
A senior-architect style playbook for entry-level engineers: how to identify which search or sort to use, step-by-step templates in Java, key variants, complexity cheats, pitfalls, and curated LeetCode practice.