Skip to content
Technology6 min read

How Site Reliability Engineers Use iPhone Notes for Incident Insights

SREs juggle incident response, SLO tracking, and postmortems across distributed systems. Here is how to capture the patterns that don't make it into runbooks until it's too late.

·By Taha Baalla

Site reliability engineering sits at the intersection of software engineering and operations. You're simultaneously writing code, responding to pages, tracking error budgets, and conducting postmortems — often across multiple services and time zones. The insights that matter most arrive at the worst times: during an active incident at 2am, mid-postmortem when someone identifies a latency pattern, or during a capacity planning review when a failure mode becomes obvious.

iPhone notes give SREs a capture layer that stays with them through on-call rotations, cross-team meetings, and production investigations. The notes you take during an incident become the seeds of runbooks. The patterns you observe across postmortems become SLO refinements.

Why SREs Need a Mobile Note System

SRE work doesn't happen at a desk. You're responding to pages from a couch, reviewing dashboards during lunch, or discussing service dependencies in hallways. A mobile note system that syncs with your engineering workflow captures context when it's live — not reconstructed hours later.

The cost of not capturing: incident timelines reconstructed from memory miss subtle details. Postmortem action items assigned verbally get forgotten. SLO violation patterns observed across multiple incidents don't get connected until a major outage reveals them. Every SRE has lost insights to the gap between the observation and the documentation.

What SREs Capture in iPhone Notes

Incident observations: Timeline notes during an active incident are invaluable. "Latency spike started at 14:32 UTC, correlated with deployment d4a9f2, rollback at 14:48 resolved p95 within 3 minutes" — this level of detail, captured live, transforms postmortem quality. Note the symptoms you see, the hypotheses you test, and the mitigations you try. Even failed mitigations are worth documenting.

SLO violation patterns: When you investigate an error budget burn, note what you find. "Third time this quarter that batch job retries have elevated error rate during business hours — needs circuit breaker, not just retry backoff." These cross-incident patterns are invisible in individual postmortems but become obvious across a note history.

Runbook gaps: During incident response, note every time you reach for a runbook and find it missing a step, outdated, or pointing to a deprecated service. "Runbook for DB failover doesn't mention the read replica lag window — caused 4 minutes of confusion" becomes a prioritized runbook update after the incident is resolved.

Capacity signals: Note when you observe capacity headroom shrinking, unexpected traffic patterns, or resource utilization crossing thresholds you haven't formally tracked. "Redis memory at 78% on prod-cache-03, projected to hit 90% in 6 weeks at current growth rate" — captured informally and converted to a formal capacity ticket.

Toil observations: Any repetitive manual task you perform gets noted with approximate time cost and frequency. These notes fuel toil reduction prioritization: "Manually rotating service account credentials takes 45 minutes each quarter across 12 services — candidate for Vault automation."

Architecture concerns: During incident investigation, you often discover architectural assumptions that don't hold under load. Note them immediately. "Service X assumes synchronous response from service Y but doesn't implement timeout — cascading failure risk if Y degrades."

The SRE Observation Note

Here's the format that works for SRE-specific observations:

``` Incident: [service/symptom] Time: [UTC timestamp] Trigger: [deployment/traffic spike/dependency failure] Timeline: [key events with times] Mitigation: [what resolved it] Root cause hypothesis: [preliminary finding] Runbook gap: [what was missing] Follow-up: [ticket or action item] ```

For a capacity observation: ``` Service: [name] Resource: [CPU/memory/connections/disk] Current utilization: [%] Growth rate: [per week/month] Projected breach: [timeframe] Action needed: [scale/optimize/alert threshold] ```

For a toil note: ``` Task: [what you did manually] Frequency: [how often] Time cost: [minutes/hours] Automation opportunity: [tool/approach] Priority: [high/medium/low based on frequency × cost] ```

Connecting Notes to On-Call Workflow

The best SRE note systems connect mobile capture to on-call tooling. Notes taken during an incident feed directly into postmortem templates. Capacity observations become monitoring alerts. Runbook gaps become prioritized updates.

Nemos' pinning system helps here — pin the notes for your current on-call rotation. When you get a page, the context from your last three related incidents is immediately accessible. When you hand off to the next on-call engineer, your notes become their briefing.

Tag notes by service and incident type. An SRE covering a dozen services quickly builds a note corpus that reveals patterns invisible in individual incidents. "Every time service A deploys on Thursday, service B shows elevated p99 latency" — observable only by connecting notes across time.

From Notes to Runbooks

The highest-leverage thing SRE notes enable is runbook quality. Most runbooks are written once, rarely updated, and contain the happy path only. Notes captured during actual incidents contain the edge cases, the "this step took 20 minutes because of X", and the "if you see Y, do Z instead."

The workflow: capture during the incident → review and clean up within 24 hours → identify runbook gaps → update runbooks before the next incident. SREs who do this consistently have noticeably better on-call experiences than those who rely on memory.

FAQ

Q: Should I take notes during an active incident or wait until after? A: Both. During the incident, capture a raw timeline with timestamps — even bullet points work. After resolution, spend 15 minutes converting the raw notes to a structured postmortem seed. The timestamps captured live are irreplaceable; the analysis can wait.

Q: How do I keep incident notes separate from other engineering notes? A: Use a consistent naming convention: "INC-[service]-[date]" for incident notes. Or use a dedicated Nemos notebook for on-call. The key is findability — you'll want to search across past incidents when investigating a new one.

Q: What's the most important thing to capture during an incident? A: The precise timeline with UTC timestamps. Everything else (root cause, contributing factors, action items) can be reconstructed or refined. The timeline — what you observed and when — is only accurate when captured live.

Q: How do SRE notes help with postmortems? A: Dramatically. A postmortem written from live notes has a 10x more accurate timeline, catches contributing factors that are forgotten by the retrospective meeting, and surfaces the "we almost did X but did Y instead" decision points that reveal process improvements.

Q: Should I note every page or just major incidents? A: Note every page with at least a one-line entry: service, trigger, resolution, time to resolve. This creates a frequency and severity baseline. Even "alert fired, false alarm, adjusted threshold" notes are valuable — they help you identify noisy alerts that should be tuned.

Q: How do I use notes to reduce toil? A: Keep a running "toil log" where every manual task gets a time estimate and frequency. Review it monthly. The tasks with high frequency × high time cost are your automation priorities. The log also builds the business case for infrastructure investment.

Q: What about sensitive production data in notes? A: Never capture customer data, PII, or confidential business data in mobile notes. Capture system behavior, timestamps, and patterns — not the content of requests or responses. "User signup flow showed 500 errors" not "user [email protected] got a 500 error."

Related Reading

Sources

  • Google SRE Book: "Site Reliability Engineering" — https://sre.google/sre-book/table-of-contents/
  • Google SRE Workbook: "The Site Reliability Workbook" — https://sre.google/workbook/table-of-contents/
  • DORA Research: Accelerate State of DevOps Report — https://dora.dev/research/
TB
·Founder, Némos

Taha built Némos after years of losing screenshots and voice memos across a dozen apps. He writes about on-device AI, personal knowledge management, and building privacy-first tools for iPhone.

@nemosapp
Join 2,400+ on the waitlist

Stop losing things you save.

Némos remembers every screenshot, voice memo, link, and note — and surfaces them when you need them. Free, private, on-device AI.

No credit card · iOS launch Q3 2026 · We'll email you when it's live

More from the blog