sre-practices
Site Reliability Engineering practices from Google - the company that invented SRE. Master SLOs, error budgets, incident response, and toil elimination. Use when designing reliable systems, implementing SRE practices, or improving operational excellence. Learn from the team that runs Google Search,
Install
mkdir -p .claude/skills/sre-practices && curl -L -o skill.zip "https://agentskills.codes/api/skills/download/14464" && unzip -o skill.zip -d .claude/skills/sre-practices && rm skill.zipInstalls to .claude/skills/sre-practices
Activation
This is the description your AI agent reads to decide when to run this skill — the better it matches your request, the more reliably it fires.
Site Reliability Engineering practices from Google - the company that invented SRE. Master SLOs, error budgets, incident response, and toil elimination. Use when designing reliable systems, implementing SRE practices, or improving operational excellence. Learn from the team that runs Google Search, Gmail, and YouTube at billions of users scale.About this skill
SRE Practices - Google Site Reliability Engineering
Expert: Alex Kim (Google SRE, 11 years) Level: 10/10 - Google invented SRE
Overview
Site Reliability Engineering from Google - what happens when you ask a software engineer to design an operations team. Not traditional ops or DevOps - applying software engineering to infrastructure.
Google runs services for billions (Search, Gmail, YouTube, Maps) with 99.99%+ uptime. These practices made that possible.
Core SRE Principles
1. Embrace Risk
100% uptime is the wrong target. Use error budgets to balance reliability vs velocity.
2. Service Level Objectives (SLOs)
Define and measure service quality with SLIs, SLOs, SLAs.
3. Eliminate Toil
Automate manual, repetitive work. Target <50% time on toil.
4. Monitoring & Alerting
Alert on symptoms (user-facing), not causes. Use golden signals.
5. Incident Response
Blameless postmortems, clear escalation, reduce MTTR.
6. Capacity Planning
Plan for growth, forecast demand, optimize resource usage.
SRE Workflow
- Define SLOs - What reliability do users need?
- Measure SLIs - Track service quality metrics
- Monitor error budget - How much budget consumed?
- Respond to incidents - Restore service quickly
- Conduct postmortems - Learn from failures
- Automate toil - Reduce manual work
- Plan capacity - Scale for growth
Google's Production Scale
SRE practices power:
- Google Search: 8.5 billion searches/day
- Gmail: 1.8 billion users
- YouTube: 2 billion users, 1 billion hours/day
- Google Maps: 1 billion users
- 99.99%+ uptime across all services
Golden Signals (Google's 4 Key Metrics)
- Latency - Time to serve requests
- Traffic - Demand on system
- Errors - Failed requests
- Saturation - Resource utilization
Best Practices
- SLOs over SLAs - Internal targets stricter than external
- Error budget policy - Define consequences when budget exhausted
- Blameless culture - Learn from failures, don't blame
- Toil automation - Invest in eliminating repetitive work
- On-call sustainability - Max 25% on-call time, 50% ticket time
Related Skills
- kubernetes-expert - Infrastructure platform
- observability - Monitoring & tracing
- chaos-engineering - Resilience testing
Last Updated: 2026-02-03 Expert: Alex Kim (Google SRE, 11 years) - Runs billion-user services