Metrics
What gets measured gets improved. We use a balanced set of metrics to understand our engineering health.
DORA Metrics
The DORA (DevOps Research and Assessment) team at Google identified four key metrics that predict software delivery performance.
1. Deployment Frequency
How often do we deploy to production?
| Level | Frequency | Description |
|---|---|---|
| ๐ข Elite | On-demand (multiple times per day) | Changes flow to production quickly and safely |
| ๐ก High | Between once per day and once per week | Regular, predictable releases |
| ๐ Medium | Between once per week and once per month | Slower release cycles |
| ๐ด Low | Between once per month and once every 6 months | Big bang releases, high risk |
Why it matters: Higher frequency means smaller changes, less risk, faster feedback.
How to improve:
- Smaller batch sizes
- Feature flags for incomplete work
- Automated deployment pipelines
- Reduce manual approval steps
2. Lead Time for Changes
How long does it take from commit to production?
| Level | Lead Time | Description |
|---|---|---|
| ๐ข Elite | Less than one hour | Rapid feedback and iteration |
| ๐ก High | Between one day and one week | Reasonable pace |
| ๐ Medium | Between one week and one month | Slower iteration |
| ๐ด Low | Between one month and six months | Long delays |
Why it matters: Shorter lead time means faster value delivery and quicker learning.
How to improve:
- Fast builds and tests
- Parallel pipeline stages
- Automated quality gates
- Reduce manual handoffs
3. Change Failure Rate
What percentage of deployments cause failures?
| Level | Failure Rate | Description |
|---|---|---|
| ๐ข Elite | 0-15% | Most changes succeed |
| ๐ก High | 0-15% | Acceptable failure rate |
| ๐ Medium | 16-30% | Frequent issues |
| ๐ด Low | 16-30% or higher | Unreliable deployments |
Why it matters: Lower failure rate means more confidence in deployments.
How to improve:
- Comprehensive testing
- Incremental rollouts
- Canary deployments
- Automated rollbacks
4. Time to Restore Service
How quickly can we recover from failures?
| Level | Recovery Time | Description |
|---|---|---|
| ๐ข Elite | Less than one hour | Rapid recovery |
| ๐ก High | Less than one day | Same-day recovery |
| ๐ Medium | Less than one week | Slower recovery |
| ๐ด Low | More than one week | Extended outages |
Why it matters: Faster recovery reduces impact on customers and business.
How to improve:
- Good monitoring and alerting
- Runbooks for common issues
- Automated remediation
- Chaos engineering practice
Balanced Metrics
DORA metrics tell us about delivery performance. We also track:
Quality Metrics
| Metric | Target | Why |
|---|---|---|
| Test Coverage | >80% | Confidence in changes |
| Code Review Time | <24 hours | Fast feedback |
| Security Vulnerabilities | 0 critical/high | Security first |
| Technical Debt Ratio | <10% | Sustainable codebase |
Productivity Metrics
| Metric | Target | Why |
|---|---|---|
| Build Time | <10 minutes | Fast feedback |
| Time to First PR | <3 days | Reduced WIP |
| Developer Satisfaction | >4/5 | Retention and motivation |
| Onboarding Time | <2 weeks | Team scalability |
Reliability Metrics
| Metric | Target | Why |
|---|---|---|
| Uptime | >99.9% | Availability |
| MTBF (Mean Time Between Failures) | >30 days | Stability |
| Error Rate | <0.1% | Quality in production |
| Alert Fatigue | <2 false alerts/week | Sustainable operations |
Using Metrics Wisely
Do’s โ
- Use metrics to guide improvement, not punish
- Look at trends, not single data points
- Combine multiple metrics for a balanced view
- Share metrics transparently with the team
- Review metrics regularly in retrospectives
Don’ts โ
- Do not use metrics to compare teams unfairly
- Do not optimize a single metric at the expense of others
- Do not ignore context when interpreting metrics
- Do not make metrics a target (Goodhart’s Law)
Dashboard Example
A good engineering dashboard shows:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Engineering Health Dashboard โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ DORA Metrics โ
โ โโโ Deployment Frequency: 12/day ๐ข โ
โ โโโ Lead Time: 2 hours ๐ข โ
โ โโโ Change Failure Rate: 8% ๐ข โ
โ โโโ Time to Restore: 15 min ๐ข โ
โ โ
โ Quality โ
โ โโโ Test Coverage: 85% โ
โ
โ โโโ Open Vulnerabilities: 2 โ ๏ธ โ
โ โ
โ Productivity โ
โ โโโ Build Time: 8 min โ
โ
โ โโโ Developer Satisfaction: 4.2/5 โ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ