Playbook
Playbook
Step-by-step guides for common Engineering Excellence scenarios.
Starting a New Project
1. Repository Setup
# Create repository
git init
# Add initial files
touch README.md
mkdir -p src tests docs
# Add .gitignore (use templates)
curl -o .gitignore https://raw.githubusercontent.com/github/gitignore/main/Python.gitignore
# Initial commit
git add .
git commit -m "Initial commit"2. CI/CD Pipeline
Create .github/workflows/ci.yml:
name: CI
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up environment
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install -r requirements-dev.txt
- name: Lint
run: flake8 src
- name: Type check
run: mypy src
- name: Test
run: pytest --cov=src --cov-report=xml
- name: Security scan
run: bandit -r src3. Project Structure
my-project/
├── README.md # Project overview
├── LICENSE
├── .gitignore
├── requirements.txt # Dependencies
├── requirements-dev.txt # Dev dependencies
├── Makefile # Common commands
├── src/ # Source code
│ └── myproject/
│ ├── __init__.py
│ └── main.py
├── tests/ # Tests
│ ├── __init__.py
│ ├── unit/
│ └── integration/
├── docs/ # Documentation
│ ├── adr/ # Architecture decisions
│ └── runbooks/ # Operational guides
└── .github/
└── workflows/ # CI/CDOnboarding a New Team Member
Week 1: Environment & Culture
Day 1:
- Machine setup (use automated scripts)
- Repository access
- Team introductions
- Product overview
- First commit (update README with their name)
Day 2-3:
- Codebase walkthrough (pair with buddy)
- Run tests locally
- First code review (observe)
- Documentation review
Day 4-5:
- First bug fix (small, guided)
- Deploy to staging
- Attend team ceremonies
Week 2-4: Increasing Independence
- First feature (small, with support)
- On-call shadowing
- Write/update documentation
- Present in team demo
Buddy System
Assign an “onboarding buddy” who:
- Checks in daily during week 1
- Available for questions
- Reviews first few PRs
- Gives feedback at 30 days
Running a Postmortem
When to Run
After any incident that:
- Affected customers
- Required manual intervention
- Took >30 minutes to resolve
Timeline
- Within 24 hours: Initial timeline
- Within 1 week: Full postmortem meeting
- Within 2 weeks: Action items completed
Template
# Postmortem: [Incident Name]
## Summary
- **Date**: YYYY-MM-DD
- **Duration**: XX minutes
- **Impact**: [What was affected]
- **Severity**: P0/P1/P2
## Timeline (all times in UTC)
- 09:00 - Issue detected via alert
- 09:05 - Engineer paged
- 09:15 - Root cause identified
- 09:30 - Fix deployed
- 09:45 - Service fully recovered
## Root Cause
[What caused the issue]
## Impact
- Users affected: XXX
- Transactions failed: XXX
- Revenue impact: $XXX
## Detection
How did we know? (monitoring, customer report, etc.)
## Resolution
What fixed it?
## Lessons Learned
What went well:
-
What could be better:
-
## Action Items
| Action | Owner | Due Date |
|--------|-------|----------|
| | | |
## Prevention
How do we prevent this from happening again?Ground Rules
- Blameless: Focus on systems, not people
- Psychologically safe: Everyone can speak freely
- Action-oriented: Come out with concrete improvements
- Shared widely: Postmortems are learning opportunities
Optimizing CI/CD Pipeline
Diagnosing Slow Pipelines
# Check build time breakdown
echo "Build stages:"
git log --pretty=format:"%h %s" -10Common Bottlenecks:
- Large dependencies: Cache them
- Sequential tests: Run in parallel
- Heavy integration tests: Split into separate job
- Large Docker images: Use multi-stage builds
Caching Strategy
- uses: actions/cache@v3
with:
path: |
~/.cache/pip
~/.npm
key: ${{ runner.os }}-deps-${{ hashFiles('**/requirements.txt') }}Parallelization
strategy:
matrix:
test-group: [unit, integration, e2e]Optimization Checklist
- Dependencies cached
- Tests run in parallel
- Only affected tests run on PR
- Docker layers cached
- Artifacts cleaned up
- Unnecessary steps removed
Handling Technical Debt
Assessment
Quantify the debt:
- Code complexity (cyclomatic)
- Test coverage gaps
- Outdated dependencies
- Documentation debt
Prioritize:
- Impact: How much does it slow us down?
- Risk: What could break?
- Effort: How hard to fix?
Payback Strategies
Boy Scout Rule: Leave code cleaner than you found it
- Every PR touches debt in related files
- Small, continuous improvements
Debt Sprints: Dedicate 20% of sprint to debt
- One day per week
- Rotating ownership
Big Bang: Major refactoring project
- When debt is overwhelming
- Requires dedicated time
Template: Tech Debt Proposal
# Tech Debt Proposal: [Title]
## Current State
[What is the problem]
## Impact
- Development time: +X%
- Bugs per release: X
- On-call interrupts: X/week
## Proposed Solution
[What we will do]
## Effort
- Estimated: X days
- Team: X engineers
- Timeline: X weeks
## Success Metrics
- [ ] Metric 1
- [ ] Metric 2
## Risks
-
## Rollback Plan
-Implementing Feature Flags
When to Use
- Gradual rollout: Enable for 1% → 10% → 100%
- A/B testing: Compare variants
- Kill switch: Disable without deploy
- Premium features: Control access
Implementation
Using LaunchDarkly:
from ldclient import LDClient
ld_client = LDClient(sdk_key="your-key")
if ld_client.variation("new-feature", user_context, False):
show_new_feature()
else:
show_old_feature()Best Practices
- Short-lived: Remove flags after rollout
- Clear naming:
feature-checkout-redesign - Default off: Safe if flag service fails
- Track usage: Know which flags are active
Cleanup
# Find stale flags
grep -r "feature-.*: true" src/ | grep -v test
# After 2 weeks at 100%, remove flagMeasuring Team Health
Metrics to Track
Velocity Trends:
- Story points completed per sprint
- Cycle time (start to finish)
- Work in progress (WIP) limits
Quality Metrics:
- Bugs escaped to production
- Time spent on bugs vs features
- Customer-reported issues
Team Sentiment:
- Regular retrospectives
- 1:1 discussions
- Engagement surveys
Warning Signs
- Velocity dropping for 3+ sprints
- Increasing bug count
- Rising cycle time
- Decreasing satisfaction scores
Interventions
- Process changes: Retrospective actions
- Technical changes: Pay down debt
- Team changes: Adjust scope or staffing
- Support: Training or mentorship