Playbook

Step-by-step guides for common Engineering Excellence scenarios.

Starting a New Project

1. Repository Setup

# Create repository
git init

# Add initial files
touch README.md
mkdir -p src tests docs

# Add .gitignore (use templates)
curl -o .gitignore https://raw.githubusercontent.com/github/gitignore/main/Python.gitignore

# Initial commit
git add .
git commit -m "Initial commit"

2. CI/CD Pipeline

Create .github/workflows/ci.yml:

name: CI

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    
    - name: Set up environment
      uses: actions/setup-python@v4
      with:
        python-version: '3.11'
    
    - name: Install dependencies
      run: |
        pip install -r requirements.txt
        pip install -r requirements-dev.txt
    
    - name: Lint
      run: flake8 src
    
    - name: Type check
      run: mypy src
    
    - name: Test
      run: pytest --cov=src --cov-report=xml
    
    - name: Security scan
      run: bandit -r src

3. Project Structure

my-project/
├── README.md              # Project overview
├── LICENSE
├── .gitignore
├── requirements.txt       # Dependencies
├── requirements-dev.txt   # Dev dependencies
├── Makefile              # Common commands
├── src/                  # Source code
│   └── myproject/
│       ├── __init__.py
│       └── main.py
├── tests/                # Tests
│   ├── __init__.py
│   ├── unit/
│   └── integration/
├── docs/                 # Documentation
│   ├── adr/             # Architecture decisions
│   └── runbooks/        # Operational guides
└── .github/
    └── workflows/        # CI/CD

Onboarding a New Team Member

Week 1: Environment & Culture

Day 1:

Machine setup (use automated scripts)
Repository access
Team introductions
Product overview
First commit (update README with their name)

Day 2-3:

Codebase walkthrough (pair with buddy)
Run tests locally
First code review (observe)
Documentation review

Day 4-5:

First bug fix (small, guided)
Deploy to staging
Attend team ceremonies

Week 2-4: Increasing Independence

First feature (small, with support)
On-call shadowing
Write/update documentation
Present in team demo

Buddy System

Assign an “onboarding buddy” who:

Checks in daily during week 1
Available for questions
Reviews first few PRs
Gives feedback at 30 days

Running a Postmortem

When to Run

After any incident that:

Affected customers
Required manual intervention
Took >30 minutes to resolve

Timeline

Within 24 hours: Initial timeline
Within 1 week: Full postmortem meeting
Within 2 weeks: Action items completed

Template

# Postmortem: [Incident Name]

## Summary
- **Date**: YYYY-MM-DD
- **Duration**: XX minutes
- **Impact**: [What was affected]
- **Severity**: P0/P1/P2

## Timeline (all times in UTC)
- 09:00 - Issue detected via alert
- 09:05 - Engineer paged
- 09:15 - Root cause identified
- 09:30 - Fix deployed
- 09:45 - Service fully recovered

## Root Cause
[What caused the issue]

## Impact
- Users affected: XXX
- Transactions failed: XXX
- Revenue impact: $XXX

## Detection
How did we know? (monitoring, customer report, etc.)

## Resolution
What fixed it?

## Lessons Learned
What went well:
- 

What could be better:
-

## Action Items
| Action | Owner | Due Date |
|--------|-------|----------|
| | | |

## Prevention
How do we prevent this from happening again?

Ground Rules

Blameless: Focus on systems, not people
Psychologically safe: Everyone can speak freely
Action-oriented: Come out with concrete improvements
Shared widely: Postmortems are learning opportunities

Optimizing CI/CD Pipeline

Diagnosing Slow Pipelines

# Check build time breakdown
echo "Build stages:"
git log --pretty=format:"%h %s" -10

Common Bottlenecks:

Large dependencies: Cache them
Sequential tests: Run in parallel
Heavy integration tests: Split into separate job
Large Docker images: Use multi-stage builds

Caching Strategy

- uses: actions/cache@v3
  with:
    path: |
      ~/.cache/pip
      ~/.npm
    key: ${{ runner.os }}-deps-${{ hashFiles('**/requirements.txt') }}

Parallelization

strategy:
  matrix:
    test-group: [unit, integration, e2e]

Optimization Checklist

Dependencies cached
Tests run in parallel
Only affected tests run on PR
Docker layers cached
Artifacts cleaned up
Unnecessary steps removed

Handling Technical Debt

Assessment

Quantify the debt:

Code complexity (cyclomatic)
Test coverage gaps
Outdated dependencies
Documentation debt

Prioritize:

Impact: How much does it slow us down?
Risk: What could break?
Effort: How hard to fix?

Payback Strategies

Boy Scout Rule: Leave code cleaner than you found it

Every PR touches debt in related files
Small, continuous improvements

Debt Sprints: Dedicate 20% of sprint to debt

One day per week
Rotating ownership

Big Bang: Major refactoring project

When debt is overwhelming
Requires dedicated time

Template: Tech Debt Proposal

# Tech Debt Proposal: [Title]

## Current State
[What is the problem]

## Impact
- Development time: +X%
- Bugs per release: X
- On-call interrupts: X/week

## Proposed Solution
[What we will do]

## Effort
- Estimated: X days
- Team: X engineers
- Timeline: X weeks

## Success Metrics
- [ ] Metric 1
- [ ] Metric 2

## Risks
- 

## Rollback Plan
-

Implementing Feature Flags

When to Use

Gradual rollout: Enable for 1% → 10% → 100%
A/B testing: Compare variants
Kill switch: Disable without deploy
Premium features: Control access

Implementation

Using LaunchDarkly:

from ldclient import LDClient

ld_client = LDClient(sdk_key="your-key")

if ld_client.variation("new-feature", user_context, False):
    show_new_feature()
else:
    show_old_feature()

Best Practices

Short-lived: Remove flags after rollout
Clear naming: feature-checkout-redesign
Default off: Safe if flag service fails
Track usage: Know which flags are active

Cleanup

# Find stale flags
grep -r "feature-.*: true" src/ | grep -v test

# After 2 weeks at 100%, remove flag

Measuring Team Health

Metrics to Track

Velocity Trends:

Story points completed per sprint
Cycle time (start to finish)
Work in progress (WIP) limits

Quality Metrics:

Bugs escaped to production
Time spent on bugs vs features
Customer-reported issues

Team Sentiment:

Regular retrospectives
1:1 discussions
Engagement surveys

Warning Signs

Velocity dropping for 3+ sprints
Increasing bug count
Rising cycle time
Decreasing satisfaction scores

Interventions

Process changes: Retrospective actions
Technical changes: Pay down debt
Team changes: Adjust scope or staffing
Support: Training or mentorship