Playbook

Step-by-step guides for common Engineering Excellence scenarios.

Starting a New Project

1. Repository Setup

# Create repository
git init

# Add initial files
touch README.md
mkdir -p src tests docs

# Add .gitignore (use templates)
curl -o .gitignore https://raw.githubusercontent.com/github/gitignore/main/Python.gitignore

# Initial commit
git add .
git commit -m "Initial commit"

2. CI/CD Pipeline

Create .github/workflows/ci.yml:

name: CI

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    
    - name: Set up environment
      uses: actions/setup-python@v4
      with:
        python-version: '3.11'
    
    - name: Install dependencies
      run: |
        pip install -r requirements.txt
        pip install -r requirements-dev.txt
    
    - name: Lint
      run: flake8 src
    
    - name: Type check
      run: mypy src
    
    - name: Test
      run: pytest --cov=src --cov-report=xml
    
    - name: Security scan
      run: bandit -r src

3. Project Structure

my-project/
├── README.md              # Project overview
├── LICENSE
├── .gitignore
├── requirements.txt       # Dependencies
├── requirements-dev.txt   # Dev dependencies
├── Makefile              # Common commands
├── src/                  # Source code
│   └── myproject/
│       ├── __init__.py
│       └── main.py
├── tests/                # Tests
│   ├── __init__.py
│   ├── unit/
│   └── integration/
├── docs/                 # Documentation
│   ├── adr/             # Architecture decisions
│   └── runbooks/        # Operational guides
└── .github/
    └── workflows/        # CI/CD

Onboarding a New Team Member

Week 1: Environment & Culture

Day 1:

  • Machine setup (use automated scripts)
  • Repository access
  • Team introductions
  • Product overview
  • First commit (update README with their name)

Day 2-3:

  • Codebase walkthrough (pair with buddy)
  • Run tests locally
  • First code review (observe)
  • Documentation review

Day 4-5:

  • First bug fix (small, guided)
  • Deploy to staging
  • Attend team ceremonies

Week 2-4: Increasing Independence

  • First feature (small, with support)
  • On-call shadowing
  • Write/update documentation
  • Present in team demo

Buddy System

Assign an “onboarding buddy” who:

  • Checks in daily during week 1
  • Available for questions
  • Reviews first few PRs
  • Gives feedback at 30 days

Running a Postmortem

When to Run

After any incident that:

  • Affected customers
  • Required manual intervention
  • Took >30 minutes to resolve

Timeline

  • Within 24 hours: Initial timeline
  • Within 1 week: Full postmortem meeting
  • Within 2 weeks: Action items completed

Template

# Postmortem: [Incident Name]

## Summary
- **Date**: YYYY-MM-DD
- **Duration**: XX minutes
- **Impact**: [What was affected]
- **Severity**: P0/P1/P2

## Timeline (all times in UTC)
- 09:00 - Issue detected via alert
- 09:05 - Engineer paged
- 09:15 - Root cause identified
- 09:30 - Fix deployed
- 09:45 - Service fully recovered

## Root Cause
[What caused the issue]

## Impact
- Users affected: XXX
- Transactions failed: XXX
- Revenue impact: $XXX

## Detection
How did we know? (monitoring, customer report, etc.)

## Resolution
What fixed it?

## Lessons Learned
What went well:
- 

What could be better:
-

## Action Items
| Action | Owner | Due Date |
|--------|-------|----------|
| | | |

## Prevention
How do we prevent this from happening again?

Ground Rules

  • Blameless: Focus on systems, not people
  • Psychologically safe: Everyone can speak freely
  • Action-oriented: Come out with concrete improvements
  • Shared widely: Postmortems are learning opportunities

Optimizing CI/CD Pipeline

Diagnosing Slow Pipelines

# Check build time breakdown
echo "Build stages:"
git log --pretty=format:"%h %s" -10

Common Bottlenecks:

  1. Large dependencies: Cache them
  2. Sequential tests: Run in parallel
  3. Heavy integration tests: Split into separate job
  4. Large Docker images: Use multi-stage builds

Caching Strategy

- uses: actions/cache@v3
  with:
    path: |
      ~/.cache/pip
      ~/.npm
    key: ${{ runner.os }}-deps-${{ hashFiles('**/requirements.txt') }}

Parallelization

strategy:
  matrix:
    test-group: [unit, integration, e2e]

Optimization Checklist

  • Dependencies cached
  • Tests run in parallel
  • Only affected tests run on PR
  • Docker layers cached
  • Artifacts cleaned up
  • Unnecessary steps removed

Handling Technical Debt

Assessment

Quantify the debt:

  • Code complexity (cyclomatic)
  • Test coverage gaps
  • Outdated dependencies
  • Documentation debt

Prioritize:

  • Impact: How much does it slow us down?
  • Risk: What could break?
  • Effort: How hard to fix?

Payback Strategies

Boy Scout Rule: Leave code cleaner than you found it

  • Every PR touches debt in related files
  • Small, continuous improvements

Debt Sprints: Dedicate 20% of sprint to debt

  • One day per week
  • Rotating ownership

Big Bang: Major refactoring project

  • When debt is overwhelming
  • Requires dedicated time

Template: Tech Debt Proposal

# Tech Debt Proposal: [Title]

## Current State
[What is the problem]

## Impact
- Development time: +X%
- Bugs per release: X
- On-call interrupts: X/week

## Proposed Solution
[What we will do]

## Effort
- Estimated: X days
- Team: X engineers
- Timeline: X weeks

## Success Metrics
- [ ] Metric 1
- [ ] Metric 2

## Risks
- 

## Rollback Plan
-

Implementing Feature Flags

When to Use

  • Gradual rollout: Enable for 1% → 10% → 100%
  • A/B testing: Compare variants
  • Kill switch: Disable without deploy
  • Premium features: Control access

Implementation

Using LaunchDarkly:

from ldclient import LDClient

ld_client = LDClient(sdk_key="your-key")

if ld_client.variation("new-feature", user_context, False):
    show_new_feature()
else:
    show_old_feature()

Best Practices

  • Short-lived: Remove flags after rollout
  • Clear naming: feature-checkout-redesign
  • Default off: Safe if flag service fails
  • Track usage: Know which flags are active

Cleanup

# Find stale flags
grep -r "feature-.*: true" src/ | grep -v test

# After 2 weeks at 100%, remove flag

Measuring Team Health

Metrics to Track

Velocity Trends:

  • Story points completed per sprint
  • Cycle time (start to finish)
  • Work in progress (WIP) limits

Quality Metrics:

  • Bugs escaped to production
  • Time spent on bugs vs features
  • Customer-reported issues

Team Sentiment:

  • Regular retrospectives
  • 1:1 discussions
  • Engagement surveys

Warning Signs

  • Velocity dropping for 3+ sprints
  • Increasing bug count
  • Rising cycle time
  • Decreasing satisfaction scores

Interventions

  • Process changes: Retrospective actions
  • Technical changes: Pay down debt
  • Team changes: Adjust scope or staffing
  • Support: Training or mentorship

Resources