Real CI/CD: What the tutorials don't tell you

5 min read

Real CI/CD: What the tutorials don't tell you

CI/CD tutorials are always the same: an npm test and a docker push. The reality is that a production pipeline has secrets to rotate, caches that invalidate at the worst time, rollbacks that fail, and debugging "why it took 45 minutes".

Let's see how to set up a real pipeline, with all the gotchas you'll encounter.

Mental Model

A good CI/CD pipeline is a state machine with several properties:

  1. Deterministic: Same input → same result (or should be)
  2. Fast: If it takes more than 10 minutes, nobody waits for feedback
  3. Informative: When it fails, you know exactly what and why
  4. Recoverable: You can re-run any step without side effects
Push → Lint → Test → Build → Push Image → Deploy Staging → E2E Tests → Deploy Prod

                                         [rollback ready]

The trick is that each step is idempotent and has clear outputs.

Under the Hood: Anatomy of a real pipeline

Directory structure

.github/
  workflows/
    ci.yml          # Tests and lint on each PR
    cd.yml          # Deploy to staging/prod
    release.yml     # Tags and releases
  actions/
    setup/          # Reusable action for setup
      action.yml

The real CI workflow

# .github/workflows/ci.yml
name: CI
 
on:
  pull_request:
    branches: [main]
  push:
    branches: [main]
 
# Cancel previous runs from the same PR
concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true
 
env:
  GO_VERSION: '1.22'
  GOLANGCI_LINT_VERSION: 'v1.55.2'
 
jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - uses: actions/setup-go@v5
        with:
          go-version: ${{ env.GO_VERSION }}
          cache: true  # Module caching
 
      - name: golangci-lint
        uses: golangci/golangci-lint-action@v4
        with:
          version: ${{ env.GOLANGCI_LINT_VERSION }}
          # Important: don't fail on new warnings in PRs
          only-new-issues: true
 
  test:
    runs-on: ubuntu-latest
    services:
      # Real services for integration tests
      postgres:
        image: postgres:15
        env:
          POSTGRES_PASSWORD: test
          POSTGRES_DB: test
        ports:
          - 5432:5432
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
 
      redis:
        image: redis:7
        ports:
          - 6379:6379
 
    steps:
      - uses: actions/checkout@v4
 
      - uses: actions/setup-go@v5
        with:
          go-version: ${{ env.GO_VERSION }}
          cache: true
 
      - name: Run tests
        env:
          DATABASE_URL: postgres://postgres:test@localhost:5432/test?sslmode=disable
          REDIS_URL: redis://localhost:6379
        run: |
          go test -v -race -coverprofile=coverage.out ./...
 
      - name: Upload coverage
        uses: codecov/codecov-action@v4
        with:
          files: ./coverage.out
          fail_ci_if_error: false  # Don't block due to codecov being down
 
  build:
    needs: [lint, test]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - uses: docker/setup-buildx-action@v3
 
      - uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
 
      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          context: .
          push: ${{ github.event_name == 'push' }}  # Only push on merge to main
          tags: |
            ghcr.io/${{ github.repository }}:${{ github.sha }}
            ghcr.io/${{ github.repository }}:latest
          cache-from: type=gha
          cache-to: type=gha,mode=max
💡Tip: Concurrency groups

cancel-in-progress: true cancels previous runs when you push new commits to the same PR. Saves CI minutes.

The real CD workflow

# .github/workflows/cd.yml
name: CD
 
on:
  push:
    branches: [main]
  workflow_dispatch:
    inputs:
      environment:
        description: 'Environment to deploy'
        required: true
        default: 'staging'
        type: choice
        options:
          - staging
          - production
 
jobs:
  deploy-staging:
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v4
 
      - name: Deploy to staging
        env:
          KUBECONFIG_DATA: ${{ secrets.KUBECONFIG_STAGING }}
        run: |
          echo "$KUBECONFIG_DATA" | base64 -d > kubeconfig
          export KUBECONFIG=kubeconfig
 
          kubectl set image deployment/myapp \
            myapp=ghcr.io/${{ github.repository }}:${{ github.sha }}
 
          kubectl rollout status deployment/myapp --timeout=300s
 
      - name: Run smoke tests
        run: |
          ./scripts/smoke-test.sh https://staging.example.com
 
  deploy-production:
    needs: deploy-staging
    runs-on: ubuntu-latest
    environment: production  # Requires manual approval
    if: github.event.inputs.environment == 'production' || github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4
 
      - name: Deploy to production
        env:
          KUBECONFIG_DATA: ${{ secrets.KUBECONFIG_PROD }}
        run: |
          echo "$KUBECONFIG_DATA" | base64 -d > kubeconfig
          export KUBECONFIG=kubeconfig
 
          # Canary deployment: 10% traffic first
          kubectl set image deployment/myapp-canary \
            myapp=ghcr.io/${{ github.repository }}:${{ github.sha }}
 
          kubectl rollout status deployment/myapp-canary --timeout=300s
 
          # If health checks pass, full rollout
          kubectl set image deployment/myapp \
            myapp=ghcr.io/${{ github.repository }}:${{ github.sha }}
 
          kubectl rollout status deployment/myapp --timeout=300s
 
      - name: Notify Slack
        if: always()
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "text": "Deploy ${{ job.status }}: ${{ github.repository }}@${{ github.sha }}"
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}
🚨Gotcha: Environments and secrets

GitHub environments (environment: production) allow approval gates and specific secrets. But WATCH OUT: environment secrets are NOT inherited from the repo, you have to duplicate them or use organization secrets.

Real example: Debugging slow CI

Slow CI diagnosis
$ gh run view 12345678 --log | grep -E 'Run|##\\[group\\]'
##[group]Run actions/checkout@v4 - 3s ##[group]Run actions/setup-go@v5 - 45s ← PROBLEM ##[group]Run go test - 120s ##[group]Run docker/build-push-action@v5 - 180s ← PROBLEM

Fix 1: setup-go caching

- uses: actions/setup-go@v5
  with:
    go-version: '1.22'
    cache: true  # Cache go modules
    cache-dependency-path: go.sum

Fix 2: Docker layer caching

- uses: docker/build-push-action@v5
  with:
    cache-from: type=gha
    cache-to: type=gha,mode=max

Fix 3: Parallelize tests

jobs:
  test:
    strategy:
      matrix:
        package: [./pkg/..., ./internal/..., ./cmd/...]
    steps:
      - run: go test -v ${{ matrix.package }}

Result: From 15 minutes to 4 minutes.

Gotchas and common mistakes

1. Secrets in logs

# BAD: The secret can appear in the error log
run: curl -H "Authorization: ${{ secrets.API_KEY }}" https://api.com
 
# GOOD: Use mask
run: |
  echo "::add-mask::${{ secrets.API_KEY }}"
  curl -H "Authorization: ${{ secrets.API_KEY }}" https://api.com

2. Cache that never hits

# Cache invalidates if key changes
- uses: actions/cache@v4
  with:
    path: ~/.cache/go-build
    key: go-${{ hashFiles('**/go.sum') }}  # Changes with each new dependency
    restore-keys: |
      go-  # Fallback to any go cache

3. Rollback that doesn't rollback

# kubectl rollout undo DOESN'T work if the previous deployment also failed
# Better: keep reference to the last working SHA
 
kubectl rollout undo deployment/myapp --to-revision=42

4. Flaky tests that block deploys

test:
  strategy:
    fail-fast: false  # Don't stop at first failure
  steps:
    - run: go test -v -count=1 ./...  # -count=1 avoids test cache

5. Protected branches that block hotfixes

# In emergencies, workflow_dispatch allows manual deploy
on:
  workflow_dispatch:
    inputs:
      skip_tests:
        description: 'Skip tests (emergency only)'
        type: boolean
        default: false

TL;DR Checklist

TL;DR
  • concurrency.cancel-in-progress to avoid accumulated runs
  • Use cache: true in setup-go/setup-node
  • Docker build with cache-from: type=gha
  • Services for real DB/Redis in tests
  • Environments to separate staging/prod secrets
  • workflow_dispatch for emergency manual deploys
  • Smoke tests after each deploy
  • Notifications (Slack/Discord) in if: always()
  • fail-fast: false if using matrix
  • Never log secrets, use ::add-mask::

Next topic: How to set up pipeline observability? Build time metrics, failure rates, and alerts when CI degrades.