Real CI/CD: What the tutorials don't tell you
Real CI/CD: What the tutorials don't tell you
CI/CD tutorials are always the same: an npm test and a docker push. The reality is that a production pipeline has secrets to rotate, caches that invalidate at the worst time, rollbacks that fail, and debugging "why it took 45 minutes".
Let's see how to set up a real pipeline, with all the gotchas you'll encounter.
Mental Model
A good CI/CD pipeline is a state machine with several properties:
- Deterministic: Same input → same result (or should be)
- Fast: If it takes more than 10 minutes, nobody waits for feedback
- Informative: When it fails, you know exactly what and why
- Recoverable: You can re-run any step without side effects
Push → Lint → Test → Build → Push Image → Deploy Staging → E2E Tests → Deploy Prod
↓
[rollback ready]The trick is that each step is idempotent and has clear outputs.
Under the Hood: Anatomy of a real pipeline
Directory structure
.github/
workflows/
ci.yml # Tests and lint on each PR
cd.yml # Deploy to staging/prod
release.yml # Tags and releases
actions/
setup/ # Reusable action for setup
action.ymlThe real CI workflow
# .github/workflows/ci.yml
name: CI
on:
pull_request:
branches: [main]
push:
branches: [main]
# Cancel previous runs from the same PR
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
env:
GO_VERSION: '1.22'
GOLANGCI_LINT_VERSION: 'v1.55.2'
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with:
go-version: ${{ env.GO_VERSION }}
cache: true # Module caching
- name: golangci-lint
uses: golangci/golangci-lint-action@v4
with:
version: ${{ env.GOLANGCI_LINT_VERSION }}
# Important: don't fail on new warnings in PRs
only-new-issues: true
test:
runs-on: ubuntu-latest
services:
# Real services for integration tests
postgres:
image: postgres:15
env:
POSTGRES_PASSWORD: test
POSTGRES_DB: test
ports:
- 5432:5432
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
redis:
image: redis:7
ports:
- 6379:6379
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with:
go-version: ${{ env.GO_VERSION }}
cache: true
- name: Run tests
env:
DATABASE_URL: postgres://postgres:test@localhost:5432/test?sslmode=disable
REDIS_URL: redis://localhost:6379
run: |
go test -v -race -coverprofile=coverage.out ./...
- name: Upload coverage
uses: codecov/codecov-action@v4
with:
files: ./coverage.out
fail_ci_if_error: false # Don't block due to codecov being down
build:
needs: [lint, test]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build and push
uses: docker/build-push-action@v5
with:
context: .
push: ${{ github.event_name == 'push' }} # Only push on merge to main
tags: |
ghcr.io/${{ github.repository }}:${{ github.sha }}
ghcr.io/${{ github.repository }}:latest
cache-from: type=gha
cache-to: type=gha,mode=maxcancel-in-progress: true cancels previous runs when you push new commits to the same PR. Saves CI minutes.
The real CD workflow
# .github/workflows/cd.yml
name: CD
on:
push:
branches: [main]
workflow_dispatch:
inputs:
environment:
description: 'Environment to deploy'
required: true
default: 'staging'
type: choice
options:
- staging
- production
jobs:
deploy-staging:
runs-on: ubuntu-latest
environment: staging
steps:
- uses: actions/checkout@v4
- name: Deploy to staging
env:
KUBECONFIG_DATA: ${{ secrets.KUBECONFIG_STAGING }}
run: |
echo "$KUBECONFIG_DATA" | base64 -d > kubeconfig
export KUBECONFIG=kubeconfig
kubectl set image deployment/myapp \
myapp=ghcr.io/${{ github.repository }}:${{ github.sha }}
kubectl rollout status deployment/myapp --timeout=300s
- name: Run smoke tests
run: |
./scripts/smoke-test.sh https://staging.example.com
deploy-production:
needs: deploy-staging
runs-on: ubuntu-latest
environment: production # Requires manual approval
if: github.event.inputs.environment == 'production' || github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v4
- name: Deploy to production
env:
KUBECONFIG_DATA: ${{ secrets.KUBECONFIG_PROD }}
run: |
echo "$KUBECONFIG_DATA" | base64 -d > kubeconfig
export KUBECONFIG=kubeconfig
# Canary deployment: 10% traffic first
kubectl set image deployment/myapp-canary \
myapp=ghcr.io/${{ github.repository }}:${{ github.sha }}
kubectl rollout status deployment/myapp-canary --timeout=300s
# If health checks pass, full rollout
kubectl set image deployment/myapp \
myapp=ghcr.io/${{ github.repository }}:${{ github.sha }}
kubectl rollout status deployment/myapp --timeout=300s
- name: Notify Slack
if: always()
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "Deploy ${{ job.status }}: ${{ github.repository }}@${{ github.sha }}"
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}GitHub environments (environment: production) allow approval gates and specific secrets. But WATCH OUT: environment secrets are NOT inherited from the repo, you have to duplicate them or use organization secrets.
Real example: Debugging slow CI
Fix 1: setup-go caching
- uses: actions/setup-go@v5
with:
go-version: '1.22'
cache: true # Cache go modules
cache-dependency-path: go.sumFix 2: Docker layer caching
- uses: docker/build-push-action@v5
with:
cache-from: type=gha
cache-to: type=gha,mode=maxFix 3: Parallelize tests
jobs:
test:
strategy:
matrix:
package: [./pkg/..., ./internal/..., ./cmd/...]
steps:
- run: go test -v ${{ matrix.package }}Result: From 15 minutes to 4 minutes.
Gotchas and common mistakes
1. Secrets in logs
# BAD: The secret can appear in the error log
run: curl -H "Authorization: ${{ secrets.API_KEY }}" https://api.com
# GOOD: Use mask
run: |
echo "::add-mask::${{ secrets.API_KEY }}"
curl -H "Authorization: ${{ secrets.API_KEY }}" https://api.com2. Cache that never hits
# Cache invalidates if key changes
- uses: actions/cache@v4
with:
path: ~/.cache/go-build
key: go-${{ hashFiles('**/go.sum') }} # Changes with each new dependency
restore-keys: |
go- # Fallback to any go cache3. Rollback that doesn't rollback
# kubectl rollout undo DOESN'T work if the previous deployment also failed
# Better: keep reference to the last working SHA
kubectl rollout undo deployment/myapp --to-revision=424. Flaky tests that block deploys
test:
strategy:
fail-fast: false # Don't stop at first failure
steps:
- run: go test -v -count=1 ./... # -count=1 avoids test cache5. Protected branches that block hotfixes
# In emergencies, workflow_dispatch allows manual deploy
on:
workflow_dispatch:
inputs:
skip_tests:
description: 'Skip tests (emergency only)'
type: boolean
default: falseTL;DR Checklist
-
concurrency.cancel-in-progressto avoid accumulated runs - Use
cache: truein setup-go/setup-node - Docker build with
cache-from: type=gha - Services for real DB/Redis in tests
- Environments to separate staging/prod secrets
-
workflow_dispatchfor emergency manual deploys - Smoke tests after each deploy
- Notifications (Slack/Discord) in
if: always() -
fail-fast: falseif using matrix - Never log secrets, use
::add-mask::
Next topic: How to set up pipeline observability? Build time metrics, failure rates, and alerts when CI degrades.