Skip to main content

Zero-Downtime Deployments Without the Drama

·854 words·5 mins· loading · loading ·
Author
Maksim P.
DevOps Engineer / SRE

TL;DR
#

  • Zero-downtime deployments aren’t just for FAANG. Small teams can do it with basic patterns.
  • Blue-green works great for stateless apps. Rolling updates handle everything else.
  • Database migrations are the real villain. Use expand-contract pattern.
  • Health checks prevent half-broken deployments from taking down your app.
  • Start with one service. Perfect the process. Then expand.

Who this is for
#

Teams shipping to production multiple times per week who are tired of “scheduled maintenance” windows. You have 3-10 engineers, basic CI/CD in place, and customers who notice when you deploy at 3pm on a Tuesday.

The deployment patterns that actually matter
#

Forget the 47 different deployment strategies you read about. For small teams, two patterns cover 95% of use cases.

Blue-green deployments
#

Perfect for stateless applications. You run two identical environments (blue and green). Deploy to the inactive one, test it, then switch traffic.

Here’s the simplest possible implementation using nginx:

# /etc/nginx/sites-available/app
upstream app_blue {
    server 10.0.1.10:3000;
}

upstream app_green {
    server 10.0.1.20:3000;
}

# Point to blue initially
upstream app_current {
    server 10.0.1.10:3000;
}

server {
    listen 80;
    server_name app.example.com;
    
    location / {
        proxy_pass http://app_current;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

Deploy to green, test it works, then update the app_current upstream and reload nginx. Total downtime: ~100ms for the nginx reload.

Rolling updates
#

Better for containerized workloads or when you can’t afford double infrastructure. Replace instances one at a time.

If you’re on Kubernetes (even k3s), this is built-in:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # one extra pod during deploy
      maxUnavailable: 0  # never go below replica count
  template:
    spec:
      containers:
      - name: api
        image: myapp:v2
        readinessProbe:
          httpGet:
            path: /health
            port: 3000
          initialDelaySeconds: 10
          periodSeconds: 5

Kubernetes won’t route traffic to new pods until they pass readiness checks. Old pods keep serving until new ones are ready.

Health checks save your bacon
#

The difference between “zero-downtime” and “zero-working-deployments” is health checks.

Bad health check:

@app.route('/health')
def health():
    return 'OK', 200

Good health check:

@app.route('/health')
def health():
    checks = {
        'database': check_db_connection(),
        'redis': check_redis_connection(),
        'disk_space': check_disk_space()
    }
    
    if all(checks.values()):
        return jsonify(checks), 200
    else:
        return jsonify(checks), 503

Your health check should verify the app can actually serve requests. Check database connectivity, required services, disk space. If your app can’t function, it shouldn’t receive traffic.

Database migrations without tears
#

Here’s where most zero-downtime efforts die. You can’t blue-green a database. You need the expand-contract pattern.

Never do this:

  1. Deploy code that expects new schema
  2. Run migration
  3. Hope for the best

Do this instead:

  1. Expand: Add new columns/tables without removing old ones
  2. Deploy code that works with both schemas
  3. Migrate data in background
  4. Deploy code that only uses new schema
  5. Contract: Remove old columns/tables

Example: renaming a column from username to email.

Migration 1 (expand):

ALTER TABLE users ADD COLUMN email VARCHAR(255);
UPDATE users SET email = username WHERE email IS NULL;

Application code (works with both):

def get_user_identifier(user):
    # New column takes precedence
    return user.email or user.username

def save_user(user_data):
    # Write to both during transition
    user.username = user_data['identifier']
    user.email = user_data['identifier']
    user.save()

Migration 2 (contract, after next deploy):

ALTER TABLE users DROP COLUMN username;

Yes, this means two deploys for schema changes. That’s the price of zero downtime.

Load balancer configuration
#

Your load balancer needs to know when to stop sending traffic to instances.

For AWS ALB/ELB, use connection draining:

  • Set deregistration delay to match your longest request (usually 30-60s)
  • Health check interval: 10s
  • Unhealthy threshold: 2 (catches issues in 20s)

For nginx, use health checks with the upstream module:

upstream backend {
    server 10.0.1.10:3000 max_fails=2 fail_timeout=30s;
    server 10.0.1.20:3000 max_fails=2 fail_timeout=30s;
}

This marks a server as down after 2 failed requests, removes it from rotation for 30s.

Testing your deployment process
#

You can’t claim zero-downtime until you’ve tested under load.

Simple load test during deployment:

# Terminal 1: Generate load
while true; do 
    curl -w "%{http_code} %{time_total}s\n" https://api.example.com/health
    sleep 0.1
done

# Terminal 2: Deploy
./deploy.sh

# Watch terminal 1 - should see only 200s, no timeouts

For more realistic testing, use vegeta or k6:

echo "GET https://api.example.com/products" | \
    vegeta attack -duration=5m -rate=100 | \
    vegeta report

Run this during deployment. Success metrics:

  • Zero 5xx errors
  • 99th percentile latency stays under 2x normal
  • No connection timeouts

When zero-downtime isn’t worth it
#

Some reality checks:

Skip zero-downtime deployments if:

  • You deploy once a month
  • Your app has natural maintenance windows (B2B with weekend downtime)
  • The complexity exceeds the benefit (2-person team, 10 users)

Zero-downtime is about customer experience, not engineering pride. If your customers don’t care, spend the effort elsewhere.

Operations checklist
#

Before claiming victory:

  • Health checks actually check dependencies
  • Load balancer drains connections gracefully
  • Database migrations use expand-contract pattern
  • Deployment process is scripted, not manual
  • You’ve tested deploys under real load
  • Rollback procedure takes <5 minutes
  • Monitoring alerts on failed deployments
  • Team knows the runbook when things go wrong

Related reads #

Reply by Email