CONTAINERS

Docker Compose in production: a guide for small teams

April 2, 2026 · 12 min read

Docker Compose is often dismissed as "just for development." That is wrong. With the right setup, Compose runs production workloads reliably for thousands of companies. It is simple, declarative, and easy to reason about. For teams with 2-30 developers running a handful of services on a single host, it is often the best choice you can make.

The problem is not Compose itself. The problem is that most teams use it the same way in production as they do in development: no health checks, no resource limits, no monitoring, no backup strategy. That is what turns a reliable tool into a liability. This guide covers everything you need to run Compose in production properly.

Why Compose works in production

The case for Compose in production comes down to simplicity. A single docker-compose.yml file describes your entire stack: services, networks, volumes, environment variables, and dependencies. Anyone on the team can read it and understand what is running. Compare that to Kubernetes, where the equivalent setup requires dozens of YAML files, a control plane, and a dedicated platform team.

Compose gives you declarative configuration. You define the desired state, and Docker makes it happen. Need to roll back? Change the image tag and run docker compose up -d. The previous version is running again in seconds. No deployment pipelines to revert, no Helm charts to roll back.

Restart policies mean your services come back after a crash or a server reboot. Health checks let Docker know when a container is actually ready to serve traffic, not just running. These two features alone cover most of the "self-healing" behavior that people think requires an orchestrator.

The single-host constraint is real, but it is also an advantage. There is no distributed consensus to fail, no network partitions between nodes, no split-brain scenarios. Your application runs on one server, and you can SSH in and inspect it directly when something goes wrong. For most small teams, that simplicity is worth more than horizontal scaling they do not need yet.

The production-ready Compose stack

A development Compose file and a production Compose file should look very different. In development, you mount source code, skip health checks, and run everything with default settings. In production, you lock things down. Here is what a production-ready Compose stack needs.

Health checks on every service

A running container is not necessarily a healthy container. Your application might have started but failed to connect to the database. Or it might be stuck in a crash loop. Without health checks, Docker has no way to know.

Add a healthcheck to every service in your Compose file. For web services, an HTTP check against a /health endpoint works well. For databases, use the native client: pg_isready for PostgreSQL, redis-cli ping for Redis. Set reasonable intervals, timeouts, and retry counts. Five seconds between checks, three retries, and a ten-second timeout is a good starting point for most services.

Health checks also enable dependency ordering. When service A depends on service B with a condition: service_healthy clause, Compose will wait until B's health check passes before starting A. This solves the classic "app starts before database is ready" problem without sleep hacks in your entrypoint.

Restart policies

Every production service should have a restart policy. The two practical options are always and unless-stopped. The difference: always restarts containers even after you explicitly stop them (on the next Docker daemon start), while unless-stopped respects manual stops.

For most services, unless-stopped is the right choice. It gives you self-healing (automatic restart after a crash or reboot) while still letting you stop a service manually for maintenance without it coming back immediately.

Resource limits

Without memory limits, a single misbehaving service can consume all available RAM and take down every other container on the host. This is the number one cause of "everything crashed at 3 AM" incidents on single-host setups.

Set mem_limit and cpus for every service. Start with generous limits based on observed usage, then tighten them over time. A typical web application might get 512MB of memory and 0.5 CPU. A database might need 2GB and 1.0 CPU. Monitor actual usage through cAdvisor or Docker stats, and adjust accordingly.

Also set memswap_limit equal to mem_limit to prevent swap usage. Containers swapping to disk will make your entire host slow and unresponsive, which is worse than the container being killed for exceeding its memory limit.

Named volumes for persistent data

Never use bind mounts for production data. Named volumes are managed by Docker, survive container recreation, and have predictable paths. Define them in the top-level volumes section of your Compose file and reference them in your services.

Database data, uploaded files, and any state that must survive a docker compose down && docker compose up cycle should live in a named volume. Everything else should be ephemeral. If you lose your container and your application breaks because it was storing data in the container filesystem, that is a bug.

Network isolation between services

By default, all services in a Compose file share a single network. That means your frontend can talk directly to your database, which is not ideal. Create separate networks for different security zones. A common pattern: a frontend network for services that need to be exposed to the reverse proxy, and a backend network for internal communication between the app and the database.

Your web application joins both networks. Your database only joins the backend network. The reverse proxy only joins the frontend network. Now your database is unreachable from outside the backend network, even if someone compromises the proxy.

Environment variables via .env files

Never hardcode credentials, API keys, or database URLs in your Compose file. Use env_file to load variables from a .env file that is gitignored. This keeps secrets out of version control and makes it easy to have different configurations per environment.

For extra security, consider using Docker secrets (available in Compose v3+ with docker stack deploy) or mounting secrets from an external secrets manager. At minimum, your .env file should have restrictive permissions (chmod 600) and be owned by the user running Docker.

Reverse proxy and TLS

Your Compose services should never be exposed directly to the internet. A reverse proxy sits in front, handles TLS termination, and routes traffic to the correct container. For Compose-based setups, Traefik is the best choice because it integrates natively with Docker.

Traefik watches the Docker socket and automatically discovers services based on labels. When you add labels like traefik.http.routers.myapp.rule=Host(`app.example.com`) to a service in your Compose file, Traefik picks it up and configures routing automatically. No config files to edit, no reloads to trigger.

TLS certificates are handled by Traefik's built-in ACME client. Point it at Let's Encrypt, and certificates are issued and renewed automatically for every discovered service. You configure it once, and then you never think about certificates again. No certbot cron jobs, no manual renewals, no expired certificate incidents at 2 AM.

The Traefik container itself runs as part of your Compose stack. It listens on ports 80 and 443, redirects HTTP to HTTPS, and proxies traffic to your other services over the internal Docker network. The only ports exposed on the host are 80 and 443. Everything else is internal. For a deeper comparison of proxy options, see our post on Traefik vs Nginx vs HAProxy.

Zero-downtime deployments

The basic deployment workflow with Compose is straightforward. Pull the new images, then run docker compose up -d. Compose detects which services have new images, stops the old containers, and starts new ones. The whole process takes a few seconds.

For most internal tools and low-traffic applications, this is fine. There is a brief moment where the service is unavailable while the new container starts, but it is typically under five seconds. If your users refresh and see a momentary error, that is acceptable for many workloads.

For true zero-downtime, you need a bit more setup. The approach: run two instances of your service behind the reverse proxy. Pull the new image, scale up to two containers with the new version, wait for the health check to pass, then scale down the old one. Traefik handles this naturally because it routes traffic based on health check status. Once the new container is healthy, traffic shifts to it. Once the old container is removed, Traefik stops routing to it.

You can script this with a simple deploy script: docker compose pull, then docker compose up -d --no-deps --scale myapp=2 myapp, wait for health, then docker compose up -d --no-deps --scale myapp=1 myapp. It is not as elegant as Kubernetes rolling updates, but it works, and it takes ten lines of bash instead of a cluster.

One critical detail: make sure your application handles graceful shutdown. When Docker sends SIGTERM, your app should finish in-flight requests, close database connections, and exit cleanly. Set stop_grace_period in your Compose file to give it enough time. Thirty seconds is a reasonable default.

Monitoring your Compose stack

Running services without monitoring is flying blind. The good news: you can run a complete monitoring stack as Compose services alongside your application. No external services needed, no SaaS bills.

The standard stack is Prometheus for metrics collection, Grafana for dashboards and alerting, node-exporter for host-level metrics (CPU, memory, disk, network), and cAdvisor for container-level metrics (per-container resource usage). All four run as containers in your Compose file.

Prometheus scrapes metrics from your services on a configurable interval. Most applications can expose a /metrics endpoint using a Prometheus client library. Prometheus stores the data locally and provides a query language (PromQL) to analyze it.
Grafana connects to Prometheus and lets you build dashboards that visualize your metrics. More importantly, it handles alerting. You can set up alerts for high CPU, low disk space, service down, or any custom metric your application exports. Alerts go to email, Google Chat, PagerDuty, or any other supported channel.
node-exporter exposes host-level metrics that Prometheus scrapes. Disk usage trending toward full, memory pressure, high CPU load. These are the metrics that tell you about infrastructure problems before they cause application problems.
cAdvisor provides container-level metrics. How much memory is each container using? Which one is consuming the most CPU? Is any container hitting its resource limits? This data helps you right-size your resource limits over time.

The entire monitoring stack adds roughly 500MB of memory overhead. On a 4GB or 8GB server, that is a reasonable cost for the visibility it provides. Without monitoring, you are guessing. With it, you know exactly what is happening on your server at all times.

Backup strategy

Backups are non-negotiable. If your server dies and you do not have backups, you lose everything. The strategy has three parts: database dumps, volume backups, and offsite storage.

For databases, run regular logical dumps. pg_dump for PostgreSQL, mysqldump for MySQL. Schedule these via cron or a dedicated backup container. Logical dumps are portable, human-readable, and can be restored to any version of the database. Run them daily at minimum, hourly if the data changes frequently.

For Docker volumes that contain uploaded files or application state, use Restic. It is an encrypted, deduplicated backup tool that supports multiple storage backends: S3, B2, SFTP, and local storage. Restic runs as a container in your Compose stack, mounts the volumes it needs to back up, and pushes encrypted snapshots to your offsite storage on a schedule.

The offsite part is critical. If your backups live on the same server as your data, a disk failure takes out both. Use S3 (or an S3-compatible service like Backblaze B2 or Wasabi) as your backup target. Storage costs are minimal: a few dollars per month for most small teams.

Test your restores. A backup that has never been restored is a backup that might not work. Schedule a monthly restore test to a separate environment. It takes an hour and gives you confidence that your recovery process actually works when you need it.

When Compose is not enough

Compose is excellent for many production workloads, but it has real limitations. Being honest about them helps you make the right decision.

Multi-host requirements. Compose runs on a single host. If your application needs to run across multiple servers for redundancy or capacity, you need an orchestrator. Docker Swarm is the simplest step up from Compose (it uses nearly identical YAML). Kubernetes is the industry standard for larger deployments.
30+ services. A Compose file with 30 or more services becomes hard to manage. Dependency graphs get complex, startup order becomes fragile, and a single docker compose up takes too long. At this scale, splitting into multiple Compose files or moving to an orchestrator makes sense.
Auto-scaling. Compose does not scale services based on load. You set a fixed number of replicas, and that is what you get. If your traffic is spiky and you need to handle 10x the normal load for short periods, you need something that can scale dynamically.
Multi-team deployments. When multiple teams need to deploy independently to the same infrastructure, Compose's single-file model becomes a bottleneck. Kubernetes namespaces or separate Compose projects can help, but the coordination overhead grows.

If any of these apply to you, read our detailed comparison of Kubernetes vs Docker Compose to understand the tradeoffs. The key takeaway: move to Kubernetes when you need its features, not before. Premature orchestration is one of the most common sources of unnecessary complexity in small engineering teams.

Making it work for your team

Docker Compose in production is not a compromise. It is a deliberate choice to keep your infrastructure simple, understandable, and maintainable. Add health checks to every service. Set resource limits so one bad deploy does not take down the host. Put Traefik in front for TLS and routing. Monitor everything with Prometheus and Grafana. Back up your data with Restic to offsite storage. Test your restores.

That is the entire playbook. No cluster to manage, no control plane to keep alive, no etcd backups to worry about. Just your application, running in containers, on a server you can understand. For teams with 2-30 developers, this setup handles far more traffic and complexity than most people realize.

If you want help setting this up for your team, our platform engineering service builds production-ready Compose stacks with all of the above included. Or if you already have a setup and want a second opinion, start with our DevOps consulting.