DEVOPS

The DevOps checklist for startups: 15 things to get right before you scale

March 19, 2026 · 10 min read

Scaling a product is hard. Scaling a product on top of fragile infrastructure is a nightmare. Most startups know they need "better DevOps" but do not know where to start. This checklist covers the 15 foundations that every startup should have in place before scaling. Not aspirational best practices. The practical minimum.

1. Version control with Git

Every line of code, configuration, and infrastructure definition should live in Git. Not just application code. Dockerfiles, CI configs, Terraform modules, database migrations. If it affects how your software runs, it belongs in version control. Use GitHub or GitLab. Enforce branch protection on main: require pull requests, at least one review, and passing CI checks before merging.

2. CI pipeline

Every push to your repository should trigger an automated pipeline that builds your code, runs tests, and reports the result. No exceptions. If a developer can merge broken code into main, your CI is not doing its job. GitHub Actions is the easiest starting point for most teams. GitLab CI if you are already on GitLab. Jenkins if you need maximum flexibility and do not mind the maintenance.

3. Automated testing

You do not need 100% coverage. You need tests that catch the bugs that matter. Start with integration tests for your critical paths: user signup, payment processing, the core feature that generates revenue. Add unit tests for complex business logic. Run them in CI on every push. A test suite that takes more than 10 minutes will get ignored, so keep it fast.

4. Containerization

Docker is the standard. Containerize your applications so they run identically in development, staging, and production. A well-written Dockerfile eliminates "works on my machine" problems. Use multi-stage builds to keep images small. Pin your base image versions. Do not run containers as root.

5. Secrets management

API keys, database passwords, and tokens must never be in your code repository. Not in environment files committed to Git, not in Docker images, not in CI logs. Use a secrets manager: 1Password for small teams, HashiCorp Vault for complex setups, or your cloud provider's native solution (AWS Secrets Manager, GCP Secret Manager). Inject secrets at runtime, never at build time.

6. Monitoring and alerting

You need to know when your application is down before your users tell you. At minimum: uptime monitoring (is the service responding?), error rate tracking, and response time metrics. Grafana with Prometheus is the open-source standard. Datadog if you want a managed solution. Set up alerts for the things that actually matter: 5xx error spikes, response time above your SLA, disk usage above 80%. Do not alert on everything or you will ignore everything.

7. Backup strategy

Databases must be backed up automatically, daily at minimum. Store backups in a different region or provider than your primary data. Test your restores regularly. A backup you have never restored is not a backup, it is a hope. Document the restore procedure so any team member can execute it under pressure.

8. Infrastructure as Code

Every server, database, load balancer, and DNS record should be defined in code. Terraform is the industry standard for cloud infrastructure. The goal: if your entire infrastructure disappears, you can recreate it by running terraform apply. No clicking through cloud consoles, no undocumented manual steps. Start with your core infrastructure (VPC, compute, database) and expand from there.

9. Documentation

Document three things at minimum. How to set up the development environment (should take less than 30 minutes for a new developer). How to deploy to production (every step, including rollback). Architecture overview: what services exist, how they communicate, where data lives. Keep docs next to the code in the repository. Docs in a wiki nobody visits are docs that do not exist.

10. Incident response plan

When production goes down at 2am, you need a plan, not a panic. Define: who gets paged (use PagerDuty or Opsgenie for on-call rotation), how to communicate with stakeholders (status page, Slack channel), and runbooks for common failures (database connection issues, high memory usage, deployment rollback). Practice it. Run a game day where you simulate a failure and work through the response.

11. Centralized logging

SSH-ing into servers to read log files does not scale. Aggregate logs from all services into a central location where you can search and filter. The ELK stack (Elasticsearch, Logstash, Kibana) is the open-source option. Grafana Loki is lighter weight and works well with Prometheus. Use structured logging (JSON) so you can query by fields like user ID, request ID, or error type.

12. SSL/TLS everywhere

Every external-facing endpoint must use HTTPS. No exceptions. Let's Encrypt provides free certificates with automatic renewal. Use a reverse proxy (Traefik, Nginx) to terminate TLS. Internal service-to-service communication should also be encrypted, especially if your services communicate across the public internet. Certificate expiration is a preventable outage. Automate renewal and alert 30 days before expiration.

13. Dependency updates

Outdated dependencies are security vulnerabilities waiting to happen. Enable Dependabot (GitHub) or Renovate to automatically create PRs for dependency updates. Review and merge security patches within 48 hours. Schedule a regular cadence (weekly or biweekly) for reviewing non-critical updates. Pin your dependency versions so updates are intentional, not accidental.

14. Access control

Principle of least privilege: every person and service should have the minimum access needed to do their job. No shared credentials. No root access for everyday operations. Use SSO for your cloud provider. Enable MFA on every account, no exceptions. Review access quarterly and revoke immediately when someone leaves the team. Use IAM roles for services instead of long-lived API keys.

15. Deployment strategy

Define how code gets from "merged to main" to "running in production." At minimum, have a one-command deploy process and a one-command rollback. Blue-green or rolling deployments let you update without downtime. Start simple: deploy from CI when a tag is pushed to main. Add complexity (canary releases, feature flags) only when the simple approach stops working. Always be able to roll back in under 5 minutes.

Where to start

You do not need all 15 on day one. Prioritize in this order: version control (1), CI pipeline (2), automated testing (3), secrets management (5), and monitoring (6). These five give you the biggest immediate impact. Add the rest as you grow.

Not sure where your team stands on this checklist? Our infrastructure audit evaluates your setup against these foundations and more. Or try our free healthcheck to get a quick score. Need hands-on help implementing? That is what our DevOps consulting is for.