DevJobs

Site Reliability Engineering Manager

Overview
Skills
  • Python Python
  • TypeScript TypeScript
  • Node.js Node.js
  • CI/CD CI/CD
  • Kubernetes Kubernetes
  • Grafana Grafana
  • Infrastructure as Code
  • OpenTelemetry
  • Prometheus Prometheus

We’re building a lean, high-impact Site Reliability Engineering (SRE) function at the core of

our SaaS platform’s production reliability and long-term quality strategy and we’re looking for

a hands-on Team Leader to drive it.


What You’ll Do:

  • Lead and scale a small SRE team (2–3 total) with end-to-end ownership of
  • observability and diagnostics across production.
  • Design and implement a central observability platform supporting engineering,
  • support, and NOC teams.
  • Write production-grade code and automation to enhance system reliability, tooling,
  • and platform resilience.
  • Drive operational excellence: incident response, alerting, monitoring, and continuous
  • reliability improvements.


Your Toolbox:

  • Deep experience in SRE or Production Engineering, ideally in cloud-native SaaS
  • environments.
  • Strong coding skills in languages such as Python, Node.js, or TypeScript,
  • you’re expected to build, not just configure.
  • Mastery of monitoring, logging, and distributed tracing (e.g., Prometheus, Grafana,
  • OpenTelemetry).
  • Solid understanding of CI/CD, Kubernetes, infrastructure as code, and scalable
  • operations.
  • A “builder” mindset handson, practical, and quality obsessed.


This role is perfect for someone who wants to define and own a strategic reliability function

from day one.

AU10TIX