We’re building a lean, high-impact Site Reliability Engineering (SRE) function at the core of
our SaaS platform’s production reliability and long-term quality strategy and we’re looking for
a hands-on Team Leader to drive it.
What You’ll Do:
- Lead and scale a small SRE team (2–3 total) with end-to-end ownership of
- observability and diagnostics across production.
- Design and implement a central observability platform supporting engineering,
- support, and NOC teams.
- Write production-grade code and automation to enhance system reliability, tooling,
- and platform resilience.
- Drive operational excellence: incident response, alerting, monitoring, and continuous
- reliability improvements.
Your Toolbox:
- Deep experience in SRE or Production Engineering, ideally in cloud-native SaaS
- environments.
- Strong coding skills in languages such as Python, Node.js, or TypeScript,
- you’re expected to build, not just configure.
- Mastery of monitoring, logging, and distributed tracing (e.g., Prometheus, Grafana,
- OpenTelemetry).
- Solid understanding of CI/CD, Kubernetes, infrastructure as code, and scalable
- operations.
- A “builder” mindset handson, practical, and quality obsessed.
This role is perfect for someone who wants to define and own a strategic reliability function
from day one.