Why Join Us?
We are looking for a technically strong and AI-savvy
Escalation & Reliability Manager to own production reliability, incident management, and cross-functional prioritization. This role leads our AI-driven automation strategy, drives self-healing infrastructure development, and sets a new standard for modern reliability engineering.
Key Responsibilities
- Own production incidents and escalations end-to-end — from mitigation to RCA to corrective action.
- Lead the design and development of self-healing systems capable of detecting, diagnosing, and remediating incidents autonomously.
- Drive automation of repetitive operational workflows using AI/ML-based solutions to reduce toil and MTTR.
- Lead and mentor the SRE team; improve monitoring, alerting, and observability.
- Manage the cross-functional Squad handling customer and production issues; align priorities across Support, QA, R&D, and Sources.
- Track key operational metrics and lead long-term reliability improvements.
Qualifications
- 3+ years in SRE or Incident Management.
- Mandatory: Hands-on experience applied to operational challenges (AIOps, anomaly detection, LLM-based automation, or auto-remediation).
- Proven track record of automating workflows and reducing manual toil at scale.
- Strong cloud background (AWS/Azure/GCP) and experience with Kubernetes, Docker, and CI/CD.
- Proficiency with observability tools (Grafana, Prometheus, ELK) and scripting (Python, Bash).
- Demonstrated leadership in high-pressure, cross-functional environments.
Advantages
- Background in cybersecurity or SaaS platforms.
- Experience with LLMOps, AI agents, or orchestration platforms (e.g., n8n, Temporal).
- Familiarity with Jira or Salesforce.
Key Attributes
- Strong ownership, accountability, and composure under pressure.
- Passionate about leveraging AI to automate workflows, reduce toil, and accelerate incident resolution.
- Visionary about self-healing operations — able to both define the strategy and drive its implementation.
- Collaborative leader with the ability to align cross-functional stakeholders.
- Technically hands-on systems-level thinker with the drive to engineer scalable, long-term solutions.