We are looking for a technically strong and AI-savvy SRE Team Lead & Escalation Manager to own production reliability, incident management, and cross-functional prioritization. This role leads our AI-driven automation strategy, drives self-healing infrastructure development, and sets a new standard for modern reliability engineering.
Major Responsibilitie
- sLead and mentor the SRE team; improve monitoring, alerting, and observability
- .Own production incidents and escalations end-to-end — from mitigation to RCA to corrective action
- .Lead the design and development of self-healing systems capable of detecting, diagnosing, and remediating incidents autonomously
- .Drive automation of repetitive operational workflows using AI/ML-based solutions to reduce toil and MTTR
- .Manage the cross-functional Squad handling customer and production issues; align priorities across Support, QA, R&D, and Sources
- .Track key operational metrics and lead long-term reliability improvements
.
Desired Backgrou
- nd3-5 years in SRE or Incident Managemen
- t.Mandatory: Hands-on experience applied to operational challenges (AIOps, anomaly detection, LLM-based automation, or auto-remediation
- ).Proven track record of automating workflows and reducing manual toil at scal
- e.Strong cloud background (AWS/Azure/GCP) and experience with Kubernetes, Docker, and CI/C
- D.Proficiency with observability tools (Grafana, Prometheus, ELK) and scripting (Python, Bash
- ).Demonstrated leadership in high-pressure, cross-functional environment
s.
Advanta
- gesBackground in cybersecurity or SaaS platfor
- ms.Experience with LLMOps, AI agents, or orchestration platforms (e.g., n8n, Tempora
l).
Key Attrib
- utesStrong ownership, accountability, and composure under press
- ure.Passionate about leveraging AI to automate workflows, reduce toil, and accelerate incident resolut
- ion.Visionary about self-healing operations — able to both define the strategy and drive its implementat
- ion.Collaborative leader with the ability to align cross-functional stakehold
- ers.Technically hands-on systems-level thinker with the drive to engineer scalable, long-term soluti
ons.