DevJobs

SRE Team Leader & Escalation Manager

Overview
Skills
  • Python Python
  • Bash Bash
  • Elasticsearch Elasticsearch
  • CI/CD CI/CD
  • AWS AWS
  • Azure Azure
  • GCP GCP
  • Docker Docker
  • Kubernetes Kubernetes
  • Grafana Grafana
  • AIOps
  • Prometheus Prometheus
  • LLM-based automation
  • auto-remediation
  • anomaly detection
  • LLMOps
  • n8n
  • AI agents
  • Temporal

We are looking for a technically strong and AI-savvy SRE Team Lead & Escalation Manager to own production reliability, incident management, and cross-functional prioritization. This role leads our AI-driven automation strategy, drives self-healing infrastructure development, and sets a new standard for modern reliability engineering.



Major Responsibilitie

  • sLead and mentor the SRE team; improve monitoring, alerting, and observability
  • .Own production incidents and escalations end-to-end — from mitigation to RCA to corrective action
  • .Lead the design and development of self-healing systems capable of detecting, diagnosing, and remediating incidents autonomously
  • .Drive automation of repetitive operational workflows using AI/ML-based solutions to reduce toil and MTTR
  • .Manage the cross-functional Squad handling customer and production issues; align priorities across Support, QA, R&D, and Sources
  • .Track key operational metrics and lead long-term reliability improvements


.
Desired Backgrou

  • nd3-5 years in SRE or Incident Managemen
  • t.Mandatory: Hands-on experience applied to operational challenges (AIOps, anomaly detection, LLM-based automation, or auto-remediation
  • ).Proven track record of automating workflows and reducing manual toil at scal
  • e.Strong cloud background (AWS/Azure/GCP) and experience with Kubernetes, Docker, and CI/C
  • D.Proficiency with observability tools (Grafana, Prometheus, ELK) and scripting (Python, Bash
  • ).Demonstrated leadership in high-pressure, cross-functional environment


s.
Advanta

  • gesBackground in cybersecurity or SaaS platfor
  • ms.Experience with LLMOps, AI agents, or orchestration platforms (e.g., n8n, Tempora


l).
Key Attrib

  • utesStrong ownership, accountability, and composure under press
  • ure.Passionate about leveraging AI to automate workflows, reduce toil, and accelerate incident resolut
  • ion.Visionary about self-healing operations — able to both define the strategy and drive its implementat
  • ion.Collaborative leader with the ability to align cross-functional stakehold
  • ers.Technically hands-on systems-level thinker with the drive to engineer scalable, long-term soluti


ons.
Check Point Software Technologies