DevJobs

Escalation Team Leader

Overview
Skills
  • Python Python
  • Bash Bash
  • Elasticsearch Elasticsearch
  • JIRA JIRA
  • CI/CD CI/CD
  • AWS AWS
  • Azure Azure
  • GCP GCP
  • Salesforce Salesforce
  • Docker Docker
  • Kubernetes Kubernetes
  • Grafana Grafana
  • AIOps
  • Prometheus Prometheus
  • LLM-based automation
  • auto-remediation
  • anomaly detection
  • LLMOps
  • n8n
  • orchestration platforms
  • AI agents
  • Temporal
Why Join Us?

We are looking for a technically strong and AI-savvy Escalation & Reliability Manager to own production reliability, incident management, and cross-functional prioritization. This role leads our AI-driven automation strategy, drives self-healing infrastructure development, and sets a new standard for modern reliability engineering.

Key Responsibilities

  • Own production incidents and escalations end-to-end — from mitigation to RCA to corrective action.
  • Lead the design and development of self-healing systems capable of detecting, diagnosing, and remediating incidents autonomously.
  • Drive automation of repetitive operational workflows using AI/ML-based solutions to reduce toil and MTTR.
  • Lead and mentor the SRE team; improve monitoring, alerting, and observability.
  • Manage the cross-functional Squad handling customer and production issues; align priorities across Support, QA, R&D, and Sources.
  • Track key operational metrics and lead long-term reliability improvements.

Qualifications

  • 3+ years in SRE or Incident Management.
  • Mandatory: Hands-on experience applied to operational challenges (AIOps, anomaly detection, LLM-based automation, or auto-remediation).
  • Proven track record of automating workflows and reducing manual toil at scale.
  • Strong cloud background (AWS/Azure/GCP) and experience with Kubernetes, Docker, and CI/CD.
  • Proficiency with observability tools (Grafana, Prometheus, ELK) and scripting (Python, Bash).
  • Demonstrated leadership in high-pressure, cross-functional environments.

Advantages

  • Background in cybersecurity or SaaS platforms.
  • Experience with LLMOps, AI agents, or orchestration platforms (e.g., n8n, Temporal).
  • Familiarity with Jira or Salesforce.

Key Attributes

  • Strong ownership, accountability, and composure under pressure.
  • Passionate about leveraging AI to automate workflows, reduce toil, and accelerate incident resolution.
  • Visionary about self-healing operations — able to both define the strategy and drive its implementation.
  • Collaborative leader with the ability to align cross-functional stakeholders.
  • Technically hands-on systems-level thinker with the drive to engineer scalable, long-term solutions.
Check Point Software Technologies