DevJobs

SRE Lead

Overview
Skills
  • DevOps DevOps ꞏ 5y
  • AWS AWS
  • GCP GCP
  • Kubernetes Kubernetes
  • Site Reliability Engineering ꞏ 5y
  • Systems Engineering ꞏ 5y
  • AI-driven IDEs
  • Automation
  • Capacity Planning
  • Distributed Systems
  • Incident Management
  • Infrastructure as Code
This role requires strong ownership, deep technical expertise in modern infrastructure, and the ability to define reliability standards across the organization. You will take charge of complex troubleshooting, system stability, and automation initiatives, while laying the groundwork to build and lead a high-performing SRE function.

Responsibilities

  • Define and implement the SRE team charter, establishing reliability practices, SLAs, SLOs, and SLIs across the engineering organization.
  • Serve as the highest point of technical escalation for production issues, taking ownership of complex troubleshooting and blameless post-mortems.
  • Work closely with our external NOC and SOC teams to ensure seamless incident management, monitoring, and operational continuity.
  • Manage and investigate security-related incidents, escalations, and infrastructure vulnerabilities as needed.
  • Partner closely with development and infrastructure teams to ensure reliability, performance, and scalability are built into our systems from day one.
  • Drive automation and leverage modern tooling to streamline operations, manage infrastructure as code, and speed up incident resolution.

Requirements

  • B.Sc. in Computer Science, Software Engineering, Information Systems, a related technical field, or equivalent related experience.
  • 5+ years of hands-on industry experience in Site Reliability Engineering, DevOps, or Systems Engineering, with a proven track record in a leadership or management capacity.
  • Deep background in the DevOps ecosystem, including extensive hands-on experience managing and operating Kubernetes (K8s) clusters.
  • Profound understanding and hands-on experience with cloud infrastructure platforms (e.g., AWS, GCP).
  • Proven experience working with, scaling, and troubleshooting high-scale production environments.
  • Extensive experience with incident management, capacity planning, and managing highly available distributed systems.
  • Deep knowledge and experience working with AI-driven IDEs and environments to accelerate workflows and infrastructure management.
  • Strong technical leadership skills, excellent problem-solving abilities, and the ability to communicate complex architectural decisions clearly to both technical and non-technical stakeholders.
Priority Software