DevJobs

Site Reliability Engineer

Overview
Skills
  • Go Go
  • Python Python
  • GCP GCP
  • Kubernetes Kubernetes
  • Ansible Ansible
  • Terraform Terraform
  • Grafana Grafana
  • Prometheus Prometheus

Job Description:

We are looking for a Site Reliability Engineer (SRE) to join our Infrastructure team and help build, scale, and secure the cloud foundation that powers Pendo’s products. As an SRE, you'll work at the intersection of software development and systems engineering — improving reliability, scalability, observability, and operational efficiency.

Our stack includes Google Kubernetes Engine (GKE), Terraform, Cloud Functions, BigQuery, and more. You’ll be responsible for creating reliable CI/CD pipelines, automating infrastructure, monitoring services, and ensuring the uptime and performance of systems that process over 15 billion events per day.

Key Responsibilities:

  • Design and implement scalable, reliable infrastructure using Infrastructure as Code (IaC) tools (Terraform, Ansible).
  • Build and maintain CI/CD pipelines and development environments to support rapid feature delivery.
  • Own monitoring, alerting, and incident response for production systems.
  • Participate in on-call rotation and troubleshoot live issues in high-scale environments.
  • Collaborate with engineering and product teams to define SLIs/SLOs and meet reliability goals.
  • Proactively manage capacity planning, cost optimization, and cloud resources.
  • Partner with Security to ensure compliance with standards like SOC 2.

Requirements:

Minimum Qualifications:

  • 3+ years of experience as an SRE, DevOps Engineer, or similar.
  • Hands-on experience with cloud providers (GCP preferred) and tools like Terraform or Ansible.
  • Proficiency in Go, Python, or similar languages.
  • Experience running Kubernetes clusters in production environments.
  • Strong understanding of system design, reliability concepts, and failure scenarios.
  • Clear communication skills and ability to document processes/runbooks effectively.

Preferred Qualifications:

  • Experience with Google Cloud Platform services (e.g., GKE, Pub/Sub, BigQuery).
  • Familiarity with monitoring and observability tools (e.g., Prometheus, Grafana).
  • Experience in incident management and on-call rotations.
  • Knowledge of CI/CD best practices and release automation.

Pendo