DevJobs

DevOps Team Lead

Overview
Skills
  • Bash Bash
  • Python Python
  • Go Go
  • Kafka Kafka
  • Redis Redis
  • AWS AWS ꞏ 5y
  • GCP GCP
  • Snowflake Snowflake
  • Kubernetes Kubernetes
  • Ansible Ansible
  • Grafana Grafana
  • Terraform Terraform
  • Chef Chef
  • GKE
  • EKS
  • Datadog
  • Karpenter
  • Keda
  • MSK
  • OpenSearch
  • Prometheus Prometheus
  • Redshift
  • Thanos
  • VictoriaMetrics
  • Flux
  • Cortex
  • ClickHouse
  • BigQuery
  • ArgoCD

About the Role

We are looking for a hands-on DevOps Team Lead to lead our multi-cloud infrastructure and drive the integration of large-scale platforms across AWS and GCP.

You will be responsible for production reliability, cloud operations, observability, security, FinOps, and infrastructure strategy while leading and mentoring a team of DevOps engineers.

This is a highly impactful player-coach role requiring both strong technical leadership and deep hands-on expertise in operating large-scale distributed systems, real-time data pipelines, and mission-critical production environments serving billions of events per day.


Key Responsibilities

  • Lead, mentor, and develop DevOps and Platform engineers while remaining highly hands-on.
  • Own and evolve cloud infrastructure across AWS and GCP, including Kubernetes-based platforms (EKS/GKE), networking, IAM, storage, and core infrastructure services.
  • Lead infrastructure integration efforts during acquisitions, platform consolidations, and cloud migration projects.
  • Design, deploy, and maintain Infrastructure-as-Code using Terraform.
  • Act as the primary escalation point for infrastructure and production issues.
  • Lead incident response, post-mortems, and continuous operational improvements.
  • Build and maintain observability platforms using Prometheus, Grafana, Datadog, and related tools, including monitoring standards, alerting strategies, SLOs, and SLAs.
  • Support large-scale data pipelines, real-time event processing systems, and high-throughput production environments handling billions of events.
  • Collaborate with engineering teams to improve reliability, observability, scalability, and performance across production systems.
  • Troubleshoot and optimize large-scale distributed systems, including capacity planning and performance tuning.
  • Lead cloud cost optimization initiatives across AWS and GCP, including FinOps practices, resource governance, and cost visibility.
  • Support SOC2, ISO27001, and infrastructure security initiatives, implementing operational controls and security best practices.


What You'll Bring

  • 5+ years of hands-on experience managing large-scale production environments on AWS, with practical experience in GCP.
  • Proven experience leading DevOps, SRE, or Platform Engineering teams, including mentoring engineers, driving operational excellence, and taking ownership of mission-critical production environments.
  • Deep expertise in Kubernetes (EKS/GKE), cloud networking, infrastructure security, and Infrastructure-as-Code using Terraform, Karpenter, Keda.
  • Experience with infrastructure tooling - Ansible, Chef.
  • Strong experience supporting distributed data platforms and production services, including Kafka (MSK), Redis, OpenSearch, and similar technologies.
  • Strong experience operating highly available distributed systems, large-scale data pipelines, streaming platforms, and real-time event processing environments.
  • Hands-on experience with observability and production operations, including Prometheus, Grafana, Datadog, monitoring, alerting, incident response, root cause analysis, and performance optimization.
  • Experience with capacity planning, cloud cost optimization (FinOps), and infrastructure governance.
  • Experience leading infrastructure integration during acquisitions, platform consolidations, or large-scale cloud migrations.
  • Strong troubleshooting skills and the ability to perform effectively under pressure in complex production environments.


Bonus Points For

  • Experience supporting SOC2, ISO27001, or similar security and compliance frameworks.
  • Experience with AdTech, MarTech, Gaming, Analytics, or other high-scale data-driven platforms.
  • Experience with ClickHouse, BigQuery, Redshift, Snowflake, or similar analytics platforms.
  • Experience with VictoriaMetrics, Thanos, Cortex, ArgoCD, Flux, or other modern observability and GitOps tools.
  • Proficiency in Python, Go, or Bash for automation and tooling.


AnyClip