About the Role
We are looking for a hands-on DevOps Team Lead to lead our multi-cloud infrastructure and drive the integration of large-scale platforms across AWS and GCP.
You will be responsible for production reliability, cloud operations, observability, security, FinOps, and infrastructure strategy while leading and mentoring a team of DevOps engineers.
This is a highly impactful player-coach role requiring both strong technical leadership and deep hands-on expertise in operating large-scale distributed systems, real-time data pipelines, and mission-critical production environments serving billions of events per day.
Key Responsibilities
- Lead, mentor, and develop DevOps and Platform engineers while remaining highly hands-on.
- Own and evolve cloud infrastructure across AWS and GCP, including Kubernetes-based platforms (EKS/GKE), networking, IAM, storage, and core infrastructure services.
- Lead infrastructure integration efforts during acquisitions, platform consolidations, and cloud migration projects.
- Design, deploy, and maintain Infrastructure-as-Code using Terraform.
- Act as the primary escalation point for infrastructure and production issues.
- Lead incident response, post-mortems, and continuous operational improvements.
- Build and maintain observability platforms using Prometheus, Grafana, Datadog, and related tools, including monitoring standards, alerting strategies, SLOs, and SLAs.
- Support large-scale data pipelines, real-time event processing systems, and high-throughput production environments handling billions of events.
- Collaborate with engineering teams to improve reliability, observability, scalability, and performance across production systems.
- Troubleshoot and optimize large-scale distributed systems, including capacity planning and performance tuning.
- Lead cloud cost optimization initiatives across AWS and GCP, including FinOps practices, resource governance, and cost visibility.
- Support SOC2, ISO27001, and infrastructure security initiatives, implementing operational controls and security best practices.
What You'll Bring
- 5+ years of hands-on experience managing large-scale production environments on AWS, with practical experience in GCP.
- Proven experience leading DevOps, SRE, or Platform Engineering teams, including mentoring engineers, driving operational excellence, and taking ownership of mission-critical production environments.
- Deep expertise in Kubernetes (EKS/GKE), cloud networking, infrastructure security, and Infrastructure-as-Code using Terraform, Karpenter, Keda.
- Experience with infrastructure tooling - Ansible, Chef.
- Strong experience supporting distributed data platforms and production services, including Kafka (MSK), Redis, OpenSearch, and similar technologies.
- Strong experience operating highly available distributed systems, large-scale data pipelines, streaming platforms, and real-time event processing environments.
- Hands-on experience with observability and production operations, including Prometheus, Grafana, Datadog, monitoring, alerting, incident response, root cause analysis, and performance optimization.
- Experience with capacity planning, cloud cost optimization (FinOps), and infrastructure governance.
- Experience leading infrastructure integration during acquisitions, platform consolidations, or large-scale cloud migrations.
- Strong troubleshooting skills and the ability to perform effectively under pressure in complex production environments.
Bonus Points For
- Experience supporting SOC2, ISO27001, or similar security and compliance frameworks.
- Experience with AdTech, MarTech, Gaming, Analytics, or other high-scale data-driven platforms.
- Experience with ClickHouse, BigQuery, Redshift, Snowflake, or similar analytics platforms.
- Experience with VictoriaMetrics, Thanos, Cortex, ArgoCD, Flux, or other modern observability and GitOps tools.
- Proficiency in Python, Go, or Bash for automation and tooling.