DevJobs

Senior SRE Engineer

Overview
Skills
  • Go Go
  • Python Python
  • Java Java
  • Kafka Kafka
  • PostgreSQL PostgreSQL
  • Elasticsearch Elasticsearch
  • GitHub Actions GitHub Actions
  • AWS AWS
  • Kubernetes Kubernetes
  • Terraform Terraform
  • RabbitMQ RabbitMQ
  • Grafana Grafana
  • EKS
  • Pulumi
  • Prometheus Prometheus
  • Databricks
  • Datadog

Grip Security is looking for a Senior SRE Engineer to join our community!

We are a fast-growing startup in the software-as-a-service security industry. We provide innovative solutions to securing the whole organization-to-SaaS surface. (More details: https://grip.security)

Using the newest technologies, we're working on solving a huge problem all enterprises face today - govern the accessibility of all its employees to all 3rd party vendors (GitHub, SendGrid, Atlassian, and thousands more!), and make sure there is no leftover/unwanted access to any of the organization's SaaS assets. The SaaS security field is complex and challenging; therefore, we're looking for super-talented people, who are not afraid of technical challenges and breaking down barriers to achieve good solutions.


The Job

We're looking for a Senior SRE Engineer who combines strong infrastructure expertise with solid programming skills to help scale our platform, who can balance operational excellence with software development. This is an exciting opportunity to build SRE processes from the ground up - creating new reliability pipelines, monitoring frameworks, and foundational practices that will scale with our rapid growth. You'll lead our infrastructure and reliability efforts while writing code to automate, optimize, and enhance our systems. This role requires both deep technical expertise and the ability to mentor team members as we scale.


Stack: AWS, Python, EKS, K8s, Kafka, RabbitMQ, Pulumi, PostgreSQL, Databricks, GitHub Actions


Core Responsibilities

  • Design and implement scalable, reliable infrastructure solutions on AWS using Infrastructure as Code (Terraform/Pulumi).
  • Build and maintain sophisticated CI/CD pipelines with GitOps methodologies.
  • Develop custom tooling and automation scripts in Python/Go/similar languages to improve operational efficiency.
  • Architect and implement comprehensive observability solutions (metrics, logging, tracing, alerting).
  • Define and track SLIs/SLOs/Error Budgets to ensure system reliability.
  • Lead incident response, conduct thorough post-mortems, and drive systemic improvements.
  • Optimize cloud costs through data-driven analysis and architectural improvements.
  • Collaborate with development teams to improve application reliability and performance.
  • Mentor team members on SRE best practices and infrastructure design patterns.


Requirements

  • 5+ years of DevOps/SRE experience in production environments.
  • Solid programming skills in at least one language (Python, Go, Java, or similar) with ability to write production-quality code.
  • Strong understanding of SRE principles: reliability engineering, capacity planning, chaos engineering.
  • Deep expertise with Kubernetes (EKS preferred) including operators, CRDs, and advanced networking.
  • Proven experience implementing Infrastructure as Code at scale.
  • Hands-on experience with observability stacks (Prometheus, Grafana, ELK, Datadog, or similar).
  • Experience with distributed systems concepts and troubleshooting.
  • Excellent problem-solving skills with a systematic approach to debugging.
  • Strong communication skills and ability to work across teams.


What Sets You Apart

  • You write code to solve operational problems, not just configure existing tools.
  • You think in systems and can identify root causes across complex architectures.
  • You're passionate about automation and eliminating toil.
  • You balance perfectionism with pragmatism to deliver reliable solutions quickly.
  • You stay current with cloud-native technologies and best practices.
  • You can translate technical concepts for various audiences.

Grip Security