DevJobs

Senior Site Reliability Engineer

Overview
Skills
  • Python Python
  • Go Go
  • Kafka Kafka
  • Django Django
  • MySQL MySQL
  • MongoDB MongoDB
  • Cassandra Cassandra
  • Redis Redis
  • Linux Linux
  • Microservices Microservices
  • Jenkins Jenkins
  • GitHub Actions GitHub Actions
  • AWS AWS
  • Kubernetes Kubernetes
  • Helm
  • RabbitMQ RabbitMQ
  • Networking Networking
  • Terraform Terraform
  • Splunk
  • SLOs
  • SQS
  • Storage
  • Tracing
  • Application Design
  • Compute
  • Containers
  • Distributed monitoring
  • Metrics
  • S3
  • Terraspace
  • EKS
  • ElastiCache
  • Lambda
  • New Relic
  • RDS

Why Work For Us

Grubhub, part of Wonder Group Inc, is all about connecting hungry diners with our network of over 375,000 merchants nationwide. Innovative technology, user-friendly platforms and streamlined delivery capabilities set us apart and make us an industry leader in the world of online food ordering. When you join our team, you become part of a community that works together to innovate, solve problems, grow, work hard and have a ton of fun in the process!


About the Opportunity:

Grubhub, a leader in connecting diners with restaurants nationwide, is seeking a Senior Site Reliability Engineer to join our Campus and On-Site team. This role is crucial for simplifying the dining experience for students across the US. You will be instrumental in architecting resilient and self-healing solutions, managing AWS infrastructure, closing observability gaps, designing scaling approaches, and shaping incident management processes. Your contributions will span the entire development lifecycle, encompassing the building and maintenance of CI/CD pipelines. Collaboration with other SRE teams is vital for guidance, knowledge sharing, and fostering camaraderie.


About the Team:

Our On-Site SRE team is dedicated to building more resilient and self-healing solutions. You'll contribute to managing AWS infrastructure, addressing observability challenges, designing scalable systems, and refining incident management processes. We emphasize close collaboration with other SRE teams for mutual support, knowledge exchange, and team spirit. You will also partner with service owners to design and build robust CI/CD pipelines and contribute to the long-term architectural vision of our products.


The Day to Day:

As an SRE within the "Runtime Engineering" organization, you will co-own critical production service designs, ensuring their high reliability. You will actively drive improvements in reliability and observability using SLOs and telemetry data. Your responsibilities include developing and enhancing internal tools and automation software to effectively and safely maintain production services. You will also lead reliability-focused practices, including Failure Analysis, Load and Capacity Planning, Service Reviews, Architecture Design, and Incident Postmortems. As a senior engineer, you will also be responsible for mentoring junior engineers.

What You'll Need:

  • Experience:
  • Senior SRE: 4+ years of experience
  • SRE II: 2+ years of experience
  • Technical Skills:
  • Deep knowledge of CI/CD tools (e.g., Jenkins, GitHub Actions).
  • Software engineering experience in Python, Go, or a similar object-oriented language.
  • Proficiency with datastores (MySQL, Mongo, Cassandra, Redis) and message brokers (Kafka/SQS/RabbitMQ).
  • Experience with Microservice Architecture and Application Design.
  • Distributed monitoring experience, including SLOs, metrics, and tracing.
  • Working knowledge of Kubernetes-based software solutions and their ecosystem.
  • Working knowledge of Cloud technologies (AWS, Compute/Containers, Storage, Linux, networking).


  • Soft Skills:
  • Strong technical writing, documentation, and communication skills.
  • Experience with highly trafficked web-based services.


About Our Tech:

The On-Site tech stack primarily utilizes Python, with some services written in Go, for tooling, automation, and service code. We leverage Django as our primary web framework. For monitoring, we use New Relic and Splunk. Our robust infrastructure is built with Infrastructure as Code (IaC) using Terraspace (wrapped around Terraform). Our services run on Kubernetes, deployed via Helm. Our cloud technologies encompass various AWS services, including EKS, S3, ElastiCache, and Lambda. Data technologies include MongoDB, MySQL (RDS), Redis (ElasticCache), RabbitMQ, and Kafka. CI/CD is managed through Jenkins. The On-Site tech stack handles a significant portion of Grubhub's daily orders and is rapidly growing. Your role will be pivotal in ensuring the platform's scalability to support our continuously expanding customer base, evidenced by the addition of 30 new campuses and a 25% year-over-year increase in order volume.


Perks:

We offer flexible PTO, comprehensive health programs, abundant opportunities for learning and career growth, and engaging events led by our Culture Crew. Grubhub is an equal opportunity employer committed to diversity and inclusion. We value innovation, problem-solving, calculated risk-taking, hard work, and, most importantly, having a lot of fun!

Grubhub