DevJobs

Experienced Site Reliability Engineer (SRE)

Overview
Skills
  • Bash Bash
  • Go Go
  • Python Python
  • Azure DevOps Azure DevOps
  • GitHub Actions GitHub Actions
  • AWS AWS ꞏ 5y
  • Terraform Terraform ꞏ 5y
  • Grafana Grafana
  • ArgoCD
  • DataDog
  • Prometheus Prometheus

Our Company is where we transform vision into reality. It's where ideas become technologies, and cutting-edge technologies become solutions for animal care and management.


We support farmers by providing real-time actionable information to help them manage their herds. It provides pet owners with smart devices and data that give them a better understanding of their pets’ activity and health needs, enriching relationships. It helps conservationists safeguard natural environments and wildlife.


Leveraging decades of Technological Research & Development experience across many markets, technologies and species, along with development environments and Quality Assurance procedures, we're always inventing new ways to look after the health and well-being of animals. Our decades of experience keep us ahead of the curve by leveraging advanced Technological Solutions from enhancing the precious bond between people and their pets, to advancing animal healthcare and wildlife preservation.



We are looking for an exceptional Senior Site Reliability Engineer (SRE) to help establish and lead the technical practices of SRE within our CloudOps team. This is a hands-on role for an experienced professional who can implement SRE principles, build frameworks and tools to ensure system reliability, and mentor others in adopting these practices.

If you are passionate about operational excellence, love solving complex technical challenges, and thrive in highly collaborative environments, this is the role for you.



What You’ll Do:


Define and Build the SRE Function

Help to define and implement the SRE principles and practices.Partner with development and DevOps teams to create Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs) for critical services.Advocate for and implement system architectures that prioritize reliability, scalability, and fault tolerance.Develop Automation and Resilience

Build automation tools to reduce toil, streamline operations, and improve reliability using Infrastructure as Code (IaC) tools like Terraform and CrossPlane.Implement self-healing systems, automate incident detection and response, and integrate chaos engineering practices to test system resilience.Drive Observability and Monitoring Excellence

  • Create and maintain advanced observability systems with tools like DataDog, Prometheus, and Grafana to ensure uptime and system health.Develop efficient alerting and monitoring strategies, including synthetic tests and automated anomaly detection.
  • Strong proven experience with AWS services and using IAC with Terraform. Analyze system logs and telemetry data to detect patterns, identify issues, and optimize system performance.

Incident Response and Problem Solving

  • Take ownership of incident response processes, ensuring swift recovery of services and conducting thorough Root Cause Analysis (RCA) for long-term improvements.Document incident learnings and collaborate with teams to enhance on-call processes and system documentation.

Contribute to Continuous Improvement

  • Improve deployment pipelines (CI/CD) using tools like GitHub Actions, Azure DevOps, or ArgoCD, ensuring smooth and reliable releases.Continuously evaluate and refine operational processes to reduce manual effort and increase efficiency.


Requirements:


Technical Expert

  • 5+ years of hands-on experience in Site Reliability Engineering.Proven expertise in AWS services, with experience working with distributed, event-driven architectures and microservices.Experience with GitOps workflows and tools.

Advanced skills in automation tools like Terraform and proficiency in scripting or programming languages (e.g., Python, Go, Bash).Problem Solver and Collaborator

  • Exceptional problem-solving skills and a proactive approach to identifying and addressing technical challenges.
  • Effective communicator and collaborator with the ability to work across teams to deliver operational excellence.
  • Strong analytical skills, especially in troubleshooting and optimizing complex systems.

Preferred

Familiarity with chaos engineering tools like Gremlin or LitmusChaos.

MSD Animal Health Technology Labs