Skills
-
Docker ꞏ 5y
-
Kubernetes ꞏ 5y
-
Grafana
-
Cloud platforms ꞏ 3y
-
Datadog
-
Prometheus
Description
Key Responsibilities:
- Ensure critical systems meet uptime and performance SLAs (Service Level Agreements) and SLOs (Service Level Objectives)
- Participate in on-call rotations, lead post-mortems, and drive root cause analysis
- Implement redundancy, failover, and high availability strategies to keep services running smoothly.
- Build and maintain robust monitoring, alerting, and observability systems (e.g., Prometheus, Grafana, Datadog)
- Ensure the security of infrastructure and pipelines by implementing best practices for access control, encryption, and vulnerability management.
- Collaborate with DevOps/Dev teams to build, maintain, and improve CI/CD pipelines
- Have fun with a great team while tackling hard challenges.
Requirements
- 5 years of experience designing, deploying, maintaining, and troubleshooting large-scale distributed systems.
- Hands-on experience with infrastructure services such as caching systems, message queues, distributed storage, and load balancers.
- Proven experience in building and maintaining monitoring solutions using tools like Prometheus, Grafana, or equivalent platforms.
- 5 years of hands-on experience with containerization technologies like Docker and orchestration tools like Kubernetes.
- At least 3 years of experience working with cloud platforms
- Understanding of network security principles (e.g., segmentation, firewalls, VPNs, zero trust)
- Familiarity with securing cloud resources: encryption, security groups, secrets management, etc
- Cloud certifications – Advantage
- Bachelor's degree (Computer Science, Computer Engineering, Data science) - Advantage
DriveNets