Own, design, and evolve our production cloud platform (AWS), Kubernetes, and IaC (Terraform) so teams ship reliably, safely, and fast.
What You’ll Do
- Architect, build, and run resilient, scalable cloud infrastructure; drive AWS Well-Architected pillars (operational excellence, security, reliability, performance, cost, sustainability).
- Champion GitOps (e.g., Argo CD): declarative configs, PR-driven changes, continuous reconciliation.
- Implement and evolve CI/CD (GitHub Actions/Argo CD), secrets management, policy-as-code, and environment promotion.
- Build first-class observability (OpenTelemetry + Prometheus) across apps and infra.
- Partner with internal and external teams (engineers, data, vendors, customers) to deliver platform capabilities and service.
- Lead/mentor engineers; drive Terraform standards, modules, and reviews.
- Optimize cost and efficiency (FinOps) while maintaining reliability.
- Define SLOs/SLIs and error budgets; lead incident readiness, response, and post-mortems.
Requirements:
What You Bring (must-haves)
- 10+ years hands-on DevOps/SRE managing large-scale production workloads.
- Strong background in software development with extensive experience developing and maintaining production-grade applications
- Deep production experience: AWS (or major public cloud), Kubernetes, Terraform.
- Proven ownership: design → implement → release → operate → improve (independent and team-based).
- Excellent communication; comfortable collaborating with internal & external stakeholders.
Nice to have
- Linux, networking (DNS/HTTP/TCP/IP), and security fundamentals.
- CI/CD with GitHub Actions/Argo CD; service mesh; policy-as-code.
- Observability: OpenTelemetry, Prometheus, Grafana.
- SRE practices (SLOs/error budgets); experience improving DORA-style outcomes.
- FinOps experience;
- Python/Node.js;
About Alice:
Alice is a trust, safety, and security company built for the AI era. We safeguard the communicative technologies people use to create, collaborate, and interact—whether with each other or with machines.
In a world where AI has fundamentally changed the nature of risk, Alice provides end-to-end coverage across the entire AI lifecycle. We support frontier model labs, enterprises, and UGC platforms with a comprehensive suite of solutions: from model hardening evaluations and pre-deployment red-teaming to runtime guardrails and ongoing drift detection.