Join our Agent Engineering org to lead the charge on monitoring, observability, and performance analytics for production AI agents.
Responsibilities:
- Own the observability stack for our AI agents in production: build and maintain dashboards, alerts, and automated reporting using Datadog and internal tools.
- Define and operationalize escalation policies, incident workflows, and on-call rotations across engineering and support teams.
- Establish and scale our AI analytics discipline: define metrics, build data pipelines, and drive performance visibility in tools like Tableau and Snowflake.
- Partner with Product, Agent Engineering, and Customer Success to ensure that agent performance meets functional, quality, and SLA expectations.
- Proactively surface regressions, coverage gaps, and UX bottlenecks by combining telemetry, structured data, and qualitative signals.
- Drive a culture of data-driven ownership, operational excellence, and continuous improvement.
Requirements:
- 3+ years of experience in analytics or data engineering, with a track record of building data products, dashboards, or observability systems - including 1+ year in a leadership or tech lead role.
- Strong experience with data platforms (e.g., Snowflake, BigQuery, Athena), visualization tools (e.g., Tableau, Looker), and monitoring systems (e.g., Datadog).
- Familiarity with defining and managing KPIs and performance metrics in complex production systems.
- Experience setting up alerts, playbooks, escalation protocols, and on-call schedules.
- Comfort working close to production environments, even if not coming from traditional SRE/infra roles.
- Hands-on experience with LLM systems and/or AI agents in production environments - including understanding of their performance characteristics and failure modes.
- Excellent collaboration and communication skills across engineering, product, and business teams.