abra R&D is looking for a Reliability Engineer!
abra R&D is looking for a
Reliability Engineer who will take part in building the next-generation agentic analytics platform, the first real-time database optimized for AI agents at scale.
We’re looking for a
Senior AI Evaluation & Reliability Engineer to define and build how AI agents are measured, validated, monitored, and improved in production. This role sits at the intersection of LLM systems, evaluation research, and production-grade engineering.
You will design evaluation methodologies, build LLM-as-a-judge systems, and develop agent-based testing frameworks to ensure correctness, robustness, and reliability of complex multi-agent workflows operating on real-time data.
What You’ll Do:
- Design and implement evaluation frameworks for AI agents and multi-agent systems
- Build LLM-as-a-judge pipelines to assess correctness, reasoning quality, and output quality
- Develop agent-based evaluation systems (agents evaluating agents) for scalable testing
- Define metrics, benchmarks, scorecards, and methodologies for agent reliability and performance
- Build data-driven evaluation pipelines using synthetic and real-world datasets
- Identify and analyze failure modes, edge cases, and non-deterministic behaviors
- Improve agent robustness, consistency, and reliability in production environments
- Work with tools such as Google ADK, Opik, and related evaluation frameworks
- Collaborate closely with AI, platform, and database teams to shape agent–data interaction quality
Requirements:
Must have:
- 4–8+ years of experience in software engineering, AI systems, or evaluation/QA engineering
- Strong programming skills in Python
- Hands-on experience working with LLMs in production environments
- Experience building evaluation systems, automation frameworks, or testing infrastructure
- Strong understanding of prompt engineering, tool use, and agent behavior
- Ability to think in terms of metrics, correctness, and system reliability
Nice to have:
- Experience with LLM evaluation frameworks (Opik, LangSmith, etc.)
- Experience with Google ADK / agent frameworks
- Experience implementing LLM-as-a-judge or ranking systems
- Background in data systems, analytics, or real-time pipelines
- Experience with multi-agent systems
- Familiarity with statistical evaluation methods or experimentation (A/B testing, scoring systems)