DevJobs

AI Evaluation & Reliability Engineer

Overview
Skills
  • Python Python
  • LLMs
  • Prompt Engineering
  • A
  • Agent Frameworks
  • Analytics
  • B Testing
  • Data Systems
  • Google ADK
  • LangSmith
  • Opik
  • Real-time Pipelines
  • Scoring Systems
  • Statistical Evaluation Methods
abra R&D is looking for a Reliability Engineer!

abra R&D is looking for a Reliability Engineer who will take part in building the next-generation agentic analytics platform, the first real-time database optimized for AI agents at scale.

We’re looking for a Senior AI Evaluation & Reliability Engineer to define and build how AI agents are measured, validated, monitored, and improved in production. This role sits at the intersection of LLM systems, evaluation research, and production-grade engineering.

You will design evaluation methodologies, build LLM-as-a-judge systems, and develop agent-based testing frameworks to ensure correctness, robustness, and reliability of complex multi-agent workflows operating on real-time data.

What You’ll Do:

  • Design and implement evaluation frameworks for AI agents and multi-agent systems
  • Build LLM-as-a-judge pipelines to assess correctness, reasoning quality, and output quality
  • Develop agent-based evaluation systems (agents evaluating agents) for scalable testing
  • Define metrics, benchmarks, scorecards, and methodologies for agent reliability and performance
  • Build data-driven evaluation pipelines using synthetic and real-world datasets
  • Identify and analyze failure modes, edge cases, and non-deterministic behaviors
  • Improve agent robustness, consistency, and reliability in production environments
  • Work with tools such as Google ADK, Opik, and related evaluation frameworks
  • Collaborate closely with AI, platform, and database teams to shape agent–data interaction quality

Requirements:

Must have:

  • 4–8+ years of experience in software engineering, AI systems, or evaluation/QA engineering
  • Strong programming skills in Python
  • Hands-on experience working with LLMs in production environments
  • Experience building evaluation systems, automation frameworks, or testing infrastructure
  • Strong understanding of prompt engineering, tool use, and agent behavior
  • Ability to think in terms of metrics, correctness, and system reliability

Nice to have:

  • Experience with LLM evaluation frameworks (Opik, LangSmith, etc.)
  • Experience with Google ADK / agent frameworks
  • Experience implementing LLM-as-a-judge or ranking systems
  • Background in data systems, analytics, or real-time pipelines
  • Experience with multi-agent systems
  • Familiarity with statistical evaluation methods or experimentation (A/B testing, scoring systems)
abra