LLM Evaluations Engineering Lead
SF Bay Area (Onsite)
Full-time / Permanent

We’re partnering with a deep-tech AI company building autonomous, agentic systems for complex physical and real-world environments. The team operates at the edge of what’s possible today, designing AI systems that plan, act, recover, and improve over long horizons in high-stakes settings.

They’re hiring an LLM Evaluations Engineering Lead to own the evaluation, verification, and regression layer for agentic LLM systems running end-to-end workflows.

This is not a metrics-only role. You’ll be building the guardrails that determine whether the system is actually getting better.

Why this role matters

As agentic LLM systems move into long-horizon planning and execution, evals become the bottleneck.

This role defines whether:
  • Agents are actually improving
  • Changes introduce silent regressions
  • Uncertainty is shrinking or compounding
  • “success” reflects real-world outcomes, not proxy metrics

If evals are wrong, everything downstream is wrong. This role sits directly on that fault line.

What you’ll do
  • Build eval harnesses for agentic LLM systems (offline in-workflow)
  • Design evals for planning, execution, recovery, and safety
  • Implement verifier-driven scoring and regression gates
  • Turn eval failures into training signals (SFT / DPO / RL)

What they’re looking for
  • Strong experience building evaluation systems for ML models (LLMs strongly preferred)
  • Excellent software engineering fundamentals:
    • Python
    • Data pipelines
    • Test harnesses
    • Distributed execution
    • Reproducibility
  • Deep understanding of agentic failure modes, including:
    • Tool misuse
    • Hallucinated evidence
    • Reward hacking
    • Brittle formatting and schema drift
  • Ability to reason about what to measure, not just how to measure it
  • Comfortable operating between research experimentation and production systems

Why join
  • Work on frontier agentic AI systems with real-world consequences
  • Own a foundational layer that determines system reliability and progress
  • High autonomy, strong technical peers, and meaningful equity
  • Build evals that actually matter, not academic benchmarks