SF Bay Area (Onsite)
Full-time / Permanent
We’re partnering with a deep-tech AI company building autonomous, agentic systems for complex physical and real-world environments. The team operates at the edge of what’s possible today, designing AI systems that plan, act, recover, and improve over long horizons in high-stakes settings.
They’re hiring an LLM Evaluations Engineering Lead to own the evaluation, verification, and regression layer for agentic LLM systems running end-to-end workflows.
This is not a metrics-only role. You’ll be building the guardrails that determine whether the system is actually getting better.
Why this role matters
As agentic LLM systems move into long-horizon planning and execution, evals become the bottleneck.
This role defines whether:
- Agents are actually improving
- Changes introduce silent regressions
- Uncertainty is shrinking or compounding
- “success” reflects real-world outcomes, not proxy metrics
If evals are wrong, everything downstream is wrong. This role sits directly on that fault line.
What you’ll do
- Build eval harnesses for agentic LLM systems (offline in-workflow)
- Design evals for planning, execution, recovery, and safety
- Implement verifier-driven scoring and regression gates
- Turn eval failures into training signals (SFT / DPO / RL)
What they’re looking for
- Strong experience building evaluation systems for ML models (LLMs strongly preferred)
- Excellent software engineering fundamentals:
- Python
- Data pipelines
- Test harnesses
- Distributed execution
- Reproducibility
- Deep understanding of agentic failure modes, including:
- Tool misuse
- Hallucinated evidence
- Reward hacking
- Brittle formatting and schema drift
- Ability to reason about what to measure, not just how to measure it
- Comfortable operating between research experimentation and production systems
Why join
- Work on frontier agentic AI systems with real-world consequences
- Own a foundational layer that determines system reliability and progress
- High autonomy, strong technical peers, and meaningful equity
- Build evals that actually matter, not academic benchmarks