SVP, Site Reliability Engineering (AI-Native SaaS Platform)
Remote

We’re hiring an experienced, hands-on SVP of Site Reliability Engineering to lead reliability, incident response, and AI-driven operations for a fast-scaling enterprise SaaS platform. This is a high-impact leadership role focused on building and evolving an AI-first SRE function where automation, agentic workflows, and intelligent remediation are central to the operating model.

You’ll lead a small team of senior engineers responsible for platform uptime, customer experience, observability, incident management, and auto-remediation systems across a large-scale AWS environment. This role requires a technical leader who remains close to the code, drives critical incident response, and partners directly with executive leadership and enterprise customers.

What We’re Looking For
  • 10 years in SaaS infrastructure, SRE, DevOps, or platform engineering
  • Proven leadership experience at VP/SVP/Head of Engineering level
  • Deep expertise in AWS, cloud-native infrastructure, distributed systems, and multi-region production environments
  • Strong background in AIOps, agentic automation, auto-remediation, and AI-driven incident response
  • Hands-on experience with observability and incident management platforms such as Grafana, Prometheus, Datadog, PagerDuty, Loki, or similar
  • Strong coding and automation skills with a passion for operational excellence
  • Experience improving uptime, MTTR, reliability, and customer satisfaction at scale
  • Executive communication skills with enterprise customer-facing experience
  • Comfortable operating in a fast-paced, high-ownership, fully remote environment

Please apply for more information