SVP, Site Reliability Engineering (AI-Native SaaS Platform)
Remote
We’re hiring an experienced, hands-on SVP of Site Reliability Engineering to lead reliability, incident response, and AI-driven operations for a fast-scaling enterprise SaaS platform. This is a high-impact leadership role focused on building and evolving an AI-first SRE function where automation, agentic workflows, and intelligent remediation are central to the operating model.
You’ll lead a small team of senior engineers responsible for platform uptime, customer experience, observability, incident management, and auto-remediation systems across a large-scale AWS environment. This role requires a technical leader who remains close to the code, drives critical incident response, and partners directly with executive leadership and enterprise customers.
What We’re Looking For
- 10 years in SaaS infrastructure, SRE, DevOps, or platform engineering
- Proven leadership experience at VP/SVP/Head of Engineering level
- Deep expertise in AWS, cloud-native infrastructure, distributed systems, and multi-region production environments
- Strong background in AIOps, agentic automation, auto-remediation, and AI-driven incident response
- Hands-on experience with observability and incident management platforms such as Grafana, Prometheus, Datadog, PagerDuty, Loki, or similar
- Strong coding and automation skills with a passion for operational excellence
- Experience improving uptime, MTTR, reliability, and customer satisfaction at scale
- Executive communication skills with enterprise customer-facing experience
- Comfortable operating in a fast-paced, high-ownership, fully remote environment
Please apply for more information
