Senior MLOps Engineer – GPU Infrastructure & Inference
Our client is building AI-native systems at the intersection of machine learning, scientific computing, and materials innovation, applying large-scale ML to solve complex, real-world problems with global impact. They are seeking a Senior MLOps Engineer to own and operate a production-grade GPU platform supporting large-scale model training and low-latency inference for computational chemistry and LLM workloads serving thousands of users.
This role holds end-to-end responsibility for the ML platform, spanning Kubernetes-based GPU orchestration, cloud infrastructure and Infrastructure-as-Code, ML pipelines, CI/CD, observability, reliability, and disaster recovery. You will design and operate hardened, multi-tenant ML systems on AWS, build and optimize high-performance inference stacks using vLLM and TensorRT-based runtimes, and drive measurable improvements in latency, throughput, and GPU utilization through batching, caching, quantization, and kernel-level optimizations. You will also establish SLO-driven operational standards, robust monitoring and alerting, on-call readiness, and repeatable release and rollback workflows.
The position requires deep hands-on experience running GPU workloads on Kubernetes, including scheduling, autoscaling, multi-tenancy, and debugging GPU runtime issues, alongside strong Terraform and cloud-native fundamentals. You will work closely with research scientists and product teams to reliably productionize models, support distributed training and inference across multi-node GPU clusters, and ensure high-throughput data pipelines for large scientific datasets. Ideal candidates bring 5 years of experience in MLOps, platform, or infrastructure engineering, strong proficiency in Python and modern DevOps practices, and a proven track record of operating scalable, high-performance ML systems in production. Experience supporting scientific, computational chemistry, or other physics-based workloads is highly desirable, as is prior exposure to large-scale LLM serving, distributed training frameworks, and regulated production environments.