Senior Machine Learning Infra Engineer | San Francisco | Competitive Salary Equity
Our client is an early-stage AI company building foundation models for physics to enable end-to-end industrial automation, from simulation and design through optimization, validation, and production. They are assembling a small, elite, founder-led team focused on shipping real systems into production, backed by world-class investors and technical advisors.
They are hiring a Machine Learning Cloud Infrastructure Engineer to own the full ML infrastructure stack behind physics-based foundation models. Working directly with the CEO and founding team, you will build, scale, and operate production-grade ML systems used by real customers.
 
What you will do
  • Own distributed training and fine-tuning infrastructure across multi-GPU and multi-node clusters
  • Design and operate low-latency, highly reliable inference and model serving systems
  • Build secure fine-tuning pipelines allowing customers to adapt models to their data and workflows
  • Deliver deployments across cloud and on-prem environments, including enterprise and air-gapped setups
  • Design data pipelines for large-scale simulation and CFD datasets
  • Implement observability, monitoring, and debugging across training, serving, and data pipelines
  • Work directly with customers on deployment, integration, and scaling challenges
  • Move quickly from prototype to production infrastructure
 
What our client is looking for
  • 3 years building and scaling ML infrastructure for training, fine-tuning, serving, or deployment
  • Strong experience with AWS, GCP, or Azure
  • Hands-on expertise with Kubernetes, Docker, and infrastructure-as-code
  • Experience with distributed training frameworks such as PyTorch Distributed, DeepSpeed, or Ray
  • Proven experience building production-grade inference systems
  • Strong Python skills and deep understanding of the end-to-end ML lifecycle
  • High execution velocity, strong debugging instincts, and comfort operating in ambiguity
 
Nice to have
  • Background in physics, simulation, or computer-aided engineering software
  • Experience deploying ML systems into enterprise or regulated environments
  • Foundation model fine-tuning infrastructure experience
  • GPU performance optimization experience (CUDA, Triton, etc.)
  • Large-scale ML data engineering and validation pipelines
  • Experience at high-growth AI startups or leading AI research labs
  • Customer-facing or forward-deployed engineering experience
  • Open-source contributions to ML infrastructure
 
This role suits someone who earns respect through hands-on technical contribution, thrives in intense, execution-driven environments, values deep focused work, and takes full ownership of outcomes. The company offers ownership of core infrastructure, direct collaboration with the CEO and founding team, work on high-impact AI and physics problems, competitive compensation with meaningful equity, an in-person-first culture five days a week, strong benefits, daily meals, stipends, and immigration support.