Our client is an early-stage AI company building foundation models for physics to enable end-to-end industrial automation, from simulation and design through optimization, validation, and production. They are assembling a small, elite, founder-led team focused on shipping real systems into production, backed by world-class investors and technical advisors.
They are hiring a Machine Learning Cloud Infrastructure Engineer to own the full ML infrastructure stack behind physics-based foundation models. Working directly with the CEO and founding team, you will build, scale, and operate production-grade ML systems used by real customers.
What you will do
- Own distributed training and fine-tuning infrastructure across multi-GPU and multi-node clusters
- Design and operate low-latency, highly reliable inference and model serving systems
- Build secure fine-tuning pipelines allowing customers to adapt models to their data and workflows
- Deliver deployments across cloud and on-prem environments, including enterprise and air-gapped setups
- Design data pipelines for large-scale simulation and CFD datasets
- Implement observability, monitoring, and debugging across training, serving, and data pipelines
- Work directly with customers on deployment, integration, and scaling challenges
- Move quickly from prototype to production infrastructure
What our client is looking for
- 3 years building and scaling ML infrastructure for training, fine-tuning, serving, or deployment
- Strong experience with AWS, GCP, or Azure
- Hands-on expertise with Kubernetes, Docker, and infrastructure-as-code
- Experience with distributed training frameworks such as PyTorch Distributed, DeepSpeed, or Ray
- Proven experience building production-grade inference systems
- Strong Python skills and deep understanding of the end-to-end ML lifecycle
- High execution velocity, strong debugging instincts, and comfort operating in ambiguity
Nice to have
- Background in physics, simulation, or computer-aided engineering software
- Experience deploying ML systems into enterprise or regulated environments
- Foundation model fine-tuning infrastructure experience
- GPU performance optimization experience (CUDA, Triton, etc.)
- Large-scale ML data engineering and validation pipelines
- Experience at high-growth AI startups or leading AI research labs
- Customer-facing or forward-deployed engineering experience
- Open-source contributions to ML infrastructure
This role suits someone who earns respect through hands-on technical contribution, thrives in intense, execution-driven environments, values deep focused work, and takes full ownership of outcomes. The company offers ownership of core infrastructure, direct collaboration with the CEO and founding team, work on high-impact AI and physics problems, competitive compensation with meaningful equity, an in-person-first culture five days a week, strong benefits, daily meals, stipends, and immigration support.