Distributed Training Infrastructure Engineer

Training Infrastructure Engineer About the Company We're partnered with a generative AI lab building the next generation of creative tools by producing realistic sound, speech, and music from video. They're developing cutting-edge foundational generative models that "unmute" silent video content and create custom, hyper-realistic audio for gaming, video platforms, and creators, empowering global storytellers to transform their content.

They recently closed a $41 million Seed round co-led by two top-tier US venture firms, with participation from a leading global investor, and are rapidly expanding across Product, Engineering, Go-to-Market, and Growth.

About the Role You'll focus on the full training stack, profiling GPU behavior, debugging training pipelines, improving throughput, choosing the right parallelism strategies, and designing the infrastructure that lets the team train models efficiently at scale. The work spans cluster management, model training, efficient data pipelines for video and audio, inference, and optimizing PyTorch code. Your contribution will shape the foundation on which all of their generative models are built and iterated.

Key Responsibilities

Identify ideal training strategies (parallelism approaches, precision trade-offs) for a variety of model sizes and compute loads
Profile, debug, and optimize single and multi-GPU operations using tools like Nsight and stack trace viewers to understand what's actually happening at the hardware level
Analyze and improve the entire training pipeline end to end, including efficient data storage, data loading, distributed training, checkpoint and artifact saving, and logging
Set up scalable systems for experiment tracking, data and model versioning, and experiment insights
Design, deploy, and maintain large-scale ML training clusters running SLURM for distributed workload orchestration

Ideal Candidate Profile

Familiarity with the latest and most effective techniques for optimizing training and inference workloads, not from reading papers but from implementing them
Deep understanding of GPU memory hierarchy and computation capabilities, knowing what the hardware can do in theory and what prevents you from achieving it in practice
Experience optimizing for both memory-bound and compute-bound operations, with a clear sense of when each constraint matters
Expertise with efficient attention algorithms and their performance characteristics at different scales

Nice to Have

Experience implementing custom GPU kernels and integrating them into PyTorch
Experience with diffusion and autoregressive models and an understanding of their specific optimization challenges
Familiarity with high-performance storage solutions (VAST, blob storage) and their performance characteristics for ML workloads
Experience managing SLURM clusters at scale

Why Join?

Pivotal moment. Fresh funding is secured and traction is building, this is the point where your contributions can make a real difference to the company's trajectory.
True ownership from day one. Genuine autonomy and responsibility, with ideas and work that directly shape both product and company direction.
Competitive compensation and equity. Strong packages that ensure you share in the success you help create.
Build for the next generation of creators. Be part of the innovation that will transform how creators work and thrive.

APPLY HERE