Salary: €80,000 to €150,000 equity
Location: Fully remote within Europe (CET ±2 hours)
Stage: Recently funded Series A AI startup
We are partnering with a fast-growing generative AI company building the next generation of creative tooling. Their platform generates hyper-realistic sound, speech, and music directly from video, effectively bringing silent content to life. The technology is already being used across gaming, video platforms, and creator ecosystems, with a clear ambition to become foundational infrastructure for audio-visual storytelling.
Backed by top-tier venture capital and fresh Series A funding, the company is now scaling its core engineering group. This is a chance to join at a point where the technical challenges are deep, the scope is wide, and individual impact is unmistakable.
The Role:
As a Training Infrastructure Engineer, you will own and evolve the full model training stack. This is a hands-on, systems-level role focused on making large-scale training fast, reliable, and efficient. You will work close to the hardware and close to the models, shaping how cutting-edge generative systems are trained and iterated.
What You Will Do:
- Design and evaluate optimal training strategies including parallelism approaches and precision trade-offs across different model sizes and workloads
- Profile, debug, and optimise GPU workloads at single and multi-GPU level, using low-level tooling to understand real hardware behaviour
- Improve the entire training pipeline end to end, from data storage and loading through distributed training, checkpointing, and logging
- Build scalable systems for experiment tracking, model and data versioning, and training insights
- Design, deploy, and maintain large-scale training clusters orchestrated with SLURM
- Proven experience optimising training and inference workloads through hands-on implementation, not just theory
- Deep understanding of GPU memory hierarchy and compute constraints, including the gap between theoretical and practical performance
- Strong intuition for memory-bound vs compute-bound workloads and how to optimise for each
- Expertise in efficient attention mechanisms and how their performance characteristics change at scale
- Experience writing custom GPU kernels and integrating them into PyTorch
- Background working with diffusion or autoregressive models
- Familiarity with high-performance storage systems such as VAST or large-scale object storage
- Experience managing SLURM clusters in production environments
- Join at a pivotal growth stage with fresh funding and strong momentum
- Genuine ownership and autonomy from day one, with direct influence over technical direction
- Competitive salary and equity so you share in the upside you help create
- Work on technology that is redefining how creators produce and experience content