Senior Inference Engineer

Senior Inference Engineer AI Video Generation Company (Stealth) | Palo Alto, CA | Hybrid
About the Role We are seeking a Senior Inference Engineer to accelerate the performance of our AI-driven video generation products. In this highly technical role, you will operate at the intersection of cutting-edge inference acceleration, GPU parallelism, advanced model deployment, and video generation technologies. Your expertise will drive significant improvements to model speed and efficiency, ensuring our creative AI systems deliver industry-leading user experiences at scale.

You will design and optimize inference pipelines, implement state-of-the-art acceleration techniques, and work closely with researchers and engineers across the team to push the boundaries of what's possible in real-time AI deployment. Your efforts will play a foundational role in powering the next generation of our video and language models.
What You'll Do

Accelerate Inference: Lead and implement advanced inference acceleration techniques, including attention optimization and quantization for efficient model serving.
Maximize GPU Parallelism: Engineer and optimize GPU strategies across tensor, sequence, and pipeline parallelism (TP, SP, PP) for maximal efficiency and scalability.
Programming for Performance: Develop and optimize high-performance computing kernels and distributed workloads using CUDA and NCCL.
Advance AI Deployment: Collaborate with research and engineering teams to bring state-of-the-art video generation and large language models into production.
Improve Training Efficiency: Contribute to improvements in model training speed, stability, and resource utilization as part of our deployment lifecycle. (Bonus)
Technical Excellence: Drive rigorous code reviews, participate in technical discussions, and mentor fellow engineers on best practices in inference and GPU programming.

What We're Looking For

Experience: 5 years of engineering experience, with a strong track record in inference acceleration and model deployment at scale.
Inference Mastery: Proven expertise in inference optimization, including quantization, attention acceleration, and deep learning compiler stacks.
GPU and Parallelism: Deep knowledge of GPU programming (CUDA, NCCL) and experience with SP, TP, PP, and other forms of parallelism for distributed inference.
AI Domain Knowledge: Familiarity with video generation models and large language models (LLMs).
Collaboration: Strong cross-discipline communication skills; able to drive shared goals across research and engineering functions.
Ownership Mindset: Self-driven, solutions-oriented, and capable of managing ambiguity in a fast-paced startup environment.

Nice to Have

Experience with high-throughput video or real-time streaming model deployment.
Familiarity with distributed training and optimization toolkits.
Contributions to open source projects in AI infrastructure or deep learning compilers.
Startup or rapid prototyping experience.

What We Offer

Competitive salary commensurate with AI industry benchmarks.
Equity in a fast-growing company shaping the future of generative AI.
Comprehensive health benefits, monthly stipends, and company retreats.
A collaborative, in-office culture focused on building and shipping together.

About the Company A well-funded, early-stage AI video generation startup headquartered in Palo Alto, CA. The team is building technology to make video creation seamless, intuitive, and universally accessible through the transformative power of AI. Tight-knit and highly energetic, the company values efficiency, intellectual curiosity, and the ambition to make a meaningful impact on the world.

APPLY HERE