AI Inference & Serving Model Efficiency

DeepRec.ai recruits top engineers specialising in inference performance, scalable serving architectures, and efficiency-critical AI infrastructure.

DeepRec.ai specialises in identifying and placing engineers who focus on running AI systems in the real world. Whether they're optimising how models are served and scaled, maintaining inference pipelines, or building real-time AI workloads, these essential roles sit at the intersection of machine learning and infrastructure. 

This is a high-value area where adaptability and practical experience matter most, traits that are hard to hire for in deep tech's shallow talent pool. This is where the DeepRec.ai team are uniquely positioned to add value. Through our global AI engineering community and our proven delivery experience, we've developed a granular understanding of how strong production engineers operate. 

When you need to move beyond standard job titles, narrow your search, and invest real time into identifying the ideal candidate for your roles, DeepRec.ai takes care of it. 

Find incredible candidates:

Talk to a Consultant

Explore the latest jobs in AI Inference and Serving:

Live jobs

Where DeepRec.ai Specialises

Inference, Serving, and Model Efficiency recruitment goals vary widely between organisations, but they share a common focus: running large models reliably and efficiently in production.

DeepRec.ai supports hiring engineers working across areas such as model serving, inference optimisation, and performance-critical AI infrastructure. This often includes engineers building and maintaining serving stacks using frameworks such as vLLM, TensorRT-LLM, TGI, SGLang, Ray Serve, KServe, and BentoML, depending on scale, latency requirements, and deployment environments.

Rather than relying on job titles alone, we focus on engineers with ownership of production systems and measurable real-world impact.

Engineers in this space are responsible for improving performance, reducing cost, and ensuring reliability under real-world workloads.

This commonly involves:

  • Applying optimisation techniques such as quantization (INT8 / FP8), pruning, and distillation

  • Implementing model compression strategies to reduce memory footprint and inference cost

  • Leveraging low-level performance improvements such as FlashAttention and FlashDecoding

  • Tuning inference pipelines for throughput, latency, and hardware efficiency

These decisions are often highly context-dependent, requiring engineers to balance accuracy, speed, cost, and operational complexity.

Why Choose DeepRec.ai as Your Talent Partner? 

The AI industry often approaches hiring for roles across AI Infrastructure & Distributed Systems as an extension of traditional ML or backend recruitment. Generic job titles are reused, CVs are screened for surface-level familiarity, and critical production experience is assumed rather than validated.

In reality, these roles demand engineers who can operate under real-world constraints, which means balancing latency, throughput, reliability, and cost in production AI systems. Hiring successfully requires context, judgement, and a deep understanding of how these systems behave at scale.

This is where DeepRec.ai adds value. We specialise in identifying engineers who have built, operated, and optimised production AI systems, not just experimented with them. Our experience delivering complex hiring mandates in performance-critical AI environments allows us to assess beyond titles and tooling, focusing instead on real-world capability and impact.

When you partner with DeepRec.ai, you get: 

  • A dedicated delivery team who specialise purely in Inference, Serving & Efficiency across AI infrastructure and distributed systems. This guarantees faster shortlists and higher-confidence hiring decisions.

  • The niche expertise of a boutique agency, but the resilience and resources of a global brand. We're part of Trinnovo Group, an international staffing business that provides the operational scale, governance, and delivery capability required to support business-critical hiring initiatives.

  • Adaptable recruitment models to suit your unique business goals. From embedded hiring solutions for high-volume hiring, all the way through to executive search for critical leadership hires. 

  • Access to a global AI engineering community of engaged, qualified, and production-ready engineers. 

  • A consultative, delivery-first approach to recruitment.

Check out our case studies

Roles We Recruit For

We support hiring across a range of production-focused AI engineering roles, including:

  • AI Inference Engineers

  • Model Serving Engineers

  • AI Infrastructure Engineers

  • Backend Engineers supporting AI workloads

  • Distributed Systems Engineers working on AI platforms

  • Performance and optimisation-focused AI engineers

Common Use Cases We Support

Inference, serving, and model efficiency hiring is most critical for teams:

  • Scaling LLM-powered products into production

  • Operating real-time or low-latency AI systems

  • Managing high-throughput inference workloads

  • Optimising infrastructure cost as AI usage grows

  • Building internal AI platforms or developer tooling

We work with teams where inference performance and system reliability directly affect product quality and commercial outcomes.

FAQ

What makes inference and serving roles difficult to hire for?

These roles require hybrid skill sets across ML, backend engineering, and infrastructure, combined with real-world production experience that is difficult to validate through CVs alone.

Do you recruit for MLOps roles?

Yes. We specialise in MLOps hiring alongside inference, serving, and AI infrastructure roles, supporting teams responsible for deploying, operating, and maintaining production AI systems.

Do you support startup and enterprise hiring?
Yes. We work with startups, scale-ups, and established organisations where production AI systems are business-critical.

Can you support confidential or business-critical hires?
Absolutely. We regularly deliver complex and sensitive hiring mandates where discretion and precision are essential.

Which locations do you service? 

We primarily deliver recruitment services across the UK, Ireland, the DACH region, and the United States, where we have deep market knowledge and an established presence. Alongside this, we regularly deliver AI infrastructure, inference, serving, and MLOps hiring mandates on a global basis.

Ready to Build Production-Grade AI Teams?

If you’re building or scaling AI systems where performance and reliability matter, DeepRec.ai can help.

Speak with a specialist

 

 

 

MEET THE TEAM

Berlin, Germany
Senior Inference Optimization Engineer
About our client:Our client is a fast-scaling automation platform that operates cloud-native and AI infrastructure at scale. By embedding autonomous decision-making directly into Kubernetes and cloud environments, the platform continuously optimizes performance, reliability, and efficiency in production, replacing tickets, alerts, and manual tuning with continuous automation that adapts infrastructure as conditions change. The company is trusted by over two thousand organizations, including a number of globally recognized enterprises across technology, automotive, media, and financial services. It operates as a distributed, international team spanning more than thirty countries across Europe, North America, Latin America, and APAC. The business recently reached unicorn status following a strategic investment from a major corporate venture arm, with a valuation now in excess of one billion dollars and strong momentum behind its next phase of growth. About the role:  Throughput. Latency. KV cache utilization. Move those three numbers in the right direction, and two things happen. Customers get faster, cheaper inference, and our client's margins improve. That is the entire thesis of this role. Every kernel you tune, every quantization scheme you ship, and every scheduler tweak you land shows up directly in a customer's p99 and on the P&L. This is a high-impact seat, and a high-autonomy one. You will be given the room to lead the technical direction of inference optimization rather than execute someone else's roadmap. The problem is that running LLMs in production is a moving target. The right model and serving configuration for a workload depend on traffic shape, sequence-length distribution, batch dynamics, GPU SKU, memory bandwidth, quantization tolerance, and a dozen other variables that shift week to week. Most teams pick a model once, over-provision GPUs, and absorb the cost. Our client's system makes that decision automatically, continuously matching workloads to the most cost-efficient, best-performing LLM and serving configuration on a customer's infrastructure. The team is building the optimization layer between the model and the hardware, and needs engineers who understand both sides deeply. Stack Python; vLLM; SGLang; TensorRT-LLM; PyTorch; CUDA-adjacent tooling; Kubernetes; gRPC; ClickHouse; PostgreSQL; GCP Pub/Sub; AWS, GCP, and Azure; GitLab CI; ArgoCD; Prometheus; Grafana; Loki; Tempo. RequirementsFive or more years building real ML systems, with a portfolio that shows depth in inference or training infrastructure, not just model training notebooks.Strong Python, with experience building production services rather than scripts.Hands-on experience with at least one of vLLM, SGLang, or TensorRT-LLM, and a working mental model of why an inference engine performs the way it does on a given GPU.Fluency with quantization tradeoffs. You have measured quality regressions, not just compression ratios.Comfort with distributed systems, including collective communication, sharding strategies, and the practical failure modes of multi-GPU and multi-node setups.A bias toward measurement. You instrument before you optimize, and you can tell the difference between a real win and a benchmark artifact.Self-direction. This role comes with a wide mandate, and you should be excited by that rather than unsettled by it.ResponsibilitiesPush throughput. Continuous batching, speculative decoding, chunked prefill, and kernel-level tuning across vLLM, SGLang, and TensorRT-LLM. Find the ceiling on each GPU SKU, then raise it.Cut latency. Attack TTFT and TPOT separately. Profile, identify the actual bottleneck whether compute, memory bandwidth, scheduling, or networking, and fix it rather than the bottleneck you assumed.Get more out of the KV cache. Paged attention, prefix caching, eviction policies, cache reuse across requests, and quantized KV. This is where a lot of the unrealized throughput lives, and it is an area you will own.Quantize without regressing quality. INT8, INT4, and FP8 across weights, activations, and KV. Empirical work that measures quality on real workloads, not just perplexity benchmarks.Shrink cold starts and memory footprint. Faster init, smarter weight loading, and tighter memory accounting, which is the difference between a model that scales and one that does not.Scale across nodes. Distributed inference topologies, network-aware placement, and checkpointing strategies that do not bottleneck on storage or interconnect.Set the technical direction. Decide what to benchmark, what to adopt, and what to build in-house. Bring the team along with strong writeups and reproducible experiments.
Sam WarwickSam Warwick
Palo Alto, California, United States
Senior Inference Engineer
Senior Inference Engineer AI Video Generation Company (Stealth) | Palo Alto, CA | HybridAbout the Role We are seeking a Senior Inference Engineer to accelerate the performance of our AI-driven video generation products. In this highly technical role, you will operate at the intersection of cutting-edge inference acceleration, GPU parallelism, advanced model deployment, and video generation technologies. Your expertise will drive significant improvements to model speed and efficiency, ensuring our creative AI systems deliver industry-leading user experiences at scale. You will design and optimize inference pipelines, implement state-of-the-art acceleration techniques, and work closely with researchers and engineers across the team to push the boundaries of what's possible in real-time AI deployment. Your efforts will play a foundational role in powering the next generation of our video and language models.   What You'll DoAccelerate Inference: Lead and implement advanced inference acceleration techniques, including attention optimization and quantization for efficient model serving.Maximize GPU Parallelism: Engineer and optimize GPU strategies across tensor, sequence, and pipeline parallelism (TP, SP, PP) for maximal efficiency and scalability.Programming for Performance: Develop and optimize high-performance computing kernels and distributed workloads using CUDA and NCCL.Advance AI Deployment: Collaborate with research and engineering teams to bring state-of-the-art video generation and large language models into production.Improve Training Efficiency: Contribute to improvements in model training speed, stability, and resource utilization as part of our deployment lifecycle. (Bonus)Technical Excellence: Drive rigorous code reviews, participate in technical discussions, and mentor fellow engineers on best practices in inference and GPU programming.  What We're Looking ForExperience: 5 years of engineering experience, with a strong track record in inference acceleration and model deployment at scale.Inference Mastery: Proven expertise in inference optimization, including quantization, attention acceleration, and deep learning compiler stacks.GPU and Parallelism: Deep knowledge of GPU programming (CUDA, NCCL) and experience with SP, TP, PP, and other forms of parallelism for distributed inference.AI Domain Knowledge: Familiarity with video generation models and large language models (LLMs).Collaboration: Strong cross-discipline communication skills; able to drive shared goals across research and engineering functions.Ownership Mindset: Self-driven, solutions-oriented, and capable of managing ambiguity in a fast-paced startup environment.  Nice to HaveExperience with high-throughput video or real-time streaming model deployment.Familiarity with distributed training and optimization toolkits.Contributions to open source projects in AI infrastructure or deep learning compilers.Startup or rapid prototyping experience.  What We OfferCompetitive salary commensurate with AI industry benchmarks.Equity in a fast-growing company shaping the future of generative AI.Comprehensive health benefits, monthly stipends, and company retreats.A collaborative, in-office culture focused on building and shipping together.About the Company A well-funded, early-stage AI video generation startup headquartered in Palo Alto, CA. The team is building technology to make video creation seamless, intuitive, and universally accessible through the transformative power of AI. Tight-knit and highly energetic, the company values efficiency, intellectual curiosity, and the ambition to make a meaningful impact on the world.
Sam WarwickSam Warwick
Palo Alto, California, United States
Staff Software Engineer (AI Infrastructure)
Staff/Lead Software Engineer, AI Infrastructure About the Company A well-funded Bay Area AI startup operating at the frontier of generative media, with a product shipping to users at scale. The company is building the core infrastructure that powers its AI capabilities, and this is a senior, high-ownership hire on that team. About the Role This is a critical hire to build and scale the infrastructure behind the company's AI capabilities. You'll lead the design and implementation of GPU infrastructure, AI model serving APIs, and general AI infrastructure execution, enabling the machine learning features that drive the product. You'll architect robust, distributed systems optimized for high-performance AI workloads, large-scale GPU orchestration, and low-latency, reliable API serving. Your work will directly shape how users experience generative AI at scale. As a senior technical leader, you'll also mentor engineers, drive best practices, and set the technical vision for AI infrastructure. What You'll DoDesign, develop, and maintain scalable GPU infrastructure for training and serving state-of-the-art AI models.Architect and optimize high-throughput, low-latency APIs for AI model serving and inference.Lead the orchestration, scheduling, and efficient utilization of heterogeneous GPU resources across clusters.Build and support robust systems for model deployment, monitoring, scaling, and reliability in production.Collaborate with ML, backend, and platform engineering teams to deliver seamless AI-powered product features.Drive technical direction, code reviews, and mentorship across the AI Infrastructure team.What We're Looking For5 years as a software engineer working on systems infrastructure, including hands-on ML serving and GPU orchestration.Deep knowledge of distributed systems, Kubernetes (or similar orchestration frameworks), and cloud-native infrastructure (AWS/GCP/Azure).Proven expertise building and optimizing APIs for large-scale AI model serving (TensorFlow Serving, Triton, TorchServe, or similar).Familiarity with the challenges of high-throughput, scalable GPU fleet management, scheduling, and efficient model execution.Proficiency in backend languages such as Python, Go, or C , with experience optimizing for performance and reliability.Ownership mentality and the drive to solve complex problems independently in ambiguous, high-growth environments.Excellent communication, collaboration, and mentorship skills.Nice to HaveExperience with multi-modal AI model infrastructure (LLMs, generative models, video/image/speech models).Background building infra for multi-tenant SaaS, enterprise AI/ML platforms, or operational automation at scale.Previous startup experience, or a track record leading high-impact projects through ambiguity and rapid iteration.Experience with competitive coding or large-scale distributed computing environments.
Sam WarwickSam Warwick
Philadelphia, Pennsylvania, United States
Machine Learning Engineer (Inference Optimization)
Machine Learning Engineer – Inference Optimization Overview We are looking for a Machine Learning Engineer focused on low-latency inference optimization to help build, tune, and productionize high-performance model serving systems. This role sits at the intersection of machine learning, systems engineering, and GPU performance. You will work on inference workloads where latency, throughput, reliability, and hardware efficiency all matter, and where a deep understanding of modern inference runtimes can meaningfully improve production outcomes. You will work closely with researchers and engineers to understand model structure, identify inference bottlenecks, and turn research ideas into efficient production systems. The work may involve other types of models, but focuses on transformer-style architectures and structured inference workloads. You will evaluate and tune frameworks and related serving or compilation systems, while also reasoning about GPU execution, memory layout, batching strategies, precision tradeoffs, and end-to-end latency. What you'll do:Design, build, and optimize low-latency inference systems for production machine learning workloads.Profile model inference pipelines across model execution, runtime configuration, batching, memory movement, serialization, networking, and I/O.Evaluate, integrate, and tune inference runtime systems.Improve latency, throughput, and GPU utilization for production inference workloads.Build and support benchmarking and profiling tools to compare model variants, hardware targets, runtime configurations, and deployment strategies.Debug performance issues involving GPU memory, compute saturation, kernel behavior, CPU/GPU coordination, data movement, and serving-layer overhead.Help shape model and system design choices so that research models are efficient to deploy under real latency constraints.Where necessary, collaborate with lower-level systems or GPU specialists on custom operators, kernel-level optimization, or hardware-specific performance work.What we're looking for:Experience deploying, optimizing, or operating machine learning inference workloads in production or production-like environments.Programming experience in Python, Java, C# etc. and at least one systems language such as C, C , Rust, or Go.Solid understanding of modern ML frameworks such as PyTorch, including model execution, export, tracing, compilation, and performance profiling.Ability to reason about latency, throughput, batching, memory use, GPU utilization, and reliability under real workloads.Strong practical judgment around tradeoffs between model quality, latency, throughput, implementation complexity, and maintainability.Preferred qualifications:Experience optimizing inference for latency-sensitive or high-throughput applications.Experience with model optimization techniques such as quantization, pruning, distillation, operator fusion, graph lowering, custom operators, or model compilation.Exposure to CUDA, Triton language, ROCm, PTX, CuTe, CUTLASS, FlashInfer, or similar low-level GPU programming tools.Experience running inference workloads on Kubernetes or GPU clusters, including scheduling, autoscaling, observability, and resource management.Background in mathematics, physics, computer science, engineering, statistics, or another technical field.Demonstrated ability to improve real-world inference performance beyond a baseline framework implementation.
Sam WarwickSam Warwick