Sam Warwick

Sam is a Senior Consultant operating within the North American market, while maintaining strong ties to his European network. He specialises in placing professionals at the intersection of machine learning and AI infra. His domain covers Data Science & Machine Learning, Infrastructure & Engineering, and Product.

With over six years’ experience in recruitment, Sam has a proven track record of identifying the right individuals to meet strategic goals, drive innovation, and add a fresh dynamic to established teams, all while respecting the parameters of each professional relationship. He works by the principle: we have two ears and one mouth for a reason; listening twice as much as we speak leads to better outcomes.

Fuelled by a lifelong devotion to football (yes, he supports Tottenham - please send thoughts and prayers) and a not-so-guilty obsession with Star Wars, Sam splits his time between the pitch and a galaxy far, far away (when he’s not immersed in the field of Geo, of course). Lest we forget, he’s powered by long runs and low heart rates; catch him in Zone 2, where the pace is chill, but the gains are real.

At DeepRec.ai, we’re more than recruiters; we’re strategic partners. As a certified B Corp, we’re committed to making a positive impact on people and the planet, with diversity and inclusion woven into every stage of the hiring journey. Whether you're advancing AI or seeking specialist talent, Sam is here to support your mission.

Connect with Sam to explore how he can help bring your deep tech vision to life.

CONTACT SAM

+1 617 994 5457

Sam.warwick@deeprec.ai

Jobs from Sam Warwick

Berlin, Germany

Senior Inference Optimization Engineer

Permanent€150000 - €200000 per annum

About our client:Our client is a fast-scaling automation platform that operates cloud-native and AI infrastructure at scale. By embedding autonomous decision-making directly into Kubernetes and cloud environments, the platform continuously optimizes performance, reliability, and efficiency in production, replacing tickets, alerts, and manual tuning with continuous automation that adapts infrastructure as conditions change. The company is trusted by over two thousand organizations, including a number of globally recognized enterprises across technology, automotive, media, and financial services. It operates as a distributed, international team spanning more than thirty countries across Europe, North America, Latin America, and APAC. The business recently reached unicorn status following a strategic investment from a major corporate venture arm, with a valuation now in excess of one billion dollars and strong momentum behind its next phase of growth. About the role: Throughput. Latency. KV cache utilization. Move those three numbers in the right direction, and two things happen. Customers get faster, cheaper inference, and our client's margins improve. That is the entire thesis of this role. Every kernel you tune, every quantization scheme you ship, and every scheduler tweak you land shows up directly in a customer's p99 and on the P&L. This is a high-impact seat, and a high-autonomy one. You will be given the room to lead the technical direction of inference optimization rather than execute someone else's roadmap. The problem is that running LLMs in production is a moving target. The right model and serving configuration for a workload depend on traffic shape, sequence-length distribution, batch dynamics, GPU SKU, memory bandwidth, quantization tolerance, and a dozen other variables that shift week to week. Most teams pick a model once, over-provision GPUs, and absorb the cost. Our client's system makes that decision automatically, continuously matching workloads to the most cost-efficient, best-performing LLM and serving configuration on a customer's infrastructure. The team is building the optimization layer between the model and the hardware, and needs engineers who understand both sides deeply. Stack Python; vLLM; SGLang; TensorRT-LLM; PyTorch; CUDA-adjacent tooling; Kubernetes; gRPC; ClickHouse; PostgreSQL; GCP Pub/Sub; AWS, GCP, and Azure; GitLab CI; ArgoCD; Prometheus; Grafana; Loki; Tempo. RequirementsFive or more years building real ML systems, with a portfolio that shows depth in inference or training infrastructure, not just model training notebooks.Strong Python, with experience building production services rather than scripts.Hands-on experience with at least one of vLLM, SGLang, or TensorRT-LLM, and a working mental model of why an inference engine performs the way it does on a given GPU.Fluency with quantization tradeoffs. You have measured quality regressions, not just compression ratios.Comfort with distributed systems, including collective communication, sharding strategies, and the practical failure modes of multi-GPU and multi-node setups.A bias toward measurement. You instrument before you optimize, and you can tell the difference between a real win and a benchmark artifact.Self-direction. This role comes with a wide mandate, and you should be excited by that rather than unsettled by it.ResponsibilitiesPush throughput. Continuous batching, speculative decoding, chunked prefill, and kernel-level tuning across vLLM, SGLang, and TensorRT-LLM. Find the ceiling on each GPU SKU, then raise it.Cut latency. Attack TTFT and TPOT separately. Profile, identify the actual bottleneck whether compute, memory bandwidth, scheduling, or networking, and fix it rather than the bottleneck you assumed.Get more out of the KV cache. Paged attention, prefix caching, eviction policies, cache reuse across requests, and quantized KV. This is where a lot of the unrealized throughput lives, and it is an area you will own.Quantize without regressing quality. INT8, INT4, and FP8 across weights, activations, and KV. Empirical work that measures quality on real workloads, not just perplexity benchmarks.Shrink cold starts and memory footprint. Faster init, smarter weight loading, and tighter memory accounting, which is the difference between a model that scales and one that does not.Scale across nodes. Distributed inference topologies, network-aware placement, and checkpointing strategies that do not bottleneck on storage or interconnect.Set the technical direction. Decide what to benchmark, what to adopt, and what to build in-house. Bring the team along with strong writeups and reproducible experiments.

Sam Warwick

Posted 28 days ago

VIEW ROLE

Palo Alto, California, United States

Senior Inference Engineer

Permanent$200000 - $300000 per annum

Senior Inference Engineer AI Video Generation Company (Stealth) | Palo Alto, CA | HybridAbout the Role We are seeking a Senior Inference Engineer to accelerate the performance of our AI-driven video generation products. In this highly technical role, you will operate at the intersection of cutting-edge inference acceleration, GPU parallelism, advanced model deployment, and video generation technologies. Your expertise will drive significant improvements to model speed and efficiency, ensuring our creative AI systems deliver industry-leading user experiences at scale. You will design and optimize inference pipelines, implement state-of-the-art acceleration techniques, and work closely with researchers and engineers across the team to push the boundaries of what's possible in real-time AI deployment. Your efforts will play a foundational role in powering the next generation of our video and language models. What You'll DoAccelerate Inference: Lead and implement advanced inference acceleration techniques, including attention optimization and quantization for efficient model serving.Maximize GPU Parallelism: Engineer and optimize GPU strategies across tensor, sequence, and pipeline parallelism (TP, SP, PP) for maximal efficiency and scalability.Programming for Performance: Develop and optimize high-performance computing kernels and distributed workloads using CUDA and NCCL.Advance AI Deployment: Collaborate with research and engineering teams to bring state-of-the-art video generation and large language models into production.Improve Training Efficiency: Contribute to improvements in model training speed, stability, and resource utilization as part of our deployment lifecycle. (Bonus)Technical Excellence: Drive rigorous code reviews, participate in technical discussions, and mentor fellow engineers on best practices in inference and GPU programming. What We're Looking ForExperience: 5 years of engineering experience, with a strong track record in inference acceleration and model deployment at scale.Inference Mastery: Proven expertise in inference optimization, including quantization, attention acceleration, and deep learning compiler stacks.GPU and Parallelism: Deep knowledge of GPU programming (CUDA, NCCL) and experience with SP, TP, PP, and other forms of parallelism for distributed inference.AI Domain Knowledge: Familiarity with video generation models and large language models (LLMs).Collaboration: Strong cross-discipline communication skills; able to drive shared goals across research and engineering functions.Ownership Mindset: Self-driven, solutions-oriented, and capable of managing ambiguity in a fast-paced startup environment. Nice to HaveExperience with high-throughput video or real-time streaming model deployment.Familiarity with distributed training and optimization toolkits.Contributions to open source projects in AI infrastructure or deep learning compilers.Startup or rapid prototyping experience. What We OfferCompetitive salary commensurate with AI industry benchmarks.Equity in a fast-growing company shaping the future of generative AI.Comprehensive health benefits, monthly stipends, and company retreats.A collaborative, in-office culture focused on building and shipping together.About the Company A well-funded, early-stage AI video generation startup headquartered in Palo Alto, CA. The team is building technology to make video creation seamless, intuitive, and universally accessible through the transformative power of AI. Tight-knit and highly energetic, the company values efficiency, intellectual curiosity, and the ambition to make a meaningful impact on the world.

Sam Warwick

Posted 28 days ago

VIEW ROLE

Palo Alto, California, United States

Staff Software Engineer (AI Infrastructure)

Permanent$200000 - $300000 per annum

Staff/Lead Software Engineer, AI Infrastructure About the Company A well-funded Bay Area AI startup operating at the frontier of generative media, with a product shipping to users at scale. The company is building the core infrastructure that powers its AI capabilities, and this is a senior, high-ownership hire on that team. About the Role This is a critical hire to build and scale the infrastructure behind the company's AI capabilities. You'll lead the design and implementation of GPU infrastructure, AI model serving APIs, and general AI infrastructure execution, enabling the machine learning features that drive the product. You'll architect robust, distributed systems optimized for high-performance AI workloads, large-scale GPU orchestration, and low-latency, reliable API serving. Your work will directly shape how users experience generative AI at scale. As a senior technical leader, you'll also mentor engineers, drive best practices, and set the technical vision for AI infrastructure. What You'll DoDesign, develop, and maintain scalable GPU infrastructure for training and serving state-of-the-art AI models.Architect and optimize high-throughput, low-latency APIs for AI model serving and inference.Lead the orchestration, scheduling, and efficient utilization of heterogeneous GPU resources across clusters.Build and support robust systems for model deployment, monitoring, scaling, and reliability in production.Collaborate with ML, backend, and platform engineering teams to deliver seamless AI-powered product features.Drive technical direction, code reviews, and mentorship across the AI Infrastructure team.What We're Looking For5 years as a software engineer working on systems infrastructure, including hands-on ML serving and GPU orchestration.Deep knowledge of distributed systems, Kubernetes (or similar orchestration frameworks), and cloud-native infrastructure (AWS/GCP/Azure).Proven expertise building and optimizing APIs for large-scale AI model serving (TensorFlow Serving, Triton, TorchServe, or similar).Familiarity with the challenges of high-throughput, scalable GPU fleet management, scheduling, and efficient model execution.Proficiency in backend languages such as Python, Go, or C , with experience optimizing for performance and reliability.Ownership mentality and the drive to solve complex problems independently in ambiguous, high-growth environments.Excellent communication, collaboration, and mentorship skills.Nice to HaveExperience with multi-modal AI model infrastructure (LLMs, generative models, video/image/speech models).Background building infra for multi-tenant SaaS, enterprise AI/ML platforms, or operational automation at scale.Previous startup experience, or a track record leading high-impact projects through ambiguity and rapid iteration.Experience with competitive coding or large-scale distributed computing environments.

Sam Warwick

Posted 28 days ago

VIEW ROLE

Philadelphia, Pennsylvania, United States

Machine Learning Engineer (Inference Optimization)

Permanent$250000 - $450000 per annum

Machine Learning Engineer – Inference Optimization Overview We are looking for a Machine Learning Engineer focused on low-latency inference optimization to help build, tune, and productionize high-performance model serving systems. This role sits at the intersection of machine learning, systems engineering, and GPU performance. You will work on inference workloads where latency, throughput, reliability, and hardware efficiency all matter, and where a deep understanding of modern inference runtimes can meaningfully improve production outcomes. You will work closely with researchers and engineers to understand model structure, identify inference bottlenecks, and turn research ideas into efficient production systems. The work may involve other types of models, but focuses on transformer-style architectures and structured inference workloads. You will evaluate and tune frameworks and related serving or compilation systems, while also reasoning about GPU execution, memory layout, batching strategies, precision tradeoffs, and end-to-end latency. What you'll do:Design, build, and optimize low-latency inference systems for production machine learning workloads.Profile model inference pipelines across model execution, runtime configuration, batching, memory movement, serialization, networking, and I/O.Evaluate, integrate, and tune inference runtime systems.Improve latency, throughput, and GPU utilization for production inference workloads.Build and support benchmarking and profiling tools to compare model variants, hardware targets, runtime configurations, and deployment strategies.Debug performance issues involving GPU memory, compute saturation, kernel behavior, CPU/GPU coordination, data movement, and serving-layer overhead.Help shape model and system design choices so that research models are efficient to deploy under real latency constraints.Where necessary, collaborate with lower-level systems or GPU specialists on custom operators, kernel-level optimization, or hardware-specific performance work.What we're looking for:Experience deploying, optimizing, or operating machine learning inference workloads in production or production-like environments.Programming experience in Python, Java, C# etc. and at least one systems language such as C, C , Rust, or Go.Solid understanding of modern ML frameworks such as PyTorch, including model execution, export, tracing, compilation, and performance profiling.Ability to reason about latency, throughput, batching, memory use, GPU utilization, and reliability under real workloads.Strong practical judgment around tradeoffs between model quality, latency, throughput, implementation complexity, and maintainability.Preferred qualifications:Experience optimizing inference for latency-sensitive or high-throughput applications.Experience with model optimization techniques such as quantization, pruning, distillation, operator fusion, graph lowering, custom operators, or model compilation.Exposure to CUDA, Triton language, ROCm, PTX, CuTe, CUTLASS, FlashInfer, or similar low-level GPU programming tools.Experience running inference workloads on Kubernetes or GPU clusters, including scheduling, autoscaling, observability, and resource management.Background in mathematics, physics, computer science, engineering, statistics, or another technical field.Demonstrated ability to improve real-world inference performance beyond a baseline framework implementation.

Sam Warwick

Posted 28 days ago

VIEW ROLE