About our client:Our client is a fast-scaling automation platform that operates cloud-native and AI infrastructure at scale. By embedding autonomous decision-making directly into Kubernetes and cloud environments, the platform continuously optimizes performance, reliability, and efficiency in production, replacing tickets, alerts, and manual tuning with continuous automation that adapts infrastructure as conditions change.
The company is trusted by over two thousand organizations, including a number of globally recognized enterprises across technology, automotive, media, and financial services. It operates as a distributed, international team spanning more than thirty countries across Europe, North America, Latin America, and APAC. The business recently reached unicorn status following a strategic investment from a major corporate venture arm, with a valuation now in excess of one billion dollars and strong momentum behind its next phase of growth. About the role:  Throughput. Latency. KV cache utilization. Move those three numbers in the right direction, and two things happen. Customers get faster, cheaper inference, and our client's margins improve. That is the entire thesis of this role. Every kernel you tune, every quantization scheme you ship, and every scheduler tweak you land shows up directly in a customer's p99 and on the P&L. This is a high-impact seat, and a high-autonomy one. You will be given the room to lead the technical direction of inference optimization rather than execute someone else's roadmap.
The problem is that running LLMs in production is a moving target. The right model and serving configuration for a workload depend on traffic shape, sequence-length distribution, batch dynamics, GPU SKU, memory bandwidth, quantization tolerance, and a dozen other variables that shift week to week. Most teams pick a model once, over-provision GPUs, and absorb the cost. Our client's system makes that decision automatically, continuously matching workloads to the most cost-efficient, best-performing LLM and serving configuration on a customer's infrastructure. The team is building the optimization layer between the model and the hardware, and needs engineers who understand both sides deeply. Stack Python; vLLM; SGLang; TensorRT-LLM; PyTorch; CUDA-adjacent tooling; Kubernetes; gRPC; ClickHouse; PostgreSQL; GCP Pub/Sub; AWS, GCP, and Azure; GitLab CI; ArgoCD; Prometheus; Grafana; Loki; Tempo. Requirements
  • Five or more years building real ML systems, with a portfolio that shows depth in inference or training infrastructure, not just model training notebooks.
  • Strong Python, with experience building production services rather than scripts.
  • Hands-on experience with at least one of vLLM, SGLang, or TensorRT-LLM, and a working mental model of why an inference engine performs the way it does on a given GPU.
  • Fluency with quantization tradeoffs. You have measured quality regressions, not just compression ratios.
  • Comfort with distributed systems, including collective communication, sharding strategies, and the practical failure modes of multi-GPU and multi-node setups.
  • A bias toward measurement. You instrument before you optimize, and you can tell the difference between a real win and a benchmark artifact.
  • Self-direction. This role comes with a wide mandate, and you should be excited by that rather than unsettled by it.
Responsibilities
  • Push throughput. Continuous batching, speculative decoding, chunked prefill, and kernel-level tuning across vLLM, SGLang, and TensorRT-LLM. Find the ceiling on each GPU SKU, then raise it.
  • Cut latency. Attack TTFT and TPOT separately. Profile, identify the actual bottleneck whether compute, memory bandwidth, scheduling, or networking, and fix it rather than the bottleneck you assumed.
  • Get more out of the KV cache. Paged attention, prefix caching, eviction policies, cache reuse across requests, and quantized KV. This is where a lot of the unrealized throughput lives, and it is an area you will own.
  • Quantize without regressing quality. INT8, INT4, and FP8 across weights, activations, and KV. Empirical work that measures quality on real workloads, not just perplexity benchmarks.
  • Shrink cold starts and memory footprint. Faster init, smarter weight loading, and tighter memory accounting, which is the difference between a model that scales and one that does not.
  • Scale across nodes. Distributed inference topologies, network-aware placement, and checkpointing strategies that do not bottleneck on storage or interconnect.
  • Set the technical direction. Decide what to benchmark, what to adopt, and what to build in-house. Bring the team along with strong writeups and reproducible experiments.