LLM Performance Engineer
Baden-Württemberg
Remote with quarterly in person engineering workshops
€110,000

The work

Most ML engineers never see what actually happens on the GPU. They train models, call an inference API, and trust the framework.

If you have ever opened Nsight or Torch Profiler, followed a request through kernel launches and communication calls, and wondered why half the GPU time disappears into overhead, this work will feel very familiar.

The problem

Large language models behave very differently in production than they do in benchmarks. Token generation patterns change. Prefill and decode phases behave unpredictably. Communication overhead quietly kills throughput. Schedulers make decisions based on incomplete information.

Most infrastructure platforms cannot see any of this.So they optimise the wrong things.

Your work changes that.

What you will actually build

You will make the entire LLM execution path observable, from the moment a request hits the system to the moment CUDA kernels execute on the GPU.

That means generating traces that capture:

token-level model behaviour
kernel launches and GPU utilisation
runtime scheduling decisions
memory movement and communication between GPUs

You will use those traces to answer questions like:

Why is a GPU only 55% utilised?
Where does latency appear between prefill and decode?
Why does a supposedly optimised attention kernel stall under load?

Then you turn those answers into improvements.

Better kernel behaviour.
Better runtime execution.
Better scheduling decisions across GPU fleets.

The results show up in real numbers: higher GPU utilisation, lower latency and more throughput on production workloads.

Why this work is different

Most ML roles sit above the framework layer. This sits underneath it.

You will spend your time inside PyTorch execution paths, CUDA behaviour, inference runtimes and distributed communication. The interesting problems live in the gaps between those layers.

The systems you work on also run at meaningful scale.

Clusters range from small internal deployments to environments with tens of thousands of GPUs. Performance improvements do not save milliseconds. They change how large fleets of hardware are used.

The environment

Small engineering team. Around sixty people.

No layers of product managers translating problems for you. Engineers talk directly to each other and to the system.

Work is fully remote, with occasional engineering sessions in Heidelberg focused on deep technical work rather than company rituals.

Performance improvements are measured, validated and shipped to production systems used by paying customers.

You will likely enjoy this if

You like profiling GPU workloads.

You have dug into CUDA kernels, PyTorch internals or distributed training behaviour to understand why something performs poorly.

You prefer investigating real systems over building ML features or training models.

You care more about how models run than about how they are trained.

LLM Performance Engineer

APPLY HERE