AI Infrastructure

Expert Infrastructure Recruitment for Teams Building and Operating AI at Scale

DeepRec.ai supports organisations designing, building, and scaling AI infrastructure that underpins production machine learning and inference platforms in use today. Our AI infrastructure practice focuses on supporting companies hiring specialist engineers across compute, platforms, and systems, where architecture, performance, efficiency, and reliability determine whether AI systems succeed outside the lab.

As AI models move into real-world use, AI infrastructure has become the defining challenge of production AI. Organisations are under increasing pressure to provision, orchestrate, and operate compute and data platforms at scale, meeting strict requirements around latency, throughput, cost, and availability. This has driven unprecedented demand for AI infrastructure capability, and for engineers who can build and operate the systems that inference, training, and experimentation depend on.

DeepRec.ai’s recruitment consultants work closely with teams operating at this level of complexity, giving us a clear view of the skills, experience, and systems required to build production-grade AI. Whether that’s AI platform engineering, GPU and accelerator infrastructure, distributed systems, or inference at scale, we connect organisations with AI engineers who can operate effectively in real-world environments.

Hire AI Infrastructure Talent:

Talk to a Consultant

Find a Job in AI Infrastructure:

Explore Careers

Why do Leading AI Teams Choose DeepRec.ai for AI Infrastructure Hiring?

DeepRec.ai's specialist Infra consultants are trusted by tech pioneers across the UK, Ireland, Germany, Switzerland, and the United States.

Our consultants work directly with teams building and operating production AI systems, giving us first-hand exposure to the architectures, constraints, and trade-offs involved.

Our consultants work directly with teams building and operating AI platforms and infrastructure in production, giving us first-hand exposure to the architectures, trade-offs, and operational realities involved.

This includes teams working on distributed training and inference, high-performance computing, GPU and accelerator clusters, and AI platform reliability, where system-level performance and infrastructure design are critical to deploying AI systems at scale.

Dedicated AI Infrastructure Delivery Teams

DeepRec.ai operates through dedicated divisions and delivery teams, each focused on a specific area of deep tech. This structure allows our AI infrastructure practice to work with depth and continuity, rather than spreading expertise across unrelated markets.

We speak Deep Tech

AI infrastructure is not a generic hiring problem. When you need to hire niche AI talent, you need a specialist who speaks deep tech. We know our serving systems from our pipelines, and we know how to talk about them with top-tier candidates.

Cross-border hiring expertise - SECO & AUG Licensed

As part of Trinnovo Group, DeepRec.ai maintains both SECO and AUG licenses, enabling us to provide compliant cross-border recruitment and employment services across Switzerland and Germany. In addition to permanent hiring, we can payroll talent in-house and manage the full administrative and compliance burden on behalf of our clients. This is supported by an internal compliance team, ensuring hiring processes remain robust, transparent, and aligned with local regulatory requirements.

A Deep Tech Community

Much of the most in-demand AI infrastructure talent does not engage with traditional hiring channels. Through sustained involvement in the deep tech ecosystem, including events, collaboration, and research, DeepRec.ai maintains close ties to the AI infrastructure community, enabling trusted access to engineers and technical leaders who are typically difficult to reach through conventional recruitment. Find out more about DeepRec.ai's social hub here: https://www.deeprec.ai/community.

A Perfect Client Net Promoter Score (+100)

DeepRec.ai maintains a client Net Promoter Score of +100 based on client feedback, a reflection of consistent delivery, clear communication, and long-term partnerships built on trust. For our clients, this typically reflects a recruitment experience that is focused, technically credible, and aligned with the realities of hiring in complex, talent-constrained deep tech markets.

AI Infrastructure Salary Guide

Q1 2026 base salary benchmarks for ML systems, infrastructure, distributed training, model serving, inference, performance, MLOps, and platform engineering roles across major US technology markets, built with fresh insights from DeepRec.ai's recent hiring mandates and candidate database.

Read our salary guide here

AI Inference and Serving Model Efficiency

Alongside our broader AI Infrastructure division, DeepRec.ai has a dedicated team focused purely on AI inference and serving efficiency.

As AI systems move from research environments into production, inference becomes the moment of truth. Latency, throughput, cost per request, hardware utilisation, and system reliability all come under pressure at scale. The engineering challenges shift from experimentation to optimisation, from building models to operating them in live, user-facing environments.

Our inference-focused consultants work with teams building high-performance serving systems, real-time and batch inference pipelines, model optimisation frameworks, and accelerator-aware deployment environments. We support organisations hiring engineers who understand quantisation, model compression, distributed inference, GPU scheduling, and system-level efficiency.

If your priority is deploying models reliably and efficiently in production, explore our AI Inference recruitment expertise to see how we support teams operating at this level.

Learn more

Who We Partner With

We work with organisations building, scaling, and operating AI infrastructure in production, ranging from early-stage teams establishing core platforms to scale-ups expanding distributed systems, and enterprises investing in large-scale AI compute and platform capability.

We also work closely with engineers, researchers, and technical leaders who build and operate AI infrastructure. Many of the people we support are not actively looking for new roles, but are open to conversations about work that is technically meaningful, well-resourced, and aligned with how they want to operate.

Our role is to bring these two sides together thoughtfully, matching organisations with engineers where technical context, expectations, and long-term goals are aligned.

If you're interested in exploring a fulfilling new role in AI infrastructure, learning more about current market trends, or you'd like to hire exceptional talent, our consultants are always available to support you. Please get in touch with us directly, and we'll get back to you as soon as possible:

Contact the team

MEET THE TEAM

Meet Anthony

Anthony Kelly

Co-Founder & MD EU/UK

Meet Sam

Sam Warwick

Senior Consultant - ML Systems + AI Infra

Meet Luke

Luke Weekes

Senior Consultant

Frankfurt am Main, Hessen, Germany

IT Infrastructure and Operational Specialist

Permanent€70000 per annum

Location: Frankfurt am Main About the Role We are looking for a highly responsible and hands-on IT Infrastructure & Operations Specialist to manage daily onsite IT operations for our growing ADAS development environment in Europe.This role combines workplace IT operations, infrastructure coordination, and implementation ownership. The ideal candidate is proactive, solution-oriented, able to work independently, and capable of turning operational requirements into practical and efficient solutions.The position requires a strong sense of ownership, fast understanding of new topics, and the ability to coordinate technical implementations in a dynamic environment. ResponsibilitiesManage daily onsite IT operations and workplace IT infrastructureConfigure and prepare laptops, devices, monitors, and accessories for employeesTroubleshoot Microsoft Office, VPN, hardware, software, printer, network issuesMaintain user accounts, permissions, and IT inventoryCoordinate meeting room systems, server room equipment, and workplace IT setupsCoordinate vendors, quotations, deliveries, and implementation activities for IT-related topicsResearch practical IT solutions based on operational requirements and budgetDesign, deploy, and maintain a local network optimized for moving terabytes of sensor data between workstations and local storage.Manage on-premise workstations and cloud instances optimized for deep learning and neural network trainingRequirementsMandatoryExperience in IT support, onsite IT operations, workplace IT environmentsStrong sense of responsibility and ownershipExpert-level Linux administration (Ubuntu/Debian) and shell scriptingDeep knowledge of L2/L3 switching, 10 GbE standards (Cat6a/Fiber), and high-speed storage protocolsHands-on and practical working styleAbility to work independently and learn quicklyGerman language proficiency at minimum C1 levelGood English communication skillsPreferred Driving licenseProficiency in Docker and KubernetesProfessional experience with Azure (Storage Accounts, Azure ML, Container Registry)Background in managing "Big Data" for AI (handling petabytes of video/LiDAR data)Nice to HaveExperience with networking, hardware installation, or server room environmentsExperience coordinating external vendors or service providers

Andrew Brophy

Posted 18 days ago

VIEW ROLE

Hamburg Altstadt, Hamburg, Germany

AI Systems Lead

PermanentCompetitive

A rapidly scaling European technology business is undertaking a company-wide AI & Automation transformation designed to embed AI into the operational core of the organization. To lead this initiative, the company is seeking an experienced AI Systems Lead to define the internal AI operating model, architect enterprise-wide agentic systems, and guide the responsible adoption of AI across the business. This is a senior leadership opportunity combining AI architecture, governance, engineering strategy, and operational transformation. The Opportunity You will act as the senior architect for the organization’s AI transformation programme — defining both the technical foundations and operational standards that govern how AI is deployed internally. You will lead the design of scalable multi-agent systems, establish practical AI governance frameworks, and drive adoption of AI tooling across business functions including Operations, Customer Support, Sales, Product, and Engineering. What You’ll Own AI Governance & Operating StandardsDefine and maintain the company-wide AI policy and operational frameworkTranslate regulatory requirements such as GDPR and the EU AI Act into practical internal standards and workflowsEstablish guardrails, governance models, and best practices for responsible AI usagePartner with leadership and operational stakeholders on AI risk, compliance, and adoption strategyEnterprise AI ArchitectureArchitect and steward the organization’s internal AI operating layer and multi-agent ecosystemDesign scalable agentic architectures connecting knowledge systems, operational tools, and collaboration platformsDefine reference architectures and engineering standards for AI teams across the businessGuide system interoperability, orchestration patterns, memory models, and tooling standardsAI Enablement & AdoptionLead the rollout and operationalization of AI tooling across the companyIdentify opportunities where agentic workflows can create measurable operational valueTrain and enable teams on AI best practices, workflows, and adoption strategiesDrive ongoing AI enablement initiatives and cross-functional collaborationRequired Experience6 years of engineering, AI systems, or platform architecture experienceStrong senior-level experience designing and deploying agentic AI systemsProven track record shipping AI workflows and automation systems to real users within fast iteration cyclesDeep hands-on experience with the Claude ecosystem is essential, including:Agent SDKsClaude CodeTool use and orchestrationMCP architecturesSub-agentsSkills and collaborative workflowsStrong understanding of AI governance, compliance, and operational risk managementExperience translating frameworks such as GDPR or the EU AI Act into practical operational policiesStrong stakeholder communication skills across technical and non-technical audiencesFluent English requiredPreferred ExperienceExperience building enterprise-scale multi-agent systemsBackground in AI operations, internal platforms, or automation strategyExposure to modern orchestration frameworks and AI infrastructure toolingExperience operating in regulated or compliance-sensitive environmentsGerman language skills strongly preferredIdeal CandidateTechnically deep but operationally pragmaticComfortable balancing innovation with governance and risk managementAble to influence senior leadership while remaining hands-on technicallyStrong systems thinker with a product and adoption mindsetExcited by building AI infrastructure that becomes foundational to how a company operatesWhy Join?Opportunity to architect AI strategy and systems at enterprise scaleExecutive sponsorship and genuine organizational commitment to AI transformationHigh-autonomy leadership role with real budget and mandateSmall, high-performing team operating at the forefront of applied AIRemote-friendly across DACH / EUBuild foundational systems that shape the future operating model of the businessFor a confidential discussion regarding this opportunity, please apply directly.

David Rodwell

Posted 21 days ago

VIEW ROLE

Hamburg Altstadt, Hamburg, Germany

AI Solutions Engineer

PermanentCompetitive

A rapidly scaling European technology platform operating at the intersection of SaaS, payments, and operational software is undertaking a major AI and automation transformation programme — embedding AI into the core workflows of the business. As part of this strategic initiative, the company is seeking an AI Solutions Engineer to help design, build, and operationalize internal AI systems and agentic tooling across multiple business functions. This is a highly cross-functional role combining AI engineering, automation architecture, internal platform development, and stakeholder partnership. The Opportunity You will help shape the company’s internal AI capability from the ground up designing how knowledge flows across the organization, building internally hosted AI tooling, and enabling teams to deploy practical AI agents that create measurable operational impact. This role is ideal for someone who combines strong technical execution with product thinking, communication skills, and the ability to operate independently in fast-moving environments. What You’ll Own AI Knowledge & Information ArchitectureDesign the operating model for internal knowledge management and AI-ready information systemsDefine standards and governance for how information is structured, maintained, and surfacedPartner with stakeholders across departments to improve data accessibility and usability for both people and AI agentsBuild scalable foundations for enterprise-wide AI adoptionInternal AI Tooling & AutomationArchitect and develop internally hosted AI tools and workflowsReduce reliance on fragmented third-party SaaS solutions by building tailored in-house capabilitiesCreate intelligent systems aligned to real operational workflows and business processesDevelop scalable AI infrastructure supporting long-term automation initiativesAgentic Workflow DevelopmentPartner with functions including People, Sales, Customer Success, Finance, Legal, and OperationsScope, design, and deploy practical AI agents and automation workflowsEnable teams to extend and maintain their own tooling over timeDrive adoption through training, collaboration, and clear communicationRequired ExperienceApproximately 5 years of experience in AI engineering, data engineering, ML infrastructure, integrations, or automation-focused software engineeringStrong hands-on experience building and shipping AI-powered internal tools and workflowsProven experience delivering agentic systems or AI automations in fast iterative cyclesStrong Python and SQL skillsExperience integrating modern SaaS platforms using OAuth-based architectures and APIsAbility to write clear technical and solution specificationsStrong communication skills with both technical and non-technical stakeholdersFluent English requiredEssential Technical Experience Candidates must have strong practical experience with:Claude platform ecosystemAgent SDKs and AI orchestration frameworksTool use and agent workflowsMCP architecturesSub-agents and collaborative AI workflowsBuilding internal AI systems with real operational usersIdeal ProfileSelf-starter who identifies opportunities independently and drives executionComfortable operating across product, engineering, and business teamsStrong stakeholder management and relationship-building skillsAble to challenge assumptions and recommend practical solutions rather than automating unnecessarilyExperience working in high-growth or transformation environments preferredGerman language skills are a plusWhy Join?Opportunity to shape AI strategy and tooling at enterprise scaleSignificant autonomy, ownership, and visibilityLeadership team fully committed to meaningful AI adoption and measurable ROISmall, high-impact team with real budget and executive supportRemote-friendly environment across DACH and the wider EUBuild AI systems that directly influence how the business operates day-to-dayFor a confidential discussion regarding this opportunity, please apply directly.

David Rodwell

Posted 21 days ago

VIEW ROLE

Remote work, United States

Senior Data Engineer

Permanent$150000 - $200000 per annum

Senior Data Engineer – HealthTech AI $150,000–$200,000 Equity | Fully Remote (U.S.) | Full-Time DeepRec.ai is partnering with a fast-growing Series B HealthTech AI company to hire multiple Senior Data Engineers. You’ll join a 35 person Data Engineering team building AI-powered clinical products used by major U.S. healthcare organisations. This is a hands-on role focused on building scalable data pipelines that directly support real-world patient care.What You’ll DoBuild and maintain production data pipelinesDevelop data workflows primarily in PythonEnsure high data quality and reliabilityWork with large-scale cloud data platformsContribute to reusable tooling and frameworksRequirements4 years in Data EngineeringStrong Python skills (core requirement)Strong SQL experienceExperience with Databricks, Snowflake, or Microsoft FabricAzure cloud experienceExperience with distributed data processingNice to HaveHealthcare or EHR data experienceLakehouse or real-time pipeline exposureSaaS / multi-tenant platform experienceTech Stack: Python • SQL • Databricks • Snowflake • Fabric • PySpark • Azure Data Factory • CI/CDWhy Join?High-growth HealthTech AI company with strong fundingReal ownership in small, collaborative teamsClear progression to Staff / Director levelCompetitive salary, equity, and strong benefitsMeaningful work with real healthcare impact Please note: This is a permanent opportunity only — no C2C or contract engagements.

Hayley Killengrey

Posted 22 days ago

VIEW ROLE

Massachusetts, United States

Senior MLOps Engineer

Permanent$175000 - $200000 per annum

Senior MLOps Engineer Fully Remote (United States) | up to $200k base equity The role Our client is hiring a Senior MLOps Engineer to build and operate the production platform powering their ML and LLM-driven healthcare workflows. You will design reliable, secure, and compliant systems for model development, evaluation, deployment, monitoring, and continuous improvement, working closely with ML, data, security, and product teams. This is the right seat for someone who has shipped ML systems in production and is excited about LLM orchestration, RAG, evaluations, guardrails, and observability inside a regulated healthcare environment. What you will be doing MLOps and ML platformDesign and operate ML platforms supporting end-to-end workflows: data ingestion, feature engineering, training, evaluation, deployment, and monitoring.Build and maintain CI/CD for ML, including testing, packaging, versioning, reproducibility, automated rollbacks, and approvals.Implement MLOps best practices: model registry, experiment tracking, lineage, governance, and reproducible training environments.Develop scalable training infrastructure: distributed training, GPU scheduling, cost controls, and auto-scaling.Build and maintain feature pipelines and feature stores, ensuring consistency between training and inference.Establish model monitoring and observability: performance, drift, fairness signals where relevant, latency, throughput, and data quality.Own end-to-end LLM delivery pipelines: prompt versioning, retrieval, orchestration, evaluation, deployment, monitoring, and iterative improvement.Build LLM evaluation harnesses, both offline and online: golden datasets, automated regression testing, human-in-the-loop review, and risk scoring.Implement cost controls: token and cost budgeting, caching, autoscaling, and performance tuning.Deployment, reliability, and operationsProductionize ML models on GCP using containers and orchestration (GKE, Cloud Run).Build CI/CD for ML and LLM systems with automated tests and safe rollouts.Implement observability: tracing, metrics, logs, dashboards, and alerting for model and system health, including hallucination indicators and retrieval quality.Data, governance, and healthcare complianceDesign systems with security and privacy by default: IAM, least privilege, secrets management, audit logs, encryption, retention, and PHI/PII handling.Implement governance: model and prompt lineage, dataset provenance, evaluation traceability, and approval workflows aligned with healthcare compliance expectations.Integrate guardrails: content filters, policy checks, prompt injection defenses, structured output validation, and fallback strategies.What we are looking for Essential6 years in software or platform engineering, including 4 years operating ML systems in production.Strong ML engineering background: training pipelines, evaluation, deployment patterns, monitoring, and iteration loops.Demonstrated hands-on experience with LLM systems in production.Strong Python plus production-grade experience building APIs and services.Strong experience with GCP services and cloud-native patterns.Production experience with Vertex AI (pipelines, endpoints, feature store, model registry, evaluation) and/or managed vector search on GCP.Containerization and orchestration with Docker, Kubernetes/GKE, and/or Cloud Run.Work authorization Open to US Citizens, Green Card holders, and candidates already in the US on a valid H-1B (transfers considered). About the company Our client is an AI-first healthtech company on a mission to detect cancer earlier and prevent it where possible. Their platform has already assessed over 700,000 patients and identified more than 75,000 cancers, and they are now expanding their US footprint with a greenfield product build off the back of a fresh Series A round, backed by one of the most respected VCs in the world. Most of the cancer industry focuses on treatment. This team is focused on detection and prevention, where the impact on survival rates is greatest. The founders are practising doctors who have lived in the problem space first-hand, and the company is tech-first, with the majority of headcount sitting in engineering, data, and ML. Why joinReal-world impact: AI that directly contributes to earlier cancer detection and improved patient outcomes.Greenfield US build at a critical inflection point, with high ownership from day one.Series A backing from a top-tier global VC.Builder culture: production-grade work, not research or prototypes.Direct exposure to the CTO and senior AI leadership in a flat, fast-moving environment.Continuous learning, with access to the latest tools and methods in AI and healthcare.BenefitsCompetitive base salary plus meaningful equity.Fully remote across the United States.Flexible working arrangements.

Sam Warwick

Posted 23 days ago

VIEW ROLE

California, United States

Founding Member of Technical Staff (Research/Post-training)

Permanent$200000 - $275000 per annum

Founding Member of Technical Staff (Research / Post-Training) Applied AI / RL | San Francisco (onsite) | $200k–$275k 0.25–0.50% equityDeepRec is partnered with a YC-backed (S25), seed-stage applied AI and data company working at the cutting edge of reinforcement learning and agentic systems. They collaborate closely with leading AI labs to train models capable of executing complex, real-world workflows across financial services. Their core platform focuses on building high-quality RL environments that simulate tasks across investment banking, private equity, and hedge funds (e.g. financial modelling, presentations, etc.). Following a recent seed raise, they’re now building out their founding research and engineering team. The Opportunity This is a Founding Member of Technical Staff hire focused on research and post-training. You’ll take ownership of training and evaluating frontier models, shaping external benchmarks, and contributing to the company’s research presence. What You’ll Be DoingTraining open-source / frontier models on proprietary RL environments to validate performance and generate insightsLeading public-facing benchmarks and leaderboard initiatives for frontier modelsPublishing research (blogs, papers) to engage both industry and academic communitiesContributing to core platform work where needed (AI tooling, data pipelines, environment/reward systems)Helping establish engineering and research culture from day oneWhat They’re Looking ForExperience in model post-training (fine-tuning, RLHF, or similar)Track record of publishing research or contributing to open research communitiesFamiliarity with RL, evaluations, or benchmarking for AI agentsStrong startup mindset — high velocity, high ownershipProduct awareness and ability to prioritise across a broad roadmapComfortable engaging with users, customers, and subject matter expertsNice to HavePrevious startup or founding experienceCompensation & Benefits$200k–$275k base 0.25–0.50% equityFully covered healthcare (including dependents)Relocation support401(k)Meals, gym, and transport fully coveredVisa sponsorship availableLocation San Francisco — full-time, onsite

Luke Weekes

Posted 23 days ago

VIEW ROLE

Munich, Bayern, Germany

Senior IT Infrastructure Engineer

Permanent€80000 - €120000 per annum

Senior IT Infrastructure Engineer About the Company We're partnered with an AI and robotics company building infrastructure from the ground up to support advanced development workflows in machine learning and embodied AI. They're at a stage where the IT foundation is being established and shaped, giving this role unusual scope and ownership over the long-term direction of the environment. About the Role As an IT Infrastructure Engineer, you'll be responsible for establishing infrastructure from the ground up, including capacity planning, disaster recovery, and day-to-day operations. You'll manage, configure, and monitor the company's IT infrastructure, including automated backups, ensure the security and availability of resources, and work closely with engineering and operations teams to provide a robust, scalable IT environment that supports AI and robotics development workflows. Your ResponsibilitiesInfrastructure architecture and operations. Design, implement, and maintain on-premise IT infrastructure across compute, storage, and networking. Perform capacity planning, develop and execute backup and disaster recovery strategies, and maintain comprehensive infrastructure documentation.Physical data center and cloud infrastructure. Manage and monitor on-premise IT facilities (servers, cooling, power) and hardware. Design and provision storage and compute/GPU infrastructure for high-performance ML and AI workloads.Enterprise networking. Design and implement WAN/LAN/WiFi network topology with proper segmentation and security controls (firewalls, IDS/IPS). Configure and manage enterprise networking equipment including switches, routers, and load balancers.System administration and support. Deploy and manage Linux server infrastructure. Configure and deploy employee workstations across Linux, macOS, and Windows, and manage IT equipment procurement. Provide technical troubleshooting and support, and manage user accounts with SSO.Vendor management. Establish and manage relationships with technology vendors, negotiate contracts, and coordinate with service providers including ISPs and colocation partners.RequirementsProven track record in building or transforming infrastructureDeep expertise in enterprise networking (WAN/LAN, VLANs, routing, switching, firewalls, VPNs)Strong hands-on experience with server hardware assembly, configuration, and maintenanceExpert knowledge of storage (RAID, SAN/NAS) and backup and recovery solutionsExperience with Linux server administration and troubleshootingSolid understanding of data center operations (power, cooling, security)Hands-on experience provisioning and managing GPU infrastructureScripting skills in Python and Bash for automationExperience with Infrastructure-as-Code tools such as Terraform and AnsibleStrong problem-solving and troubleshooting skills for complex hardware and network issuesExcellent documentation and communication skillsSelf-motivated and able to work independently in a fast-paced environment

Sam Warwick

Posted 23 days ago

VIEW ROLE

Berlin, Germany

Distributed Training Infrastructure Engineer

Permanent€150000 - €200000 per annum

Training Infrastructure Engineer About the Company We're partnered with a generative AI lab building the next generation of creative tools by producing realistic sound, speech, and music from video. They're developing cutting-edge foundational generative models that "unmute" silent video content and create custom, hyper-realistic audio for gaming, video platforms, and creators, empowering global storytellers to transform their content. They recently closed a $41 million Seed round co-led by two top-tier US venture firms, with participation from a leading global investor, and are rapidly expanding across Product, Engineering, Go-to-Market, and Growth. About the Role You'll focus on the full training stack, profiling GPU behavior, debugging training pipelines, improving throughput, choosing the right parallelism strategies, and designing the infrastructure that lets the team train models efficiently at scale. The work spans cluster management, model training, efficient data pipelines for video and audio, inference, and optimizing PyTorch code. Your contribution will shape the foundation on which all of their generative models are built and iterated. Key ResponsibilitiesIdentify ideal training strategies (parallelism approaches, precision trade-offs) for a variety of model sizes and compute loadsProfile, debug, and optimize single and multi-GPU operations using tools like Nsight and stack trace viewers to understand what's actually happening at the hardware levelAnalyze and improve the entire training pipeline end to end, including efficient data storage, data loading, distributed training, checkpoint and artifact saving, and loggingSet up scalable systems for experiment tracking, data and model versioning, and experiment insightsDesign, deploy, and maintain large-scale ML training clusters running SLURM for distributed workload orchestrationIdeal Candidate ProfileFamiliarity with the latest and most effective techniques for optimizing training and inference workloads, not from reading papers but from implementing themDeep understanding of GPU memory hierarchy and computation capabilities, knowing what the hardware can do in theory and what prevents you from achieving it in practiceExperience optimizing for both memory-bound and compute-bound operations, with a clear sense of when each constraint mattersExpertise with efficient attention algorithms and their performance characteristics at different scalesNice to HaveExperience implementing custom GPU kernels and integrating them into PyTorchExperience with diffusion and autoregressive models and an understanding of their specific optimization challengesFamiliarity with high-performance storage solutions (VAST, blob storage) and their performance characteristics for ML workloadsExperience managing SLURM clusters at scaleWhy Join?Pivotal moment. Fresh funding is secured and traction is building, this is the point where your contributions can make a real difference to the company's trajectory.True ownership from day one. Genuine autonomy and responsibility, with ideas and work that directly shape both product and company direction.Competitive compensation and equity. Strong packages that ensure you share in the success you help create.Build for the next generation of creators. Be part of the innovation that will transform how creators work and thrive.

Sam Warwick

Posted 23 days ago

VIEW ROLE

FIND A JOB