About the Role As a founding member of the Platform Team, you'll architect and build the software backbone of the company. The software needs to span two very different worlds, high-performance distributed computing for AI research (training, massive data ingestion, RL simulation) and resource-constrained, real-time execution on physical robot hardware.
This is a high-impact, greenfield opportunity. You won't be maintaining legacy code, you'll be making critical architectural decisions that define how the company scales from prototype to production. You'll act as the bridge between research scientists and hardware, ensuring that state-of-the-art models can be trained efficiently and deployed reliably to the real world.
Your Responsibilities
- Architect and build. Design and implement a scalable software platform that unifies research workflows (training, simulation) with production realities (real-time inference, data collection).
- Bridge the gap. Develop seamless tooling that facilitates the transition of models from Python-heavy research environments to performant C /Rust runtimes on hardware.
- Performance optimization. Optimize the stack's critical path, focusing on inference latency, distributed training throughput, and system resource management.
- Infrastructure and tooling. Establish engineering excellence by setting up robust CI/CD pipelines, build systems (Bazel), and containerization strategies (Docker).
- Reliability. Engineer fault-tolerant systems capable of handling long-running experiments and safety-critical operations on physical robots.
- Education. MS in Computer Science or a comparable technical field.
- Software engineering. 5 years shipping high-quality software, with a track record of owning large features from design through deployment.
- Language proficiency. Expert-level fluency in Python (for tooling and ML infrastructure), plus strong proficiency in either modern C or Rust.
- System architecture. Demonstrated experience designing scalable software architectures, including microservices, API design (gRPC/REST), and distributed systems.
- Engineering rigor. A commitment to automated testing, code reviews, and writing maintainable, modular code.
- Machine learning systems. Experience building ML frameworks (PyTorch), MLOps infrastructure, data pipelines, or deploying models.
- Build and deploy. Hands-on experience with Docker and Bazel. Experience with orchestration (Kubernetes) or job schedulers (SLURM) is a plus.
- Robotics middleware. Familiarity with ROS2, DDS, or similar message-passing frameworks.
- Cloud infrastructure. Experience managing compute resources on AWS, GCP, or Azure using Infrastructure-as-Code (Terraform, Ansible).
- Simulation. Experience integrating with simulation environments (Isaac Sim, MuJoCo) for Reinforcement Learning.
