Job Description

We are hiring a Performance Engineer to join our Growth team. In this role, your product is the efficiency and throughput of our massive-scale AI clusters. As we scale our network, the difference between a working cluster and an optimized one represents millions of dollars in value and weeks of saved research time for our customers.
You will sit at the intersection of systems engineering and research, profiling end-to-end training runs to hunt down bottlenecks in compute, communication, and storage.
What You'll Do
Profile & Optimize: Conduct end-to-end profiling of training workloads to identify bottlenecks across GPU kernels, NCCL communication, and storage I/O.
System Refinement: Collaborate with systems engineers to improve scheduling efficiency, collective communication performance, and kernel execution.
Observability: Build and maintain high-fidelity tooling to monitor and visualize MFU, throughput, and cluster uptime.
Process Design: Design technical processes (...

Ready to Apply?

Take the next step in your AI career. Submit your application to Confidential today.

Submit Application