Job Description

**Overview**

Microsoft Azure operates one of the world’s largest and most complex cloud compute fleets. As a Principal Technical Program Manager (TPM) in Compute Fleet Infrastructure, you will lead cross‑functional initiatives that ensure node‑level health, availability, and automated recovery across Azure’s global fleet, directly supporting the reliability and stability of customer workloads at scale.

This role operates at the intersection of hardware, host operating system (OS), virtualization, control plane services, and data center operations. The mission is to transform low‑level node health signals into predictable, automated, and scalable recovery outcomes, protecting customer workloads while continuously raising the reliability standards of the Azure platform.

You will own end‑to‑end programs that span health signal definition, fleet‑wide detection, mitigation strategies, and recovery automation. This work involves close collaboration with engineering...

Ready to Apply?

Take the next step in your AI career. Submit your application to Microsoft Corporation today.

Submit Application