Job Description

Role Responsibilities
Act as the Single Point of Contact (SPOC) and technical owner for GPU cluster operations.
Coordinate across GPU hardware, networking, data center facilities, and external vendors to ensure stable operations.
Plan, manage, and execute change management, system upgrades, and maintenance windows.
Interface directly with customers and vendors' technical support teams to resolve operational and service issues.
Own SLA management, incident tracking, and conduct post-incident reviews and Root Cause Analysis (RCA).
Requirements
Proven experience in AI / HPC cluster operations and support.
Solid understanding of end-to-end architecture, including GPU, InfiniBand (IB), and system-level integration.
Strong communication, coordination, and execution skills, with the ability to drive issues to resolution across multiple stakeholders.
Willingness to work onsite at the client's data center.
Show more Show less

Ready to Apply?

Take the next step in your AI career. Submit your application to ByteBridge today.

Submit Application