Job Description
ResponsibilitiesDesigning, supporting, and operating HPC compute and storage infrastructure across on-premises and cloud environments
Maintaining and improving large-scale Linux systems spanning compute, storage, networking, automation, and monitoring
Troubleshooting complex issues across OS, storage, networking, and cluster scheduling layers in collaboration with other infrastructure teams
Managing and optimizing batch and containerized workloads across diverse compute resources (CPU and GPU)
Developing, operating, and improving cloud infrastructure across various providers such as GCP, AWS, and Azure
Building and maintaining HPC management tools, user access modules, and internal libraries
Developing metrics and observability pipelines, analyze system performance, and drive improvements in cluster utilization and efficiency
Managing code deployments, upgrades, fixes, and infrastructure life...
Ready to Apply?
Take the next step in your AI career. Submit your application to Tower Research Capital today.
Submit Application