Job Description

Responsibilities
  • Designing, supporting, and operating HPC compute and storage infrastructure across on-premises and cloud environments

  • Maintaining and improving large-scale Linux systems spanning compute, storage, networking, automation, and monitoring

  • Troubleshooting complex issues across OS, storage, networking, and cluster scheduling layers in collaboration with other infrastructure teams

  • Managing and optimizing batch and containerized workloads across diverse compute resources (CPU and GPU)

  • Developing, operating, and improving cloud infrastructure across various providers such as GCP, AWS, and Azure

  • Building and maintaining HPC management tools, user access modules, and internal libraries

  • Developing metrics and observability pipelines, analyze system performance, and drive improvements in cluster utilization and efficiency

  • Managing code deployments, upgrades, fixes, and infrastructure life...
  • Ready to Apply?

    Take the next step in your AI career. Submit your application to Tower Research Capital today.

    Submit Application