Job Description

Position Overview (Job Summary):
The role is for an HPC Engineer responsible for designing, deploying, managing, and optimizing an on-premises High Performance Computing (HPC) environment.
The environment includes SLURM-managed CPU and GPU clusters.
Strong emphasis on HPC architecture, Linux administration, job scheduling, and cluster operations.
Experience with parallel/distributed storage (Weka FS, Scality) is preferred but optional.
Primary Skills:
HPC Operations & Cluster Management (CPU & GPU)
SLURM Workload Manager (Mandatory) Install/configure/manage SLURM across multiple clusters
Partitions/queues, fairshare, job priority, scheduling policies
Upgrades, migrations, automation via API/integrations
Linux System Administration (RHEL focus) OS patching, hardening, tuning, package management
Troubleshooting & Performance Optimization Cluster health, node/job failures, bottlenecks, utilization optimization
Parallel Computing Knowledge MPI, Open MP, ...

Ready to Apply?

Take the next step in your AI career. Submit your application to HCLTech today.

Submit Application