Job Description

Responsibilities

  • Run managed services, not just systems. Operate multi‑tenant data/AI platforms (Spark, Airflow, Flink, Jupyter) with clear SLAs/SLIs/SLOs, cost guardrails, and capacity plans across AWS/GCP + Kubernetes.
  • Be the face of reliability. Lead incidents end‑to‑end, own customer comms and post‑incident reviews (RCA with actions customers can see and feel).
  • Design for Customer experience. Help Data scientists and customers reduce failed/slow jobs, improve time‑to‑data, and optimize costs—so customers notice faster pipelines and fewer surprises.
  • Standardize & scale. Build service runbooks, golden paths, and automation that make onboarding and daily ops predictable across customers.
  • Automate the toil away. Ship tooling (Bash/Python, GitOps, CI/CD) for backups, DR drills, upgrades, access, and environment bootstrapping.
  • Make signals meaningful. Instrument platforms with metrics/logs/traces; tune alerting to cu...

Ready to Apply?

Take the next step in your AI career. Submit your application to OpsWerks today.

Submit Application