Job Description
We are seeking an experienced DevOps / Site Reliability Engineer (L5) to own and scale the production operations of a large-scale, AI-first platform. In this role, you will be responsible for reliability, performance, observability, and cost efficiency across cloud-native workloads running on GCP and Kubernetes. You will work closely with platform, data, and AI teams to ensure resilient, secure, and highly available systems in production.
Key Responsibilities
Own day-2 production operations for a large-scale AI-driven platform running on Google Cloud Platform (GCP).
Run, scale, and harden GKE-based Kubernetes workloads integrated with GCP managed services (data, messaging, AI, networking, and security).
Define, implement, and operate SLIs, SLOs, and error budgets across platform and AI services.
Build and manage end-to-end observability using New Relic (APM, infrastructure monitoring, logging, alerts, and dashboards).
Design,...
Ready to Apply?
Take the next step in your AI career. Submit your application to GeekyAnts today.
Submit Application