Job Description
We are seeking a hands-on Site Reliability Engineer (SRE) / AI Platform DevOps Engineer to own infrastructure provisioning, CI/CD automation, telemetry pipelines, and production deployment for AI-powered services, agents, and orchestration systems.
This is an SRE-heavy, infrastructure-first role, focused on ensuring AI systems operating in production are:
Reliable
Observable
Scalable
Secure
Cost-efficient
Safe to deploy and operate
You will play a critical role in building and maintaining the platform foundation that enables AI services to run safely and efficiently at scale.
Key Responsibilities
1. Infrastructure Provisioning & Automation
Design and manage cloud infrastructure using Infrastructure as Code (Terraform or similar)
Provision and maintain Kubernetes clusters ...
Ready to Apply?
Take the next step in your AI career. Submit your application to Endava today.
Submit Application