Job Description
What you’ll be doing
Implement and optimise CI/CD pipelines, automation frameworks, and infrastructure-as-code solutions using AWS, GitOps, and container technologies. Design, develop, and troubleshoot large-scale distributed systems across on-prem and cloud environments, ensuring reliability and scalability. Lead performance and scale testing, monitoring, and analysis to improve system stability, security, and efficiency. Integrate AI-driven observability and monitoring tools to improve incident detection, anomaly identification, and root‑cause analysis. Implement and maintain ML/AI automation (e.g., predictive scaling, automated remediation, intelligent alerting) to improve platform reliability and reduce manual toil. Proactively identify and mitigate risks, perform root cause analysis, and implement preventive measures following incidents. Champion best practices in Site Reliability Engineering, mentor team ...
Ready to Apply?
Take the next step in your AI career. Submit your application to BT Group today.
Submit Application