Job Description

What you’ll be doing

  • Implement and optimise CI/CD pipelines, automation frameworks, and infrastructure-as-code solutions using AWS, GitOps, and container technologies.
  • Design, develop, and troubleshoot large-scale distributed systems across on-prem and cloud environments, ensuring reliability and scalability.
  • Lead performance and scale testing, monitoring, and analysis to improve system stability, security, and efficiency.
  • Integrate AI-driven observability and monitoring tools to improve incident detection, anomaly identification, and root‑cause analysis.
  • Implement and maintain ML/AI automation (e.g., predictive scaling, automated remediation, intelligent alerting) to improve platform reliability and reduce manual toil.
  • Proactively identify and mitigate risks, perform root cause analysis, and implement preventive measures following incidents.
  • Champion best practices in Site Reliability Engineering, mentor team ...
  • Ready to Apply?

    Take the next step in your AI career. Submit your application to BT Group today.

    Submit Application