Job Description

Descripción

Reliability & Performance

  • Define and manage SLIs, SLOs, and error budgets.
  • Improve system reliability, scalability, and resilience.
  • Lead reliability reviews and prevent incidents proactively.


  • Observability & Monitoring

  • Build and maintain monitoring, logging, and alerting.
  • Ensure actionable alerts and effective dashboards.
  • Implement distributed tracing.


  • Automation & Tooling

  • Automate operational tasks to reduce toil.
  • Build tools for reliability and automated remediation.


  • CI/CD & Deployments

  • Improve CI/CD pipelines for safe deployments.
  • Implement canary, blue/green, and rollback strategies.
  • Ensure production readiness.


  • Incident Management

  • Join on-call rotations.
  • Lead incident response and post-incident reviews.
  • Promote a blameless culture.
  • Ready to Apply?

    Take the next step in your AI career. Submit your application to Michael Page today.

    Submit Application