Job Description
Descripción
Reliability & Performance
Define and manage SLIs, SLOs, and error budgets.Improve system reliability, scalability, and resilience.Lead reliability reviews and prevent incidents proactively.
Observability & Monitoring
Build and maintain monitoring, logging, and alerting.Ensure actionable alerts and effective dashboards.Implement distributed tracing.
Automation & Tooling
Automate operational tasks to reduce toil.Build tools for reliability and automated remediation.
CI/CD & Deployments
Improve CI/CD pipelines for safe deployments.Implement canary, blue/green, and rollback strategies.Ensure production readiness.
Incident Management
Join on-call rotations.Lead incident response and post-incident reviews.Promote a blameless culture.
Ready to Apply?
Take the next step in your AI career. Submit your application to Michael Page today.
Submit Application