Job Description
Position Overview
Observability-Focused Mindset
Deep understanding of telemetry: Metrics, logs, traces, and events.
Experience with observability tools: e.g., Datadog, AWS CloudWatch, SolarWinds, OpenTelemetry.
Ability to define and refine SLIs/SLOs to measure system health.
Proactive monitoring: Builds dashboards and alerts that detect issues before users do.
Incident Management & Communication
Calm under pressure: Handles high-severity incidents with clarity and focus.
Strong communicator: Clearly articulates impact, status, and resolution steps to stakeholders.
Postmortem discipline: Writes blameless post-incident reports and drives follow-ups.
Collaboration: Works closely with devs, product, and support during incidents.
Logging & Iterative Improvements
Strategic logging: Adds meaningful logs that aid in debuggi...
Ready to Apply?
Take the next step in your AI career. Submit your application to Zelis today.
Submit Application