Job Description
• Own the day-to-day operational health of compliance approved AI and Data platforms in production, ensuring high availability, performance, and reliability
• Monitor AI and Data services, model inference layers, APIs, and data dependencies using logs, metrics, dashboards, and alerts
• Provide production-focused user support for AI tools and Data platforms, prioritizing issue resolution
• Lead incident triage, coordination, and resolution for platform outages or service degradations, partnering with development and infrastructure teams.
• Perform deep technical troubleshooting across applications, data, and system layers.
• Enhance observability, alerting, and operational runbooks to reduce mean time to detect. (MTTD) and mean time to resolve (MTTR) incidents
• Conduct post-incident root cause analysis and drive corrective and preventive improvements
• Support production deployments, configuration changes, and platform ...
Ready to Apply?
Take the next step in your AI career. Submit your application to Point72 today.
Submit Application